CSE 6242 / CX 4242: Data and Visual Analytics
HW 3: Spark, Docker, DataBricks, AWS and GCP
Q1 [15 points] Analyzing trips data with PySpark
Follow these instructions to download and setup a preconfigured Docker image that you will use for this
Why use Docker? In earlier iterations of this course, students installed software on their own machines, and we (both students and instructor team) ran into all sorts of issues, some of which could not be resolved satisfactorily. Docker allows us to distribute a cross platform, preconfigured image with all of the requisite software and correct package versions. Once Docker is installed and the container is running, access Jupyter by browsing to http://localhost:6242. There is no need to install any additional Java or PySpark dependencies as they are all bundled as part of the Docker container.
Imagine that your boss gives you a large dataset which contains trip information of New York City Taxi and Limousine Commission (TLC). You are asked to provide summaries for the most common trips, as well as information related to fares and traffic. This information might help in positioning taxis depending on the demand at each location.
You are provided with a Jupyter notebook (q1.ipynb) file which you will complete using PySpark using the provided Docker image. Be sure to save your work often! If you do not see your notebook in Jupyter then double-check that the file is present in the folder and double check that your Docker has been set up correctly. If, after checking both, the file still does not appear in Jupyter then you can still move forward by clicking the “upload” button in the Jupyter notebook and uploading the file – however, if you use this approach then your file will not be saved to disk when you save in Jupyter, so you would need to download your work by going to File > Download as… > Notebook (.ipynb), so be sure to download as often as you would normally save!
- Regular PySpark Dataframe Operations and PySpark SQL operations can be used.
- If you re-run cells, remember to restart the kernel to clear the Spark context, otherwise an existing
Spark context may cause errors.
You will use the yellow_tripdata_2019-01_short.csv dataset. This dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off
3 Version 1
locations, trip distances, fare amounts, payment types, and driver-reported passenger counts. When processing the data or performing calculations, do not round any values. Download the data here.
[1 pt] You will be modifying the function clean_data to clean the data. Cast the following columns into the specified data types:
a. passenger_count-integer b. total_amount – float
c. tip_amount – float
d. trip_distance – float
e. fare_amount – float
f. tpep_pickup_datetime-timestamp g. tpep_dropoff_datetime – timestamp
[4 pts] You will be modifying the function common_pair. Find the top 10 pickup-dropoff location pairs having the highest number of trips (count). The location pairs should be ordered by count in descending order. If two or more pairs have the same number of trips, break the tie using the trip amount per distance travelled (trip_rate) in descending order. Use columns total_amount and trip_distance to calculate the trip amount per distance. In certain situations, the pickup and dropoff locations may be the same (include such entries as well).
While calculating trip_rate, first get the average trip_distance and the average total_amount for each pair of PULocationID and DOLocationID (using group by). Then take their ratio to get the trip_rate for a pickup-drop pair.
PULocationID DOLocationID Count trip_rate 1 2 23 5.242345
3 3 5
[4 pts] You will be modifying the function time_of_cheapest_fare. Divide each day into two periods: Day (from 9am to 8:59:59pm, both inclusive), and Night (from 9pm to 8:59:59am, both inclusive). Calculate the average total amount per unit distance travelled (use column total_amount) for both time periods. Sort the result by trip_rate in ascending order to determine when the fare rate is the cheapest. Use tpep_pickup_datetime to divide trips into Day and Night.
Day 4.2632344561 Night 6.42342882
d. [4 pts] You will be modifying the function passenger_count_for_most_tip . Filter the data for trips having fares (fare_amount) greater than $2 and the number of passengers (passenger_count) greater than 0. Calculate the average fare and tip (tip_amount) for all passenger group sizes and calculate the tip percent (tip_amount * 100 / fare_amount). Sort the result in descending order of tip percent to obtain the group size that tips the most generously.
2 14.22345234 1 12.523334576 3 12.17345231
e. [3 pts] You will be modifying the function day_with_traffic . Sort the days of the week (using tpep_pickup_datetime) in descending order of traffic (day having the highest traffic should be at the top). Calculate traffic for a particular day using the average speed of all taxi trips on that day of the week. Calculate the average speed as the average trip distance divided by the average trip time, as distance per hour. If the average_speed is equal for multiple days, order the days alphabetically. A day with low average speed indicates high levels of traffic. The average speed may be 0, indicating very high levels of traffic. Not all days of the week may be present in the data (do not include these missing days of the week in your output). Use date_format along with the appropriate pattern letters to format the day of the week such that it matches the example output below.
day_of_week average_speed Fri 0.953452345 Mon 5.2424622
[15 points] q1.ipynb the notebook for the given question with your code
- a) [1 pt] clean_data to clean the data
- b) [4 pts] common_pair to find the top 10 pickup-dropoff pairs
- c) [4 pts] time_of_cheapest_fare to find when the trips are cheapest (Day vs Night)
- d) [3 pts] passenger_count_for_most_tip to find which group size tips most generously
- e) [3 pts] day_with_traffic to find which day has the highest traffic (slowest trips)
Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks
Firstly, go over this Spark on Databricks Tutorial, to learn the basics of creating Spark jobs, loading data, and working with data.
You will analyze nyc-tripdata.csv1 using Spark and Scala on the Databricks platform. (A short description of how Spark and Scala are related can be found here.) You will also need to use the taxi zone lookup table
1 Graph derived from the NYC Taxi and Limousine Commission
5 Version 1
using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc- tripdata dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts.
- Use only Firefox, Safari or Chrome when configuring anything related to Databricks. The setup process has been verified to work on these browsers.
- Carefully follow the instructions in the Databricks Setup Guide (Datasets mentioned above must be downloaded from here. This link has been mentioned in the guide as well.). Open the link in a private browser window if you get ‘Permission Denied’ message.
- You must choose the Databricks Runtime (DBR) version as “6.4 (includes Apache Spark 2.4.5, Scala 2.11)”. We will grade your work using this version.
- You must not choose the default DBR version of >= 7.2
- Note that you do not need to install Scala or spark in your local machine. They are provided
with the DBR environment.
- You must use only Scala DataFrame operations for this question. Scala DataFrames are just another name for Spark DataSet of rows. You can use DataSet API in Spark to work on these DataFrames. Here is a Spark document that will help you get started on working with DataFrames in Spark. You will lose points if you use SQL queries, Python, or R to manipulate a DataFrame.
- Refer to this link to understand how to avoid using other languages. After selecting the default language as SCALA, do not use the language magic %<language> with other languages like %r, %python, %sql etc. The language magics are used to override the default language, which you must not do for this assignment.
- You must not use full SQL queries in lieu of the Spark DataFrame API. That is, you must not use functions like sql(), which allows you to directly write full SQL queries like spark.sql (“SELECT* FROM col1 WHERE …”). This should be df.select(“*”) instead.
- The template Scala notebook q2.dbc (in hw3-skeleton) provides you with code that reads a data file nyc-tripdata.csv. The input data is loaded into a DataFrame, inferring the schema using reflection (Refer to the Databricks Setup Guide above). It also contains code that filters the data to only keep the rows where the pickup location is different from the drop location, and the trip distance is strictly greater than 2.0 (>2.0).
- All tasks listed below must be performed on this filtered DataFrame, or you will end up with wrong answers.
- Carefully read the instructions in the notebook, which also provide hints for solving the problems.
- Some tasks in this question have specified data types for the results that are of lower precisions (e.g., float). For these tasks, we will accept relevant higher precision formats (e.g., double). Similarly, we will accept results stored in data types that offer “greater range” (e.g., long, bigint) than what we have specified (e.g., int).
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx