大数据代写 | Big Data Assignment 2 – Spark Dataframes

本次美国代写是一个大数据spark dataframes的assignment

1. 15 points

Datafile: BreadBasket_DMS.zip
Solve: Show the total number by item, per day per hour
Example, given the input:
Bread, 2016-10-30, 09, 1
Bread, 2016-10-30, 10, 12
:

2. 15 Points

Dataset: Restaurants_in_Durham_County_NC.csv
NOTE*** This file is colon delimited (not comma). Do not preprocess it; read it
with spark.read…
Solve: Summarize the number of entities by “rpt_area_desc”
Example:
“Swimming Pools”, 13
“Tatoo Establishment”, 2
:

3. 25 Points

Dataset: populationbycountry19802010millions.csv
Solve: For each year and each region, compute percentage increase in population,
year over year. Note the year 1980 will not have a preceding year.
Show the percentage of yearly population increase as a percentage of the global
population increase for that year.
Display the top 10 in deceasing order of global growth

Example:

Year, Region, yearly increase, percent of global year increase (these results are
made up)
1981, North America, 1.30%, 1%
1982, Aruba, …

4. 15 Points

Dataset: romeo-juliet-pg1777.txt
Solve: WordCount
Do a word count exercise using pyspark. Ignore punctuation, and normalize to
lower case. Accept only the characters in this set: [0-9a-zA-Z]