大数据代写 | Big Data Assignment 2 – Spark Dataframes
本次美国代写是一个大数据spark dataframes的assignment
1. 15 points
Datafile: BreadBasket_DMS.zip
Solve: Show the total number by item, per day per hour
Example, given the input:
Bread, 2016-10-30, 09, 1
Bread, 2016-10-30, 10, 12
2. 15 Points
Dataset: Restaurants_in_Durham_County_NC.csv
NOTE*** This file is colon delimited (not comma). Do not preprocess it; read it
with spark.read…
Solve: Summarize the number of entities by “rpt_area_desc”
“Swimming Pools”, 13
“Tatoo Establishment”, 2
3. 25 Points
Dataset: populationbycountry19802010millions.csv
Solve: For each year and each region, compute percentage increase in population,
year over year. Note the year 1980 will not have a preceding year.
Show the percentage of yearly population increase as a percentage of the global
population increase for that year.
Display the top 10 in deceasing order of global growth
Year, Region, yearly increase, percent of global year increase (these results are
made up)
1981, North America, 1.30%, 1%
1982, Aruba, …
4. 15 Points
Dataset: romeo-juliet-pg1777.txt
Solve: WordCount
Do a word count exercise using pyspark. Ignore punctuation, and normalize to
lower case. Accept only the characters in this set: [0-9a-zA-Z]