Big Data | Hive | Pig

Big Data Report (40%) Group Assignment
− This is a group assignment, groups of 2 and from the same tutorial ONLY.
− There is no interview for this assignment.
− You will present this work as a group in Presentation of Big Data Report (10%). The
presentation will be for Part 2 of this assignment.
This report consists of two parts:
• The first part is performance evaluation. You will perform a number of tasks and queries
in the Hortonworks environment using Hive and Pig. You need to write the correct queries for Pig and Hive to produce the results specified in the assignment. Then you will record all the details that logs and reports show in Hortonworks. You will use all this information to compare the performance of Pig and Hive such as how long it took for each or how many MapReduce jobs were executed etc. A table should be included along with brief but informative discussions in a paragraph format.
• The second part (see page 6) involves research. You will select a specific area in big data, and read 4 seminal papers about your selected area. Then you will discuss, analyse and compare these papers based on their approaches, contributions, methods, limitations, and any other criteria. This part has to be written according to a specified template, with high quality and correct APA referencing.

You will submit both parts in in a well-structured report (ONE Word document).
Part 1 (20 marks)
An overview:
In this part, you will use six files Master.csv, AwardsCoaches.csv, AwardsPlayers.csv, Coaches.csv, Scoring.csv and Teams.csv from the hockey dataset: http://opensourcesports.com/files/hockey/hdb-2012-06-23.zip
You will use our Hortonworks tutorials as the guideline.
Due Date: Week 12, Thursday by 3pm
The data used in this assignment is courtesy of Open Source Sports. The Hockey Databank project allows for free usage of its data. In exchange for any usage of data, in whole or in part, we agree to display the following statement prominently and in its entirety on your end product:
“The information used herein was obtained free of charge from and is copyrighted by the Hockey Databank project. For more information about the Hockey Databank project please visit http://sports.groups.yahoo.com/group/hockey-databank”
Page 1 of 8
Ice hockey is played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a hockey puck into their opponent’s net to score points. Normally, each team has six players.
Common terms
Positions: Between the six players on the ice, they are typically divided into three forwards and two defensemen and a goaltender (goalie) (G). The forward positions consist of a centre (C) and two wingers: a left wing (L) and a right wing (R). The defencemen (D) usually stay together as a pair generally divided between left and right.
Assists: In ice hockey, point has three contemporary meanings: A point is awarded to a player for each goal scored or assist earned. The total number of goals plus assists equals total points. An assist is attributed to up to two players of the scoring team who shot, passed or deflected the puck towards the scoring teammate, or touched it in any other way which enabled the goal, meaning that they were “assisting” in the goal. There can be a maximum of two assists per goal.
Short-handed: A short-handed goal is a goal scored in ice hockey when a team’s on-ice players are outnumbered by the opposing team. Normally, a team would be outnumbered because of a penalty incurred.
Power team: In ice hockey, a team is said to be on a power play when at least one opposing player is serving a penalty, and the team has a numerical advantage on the ice
(Source: Wikipedia)
Tables details:
Table Master.csv
playerID coachID firstName lastName nameNote nameGiven nameNick height weight shootCatch pos birthYear birthMon birthDay birthCountry birthState birthCity
Player ID Coach ID First name
Last name
Note about player’s name Given name
Nickname
Height in inches Weight in pounds
Shooting hand (catching hand for goalies/goalkeepers) Position (L, R, D, C, G)
Year of birth
Month of birth Day of birth
Country of birth
State or province of birth City of birth
Table AwardsPlayers.csv
playerID award year
lgID
note teammate pos
Player ID
Name of award or trophy
League ID
Position (for all-star teams)
Page 2 of 8
Table Scoring.csv
playerID year stint tmID lgID
Player ID
Year (2005-06 listed as “2005”)
Stint (order of appearance in a season) Team ID
GTG SOG PostGP PostG PostA PostPts PostPIM Post+/- PostPPG PostPPA PostSHG PostSHA PostGWG PostSOG
Game-tying goals – In a tie game, the game-tying goal (GTG) is the last goal scored by the player Shots on goal – Total shots taken on net (the sum of the goals and the opposing goaltender’s saves)
Postseason games played Postseason goals Postseason assists Postseason points
Postseason penalty minutes Postseason plus / minus
Postseason power play goals Postseason power play assists Postseason shorthanded goals
Postseason shorthanded assists Postseason game-winning goals
Postseason shots on goal
League ID
Position (explained earlier)
pos
GP
G
A
Pts
PIM
PPG
PPA
SHG
SHA
GWG
the losing team. If the losing team scores three goals, the fourth goal scored by the player (winning team) is the GWG
Games played
Goals – Total number of goals the player has scored in the current season Assists – Number of goals the player has assisted in the current season Points – Scoring points, calculated as the sum of G and A
Penalty minutes
Power play goals – Number of goals the player has scored while his team was on the power play. Power play assists – Number of goals the player has assisted in while his team was on the power play.
Shorthanded goals – Number of goals the player has scored while his team was shorthanded. Shorthanded assists – Number of goals the player has assisted in while his team was shorthanded.
Game-winning goals – the goal for the winning team that is one more than the total number of goals scored by
Table Coaches.csv
coachID year tmID lgID stint notes
G
W
L
T PostG PostW PostL PostT
Coach ID Year
Team ID League ID
Coaching order Miscellaneous comments
Games Wins
Losses
Ties
Postseason games
Postseason wins Postseason losses Postseason ties
Table AwardsCoaches.csv
coachID award year lgID note
Coach ID
Name of award or trophy
Year League ID
“tie” indicates a tie with another coach
Table Teams.csv
year lgID tmID franchID rank
G
W
L
T OTL Pts SoW SoL GF GA name PIM PPG PPC SHA PKG PKC SHF
Year League ID
Team ID Franchise ID
Final standing Games
Wins Losses Ties
Overtime losses Points
Shootout wins Shootout losses
Goals for Goals against
Full team name Penalty minutes Power play goals Power play chances
Shorthanded goals against Power play goals against Penalty kill chances
Shorthanded goals for
Page 3 of 8
Instruction:
− For Part 1, you will write a technical report that very briefly describes how each task was performed and what the results were using all the documentation mentioned in each task. The report should be written in the order of the tasks using the numbered headings and subheadings.
Task 1 (1 mark):
− Upload the files in Pig and Hive as specified in our tutorials. − Provide one screenshot (first page) per table.
Task 2 Players (6 marks):
a) Find out the player/s who had the highest GWG (you can repeat a player if the year is different). In the query results show the following details: year, player first and last names, date of birth (day, month and year), country of birth, and team name and league ID, their positions, and GWG.
b) Among those player/s who had the highest GWG, find out the player who had won the highest number of awards. In the query results show the following details: player first and last name, and award count.
c) Then using the first and last name of the player who had won the highest number of awards find out the points that the player had earned for each year that the player received an award. Display the following details: the award names, the award year, and the points/Pts that the player scored that year.
− You will write the queries for the following questions in both pig and hive.
− Record and compare the Hive and Pig based on the total time taken as well as other factors
such as no of jobs, maps and reducers, etc.
− You will provide your hive queries, pig scripts and the screenshots of the results (and tables).
Task 3 Coaches (6 marks):
a) Find out the coach who had won the highest number of awards. In the query results show the following details: coach first and last name, date of birth, birth country and number of awards.
b) Find out the coach who had the highest wins. In the query results show the following details: coach first and last name, year, games, wins, losses and ties.
− In your experiments and comparative evaluation, record your machine/computer details that you used like its model, operating system, memory, processor capacity, etc.
Page 4 of 8
c) Perform the same query (b) mentioned above but in the results also display the number of awards won by this coach.
− You will write the queries for the following questions in both pig and hive.
− Record and compare the Hive and Pig based on the total time taken as well as other factors
such as no of jobs, maps and reducers, etc.
− You will provide your hive queries, pig scripts and the screenshots of the results (and tables).
Task 4 Teams (4 marks):
a) Find out the total number of points, goals and assists by each team. In the query results show the following details for all the teams: team name, total number of points, total number goals and total number of assists.
b) Find out for each team which player had scored the highest total number of points. In the query results show the following details: team name, player first and last name, year, number of points, goals and assists.
− You will write the queries for the following questions in both pig and hive.
− Record and compare the Hive and Pig based on the total time taken as well as other factors
such as no of jobs, maps and reducers, etc.
− You will provide your hive queries, pig scripts and the screenshots of the results (and tables).
Task 5 (3 marks):
− In the Hortonworks shell, execute the first query in Task 3 using Pig, and record your
time with and without enabling the Tez. Compare the results.
− Then, in Ambari, for Hive, enable the Tez and perform the same query (the first query in
Task 3) with and without Tez, and compare your results.
− For the same query, for Hive, this time use Cost Based Optimization (CBO) with Tez on.
Record total time taken and compare the results such as DAG details, Graphic View or
others with the previous results.
Provide the results of your experiments in table/s where possible along with the screenshots.
Read this carefully
In the report, make sure you include all Hive queries, Pig scripts, and screenshots of all the results, logs, etc. that show the completion of each task and its component, and all the comparison discussions that you will write in paragraphs and comparison tables. (Note: when the results of a query do not fit in a screenshot, you have to provide two screenshots, one from the first page of results and one from the last page of results). There will be mark deduction if the report is incomplete.
Page 5 of 8
Part 2 (20 marks)
In this part, you will do research on one of the big data areas to gain a better understanding of the state of the art in this field. You will write your findings in a high quality report (1500 word limit). This part must demonstrate your ability to study and discuss peer-reviewed journal articles and conference papers, carry out in-depth analysis, and arrive at substantial conclusions.
Step 1 – You need to select only one of the following topics:
• Big data models and theories
• Machine learning and AI for big data
• Big data mining
• Big data standardization
• Green issues of big data
• Big data analytics and social media
• Real-time analysis of big data (technologies and algorithms) • Big data case studies and applications
Step 2 – After selecting your research area, you need to read and research the related journal articles and conference papers (the books, websites, technical reports or any other sources not accepted here).
• You need to search only these databases in Monash Library: IEEE, ScienceDirect, Springer, ACM, ProQuest, IOS Press, or Scopus (ONLY 2014-present, not older).
• You will select 4 seminal and most related papers.
• Papers should be completed research papers. DO NOT choose research-in-progress
papers, surveys, or review papers.
• The selection of seminal papers must be based on considering the most influential, well-known and cited papers in that research area, and whether they are full research papers with a proposed approach and its implementation and evaluation.
(3 marks)
Step 3 – After selecting the four papers, you need to identify each paper’s contributions, the proposed approach/method, the research issues/challenges it addresses, main findings and finally any remaining open issues.
You need to read, fully understand and analyse each paper and provide a professional and brief description for each:
a. The research challenges and issues that each paper addressing (there might be more than one paper addressing the same issue)
Page 6 of 8
b. The paper contributions, what approach/method/model they are proposing and developing to address those challenges. You need to briefly describe their proposed approaches/methods/models, avoiding technical details and jargons.
c. What are the main findings and results of each paper (usually discussed after the evaluation section), and any open issues for further research.
Step 4 – You will consolidate all the results of step 3 into a research paper following the specified guidelines below.
You need to follow the following structure for Part (b):
1. Introduction (about 100 words): a brief description about what your paper is about. (2 marks)
2. Current state of the art (about 1300 words):
1. Briefly and clearly describe the proposed approaches/methods of each paper;
(2 marks)
2. Discuss the challenges/issues that these papers focus on; (2 marks)
3. Summarise the findings and results of evaluation/experiments. This should
include what improvement or impact each paper had. Provide evidence from
their evaluation results. (4 marks)
4. Add your judgement on their results at the end. If the papers address the same
problem, here you need to compare how their improvements are different or which approach outperforms the other one. (3 marks)
To write this section, use paragraphs rather than bullets or other styles, and make sure the paragraphs have a logical and consistent flow.
3. Conclusion (about 100 words) – Conclude by saying what paper was about, briefly discussing the main or interesting findings and making your final point (a strong one). (2 marks)
• References (this will not be counted in word limit) – list the details of all the references you used in the paper according to the APA style (in addition to your 4 papers you can reference other related papers). In-text citation and references should be correct and need to be according to APA style (refer to QManual on Moodle). (2 marks)
Submission Requirements:
All the following files should be uploaded to Moodle as a zip file and use the following naming convention: FIT5043-A2-[StudentID].zip. One group member will upload the file. There is a mark deduction for any missing document.
1. An Assessment Cover Sheet for the group
2. Provide a report that includes Part a report and Part b report as specified in ONE
Word document in the order of Tasks. Use numbered heading and subheadings. Page 7 of 8
3. The font used Times New Roman for the text, size 12 and 1.5 line space.
4. The paper must be proof read and spell-checked before submission.
Late Submission:
Late Assignments or extensions will not be accepted unless you submit a special consideration form and provide valid documentation such as a medical certificate prior to the submission deadline (NOT after). Otherwise, there will be 5% penalty per day including weekends.
PLEASE NOTE.
Before submitting your assignment, please make sure that you haven’t breached the University plagiarism and cheating policy. It is the student’s responsibility to make themselves familiar with the contents of these documents.
Please also note the following from the Plagiarism Procedures of Monash, available at http://www.policy.monash.edu/policy-bank/academic/education/conduct/plagiarism- procedures.html
Plagiarism occurs when students fail to acknowledge that the ideas of others are being used. Specifically it occurs when:
• other people’s work and/or ideas are paraphrased and presented without a reference;
• other students’ work is copied or partly copied;
• other people’s designs, codes or images are presented as the student’s own
work;
• phrases and passages are used verbatim without quotation marks and/or
without a reference to the author or a web page;
• lecture notes are reproduced without due acknowledgement.