Python代写 | SIT742 Modern Data Science Unit Assessment Handbook

本次Python代写是完成数据处理相关的程序
SIT742 Modern Data Science

0.1.3 Important Dates
Please be aware of the following important dates:
Task 1 Due Date CloudDeakin Submission, by 23:59pm, ✘✘✘✘✘✘✘✿18/04/2020
11/04/2020 (Week ✚✚❃
06
05 Sunday).
Task 2 Group Sign-Up Due Date Task 2 group sign-up on CloudDeakin by 23:59pm,
18/04/2020 (Week 06 Saturday).
Task 2 Due Date CloudDeakin Submission, by 23:59pm, ✘✘✘✘✘✘✘✿30/05/2020
23/05/2020 (Week ✚✚❃
11
10 Sunday).
Task 3 Due Date CloudDeakin Online Quiz, in Week 10.
0.1.4 Assignment Results
Task 1 and 2 will be marked based on your submitted pdf report and Jupyter notebook,
while task 3 will be automatically marked online.
• The marking report is expected to be released to CloudDeakin within 14 working
days of the due date;
• Within 3 working days after the result is released, any student who wishes to
challenge the mark must contact or approach the unit chair during contact hours
and bring with them a copy of their assignment and their mark breakdown. Cloud
students can email me for this issue.
0.2 General Requirements
1. Any text or code adapted from any source must be clearly labelled and referenced.
You should clearly indicate the start and end of any such text/code.
2. All SIT742 assignments must be submitted as required by their corresponding assessment specifications. Assignments will not be accepted through
any other manner without prior approval. Students should note that this means
that email based submissions will ordinarily be rejected.
R375 (2020-03-06 20:37:40Z) 2 Last changed by: Gang Li
⇒[SIT742]⇐ 0.2. GENERAL REQUIREMENTS
3. Penalties for late submissions are indicated in the Unit Guide. Close of submissions
on the due date and each day thereafter for penalties will occur at 11 : 59pm local
time. Students outside of Victoria, Australia, should note that the local time zone
is UT C + 10, and in Daylight Saving Dates, it will be UT C + 11.
4. Information regarding assignment extensions is provided in the Unit Guide &
Information in CloudDeakin. Students must not assume an extension will be
granted. Late penalties still apply in the case of a failed application for extension.
Thus until an extension is granted students should submit any work completed
before the assignment is due. Note that extensions cannot be granted for system
outages or encumbrances.
R375 (2020-03-06 20:37:40Z) 3 Last changed by: Gang Li

ASSESSMENT TASK
ONE
DATA EXPLORATION: DATA SCIENTISTS SURVEY
Contents
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Text analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 What to Submit? . . . . . . . . . . . . . . . . . . . . . . . . . . 6
This task contributes 25% of your final SIT742 mark. It must be completed individually,
and submitted to CloudDeakin by 23:59pm, ✘✘✘✘✘✘✘✿18/04/2020
11/04/2020 (Week ✚✚❃
06
05 Sunday).
1.1 Background
In 2017, Kaggle (a data science community and competition platform) conducted a survey
on a large range of users registered as the data scientist in their platform. The survey
data are broadly covered the skill set of the data scientists, the demographic of the data
scientists, the feedback of the platform and many other information.
1.2 Task Description
We provide one Jupyter notebook 2020SIT742Task1.ipynb at GitHub-SIT742, together
with three data files at the data subfolder:
MCQResponses.csv The csv file contains participants’ answers to multiple choice questions. Each column contains the answers of one respondent to a specific question.
ConversionRates.csv Currency conversion rates to USD.
JobPostings.csv Data scientists job advertising in US with job descriptions, from
JobPikr.
5
1.3. WHAT TO SUBMIT? ⇒[SIT742]⇐
You are required to develop a data exploration report by completing the provided
Jupyter notebook to finish some required analysis, with the exploration data analytics
skills as well as visualization skills. Details requirements can be found in the provided
notebook, and you need follow the notebook requirements to complete the coding and
include the results into the report SIT742T1Report.pdf.
1.2.1 Data Exploration
For a data scientist, after obtaining the dataset, the first most crucial task is to obtain a
good understanding of the data he or she is dealing with. This includes: examining the
data attributes (or equivalently, data fields), seeing what they look like, what is the data
type for each field, and from this information, determining suitable numerical/visual
descriptions.
In this part of this assessment task, you need to complete the provided notebook coding
parts and finish the required analysis in the attributes such as ‘education’, ‘salary’ and
related demographic information (70%).
1.2.2 Text analysis
For the job advertisement data JobPostings.csv, you are required to write Python
code to remove the stop-words, and to extract the high frequency words used in job
advertisements.
After that, you can do one self-defined text analysis task to get insight into those
advertisement information (30%).
1.3 What to Submit?
Please familiarise yourself with the General Requirements (see Section 0.2) on Assignments
Submission. By the due date, you are required to submit the following files to the
corresponding Assignment (Dropbox) in CloudDeakin:
SIT742Task1.ipynb Your Jupyter notebook solution source file for the data exploration
of the data scientists related data. You can fill your name and Deakin ID information
at the relevant place in the first markdown cell.
Please follow the PEP 8 guidelines (Section 3.1) for source code style. Your commenting and adherence to code standards will be considered when marking.
SIT742T1Report.pdf This pdf report contains the required source code, the required
answers to selected questions, as specified in the notebook file.
No Special Consideration will be granted for this assessment task. Students who have
difficulty meeting the deadline because of illness, etc. must apply for an assignment
extension no later than the noon on the day prior to the deadline.
R375 (2020-03-06 20:37:40Z) 6 Last changed by: Gang Li
ASSESSMENT TASK
TWO
DATA ANALYTICS: FIFA 2019
Contents
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 FIFA19 Data Analytics . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Project Report . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 What to Submit? . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Important Dates . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Files to Submit . . . . . . . . . . . . . . . . . . . . . . . . 9
This task contributes 40% of your final SIT742 mark. It can be done in group of 3
members and submitted to CloudDeakin by 23:59pm, ✘✘✘✘✘✘✘✿30/05/2020
23/05/2020 (Week ✚✚❃
11
10 Sunday).
2.1 Background
Recently, Kaggle (a data science community and competition platform) released one data
set FIFA19, which consists of 18K+ FIFA 19 player with around 90 attributes extracted
from FIFA database. Here, we redistribute this data set for this assessment task:
2020T2Data.csv The file contains detailed information about each FIFA 19 player.
More information about this dataset can be found at
fifa19.

2.3. WHAT TO SUBMIT? ⇒[SIT742]⇐
You are required to analyse this dataset using Jupyter notebook with Spark packages
including spark.sql and pyspark.ml.
2.2.1 FIFA19 Data Analytics
To systematically investigate this dataset, your Jupyter notebook should complete the
following 3 kinds of analysis (80%):
Part 1 – Exploratory Data Analysis data visualization and understanding.
Part 2 – Clustering Analysis Identify the inherent clusters among players, and for each
cluster, identify its profile.
Part 3 – Classification Analysis Build classifiers to predict the ‘position_group’ of the
player. You are also required to evaluate the performance of at least 3 models using
cross-validation.