Python代写 | SIT742 Modern Data Science
本次Python代写是根据数据统计分析“教育”,“薪水”和相关的人口信息
SIT742 Modern Data Science
1.3. WHAT TO SUBMIT? ⇒[SIT742]⇐
You are required to develop a data exploration report by completing the provided
Jupyter notebook to finish some required analysis, with the exploration data analytics
skills as well as visualization skills. Details requirements can be found in the provided
notebook, and you need follow the notebook requirements to complete the coding and
include the results into the report SIT742T1Report.pdf.
1.2.1 Data Exploration
For a data scientist, after obtaining the dataset, the first most crucial task is to obtain a
good understanding of the data he or she is dealing with. This includes: examining the
data attributes (or equivalently, data fields), seeing what they look like, what is the data
type for each field, and from this information, determining suitable numerical/visual
descriptions.
In this part of this assessment task, you need to complete the provided notebook coding
parts and finish the required analysis in the attributes such as ‘education’, ‘salary’ and
related demographic information (70%).
1.2.2 Text analysis
For the job advertisement data JobPostings.csv, you are required to write Python
code to remove the stop-words, and to extract the high frequency words used in job
advertisements.
After that, you can do one self-defined text analysis task to get insight into those
advertisement information (30%).
1.3 What to Submit?
Please familiarise yourself with the General Requirements (see Section 0.2) on Assignments
Submission. By the due date, you are required to submit the following files to the
corresponding Assignment (Dropbox) in CloudDeakin:
SIT742Task1.ipynb Your Jupyter notebook solution source file for the data exploration
of the data scientists related data. You can fill your name and Deakin ID information
at the relevant place in the first markdown cell.
Please follow the PEP 8 guidelines (Section 3.1) for source code style. Your commenting and adherence to code standards will be considered when marking.
SIT742T1Report.pdf This pdf report contains the required source code, the required
answers to selected questions, as specified in the notebook file.
No Special Consideration will be granted for this assessment task. Students who have
difficulty meeting the deadline because of illness, etc. must apply for an assignment
extension no later than the noon on the day prior to the deadline.
R375 (2020-03-06 20:37:40Z) 6 Last changed by: Gang Li
ASSESSMENT TASK
TWO
DATA ANALYTICS: FIFA 2019
Contents
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 FIFA19 Data Analytics . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Project Report . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 What to Submit? . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Important Dates . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Files to Submit . . . . . . . . . . . . . . . . . . . . . . . . 9
This task contributes 40% of your final SIT742 mark. It can be done in group of 3
members and submitted to CloudDeakin by 23:59pm, ✘✘✘✘✘✘✘✿30/05/2020
23/05/2020 (Week ✚✚❃
11
10 Sunday).
2.1 Background
Recently, Kaggle (a data science community and competition platform) released one data
set FIFA19, which consists of 18K+ FIFA 19 player with around 90 attributes extracted
from FIFA database. Here, we redistribute this data set for this assessment task:
2020T2Data.csv The file contains detailed information about each FIFA 19 player.
fifa19.
2.3. WHAT TO SUBMIT? ⇒[SIT742]⇐
You are required to analyse this dataset using Jupyter notebook with Spark packages
including spark.sql and pyspark.ml.
2.2.1 FIFA19 Data Analytics
To systematically investigate this dataset, your Jupyter notebook should complete the
following 3 kinds of analysis (80%):
Part 1 – Exploratory Data Analysis data visualization and understanding.
Part 2 – Clustering Analysis Identify the inherent clusters among players, and for each
cluster, identify its profile.
Part 3 – Classification Analysis Build classifiers to predict the ‘position_group’ of the
player. You are also required to evaluate the performance of at least 3 models using
cross-validation.
2.2.2 Project Report
Based on your implementation as required in Jupyter notebook, you are required to
write a report SIT742T2Report.pdf with 1000 − 1500 words, which should include the
following information:
(1) The required report ‘Section 1’ to ‘Section 3’ (results and analysis) as specified
in the notebook.
(2) In the report’s ‘Section 4’, discuss any findings you can reveal from this data set,
such as any rising star? any omni player? etc. (10%)
(3) In the report’s ‘Section 5’, reflect the project group activities, such as the task
distribution and contributions from each group members, and what you have learnt
during this project. (10%)
2.3 What to Submit?
2.3.1 Important Dates
Please be aware of the following important dates:
Group Sign-Up The group needs to be finalized on CloudDeakin by 23:59pm, 18/04/2020
(Week 06 Saturday). If any issue or group correction is needed, please send
SIT742 unit chair an email by 23:59pm, 18/04/2020 (Week 06 Saturday).
Final Submission The due date for this assessment task submission is on 23:59pm,
✘✘✘✘✘✘✘✿30/05/2020
23/05/2020 (Week ✚✚❃
11
10 Sunday).
R375 (2020-03-06 20:37:40Z) 8 Last changed by: Gang Li
⇒[SIT742]⇐ 2.3. WHAT TO SUBMIT?
2.3.2 Files to Submit
Please familiarise yourself with the General Requirements (see Section 0.2) on Assignments
Submission. By the due date, you are required to submit the following files to the
corresponding Assignment (Dropbox) in CloudDeakin:
SIT742Task1.ipynb Your Jupyter notebook solution source file for the data exploration
of the bank marketing data. You can fill your group information at the relevant
place in the first markdown cell. Please follow the PEP 8 guidelines (Section 3.1)
for source code style.
SIT742T2Report.pdf A 1000−1500 words report describing and discussing your analysis
results, and reflect the project group activities.
No Special Consideration will be granted for this project. Students who have difficulty
meeting the deadline because of illness, etc. must apply for an assignment extension no
later than the noon on the day prior to the deadline.
R375 (2020-03-06 20:37:40Z) 9 Last changed by: Gang Li
APPENDIX
THREE
APPENDIX
Contents
3.1 Code Style: Pep 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Academic Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 How to Find Papers? . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 How to Read Papers? . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 How to Write a Paper? . . . . . . . . . . . . . . . . . . . . 13
3.1 Code Style: Pep 8
Pep 8 is the de-facto code style guide for Python (https://www.python.org/dev/peps/
pep-0008/). Skim the style guide to gain basic understanding of what is required.
Conforming your Python code to PEP 8 is generally a good idea and helps make the code
more consistent when working on projects with other developers.
In your assessment task, if the source code or IPython notebook is to be included, you
are required to format your code so that it meets at least the following major PEP 8
guidelines:
Comment Please follow the following style for Python comments:
1. To explain the functionality of a group of statements, apply block comments
before the statements. Indent the comments to the same level as the code.
2. Write documentation strings (i.e. docstring) for your function.
Code Lay-out Please follow the following style for Python code layout:
1. Blank lines: Surround top-level function and class definition with two blank
lines. Use blank lines in functions, sparingly, to indicate logical sections.
2. Indentation: Use four white spaces instead of tab for indentation.
11
3.2. ACADEMIC SKILLS ⇒[SIT742]⇐
White spaces in expressions and statements Please follow the following style for Python
while spaces:
1. Surround binary operators with a single space on either side.
2. If operators with different priorities are used, consider add whitespace around
the operators with the lowest priority(ies). However, never use more than one
space.
You should use:
✞
i = i + 1
num += 1
x = x ∗2 − 1
✡✝ ✆
rather than this:
✞
i=i +1
num +=1
x = x ∗ 2 − 1
✡✝ ✆
String quotes Use either single-quoted or double-quoted strings. Pick one of them and
stick to it for consistency. Only use the other one when a string contains single or
double quote characters.
Naming Conventions Make sure the naming of your variable follow consistent style: e.g.
lowercase, lower_case_with_underscores, or mixedCase.
3.2 Academic Skills
3.2.1 How to Find Papers?
For the assessment task in this unit, you can try to find some related references from
highly respected journals and conferences. You can find papers from Scopus, IEEEXplore,
ACM Portal, Elsevier ScienceDirect, and DBLP 1