Phase 1 Due at the End of Week 5 (18%):
– Data cleaning and preprocessing (dealing with missing values, dropping ID-like columns,
data aggregation, etc.) as appropriate.
– Data exploration and visualisation (charts, graphs, interactions, etc) as appropriate.
Phase 2 Due at the End of Week 12 (27%):
– Predictive modelling of data as appropriate.
1. This is a loosely defined project to give you the maximum level of flexibility. In
particular, you will need to choose a project dataset yourself.
2. Dataset resources: There are no hard restrictions on the dataset that you can select
for your project. You can choose a public dataset from popular data repositories or
you can find some other suitable dataset from any website on the Internet or
whatever. You can also use data from your work. Please check the bottom of this
page for some suggested resources for finding a suitable dataset for your project.
3. Guidance for selecting a dataset: As a friendly advice, you might want to select a
dataset in line with your future career plans and the particular industries you are
interested in. For example, if you plan on working in the banking industry (or
conducting academic research in this area), you might want to select a finance
related dataset so that your course project can be a talking point during your
4. Blacklisted datasets: The dataset you choose apparently needs be appropriate for a
major machine learning course project. For instance, the following datasets are
US Adult Income Dataset
Wisconsin Breast Cancer
5. Minimum requirements: Your dataset must have at least 200 rows and at least 8
descriptive (that is, independent or explanatory) features after dropping all
unnecessary features but before one-hot-encoding of any categorical
descriptive features. Please remember: “features”, “attributes”, or “variables” are all
the same thing: they are just columns in your dataset. Likewise, a “dependent”,
“target”, or “response” feature are all the same thing: it’s the variable you are
predicting (as part of a supervised machine learning problem).
6. Random sampling for very large datasets: There is no upper limit on the number of
rows, but if your dataset has more than 5000 rows, you might want to select a random
subset with at most 5000 rows so that you do not fry your laptop! That is, you will not
lose any points for selecting only a relatively small subset of rows in case your
dataset has too many rows.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx