Machine learning coursework
This coursework is designed for you to apply some of the methods that you
have learned during our Machine Learning unit and that are also commonly
applied in practice. Given that this is your only assessment for this unit the
coursework is designed to be relatively open-ended with some guidelines, so
that you can demonstrate your knowledge of what was taught – both in the
labs and in the lectures.
In this coursework, we will focus on the classical hand-written MNIST dataset1
and the California housing regression dataset2
. We recommend that you first
get a basic implementation, and start writing your report with some plots
with results across all four topics, and then gradually improve them. Where
suitable you should discuss your results in light of the concepts covered in
the lectures (e.g. curse of dimensionality, overfitting, etc.).
2.1 Analysing MNIST
To gain a deeper understanding of a particular dataset it is often a good
strategy to analyse it using unsupervised methods.
Run PCA on the MNIST dataset. How much variance does each principal
component explain? Plot the two components that explain the most variance.
Interpret and discuss your results.
Apply K-means (with K = 10) using the first two components from the PCA
analysis above. Plot your clusters in 2D and relate them to the digit classes.
What does each cluster correspond to? How good is the match between a
given cluster and a specific digit? Interpret and discuss your results.
Train an ANN, plot the training and validation learning curves. Does the
model overfit? What are your results in the testing dataset? Interpret and
discuss your results. How do they compare with SVMs? How do the hyperpameters (e.g. learning rate) impact on performance?
Train an SVM (with a chosen Kernel) and perform the same analyses as for
ANNs. Interpret and discuss your results. Does the model overfit? How
do they compare with ANNs? And why? How does the type of kernel (e.g.
linear, RBF, etc.) impact on performance?
2.3 Bayesian linear regression with PyMC3
In this task you are required to use PyMC3 to perform Bayesian linear regression on the California housing dataset which is easily available via the
sklearn.datasets.fetch california housing function. The goal with this dataset
is to predict the median house value in a ‘block’ in California. A block is
a small geographical area with a population of between 600 and 3000 people. Each datapoint in this dataset corresponds to a block. Consult the
scikit-learn documentation for details of the predictor variables.
As always with Bayesian analysis it is up to you to choose your prior
distributions. Be sure to justify your choice of priors in your report. What
do the results produced by PyMC3 tell you about what influences house
value in California? Is it necessary and/or useful to transform the data in
some way before running MCMC?
2.4 Building an ensemble
Here, you will implement both of the following steps for the California Housing regression task.
2.4.1 Random Forest
This part builds on the related lab (week 7). First, run a random forest
regressor for the California housing dataset, and contrast this with your
previous Bayesian linear regression method. For this you can use the RandomForestRegressor class from Scikit-learn.
Analyse the effect of the hyperparameters of the random forest, such as
the number of estimators (or base models, i.e., the number of decision trees
that are combined into the random forest). Look at the constructor of the
RandomForestRegressor class to see what hyperparameters you can set. In
your analysis, include the following plots and discussions but you may wish
to add further analysis of your own:
1. Plot the relationship between a hyperparameter and the performance
of the model.
2. Optimise the hyperparameter on a validation set.
3. Plot the trade-off between time taken for training and prediction performance.
4. What do you think is a good choice for the number of estimators on
5. What is the effect of setting the maximum tree depth or maximum
number of features?
6. Is the random forest interpretable? Are the decision trees that make
up the forest interpretable?
Bayesian linear regression and decision trees are two very different approaches
to regression. Ensemble methods can exploit such diversity between different
methods to improve performance. So now you will try combining the random
forest and Bayesian linear regression using stacking. Scikit-learn includes
the StackingRegressor class to help you with this. In the report, explain
the stacking approach and describe your results, making sure to cover the
1. When does stacking improve performance over the individual models
(e.g., try stacking with a random forest with max depth = 10 and
n estimators = 10)?
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx