本次Python代写是通过机器学习对住房数据进行无监督分析

Machine learning coursework

1 Introduction

This coursework is designed for you to apply some of the methods that you

have learned during our Machine Learning unit and that are also commonly

applied in practice. Given that this is your only assessment for this unit the

coursework is designed to be relatively open-ended with some guidelines, so

that you can demonstrate your knowledge of what was taught – both in the

labs and in the lectures.

2 Tasks

In this coursework, we will focus on the classical hand-written MNIST dataset1

and the California housing regression dataset2

. We recommend that you first

get a basic implementation, and start writing your report with some plots

with results across all four topics, and then gradually improve them. Where

suitable you should discuss your results in light of the concepts covered in

the lectures (e.g. curse of dimensionality, overfitting, etc.).

2.1 Analysing MNIST

To gain a deeper understanding of a particular dataset it is often a good

strategy to analyse it using unsupervised methods.

2.1.1 PCA

Run PCA on the MNIST dataset. How much variance does each principal

component explain? Plot the two components that explain the most variance.

Interpret and discuss your results.

2.1.2 K-Means

Apply K-means (with K = 10) using the first two components from the PCA

analysis above. Plot your clusters in 2D and relate them to the digit classes.

What does each cluster correspond to? How good is the match between a

given cluster and a specific digit? Interpret and discuss your results.

2.2.1 ANNs

Train an ANN, plot the training and validation learning curves. Does the

model overfit? What are your results in the testing dataset? Interpret and

discuss your results. How do they compare with SVMs? How do the hyperpameters (e.g. learning rate) impact on performance?

2.2.2 SVMs

Train an SVM (with a chosen Kernel) and perform the same analyses as for

ANNs. Interpret and discuss your results. Does the model overfit? How

do they compare with ANNs? And why? How does the type of kernel (e.g.

linear, RBF, etc.) impact on performance?

2.3 Bayesian linear regression with PyMC3

In this task you are required to use PyMC3 to perform Bayesian linear regression on the California housing dataset which is easily available via the

sklearn.datasets.fetch california housing function. The goal with this dataset

is to predict the median house value in a ‘block’ in California. A block is

a small geographical area with a population of between 600 and 3000 people. Each datapoint in this dataset corresponds to a block. Consult the

scikit-learn documentation for details of the predictor variables.

As always with Bayesian analysis it is up to you to choose your prior

distributions. Be sure to justify your choice of priors in your report. What

do the results produced by PyMC3 tell you about what influences house

value in California? Is it necessary and/or useful to transform the data in

some way before running MCMC?

2.4 Building an ensemble

Here, you will implement both of the following steps for the California Housing regression task.

2.4.1 Random Forest

This part builds on the related lab (week 7). First, run a random forest

regressor for the California housing dataset, and contrast this with your

previous Bayesian linear regression method. For this you can use the RandomForestRegressor class from Scikit-learn.

Analyse the effect of the hyperparameters of the random forest, such as

the number of estimators (or base models, i.e., the number of decision trees

that are combined into the random forest). Look at the constructor of the

RandomForestRegressor class to see what hyperparameters you can set. In

your analysis, include the following plots and discussions but you may wish

to add further analysis of your own:

1. Plot the relationship between a hyperparameter and the performance

of the model.

2. Optimise the hyperparameter on a validation set.

3. Plot the trade-off between time taken for training and prediction performance.

4. What do you think is a good choice for the number of estimators on

this dataset?

5. What is the effect of setting the maximum tree depth or maximum

number of features?

6. Is the random forest interpretable? Are the decision trees that make

up the forest interpretable?

2.4.2 Stacking

Bayesian linear regression and decision trees are two very different approaches

to regression. Ensemble methods can exploit such diversity between different

methods to improve performance. So now you will try combining the random

forest and Bayesian linear regression using stacking. Scikit-learn includes

the StackingRegressor class to help you with this. In the report, explain

the stacking approach and describe your results, making sure to cover the

following points:

1. When does stacking improve performance over the individual models

(e.g., try stacking with a random forest with max depth = 10 and

n estimators = 10)?

**程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB**

本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！

**E-mail:** [email protected] **微信:**itcsdx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。