R语言数据分析代写 | MAST90138 Assignment 3



The assignment contains 3 problems worth a total of 100 points which will count towards
15% of the final mark for the course. If you L ATEXand knitr your assignment in a nice
way, you will potentially get up to a maximum of 0:75% towards the final mark for the
course as extra credits.

Use tables, graphs and concise text explanations to support your answers. Unclear answers
may not be marked at your own cost. All tables and graphs must be clearly commented
and identified.

No late submission is allowed.

Data: In the assignment you will analyze some rainfall data. The dataset is available in .txt
format on the LMS website. To load the data into R you can use the function read.table()
or any command of your choice. You may need to manipulate the data format (data frames
or matrices) depending on the task. The data are separated in a training set and a test set.

The training set contain p = 365 explanatory variables X1; : : : ;Xp and one class membership
(G = 0 or 1) for ntrain = 150 individuals. The test set contains p = 365 explanatory variabless
X1; : : : ;Xp and one class membership (G = 0 or 1) for ntest = 41 individuals.

In these data, for each individual, X1; : : : ;Xp correspond to the amount of rainfall at each
of the p = 365 days in a year. Each individual in this case is a place in Australia coming either
from the North (G = 0) or from the South (G = 1) of the country. Thus, the two classes (North
and South) are coded by 0 and 1.

You will use the training data to fit your models or train classifiers. Once you have fitted
your model or trained your classifiers with the training data, you will need to check how well
the tted models/trained classifiers work on the test data.

The test and training data are all placed in different text files: XGtrainRain.txt, which
contains the training X data (values of the p explanatory X-variables) for ntrain = 150 indi-
viduals as well as their class (0 or 1) label, and XGtestRain.txt, which contains the test X
data (values of the p explanatory X-variables) for ntest = 41 as well as their class (0 or 1)
label. The test class membership is provided to you ONLY TO COMPUTE THE ERROR OF
CLASSIFICATION of your classifier.

Please include all the necessary R code to answer the questions, but not super-
uous R code that are not relevant. Marks may be taken off for R code that is
poorly presented.

You may take classification error/test error to be the proportion/percentage out of the
41 test samples that are misclassified.

Problem 1 [60 marks]:

In this problem you will train quadratic discriminant (QDA) and the logistic regression
classifiers to predict the class labels (0 or 1) in the test set.

(a) Use standard functions in R to train the QDA classifier and the logistic classifier, with all
the p predictors in the training set. What happened? And why did it happen? Do you
recommend using these two classifiers on the test set? (Hint: For the logistic classifier,
use the summary function to take a look at the trained model object) [10]

(b) Use prcomp and the plsr (package pls) functions to obtain, respectively, the PCA and
PLS (partial least square) components of the explanatory variables, in the training set.
Here, when considering the covariance maximization problem of PLS, we maximise the
covariance between X = (X1; : : : ;Xp)T and Y = 1fG = 1g, the indicator variable that an
individual belongs to group 1. For each case, you will need to use the \projection matrix”
(i.e., for PCA and  for PLS discussed in class) reported by the function to re-compute
the components \manually” to check that you understand how the components are ob-
tained. [10]

(c) Train a QDA classifier with the PLS components, and another one with the PCA compo-
nents. In each case, pick the number of components to use based on leave-one-out cross
validation (LOOCV); consider up to using 50 components. Plot the leave-one-out CV
error against the number of components considered. Report the final chosen number of
components. (Refer to the lab in Week 7 to get some ideas)

Do the same for the logistic classifier.

(If you want to pick your number of components based on methods other than LOOCV,
please explain your choice in a clear and concise manner)

(d) For each of the QDA and logistic classifiers, which version (PCA or PLS) do you prefer?
Why? (Answer this question without any knowledge of the test-set results in the next
problem) [5]

(e) Apply your trained classifiers in (c) to the test set, and report the resulting classification
error (test error). Be careful about how you should center the data in your test set to
produce your prediction. The lab in Week 7 may give you some ideas again. [15]