R语言代写|Problem Set 3: Hyperparameter Search, Decision Trees, and Regularization

这是一个R语言机器学习代码代写的案例

1. Regularization short questions

For a-f indicate which of the following are correct:

1. Will have better performance due to increased flexibility when its increase in bias is less than its
decrease in variance.

2. Will have better performance due to increased flexibility when its increase in variance is less than its
decrease in bias.

3. Will have better performance due to decreased flexibility when its increase in bias is less than its
decrease in variance.

4. Will have better performance due to decreased flexibility when its increase in variance is less than its
decrease in bias.

5. The models are equivalent so their performance will be the same.

6. Not enough information to tell.

a. Polynomial regression of degree 3, relative to polynomial regression of degree 5 (choose from answers
above):

b. Ridge regression, relative to least squares (choose from the answers above):

c. The lasso with λ = .005, relative to lasso with λ = .1 (choose from answers above):

d. Cost-complexity pruned trees with α = 1 relative to unpruned trees (choose from answers above):

e. Ridge regression, relative to lasso (choose from the answers above):

f. KNN regression with k=1 relative to least squares (choose from answers above):

2. Spam classification using LASSO

Spam detection involves building classifiers to learn patterns in word counts from manually labeled spam
and ham (not spam).

In this problem, we will do this using a logistic regression model with ℓ1 penalty (the lasso) trained on
a corpus of email messages from the now-defunct company ENRON. Each of these messages are labeled
“spam” or “ham” in the variable Spam.Ham.

First, lets install and load packages.

#install.packages(“doMC”)
#install.packages(“glmnet”)
#install.packages(“quanteda”)
#install.packages(“readtext”)

library(doMC)
library(glmnet)
library(quanteda)
library(readtext)

In this first block, I have provided code that downloads, extracts, and preprocesses these data into a matrix
of word counts (columns) for each document (rows). Each document is labeled spam or ham (not spam) in
the document variable Spam.Ham.

if (!file.exists(“enron_spam_data.zip”)) {
download.file(“https://github.com/MWiechmann/enron_spam_data/raw/master/enron_spam_data.zip”, “enron_s
unzip(‘enron_spam_data.zip’)
}
texts <- read.csv(‘enron_spam_data.csv’)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## embedded nul(s) found in input
N <- nrow(texts)
corpus <- corpus(texts, text_field=”Message”) # create a corpus
dfm <- dfm(tokens(corpus)) # create features of word counts for each document
dfm <- dfm_trim(dfm, min_termfreq = 50) # remove word features occurring less than 50 times

Below is starter code to help you properly train a lasso model. As you work on this problem, it may be
helpful when troubleshooting or debugging to reduce nfolds to 3 to reduce the time it takes you to run
code.

tr <- sample(1:N, floor(N/3)) # indexes for training data
registerDoMC(cores=5) # trains all 5 folds in parallel (at once rather than one by one)
mod <- cv.glmnet(dfm[tr,], docvars(dfm,”Spam.Ham”)[tr], nfolds=5, parallel=TRUE, family=”binomial”, type

a. Plot misclassification error for all values of λ chosen by cv.glmnet. How many non-zero coefficients
are in the model where misclassification error is minimized? How many non-zero coefficients are in the
model one standard deviation from where misclassification error is minimized? Explain how the value
of lambda is related to bias and variance. How is the value of lambda related to an ordinary logit
model?

b. According to the estimate of the test error obtained by cross-validation, what is the optimal λ stored
in your cv.glmnet() output? What is the CV error for this value of λ? Hint: The vector of λ values
will need to be subsetted by the index of the minimum CV error.

c. Report and plot test ROC for the λ that minimizes CV error and the 1 S.E. λ. How well did CV error
estimate test error? How is ROC calculated? Why might we want to use ROC over other error metrics
in this particular classification problem?

d. For the model you have identified with the minimum CV error, write an analysis of non-zero coefficients
and the terms associated with them. How many coefficients were shrunk to zero? Which terms appear
to be most important? What do the magnitudes and signs of the coefficients represent? Can we
interpret the coefficients as we would in a non-regularized logistic regression model?

e. In parts a-d, we have looked at the body of the email (Message). Do the important terms in a model
of Subject differ from important terms in the above model of Message? How?library(doMC)

library(glmnet)
library(quanteda)
library(readtext)

In this first block, I have provided code that downloads, extracts, and preprocesses these data into a matrix
of word counts (columns) for each document (rows). Each document is labeled spam or ham (not spam) in
the document variable Spam.Ham.

if (!file.exists(“enron_spam_data.zip”)) {
download.file(“https://github.com/MWiechmann/enron_spam_data/raw/master/enron_spam_data.zip”, “enron_s
unzip(‘enron_spam_data.zip’)
}
texts <- read.csv(‘enron_spam_data.csv’)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## embedded nul(s) found in input
N <- nrow(texts)
corpus <- corpus(texts, text_field=”Message”) # create a corpus
dfm <- dfm(tokens(corpus)) # create features of word counts for each document
dfm <- dfm_trim(dfm, min_termfreq = 50) # remove word features occurring less than 50 times

Below is starter code to help you properly train a lasso model. As you work on this problem, it may be
helpful when troubleshooting or debugging to reduce nfolds to 3 to reduce the time it takes you to run
code.

tr <- sample(1:N, floor(N/3)) # indexes for training data
registerDoMC(cores=5) # trains all 5 folds in parallel (at once rather than one by one)
mod <- cv.glmnet(dfm[tr,], docvars(dfm,”Spam.Ham”)[tr], nfolds=5, parallel=TRUE, family=”binomial”, type

a. Plot misclassification error for all values of λ chosen by cv.glmnet. How many non-zero coefficients
are in the model where misclassification error is minimized? How many non-zero coefficients are in the
model one standard deviation from where misclassification error is minimized? Explain how the value
of lambda is related to bias and variance. How is the value of lambda related to an ordinary logit
model?

b. According to the estimate of the test error obtained by cross-validation, what is the optimal λ stored
in your cv.glmnet() output? What is the CV error for this value of λ? Hint: The vector of λ values
will need to be subsetted by the index of the minimum CV error.

c. Report and plot test ROC for the λ that minimizes CV error and the 1 S.E. λ. How well did CV error
estimate test error? How is ROC calculated? Why might we want to use ROC over other error metrics
in this particular classification problem?

d. For the model you have identified with the minimum CV error, write an analysis of non-zero coefficients
and the terms associated with them. How many coefficients were shrunk to zero? Which terms appear
to be most important? What do the magnitudes and signs of the coefficients represent? Can we
interpret the coefficients as we would in a non-regularized logistic regression model?

e. In parts a-d, we have looked at the body of the email (Message). Do the important terms in a model
of Subject differ from important terms in the above model of Message? How?