R语言代写|Problem Set 3: Hyperparameter Search, Decision Trees, and Regularization
这是一个R语言机器学习代码代写的案例
1. Regularization short questions
For a-f indicate which of the following are correct:
1. Will have better performance due to increased flexibility when its increase in bias is less than its
 decrease in variance.
2. Will have better performance due to increased flexibility when its increase in variance is less than its
 decrease in bias.
3. Will have better performance due to decreased flexibility when its increase in bias is less than its
 decrease in variance.
4. Will have better performance due to decreased flexibility when its increase in variance is less than its
 decrease in bias.
5. The models are equivalent so their performance will be the same.
6. Not enough information to tell.
a. Polynomial regression of degree 3, relative to polynomial regression of degree 5 (choose from answers
 above):
b. Ridge regression, relative to least squares (choose from the answers above):
c. The lasso with λ = .005, relative to lasso with λ = .1 (choose from answers above):
d. Cost-complexity pruned trees with α = 1 relative to unpruned trees (choose from answers above):
e. Ridge regression, relative to lasso (choose from the answers above):
f. KNN regression with k=1 relative to least squares (choose from answers above):
2. Spam classification using LASSO
Spam detection involves building classifiers to learn patterns in word counts from manually labeled spam
 and ham (not spam).
In this problem, we will do this using a logistic regression model with ℓ1 penalty (the lasso) trained on
 a corpus of email messages from the now-defunct company ENRON. Each of these messages are labeled
 “spam” or “ham” in the variable Spam.Ham.
First, lets install and load packages.
#install.packages(“doMC”)
#install.packages(“glmnet”)
#install.packages(“quanteda”)
#install.packages(“readtext”)library(doMC)
library(glmnet)
library(quanteda)
library(readtext)
In this first block, I have provided code that downloads, extracts, and preprocesses these data into a matrix
 of word counts (columns) for each document (rows). Each document is labeled spam or ham (not spam) in
 the document variable Spam.Ham.
if (!file.exists(“enron_spam_data.zip”)) {
download.file(“https://github.com/MWiechmann/enron_spam_data/raw/master/enron_spam_data.zip”, “enron_s
unzip(‘enron_spam_data.zip’)
}
texts <- read.csv(‘enron_spam_data.csv’)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## embedded nul(s) found in input
N <- nrow(texts)
corpus <- corpus(texts, text_field=”Message”) # create a corpus
dfm <- dfm(tokens(corpus)) # create features of word counts for each document
dfm <- dfm_trim(dfm, min_termfreq = 50) # remove word features occurring less than 50 times
Below is starter code to help you properly train a lasso model. As you work on this problem, it may be
 helpful when troubleshooting or debugging to reduce nfolds to 3 to reduce the time it takes you to run
 code.
tr <- sample(1:N, floor(N/3)) # indexes for training data
registerDoMC(cores=5) # trains all 5 folds in parallel (at once rather than one by one)
mod <- cv.glmnet(dfm[tr,], docvars(dfm,”Spam.Ham”)[tr], nfolds=5, parallel=TRUE, family=”binomial”, type
a. Plot misclassification error for all values of λ chosen by cv.glmnet. How many non-zero coefficients
 are in the model where misclassification error is minimized? How many non-zero coefficients are in the
 model one standard deviation from where misclassification error is minimized? Explain how the value
 of lambda is related to bias and variance. How is the value of lambda related to an ordinary logit
 model?
b. According to the estimate of the test error obtained by cross-validation, what is the optimal λ stored
 in your cv.glmnet() output? What is the CV error for this value of λ? Hint: The vector of λ values
 will need to be subsetted by the index of the minimum CV error.
c. Report and plot test ROC for the λ that minimizes CV error and the 1 S.E. λ. How well did CV error
 estimate test error? How is ROC calculated? Why might we want to use ROC over other error metrics
 in this particular classification problem?
d. For the model you have identified with the minimum CV error, write an analysis of non-zero coefficients
 and the terms associated with them. How many coefficients were shrunk to zero? Which terms appear
 to be most important? What do the magnitudes and signs of the coefficients represent? Can we
 interpret the coefficients as we would in a non-regularized logistic regression model?
e. In parts a-d, we have looked at the body of the email (Message). Do the important terms in a model
 of Subject differ from important terms in the above model of Message? How?library(doMC)
library(glmnet)
library(quanteda)
library(readtext)
In this first block, I have provided code that downloads, extracts, and preprocesses these data into a matrix
 of word counts (columns) for each document (rows). Each document is labeled spam or ham (not spam) in
 the document variable Spam.Ham.
if (!file.exists(“enron_spam_data.zip”)) {
download.file(“https://github.com/MWiechmann/enron_spam_data/raw/master/enron_spam_data.zip”, “enron_s
unzip(‘enron_spam_data.zip’)
}
texts <- read.csv(‘enron_spam_data.csv’)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## embedded nul(s) found in input
N <- nrow(texts)
corpus <- corpus(texts, text_field=”Message”) # create a corpus
dfm <- dfm(tokens(corpus)) # create features of word counts for each document
dfm <- dfm_trim(dfm, min_termfreq = 50) # remove word features occurring less than 50 times
Below is starter code to help you properly train a lasso model. As you work on this problem, it may be
 helpful when troubleshooting or debugging to reduce nfolds to 3 to reduce the time it takes you to run
 code.
tr <- sample(1:N, floor(N/3)) # indexes for training data
registerDoMC(cores=5) # trains all 5 folds in parallel (at once rather than one by one)
mod <- cv.glmnet(dfm[tr,], docvars(dfm,”Spam.Ham”)[tr], nfolds=5, parallel=TRUE, family=”binomial”, type
a. Plot misclassification error for all values of λ chosen by cv.glmnet. How many non-zero coefficients
 are in the model where misclassification error is minimized? How many non-zero coefficients are in the
 model one standard deviation from where misclassification error is minimized? Explain how the value
 of lambda is related to bias and variance. How is the value of lambda related to an ordinary logit
 model?
b. According to the estimate of the test error obtained by cross-validation, what is the optimal λ stored
 in your cv.glmnet() output? What is the CV error for this value of λ? Hint: The vector of λ values
 will need to be subsetted by the index of the minimum CV error.
c. Report and plot test ROC for the λ that minimizes CV error and the 1 S.E. λ. How well did CV error
 estimate test error? How is ROC calculated? Why might we want to use ROC over other error metrics
 in this particular classification problem?
d. For the model you have identified with the minimum CV error, write an analysis of non-zero coefficients
 and the terms associated with them. How many coefficients were shrunk to zero? Which terms appear
 to be most important? What do the magnitudes and signs of the coefficients represent? Can we
 interpret the coefficients as we would in a non-regularized logistic regression model?
e. In parts a-d, we have looked at the body of the email (Message). Do the important terms in a model
 of Subject differ from important terms in the above model of Message? How?