# R语言代写｜Problem Set 3: Hyperparameter Search, Decision Trees, and Regularization

这是一个R语言机器学习代码代写的案例

## 1. Regularization short questions

For a-f indicate which of the following are correct:

1. Will have better performance due to increased flexibility when its increase in bias is less than its

decrease in variance.

2. Will have better performance due to increased flexibility when its increase in variance is less than its

decrease in bias.

3. Will have better performance due to decreased flexibility when its increase in bias is less than its

decrease in variance.

4. Will have better performance due to decreased flexibility when its increase in variance is less than its

decrease in bias.

5. The models are equivalent so their performance will be the same.

6. Not enough information to tell.

a. Polynomial regression of degree 3, relative to polynomial regression of degree 5 (choose from answers

above):

b. Ridge regression, relative to least squares (choose from the answers above):

c. The lasso with λ = .005, relative to lasso with λ = .1 (choose from answers above):

d. Cost-complexity pruned trees with α = 1 relative to unpruned trees (choose from answers above):

e. Ridge regression, relative to lasso (choose from the answers above):

f. KNN regression with k=1 relative to least squares (choose from answers above):

## 2. Spam classification using LASSO

Spam detection involves building classifiers to learn patterns in word counts from manually labeled spam

and ham (not spam).

In this problem, we will do this using a logistic regression model with ℓ1 penalty (the lasso) trained on

a corpus of email messages from the now-defunct company ENRON. Each of these messages are labeled

“spam” or “ham” in the variable Spam.Ham.

First, lets install and load packages.

#install.packages(“doMC”)

#install.packages(“glmnet”)

#install.packages(“quanteda”)

#install.packages(“readtext”)library(doMC)

library(glmnet)

library(quanteda)

library(readtext)

In this first block, I have provided code that downloads, extracts, and preprocesses these data into a matrix

of word counts (columns) for each document (rows). Each document is labeled spam or ham (not spam) in

the document variable Spam.Ham.

if (!file.exists(“enron_spam_data.zip”)) {

download.file(“https://github.com/MWiechmann/enron_spam_data/raw/master/enron_spam_data.zip”, “enron_s

unzip(‘enron_spam_data.zip’)

}

texts <- read.csv(‘enron_spam_data.csv’)

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :

## embedded nul(s) found in input

N <- nrow(texts)

corpus <- corpus(texts, text_field=”Message”) # create a corpus

dfm <- dfm(tokens(corpus)) # create features of word counts for each document

dfm <- dfm_trim(dfm, min_termfreq = 50) # remove word features occurring less than 50 times

Below is starter code to help you properly train a lasso model. As you work on this problem, it may be

helpful when troubleshooting or debugging to reduce nfolds to 3 to reduce the time it takes you to run

code.

tr <- sample(1:N, floor(N/3)) # indexes for training data

registerDoMC(cores=5) # trains all 5 folds in parallel (at once rather than one by one)

mod <- cv.glmnet(dfm[tr,], docvars(dfm,”Spam.Ham”)[tr], nfolds=5, parallel=TRUE, family=”binomial”, type

a. Plot misclassification error for all values of λ chosen by cv.glmnet. How many non-zero coefficients

are in the model where misclassification error is minimized? How many non-zero coefficients are in the

model one standard deviation from where misclassification error is minimized? Explain how the value

of lambda is related to bias and variance. How is the value of lambda related to an ordinary logit

model?

b. According to the estimate of the test error obtained by cross-validation, what is the optimal λ stored

in your cv.glmnet() output? What is the CV error for this value of λ? Hint: The vector of λ values

will need to be subsetted by the index of the minimum CV error.

c. Report and plot test ROC for the λ that minimizes CV error and the 1 S.E. λ. How well did CV error

estimate test error? How is ROC calculated? Why might we want to use ROC over other error metrics

in this particular classification problem?

d. For the model you have identified with the minimum CV error, write an analysis of non-zero coefficients

and the terms associated with them. How many coefficients were shrunk to zero? Which terms appear

to be most important? What do the magnitudes and signs of the coefficients represent? Can we

interpret the coefficients as we would in a non-regularized logistic regression model?

e. In parts a-d, we have looked at the body of the email (Message). Do the important terms in a model

of Subject differ from important terms in the above model of Message? How?library(doMC)

library(glmnet)

library(quanteda)

library(readtext)

In this first block, I have provided code that downloads, extracts, and preprocesses these data into a matrix

of word counts (columns) for each document (rows). Each document is labeled spam or ham (not spam) in

the document variable Spam.Ham.

if (!file.exists(“enron_spam_data.zip”)) {

download.file(“https://github.com/MWiechmann/enron_spam_data/raw/master/enron_spam_data.zip”, “enron_s

unzip(‘enron_spam_data.zip’)

}

texts <- read.csv(‘enron_spam_data.csv’)

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :

## embedded nul(s) found in input

N <- nrow(texts)

corpus <- corpus(texts, text_field=”Message”) # create a corpus

dfm <- dfm(tokens(corpus)) # create features of word counts for each document

dfm <- dfm_trim(dfm, min_termfreq = 50) # remove word features occurring less than 50 times

Below is starter code to help you properly train a lasso model. As you work on this problem, it may be

helpful when troubleshooting or debugging to reduce nfolds to 3 to reduce the time it takes you to run

code.

tr <- sample(1:N, floor(N/3)) # indexes for training data

registerDoMC(cores=5) # trains all 5 folds in parallel (at once rather than one by one)

mod <- cv.glmnet(dfm[tr,], docvars(dfm,”Spam.Ham”)[tr], nfolds=5, parallel=TRUE, family=”binomial”, type

a. Plot misclassification error for all values of λ chosen by cv.glmnet. How many non-zero coefficients

are in the model where misclassification error is minimized? How many non-zero coefficients are in the

model one standard deviation from where misclassification error is minimized? Explain how the value

of lambda is related to bias and variance. How is the value of lambda related to an ordinary logit

model?

b. According to the estimate of the test error obtained by cross-validation, what is the optimal λ stored

in your cv.glmnet() output? What is the CV error for this value of λ? Hint: The vector of λ values

will need to be subsetted by the index of the minimum CV error.

c. Report and plot test ROC for the λ that minimizes CV error and the 1 S.E. λ. How well did CV error

estimate test error? How is ROC calculated? Why might we want to use ROC over other error metrics

in this particular classification problem?

d. For the model you have identified with the minimum CV error, write an analysis of non-zero coefficients

and the terms associated with them. How many coefficients were shrunk to zero? Which terms appear

to be most important? What do the magnitudes and signs of the coefficients represent? Can we

interpret the coefficients as we would in a non-regularized logistic regression model?

e. In parts a-d, we have looked at the body of the email (Message). Do the important terms in a model

of Subject differ from important terms in the above model of Message? How?