# Python代写 | COMS 4771-2 Fall-B 2020 HW 2

COMS 4771-2 Fall-B 2020 HW 2

Problem 1 (10 points)
In this problem, you will reason about optimal predictions for mean squared error.
Suppose Y1, . . . , Yn, Y are iid random variables—the distribution of Y is unknown to you. You
observe Y1, . . . , Yn as “training data” and must make a (real-valued) prediction of Y .
(a) Assume Y has a probability density function given by
pθ(y) :=

1
θ
2 ye−y/θ if y > 0,
0 if y ≤ 0,
for some θ > 0. Suppose that θ is known to you. What is the “optimal prediction” yˆ
? of
Y that has the smallest mean squared error E[(yˆ
? − Y )
2
]? And what is this smallest mean
(b) (Continuing from Part (a).) In reality, θ is unknown to you. Suppose you observe (Y1, . . . , Yn) =
(y1, . . . , yn) for some positive real numbers y1, . . . , yn > 0. Derive the following:
• the MLE ˆθ(y1, . . . , yn) of θ given this data;
• the prediction yˆ(y1, . . . , yn) of Y based on the plug-in principle (using ˆθ(y1, . . . , yn)).
Show the steps of your derivation. The MLE and prediction should be given as simple formulas
involving y1, . . . , yn.
(c) Now, instead assume Y ∼ Bernoulli(θ) for some θ ∈ [0, 1]. Suppose that θ is known to you.
What is the prediction yˆ
? of Y that has the smallest mean squared error E[(yˆ
? − Y )
2
]? And
what is this smallest mean squared error? Your answers should be given in terms of θ. (Note:

?
is allowed to be any real number!)
(d) (Continuing from Part (c).) Define the following loss function ` : R × R → R by
`(ˆy, y) :=

2(ˆy − y)
2
if yˆ ≥ y,
(ˆy − y)
2
if y < y. ˆ
This loss function is a different way to measure how “bad” a prediction is. With this loss
function, a prediction that is too high is more costly than one that is too low. What is the
prediction yˆ
? of Y that has the smallest expected loss E[`(yˆ
?
, Y )]? And what is this smallest
2
Problem 2 (15 points)
In this problem, you’ll practice analyzing a simple data set with linear regression.
Obtain the Jupyter notebook Linear_regression_on_Dartmouth_data.ipynb from Courseworks,
and run the code there (e.g., using Google Colaboratory) to fit linear regression models to the
Dartmouth College GPA data described in lecture.
You’ll now apply a similar linear regression analysis to a data set concerning prostate cancer:
• https://www.cs.columbia.edu/~djhsu/coms4771-f20/data/prostate-train.csv
Regard this data set as “training data” in which the goal is to predict the variable lpsa (the
logarithm of the prostate specific antigen level) using the remaining variables (lcavol, lweight,
age, lbph, svi, lcp, gleason, pgg45) as features.1
(a) For each of the eight features, find the best fit affine function of that variable to the label
lpsa. Report the “slope” and “intercept” in each case.
(b) Now find the best fit affine function of all eight features (together as a vector in R
8
) to the
label lpsa. Report the coefficients in the weight vector and the “intercept” term.
You should find that some of the variables have a negative coefficient in the weight vector from
Part (b) even though its corresponding affine function from Part (a) has a positive slope. This
might seem like a paradox: for such a feature, Part (a) might lead you to think that increasing the
feature’s value should, on average, increase the value of lpsa; whereas Part (b) might lead you to
think that increasing the feature’s value should, on average, decrease the value of lpsa.
Of course there is no paradox. Here is a simple example to show how this can happen. Suppose
X1 ∼ N(0, 1) and X2 ∼ N(0, 1), and E[X1X2] = 2
3
. Furthermore, suppose Y =
3
2X1 −
3
4X2.
(c) What is the linear function of X1 that has smallest mean squared error for predicting Y ?
What is the linear function of X2 that has smallest mean squared error for predicting Y ?
And finally, what is the linear function of (X1, X2) that has smallest mean squared error for
predicting Y ?
You should find that even though each of X1 and X2 is positively correlated with Y (analogous
to the situation in Part (a)), the best linear predictor of Y that considers both X1 and X2 has a
positive coefficient for one variable and a negative coefficient for the other variable (analogous to
the situation in Part (b)). E-mail: [email protected]  微信:itcsdx 