代做Algorithm | 代做机器学习 | assignment作业 – CS 189/289A Introduction to Machine Learning

CS 189/289A Introduction to Machine Learning

代做Algorithm | 代做机器学习 | assignment作业 – 这是一个关于机器学习的题目测试, 主要考察了关于Machine Learning的内容,是一个比较经典的题目, 涵盖了Algorithm/机器学习等方面, 这是值得参考的assignment代写的题目

express代写 代写express nodejs代做 web代写

Spring 2021

Introduction to Machine Learning

  • The exam is open book, open notes for material onpaper. On your computer screen, you may have only this exam, Zoom (if you are running it on your computer instead of a mobile device), and four browser windows/tabs: Gradescope, the exam instructions, clarifications on Piazza, and the form for submitting clarification requests.
  • You will submit your answers to the multiple-choice questions directly into Gradescope via the assignment Midterm Multiple Choice; pleasedo notsubmit your multiple-choice answers on paper. If you are in the DSP program and have been granted extra time, select the DSP, 150% or DSP, 200% option. By contrast, you will submit your answers to the written questions by writing them on paper by hand, scanning them, and submitting them through Gradescope via the assignment Midterm Free Response.
  • Please write your name at the top of each page of your written answers. (You may do this before the exam.)Please start each top-level question (Q2, Q3, etc.) on a new sheet of paper. Clearly label all written questions and all subparts of each written question.
  • You have80 minutes to complete the midterm exam (7:409:00 PM). (If you are in the DSP program and have an allowance of 150% or 200% time, that comes to 120 minutes or 160 minutes, respectively.)
  • When the exam ends (9:00 PM),stop writing. You must submit your multiple-choice answers before 9:00 PM sharp. Late multiple-choice submissions will be penalized at a rate of 5 points per minute after 9:00 PM.(The multiple- choice questions are worth 40 points total.)
  • From 9:00 PM, you have 15 minutes to scan the written portion of your exam and turn it into Gradescope via the assignment Midterm Free Response. Most of you will use your cellphone/pad and a third-party scanning app. If you have a physical scanner, you may use that.Late written submissions will be penalized at a rate of 10 points per minute after 9:15 PM.(The written portion is worth 60 points total.)
  • Following the exam, you must use Gradescopespage selection mechanismto mark which questions are on which pages of your exam (as you do for the homeworks). Please get this done before 2:00 AM. This can be done on a computer different than the device you submitted with.
  • The total number of points is 100. There are 10 multiple choice questions worth 4 points each, and four written questions worth a total of 60 points.
  • For multiple answer questions, fill in the bubbles forALL correct choices:there may be more than one correct choice, but there is always at least one correct choice. NO partial crediton multiple answer questions: the set of all correct answers must be checked.

Q1. [40 pts] Multiple Answer

Fill in the bubbles forALL correct choices: there may be more than one correct choice, but there is always at least one correct choice.NO partial credit: the set of all correct answers must be checked.

(a)[4 pts] Which of the following cost functions are smoothi.e., having continuous gradients everywhere?
A: the perceptron risk function
B: the sum (over sample points) of logistic losses
C: least squares with` 2 regularization
D: least squares with` 1 regularization
(b)[4 pts] Which of the following changes would commonly cause an SVMs margin 1/wto shrink?
A: Soft margin SVM: increasing the value ofC
B: Hard margin SVM: adding a sample point that
violates the margin
C: Soft margin SVM: decreasing the value ofC
 D: Hard margin SVM: adding a new feature to
each sample point

The greater the value of C is, the higher the penalty for violating the margin. The soft margin shrinks to compensate.

If you add a sample point that violates the margin, a hard margin always shrinks.

If you add a feature, the old solution can still be used (by setting the weight associated with the new feature to zero). Although the new feature might enable a new solution with a wider margin, the optimal solution cant be worse than the old solution.

(c)[4 pts] Recall the logistic functions() and its derivatives()=dds(). Letbe the value ofthat maximizess().
A:= 0. 25
B:s()= 0. 5
C:s()= 0. 5
D:s()= 0. 25

A glance at the logistic curve and its symmetry makes it clear that the slope is maximized when=0 ands()= 0 .5. Recall thats=s(1s); hences()= 0 .25.

(d)[4 pts] You are running logistic regression to classify two-dimensional sample pointsXiR^2 into two classesyi { 0 , 1 }
with the regression functionh(z)=s(w>z+), wheresis the logistic function. Unfortunately, regular logistic regression
isnt fitting the data very well. To remedy this, you try appending an extra feature,Xi^2 , to the end of each sample
pointXi. After you run logistic regression again with the new feature, the decision boundary inR^2 could be
A: a line.
B: a circle.
C: an ellipse.
D: an S-shaped logistic curve.

The decision boundary is the set of points{z:s(w>z+)= 0. 5 }, which is the set of points{z:w>z+= 0 }. Ifz= [x 1 x 2 x 12 +x^22 ], then the decision boundary for points inx-space isw 1 x 1 +w 2 x 2 +w 3 (x^21 +x^22 )+=0. This equation can express arbitrary circles and lines. It cant express ellipses because there is only one quadratic feature, and it causesx^21 andx^22 to always have the same coefficient. It definitely cant express any S-shaped curves.

(e)[4 pts] We are performing least-squares linear regression, with the use of a fictitious dimension (so the regression function
isnt restricted to satisfyh(0)=0). Which of the following will never increase the training error, as measured by the
mean squared-error cost function?
A: Adding polynomial features
 B: Using backward stepwise selection to remove
some features, thereby reducing validation error
C: Using Lasso to encourage sparse weights
D: Centering the sample points

Adding new features never increases the training error. Centering the sample points never changes the training error (as were using a fictitious dimension), as the change in coordinate system can be compensated for by a translation of the regression function (effected by a change in the bias term). Options B and C usually increase training error for the purpose of reducing validation error.

(f)[4 pts] Given a design matrixX  Rnd, labelsy  Rn, and >0, we find the weight vectorwthat minimizes
Xwy^2 +w^2. Suppose thatw,0.
A: The variance of the method decreases ifin-
creases enough.
B: There may be multiple solutions forw.
C: The bias of the method increases ifincreases
enough.
D:w=X+y, whereX+is the pseudoinverse ofX.

A and C: as ,w0 so the variance also approaches zero and the bias grows. When >0, the Objective is strongly convex and the minimizer is unique. D holds only when=0, which we have said is not the case.

(g)[4 pts]The following two questions use the following assumptions.You want to train a dog identifier with Gaussian
discriminant analysis. Your classifier takes an image vector as its input and outputs 1 if it thinks it is a dog, and 0
otherwise. You use the CIFAR10 dataset, modified so all the classes that are not dog have the label 0. Your training
set has 5,000 dog images and 45,000 non-dog (other) images. Which of the following statements seem likely to be
correct?
 A: LDA has an advantage over QDA because the
two classes have different numbers of training exam-
ples.
 B: QDA has an advantage over LDA because the
two classes have different numbers of training exam-
ples.
C: LDA has an advantage over QDA because the
two classes are expected to have very different covari-
ance matrices.
D: QDA has an advantage over LDA because the
two classes are expected to have very different covari-
ance matrices.

The dog class (label 1) only contains images of dogs, so compared to the other class (label 0), it should have less variance. The other class has images from many different classes, so it will have greater variance. QDA fits different covariance matrices to different classes, so it is more appropriate than LDA, which fits a single covariance matrix to both classes.

(h)[4 pts]This question is a continuation of the previous question.You train your classifier with LDA and the 0-1 loss.
You observe that at test time, your classifier always predicts other and never predicts dog. What is a likely reason for
this and how can we solve it? (Check all that apply.)
A: Reason: The prior for the other class is very
large, so predicting other on every test point mini-
mizes the (estimated) risk.
B: Reason: As LDA fits the same covariance ma-
trix to both classes, the class with more examples will
be predicted for all points inRd.
C: Solve it by using a loss function that penal-
izes dogs misclassified as other more than others
misclassified as dogs.
D: Solve it by learning an isotropic pooled covari-
ance instead of an anisotropic one; that is, the covari-
ance matrix computed by LDA has the form^2 I.

A: The 0-1 loss does not account for the imbalance in the number of examples per class, so it is minimized by predicting the class with the larger prior for a very large set of inputs if that prior is large enough (in this case, that set is large enough to contain the entire set of training images). B: LDA can still predict the class with fewer examples when trained to minimize the 0-1 loss if the means of the two classes are different enough. C: An asymmetric loss can counteract the larger prior on the other class. D: lolno

(i)[4 pts] We do an ROC analysis of 5 binary classifiersC 1 ,C 2 ,C 3 ,C 4 ,C 5 trained on the training pointsXtrainand labelsytrain.
We compute their true positive and false positive rates on the validation pointsXvaland labelsyvaland plot them in the
ROC space, illustrated below. InXvalandyval, there arenppoints in class positive andnnpoints in class negative. We
use a 0-1 loss.
ROC analysis of five classifiers. FPR=false positive rate; TPR=true positive rate.
A: Ifnp=nn,C 2 is the classifier with the highest
validation accuracy.
B: Ifnp=nn, all five classifiers have higher vali-
dation accuracy than any random classifier.
C: There exists somenpandnnsuch thatC 1 is the
classifier with the highest validation accuracy.
D: There exists somenpandnnsuch thatC 3 is the
classifier with the highest validation accuracy.

Ifnp=nn, the validation accuracy of each classifier is 0.5(True Positive Rate)+ 0 .5(True Negative Rate) andC 2 is the best classifier, with an accuracy of about 80%. All five classifiers lie above the diagonal line, so they all have validation accuracy better than 50%, whereas any random classifier (biased or not) will have a validation accuracy of exactly 50% whennp=nn. C 1 is the best classifier when nearly all the validation data is negative, butC 3 is never the best classifier for anynp/nn, as it lies below the ROC convex hull.

(j)[4 pts] Tell us about feature subset selection.
 A: Ridge regression is more effective for feature
subset selection than Lasso.
B: If the best model uses only features 2 and 4 (i.e.,
the second and fourth columns of the design matrix),
forward stepwise selection is guaranteed to find that
model.
C: Stepwise subset selection uses the accuracy on
the training data to decide which features to include.
D: Backward stepwise selection could train a
model with only features 1 and 3. It could train a
model with only features 2 and 4. But it will never
train both models.

A: Lasso can be used for subset selection because it encourages weights of zero. Ridge regression almost never produces weights of exactly zero.

C: Subset selection uses validation accuracy, not training accuracy.

B: Stepwise subset selection can miss the best subset. If the best 1-feature model uses only feature 3, then forward stepwise selection wont investigate the model that uses features 2 and 4.

D: It will never train both because it only tries training 2-feature models when it has already reduced the number of features to three, and those three features cannot contain all of features 1, 2, 3, and 4.

Q2. [14 pts] Eigendecompositions

(a)[5 pts] Consider a symmetric, square, real matrixARdd. LetA=VV>be its eigendecomposition. Letvidenote the
ith column ofV. Letidenoteii, the scalar component on theith row andith column of.
Consider the matrixM=AA^2 , whereR. What are the eigenvalues and eigenvectors ofM? (Expressed in terms
of parts ofAs eigendecomposition and. No proof required.)

Every eigenvectorviofAis also an eigenvector ofM. For each eigenvalueiofA,i^2 iis an eigenvalue ofM.

(b)[4 pts] Suppose thatAis a sample covariance matrix for a set ofnsample points stored in a design matrixXRnd, and
thatRis a fixed constant. Is it always true (for any suchAand) that there exists another design matrixZRnd
such thatM=AA^2 is the sample covariance matrix forZ? Explain your answer.

No.Mcould have negative eigenvalues, but all sample covariance matrices are positive semidefinite.

(c)[5 pts] In lecture, we talked about decorrelating a centered design matrixX . We used an eigendecomposition to do that.
Explain (in English, not math) what the eigendecomposition tells us about the sample points, and how that information
helps us decorrelate a design matrix.
The eigenvectors of tell us
.
With this information, we decorrelate the centered design matrix by
.

The eigenvectors of the sample covariance matrix tell us some orthogonal directions (alternative coordinate axes) along which the points are not correlated.

With this information, we decorrelate the centered design matrix by doing a change of coordinates (a rotation) to a coordinate system in which the features have covariance zero.

Q3. [10 pts] Maximum Likelihood Estimation

There are 5 balls in a bag. Each ball is either red or blue. Let(an integer) be the number of blue balls. We want to estimate, so we draw 4 ballswith replacementout of the bag, replacing each one before drawing the next. We get blue, red, blue, and blue (in that order).

(a)[5 pts] Assumingis fixed, what is the likelihood of getting exactly that sequence of colors (expressed as a function
of)?

L(;X)=P(X;)=

( 5 
5
) (
5
) 3
.
(b)[3 pts] Draw a table showing (as a fraction) the likelihood of getting exactly that sequence of colors, for every value of
from zero to 5 inclusive.
 L(;blue, red, blue, blue)
0?
1?
2?
3?
4?
5?
 L(;blue, red, blue, blue)
0 0
1 4 / 625
2 24 / 625
3 54 / 625
4 64 / 625
5 0
(c)[2 pts] What is the maximum likelihood estimate for? (Chosen among all integers; not among all real numbers.)

The maximum likelihood estimate foris 4.

Q4. [20 pts] Tikhonov Regularization

Lets take a look at a more complicated version of ridge regression calledTikhonov regularization. We use a regularization parameter similar to, but instead of a scalar, we use a real, square matrixRdd(called theTikhonov matrix). Given a design matrixXRndand a vector of labelsyRn, our regression Algorithm finds the weight vectorwRdthat minimizes the cost function J(w)=Xwy^22 +w^22.

(a)[7 pts] Derive the normal equations for this minimization problemthat is, a linear system of equations whose solution(s)
is the optimal weight vectorw.Show your work.(If you prefer, you can write an explicit closed formula forw.)

The gradient of the cost function is J(w)= 2 X>(Xwy)+ 2 >w.

SettingJ=0 gives the normal equations, (X>X+ >)w=X>y.

If the matrix is not singular, we can write the closed formula

w=(X>X+ >)^1 X>y.
(b)[3 pts] Give a simple, sufficient and necessary condition on(involvingonly; notXnory) that guarantees thatJ(w)
has only one unique minimumw. (To be precise, the uniqueness guarantee must hold forallvalues ofXandy, although
the uniquewwill be different for different values ofXandy.) (A sufficient but not necessary condition will receive part
marks.)

The following conditions are all equivalent to each other, and are all correct answers.>is positive definite.is invertible. is nonsingular.has full rank.has rankd.

(c)[5 pts] Recall the Bayesian justification of ridge regression. We impose an isotropic normal prior distribution on the
weight vectorthat is, we assume thatwN(0,^2 I). (This encodes our suspicion that small weights are more likely to
be correct than large ones.) Bayes Theorem gives us a posterior distributionf(w|X,y). We apply maximum likelihood
estimation (MLE) to estimatewin that posterior distribution, and it tells us to findwby minimizingXwy^22 +w^22
for some constant.
Suppose we change the prior distribution to ananisotropicnormal distribution: w N(0,) for some symmetric,
positive definite covariance matrix. Then MLE on the new posterior tells us to do Tikhonov regularization! What value
ofdoes MLE tells us to use when we minimizeJ(w)?
Give a one-sentence explanation of your answer.

= ^1 /^2. (Answers that incorporate an extra scalar factor, like =^1 /^2 , are also acceptablemaybe even preferable.) Because the log posterior will include a term equal to some constant timesw>^1 w, which is equal tow^2 if = ^1 /^2.

(d)[5 pts] Suppose you solve a Tikhonov regularization problem in a two-dimensional feature space (d=2) and obtain
a weight vectorwthat minimizesJ(w). The solutionwlies on an isocontour ofXwy^22 and on an isocontour of
w^22. Draw a diagram that plausibly depicts both of these two isocontours, in a case whereisnotdiagonal andy,0.
(You dont need to choose specific values ofX,y, or; your diagram just needs to look plausible.)
Your diagram must contain the following elements:
  • The two axes (coordinate system) of the space you are optimizing in, with both axes labeled.
  • The specified isocontour ofXwy^2 , labeled.
  • The specified isocontour ofw^2 , labeled.
  • The pointw.
These elements must be in a plausible geometric relationship to each other.

The diagram should contain: the two coordinate axes labeledw 1 andw 2 ; an ellipse or circle,notcentered at the origin, labeled Xwy^2 ; an ellipse,notaxis-aligned andnotcircular, centered at the origin, labeledw^2 ; a pointwon both ellipses; and the two ellipses must touch each other tangentially without crossing each other.

Q5. [16 pts] Multiclass Bayes Decision Theory

Lets apply Bayes decision theory to three-class classification. Consider a weather station that constantly receives data from its radar systems and must predict what the weather will be on the next day. Concretely:

  • The inputXis a scalar value representing the level of cloud cover, with only four discrete levels: 25, 50, 75, and 100 (the percentage of cloud cover).
  • The station must predict one of three classesYcorresponding to the weather tomorrow.Y=y 0 means sunny,y 1 means cloudy, andy 2 means rain.
  • The priors for each class are as follows:P(Y=y 0 )= 0 .5,P(Y=y 1 )= 0 .3, andP(Y=y 2 )= 0 .2.
  • The station has measured the cloud cover on the days prior to 100 sunny days, 100 cloudy days, and 100 rainy days. From these numbers they estimated the class-conditional probability mass functionsP(X|Y):
Prior-Day Cloud Cover (X) Sunny,P(X|Y=y 0 ) Cloudy,P(X|Y=y 1 ) Rain,P(X|Y=y 2 )
25 0.7 0.3 0.
50 0.2 0.3 0.
75 0.1 0.3 0.
100 0 0.1 0.
  • We use an asymmetric loss. Letzbe the predicted class andythe true class (label).
L(z,y)=
0 z=y,
1 y=y 0 andz,y 0 ,
2 y=y 1 andz,y 1 ,
4 y=y 2 andz,y 2.
(a)[8 pts] Consider the constant decision ruler 0 (x)=y 0 , whichalwayspredictsy 0 (sunny). What is the riskR(r 0 ) of the
decision ruler 0? Your answer should be a number, butshow all your work.
R(r 0 (x))=
i
x
P(Y=yi)L(r 0 (x),yi)P(X=x|Y=yi)
=P(Y=y 1 )L(y 0 ,y 1 )
x
P(X=x|Y=y 1 )+P(Y=y 2 )L(y 0 ,y 2 )
x
P(X=x|Y=y 2 )
= 0. 3 2 1 + 0. 2 4 1
= 1. 4.

Asr 0 does not actually vary withxat all, it is also acceptable to notice that we dont have to break down the risk byx:

R(r 0 (x))=R(y 0 )
=
i
P(Y=yi)L(y 0 ,yi)
= 0. 3 2 + 0. 2 4
= 1. 4.
(b)[8 pts] Derive the Bayes optimal decision ruler(x)the rule that minimizes the riskR(r).
Hint: Write down a table calculatingL(z,yi)P(X|Y=yi)P(Y=yi), for each classyiand each possible value ofX(
entries total), in the cases where the predictionzis wrong. Then figure out how to use it to minimizeR. This problem
can be solved without wasting time computingP(X).

The usual principle is to pick the class with the biggest posterior probability, but in this case we also need to weight each class with an asymmetrical loss. We dont need to fully compute each posterior probability, because we dont need to know the value of the denominatorP(X). The following table showsL(z,yi)P(X|Y=yi)P(Y=yi), for each classyiand each possible value ofX, in the cases where the predictionzis wrong. (Whenzis correct, the loss is always zero.)

Prior-Day Cloud Cover (X) Sunny (y 0 ) Cloudy (y 1 ) Rain (y 2 )
25 0.35 0.18 0.
50 0.10 0.18 0.
75 0.05 0.18 0.
100 0.00 0.06 0.

The Bayes optimal decision rule always picks the highest loss-weighted posterior to minimize risk, so we pick the boldfaced classes (table 2) for each of the corresponding inputsX. Formally,

r(x)=
y 0 x= 25 ,
y 1 x= 50 ,
y 2 x=75 orx= 100.