### DS-GA-1003: Machine Learning (Spring 2022)

Neural Networks | 作业express | Network | 代写network | Algorithm代写 | app作业 | Machine learning | Objective | assignment – 这个题目属于一个Network的代写任务, 涵盖了Neural Networks/express/Network/network/Algorithm/app/Machine learning/Objective等方面, 这个项目是assignment代写的代写题目

### Final Exam (6:00pm8:00pm, May 12)

- You should finish the exam within1 hours and 45 minutesand submit through Gradescope by8pm.
- You can refer to textbooks, lecture slides, and notes. However, searching answers online and collaborating with others are not allowed.
- Please write your solution on a separate sheet or the released exam sheet clearly, then upload the photo of your handwriting or the annotated pdf to Gradescope.
- Make sure you leave enough time to upload the exam. Gradescope will terminate the submission window at 8:00pm promptly. There is no grace period.

```
Question Points Score
```

```
Probabilistic models 14
```

```
Bayesian methods 14
```

```
Multiclass classification 15
```

```
Decision trees and ensemble methods 12
```

```
Boosting 11
```

```
Machine learning 人工智能"> Neural Networks 19
```

```
Clustering 15
```

```
Total: 100
```

- Help the TAs! Vishakh and Colin are trying to model the time it takes to answer a students question during office hour so that they can better plan ahead. You will help them build a model to do this. One common distribution to model the duration of independent events is the exponential distribution, which has the following density function:

```
p(y) =
```

#### (

```
ey y 0
0 y < 0
```

```
(a) You have collected the duration of each student visit for several office hours. Now
use this datasetD={yi}ni=1to estimate the parameter >0 of the distribution.
i. (2 points) Write down the log-likelihood function() (you dont need to sim-
plify the equation).
```

```
ii. (3 points) Solve for the optimalby maximum likelihood estimation.
```

(b) To have a more accurate estimate for each individual student who comes to the office hour, you decide to predict the durationygiven features of each student and their questionsxRdusing a linear probabilistic modelp(y|x;w) wherewRd is the weight vector. You would still like to modelyas a random variable from the exponential distribution, but now parametrized bywx(instead of). i. (2 points) Recall that we need a transfer function to map the scorewxto the parameter space of the exponential distribution. What is the input and output space of this map? A. R[0,1] B. R(0,1] C. R[0,+] D. R(0,+] ii. (3 points) Based on your answer above, propose a transfer functionf.

```
iii. (4 points) Give an expression for the log-likelihood of one data point (x,y):
logp(y|x;w) using the transfer function above.
```

- Bayesian methods

```
(a) (4 points) Suppose we are trying to estimate a parameterfrom dataD. What is
the difference between a maximum likelihood estimate and a maximum a posteriori
estimate of? Use the mathematical expressions for both quantities.
```

Suppose we are trying to estimate the parameters of a biased die, that is, each trial is a draw from a categorical distribution with probabilities ( 1 ,…, 6 ) (for example, P(X= 1) = 1 ).

```
(b) (3 points) We throw the dieN times. The throws are independent of each other.
What is the likelihood of an outcome (n 1 ,...,n 6 ), where
```

#### P 6

```
i=1ni = N andn^1
represents the number of trials where the result was 1?
```

(c) (4 points) We will use the Dirichlet prior, which has the following form, where 1 ,…, 6 are parameters:

```
f( 1 ,..., 6 )
```

#### Y^6

```
i=
```

```
ii^1 (4)
```

```
Given this prior, what is the posterior, up to the normalizing constant (i.e. you can
use thesymbol in your response)?
```

(d) (3 points) Is the Dirichlet prior a conjugate prior to the categorical distribution? Explain your answer.

- Dog classification App.Alice is trying to build an app to help beginning dog lovers recognize different types of dogs. To start with an easy setting, she decides to build a classifier with three classes: Golden Retriever, Husky, Not Dog (class 1, 2, and 3). (a) First, she decides to use the AvA (all vs all) method, where each pairwise classifier hijis a binary classifier that separates classi(positive) from classj(negative). i. (3 points) Consider the following outputs from each classifier: h 12 (x) = 1, h 13 (x) = 1, h 23 (x) = 1. What prediction should the App make for an examplex? Your answer should be Golden Retriever, Husky, or Not Dog.

```
ii. (3 points) How manymore classifiers does Alice need to train to update her
App to include German Shepherd?
```

```
(b) A friend who tried the initial version of the App gave the feedback that it is really
annoying seeing the App sometimes classifies a dog as Not Dog . To improve
user trust, Alice then decides to penalize the model more when it classifiers either
Golden Retriever or Husky as Not Dog.
i. (3 points) To achieve this, Alice start with the multiclass zero-one loss (y,y).
Recall that in class we defined (y,y) to beI{y=y}. Now, redefine (y,y)
for Alice by filling the table below to indicate the preference that confusion
between any dog and Not Dog incur a penalty of 5 whereas other misclassi-
fication incurs a penalty of 1.
```

```
(y,y) y= Golden Retriever y= Husky y= Not Dog
y= Golden Retriever
y= Husky
y= Not Dog
```

```
ii. (6 points) Next, Alice adapts the generalized hinge loss by plugging in the new
function defined above:
```

```
hinge(y,x,w)
def
= max
yY
((y,y)w,((x,y)(x,y)))
```

```
For simplicity, lets(x,y) denote the compatibility scorew,(x,y). Given an
imagex, consider the following prediction:
```

- s(x,Golden Retriever) = 1. 5
- s(x,Husky) = 4. 5
- s(x,Not Dog) = 3. 5 What is the loss with respect to different groundtruth labelsy? (Your answer should be a real number)
- (x,Golden Retriever,w) =
- (x,Husky,w) =
- (x,Not Dog,w) =

- Decision trees and ensemble methods

```
For each of the following statements, indicate whether it is true or false, and explain
your answer.
(a) (3 points) Decision trees favor high-variance features; centering and scaling the
values of all features before fitting the tree can counteract this tendency.
```

(b) (3 points) Boosting is more difficult to parallelize than bagging.

(c) (3 points) Decision trees are generally less sensitive to outliers than linear regres- sion.

(d) (3 points) Bagging leverages the fact that the decision tree fitting procedure we discussed in class is nondeterministic: because fitting a tree to the same dataset results in a different tree each time, averaging the predictions of multiple such trees reduces variance.

- Monotonic regression. You are going to use gradient boosting to solve a regression problem, i.e. learn a functionf:RRwhere

```
f
```

#### (M

#### X

```
m=
```

```
vmhm(x)|vmR, hmH, m= 1,...,M
```

#### )

```
Lets assume that the true function is monotonically increasing. For example:
```

```
(a) (2 points) What loss function(y,y) (where yis the prediction) will you use?
```

```
(b) (3 points) What is the pseudo residual for an example (xi,yi)? (You can directly
usefin the expression)
```

```
(c) (3 points) Does it make sense to choose the base hypothesis spaceHto be all linear
predictors (h(x) =wx)? Why or why not?
```

(d) (3 points) Which of the following base predictors (wRis the parameter) ensures thatfwill be monotonic? (select all that apply)

```
A. h(x) =
```

#### (

```
1 x > w
0 xw
```

```
B.h(x) =
```

#### (

```
x x > w
0 xw
C. h(x) =|xw|
D. h(x) =1+e^1 (xw)
```

- Neural networks

```
Suppose our inputs have binary featuresx 1 { 0 , 1 }andx 2 { 0 , 1 }, and we would
like to define a classifierf wheref(x 1 ,x 2 ) = 1 ifx 1 = 1 or x 2 = 1, but not both,
andf(x 1 ,x 2 ) = 0 otherwise. For the following questions, you can use the sign function
(x) = 1x 0 as the nonlinearity, for simplicity.
(a) (2 points) Can a single-layer neural network (with no hidden units) compute this
function? If so, spell out the equations computed by the network; if not, explain
why.
```

(b) (2 points) Can a classification tree compute this function? If so, draw the tree; if not, explain why.

(c) (3 points) Can a two-layer neural network (with one hidden layer) compute this function? If so, spell out the equations computed by the network and provide an appropriate set of weights; if not, explain why.

In the following questions we will optimize the parameters for ridge regression using backpropagation. Recall that ridge regression is linear regression with the regularization termwTw, whereware the weights andis a hyperparameter. Were going to use the standard square loss.

```
(d) (2 points) What is the Objective function for a training instance (x,y)?
```

(e) (4 points) Draw the computation graph for this function, and below the graph express each node explicitly as a function of its inputs (for example, the last node in the graph might beJ=L+R, whereJis the objective,Ris the regularization term andLis the loss).

(f) (6 points) Using backpropagation, compute the partial derivative of the objective function with respect to a particular weightwj. Show all intermediate derivatives. You will not be penalized for calculus mistakes (though try not to make them anyway!).

- K-means clustering of a toy dataset. Recall that thek-means Algorithm aims to minimize the following objective on a datasetD={xi}ni=1:

```
J(c,) =
```

```
Xn
```

```
i=1
```

```
xici^2 ,
```

```
whereci 1 ,...,kis the cluster assignment for each examplexi, andjis the centroid
of clusterj 1 ,...,k.
(a) (2 points) Suppose there is a single cluster (i.e.k= 1). Give an expression for the
optimal centroid 1 (no proof needed).
```

(b) (3 points) Suppose there isnclusters (i.e. k = n). Give an expression for the optimal cluster assignmentsci and centroidsi (no proof needed). What is the optimal objectiveJ(c,) in this case?

(c) Consider the following 2D dataset.

#### 0 1 2 3 4 5 6

#### 0

#### 1

#### 2

#### 3

#### 4

```
x 1
```

```
x
2
```

```
You will be asked to circle or mark points in the graph. You can either directly
draw on the figure, or list the points by their coordinates in text.
You are going to run k-means algorithm to cluster it withk= 2. The points circled
in red denotes the initial cluster centroids.
i. (3 points) Show the cluster centroids and the points in each cluster when the
algorithm converges.
```

```
ii. (3 points) Is this the clustering you would come up with by inspecting the
pattern in the data? If not, indicate the two clusters that make more sense to
you.
```

```
(d) (4 points) Select two points (out of the ten blue dots) as initial cluster centroids
such that the k-means algorithm would converge to the desired clusters that you
give in the previous question.
```

#### 0 1 2 3 4 5 6

#### 0

#### 1

#### 2

#### 3

#### 4

```
x 1
```

```
x
2
```

Congratulations! You have reached the end of the exam.