# Algorithm | Machine learning代做 – Homework 2

### Homework 2

Algorithm | Machine learning代做 – 该题目是一个常规的Algorithm的练习题目代写, 涉及了Algorithm/Machine learning等代写方面 • Your answers to Questions 1, 2, and 3 as a PDF file titledhw2_writeup.pdf. You can produce the file however you like (e.g. LATEX, Microsoft Word, scanner), as long as it is readable.

Neatness Point:One of the 10 points will be given for neatness. You will receive this point as long as we dont have a hard time reading your solutions or understanding the structure of your code.

Late Submission:10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted.

Weekly homeworks are individual work. See the Course Information handouthttp://www.cs. toronto.edu/~mren/teach/csc411_19s/syllabus.pdffor detailed policies.

1. [4pts] Information Theory.The goal of this question is to help you become more familiar with the basic equalities and inequalities of information theory. They appear in many contexts in Machine learning and elsewhere, so having some experience with them is quite helpful. We review some concepts from information theory, and ask you a few questions. Recall the definition of the entropy of a discrete random variableXwith probability mass functionp: H(X) =
``````xp(x) log 2
``````

#### (

``````1
p(x)
``````

#### )

. Here the summation is over all possible values of x X, which (for simplicity) we assume is finite. For example,X might be{ 1 , 2 ,…,N}. You may assumep(x) log 2 (

#### (

``````1
p(x)
``````

#### )

``````= 0 ifp(x) = 0.
``````
``````(a)[1pt]Prove that the entropyH(X) is non-negative.
``````
``````An important concept in information theory is the relative entropy or theKL-divergence of
two distributionspandq. It is defined as
``````
``````KL(p||q) =
``````
``````x
``````
``````p(x) log 2
``````
``````p(x)
q(x)
``````

#### .

``````TheKL-divergence is one of the most commonly used measure of difference (or divergence)
between two distributions, and it regularly appears in information theory, machine learning,
and statistics. For this question, you may assumep(x)>0 andq(x)>0 for allx.
If two distributions are close to each other, theirKLdivergence is small. If they are exactly
the same, theirKLdivergence is zero. KLdivergence is not a true distance metric (since it
isnt symmetric and doesnt satisfy the triangle inequality), but we often use it as a measure
of dissimilarity between two probability distributions.
``````
``````(b)[2pt]Prove thatKL(p||q) is non-negative.Hint: you may want to use Jensens Inequal-
ity, which is described in the Appendix.
``````
``````(c) [1pt]The Information Gain or Mutual Information betweenX andY is I(Y;X) =
H(Y)H(Y|X). Show that
``````
``````I(Y;X) =KL(p(x,y)||p(x)p(y)),
``````
``````wherep(x) =
``````
``````yp(x,y) is the marginal distribution ofX.
``````
1. [2pts] Benefit of Averaging.Considermestimatorsh 1 ,…,hm, each of which accepts an inputxand produces an outputy, i.e.,yi=hi(x). These estimators might be generated through a Bagging procedure, but that is not necessary to the result that we want to prove. Consider the squared error loss functionL(y,t) =^12 (yt)^2. Show that the loss of the average estimator h (x) =^1 m
``````m
``````
``````i=
``````
``````hi(x),
``````
``````is smaller than the average loss of the estimators. That is, for anyxandt, we have
``````
``````L(h (x),t)
``````

#### 1

``````m
``````
``````m
``````
``````i=
``````
``````L(hi(x),t).
``````
``````Hint: you may want to use Jensens Inequality, which is described in the Appendix.
``````
1. [3pts] AdaBoost.The goal of this question is to show that the AdaBoost Algorithm changes the weights in order to force the weak learner to focus on difficult data points. Here we consider the case that the target labels are from the set{ 1 ,+1}and the weak learner also returns a classifier whose outputs belongs to{ 1 ,+1}(instead of{ 0 , 1 }). Consider thet-th iteration of AdaBoost, where the weak learner is
``````htargmin
hH
``````

#### N

``````i=
``````
``````wiI{h(x(i)) 6 =t(i)},
``````
``````thew-weighted classification error is
``````
``````errt=
``````

#### N

``````i=1wiI{ht(x
``````
``````(i)) 6 =t(i)}
N
i=1wi
``````

#### ,

``````and the classifier coefficient ist=^12 log^1 errerrtt. (Here, log denotes the natural logarithm.)
AdaBoost changes the weights of each sample depending on whether the weak learnerht
classifies it correctly or incorrectly. The updated weights for sampleiis denoted bywiand is
``````
``````wiwiexp
``````

#### (

``````tt(i)ht(x(i))
``````

#### .

``````Show that the error w.r.t. (w 1 ,...,wN) is exactly^12. That is, show that
``````
``````errt=
``````

#### N

``````i=1w

iI{ht(x
(i)) 6 =t(i)}
N
i=1w

i
``````

#### .

``````Note that here we use the weak learner of iterationtand evaluate it according to the new
weights, which will be used to learn thet+ 1-st weak learner. What is the interpretation of
this result?
``````

Appendix: Convexity and Jensens Inequality. Here, we give some background on con- vexity which you may find useful for some of the questions in this assignment. You may assume anything given here. Convexity is an important concept in mathematics with many uses in machine learning. We briefly define convex set and function and some of their properties here. Using these properties are useful in solving some of the questions in the rest of this homework. If you are interested to know more about convexity, refer to Boyd and Vandenberghe, Convex Optimization, 2004. A setCisconvexif the line segment between any two points inClies withinC, i.e., if for any x 1 ,x 2 Cand for any 01, we have

``````x 1 + (1)x 2 C.
``````

For example, a cube or sphere inRdare convex sets, but a cross (a shape likeX) is not. A functionf:RdRisconvexif its domain is a convex set and if for allx 1 ,x 2 in its domain, and for any 01, we have

``````f(x 1 + (1)x 2 )f(x 1 ) + (1)f(x 2 ).
``````

This inequality means that the line segment between (x 1 ,f(x 1 )) and (x 2 ,f(x 2 )) lies above the graph off. A convex function looks like`. We say thatf isconcaveiff is convex. A concave function looks likea. Some examples of convex and concave functions are (you do not need to use most of them in your homework, but knowing them is useful):

• Powers: xpis convex on the set of positive real numbers whenp1 orp0. It is concave for 0p1.
• Exponential: eaxis convex onR, for anyaR.
• Logarithm: log(x) is concave on the set of positive real numbers.
• Norms: Every norm onRdis convex.
• Max function:f(x) = max{x 1 ,x 2 ,…,xd}is convex onRd.
• Log-sum-exp: The functionf(x) = log(ex^1 +…+exd) is convex onRd.

An important property of convex and concave functions, which you may need to use in your homework, isJensens inequality. Jensens inequality states that if(x) is a convex function ofx, we have (E[X])E[(X)].

In words, if we apply a convex function to the expectation of a random variable, it is less than or equal to the expected value of that convex function when its argument is the random variable. If the function is concave, the direction of the inequality is reversed. Jensens inequality has a physical interpretation: Consider a setX ={x 1 ,…,xN}of points onR. Corresponding to each point, we have a probabilityp(xi). If we interpret the probability as mass, and we put an object with massp(xi) at location (xi,(xi)), then the centre of gravity of these objects, which is inR^2 , is located at the point (E[X],E[(X)]). Ifis convex`, the centre of gravity lies above the curvex7(x), and vice versa for a concave functiona.