# AI代写 | Machine Learning代写 | 机器学习代写 – CIS 520 Machine Learning

AI代写 | Machine Learning代写 | 机器学习代写 ： 这是一个常规的AI代写方面的题目
Problem Set 4

CIS 520: Machine Learning, Spring 2018, Problem Set 4
1.
(20 points) EM Practice: Red and Blue Coins. Your friend has two coins: a red coin and a blue coin, with biases pr and pb, respectively (i.e. the red coin comes up heads with probability pr, and the blue coin does so with probability pb). She also has an inherent preference π for the red coin. She conducts a sequence of m coin tosses: for each toss, she first picks either the red coin with probability π or the blue coin with probability 1 − π, and then tosses the corresponding coin; the process for each toss is carried out independently of all other tosses. You don’t know which coin was used on each toss; all you are told are the outcomes of the m tosses (heads or tails). In particular, for each toss i, define a random variable Xi as
􏰄1 if the i-th toss results in heads Xi = 0 otherwise.
Then the data you see are the values x1, . . . , xm taken by these m random variables. Based on this data, you want to estimate the parameters θ = (π,pr,pb). To help with this, for each toss i, define a latent (unobserved) random variable Zi as follows:
􏰄1 if the i-th toss used the red coin Zi = 0 otherwise.
(a) Let X be a random variable denoting the outcome of a coin toss according to the process described above, and let Z be the corresponding latent random variable indicating which coin was used, also as described above (both X and Z take values in {0,1} as above). Write an expression for the joint distribution of X and Z. Give your answer in the form
p(x, z; θ) = .
(b) Write an expression for the complete-data log-likelihood, ln Lc(θ) = 􏰅mi=1 ln p(xi, zi; θ).
(c) Suppose you knew the values zi taken by the latent variables Zi. What would be the maximum-
2.
􏰇
likelihood parameter estimates θ? Give expressions for π􏰇, p􏰇 , and p􏰇 (in terms of x and z ).
(d) In the absence of knowledge of zi, one possibility for estimating θ is to use the EM algorithm. Recall that the algorithm starts with some initial parameter estimates θ0, and then on each iteration t, performs an E-step followed by an M-step. Let θt denote the parameter estimates at the start of iteration t. In the E-step, for each toss i, the algorithm requires computing the posterior distribution of the latent variable Zi under the current parameters θt. Calculate the posterior probability P(Zi = 1 | Xi = xi; θt).
(e) For each toss i, denote the posterior probability computed in part (d) above by γit (so that γit = P(Zi = 1|Xi = xi; θt)). Then the expected complete-data log-likelihood with respect to these posterior distributions is
m
􏰆􏰂γit ·lnp(xi,1; θ)+(1−γit)·lnp(xi,0; θ)􏰃. i=1
The M-step of the EM algorithm requires finding parameters θt+1 that maximize this expected
complete-data log-likelihood. Determine the updated parameters θt+1. Give expressions for πt+1,
pt+1, and pt+1 (in terms of x and γt). rbii
(25 points) Programming Exercise: Gaussian Mixture Models. Write a piece of MATLAB code to implement the EM algorithm for learning a Gaussian mixture model (GMM) given a data set X, where X is an m × d matrix (m instances, each of dimension d), and a target number of Gaussians K: your program should take the training data X and the target number of Gaussians K as input and should output estimated parameters of a mixture of K Gaussians (specifically, it should output mixing coefficients π􏰇 ∈ ∆K, mean vectors μ􏰇1,…,μ􏰇K ∈ Rd, and covariance matrices Σ􏰇1,…,Σ􏰇K ∈ Rd×d).
rb ii

CIS 520: Machine Learning, Spring 2018, Problem Set 4 3
For this problem, you are provided a 2-dimensional data set generated from a mixture of (3) Gaussians. The data set is divided into training and test sets; the training set has 480 data points and the test set has 120 data points. The goal is to learn a GMM from the training data; the quality of the learned GMM will be measured via its log-likelihood on the test data.
EM initialization and stopping criterion: Initialize the mixing coefficients to a uniform distribu- tion, and each of the covariance matrices to be the d × d identity matrix: π0 = (1/K, . . . , 1/K)⊤, and Σ01 = … = Σ0K = Id. Initializations for the means will be provided to you (if you like, you can write your program to take the mean initialization as an additional input). For the stopping criterion, on each iteration t, keep track of the (incomplete-data) log-likelihood on the training data, 􏰅mi=1 lnp(xi; θt); stop when either the change in log-likelihood, 􏰅mi=1 ln p(xi; θt+1) − 􏰅mi=1 ln p(xi; θt), becomes smaller than 10−6, or when the number of iterations reaches 1000 (whichever occurs earlier; here θt denotes the parameter estimates on iteration t).
(a) Known K. For this part, assume you are given K = 3.
i. Learning curve. Use your implementation of EM for GMMs to learn a mixture of 3 Gaus- sians from increasing fractions of the training data (10% of the training data, then 20% of the training data, then 30% and so on upto 100%). You must use the subsets provided in the folder TrainSubsets); in each case, to initialize the means of the Gaussians, use the initializa- tions provided in the folder MeanInitialization (see the README.txt file for more details). In each case, calculate the normalized (incomplete-data) log-likelihood of the learned model on the training data, as well as the normalized log-likelihood on the test data (here normaliz- ing means dividing the log-likelihood by the number of data points); you can use the provided file compute nllh.m to compute the normalized log-likelihood. In a single plot, give curves showing the normalized train and test log-likelihoods (on the y-axis) as a function of the fraction of data used for training (on the x-axis). (Note: the training log-likelihood should be calculated only on the subset of examples used for training, not on all the training examples available in the given data set.)
ii. Analysis of learned models. For each of the learned models (corresponding to the different subsets of the training data), show a plot of the Gaussians in the learned GMM overlaid on the test data (you can use the code in sample plot.m to help with the plotting). What do you observe? For the final model (learned from 100% of the training data), also write down all the parameters of the learned GMM, together with the (normalized) train and test log-likelihoods.
(b) Unknown K. Now suppose you do not know the right number of Gaussians in the mixture model to be learned. Select the number of Gaussians K from the range {1,…,5} using 5-fold cross-validation on the full training data (use the folds provided in the folder CrossValidation). For each K, to initialize the means of the K Gaussians, use the initializations provided in the folder MeanInitialization. In a single plot, draw 3 curves showing the (normalized) train, test, and cross-validation log-likelihoods (on the y-axis) as a function of number of Gaussians K (on the x-axis); again, you can use the provided file compute nllh.m to compute the normalized log- likelihood. Write down the chosen value of K (with the highest cross-validation log-likelihood).
3. (15 points) Hidden Markov Models: Strange Behavior. On any given day, Bob is in one of two states: happy or sad. You do not know his internal state, but get to observe his activities in the evening. Each evening, he either sings, goes for a walk, or watches TV.
Bob’s state on any day is random. His state Z1 on day 1 is determined by: P(Z1 = happy) = 3/5 .
Given his state Zt on day t, his state Zt+1 on the next day is governed by the following probabilities (and is conditionally independent of his previous states and activities):
P(Zt+1 = happy|Zt = happy) = 3/8 P(Zt+1 = happy|Zt = sad) = 3/5.

4
CIS 520: Machine Learning, Spring 2018, Problem Set 4
4.
Suppose you observe Bob singing on day 1 and watching TV on day 2, i.e. you observe x1:2 = (sing, TV). Find the joint probability of this observation sequence together with each possible hidden state sequence that could be associated with it, i.e. find the four probabilities below. Show your calculations.
Based on these probabilities, what is the most likely hidden state sequence z1:2? What is the individ- ually most likely hidden state on day 2 ?
(20 points) Hidden Markov Models: A Part-of-Speech Tagging Application. In this problem, you will use hidden Markov models (HMMs) for automatic part-of-speech (POS) tagging of English sentences. In particular, we will view words in a sentence as observations xt and the corresponding POS tags as hidden states zt. We will be using a vocabulary of 14,394 words, and 218 POS tags as provided by the Natural Language Toolkit (NLTK).1 You are provided the initial, transition, and emission probabilities associated with an HMM trained on an English language corpus.2 You are also given three ‘test’ sentences; the goal is to find the most likely sequences of POS tags for these sentences.
Specifically, you are given the following data:
vocabulary.txt – Vocabulary of words (contains 14394 unique words) pos.txt – POS tags (contains 218 unique POS tags)
init.txt – Initial probabilities (1 × 218 vector)
trans.txt – Transition probabilities (218 × 218 matrix)
emit.txt – Emission probabilities (218 × 14394 matrix)
test sents.txt – Test sentences separated by a newline (each sentence contains tab-delimited words)
Write a small piece of MATLAB code to implement the Viterbi algorithm. For each of the three test sentences x1:T in the file test sents.txt, use your Viterbi implementation to find the most likely POS tag sequence z1:T (one POS tag for each word in the sentence; the sentence length T may be different for the different sentences). For each test sentence, write down both the most likely POS tag sequence z1:T (using the actual POS tags as provided in the file pos.txt, not just the tag numbers), and the maximal joint probability p(x1:T , z1:T ).
Hints: (a) In your code, rather than multiplying probabilities (which can lead to underflow), you may find it helpful to add their logarithms (and later exponentiate to obtain the final result). (b) The probabilities provided are indexed according to the order of POS tags in the file pos.txt and vocabulary words in the file vocabulary.txt. You may find the textscan and string array functions in MATLAB useful for processing test sentences.
Bob’s activities are also random. His activities vary based on his state; given his state Zt on day t, his activity Xt on that day is governed by the following probabilities (and is conditionally independent of everything else):
P(Xt = sing|Zt = happy) = 4/9 P(Xt = walk|Zt = happy) = 1/9 P(Xt = TV|Zt = happy) = 4/9
P(Xt = sing|Zt = sad) = 2/9 P(Xt = walk|Zt = sad) = 4/9 P(Xt = TV|Zt = sad) = 3/9.
5. (20 points) Bayesian Networks. Consider the Bayesian network over 8 random variables X1, X2, X3, X4, X5, X6, X7, X8 shown below (assume for simplicity that each random variable takes 2 possible values):
1For more details, see http://www.nltk.org/book/ch05.html
2The HMM parameters were estimated using maximum likelihood estimation on a (labeled) data set obtained from the news category of the Brown corpus (containing 4623 sentences), tagged using the NLTK POS tagger.

CIS 520: Machine Learning, Spring 2018, Problem Set 4 5
(a) Write an expression for the joint probability mass function p(x1, x2, x3, x4, x5, x6, x7, x8) that makes the same (conditional) independence assumptions as the Bayesian network above.
(b) Consider a joint probability distribution satisfying the following factorization:
p(x1, x2, x3, x4, x5, x6, x7, x8) = p(x1)p(x2)p(x3)p(x4 | x1)p(x5 | x2, x3)p(x6 | x3, x5)p(x7)p(x8 | x5) .
Is this distribution included in the class of joint probability distributions that can be represented by the Bayesian network above? Briefly explain your answer.
(c) If the edge from X5 to X8 is removed from the above network, will the class of joint probability distributions that can be represented by the resulting Bayesian network be smaller or larger than that associated with the original network? Briefly explain your answer.
(d) Given the above figure, determine whether each of the following is true or false. Briefly justify your answer.
i. X4 ⊥X5 ii. X2 ⊥ X6
iii. X3 ⊥ X4 | X8 iv. X2 ⊥ X8 | X5