web | report | Algorithm | Objective代写 – The EM algorithm

The EM algorithm

web | report | Algorithm | Objective代写 – 这是一个关于算法的题目, 主要考察了关于EM算法的内容,是一个比较经典的算法题目, 是比较有代表性Algorithm等代写方向

report代写 代做report

Background: The EM algorithm

The EM Algorithm is a method for estimating parameters in models with unobserved variables. Classical examples of applications are found in model-based clustering and in sequence analysis. EM stands forExpectation-Maximization, and it describes an iterative method that maximizes anexpected valueat each iteration.

Problem statement. Assume that we observe data,y, from a probability distribu- tion which is defined in hierarchical way, as follows

p(y|) =

p(y|z,)p(z|)dz= E[p(y|z,)|].

In this formula,is the parameter of interest, andzis an unobserved (hidden) variable (The parameterand the hidden variablezcan have large dimensions). The integral symbol is a generic symbol that represents the summation symbol whenzis a discrete variable, and the multiple integration symbol whenzis a continuous multidimensional variable. A way to estimateis by maximizing the log-likelihood function, where the likeli- hood represent the probability distribution of the observed variables,y, given. The solution of the optimzation problem can be formalized as follows

= arg maxL() = arg max logp(y|).

The concern with this approach is that the summation that appears in the formula of the log-likelihood logp(y|) = log

(

p(y|z,)p(z|)dz

)

.

is very difficult to evaluate in general.

Algorithm. To overcome the above problem, the EM algorithm repeats an iterative process that is guaranteed to increase the likelihood of the parameter at each iteration, and that converges to a local maximum of the likelihood function. Let^0 denote the current value of the parameter. The EM update rule replaces^0 by^1 , the value of that maximizes the following quantity

Q(,^0 ) = E[logp(y,z|)|y,^0 ] =

logp(y,z|)p(z|y,^0 )dz,

and, more generally, we have

t+1= arg max Q(,t).

The EM algorithm is useful when the quantity logp(y,z|) has a simple expression, for example, a linear function of the hidden variablez, and when the probabilityp(z|y,^0 ) can be easily obtained from the Bayes formula

p(z|y,^0 )p(y|z,^0 )p(z|^0 ).

Exercise. Basic arguments and remarks.Answer the following questions.

  1. Find justifications for why the likelihood increases at each iteration and summa- rize the key arguments of the proof.
  2. There are obvious limitations of the EM algorithm. Describe two potential con- cerns with this method.

Problem 1. An EM algorithm for detecting frequency change points in a binary sequence. We observe a sequence of binary data that consists of nobservations, y= (y 1 ,…,yn),yi { 0 , 1 }. Thenbinary variables correspond to independent signals from a random source, for which the frequency of 1s is unknown and may be modified at an unknown time point from 1 to 2. For example, the sequence of observations can be as follows

y= 01100... 00110  1110011 ... 1100111

In this representation, thesymbol indicates that a change occurred at positionz, that can represent any point between 2 etn. for alli < z, the frequency of 1s is 1 , and forizthe frequency of 1s is 2. By convention,z= 1 corresponds to the situation where no change occurs. In this case, the frequency of 1s is constant, and it is equal to 2.

Challenge and evaluation rule. Download the data from the following URL:

http://membres-timc.imag.fr/Olivier.Francois/sequence_2019.txt

The data consists of a sequence of 399 binary items. The Objective of the challenge is to provide a list of (one or more) change points with the following information

  • Most likely change point position, z, in the range [1, 399 ].
  • Lower and upper values zl and zu, such that p(z [zl, zu]) = 0 .75.
  • Estimates of frequencies 1 and 2 before and after z.
  • Number of iterations of the EM algorithm.

The output file must be formatted as follows

number position lower upper theta1 theta2 iter 1 42 35 56 .72 .43 13 2 3

AREADME file including comments and describing the options used when analyzing the data is required. The results will be evaluated on the basis of the 1) number of correct detections, 2) evaluation of uncertainty on each correctly detected position (ie, correctness of the difference(upper – lower)).

Derivation of an EM algorithm. We use the following notations

= ( 1 , 2 ).

Forz= 2,…,n, we have

p(yi|z,) = 1 yi(1 1 )^1 yi i= 1,...,z 1

et p(yi|z,) = 2 yi(1 2 )^1 yi i=z,…,n.

A priori, we assume that the change pointzis sampled from the uniform distribution

p(z) =n^1 , z= 1,...,n.

We call this distribution theprior distributiononz. The goal of this problem is to propose an EM algorithm for estimating the model parameterand for evaluating the conditional probabilitiesp(z|y) for allz= 1,…,n.

  1. Letz= 1. Give a formula for the probabilityp(y|z,). Same question forz >1.
  2. Show that, forz >1, we have
logp(y|z,) = log 1
(z 1
i=
yi

)

+ log(1 1 )

(

z 1 
z^1
i=
yi

)

+ log 2
(n
i=z
yi

)

+ log(1 2 )

(

nz+ 1
n
i=z
yi

)

  1. Suppose 1 and 2 are known. Using the Bayes formula, show that
z= 2,...,n, R(z) =p(pz(= 1z|y,|y,) )=
z 1
j=
p(yj| 1 ,z)
p(yj| 2 ,z= 1)
  1. Show that there is a relationship betweenR(z) andR(z1) for allz= 2,…,n, and propose an algorithm for computingp(z|y) for allz= 1,…,n.
  2. Propose an algorithm for computing the expected value ofzi=1^1 yi
E 1 = E[
z^1
i=
yi| y , ^0 ].
by averaging over all values ofz. Apply the same approach to the 3 other quan-
tities found in question 2. What is the complexity of the algorithm?
  1. Describe the EM algorithm for estimatingfromy.
  2. Generate simulated data for known values ofandz. Apply the EM algorithm to the simulated data, and evaluate the convergence of the algorithm by testing several values of^0. Plot histograms forp(z|y) usingplot(.,type =h).
  3. Discuss the choice of a uniform prior distribution for z. How could the EM algorithm be modified to account for informative prior distributions?

Problem 2. Gaussian mixture models. Consider a populationPconsisting of 2 subpopulationsP 0 andP 1 with equal sizes. We samplenindividuals fromP, but their origins are not observed. A quantityyiis measured for each individual. Grouping individuals based on the observations is called anunsupervised clusteringtask. In order to group individuals into clusters, we assume that subpopulation labels are missing data. We want to estimate the cluster localizationm 0 andm 1 for each group and the proportion of individuals sampled from subpopulationP 0 orP 1. In addition, we want to estimate the probability that each individual is sampled from population P 1 (orP 0 ). Let= (m 0 ,m 1 ). We define a Gaussian mixture model as follows yiR, p(yi|) =pp 1 (yi|) +qp 0 (yi|)

wherepk(yi|) =N(yi|mk,^2 = 1) is the Gaussian density function, fork= 0,1. The probabilitypis the probability of sampling fromP 1 , andq= 1p.

  1. To begin, we consider that the variance^2 is a known parameter equal to^2 = 1, and p= 1/2. For all individuals, we consider hidden variableszi { 0 , 1 } representing their unobserved label of source population. Show that p(yi|) =p(zi= 1)p(yi|zi= 1,) +p(zi= 0)p(yi|zi= 0,), wherep(zi= 1) = 1/2.
  2. Write a computer program for drawing samples from p(yi|) of size n = 200 (for fixed values of (p,m 0 ,m 1 ). Check your the program is correct by drawing a histogram of the simulated data.
  3. Considering the unobserved vectorz= (z 1 ,…,zn), show that logp(y,z|) =^12
n
i=
(1zi)(yim 0 )^2 +zi(yim 1 )^2 +Cn
  1. Letn 1 =ni=1ziandn 0 =nn 1. Show that the above expression is maximized for m 1 =n^11 n i=
yizi m 0 =n^10
n
i=
yi(1zi) (?)
  1. We supposep(zi= 1) = 1/2. Show that the conditional probability ofzi= 1 givenyiand^0 is equal to p(zi= 1|yi,^0 ) = exp((yim

(^01) ) (^2) /2) exp((yim^00 )^2 /2) + exp((yim^01 )^2 /2).

  1. In equations (?), replace the hidden variablezibyp(zi= 1|yi^0 ), and show that this operation corresponds to writing the EM algorithm for estimating.
  2. Write the EM algorithm in theRprogramming language. Generate simulated data that for known values ofandz. Apply the EM algorithm to the simulated data, and evaluate the convergence of the algorithm by testing several initial values^0.
  3. Extend the EM algorithm to the case where the variance^2 is unknown, andp is arbitrary. Extend it further to the case where the two classes have unequal (unknown) variances, andpis arbitrary.
  4. Download the data from the following URL:
http://membres-timc.imag.fr/Olivier.Francois/data2.txt
  1. Apply the EM algorithm to the data. Evaluate the convergence of the algorithm by testing several initial values of^0. report estimates for 1 and 2 , and display p(z|y) by using thebarplotcommand to visualize the probability matrix of size n2.
  2. Install theRpackagemclustfrom the CRAN web site. Look at the different op- tions of theMclustfunction (models and outputs), and run theMclustcommand on the data forG= 1 to 5.
  3. Find a definition of the Bayesian Information Criterion (BIC). Discuss the choice of a model for the data using the BIC.

Challenge and evaluation rule. Download the data from the following URL:

http://membres-timc.imag.fr/Olivier.Francois/matrix_2019.txt

The data consists of a matrix of 499 r ows and 6421 columns with entries i n 0 , 1 , 2. The objective of t he challenge i s t o evaluate t he number of clusters ( for r ows) i n t he data s et and t o assign a cluster l abel t o each r ow. The r esult i s a l ist of 499 cluster labels, one f or each r ow. The output file must contain t he r esulting l ist f ormatted as a sequence of i nteger values s eparated by s pace characters as f ollows

12 12 1 6 7 6 6 3 11 11 …

A README file describing the options used when analyzing the data is required. The results will be evaluated on the basis of the confusion matrix and the number of wrongly classified rows.

Important comments: Use a dimension reduction algorithm such as principal component analysis or multidimensional scaling to reduce the dimension of the data set before applying model-based clustering algorithms. Then, prefer using the Mclust algorithm rather than reprogramming your own EM method.

Problem 3. ABO groups and genetics

A geneticist studies genotypes at the ABO locus (alleles A, B, O) for blood samples fromnindividuals, and gets observations for phenotypes for each individual (blood groups). Four distinct phenotypes can be observed

  • type A corresponds to genotypesA/AandA/O(sample sizenA),
  • type B corresponds to genotypesB/BandB/O(sample sizenB),
  • type AB corresponds to genotypeA/B(sample sizenAB),
  • type O corresponds to genotypeO/O(sample sizenO)

where we suppose thatA/OandO/Aare a same genotype (the same forB), and

n=nA+nB+nAB+nO.

We say that types A et B are codominants, whereas type O is recessive and is observed only if an individual carries to copies of allele O. We want to estimate the frequency pAof allele A in the sampled population.

  1. We first assume that all allele frequencies are known parameters. Using the Hardy-Weinberg principle, show that the expected number of genotypesA/Ain a sample of sizenis equal to nA/A=nA p

(^2) A p^2 A+ 2pApO.

  1. Find a similar equation fornA/O.
  2. Suppose we knownA/A,nA/O,nAB. Give an estimate pAof the frequencypA
  3. Use a circular (iterative) argument to compute an estimate pAfrom the data.
  4. Write an EM algorithm for estimating all allele frequencies.
  5. Application: We observen= 521 cases of peptic ulcer disease, for whichnA= 186,nB= 38,nAB= 13 etnO= 284. Find estimates forpA,pBetpO.