Description
Please read these instructions to ensure you receive full credit on your homework. Submit the written portion of your homework as a single PDF file through Courseworks (less than 5MB). In addition to your PDF write-up, submit all code written by you in their original extensions through Courseworks (e.g., .m, .r, .py, etc.). Any coding language is acceptable. Do not wrap your files in .rar, .zip, .tar and do not submit your write-up in .doc or other file type. Your grade will be based on the contents of one PDF file and the original source code. Additional files will be ignored. We will not run your code, so everything you are asked to show should be put in the PDF file. Show all work for full credit.
Problem Set-up
We are given observations X = {x1,…,xn} where each xi ∈ Rd. We model this as being generated from a Gaussian mixture model of the form
In this homework, you will implement three algorithms for learning this mixture model, one based on maximum likelihood EM, one on variational inference and one on Gibbs sampling. Use the data provided for all experiments.
Problem 1. (30 points)
In this problem, you will implement the EM algorithm for learning maximum likelihood values of π and each (µj,Λj) for j = 1,…,K. The algorithm is given in the notes, and also in Section 9.2 of Bishop’s book.
a) Implement the EM-GMM algorithm and run it for 100 iterations on the data provided forK = 2,4,8,10.
b) For each K, plot the log likelihood over the 100 iterations. What pattern do you observe and why might this not be the best way to do model selection?
c) For the final iteration of each model, plot the data and indicate the most probable clusterof each observation according to q(ci) by a cluster-specific symbol. What do you notice about these plots as a function of K?
1
Problem 2. (35 points)
In this problem, you will implement a variational inference algorithm for approximating the posterior distribution of the GMM variables. We therefore require prior distributions on these variables. For this problem, we use
π ∼ Dirichlet(α), µj ∼ Normal(0,cI), Λj ∼ Wishart(a,B)
For this problem, set α = 1, c = 10, a = d and where A is the empirical covariance of the data. Approximate the posterior distribution of these variables with q distributions factorized on π, and each µj, Λj and ci as discussed in class.
a) Implement the variational inference algorithm discussed in class and in the notes for K = 2,4,10,25 and 100 iterations each.
b) For each K, plot the variational objective function over the 100 iterations. What pattern do you observe?
c) For the final iteration of each model, plot the data and indicate the most probable clusterof each observation according to q(ci) by a cluster-specific symbol. What do you notice about these plots as a function of K?
Problem 3. (35 points)
In this problem, you will implement a Bayesian nonparametric sampler for a marginalized version of the GMM. In contrast to Problem 2, in this problem we will use a joint prior on (µj,Λj). This is done for computational convenience in calculating the marginal distribution of the data.
Specifically, we use the prior distribution
µj |Λj ∼ Normal(m,(cΛ)−1), Λj ∼ Wishart(a,B)
as well as the limit of the prior π ∼ Dirichlet(α/K,…,α/K) as K → ∞.
In this problem you will implement the marginal sampler where π is integrated out. For this problem, set m to be the empirical mean of the data, c = 1/10, a = d and B = c · d · A where A is the empirical covariance of the data. For the “cluster innovation parameter” set α = 1.
a) Implement the above-mentioned Gibbs sampling algorithm discussed in class and describedin the notes. Run your algorithm on the data provided for 500 iterations.
b) Plot the number of observations per cluster as a function of iteration for the six mostprobable clusters. These should be shown as lines that never cross; for example the ith value of the “second” line will be the number of observations in the second largest cluster after completing the ith iteration. If there are fewer than six clusters then set the remaining values to zero.
c) Plot of the total number of clusters that contain data as a function of iteration.
2




Reviews
There are no reviews yet.