CS446: Machine Learning

Description

5/5 – (1 vote)

• This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam.
• This exam booklet contains four problems. You need to solve all problems to get 100%.
• Please make sure that your exam booklet contains 20 pages.
• The exam ends at 1:45 PM. You have 75 minutes to earn a total of 100 points.
• Answer each question in the space provided. If you need more room, write on the reverse side of the paper and indicate that you have done so.
• Besides having the correct answer, being concise and clear is very important. For full credit, you must show your work and explain your answers.
• A list of potentially useful functions has been included in the appendix at the back.
Good Luck!
Name (NetID): (1 Point)
Na¨ıve Bayes /25
Expectation Maximization /25
Multiclass Classification and Graphical Models /25
Short Questions /24
Total /100
Na¨ıve Bayes [25 points]
For every deal, we have two attributes: number of views (X1), and time taken to receive 100 views (X2).
We assume that the number of views (X1) is related to each category (A, B) via a geometric distribution with a category-specific parameter (θA, θB resp.) and that the time taken to receive 100 views (X2) is related to each category (A, B) via an exponential distribution with a category-specific parameter (λA, λB resp.).
Also, (γA,γB) are our prior beliefs for each of the categories (A, B), resp. The summary of the model assumptions is given below:
Pr[Y = y] = γy ∀y ∈ {A,B}
Pr[X1 = x1|Y = y] = (θy)(1 − θy)(x1−1) ∀y ∈ {A,B}
Pr[X2 = x2|Y = y] = (λy)e−x2λy ∀y ∈ {A,B}
(a) [15 points] Assume DA to be the set of training instances with label A, and DB to be the set of training instances with label B
i. (5 points) Under the given na¨ıve Bayes assumption, and using the notation of and to represent the values of X1 and X2 respectively for the ith training instance, write down the expression for the log likelihood (LL) of the dataset.
LL = LL(γA) + LL(γB) + LL(θA) + LL(θB) + LL(λA) + LL(λB)
where,

ii. (5 points) Now, assume the following notation:
|DA| = nA |DB| = nB
X i
x1 = fA
i∈DA
X i
x1 = fB
i∈DB
X i
x2 = gA
i∈DA
X i
x2 = gB
i∈DB
Using this notation, derive the expressions for the MLE estimates of the parameters of your model.
• θA, θB:
Setting the derivatives of LL(θA) and LL(θB) to zero, yields the following maximum likelihood estimates :-
A nA
θ =
fA
n
θB = B fB
• λA, λB:
Setting the derivatives of LL(λA) and LL(λB) to zero, yields the following maximum likelihood estimates :-
A nA
λ =
gA
B nB
λ =
gB
• γA, γB:
The estimates of γA and γB can be directly calculated by just counting the number of instances with each of the labels :-
A nA
γ =
nA + nB n
γB = B
nA + nB
iii. (5 points) Assume that the given data in Table 1 is generated by a na¨ıve Bayes model. Use this data and your MLE expressions obtained above to compute the prior probabilities (γA,γB) and parameter values (θA,θB,λA,λB).
That is, fill out Table 2. (Keep the solutions as fractions.)
X1 X2 Y
2 12 A
4 5 A
3 7 A
12 11 B
1 1 B
7 8 B
12 4 B
Table 1: Dataset for Poisson na¨ıve Bayes
γA = 3/7 γB = 4/7
θA = 1/3 θB = 1/8
λA = 1/8 λB = 1/6
Table 2: Parameters for na¨ıve Bayes
(b) [5 points] Derive an algebraic expression for the na¨ıve Bayes predictor for Y in terms of the parameters of the model. That is, predict Y = A iff
α > 1,where
Pr(Y = A)Pr(X1,X2|Y = A) Let α =
Pr(Y = B)Pr(X1,X2|Y = B)

Pr(Y = B|X1 = 2,X2 = 16)
use 16 ≈ 24(ln(2)) to simplify your calculations
Pr(Y = A)Pr(X1 = 2,X2 = 16|Y = A)
=
Pr(Y = B)Pr(X1 = 2,X2 = 16|Y = B)

Solving above, we get :-
16
=
7
(d) [2 points] What will the classifier predict as the value of Y , given the above data point i.e. X1 = 2,X2 = 16?
Y = A
Expectation Maximization [25 points]
Consider the following generative probabilistic model:
W → X ← Z.
over the Boolean variables W, X, Z, where the data is generated according to:
• The variable W is set to 1 with probability α, and 0 with probability 1 − α.
• The variable Z is set to 1 with probability β, and 0 with probability 1 − β.
• If (W,Z) = (1,1) then X = 1 with probability λ11
If (W,Z) = (0,1) then X = 1 with probability λ01
If (W,Z) = (1,0) then X = 1 with probability λ10
If (W,Z) = (0,0) then X = 1 with probability λ00
This information is summarized below.
P(W = 1) = α
P(Z = 1) = β
P(X = 1|W = 1,Z = 1) = λ11
P(X = 1|W = 0,Z = 1) = λ01
P(X = 1|W = 1,Z = 0) = λ10
P(X = 1|W = 0,Z = 0) = λ00
You need to estimate the parameters of this model. However, one of the variables, Z, is not observed. You are given a sample of m data points:

In order to estimate the parameters of the model, α, β, λ11, λ01, λ10, λ00, you will derive update rules for them via the EM algorithm.
(a) (3 points) Choose the correct expression for P(w(j),x(j)) in terms of the unknown parameters α, β, λ11, λ01, λ10, λ00. (Circle one of the four options given below.)
i. P(w(j),x(j)) = (1 − β)[αλx11j(1 − λ11)1−xj]wj[(1 − α)λ01xj(1 − λ01)1−xj]1−wj
+ β[αλ10xj(1 − λ10)1−xj]wj[(1 − α)λx00j(1 − λ00)1−xj]1−wj
ii. P(w(j),x(j)) = β[αλx11j]wj[(1 − α)λx01j]1−wj
+ (1 − β)[αλx10j]wj[(1 − α)λx00j]1−wj
iii. P(w(j),x(j)) = β[αλx11j(1 − λ11)1−xj]wj[(1 − α)λx01j(1 − λ01)1−xj]1−wj
+ (1 − β)[αλx10j(1 − λ10)1−xj]wj[(1 − α)λx00j(1 − λ00)1−xj]1−wj
iv.
Ans: ii
(b) (4 points) Let ), the probability that the hidden variable Z has value z. Choose the correct expression for in terms of the unknown parameters α, β, λ11, λ01, λ10, λ00. (Circle one of the four options given below.) i.
ii. iii.
iv.
Ans: i
(c) (10 points) Choose the correct expression for the expected log likelihood (LL) of the entire dataset, given the new parameter estimates α, β, λ11, λ01, λ10, λ00. (Circle one of the four options given e
below.)
i.
ii. iii.
iv.
Ans: ii
(d) (8 points) Maximize the LL and select the correct update rule for β according to the EM algorithm. (Circle one of the four options given below.) i.
ii. iii.
iv. β = X1 − f1j
j=1
Ans: iv
(a) (13 points) In this part we consider a discriminative learning approach.
i. (1 point) Learning using a discriminative approach can be viewed as estimating

ii. (10 points) You are now tasked with choosing a discriminative model for this problem. Given your machine learning expertise, you have already narrowed down your modeling choices to:
• one versus all (OvA),
• all versus all (AvA),
• a minimal size error correcting output code (ECOC), and
• and multiclass SVM (MSVM)
Furthermore, you plan on using linear classifiers of the form h(x) = 1[w·x+ θ ≥ 0]) for every binary classification problem that arises from these models. The goal of this question is to determine the number of parameters required to represent its hypothesis.
Note that the number of parameters is the number of real-valued variables whose values you are choosing during the learning process; for example, a single linear classifier of the form mentioned before has 6 parameters, consisting of each of the five dimensions of w as well as θ.
Question: What is the total number of parameters required to represent each of these four hypotheses for this problem? In each case, explain how you derive your results.
• one versus all (OvA):
There are 4 binary classification problems (one for each class), each of which requires 6 parameters. Thus, the total number of parameters here is 24.
• all versus all (AvA):
There are = 6 binary classification problems (one for each pair of classes), each of which requires 6 parameters. Thus, the total number of parameters here is 36.
• a minimal size error correcting output code (ECOC) (that is, use the smallest number of hypotheses needed for ECOC in this case):
Four classes can be represented using a two-bit binary code; this means there will be two binary classification problems, each of which requires 6 parameters. Thus, the total number of parameters here is 12.
• multiclass SVM (MSVM)
Representing MSVM requires 4 vectors (one per class) of length 6 (for the features/bias). Thus, the total number of parameters here is 24.
iii. (2 points) Suppose you decide to use the minimal ECOC model. Briefly discuss any potential issues with using this model to solve the classification problem.
(b) (12 points) In this part we will consider a generative approach.
i. (1 point) Learning using a generative approach can be viewed as estimating

ii. (4 points) We model the problem using a Bayesian network. After some thought, you narrow down the candidate graphs to the following two choices:
Figure 1: Option A
Figure 2: Option B
Question: Write down the factored joint probability distribution represented by each model.
• Model A:
p(y,x1,x2,x3,x4,x5) = p(y)p(x1|y)p(x2|y)p(x3|y)p(x4|y)p(x5|y)
• Model B:
p(y,x1,x2,x3,x4,x5) = p(y)p(x1|y)p(x2|y)p(x3|y)p(x4|x1,x2)p(x5|x2,x3)
iii. (4 points) We are interested in figuring out the number of parameters needed to represent each of the models in Figures 1 and 2 above. Note that in the context of graphical models, a parameter is a probability value for a given variable assignment (e.g. Pr(X1 = 0,X2 = 1|X3 = 1) is a single parameter). Compute the minimum number of parameters required to represent each model. Explain as needed.
• Model A: p(y) requires 3 parameters (1 for each label, except the last can be computed in terms of the other three). Each feature requires 4 parameters (1 for each possible label for y, since the other probability value can be computed in terms of the first) Thus, a total of3+5×4 = 23 parameters are required.
• Model B: p(y) requires 3 parameters (same reason as before). x1, x2, and x3 require 4 parameters each (same reason as before) x4 and x5 require 4 parameters each (since there are four possible assignments to their parents) Thus, a total of 3 + 5 × 4 = 23 parameters are required here as well. iv. (3 points) After staring at your data for a few hours, you realize that the features x4 and x5 are not conditionally independent given the label y. Given this piece of information, which of the two Bayesian networks is a better choice for this problem? Explain your answer. Option B is better – Option A encodes that x4 and x5 are conditionally independent given the label, which the data implies is not true. However, Option B does not encode this assumption; therefore, (given no further information), it can better represent
the data.
Short Answer Questions [24 points]
(a) (8 points) For the purpose of this question, consider the AdaBoost algorithm. Let Dt be the probability distribution in the tth round of Adaboost, ht be the weak learning hypothesis learned in the tth round, and t its error. Now, fill in the blanks to complete the algorithm:
D1(i) = 1/m
Given Dt and ht:
if yi¬ = ht(xi) if yi = ht(xi)
where and where
Options:
a.) b.) c.)
d.) e.)Pi Dt(i)exp(αtyiht(xi))
f.)Pi Dt(i)exp(−αtyiht(xi)) g.) Pt Dt(i)exp(−αtyiht(xi))
h.)1/2 i.)1/2
(b) (8 points) Given the instance space X = R2, consider the hypothesis class
H = {h(x1,x2) = (x1 − a)2 + (x2 − b)2 ≤ r2 : a,b ∈ R,r ∈ R+}
That is, each h ∈ H is a circle with radius r and center (a,b) whose interior is labeled as positive and whose exterior is labeled as negative.
Is H PAC learnable? Explain your answer. (It is sufficient to explain the structure of the argument, without getting to all the technical details.)
Yes. The hypothesis class is PAC learnable because its VC-Dimension is finite.
(c) (8 points) [Support Vector Machine]
Recall the objective function for soft SVM.
(1) s.t(2)
where m is the number of examples.
i. State whether the following statements about the SVM formulation above are correct. In each case, use one sentence to explain your answer (no need for a mathematical derivation or a proof).
A. When using the value of C = 0, we obtain the Hard-SVM objective.

Correct/Incorrect
Reason: Incorrect. Choosing C = 0 leads to a trivial solution for w~ = 0.
B. Choosing higher values of C leads to over-fitting the training data.

Correct/Incorrect
Reason: Correct. Higher value of C leads to more weight to not making
mistakes on any training examples, which leads to over-fitting. Alternatively, regularization term gets lesser emphasis in the objective function.
C. The slack variable ξi for a data point xi always takes the value 0 if the data point is correctly classified by the hyper-plane.

Correct/Incorrect
Reason: Incorrect. Data points classified correctly but lying within the
margin will also have non-zero slack variable value.
D. The optimal weight vector w~ can be calculated as a linear combination of the training data points. [You need not prove this.]

Correct/Incorrect
Reason:
Correct. We can use the dual representation where weight vector w~ = Pi αiyixi where i iterates over each training data points.
ii. Circles Dataset Consider the following data set.

All data points inside a circle of some radius are marked as positive (+1) and points outside the circle are marked as negative (-1).
A. Given the data set above that is separable by a circle, explain how HardSVM can be used to learn a valid separator in this case.

B. Which of the figures above is more likely to be the separator that would be learned by the Hard-SVM formulation? Justify your choice briefly. Separator (B) is the more likely for a hard-SVM because SVM is a maxmargin classifier.

(c) P(A,B) = P(A|B)P(B)
(d) Let p define the probability distribution of a discrete random variable X, then:
Ep[f(X)] = Px p(X = x)f(x)

Reviews

There are no reviews yet.

Be the first to review “CS446: Machine Learning”

Description

Reviews

Related products

CS446 – Machine Problem 5 Solved

CS446 – Machine Problem 12 Solved

CS 446: Machine Learning Homework 9 Solved

CS 446: Machine Learning Solved

CS 446: Machine Learning Homework Solved