CS446: Machine Learning

Description

5/5 – (1 vote)

• This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. Note that there is an appendix with possibly useful formulae and computational shortcuts at the end.
• This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.
• The exam ends at 10:45 AM. You have 75 minutes to earn a total of 100 points. You can earn 25 additional (bonus) points if you successfully attempt all five problems.
• If you choose to attempt all five problems, the four problems with the highest points will be considered for your final exam score and the lowest will be considered as bonus.
• Answer each question in the space provided. If you need more room, write on the reverse side of the paper and indicate that you have done so.
• Besides having the correct answer, being concise and clear is very important. For full credit, you must show your work and explain your answers.
Good Luck!
Name (NetID): (1 Point)
Short Questions /24
Support Vector Machines /25
Probability /25
Naive Bayes /25
Expectation Maximization /25
Total /100
Bonus /25
Short Questions [24 points]
(a) [10 points] We define a class Cr,k,n of r-of-k functions in the following way. Let X = {0,1}n. For a chosen set of k relevant variables and a given number r, an rof-k function f(x1,…,xn) is 1 if and only if at least r of the k relevant attributes are 1. We assume that 1 ≤ r ≤ k ≤ n.
1. [5 points] Phrase this problem as a problem of learning a Boolean disjunction over some feature space. Define the feature space and the learning problem.
2. [5 points] Assume you are learning this function using Winnow. What mistake bound do you obtain?
(b) [8 points] According to the CDC, Women who smoke are 21 times more likely to develop lung cancer compared to those who don’t smoke. Furthermore, CDC tells us that about 10% of the total women smoke. If you learn that a woman has been diagnosed with lung cancer, and you know nothing else about her, what is the probability that she is a smoker?
(c) [6 points] Fill in the blanks with options given below:
(a) δ (c) 1/δ (e) 1 − δ
(g) m (h) n (i) size(C) (j) size(H)
(k) number of examples (l) instance size (m) computation time
(n) linear (o) polynomial (p) exponential
A concept class C defined over the instance space X (with instances of length n) is PAC learnable by learner L using a hypothesis space H if
for all
for all distributions D over X, and fixed 1], given a sample of m examples sampled independently according to the distribution D, the learner L produces with a probability
{at least | at most | equal to} {one of (a) to (f)}
a hypothesis with error (ErrorD = PrD[f(x) 6= g(x)])
{at least | at most | equal to} {one of (a) to (f)}
where the is in
{one of (k) to (m)} {one of (n) to (p)}
, , ,and .
{four of (a) to (j)}
Support Vector Machines [25 points]
We are given the following set of training examples , where are integer-valued features, and y(i) are binary labels.
x1 x2 y
-2 -4 +
-2 0 +
0 2 +
2 2 –
2 -2 –
0 -4 –
Our objective is to learn a hyperplane w1x1 + w2x2 + b = 0 using the hard-SVM objective:
minimize subject to
Use the grid below to answer the following questions (you will place a few points and lines on this grid).

4
2
x2 0
−2
−4
−4 −2 0 2 4
x1
(a) [10 points] Finding the hard-SVM hyperplane:
1. [2 points] Place the training examples on the grid, and indicate the support vectors.
2. [3 points] Draw the hyperplane produced by the hard-SVM on the grid.
3. [5 points] Find the values of w1,w2,b ∈ R that optimize the hard-SVM objective.
(b) [10 points] Experimental evaluation:
1. [2 points] Provide the classification rule used to classify an example with features x1,x2, using the hyperplane produced by hard-SVM.
2. [2 points] What will the error of your classifier be on the training examples D (expressed as the fraction of training examples misclassified)?
3. [2 points] Draw on the grid, the hyperplane that will be produced by hardSVM when you use all training examples except a = (0,2,+). Using this hyperplane, will you classify a correctly?
4. [2 points] Draw on the grid, the hyperplane that will be produced by hardSVM when you use all training examples except b = (2,2,−). Using this hyperplane, will you classify b correctly?
5. [2 points] What will be the average error if you use 6-fold cross validation on the training set D?
(c) [5 points] Soft-SVM formulation:
1. [3 points] Write the soft-SVM objective below. Circle either min or max.
.
2. [2 points] For what range of positive C values, will the hyperplane produced by this soft-SVM objective be most similar to the hyperplane produced by hard-SVM. Circle one of the following.
very small moderate very large
Probability [25 points]
You are given the following sample S of data points in order to learn a model. This question will use this data.
Example A B C
1 1 1 0
2 0 1 1
3 1 0 0
4 0 0 0
5 1 1 0
6 0 0 0
7 1 0 1
8 0 1 1
9 1 1 0
10 0 0 0
11 1 1 1
12 0 0 0
(a) [3 points] What would be your estimate for the probability of the following data points, given the sample S, if you were not given any information on a model? (That is, you would estimate the probability directly from the data.)
1. P(A = 1,B = 1,C = 0)
2. P(A = 0,B = 1,C = 1)
3. P(A = 0,B = 0,C = 1)
(b) [10 points] Consider the following graphical model M over three variables A,B, and C.
A → B → C
1. [5 points] What are the parameters you need to estimate in order to completely define the model M? Circle these parameters from Table 1.
(1) P[A = 1] (5) P[B = 1] (9) P[C = 1]
(2) P[A = 1|B = b] b ∈ {0,1} (6) P[B = 1|C = c] c ∈ {0,1} (10) P[C = 1|A = a] a ∈ {0,1}
(3) P[A = 1|C = c] c ∈ {0,1} (7) P[B = 1|A = a] a ∈ {0,1} (11) P[C = 1|B = b] b ∈ {0,1}
(4) P[A = 1|B,C = b,c] b,c ∈ {0,1} (8) P[B = 1|A,C = a,c] a,c ∈ {0,1} (12) P[C = 1|A,B = a,b] a,b ∈ {0,1}
Table 1: Options to choose from to explain model M.
2. [5 points] Use the data to estimate the parameters you have circled in (b).1.

[6 points] Use the parameters chosen in (b).1 to write down expressions for the probabilities of the same data points according to model M and compute these probabilities using the estimated parameters.
1. PM(A = 1,B = 1,C = 0)
2. PM(A = 0,B = 1,C = 1)
3. PM(A = 0,B = 0,C = 1)
(d) [6 points] Use the parameters chosen in (b).1 to write down the expressions for the following probabilities for model M and compute these probabilities.
1. PM(B = 1)
2. PM(A = 1|B = 0)
3. PM(A = 0|B = 0,C = 0)
Naive Bayes [25 points]
You would like to study the effects of irrigation, fertilization and pesticide use with respect to the yield of a farm. Suppose you are provided with a collection D = {D1,…,Dm} of m data points corresponding to m different farms. Each farm has three binary attributes IsIrrigated (X1), IsFertilized (X2) and UsesPesticide (X3), and each has either a high yield (V = 1) or a low yield (V = 0). The label is Yield. A natural model for this is the multi-variate Bernoulli model.
Below is a table representing a specific collection S of data points for 8 farms to illustrate how a collection might look like.
# IsIrrigated (X1) IsFertilized (X2) UsesPesticide (X3) Yield (V )

(1) αi = Pr(Xi = 1) (7) β = Pr(V = 1)
(2) γi = Pr(Xi = 0) (8) ϕ = Pr(V = 0)
(3) pi = Pr(Xi = 1 | V = 1) (9) qi = Pr(V = 1 | Xi = 1)
(4) ri = Pr(Xi = 0 | V = 1) (10) si = Pr(V = 0 | Xi = 1)
(5) ti = Pr(Xi = 1 | V = 0) (11) ui = Pr(V = 1 | Xi = 0)
(6) wi = Pr(Xi = 0 | V = 0) (12) yi = Pr(V = 0 | Xi = 0)
(b) [3 points] How many independent parameters do you have to estimate to learn this model?
[5 points] Write explicitly the na¨ıve Bayes classifier for this model as a function of the model parameters selected in (a):
Pr(V = v | X1 = x1,X2 = x2,X3 = x3)
=
(d) [5 points] Write the expression for L, the log likelihood of the entire data set D, using the parameters that you have identified in (a).
(e) [6 points] We would like to train a Na¨ıve Bayes classifier on S to help us predict the yield on a new farm S9.
1. [3 points] What is the decision rule for the Na¨ıve Bayes classifier trained on S? vNB =
2. [3 points] Predict the yield for the following farm using the decision rule written earlier.
# IsIrrigated (X1) IsFertilized (X2) UsesPesticide (X3) Yield (V )
9 Yes (1) Yes (1) Yes (1) ?
Expectation Maximization [25 points]
Assume that a set of 3-dimensional points (x,y,z) is generated according to the following probabilistic generative model over Boolean variables X,Y,Z ∈ {0,1}:
Y ← X → Z
(a) [4 points] What parameters from Table 2 will you need to estimate in order to completely define the model?
(1) P(X=1) (2) P(Y=1) (3) P(Z=1)
(4) P(X|Y=b) (5) P(X|Z=b) (6) P(Y|X=b) (7) P(Y|Z=b)
(8) P(Z|X=b) (9) P(Z|Y=b) (10) P(X|Y=b,Z=c) (11) 3
Table 2: Options to choose from. b,c ∈ {0,1}.
(b) [4 points] You are given a sample of m data points sampled independently at random. However, when the observations are given to you, the value of X is always omitted. Hence, you get to see {(y1,z1),…,(ym,zm)}. In order to estimate the parameters you identified in part (a), in the course of this question you will derive update rules for them via the EM algorithm for the given model.
Express Pr(yj,zj) for an observed sample (yj,zj) in terms of the unknown parameters.
[4 points] Let pji = Pr(X =i | yj,zj) be the probability that hidden variable X has the value i ∈ {0,1} for an observation (yj,zj),j ∈ {1,…,m}. Express pji in terms of the unknown parameters.
(d) [4 points] Let (xj,yj,zj) represent the completed jth example, j ∈ {1,…,m}. Derive an expression for the expected log likelihood (LL) of the completed data set , given the parameters in (a).
(e) [9 points] Maximize LL, and determine update rules for any two unknown pa-

rameters of your choice (from those you identified in part (a)).

• P(A,B) = P(A|B)P(B)
), where k is number of values

Reviews

There are no reviews yet.

Be the first to review “CS446: Machine Learning”

Description

Reviews

Related products

CS446: Machine Learning

CS446 – Machine Problem 9 Solved

CS446 – Machine Problem 12 Solved

CS 446: Machine Learning Homework Solved

CS 446: Machine Learning Homework Solved