100% Guaranteed Results


CS446 – •
$ 20.99
Category:

Description

5/5 – (1 vote)

• This exam booklet contains four problems. You need to solve all problems to get 100%.
• The exam ends at 1:45 PM. You have 75 minutes to earn a total of 100 points.
• Answer each question in the space provided. If you need more room, write on the reverse side of the paper and indicate that you have done so.
• Besides having the correct answer, being concise and clear is very important. For full credit, you must show your work and explain your answers.
Good Luck!
Name (NetID): (1 Point)
Short Questions /29
Kernels /25
Online Learning /25
Decision Trees /20
Total /100
Short Questions [29 points]
(a) (5 points) In the following definition of PAC learning we left a few blank fields. Fill in the blanks by choosing, for each empty field, one of the options given below. Note that under each line defining a blank we provided a small set of options for you to choose from.
(a) δ (e) 1 − δ
(g) m (h) n(j) size(H)
(k) number of examples (l) instance size (m) computation time
(n) linear (o) polynomial (p) exponential
(s) 1 − γ
A concept class C defined over the instance space X (with instances of length n) is strongly PAC learnable by learner L using a hypothesis space H if for all concepts f ∈ C, for all distributions D on X, and for all fixed 1], given a sample of m examples sampled independently according to the distribution D, the learner L produces with a probability

(b) (8 points) Consider the following algorithm B for functions in C. B has the property that given any polynomial size sample of m labeled examples {(x1,c(x1)),…(xm,c(xm))} according to some c ∈ C, B outputs a hypothesis h ∈ H that is incorrect on at most one of the m examples.
Then, the concept class C .
{PAC learnable | not PAC learnable} Justify your answer.
(c) (8 points)
In this question we consider the Winnow algorithm.
(1) In class we proved that Winnow can learn k-monotone disjunctions over a domain of n variables. State the upper bound on the number of examples Winnow will make.
(2) We consider a learning problem of the domain X = {1,2,…,N}d. A ddimensional hyper-rectangle over this domain X is a subset c ⊆ X defined by 2d values: 1 ≤ ai ≤ bi ≤ N, for i = 1,…,d.
The subset is
c = {(x1,…,xd) ∈ X : ai ≤ xi ≤ bi∀i = 1,…,d}.
Let RECT denote the class of all such d-dimensional hyper-rectangles over X.
i. Show that you can use Winnow to learn the concept class RECT. Justify your answer.
ii. What is the mistake bound of your algorithm? Explain.
(d) (8 points) We define a set of concepts
H = {sgn(ax2 + bx + c);a,b,c,∈ R},
where sgn(z) is 1 when z is positive, and 0 otherwise. What is the VC dimension of H? Prove your claim.
Kernels [25 points]
In this question we will define kernels, study some of their properties and develop one specific kerenl.
(a) (1 point) Choose one of the options below:
A function K(x,z) is a valid kernel if it corresponds to an
in some feature space, between feature rep-
{”inner(dot) product” | ”sum”}
resentations that correspond to x and z.
(b) (12 points) In the next few questions we guide you to prove the following property of kernels:
Linear Combination Property: if ∀i, ki(x,x0) are valid kernels, and ci > 0 are constants, then k(x,x0) = Pi ciki(x,x0) is a valid kernel.
i. Given a valid kernel k1(x,x0) and a constant c > 0, use the definition in(a) to show that k(x,x0) = ck1(x,x0) is also a valid kernel. ii. Given valid kernels k1(x,x0) and k2(x,x0), use the definition in (a) to show that k(x,x0) = k1(x,x0) + k2(x,x0) is also a valid kernel.
iii. Conclude that the Linear Combination Property holds.
(c) (12 points) In order to learn functions over sets, we represent sets as feature vectors in an n dimensional space, and define a kernel in this space. Our instances are all subsets of a set S, of size |S| = m. The n dimensional feature representation of A ⊆ S is: ϕ(A) = (φU1(A),φU2(A),…,φUn(A)),
where U1,U2,…,Un are all the subsets of S, and the coordinates are defined using the following feature mapping function:
, if U ⊆ A
, otherwise.
That is, ϕ(A) ∈ {0,1}n, where n = 2m.
Let A,B be subsets of S. We define the kernel
K(A,B) = ϕ(A)>ϕ(B).
Show that K(A,B) can be computed in time that is polynomial in m.
On-Line Learning [25 points]
In this problem we will consider a few variations of the Perceptron Learning algorithm and their properties. Recall the perceptron algorithm,
1: for do
2: if y(w|x) ≤ 0 then
3: w0 ← w + ηyx
4: w ← w0
5: end if
6: end for
7: return w

(a) (10 points) Suggest a variation of the preceptron update rule (line 3) which has the following property:
If the algorithm sees two consecutive occurrences of the same example, it will never make a mistake on the second occurrence.
(Hint: determine an appropriate learning rate that guarantees this property).
Your work will result in one of the options below.
Which one?
{one of (a) to (d)}
| | | |
Prove that your proposed variation satisfies the property stated above (that is, show how you derive the learning rate you chose).
(b) (5 points) Consider the following variant of perceptron: Instead of returning w in line 7, return .
Choose one of the following options and justify your answer: The decision boundary of the variant and the standard perceptron are because:
{same | different}
(c) (5 points) Consider the following variant of perceptron: Following each mistake, instead of the usual update in line 3, perform the following update,
.
Choose one of the following options and justify your answer: The decision boundary of the variant and the standard perceptron are because:
{same | different}
(d) (5 points) Derive the SGD update for the SVM algorithm, which has the following loss:
M kwk2 + Xmax(0,1 − yi(w|xi))
i=1
(disregard the single point where the objective function is not differentiable. )
Decision Trees [20 points]
(a) (10 points) Let x be a vector of n Boolean variables and let k ≤ n be an integer. We define the class Pk of k-parity functions over the n variables. A k-parity function fS is defined as follows: a set S ⊆ {x1,…,xn} is chosen such that |S| = k. Let x = (x1,x2,…xn) ∈ {0,1}n. Then fS(x) = 1 iff an odd number of variables in S are set to 1 in x.
Example: Let n = 3 and consider functions in P2. Let S = {x1,x2}. Then fS(100) = fS(101) = fS(010) = fS(011) = 1, and fS(000) = fS(001) = fS(110) = fS(111) = 0.
i. Fix S ⊆ {x1,…,xn}, and assume that you are using ID3 to learn a decision tree from data that is consistent with fS. Consider two variables, x1,x2 such that x1 ∈ S and x2 ∈6 S. Which of these variables is more likely to be the root of the decision tree for fS?
{x1 ; x2} Justify your answer: ii. Consider fS ∈ Pk. State the depth of the smallest possible consistent decision tree for fS in terms of n and k. Describe the shape of the decision tree for fS.
Justify your answer.
(b) (10 points) Assume that you are using an implementation of ID3 that takes an upper bound on the depth of the output decision tree as a parameter. You will use this implementation of ID3 to learn from a dataset Dtrain, compute the empirical error, and then evaluate the learned tree on a test set Dtest. You will learn two trees, Tk, learned with bound k on the depth, and Tm, learned with a depth bound m.
Assume k < m. (But note that we say nothing about the size of k; it could be a very small number or a very large number).
i. Which tree, Tk or Tm, is likely to have larger empirical error (that is, error
on Dtrain)?
{Tk ; Tm ; impossible to tell} Justify your answer: ii. Which tree, Tk or Tm, is likely to have larger error when tested on Dtest?

{Tk ; Tm ; impossible to tell} Justify your answer:
This page was intentionally left blank.

Reviews

There are no reviews yet.

Be the first to review “CS446 – •”

Your email address will not be published. Required fields are marked *

Related products