Description
Homework 1: Mathematical Fundamentals, Ridge
Regression, Gradient Descent, and SGD
1 Introduction
In this homework you will first solve some probability and linear algebra questions and then you will implement ridge regression using gradient descent and stochastic gradient descent. We’ve provided a lot of support Python code to get you started on the right track. References below to particular functions that you should modify are referring to the support code, which you can download from the website. If you have time after completing the assignment, you might pursue some of the following:
• Study up on numpy’s broadcasting to see if you can simplify and/or speed up your code.
• Tlohssinfkunacbtoiountshaonwdysotuepcosuizlde mmaetkheotdhse. code more modular so that you could easily try different
• Eouxtptehriemreenctomwmithenmdaotrieonssopinhi“sBtioctattoeud’sapSpGrDoaTchreicsktso” osnetttihnegwtehbesisttee)p sizes for SGD (e.g. try
• Iynosuteuasde mofutlatikpilnegp1oidnatstaatpaoitnitmaettao gtiemt ey,ouasr sintepSGdDire,cttriyonm. iHniobwatdcohesgrtahdisieenfftedctesccoennvte,rwgehnecree speed? Are you getting computational speedup as well by using vectorized code?
• Advanced: What kind of loss function will give us “quantile regression”?
2 Mathematical Fundamentals
The following questions are designed to check how prepared you are to take this class. Familiarity with linear algebra and probability at the level of these questions is expected for the class.
2.1 Probability
Lteht (Xd1,X2,·o·v·a,rXiadn)cehijamveatarixd-Σdim∈teRhnesdi×eolnde,amli.eemn.tu(laXttiv1ta,hrXeia2it,teh··Gr·oa,wuXsadsn)iad∼njdNthis(tcµroi,lbΣuum)t.inoUno,sfewΣµi.tihtomdeaennovteectthoer
µ ∈R and c
i element of µ and Σ to denote
1. Let x,y ∈Ryd22b. eEtxwporeisnsdyeopuerndaennstwesramasplaesfudnrcatwionnfroofmµ Nan(dµ,ΣΣ.). xG2ivreepexrepsreenstssiotnhefoℓr2E-noxrm22 and Eoxr−x.
of vect
2. Find the distribution of Z = αiXi +αjXj, for i ∕=swjeranbdy1id≤enit,ijfy≤ingd.thTisheclaanssswoefrdwisitlrlibbueltoionng to a familiar class of distribution. Report the an and specifying the parameters.
3. (Optional) Assume W and R are two Gaussian distributed random variables. Is W + R still Gaussian? Justify your answer.
2.2 Linear Algebra
1. Let A be a d ×SAd?matrix with rank k. Consider the set SA := {x ∈Rd|Ax = 0}. What is the dimension of
2. AofsSsuvm. eLeStvwisbae kandaimrbeintrsaiornyavlescutborspiancRe idn. RFidndand v1,v2,··· ,vk form an orthonormal basis
,
wh2 is the Euclidean distance between w and x. Express x∗ as a function of v1, 2 k d w.
3. (Optional) Continuing from above, x∗ can be expressed as
x∗ = Mw,
weihgeenreveMctoirss aofdM×acdosrmareasftpurinoxnc.tdiioPnnrgoovtfoe vtt1hh,eavtn2,so·un·cz·he,ravokne.iMgCenoavmlawlpuaueytsse. etxhisetseiogrenmvaolrueespraencdiseolynefinsedt aonf
expression for M
3 Linear Regression
3.1 Feature Normalization
When feature values differ greatly, we can get much slower rates of convergence of gradient-based algorithms. Furthermore, when we start using regularization (introduced in a later problem), features with larger values are treated as “more important”, which is not usually what you want. One common approach to feature normalization is perform an affine transformation (i.e. shift and rescale) on each feature so that all feature values in the training set are in [0,1]. Each feature gets its own transformation. We then apply the same transformations to each feature on the test set. It’s important that the transformation is “learned” on the training set, and then applied to the test set. It is possible that some transformed test set values will lie outside the [0,1] interval.
Modify function feature_normalization to normalize all the features to [0,1]. (Can you use numpy’s “broadcasting” here?) Note that a feature with constant value cannot be normalized in this way. Your function should discard features that are constant in the training set.
3.2 Gradient Descent Setup
In linear regression, we consider the hypothesis space of linear functions hθ : Rd →R, where
hθ(x) = θTx,
for θ,x ∈Rd, and we choose θ that minimizes the following “average square loss” objective function:
,
where (x1,y1),…,(xm,ym) ∈f lRindea×r Rregirseossuirontriasinvienrgydcaotnav.enient, it’s more standard to use a hyWhile this formulation o pothesis space of “affine” functions:
hθ,b(x) = θTx + b,
which allows a “bias” or nonzero intercept term. The standard way to achieve this, while still maintaining the convenience of the first representation, is to add an extra dimension to x that is always a fixed value, such as 1. You should convince yourself that this is equivalent. We’ll assume this representation, and thus we’ll actually take θ,x ∈Rd+1.
1. Ltoe1rteXxpr∈eysmsiR)oTnm,∈×w(Rdit+mh1o×)u1btbeuestithnhegead“nreessepixgopnnliscemit”.asuWtmrriixmt,eattwihoheneorsebigjntehc.te[iBviee’tifhnugnrcaotbwiolenotfJo(Xwθ)riitasesxeaxi.mpraeLtsrseiitxo/nyvsea=cs-
(y ,…,
matrix/vector expressions without summations is crucial to making implementations that are useful in practice, since you can use numpy (or more generally, an efficient numerical linear algebra library) to implement these matrix/vector operations orders of magnitude faster than naively implementing with loops in Python.]
2. Write down an expression for the gradient of J (again, as a matrix/vector expression, without using an explicit summation sign).
3. In our search for a θ that minimizes J, suppose we take a step from θ to θ + ηh, where
is
hgr∈atdhRieend“s+tt1etpoiswsitzrhiete”e“(sdntooewptendtaihrneactatptiohpnirs”oxi(sirmencoaattlelt,hetxehpiasrcetissusainolonltefnnogertcthehsoesafcrthihlayenagsteueipnn,itowvbhejiecccthotirivs)eηafnuhdncη)t.i∈oUn(s0ve,a∞tlhuee)
J(θ + ηh) − J(θ). [This approximation is called a “linear” or “first-order” approximation.]
4. Write down the expression for updating θ in the gradient descent algorithm. Let η be the step size.
5. Modify the function compute_square_loss, to compute J(θ) for a given θ. You might want to create a small dataset for which you can compute J(θ) by hand, and verify that your compute_square_loss function returns the correct value.
6. Magoadinifywatnhte tfounucsteioanscmoamllpduattea_sestqtuoavreeri_fylotshsat_gyoruardcioemnptu, ttoe_csoqmupaurtee_∇lθoJs(sθ_).grYaoduimenaty function returns the correct value.
3.3 (OPTIONAL) Gradient Checker
For many optimization problems, coding up the gradient correctly can be tricky. Luckily, there is aannyicveecwtaoyr hto∈nRumd,ertihcealdlyirecchteicoknatlhedegrriavadtieivnet ocfalJcualattθioinn. tIhfeJd:irRecdti→on Rh iissgdiivffeenrebnyt2iable, then for
lim J(θ + εh) − J(θ−εh). ε→0 2
We can approximate this directional derivative by choosing a small value of ε > 0 and evaluating the quotient above. We can get an approximation to the gradient by approximating the directional derivatives in each coordinate direction and putting them together into a vector. In other words, take h = (1,0,0,…,0) to get the first component of the gradient. Then take h = (0,1,0,…,0) to get the second component. And so on. See http://ufldl.stanford.edu/wiki/index. php/Gradient_checking_and_advanced_optimization for details.
3.4 Batch Gradient Descent3
At the end of the skeleton code, the data is loaded, split into a training and test set, and normalized. We’ll now finish the job of running regression on the training set. Later on we’ll plot the results together with SGD results.
1. Complete batch_gradient_descent.
The2Ofofrmcougrivsee,nitgiivseaslasobgeitvteenr abpyptrhoxeimmaotrieonstatondtharedddereifivnaittiivoenwohfednirwecetaiorneaulsdinegrisvmataivlle(,bliumtεn→ot0i1εnfi[Jn(itθe+simεha)lly−sJm(aθ)ll]).
Sometimes people say “batch gradient descent” or “full batch gradient descent” to mean gradient descent, defined as we discussed in class. They do this to distinguish it from stochastic gradient descent and minibatch gradient descent, which they probably use as their default.
3. (Optional) Implement backtracking line search (google it). How does it compare to the best fixed step-size you found in terms of number of steps? In terms of time? How does the extra time to run backtracking line search at each step compare to the time it takes to compute the gradient? (You can also compare the operation counts.)
3.5 Ridge Regression (i.e. Linear Regression with ℓ2 regularization)
When we have a large number of features compared to instances, regularization can help control overfitting. Ridge regression is linear regression with ℓ2 regularization. The regularization term is sometimes called a penalty term. The objective function for ridge regression is
,
where λ is the regularization parameter, which controls the degree of regularization. Note that the bias parameter is being regularized as well. We will address that below.
1. Compute the gradient of J(θ) and write down the expression for updating θ in the gradient descent algorithm. (Matrix/vector expression – no summations please.)
2. Implement compute_regularized_square_loss_gradient.
3. Implement regularized_grad_descent.
5. (Optional) Develop a formal statement of the claim in the previous problem, and prove the statement.
6. (Optional) Try various values of B to see what performs best in test.
7. Nowfix B = 1. Choosing a reasonable step-size (or using backtracking line search), find the that minimizes J(θ) over a range of λ. You should plot the training average square loss and the test average square loss (just the average square loss part, without the regularization, in each case) as a function of λ. Your goal is to find λ that gives the minimum average square loss on the test set. It’s hard to predict what λ that will be, so you should start your search very broadly, looking over several orders of magnitude. For example, λ ∈g in1.0−Y7o,u10m−a5y,1w0a−n3t,1t0o−h1a,v1e,1l0o,g1(λ00)o.nOthnecex-yaoxuisfirnadthaerrathnagne tλh.a[tIfwyoorukslikbee,ttyeoru, kmeaepy
zoomin
use sklearn to help with the hyperparameter search.]
8. What θ would you select for deployment and why?
3.6 Stochastic Gradient Descent
When the training data set is very large, evaluating the gradient of the objective function can take a long time, since it requires looking at each training example to take a single gradient step. When the objective function takes the form of an average of many values, such as
(as it does in the empirical risk), stochastic gradient descent (SGD) can be very effective. In SGD, rather than taking −,∇mJ}p.(liθcT)ahtaeisoanopsu,prresoatxcehipmfdaiit(riθeo)cntwiiosonup,lodwoerb,etbatukhetew−loe∇swsfiio(llnθ)shtfhooewr isitothmiseexuianmcbhipaolsseeedn(a.unndifoofrmcoluyraset random from {r1n,i.n.g. ap In machine lea
we’d typically write n instead of m, for the number of training points). In practical implementations for ML, the data points are randomly shuffled, and then we sweep through the whole training set one by one, and perform an update for each training example individually. One pass through the data is called an epoch. Note that each epoch of SGD touches as much data as a single step of batch gradient descent. You can use the same ordering for each epoch, though optionally you could investigate whether reshuffling after each epoch affects the convergence speed.
1. Show that the objective function
can be written in the form by giving an expression for fi(θ) that makes the two expressions equivalent.
2. Siθs.haon(wHutinhntab:tiatIthseewdsitleloscbtheiamsetaaicstioegrrr,aodnfioe∇tnaJtt(i∇oθn)f.ai(lIθlny),,ofttoohrepirrcowhvooersdetsnh,iusshnfoiofwroratmhglayetnaEet[r∇anfid(oθm)] =fro∇mJ{(1θ,).f.o.r,man}y,,
rather than the specific case of ridge regression. You can start by writing down an expression for E[∇fi(θ)]…)
3. Write down the update rule for θ in SGD for the ridge regression objective function.
4. Implement stochastic_grad_descent. (Note: You could potentially generalize the code you wrote for batch gradient to handle minibatches of any size, including 1, but this is not necessary.)
5. Use SGD to find that minimizes the ridge regression objective for the λ and B that you selected in theprevious problem. (If you could not solve the previous problem, choose λSG=D1m0−ay2 naontdcBonv=er1g)e. wTitrhy fiaxfeedwstfiexpedsizset.epSismizpelsy(naotteleyaosturtrryesηutlt∈s. {N0e.x05t,t.r0y05st}e.pNsioztees tthhaatt decrease with the step number according to the following sche ,
C ≤C1(.sePelenaosteeisncbleuldowe Cfo=r d0e.1taiinls)y.ouFrosrubeamcihsssiotenps.sYizoeu’rruelee,npcoloutratgheedvtaoluteryodfifftheereonbtjveacltuivees
of
function (or the log of the objective function if that is more clear) as a function of epoch (or step number, if you prefer) for each of the approaches to step size. How do the results compare? Some things to note:
• Idnifftehriesnctassteepwesiazreesicnhveedsutilgeas,titnhgusthweec’roenvinetregreenscteedraitnetohfethvaeluopetoimf tihzaetoiobnjeacltgiovreitfhumnctwioitnh, which includes the regularization term.
• Sinotmo eatipmaerts othfepainraitmiaeltsetresppsaiczeef(rComfowr hCic/ht aynoud cCa/n√’ttr)eicsovtoeor.aTgrgyrersesdivuecianngdCwtilol cgoeutnytoeur this problem.
• (rOatphteirontahla)nTuhseirnegisthaenoltahsterpvaarraimanetteorf SvGaluDe, swoemveitsiimt,essacyalθleTd,awveeruasgeetdheSGavDer,aignewohf iachll parameter values we visit along the optimization path: , where T is total number of steps taken. Try this approach and see how it compares.
6. (Optional) Try a stepsize rule of the form 0 , where λ is your regularization constant, and η0 a constant you can choose. How do the results compare?
4 Risk Minimization
4.1 Square Loss
1. Let y be a random variable with a known distribution, and consider the square loss function ℓ(a,y) = (a − y)2. We want2to find the action a∗ that has minimal risk. That is, we want to fianndd tah∗e=Baarygesmriinska E(i.(ea.−thye)r,iswkhoefrae∗t)hieseVxapre(cyt)a.tiIonnoitshweritwhorredssp,eicftytoouyw.aSnhtotwo tthryattoa∗p=redEicyt, the value of a random variable, the best you can do (for minimizing expected square loss) is to predict the mean of the distribution. Your expected loss for predicting the mean will be the variance of the distribution. [Hint: Recall that Var(y) = Ey2 − (Ey)2.]
2. Now let’s introduce an input. Recall that the expected loss or “risk” of a decision function
f : X → A is R(f) = Eℓ(f(x),y),
where (x,y) ∼inPimX×aYl r,isakndamthoengBaalylepsosdsiebcleisfiuonnctfiuonnsc:tion f∗ : X → A is a function that achieves the m
.
Hliosesosrveℓe(wraey,y.c)oA=nssib(daeef−rortyeh,)e2w,reethgaresesBsusamioyeneswsdeeettkciinnsoigow,nintfhuwenchdtiaicothan-Agisen=fe∗rY(axt)=in=gREd.i[syWtr|iebxuw],tiiwlolhnsehProeXwt×hfYoe.rexthpeecstqautaiorne
(a) We’ll approach this problem by finding the optimal action for any given x. If somebody tells us x, we know that the corresponding y is coming from the conditional distribution y |odxu.cFe)orthaatpahratsicumlainrimx,alwehxaptevctaeludelosshso?uldExwperepssreydoicutr (ai.nes.wwerhaats aacdtieocnisiaonshfouunlcdtiwone
pr
f(x), which gives the best action for any given x. In mathematical notation, we’re looking for f∗(x) = argmina E(a − y)2 | x, where the expectation is with respect to y. (Hint: There is really nothing to do here except write down the answer, based on the previous question. But make sure you understand what’s happening…)
(b) In the previous problem we produced a decision function f∗(x) that minimized the risk for each x. In other words, for any other decision function f(x), f∗(x) is going to be at least as good as f(x), for every single x. In math, we mean
E(f∗(x) − y)2 | x≤E(f(x) − y)2 | x,
for all x. To show that f∗(x) is the Bayes decision function, we need to show that
E(f∗(x) − y)2≤E(f(x) − y)2
for any f. Explain why this is true. (Hint: Law of iterated expectations.)
4.2 (OPTIONAL) Median Loss
1. Show that for the absolute loss ℓ(yˆ,y) = |oyn−oyfˆ|y, fgi∗v(exn) xis. a[HBinayt:esAdseicnisitohne fpurnecvtioiouns isfecft∗i(oxn), is the median of the conditional distributi




Reviews
There are no reviews yet.