Description
General considerations
For the first part of this assignment we will consider a synthetic prediction problem to develop our intuition about the error decomposition. Consider the random variables x ∈ X = [0,1] distributed uniformely (x ∼ Unif([0,1])) and y ∈ Y = R defined as a polynomial of degree 2 of x: there exists (a0,a1,a2) ∈ R3 such that the values of x and y are linked as y = g(x) = a0 + a1x + a2x2. Note that this relation fixes the joint distribution PX×Y.
From the knowledge of a sample , we would like to predict the relation between x and y, that is find a function f to make predictions ˆy = f(x). We note Hd, the set of polynomial functions on R of degree . We will consider the hypothesis classes Hd varying d. We will minimize the squared loss `(y,yˆ ) = to solve the regression problem.
1. Recall the definition of the expected risk R(f) of a predictor f. While this cannot be computed in general note that here we defined PX×Y. Which function f∗ is an obvious Bayes predictor? Make sure to explain why the risk R(f∗) is minimum at f∗.
2. Using H2 as your hypothesis class, which function fH∗2 is a risk minimizer in H2? Recall the definition of the approximation error. What is the approximation error achieved by fH∗2 ?
3. Considering now Hd, with d > 2. Justify an inequality between R(fH∗2) and Which function is a risk minimizer in Hd? What is the approximation error achieved by ?
4. For this question we assume a0 = 0. Considering H = {f : x → b1x;b1 ∈ R}, which function fH∗ is a risk minimizer in H? What is the approximation error achieved by fH∗ ? In particular what is the approximation error achieved if furthermore a2 = 0 in the definition of true underlying relation g(x) above?
Polynomial regression as linear least squares
In practice, PX×Y is usually unknown and we use the empirical risk minimizer (ERM). We will reformulate the problem as a d-dimensional linear regression problem. First note that functions in Hd are parametrized by a vector b = [b0,b1,···bd]>, we will use the notation fb. Similarly we will note a ∈ R3 the vector parametrizing g(x) = fa(x). We will also gather data points from the training sample in the following matrix and vector:
, y = [y0,y1,···yd]>. (1)
These notations allow us to take advantage of the very effective linear algebra formalism. X is called the design matrix.
5. Show that the empirical risk minimizer (ERM) bˆ is given by the following minimization bˆ = argmin .
b
6. If N > d and X is full rank, show that bˆ = (X>X)− X>y. (Hint: you should take the gradients of the loss above with respect to b 1). Why do we need to use the conditions N > d and X full rank?
Hands on
Open the source code file hw1 code source.py from the .zip folder. Using the function get a get a value for a, and draw a sample x train, y train of size N = 10 and a sample x test, y test of size Ntest = 1000 using the function draw sample.
7. Write a function called least square estimator taking as input a design matrix X ∈ RN×(d+1) and the corresponding vector y ∈ RN returning bˆ ∈ R(d+1). Your function should handle any value of N and d, and in particular return an error if N ≤ d. (Drawing x at random from the uniform distribution makes it almost certain that any design matrix X with d ≥ 1 we generate is full rank).
8. Recall the definition of the empical risk Rˆ(fˆ) on a sample for a prediction function fˆ. Write a function empirical risk to compute the empirical risk of fb taking as input a design matrix X ∈ RN×(d+1), a vector y ∈ RN and the vector b ∈ R(d+1) parametrizing the predictor.
9. Use your code to estimate bˆ from x train, y train using d = 5. Compare bˆ and a. Make a single plot (Plot 1) of the plan (x,y) displaying the points in the training set, values of the true underlying function g(x) in [0,1] and values of the estimated function fbˆ(x) in [0,1]. Make sure to include a legend to your plot.
10. Now you can adjust d. What is the minimum value for which we get a “perfect fit”? How does this result relates with your conclusions on the approximation error above?
In presence of noise
Now we will modify the true underlying PX×Y, adding some noise in , with a standard normal random variable independent from x. We will call training error et the empirical risk on the train set and generalization error eg the empirical risk on the test set.
12. Recall the definition of the estimation error. Using the test set, (which we intentionally chose large so as to take advantage of the law of large numbers) give an empirical estimator of the estimation error. For the same values of N and d above plot the estimation error as a function of N (Plot 3).
13. The generalization error gives in practice an information related to the estimation error. Comment on the results of (Plot 2 and 3). What is the effect of increasing N? What is the effect of increasing d?
14. Besides from the approximation and estimation there is a last source of error we have not discussed here. Can you comment on the optimization error of the algorithm we are implementing?
Application to Ozone data (optional)
You can now use the code we developed on the synthetic example on a real world dataset. Using the command np.loadtxt(‘ozone wind.data’) load the data in the .zip. The first column corresponds to ozone measurements and the second to wind measurements. You can try polynomial fits of the ozone values as a function of the wind values.
15. Reporting plots, discuss the again in this context the results when varying N (subsampling the training data) and d.
Reviews
There are no reviews yet.