DS-GA1003 – Agenda Solved

Description

5/5 – (1 vote)

Agenda
Motivation
Calculating the gradient
Coding exercise
Intro Question

Intro Solution

Intuition
Why don’t we directly compute the derivative ∂θ∂ J(θ,y) and solve for θ∗ that makes
But what if the analytical solution for ∂θ∂ J(θ∗,y) = 0 is computationally intractable?
We can improve the an initial guess θ0 by iteratively “updating” it using the gradient of the loss ∇J(θ).
This process will give us a “good enough” approximation for θ∗.
To do so, we need to know how to calculate the gradient J
Directional Derivative
Partial Derivative
i−1

Let ei = (z0,0,…,}| 0{,1,0,…,0) denote the ith standard basis vector.
The ith partial derivative is defined to be the directional derivative along ei.
It can be written many ways:
0 ∂
f (x;ei) = f (x) = ∂xi f (x) = ∂if (x).
∂xi
What is the intuitive meaning of ∂xi f (x)? For example, what does a large value of ∂x3f (x) imply?
Differentiability and Gradients
We say a function f : Rn → R is differentiable at x ∈ Rn if
f (x + v) − f (x) − gT v
lim = 0, v→0 kvk2
for some g ∈ Rn. (Proof: page 8 of 19ML notes)
If it exists, this g is unique and is called the gradient of f at x with notation g = ∇f (x).
It can be shown that
.

Computing Gradients
The gradient of this function can be computed as following:

Computing Gradients
In homework2, we will show that
∇J(θ) = 2(X T Xθ − X T y) = 2X T (Xθ − y)
Gradient Checker
So far we have worked with relatively simple functions where it is straight-forward to compute the gradient.
For more complex functions, the gradient computation can be notoriously difficult to debug and get right. Example from www.
quora.com/What-is-the-most-complex-equation-in-the-world.
How can we test if our gradient computation is correct?
Gradient Checker
Recall the mathematical definition of the derivative as:
∂ f f
∂θ
We can approximate the gradient (left hand side) using the equation on the right hand side by setting to a small constant, say .
Now let’s expand this method to deal with vector input θ ∈ Rd. Let’s say we want to verify out gradient at dimension i (∇f (θ))i. We can make use of one-hot vector ei in which all dimension except the ith are 0 and the ith dimension has a value of 1: ei = [0,0,…,1,…,0]T The gradient at ith dimension can be then approximated as
[ (i) ≈ f
∇f (θ)]
2
Recap
To find a good decision function we will minimize the empirical loss on the training data.
Having fixed a hypothesis space parameterized by θ, finding a good decision function means finding a good θ.
Given a current guess for θ, we will use the gradient of the empirical loss (w.r.t. θ) to get a local linear approximation.
If the gradient is non-zero, taking a small step in the direction of the negative gradient is guaranteed to decrease the empirical loss.
This motivates the minimization algorithm called gradient descent.
Coding Exercise
In the provided notebook, we will use Python to: calculate the gradient for two example functions implement the gradient checker

References
References

Reviews

There are no reviews yet.

Be the first to review “DS-GA1003 – Agenda Solved”

DS-GA1003 – Agenda Solved

Description

Reviews

Related products

A100 – Computer Science Solved

DS-GA1003 – Agenda Solved

DS-GA1003 – Homework 1: Error Decomposition & Polynomial Regression Solved

DS-GA1003 – Homework 0: Assignments Tips and Self-Assessment Solved