Description
Agenda
Gradient Descent
Adaptive Learning Rate
Coding Exercise
Stochastic Gradient Descent
Coding Exercise
Application
Linear Regression
Logistic Regression
Agenda
Gradient Descent
Gradient Descent Recap
Gradient Descent
Gradient Descent
Learning Rate
Application
Suppose we would like to find θ∗ ∈ R that minimizes f (θ) = θ2 − 2θ + 1. The gradient (in this case, the derivative) ∇f (θ) = 2θ − 2. We can easily see that θ∗ = argminθ f (θ) = 1.
Learning Rate
We applied gradient descent for 1000 iterations on f (θ) = θ2 − 2θ + 1 with varying learning rate
When the learning rate is too large ( ) does not decrease through iterations. The value of θi at each iteration significantly fluctuates.
When the learning rate is too small ( ) decreases very slowly.
Adaptive Learning Rate
Instead of using a fixed learning rate through all iterations, we can adjust our learning rate in each iteration using a simple algorithm.
At each iteration i:
if δ ≥ threshold: we achieve a satisfactory reduction on f (θ) θi = θ¯ maybe we can consider increasing the learning rate for next iteration the reduction is unsatisfactory θi = θi−1
the learning rate is too large, so we reduce the learning rate
Adaptive Learning Rate
How to decide a proper threshold for f (θi−1) − f (θ˜)
Stochastic Gradient Descent
Stochastic Gradient Descent
Application
Application
Gradient Descent for Linear Regression
Goal: Finding the set of parameters that minimize the empirical risk:
n
Rˆn(θ) = 1
n
i=1
where θ ∈ Rd parameterizes the hypothesis space F Set our cost function:
n
J
Approach 1: Closed Form Solution (set derivatives equal to zero and solve for parameters) pros: one shot algorithm! cons: does not scale to large datasets (matrix inverse is bottleneck)
Approach 2: Gradient Descent (take larger, more certain steps toward the negative gradient) pros: conceptually simple, guaranteed convergence cons: batch, often slow to converge
Approach 3: Stochastic Gradient Descent (take many small, quick steps opposite the gradient) pros: memory efficient, fast convergence, less prone to local optima cons: convergence in practice requires tuning and fancier variants
Approach 1: Close-Form Solution
Transform the cost function in matrix form
n J
i=1
To minimize J(θ), take derivative and set to zero:
X T Xθ − X T y X T (Xθ − y) = 0
XT y
Ensure invertibility of X>X
What if X has less than full column rank?
Approach 2: Iterative Method GD
Recall:
d J(θ) dθ1
d
∇θJ(θ) = dθ2 J(θ)
d dθd
Approach 3: Iterative Method SGD
Gradient Descent for Logistic Regression
Learning Logistic Regression
Assumption: (x1,y1),…,(xn,yn) are independently generated
Likelihood: The probability of getting the y1 ……,yn in D from the corresponding x1,…,xn
n
P (y1,…,yn | x1,…,xn) = Yp(y(i) | x(i);θ)
i=1
n
= Y(hθ(x(i)))y(i)(1 − hθ(x(i)))1−y(i)
i=1
Goal: maximize the log likelihood (Easier)
n
`(θ) = log L(θ) = Xy(i) log hθ(x(i)) + (1 − y(i))log(1 − hθ(x(i)))
i=1
Equivalent to minimize the objective function with logistic loss: J where `(h (x),y) = −y log (hθ(x)) − (1 − y)log (hθ(x))
Learning Logistic Regression
Analytic solution won’t work
Find optimum using iterative methods: Gradient Ascent or Stochastic
Gradient Ascent
Gradient ascent rule: θ := θ + α∇θ`(θ)
Stochastic gradient ascent rule: θ := θ + α∇θ`(i)(θ) for random training pair (xi,yi)
For one training example (x,y), the partial derivative of log likelihood `(θ):
∂ 1 − (1 − y) 1 ) ∂ g(θT x)
`(θ) = (y
∂θj g(θT x) 1 − g(θT x) ∂θj
= (y 1 1 T − g(θT x) ∂ θT x
− (1 − y) g(θ x)(1
g(θT x) 1 − g(θT x) ∂θj
= (y(1 − g(θT x)) − (1 − y)g(θT x))xj
= (y − hθ(x)) · xj
References
References
Stanford CS229 Note1
CMU 10-701 Linear Regression Slide
Reviews
There are no reviews yet.