Description
1. [13 points] Q-Learning
(a) State the Bellman optimality principle as a function of the optimal Q-function Q∗(s,a), the expected reward function R(s,a,s0) and the transition probability P(s0|s,a), where s is the current state, s0 is the next state and a is the action taken in state s.
(b) In case the transition probability P(s0|s,a) and the expected reward R(s,a,s0) are unknown, a stochastic approach is used to approximate the optimal Q-function. After observing a transition of the form (s,a,r,s0), write down the update of the Q-function at the observed state-action pair (s,a) as a function of the learning rate α, the discount factor γ, Q(s,a) and Q(s0,a0).
Your answer:
• Obtain a sample mini-batch B ⊆ D, which D = {(s,a,r,s0)}
• Compute target
• Use stochastic (semi-)gradient descent to optimize:
Q(s,a) = (1 − α)Q(s,a) + αyj
= (1 − α)Q(s,a) + α(R(s,a,s0) + γ max Q(s0,a0))
a0∈A0s
(c) What is the advantage of an epsilon-greedy strategy?
Your answer:
Advantage of Epsilon greedy strategy is to explore (generate) more random samples rather than directly use the state space. (so it is more stochastic)
(d) What is the advantage of using a replay-memory?
Your answer:
• Q-learning updates are incremental and do not converge quickly, so multiple passes with the same data is beneficial, especially when there is low variance in immediate outcomes (r,s0) given the same state, action pair (s,a).
• Better convergence behavior will be given when training function approximation. This is because the data is more like i.i.d. data assumed in most supervised learning convergence proofs.
2
(e) Consider a system with two states S1 and S2 and two actions a1 and a2. You perform actions and observe the rewards and transitions listed below. Each step lists the current state, reward, action and resulting transition as: Si;R = r;ak : Si → Sj. Perform Qlearning using a learning rate of α = 0.5 and a discount factor of γ = 0.5 for each step by applying the formula from part (b). The Q-table entries are initialized to zero. Fill in the tables below corresponding to the following four transitions. What is the optimal policy after having observed the four transitions?
i. S1; R = −10; a1 : S1 → S1 ii. S1; R = −10; a2 : S1 → S2 iii. S2; R = 18.5; a1 : S2 → S1
3




Reviews
There are no reviews yet.