100% Guaranteed Results


CS234 – 1
$ 20.99
Category:

Description

5/5 – (1 vote)

Consider an MDP M = (S,A,R,P,γ) where |S| < ∞ and |A| < ∞. Recall the Bellman operator T is defined such that:

For a policy π, the Bellman operator associated with π, Tπ, is defined such that:

We consider the following properties of these operators. Note = and ≤ are element-wise comparisons. For example, V ≤ V 0 if V (s) ≤ V 0(s) for all s ∈ S.
(a) Let V and V 0 be two value functions. Prove for any policy π, V ≤ V 0 ⇒ TπV ≤ TπV 0. Also, prove V ≤ V 0 ⇒ TV ≤ TV 0.
(b) Let c be a constant and ~1 be the all-ones vector. Prove for any policy TπV + γc~1. Also, prove T(V + c~1) = TV + γc~1.
(c) For α,β ≥ 0 and α + β = 1, prove T(αV + βV 0) ≤ αTV + βTV 0.
(d) Prove or disprove the following statement: For all V , TV ≤ V .
2 Value Iteration
(a) After each iteration of value iteration you can extract the current best policy (given the current value function, by taking an argmax instead of max). If the best policy at step k is the same as the best policy at step k + 1 when doing value iteration, can the policy ever change again (with further iterations of value iteration)? Prove it cannot or disprove with a counterexample.
(b) Consider an MDP problem with finite state space and finite action set. Empirically we often halt value iteration after a fixed number of steps for computational reasons, yielding an approximately optimal value function V˜. If we then extract the best policy for V˜ (by taking the argmax of one more backup) we will get a policy. Let’s assume we know that the maximum difference across any state is (for extra understanding, think about how the difference in |BV − V | allows us to ensure this)
for all s
Then prove that the resulting best policy extracted from V˜ will also have a performance that is close to the performance of the optimal policy. In particular, for a greedy policy πV˜ derived from V˜, define the loss function LV˜ as follows:
LV˜(s) = V ∗(s) − VπV˜(s), for all s
where V ∗ is the optimal value function and VπV˜ is the value function associated with πV˜. Prove

If you use outside sources you must write up your own solution and cite the source.
3 Grid Policies
Consider the following grid environment. Starting from any unshaded square, you can move up, down, left, or right. Actions are deterministic and always succeed (e.g. going left from state 21 goes to state 20) unless they will cause the agent to run into a wall. The thicker edges indicate walls, and attempting to move in the direction of a wall results in staying in the same square. Taking any action from the green target square (no. 11) earns a reward of +5 and ends the episode. Taking any action from the red square of death (no. 15) earns a reward of −5 and ends the episode. Otherwise, each move is associated with some reward r ∈ {−1,0,+1}. Assume the discount factor γ = 1 unless otherwise specified.
20 21 22 23 24
15 16 17 18 19
10 11 12 13 14
5 6 7 8 9
0 1 2 3 4
(a) Define the reward r for all states (except state 15 and state 11 whose rewards are specified above) that would cause the optimal policy to return the shortest path to the green target square (no. 11).
(b) Using r from part (a), find the optimal value function for each square.
(c) Does setting γ = 0.5 change the optimal policy? Why or why not?
(d) The green target state is now even more awesome! Its reward is +20. Assuming the rest of the rewards for all states are the same as you defined in part (a), does the value function change and why? Does the policy change?
4 Frozen Lake MDP
Now you will implement value iteration and policy iteration for the Frozen Lake environment from OpenAI Gym (https://gym.openai.com/envs/FrozenLake-v0). We have provided custom versions of this environment in the starter code.
(a) (coding) Implement policy iteration in vi and pi.py. The stopping tolerance is tol = 10−3. Use γ = 0.9. Return the optimal value function and the optimal policy.
(b) (coding) Implement value iteration in vi and pi.py. The stopping tolerance is tol = 10−3. Use γ = 0.9. Return the optimal value function and the optimal policy.
(c) (written) Run both methods on the Deterministic-4×4-FrozenLake-v0 and
Stochastic-4×4-FrozenLake-v0 environments. In the second environment, the dynamics of the world are stochastic. How does stochasticity affect the number of iterations required, and the resulting policy?
5 Frozen Lake Reinforcement Learning
Continuing with only the Stochastic-4×4-FrozenLake-v0 environment, you will now do full reinforcement learning using model-based and model-free methods! This problem is considered solved when 100 consecutive trials achieve an average score of 0.78, so you can use this to check the performance of your agent.
(a) (coding) Implement Q-learning with -greedy exploration in model free learning.py. Return an array of the Q values for each state-action pair.
(b) (coding) Implement SARSA with -greedy exploration in model free learning.py. Return an array of the Q values for each state-action pair.
(c) (coding, written) Plot the running average score of the Q-learning agent for the first 1000 training episodes. The x-axis should be the episode number, and the y-axis should be the average score over all training episodes so far. Include the plot in your writeup.
(d) (coding, written) Now implement model-based learning in model based learning.py where after each episode, you train an MDP model using the experience from that episode and then use value iteration on your updated model of the environment to compute a current best policy. Explore using -greedy. Plot performance on the same axis as part (c). Compare and contrast Q-learning and model-based learning, including at least one benefit of each. For model-based learning, your agent does not always have to win the environment, but it should consistently score an average of 0.5 or higher.

Reviews

There are no reviews yet.

Be the first to review “CS234 – 1”

Your email address will not be published. Required fields are marked *

Related products