Description
Submission:
1. The entire code including the starter code and the files you have edited.
2. A writeup answering the questions in the assignment and showing the plots.
Include everything in a folder and submit a single zip file with the name <Roll No.>HW2.zip.
Introduction
In this homework you will implement deep Q-learning, following DeepMind’s paper ([1] and [2]) that learns to play Atari games from raw pixels. The purpose is to demonstrate the effectiveness of deep neural networks as well as some of the techniques used in practice to stabilize training and achieve better performance. We are expecting that you will have some familiarity with PyTorch (the reason that we kept Deep Learning as a pre-requisite).
We are aware that not everyone of you are equipped with compute resources to train deep RL algorithms on complex environments and we want you to utilise your allotted GCP credits for your project. So, instead of using an Atari Environment (well, which would have been fun!), we are providing a simpler test environment which you can easily run locally on CPU.
Test Environment
We are providing you with this simple test environment for your code. You should be able to run your models on CPU in no more than a few minutes on the following environment:
• 4 states: 0,1,2,3
• 5 actions: 0,1,2,3,4. Action 0 ≤ i ≤ 3 goes to state i, while action 4 makes the agent stay in the same state.
• Rewards: Going to state i from states 0, 1, and 3 gives a reward R(i), where R(0) = 0.2,R(1) =
−0.1,R(2) = 0.0,R(3) = −0.3. If we start in state 2, then the rewards defined above are multiplied by −10. See Table 1 for the full transition and reward structure.
• One episode lasts 5 time steps (for a total of 5 actions) and always starts in state 0 (no rewards at the initial state).
1
State (s) Action (a) Next State (s0) Reward (R)
0 0 0 0.2
0 1 1 -0.1
0 2 2 0.0
0 3 3 -0.3
0 4 0 0.2
1 0 0 0.2
1 1 1 -0.1
1 2 2 0.0
1 3 3 -0.3
1 4 1 -0.1
2 0 0 -2.0
2 1 1 1.0
2 2 2 0.0
2 3 3 3.0
2 4 2 0.0
3 0 0 0.2
3 1 1 -0.1
3 2 2 0.0
3 3 3 -0.3
3 4 3 -0.3
Table 1: Transition table for the Test Environment
An example of a trajectory (or episode) in the test environment is shown in Figure 1, and the trajectory can be represented in terms of st,at,Rt as: s0 = 0,a0 = 1,R0 = −0.1,s1 = 1,a1 = 2,R1 = 0.0,s2 = 2,a2 = 4,R2 = 0.0,s3 = 2,a3 = 3,R3 = 3.0,s4 = 3,a4 = 0,R4 = 0.2,s5 = 0.
Figure 1: Example of a trajectory in the Test Environment
1 Tabular Q-Learning (5 pts)
If the state and action spaces are sufficiently small, we can simply maintain a table containing the value of Q(s,a), an estimate of Q∗(s,a), for every (s,a) pair. In this tabular setting, given an experience sample (s,a,r,s0), the update rule is
(1)
where α > 0 is the learning rate, γ ∈ [0,1) the discount factor.
-Greedy Exploration Strategy For exploration, we use an -greedy strategy. This means that with probability , an action is chosen uniformly at random from A, and with probability 1−, the greedy action (i.e., argmaxa∈A Q(s,a)) is chosen.
Implement the getaction and update functions in q1schedule.py. Test your implementation by running python q1schedule.py.
2 Q-Learning with Function Approximation (15 points)
) (2)
where (s,a,r,s0) is a transition from the MDP.
To improve the data efficiency and stability of the training process, DeepMind’s paper [1] employed two strategies:
• A replay buffer to store transitions observed during training. When updating the Q function, transitions are drawn from this replay buffer. This improves data efficiency by allowing each transition to be used in multiple updates.
• A target network with parameters θ¯ to compute the target value of the next state, maxa0 Q(s0,a0). The update becomes
) (3)
Updates of the form (3) applied to transitions sampled from a replay buffer D can be interpreted as performing stochastic gradient descent on the following objective function:
LDQN (4)
Note that this objective is also a function of both the replay buffer D and the target network Qθ¯. The target network parameters θ¯ are held fixed and not updated by SGD, but periodically – every C steps – we synchronize by copying θ¯← θ.
We will now examine some implementation details.
2.1 Linear Approximation (10 pts)
We will now implement linear approximation in PyTorch. This question will set up the pipeline for next part of your Assignment. You’ll need to implement the following functions in q21lineartorch.py (please read through q21linear torch.py):
• initializemodels
• get qvalues
• updatetarget
• calcloss
• add optimizer
Test your code by running python q21lineartorch.py locally on CPU. This will run linear approximation with PyTorch on the test environment from Problem 1. Running this implementation should only take a minute.
Do you reach the optimal achievable reward on the test environment? Attach the plot scores.png from the directory results/q21 linear to your writeup.
2.2 Implementing DeepMind’s DQN (5 pts)
Implement the deep Q-network as described in [1] by implementing initializemodels and getqvalues in q22naturetorch.py. The rest of the code inherits from what you wrote for linear approximation.
Test your implementation locally on CPU on the test environment by running python q22naturetorch.py.
Running this implementation should only take a minute or two.
Attach the plot of scores, scores.png, from the directory results/q22nature to your writeup. Compare this model with linear approximation. How do the final performances compare? How about the training time?
References
[1] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), pp. 529–533.
[2] Volodymyr Mnih et al. “Playing Atari With Deep Reinforcement Learning”. In: NIPS Deep Learning Workshop. 2013.



Reviews
There are no reviews yet.