Description
Homework #2
The π Return
Description
Given an MDP and a particular time step π‘ of a task (continuing or episodic), the π-return, πΊππ‘ , 0 β€ π β€ 1, is a weighted combination of the π-step returns πΊπ‘:π‘+π, π β₯ 1:
πΊππ‘ .
π=1
While the π-step return πΊπ‘:π‘+π can be viewed as the target of an π-step TD update rule, the π-return can be viewed as the target of the update rule for the TD(π) prediction algorithm, which you will become familiar with in project 1.
Consider the Markov reward process described by the following state diagram and assume the agent is in state 0 at time π‘ (also assume the discount rate is πΎ = 1). A Markov reward process can be thought of as an MDP with only one action possible from each state (denoted as action 0 in the figure below).
Procedure
You will implement your solution using the solve() method in the code below. You will be given p , the probability of transitioning from state 0 to state 1, V , the estimate of the value function at time π‘, represented as a vector
[π (0), π (1), π (2), π (3), π (4), π (5), π (6)], and rewards , a vector of the rewards [π0, π1, π2, π3, π4, π5, π6] corresponding to the MDP.
Your return value should be a value of π, strictly less than 1, such that the expected value of the π-return equals the expected Monte-Carlo return at time π‘.
Your answer must be correct to 3 decimal places, truncated (e.g. 3.14159265 becomes 3.141).
Resources
The concepts explored in this homework are covered by:
Lecture Lesson 3: TD and Friends
Chapter 7 (7.1 π-step TD Prediction) and Chapter 12 (12.1 The π-return) of
http://incompleteideas.net/book/the-book-2nd.html (http://incompleteideas.net/book/the-book-2nd.html)
‘Learning to Predict by the Method of Temporal Differences’, R. Sutton, 1988
Submission
Submit your finished notebook on Gradescope. Your grade is based on a set of hidden test cases. You will have unlimited submissions – only the last score is kept. Use the template below to implement your code. We have also provided some test cases for you. If your code passes the given test cases, it will run (though possibly not pass all the tests) on Gradescope.
Gradescope is using python 3.6.x and numpy==1.18.0, and you can use any core library (i.e., anything in the Python standard library). No other library can be used. Also, make sure the name of your notebook matches the name of the provided notebook.
Gradescope times out after 10 minutes.
Test cases
We have provided some test cases for you to help verify your implementation.
In [ ]: ## DO NOT MODIFY THIS CODE. This code will ensure that your submissio
## will work proberly with the autograder
import unittest
class TestTDNotebook(unittest.TestCase):
def test_case_1(self): agent = TDAgent()
np.testing.assert_almost_equal(
agent.solve( p=0.81,
V=[0.0, 4.0, 25.7, 0.0, 20.1, 12.2, 0.0], rewards=[7.9, -5.1, 2.5, -7.2, 9.0, 0.0, 1.6]
), 0.622, decimal=3
)
def test_case_2(self): agent = TDAgent()
np.testing.assert_almost_equal(
agent.solve( p=0.22,
V=[12.3, -5.2, 0.0, 25.4, 10.6, 9.2, 0.0], rewards=[-2.4, 0.8, 4.0, 2.5, 8.6, -6.4, 6.1] ), 0.519, decimal=3
)
def test_case_3(self): agent = TDAgent()
np.testing.assert_almost_equal(
agent.solve( p=0.64,
V=[-6.5, 4.9, 7.8, -2.3, 25.5, -10.2, 0.0], rewards=[-2.4, 9.6, -7.8, 0.1, 3.4, -2.1, 7.9] ), 0.207, decimal=3
)
unittest.main(argv=[”], verbosity=2, exit=False)
Reviews
There are no reviews yet.