Description
Problem 1: Value Iteration
a. [3 points] Give the value of 𝑽𝒐𝒑𝒕(𝒔) for each state s after 0, 1, and 2 iterations of value iteration. Iteration 0 just initializes all the values of V to 0. Terminal states do not have any optimal policies and take on a value of 0.
Answer:
I will use Bellman Equation to handle this problem. The Bellman equation is defined as follows:
𝑉∗(𝑠) = max 𝑇(𝑠, 𝑎, 𝑠 )[𝑅(𝑠, 𝑎, 𝑠 ) + 𝛾𝑉∗(𝑠 )]
The basic structure of problem a. is: If your action results in transitioning to state -2, then you receive a reward of 20. If your action results in transitioning to state 2, then your reward is 100. Otherwise, your reward is -5. Assume the discount factor γ is 1.
UtilityState -2 -1 0 1 2
𝑉 0 0 0 0 0
𝑉 0 15.0 -5.0 26.5 0
𝑉 0 14.0 13.45 23.0 0
b. [3 points] What is the resulting optimal policy 𝝅𝒐𝒑𝒕 for all non-terminal states?
It is a policy iteration problem.
Answer:
𝑉 (𝑠) = 𝑇(𝑠, 𝜋(𝑠), 𝑠 )[𝑅(𝑠, 𝜋(𝑠), 𝑠 ) + 𝛾𝑉 (𝑠 )]
∀s ∈ S,𝜋∗(𝑠) = 𝑎𝑟𝑔max 𝑄∗(𝑠, 𝑎) = 𝑎𝑟𝑔max 𝑇(𝑠, 𝑎, 𝑠′)[𝑅(𝑠, 𝑎, 𝑠′) + 𝛾𝑉∗(𝑠 )]
Finally, it will converge to:
PolicyState -2 -1 0 1 2
π 0 -1 1 1 0
0 means No action, it is the end!
Problem 2: Transforming MDPs
a. [3 points] If we add noise to the transitions of an MDP, does the optimal value always get worse? Specifically, consider an MDP with reward function 𝐑𝐞𝐰𝐚𝐫𝐝(𝐬, 𝐚, 𝐬′), states States, and transition function 𝐓(𝐬, 𝐚, 𝐬′). Let’s define a new MDP which is identical to the original, except that on each action, with probability 12, we randomly jump to one of the states that we could have reached before with positive probability. Let V1 be the optimal value function for the original MDP, and V2 the optimal value function for the modified MDP. Is it always the case that 𝐕𝟏(𝐒𝐬𝐭𝐚𝐫𝐭) ≥ 𝐕𝟐(𝐒𝐬𝐭𝐚𝐫𝐭)?
1
T (s,a,s ) =T(s,a,s ) + 2 2 |{s :T(s,a,s′′) > 0}|
Answer:
I have finished the code in submission.py. Please check. Thanks!
No, not always. There are still some situations that V1(S ) < V2(S ). Supposed we have a simple MDP situation below:
State_C ← State_A → State_B
We have 3 states: A, B, C. A is the start state and both B, C are terminal states. Assume that we have only one action: the original transaction function is T(A, a, B) = 0.1 and T(A, a, C) = 0.9 , the reward function is R(A,a, B) = 100 and R(A, a, C) = 1, the discount factor 𝛾 = 1. Because we only have one action, the value function is:
𝑉∗(𝑠) = 𝑇(𝑠, 𝑎, 𝑠 )[𝑅(𝑠, 𝑎, 𝑠 ) + 𝑉∗(𝑠 )]
V1(S ) = 0.9 ∗ (1 + 0) + 0.1 ∗ (100 + 0) = 10.9
V2(S ) = 0.7 ∗ (1 + 0) + 0.3 ∗ (100 + 0) = 30.7
V1(S ) < V2(S )
I also write the code in CounterexampleMDP of this situation.
b. [3 points] Suppose we have an acyclic MDP (you will not visit a state a second time in this process). We could run value iteration, which would require multiple iterations. Briefly explain a more efficient algorithm that only requires one pass over all the (𝐬, 𝐚, 𝐬′) triples.
Answer:
For acyclic graphs (for example, the MDP for Blackjack), we just need to do one iteration provided that we process the nodes in reverse topological order of the graph. This is the same setup as we had for dynamic programming in search problems, only the equations are different. The method is: We use topological sorting on our state space from the start state to the end state. Then we calculate the utility value and update it in reverse topological order (the calculation result is used directly for the utility value update of the next layer in the reverse topology sorting). For acyclic graphs, we just need to do one iteration.
c. [3 points] Suppose we have an MDP with states 𝐒𝐭𝐚𝐭𝐞𝐬 a discount factor 𝛄 < 𝟏, but we have an MDP solver that only can solve MDPs with discount 1. How can leverage the MDP solver to solve the original MDP?
Answer:
Let us define a new MDP with states States′ = States ∪ {o}, where o is a new state. Let’s use the same actions (Actions′(s) = Actions(s)), but we need to keep the discount γ′ = 1. Your job is to define new transition probabilities T′(s, a, s′) and rewards Reward′(s, a, s′) in terms of the old MDP such that the optimal values V (s) for all s ∈ States are the equal under the original MDP and the new MDP. original MDP:
V∗(𝑠) = max 𝑇(𝑠, 𝑎, 𝑠 )[𝑅(𝑠, 𝑎, 𝑠 ) + 𝛾V∗(𝑠 )] 𝛾 < 1
new MDP:
V∗(𝑠) = max 𝑇 (𝑠, 𝑎, 𝑠 )[𝑅 (𝑠, 𝑎, 𝑠 ) + 𝑉∗(𝑠 )]
We want V∗(𝑠) = V∗(𝑠) for all s ∈ States after iterations.
First, we define some new transition probabilities and rewards of States′ = States ∪ {o} and Actions′(s) = Actions(s):
𝑇 (𝑠, 𝑎, 𝑠 ) = 𝛾
𝑅 (𝑠, 𝑎, 𝑠 ) = 𝑇(𝑠, 𝑎, 𝑠 ) ∗ 𝑅(𝑠, 𝑎, 𝑠 )
𝛾
𝑇 (𝑠, 𝑎, 𝑜) = 1 − 𝛾
𝑅 (𝑠, 𝑎, 𝑜) = 0
𝑇 (𝑜, 𝑎, 𝑜) = 1
𝑅 (𝑜, 𝑎, 𝑜) = 0
Then, we calculate one of the optimal values V (s) for example:
V∗(𝑠1) = max 𝑇(𝑠, 𝑎, 𝑠 )[𝑅(𝑠1,𝑎, 𝑠1 ) + 𝛾V∗(𝑠1 )]
V∗(𝑠1) = max 𝑇 (𝑠1,𝑎, 𝑠1 )[𝑅 (𝑠1,𝑎, 𝑠1 ) + 𝑉∗(𝑠1 )]
= max 𝑇 (𝑠1,𝑎, 𝑜)[𝑅 (𝑠1,𝑎, 𝑜) + 𝑉∗(𝑜)] + 𝑇 (𝑠1,𝑎, 𝑠1 )[𝑅 (𝑠1,𝑎, 𝑠1 ) + 𝑉∗(𝑠1 )]
/
= max 0 + 𝛾 ∗ 𝑇(𝑠, 𝑎, 𝑠 ) ∗ 𝑅(𝑠, 𝑎, 𝑠 )
𝛾
/
= max 𝑇(𝑠, 𝑎, 𝑠 ) ∗ 𝑅(𝑠, 𝑎, 𝑠 ) + 𝛾 ∗ 𝑉∗(𝑠1 )
/
= max 𝑇(𝑠, 𝑎, 𝑠 )[𝑅(𝑠1,𝑎, 𝑠1 ) + 𝛾V∗(𝑠1 )]
= V∗(𝑠1)
It indicated that V∗(𝑠) = V∗(𝑠) in each iteration.
Problem 3: Peeking Blackjack
a. [10 points] Implement the game of Blackjack as an MDP by filling out the
𝐬𝐮𝐜𝐜𝐀𝐧𝐝𝐏𝐫𝐨𝐛𝐑𝐞𝐰𝐚𝐫𝐝() function of class BlackjackMDP.
I have finished the code in submission.py. Please check. Thanks!
b. [4 points] Let’s say you’re running a casino, and you’re trying to design a deck to make people peek a lot. Assuming a fixed threshold of 20, and a peek cost of 1, design a deck where for at least 10% of states, the optimal policy is to peek. Fill out the function 𝐩𝐞𝐞𝐤𝐢𝐧𝐠𝐌𝐃𝐏() to return an instance of BlackjackMDP where the optimal action is to peek in at least 10% of states.
I have finished the code in submission.py. Please check. Thanks!
Problem 4: Learning to Play Blackjack a. [8 points]
I have finished the code in submission.py. Please check. Thanks!
b. [4 points] Call simulate using your algorithm and the 𝐢𝐝𝐞𝐧𝐭𝐢𝐭𝐲𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐄𝐱𝐭𝐫𝐚𝐜𝐭𝐨𝐫() on the MDP smallMDP, with 30000 trials. Compare the policy learned in this case to the policy learned by value iteration. Don’t forget to set the explorationProb of your Qlearning algorithm to 0 after learning the policy. How do the two policies compare (i.e., for how many states do they produce a different action)? Now run 𝐬𝐢𝐦𝐮𝐥𝐚𝐭𝐞() on largeMDP. How does the policy learned in this case compare to the policy learned by value iteration? What went wrong?
Answer:
c. [5 points]
I have finished the code in submission.py. Please check. Thanks!
d. [4 points] Now let’s explore the way in which value iteration responds to a change in the rules of the MDP. Run value iteration on originalMDP to compute an optimal policy. Then apply your policy to newThresholdMDP by calling simulate with FixedRLAlgorithm, instantiated using your computed policy. What reward do you get? What happens if you run Q learning on newThresholdMDP instead? Explain Answer:
Training value iteration on originalMDP to compute an optimal policy and apply the policy to newThresholdMDP, we will obtain the average rewards of 6.836 in 30000 times gaming. However, if we choose to run Q learning on newThresholdMDP, we will obtain we will obtain the average rewards of almost 9.5 in 30000 times gaming. It is obviously that Q-learning is better! I check each choice of both value iteration and Q-learning and found that value iteration will stop playing earlier because the threshold is still 10 in its ‘cognition’. Q-learning has higher rewards because it is reinforcement learning. Q-learning can adapt to new environment.
Reference
[1] Fard M M , Pineau J . MDPs with Non-Deterministic Policies.[J]. Advances in Neural Information Processing Systems, 2009, 21:1065.
[2] http://inst.eecs.berkeley.edu/~cs188/sp19/
[3] Dai P , Mausam, Weld D S , et al. Topological value iteration algorithms.[J]. Journal of Artificial Intelligence Research, 2011, 42(1):181-209.
[4] https://www.cs.bgu.ac.il/~asins131/wiki.files/HierarchicalMDPs.pdf




Reviews
There are no reviews yet.