Description
1 Nash Equilibria
Compute all (mixed and pure) Nash equilibria for each of the following normal-form games:
Sample Answer:
(T,L),(T,R),((1/3,2/3),(3/4,1/4))
Especially, if no mixed equilibria, write as: (T,L),(T,R)
If one player is indifferent for any mixed strategy, write as:
((any),(P,1-P))
2 Strategies
Consider the following game in matrix form with two players. Payoffs for the row player Izzy are indicated first in each cell, and payoffs for the column player Jack are second.
X Y Z
S 5, 2 10, 6 25, 10
T 10, 12 5, 6 0, 0
(a) This game has two pure strategy Nash equilibria. What are they (justify your answer)? Of the two pure equilibria, which would Izzy prefer? Which would Jack prefer?
Sample Answer:
(S,X),(S,Y)
Izzy(S,X)
Jack(S,Y)
1
(b) Suppose Izzy plays a strictly mixed strategy, where both S and T are chosen with positive probability. With what probability should Izzy choose S and T so that each of Jack’s three pure strategies is a best response to Izzy’s mixed strategy.
Sample Answer:
(1/2,1/2)
Note that the former is the probability of S and the latter is the probability of T.
(c) Suppose Jack wants to play a mixed strategy in which he selects X with probability 0.7. With what probability should Jack plays actions Y and Z so both of Izzy’s pure strategies is a best response to Jack’s mixed strategy?
Sample Answer:
(1/4,3/4)
Note that the former is the probability of Y and the latter is the probability of Z.
(d) Based on your responses above, describe a mixed strategy equilibrium for this game in which both Jack and Izzy play each of their actions (pure strategies) with positive probability (you can rely on the quantities computed in the prior parts of this question).
Sample Answer:
((1/4,3/4),(7/10,3/20,3/20))
Note that the first tuple is the strategy of Izzy and the second one is the strategy of Jack.
(e) If we swap two of Izzy’s payoffs in this matrixin other words, if we replace one of his payoffs r in the matrix with another of his payoffs t from the matrix, and replace t with r, we can make one of his strategies dominant. What swap should we make, which strategy becomes dominant?
Sample Answer:
(S,X),(S,Y),S
Note: The two tuples are strategies of which you want to exchange Izzy’s payoffs. And please write in dictionary order, e.g. (S,Y ) before (T,X). The S is the dominant strategy after the payoff exchange.
3 Solving MDPs
Consider the gridworld MDP for which Left and Right actions are 100% successful.
Specifically, the available actions in each state are to move to the neighboring grid squares. From state a, there is also an exit action available, which results in going to the terminal state and collecting a reward of 10. Similarly, in state e, the reward for the exit action is 1. Exit actions are successful 100% of the time.
Let the discount factor γ = 1. Write the following quantities in one line.
V0(d),V1(d),V2(d),V3(d),V4(d),V5(d)
Sample Answer:
0,0,0,0,1,10
4 Value Iteration Convergence Values
Consider the gridworld where Left and Right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From state a, there is also an exit action available, which results in going to the terminal state and collecting a reward of 10. Similarly, in state e, the reward for the exit action is 1. Exit actions are successful 100% of the time.
Let the discount factor =0.2. Fill in the following quantities.
Write the following quantities in one line.
Sample Answer:
0,0,0,0,1
5 Value Iteration Properties
Which of the following are true about value iteration? We assume the MDP has a finite number of actions and states, and that the discount factor satisfies 0 < γ < 1.
A. Value iteration is guaranteed to converge.
B. Value iteration will converge to the same vector of values (V ∗) no matter what values we use to initialize V .
C. None of the above.
6 Policy Iteration
Consider the gridworld where Left and Right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From state a, there is also an exit action available, which results in going to the terminal state and collecting a reward of 10. Similarly, in state e, the reward for the exit action is 1. Exit actions are successful 100% of the time.
The discount factor (γ) is 1.
(a) Consider the policy π1 shown below, and evaluate the following quantities for this policy. Write your answers in one line.
Sample Answer:
0,0,0,0,1
(b) Consider the policy π2 shown below, and evaluate the following quantities for this policy. Write your answers of the same format in Part 1 in one line.
Sample Answer:
0,0,0,0,1
7 Policy Iteration
Consider the gridworld where Left and Right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From state a, there is also an exit action available, which results in going to the terminal state and collecting a reward of 10. Similarly, in state e, the reward for the exit action is 1. Exit actions are successful 100% of the time.
The discount factor (γ) is 0.9.
(a) Policy Iteration
Consider the policy πi shown below, and evaluate the following quantities for this policy. Write your answers in one line.
Sample Answer:
0,0,0,0,1
(b) Policy Improvement
Perform a policy improvement step. The current policy’s values are the ones from Part 1 (so make sure you first correctly answer Part 1 before moving on to Part 2). Write your answers in one line.
(i). π(i+1)(a) =
A. Exit B. Right
(ii). π(i+1)(b) =
A. Exit B. Right
(iii). π(i+1)(c) =
A. Exit B. Right
(iv). π(i+1)(d) =
A. Exit B. Right
(v). π(i+1)(e) =
A. Exit
B. Right
Sample Answer:
A,A,A,A,B
8 Wrong Discount Factor
Bob notices value iteration converges more quickly with smaller γ and rather than using the true discount factor γ, he decides to use a discount factor of αγ with 0 < α < 1 when running value iteration. Write the options that are guaranteed to be true:
A. While Bob will not find the optimal value function, he could simply rescale the values he finds by to find the optimal value function.
B. If the MDP’s transition model is deterministic and the MDP has zero rewards everywhere, except for a single transition at the goal with a positive reward, then Bob will still find the optimal policy.
C. If the MDP’s transition model is deterministic, then Bob will still find the optimal policy.
D. Bob’s policy will tend to more heavily favor short-term rewards over long-term rewards compared to the optimal policy. E. None of the above.
9 MDP Properties
(a) Which of the following statements are true for an MDP?
A. If the only difference between two MDPs is the value of the discount factor then they must have the same optimal policy.
B. For an infinite horizon MDP with a finite number of states and actions and with a discount factor γ that satisfies 0 < γ < 1, value iteration is guaranteed to converge.
C. When running value iteration, if the policy (the greedy policy with respect to the values) has converged, the values must have converged as well. D. None of the above.
(b) Which of the following statements are true for an MDP?
A. If one is using value iteration and the values have converged, the policy must have converged as well.
B. Expectimax will generally run in the same amount of time as value iteration on a given MDP.
C. For an infinite horizon MDP with a finite number of states and actions and with a discount factor γ that satisfies 0 < γ < 1, policy iteration is guaranteed to converge. D. None of the above.
10 Policies
John, James, Alvin and Michael all get to act in an MDP (S,A,T,γ,R,s0).
• John runs value iteration until he finds V ∗ which satisfies ∀s ∈ S : V ∗(s) = maxXT(s,a,s0)[R(s,a,s0) + γV ∗(s0)]
a∈A
s0
and acts according to
πJohn = argmax XT(s,a,s0)[R(s,a,s0) + γV ∗(s0)]
a∈A s0
• James acts according to an arbitrary policy πJames.
• Alvin takes James’s policy πJames and runs one round of policy iteration to find his policy πAlvin.
• Michael takes John’s policy and runs one round of policy iteration to find his policy πMichael.
Note: One round of policy iteration = performing policy evaluation followed by performing policy improvement.
Write all of the options that are guaranteed to be true:
A. It is guaranteed that ∀s ∈ S : V πJames(s) ≥ V πAlvin(s).
B. It is guaranteed that ∀s ∈ S : V πMichael(s) ≥ V πAlvin(s).
C. It is guaranteed that ∀s ∈ S : V πMichael(s) > V πJohn(s).
D. It is guaranteed that ∀s ∈ S : V πJames(s) ≥ V πJohn(s).
E. None of the above.
Reviews
There are no reviews yet.