Now we assume that we have a perfect knowledge of the problem. The RL problem becomes more of an optimization problem, than a learning one: we simply want to find the policy that maximizes some known objective function. Here, we will sometimes use learning as a tool for approximately solving the RL optimization problem. We start with a short background on MDPs. If you’re familiar with MDPs, you cak skip to the following section.

Background: MDP and Relevant Algorithms

There’s a standard mathematical formulation for our needs: Markov Decision Process (MDP). Without going into formal definitions, the idea is relatively simple: a model is a probabilistic function, that given the current state and an action returns the next state. Essentially it means that all the relevant information is in the current state (there are no long-term hidden effects). We can assume the model is given by a conditional distribution $\tau(s^\prime|s,a)$. The goal of RL is to find a good policy. But, what is “good”? In MDPs, quality of a policy is measured based on a reward fucntion $r(s, a)$. Summarizing the rewards of an episode into a single number can be done either by summing them or by taking a discounted sum:

$\max_\pi \underset {\bar{s}, \bar{a} \sim \tau,\pi} {\mathbb{E}} \Bigg[\sum_{i=0}^\infty \gamma^i r(s_i, a_i) \Bigg]$

where $\gamma$ is called the discounting factor, telling our agent to prefer rewards in the near future over rewards in the far one.

If we have an MDP and a policy $\pi$, we can define the following two functions, conveniently called the V-function and the Q-function:

$V_ {\pi}(s) = \underset {\bar{s}, \bar{a} \sim \tau,\pi | s_ 1=s } {\mathbb{E}} \Bigg[\sum_{i=0}^T \gamma^i r(s_i, a_i) \Bigg] ;$ $Q_ {\pi}(s, a) = \underset {\bar{s}, \bar{a} \sim \tau,\pi | s_ 1=s, a_ 1=a} {\mathbb{E}} \Bigg[\sum_{i=0}^T \gamma^i r(s_i, a_i) \Bigg] .$

And it so happens that for every policy $\pi$, the following recursive equations hold:

$V_ {\pi}(s) = \sum_a \pi(a|s) \Bigg[ r(s, a) + \gamma \sum _{s^\prime} \tau(s^\prime|s,a) V _\pi (s^\prime) \Bigg] ;$ $Q_ {\pi}(s, a) = r(s, a) + \gamma \sum _{s^\prime, a^\prime} \tau(s^\prime|s,a) \pi(a^\prime|s^\prime) Q _\pi (s^\prime, a^\prime) .$

More specifically, for an optimal policy $\pi^*$ the following equations (known as the Bellman equations) hold, as a special case of the previous ones:

$V^* (s) = \max_a \Big[ r(s, a) + \gamma \sum _{s^\prime} \tau(s^\prime|s,a) V^* (s^\prime) \Big] ;$ $Q^* (s, a) = r(s, a) + \gamma \sum _{s^\prime} \tau(s^\prime|s,a) \max _{a^\prime} Q^*(s^\prime, a^\prime) .$

If we somehow got either $V^* $ or $Q^* $, it’s easy to calculate an optimal policy $\pi^*$:

$\pi^{*}(a|s)=\begin{cases}1 & a=\underset{a}{\text{argmax}}Q^{*}\left(s,a\right)\\0 & \text{o/w}\end{cases} .$

Our implementation of Pong is an MDP, if we fix the left paddle to the “Follow” policy: the next state is determined solely based on the current state and action.

V/Q Evaluations

How can we find $V_\pi$ and $Q_\pi$ for a given policy $\pi$? There’s a simple algorithmic way, using the recursive equations from above: start with an arbitrary function $V^{(0)}$ or $Q^{(0)}$, and iteratively update it:

$V^{(i+1)}(s) = \sum_a \pi(a|s) \Bigg[ r(s, a) + \gamma \sum _{s^\prime} \tau(s^\prime|s,a) V^{(i)} (s^\prime) \Bigg] ;$ $Q^{(i+1)}(s, a) = r(s, a) + \gamma \sum _{s^\prime, a^\prime} \tau(s^\prime|s,a) \pi(a^\prime|s^\prime) Q^{(i)} (s^\prime, a^\prime) .$

This process is guaranteed to converge to $V_\pi$ and $Q_\pi$ respectively, and it does so relatively quickly (exactly how fast depends on $\gamma$).

V/Q Iteration

What can we do in order to find an optimal policy $\pi^*$? We can use the same idea, but with the following update rule:

$V^{(i+1)} (s) = \max_a \Big[ r(s, a) + \gamma \sum _{s^\prime} \tau(s^\prime|s,a) V^{(i)} (s^\prime) \Big] ;$ $Q^{(i+1)} (s, a) = r(s, a) + \gamma \sum _{s^\prime} \tau(s^\prime|s,a) \max _{a^\prime} Q^{(i)}(s^\prime, a^\prime) .$

And it also converges to an optimal policy.

P-Iteration (a.k.a. Policy Iteration)

Here’s an alternative approach to V/Q Iteration that’s also finds an optimal policy. Start with some random policy $\pi_0$, and perform:

Estimate $Q_{\pi_i}$ using Q-Evaluation.
Create $\pi_{i+1} $ such that $ \pi_{i+1}(s) = \underset{a}{\text{argmax}} Q_{\pi_i} (s, a) $.
Repeat.

It can be proved that this algorithm converges to $\pi^*$, and empirical testing shows that it happens within very few iterations. Of course, step (1) is computationally demanding on its own: it performs a full Q-Evaluation for every P-Iteration iteration.

Rollout-based Monte Carlo Tree Search (MCTS)

MCTS is as an alternative for V/Q/P Iteration. Those algorithms can become computationally intractable when the number of states in the MDP is huge (or worse - infinite). In contrast to those algorithms, MCTS doesn’t attempt to find an explicit policy, that is a function from all the states to actions. It’s an algorithm that gives you a possible move for a single given state.

Assume that you already have some stochastic policy $\pi$ and some evaluation function $v(s)$ that gives you the expected reward at a given state $s$. Neither $\pi$ nor $v(s)$ have to be very good. You are at state $s$ at the moment, and you need to decide on an action, so you want to approximate $Q(s, a)$ for every possible $a$. Here’s a possible way to do it: for every $a$, perform $n$ episodes in a simulator, starting from state $s$ and action $a$, and continuing according to $\pi$. Run each episode simulation until its end, or else terminate it after $d$ steps. If you reached the episode’s end, use the total reward of the episode, and otherwise use your evaluation function $v(s)$ to estimate its tail. Taking the average reward over those $n$ simulations will give you a reasonable approximation $ \hat{Q}_ {\pi} (s, a) $, and you can choose $ a = \underset{a}{\text{argmax}} \hat{Q}_{\pi} (s, a) $.

Evaluations of a single action using rollouts.

The problem with this approach is that you don’t follow the policy $\pi$, but the policy $\text{MCTS}(\pi)$. So you don’t really want to estimate $Q_ \pi(s, a) $, but $Q_ {MCTS(\pi)} (s, a)$ and pick its maximal value. So we can do it recursively. For example, depth 2: for each pair of actions $(a_1, a_2)$ set your simulator to state $s$, perform $a_ 1$ and then $a_2$, and from then on act according to $\pi$. It should give a better approximation of the best action for state $s$.

Evaluations of two actions using rollouts.

The Rollout-based MCTS algorithm is a bit different, but this is the main idea. Instead of enumerating all possible actions for some depth $d$, we simply run many simulations starting at state $s$ and following the policy $\pi$, but for each simulation we don’t merely remember its total reward, but we create nodes in a game tree. That is, instead of creating a full game tree of depth $d$ in advance and running simulation from its leafs, we create the tree as needed on-the-fly.

Illustration of MCTS

So we created such a tree by running simulations from the state $s$ using $\pi$. It allows us to estimate $Q(s, a)$ for our real policy, because we can follow $\text{MCTS}(\pi)$ all along the tree. It is true that further down the tree the estimates would become worse, because they are based on less simulations. But we can still hope to get reasonable estimates for the top layer of the tree, and that’s all we need. Once we decide on an action a, we can perform MCTS again for our new state, $s’$.

We hope that this is a reasonable description of the basic MCTS algorithm. But what we really need is an improvement of this algorithm, called UCT-MCTS. The idea is that instead of picking the actions in the simulation simply based on $\pi$, we can use an algorithm from Online Learning called UCB, that ensures that we will explore the different possible actions in a somewhat wiser fashion.

There are theoretical guarantees about MCTS - that for a long enough simulation, it would return the best possible action. But there is very little to say about how many simulations are needed - possibly exponentially many (exponential at episode length).

Back to Pong: Discretization & V-Iteration

Recap: we have an algorithm that finds the best policy for a given MDP, namely V-Iteration. How can we apply it to Pong? We cannot run it as is, because the state-space is infinite. But this doesn’t appear to be a big problem, because small differences in the state shouldn’t matter too much, right?. It shouldn’t matter if the ball’s position is $(0.51, 0.37)$ or $(0.52, 0.36$). So we can choose a finite number of states to approximate the real state — that is, use a discretization of the state-space. Because Pong is deterministic, we can even simplify the previous update formula, and get:

$V^{(i+1)} (s) = \max_a \Big[ r(s, a) + \gamma V^{(i)} (\text{next}(s, a)) \Big] .$

We still need to decide how exactly to discretize the state, and which reward scheme shall we use. We decided to start with the simplest approach: give the paddle a reward of +1 whenever it hits the ball and -1 whenever it misses. We also decide to simplify the game: we only try to teach the paddle to reach the ball while it approaches, and manually program it to go back to the center when the ball is moving away.

How to discretize? It takes the ball 67 steps to get from one side of the table to the other. It means that we cannot use less than 67 cells for the x-axis, because otherwise we will have at least one cell where the ball doesn’t move from one cell to the other, and the algorithm will get stuck. So we decided to use 100 cells on all the 4 relevant dimensions, which yields $100^{4} = 10^{8}$ states. We run it for a few hours and got 5% win rate:

Simple discretization and Value Iteration. Performs very poorly.

Ouch. So what happened here? We found the best policy for the wrong model, which apparently is different enough from the real one. Note how “apathetic” the paddle seems to be: this is because it has no notion of improving the probability of success. Either it thinks it has a possible move that will ensure win, in which case it does it, or it thinks it can do nothing, in which case it doesn’t care.

We can “improve” it by changing the simulator and making it truly discrete (also known as “fixing the world to match the model”):

Here the agent plays perfectly, as we expected, and we can learn a lot about V-Iteration from this video. The paddle is still “apathetic” — it “just” catches the ball at the last minute. This is because it cannot differentiate between the different kinds of hits.

In what sense is the model wrong? Well, obviously, we lose some information by the discretization. But this is probably not enough to explain those poor results. It’s easy to program a good policy that’s based only on this information:

Manually programmed policy that uses only the information after discretization to a very low resulation. Uses linear regression in order to get good approximations of the velocity and positions.

So the issue is not lack of information, but that after the discretization our assumptions about the model don’t hold anymore. Firstly, the model is no longer deterministic: given only partial information about the state, the real state remains unknown, and so there can be more than one possible next-real-state, which will often lead to more than one possible next-partial-state. This claim is trivial once you look at the following visualization:

We can, hypothetically, deal with this issue. Value Iteration works just as well in probabilistic settings, and the deterministic model assumption can be removed. It would be much more complicated to implement, and would be somewhat more demanding computationally, but still possible.

The second error is much worse: after discretization the model is no longer an MDP. You can deduce more information about the real velocity and position of the ball by looking at more than one consecutive states. As before, it is a trivial claim with the proper drawing:

Conclusion: finding an optimal policy for the wrong model is not the best choice of algorithm. Which, in retrospect, is not so surprising.

One last attempt before moving on. Among the many tests we ran in order to make V-Iteration to work, we tried smoothing the V-function. Instead of picking an action by:

$a = \underset {a} {\text{argmax}} V(\text{next}(s, a)) ,$

we picked an action by:

$a = \underset {a} {\text{argmax}} \sum_\epsilon V(\text{next}(s + \epsilon, a)) ,$

where $\epsilon$ goes on adjacent states of $s$. We used the same V table from above. This is the result:

A policy based on the same V table from before, but adding a small degree of uncertainty to the state. Performs well.

As you can see, it works very well. There can be several possible explanations, but honestly we don’t feel that we fully understand it.

Function Approximation & Q-Iteration

Discretizing a continuous case is tricky. Can we do something different? The answer is yes. Instead of discretization, which allows us to pass over all possible states at every state, we can use function approximation. Even a simple NN can approximate a wide variety of functions, and we will use our NN architecture to approximate the value function. Randomly initializing the weights of the NN gives us an initial (arbitrary) $V^{(0)}$. We next perform iterations of V-Iteration approximately, by solving the following optimization problem using SGD:

$V^{(i+1)} \approx \arg\min_V \mathbb{E}_s \Bigg( V(s) - \max_a \Big[ r(s, a) + \gamma \sum _{s^\prime} \tau(s^\prime|s,a) V^{(i)} (s^\prime) \Big] \Bigg)^2 .$

The expectation is over some way to sample the state space — in our case, we picked a uniform sampling, which is the continuous analog of “passing over all possible states” in the discrete case. Note that efficient uniform sampling is often not an option (even with a known model).

Because it is somewhat easier to implement, we used Q-Iteration instead of V-Iteration. We also removed the simplifying assumptions from the previous part: we actually attempted to solve the full model. And, contrary to the discrete case, it works perfectly and takes only a few minutes to run:

Policy learned using Deep-Q-Iteration

Deep-Q-Iteration win rate as a function of number of SGD steps.

Random Walk & MCTS

An interesting fact about MCTS is that at no point in the description of the algorithm we assumed finite state space. The only thing we need is a discrete action space. It means that we can apply MCTS as-is, without discretization or function approximation. We do need to choose some initial policy $\pi$ and evaluation function $v(s)$, but we can just choose a policy with uniform distribution on all actions and $v(s)=0$. We set the number of simulations to 100 and the maximal simulation length to 50, and got:

MCTS on a uniform policy playing against Follow

Which is actually quite telling. We can see that the MCTS paddle occasionally goes straight to the ball, but most of the times the paddle ignores it. Actually, the paddle goes to the ball only if it doesn’t have to move too far. The reason for it is that we performed only 100 random simulations. If at one of those simulation the paddle managed to hit the ball, it would act accordingly. Otherwise, as far as the paddle knows, there’s nothing it can do in order to avoid losing, and so it does nothing.

Note that the video is somewhat misleading here: every step here requires a few seconds - the paddle must perform 100 simulation for every step. Creating this video took about 30 minutes on our computer.

Conclusions

Many classical algorithms assume a small (and finite) state space. We have shown two main approaches to utilize them in Pong — discretization and function approximation. Discretization is both very computationally expensive (suffers from the “curse of dimensionality”) and might lead to distribution mismatch (an optimal policy for the discretized world might perform poorly on the non-discretized world). Function approximation worked much better for us.

Next Chapter

In Chapter 3: AlphaZero we describe the AlphaZero algorithm: it still assumes we know the rules of the game, but manages to find a good policy even when the algorithms described above fail. We will talk (briefly) about the way AlphaZero manages to find a good policy even without knowing exactly the model - it attempts to find a policy that would defeat any rival policy, and not a specific one.