In this part, we survey algorithms that don’t assume we have any knowledge of the model, but merely that we can interact with it by playing. This is a much more realistic setting than the previous ones, and in essence this is the “real” scenario of RL: learn how to act in an environment simply by interacting with it.

Introduction to Learning while Playing

Before we go into the different algorithms, there are some common issues that need to be addressed.

Snake or Pong: all those algorithms can be run both in a Snake-like scenario: training them to defeat the “Follow” policy, and in a Pong-like scenario: training them against themselves (or, theoretically, each other), trying to get an “optimal” policy not specifically designed to defeat “Follow”. At first the Snake-like scenario might seem easier: the “Follow” policy is better than random one, but it isn’t very good, and defeating it is not a very difficult task. This is while Pong-style is difficult because you don’t train the policy in the setting it will be tested (eventually we test the policies against “Follow”). But consider what would happen if we replaced “Follow” with “Predict” (or, more precisely, with an optimal policy). In this case, it doesn’t matter what the agent will do, it will never win. So, it cannot learn much as there is no differentiation between good and bad actions. On the other hand, in a Pong-like scenario, the agent is expected to win in about 50% of the games, so even if it plays according to a very weak policy at the time, it can learn which decisions were better or worse, and can, hypothetically, gradually improve until it is the equal of “Predict”. So there are reasons to think that the Snake-like scenario is easier, and others to think that the Pong-like scenario is easier. Eventually, which one will work better depends on the details of the case.

Computational Complexity: Generally speaking, those algorithms require many episodes. Moreover, the actions of the agent in those episodes are chosen using some NN. It turns out that in many of them more time is spent in playing and creating new data than in performing SGD steps. This is surprising, and has effects on the practical optimization of the code. It also gives us a computational incentive to get the most out of each played episode, even if it means we should perform a much heavier algorithm for the actual training of the network.

Objective: Assume two algorithms achieve similar performances in terms on winning rate. How should we compare them? There are different possible objectives, and weighing them differently might lead to a different choice of an algorithm. Here is a non comprehensive list of possible objectives:

  1. Runtime: if you have a simulator (as we do), perhaps you don’t care about anything except the running time. Even if you do care about the other objectives, certainly having a reasonable running time is a necessary condition for the algorithm’s usefulness.
  2. Number of Episodes: if you don’t have a simulator, and playing an episode has a cost (think about real life cases, or arcade machines), then you might want to optimize over the total number of episodes. It would mean you will attempt to get the most out of each episode, even at the cost of heavier computations.
  3. Number of Generations: many of those algorithms work in “generations” - they fix a policy, play many episodes with it, and only then update the policy to a different one (“creating a new generation”). It allows for a better parallelization, as you can run the different episodes in parallel. If you can run many episodes in parallel, but each episode takes a long time, then you might prefer an algorithm that requires less generations.
Learning in generations.
Learning in generations.

Online vs. Batch Learning: Often, when discussing learning while playing, we think about an online setting — we would like to guarantee we are not doing “too bad” at every point in time. In this article, we are interested in a batch setting, where there is a “training session” in which the algorithms can do whatever they wish with no regard to rewards, and “testing session” where they are tested, but in a “single shot” manner - they can’t learn from the testing session.

Policy Gradient: Because Simpler is Better

Perhaps the most direct approach is to try to optimize our objective using standard optimization methods. We can phrase it using only \(\pi\):

So why can’t we approximate \(\pi\) by a NN, and optimize over this objective?

Well, we can’t even evaluate this objective for a given policy \(\pi\) without knowledge of the model. We can, however, estimate it by simply playing according the policy \(\pi\) (and taking an average over many episodes). The ability to estimate a function at a point is enough for some optimization algorithms (“zero-order methods”), but actually it turns out we can do better: if the weights of the NN are aggregated as a vector \(\theta\), we can have an unbiased estimator of the gradient of the objective with respect to \(\theta\) by playing according to \(\pi_\theta\).

The Policy Gradient theorem:

The proof of theorem is simple, and can be found elsewhere. Once we have the ability to estimate the gradient, we can apply SGD as is.

Some notes:

  1. It only works if \(\pi_\theta\) is a stochastic policy - giving a probability to every possible action. We know that there is an optimal deterministic policy, but if we limit ourselves to deterministic policies, the objective is not even continuous, and so not differentiable.
  2. We can get an estimation of the gradient using a single episode. But, we can also play many episodes and take the average value of the estimator, getting a better estimation (with less variance).

The Algorithm

  1. Play 1000 episodes using \(\pi_{\theta_i}\).
  2. Create \( \pi_{\theta_{i+1}} \) by performing a single gradient step in the direction calculated by the formula above.
  3. Repeat.

Results

We ran PG, estimating every gradient using 1000 episodes, for 3500 generations. It took a long time, but eventually we got a 88% winning rate:

Policy Gradient
Win rate of Policy Gradient
Win rate of Policy Gradient

To be fair, our setting was probably less than ideal for PG. We wanted to be able to easily compare between algorithms, so we decided to fix the number of episodes per generation to 1000. In the PG case this turns out to be a very bad idea - for every 1000 episodes, we do a single gradient step. We can get better results by reducing the number of episodes per iteration.

Self Play

Policy Gradient trained in self-play mode, playing against itself.
Policy Gradient trained in self-play mode, playing against Follow.
Win rate of Policy Gradient trained by self-play.
Win rate of Policy Gradient trained by self-play.

Success Learning: Few Ways to Succeed, Many Ways to Fail

To the best of our knowledge this is a new RL algorithm, that we discovered by mistake because of a bug we had in our Policy Gradient implementation. But it proved to be a very efficient and stable algorithm, acheiving good results in a matter of minutes instead of hours, so perhaps it would be of some interest. Of course, some of its efficiency is a result of how easy Pong is.

The algorithm is so intuitive that is can be presented immediately, and be explained afterwards.

The Algorithm

  1. Play 1000 episodes using \(\pi_{\theta_i}\).
  2. Throw away all the episodes in which the policy lost.
  3. Create \(\pi_{\theta_{i+1}}\) by Imitating the actions taken by \(\pi_{\theta_i}\) at the remaining episodes.
  4. Repeat.

Results

We ran it for 50 generations of 1000 episodes each one, and got:

Success Learning
Win rate of Success Learning
Win rate of Success Learning

Informal Explanation

The basic idea is that a stochastic policy will occasionally perform good steps by chance, and that we can improve our policy by reinforcing those behaviours.

But, it might seem wasteful: there is some information in the thrown episodes as well: “do not act like this”. We can simply say that it is an empirical fact that throwing them away improved the performance, but there is probably a deeper insight here. There are relatively few ways to win in pong, and many ways to lose. Accordingly, there are few direction in which we can go to improve \(\pi_{\theta}\) and many that can worsen it. Each successful episode gives us some information about the “good” directions; each unsuccessful one gives us some information about the “bad” directions. So the situation can be represented graphically like that:

Representation of a single generation of Success Learning.
Representation of a single generation of Success Learning.

Now it should be obvious why successful episodes are more informative than unsuccessful ones. Giving the unsuccessful ones negative weights would mean that we would make a step in the average of the following directions:

Same image with inversed failed episodes.
Same image with inversed failed episodes.

And you can see that the unsuccessful episodes add a lot of noise. It might be possible to overcome this obstacle by assigning failed episode small negative weights (e.g. -0.001), to reflect the fact that we trust successful episodes to a greater degree. In practice, we didn’t find such a weight that gave better results (a small enough weight gave the same results as throwing them away).

Another interesting idea is that the situation actually gets worse as our policy improves. If the failed episodes are pretty good (e.g. just missing the ball) then the previous illustrations become:

Representation of a single generation of Success Learning at late stages of the training.
Representation of a single generation of Success Learning at late stages of the training.
Same image with inversed failed episodes.
Same image with inversed failed episodes.

And making a step in the opposite direction to the failed episodes is a really bad idea.

Similarities & Differences between SuccessLearning and PolicyGradient

In our experiments on Pong, success learning worked much better than PolicyGradient - about two orders of magnitude faster in both time and number of episodes. However, if you take a look at code, you’ll see that SuccessLearning is only a minor variation of PolicyGradient. There are only two differences between the two:

  1. In SucccessLearning, we throw all the failed episodes, while in PolicyGradient we give them negative weights.
  2. In SuccessLearning we use the same episodes to perform many gradient steps: it’s not that we make a bigger step in the direction of one gradient (i.e. increasing the learning rate) but we calculate many gradients using the same episodes.

We believe that the two are connected: (1) greatly reduces the variance, giving us much cleaner information on the direction of “good” policies, which allows (2) - exploiting this information.

In many ways this is related to “baseline” methods for PolicyGradient, which also attempt to reduce the variance using similar ideas, but this is an easier way to do so.

Self Play

Success Learning trained in self-play mode, playing against itself.
Success Learning trained in self-play mode, playing against Follow.
Win rate of Success Learning trained in self-play mode.
Win rate of Success Learning trained in self-play mode.

Next Chapter

In Chapter 5: Learning While Playing Part 2 we continue with the “learning while playing” problem, but add the assumption that the model is an MDP. We describe the famous Deep-Q-Learning algorithm, and many small variations of it.