An overview of the evaluation procedures for the Atari 2600 domain

Originally published in 2017

Just around two weeks ago "Deep Reinforcement Learning that Matters" took Reinforcement Learning community by storm. This is my favourite figure from the paper:

In two words, the paper asks a great question: "Does the research we are doing right now make sense? If not, what should we do to make Deep RL great again?". The answer includes publishing implementations for the reproducibility, do proper significance testing, developing methods agnostic to hyperparameters, and thinking about settings where the algorithms in question will be applied.

After reading this paper I feel brave enough to write a post about the evaluation practice in another domain popular for RL research: Atari 2600. Having read quite a few papers on the subject recently, I found a lot of moments the reasons of which were unknown to me. I do not claim I will answer these questions here, but I will point them out so that we can think together.

I have no computational resources to reproduce the papers' results like the authors from the paper did, so, I will just describe how the evaluation procedures were conducted in major papers on the topic and say what I find bizarre. Why do I do it? Evaluation procedure is what makes paper results look credible. When you compare the results with all other methods, it's important that the comparison is fair. In order to make it fair we need to understand the procedure well enough. But reading the Atari 2600 papers, I have a feeling that almost each paper has its own evaluation approach. Why? I have no idea. But let's start.

Playing Atari with Deep Reinforcement Learning

I'll start from the paper that revived the interest to using neural nets in RL.
There are some details that might seem obvious to a person with some experience in the field, but were not obvious for me at all at the beginning, so, I will mention them.

Fist of them is the frameskip parameter. When we want a model to predict the action given the state, we do it for every k-th frame, where k is the frameskip parameter. The environment will repeat the chosen action for all the other 'skipped' frames. The authors chose k = 4 (with an exception for Space Invaders with k = 3), and, usually the number is never changed in the Atari 2600 papers.

The next important thing is the reward clipping. During training an agent receives either –1 if the reward is negative, 1 if it's positive and zero otherwise. That means, our agent is not able to distinguish between excellent moves and just okayish moves, and this might be a cause of sub-par performance. At the same time, as the authors mention, this limits the scale of the error derivatives and makes it possible to use the same learning rate across games. This should also keep our Q-values more or less comparable.

For the evaluation the authors choose the strategy as in [11] and report the average score obtained by running an ε-greedy policy with ε=0.05 for a fixed number of steps.

The number of steps is not mentioned and there are two options: either 10,000 as on Fig. 2. or 18,000 as in [11]:

Evaluation was performed by dividing gameplay into episodes, as described in Section 2.2, with the maximum episode length set to 5 minutes of real-time play (18,000 frames). Each method was evaluated over 30 trials, with each trial composed of 5,000 training episodes followed by 500 evaluation episodes during which no learning took place. The agent’s performance was measured as the average score achieved during this evaluation period.

500 episodes in [11] look really brutal, but, as far as I get, by 'choosing the strategy', the authors mean the first sentence of the citation above.

The main table with the results contains the average performance of the method as well as the best run. Unfortunately there is no variance or standard error of the mean. Given this we, first, have no idea about how deterministic the agent is (given that ALE is deterministic, this would be an interesting point). And, secondly, we have no clue about how stable the performance is.

That's it for this paper, let's move now to the next one closely related to the current.

Human-Level Control Through Deep reinforcement Learning

The paper is an extension of the previous one with 49 games and uses exactly the same hyperparameters across different games for training. The code can be found here. But I like this repo more since it has issues and discussions.

We also have the reward clipping as in the previous paper. But the code also has the option for reward scaling, and, we might hypothesize that the authors also tried this.

The thing I like about the paper is that it has the whole subsection on the evaluation (you can find it in the appendix) and all the hyperparameters can be found in the Table 1.

The method was again evaluated with a 0.05 ε-greedy policy on 30 episodes (but no more than 5 minutes each). It looks like we need these five-minute limit to avoid situations when an agent does nothing and the episode can last forever, e.g. in Montezuma's Revenge).

There is a fix for skipframe coefficient here. As I said before, the authors used k=3 for Space Invaders since the laster beams blink every 4th frame. To make it possible to use k=4 the authors do the max over last 2 frames and use it as the observation.

To minimize overfitting, the authors used a 'no-op' parameter, i.e. at the beginning of the episode the agent did nothing up to 30 frames. The number of frames is randomly sampled for each of the episodes (look here and here.

The paper also reports the normalized performance, that equals to 100*(DQN score - random score)/(human score - random score).

The bizarre thing is in the random baseline. The authors say that they report the results for choosing randomly an action with skipframe coefficient equal to 6, since it is how the fastest human can press the fire button. Okay, why, in this case, DQN can press the buttons faster than a human player with k=4? They also say, that they had tried choosing the action randomly for each of the frames (i.e. at 60Hz), but it had a minimal effect.

Another question is why we don’t have the standard deviation for the random agent. It's super easy to implement and it would be interesting to see, what you can take from here. For instance, for Pinball you can be really good by just being super quick no matter what you are pressing [5].

The cool thing about the results table is that we have the standard deviation for the score. And we know the number of episodes, so we can compute the SEM here. Great!

When writing this post, I found an error in [5]. We say, that there is not information about the number of episodes in the Nature paper, so, we cannot infer the SEM. But we can, actually.

Massively Parallel Methods for Deep Reinforcement Learning

The paper has a really clear Evaluation section (5.2) which is a must-read since the evaluation procedure will be used in the subsequent papers.

Following DQN, we periodically evaluated each model during training and kept the best performing network parameters for the final evaluation. We average these final evaluations over the 5 runs, and compare the mean evaluations with DQN and human expert scores.

There is something new here: human starts! What is that? A human demonstrator starts to play a game. For each of the games, they record 100 of such beginnings and the agent continues then to play from these saved checkpoints.

Why do we want to have such a thing? ALE is deterministic, and this approach should make it more generalizable.

Deep Reinforcement Learning with Double Q-learning

I read this paper before the previous one, and I did not get a thing about the evaluation procedure. It was like a riddle to me. But after reading [6], it gets quite understandable. The evaluation, actually, is the same as in [6].

Again, the method is evaluated with 0.05 ε-greedy policy. The authors also use the normalized score as in [4]. The list of games for the papers is 57!!!

Apart from the raw scores in the Appendix, the authors present quite strange Tables 1 and 2, which show the mean and median performance for all 49 games. Why? What information should we get from these two tables? That DDQN is definitely better? Okay.

Given, that Double DQN is muuuuch better in Video Pinball, the increase in the mean mostly reflects our awesomness in this game. I re-typed the numbers into a Google Sheet and took out the Video Pinball. The results change dramatically: with Pinball, DDQN's mean is about 37% better, whereas without it, only by 19%.
At the same time I agree, that the median should be robust to this kind of stuff.

The paper also presents the results for the tuned version of the DDQN. The idea is the following, the DQN hyperparameters were tuned for DQN and are not the best fit for DDQN. Fair enough. The authors change the target network update period from 10k to 30k, the ε value was changed to 0.01 both for training and evaluation (so, the agent is more deterministic). The results are clearly better as the authors point out.

Dueling Network Architectures for Deep Reinforcement Learning

Again, we have 57 games here and evaluation with no-op and human starts. ε is 0.001, so, the agent is more deterministic.

It is cool to see, that the authors thought about the normalized score and adjusted the indicator accordingly:

We took the maximum over human and baseline agent scores as it prevents insignificant changes to appear as large improvements when neither the agent in question nor the baseline are doing well. For example, an agent that achieves 2% human performance should not be interpreted as two times better when the baseline agent achieves 1% human performance. We also chose not to measure performance in terms of percentage of human performance alone because a tiny difference relative to the baseline on some games can translate into hundreds of percent in human performance difference.

Another reason the paper is great that it actually mentions the fact, that the network output is not 18 (maximum amount of ALE actions). It varies from 3 to 18 depending on the game. I haven't seen any other paper saying this.

Neural Episodic Control

I really like this paper. Everything, from the idea to the beautiful Figures 1 and 2. But the evaluation procedure description really lags behind.

There is no exploration parameter ε, there is no number of episodes to average over. There is no mention of no-op or human starts.

The only thing we have is that the learning curves are averaged over 5 different initial random seeds and that the method was evaluated each 200k frames, when A3C and DQN baselines were evaluated each 1 million frames. A3C and DQN use the reward clipping.

At the same time, the paper tries to perform some qualitative analysis that is not a usual practice at all.

Learning from Demonstrations for Real World Reinforcement Learning

The cool thing about this paper is that all the training hyperparameters are available as a table in the appendix. The evaluation is done with ε=0.01. The weird thing is that the amount of games is 42 (not 49 or 57 as we saw before). More interesting is that in the last version of the paper the authors say that these 42 games are a randomly selected subset. Why? I have no idea.

The authors rescaled the rewards by using a log scale: r_agent = sign(r)*log(1+|r|). That makes a complete sense since the agent starts to distinguish what is attractive and what is super attractive.

Another weird moment is:

DQfD scores are the best 3 million step window averaged over four seeds, which is 508 episodes on average.

So, the reported results are the results responsible for choosing the model. It's not like we take some model based on our metric (e.g. evaluation during training), and then rerun the evaluation and report the results.

The last weird thing about the paper is the reported Double DQN baseline. May be it's because of the evaluation described above, but this is really strange. I will put the original Dueling Network paper results and the paper results below.

First, DDQN:

And, DQFD:

Look at Assault or Video Pinball, for instance. The numbers are totally different. May be I miss something, but I haven't found what.

Outro

So, we are done. As we have seen, even one of the most high-quality papers have some weird stuff inside. May be it's just me. May be it's just badly described and I misunderstood something. But it would be cool to know what do you think about it. Or, may be, you've seen something similar that I haven't mentioned. Just drop a line and I'll update the post. I hope it was interesting to read and, may be, it might be useful to some of you who is working on the Atari 2600 domain. [11] also makes a general description of the common evaluation practice, and you can also have a look on it.

I'd like to conclude with the following. We can often hear now, that Atari 2600 is solved, let's move to something more challenging, let's finally solve the intelligence! But is it really so?

I beg to differ. Okay, according to the mean/median metrics we've seen today, we are crushing stupid humans. But does it really mean that RL agents are smarter in this sense? Not at all. It is easy to double your score in such reactive-control games as Space Invaders, Breakout or Video Pinball. But it's not the case for Montezuma's Revenge, for instance. And RL success in reactive-control games hides our inferiority in games which require reasoning: Montezuma's Revenge or, even, Ms. PacMan.

It would be cool to see a progress in one of these kind of games using some revolutionary approaches. We do have to avoid feature engineering, though.
In this sense, I like Microsoft's focus on Minecraft and DeepMind's work on Starcraft II. Hope to see something interesting soon.

References

[1] Henderson, Peter, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. "Deep Reinforcement Learning that Matters." arXiv preprint arXiv:1709.06560 (2017), link

[2] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013), link

[3] Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents." arXiv preprint arXiv:1709.06009 (2017), link

[4] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533, link

[5] Kurin, Vitaly, Sebastian Nowozin, Katja Hofmann, Lucas Beyer, and Bastian Leibe. "The Atari Grand Challenge Dataset." arXiv preprint arXiv:1705.10998 (2017), link

[6] Nair, Arun, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam et al. "Massively parallel methods for deep reinforcement learning." arXiv preprint arXiv:1507.04296 (2015), link

[7] Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." In AAAI, pp. 2094-2100. 2016, link

[8] Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. "Dueling network architectures for deep reinforcement learning." arXiv preprint arXiv:1511.06581(2015). link

[9] Pritzel, Alexander, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. "Neural Episodic Control." arXiv preprint arXiv:1703.01988 (2017). link

[10] Hester, Todd, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris et al. "Learning from Demonstrations for Real World Reinforcement Learning." arXiv preprint arXiv:1704.03732 (2017). link

[11] Bellemare, Marc G., Yavar Naddaf, Joel Veness, and Michael Bowling. "The Arcade Learning Environment: An evaluation platform for general agents." J. Artif. Intell. Res.(JAIR) 47 (2013): 253-279, link