DeepMind--Rainbow: Review

Introduction

Since the publication of Playing Atari with Deep Reinforcement Learning, there are six extensions (彩虹🌈) further developed by DeepMind:

  1. Deep Reinforcement Learning with Double Q-learning - 2015
  2. Prioritized Experience Replay - 2015
  3. Dueling Network Architectures for Deep Reinforcement Learning - 2015
  4. Asynchronous Methods for Deep Reinforcement Learning 2016
  5. Noisy Networks for Exploration - 2017
  6. A Distributional Perspective on Reinforcement Learning - 2017

The Oct.6th published article – Rainbow: Combining Improvements in Deep Reinforcement Learning tries combine the above six method together. It shows promising and expected result. More interestingly, figure 3 revels the marginal improvement offered by each extension:

Which extension is outstanding?

Screen Shot 2017-10-11 at 4.03.48 PM.png

We can tell from Figure 3.:

  • N-step return, prioritized replay and distributional Bellman generate the biggest marginal improvement individually.
  • Dueling and Double Q offer the smallest improvement among all extensions.

We know that:

  • Dueling is a method to decompose state-action value in order to the concept of advantage.
  • Double Q is a counter method for overestimating state-action value.
  • Basically Dueling and Double Q are dealing with the same issue – better estimate the Q value with its defined target (i.e. to reduce variance and bias).

We also know that:

  • N-step return helps the RL agent learn from a policy that result in the best future n step return.
  • Prioritized replay tracks rare reward signals that help quicker learning.
  • Distributional Bellman equation treat the “value function” as a distribution of rewards.
  • These three extension aim to create/select a better target for the action-value function to learn.

Conclusion

With that said, combining the result of Fig3, we are able to conclude that for a value-based method: Having better target to learn is generally more effective than learning target better.

Further, adding a final noisy dense layer does help exploring (i.e. finding better target to learn) in general.

Figure1

Also, Figure 1 shows prioritized DQN, and Distributional DQN are about the same level. DDQN aided by Dueling is better than plain DDQN. Besides plain DDQN > plain DQN, this implies Dueling and Double Q (both are learning target better method) are good combination, but Dueling DDQN barely matches prioritized DQN, and Distributional DQN (which are learning better target).

And going back to Figure3, it seems like Rainbow only needs one “learning target better” method. And it would be interesting to see how Rainbow performs without Dueling & Double Q.