Deep Reinforcement Learning


On this chapter we will learn the effects of merging Deep Neural Networks with Reinforcement learning. If you follow AI news you may heard about some stuff that AI is not capable to do without any specific programming:
  • Learn how to play atari from raw image pixels
  • Learn how to beat Go champions (Huge branching factor)
  • Robots learning how to walk
All those achievements fall on the Reinforcement Learning umbrella, more specific Deep Reinforcement Learning. Basically all those achievements arrived not due to new algorithms, but due to more Data and more powerful resources (GPUs, FPGAs, ASICs). Examples:
  • Atari: Old Q-Learning algorithm but with a CNN as function aproximator (Since 1988 people talk about standard RL with function approximators)
  • AlphaGo: Policy gradients that use Monte Carlo Tree Search (MCTS), which is pretty standard.
Nowadays Policy Gradients it's the favorite choice for attacking Reinforcement learning(RL) problems. Previously it was DQN (Deep Q learning). One advantage of Policy Gradients is because it can be learned end-to-end.

Policy Gradient

We will start our study with policy gradient using the pong game:
Here we have to possible actions (UP/DOWN) and our objective is make the ball pass the opponent paddle. Our inputs are going to be a 80x80x1 image pre-processed from a 210x160x3 image. Our game enviroment (openAI gym) will give a reward of +1 if you win the opponent, -1 if you lose or 0 otherwise.
Here each node is a particular game state, and each edge is a possible transition, also each edge can gives a reward. Our goal is to find a policy (map from states to actions) that will give us the best action to do on each state. The whole point is that we don't know this graph, otherwise the whole thing would be just a reflex agent, and also sometimes we cannot fit all this on memory.


On this pong example we do some operations on the original 210x160x3 (RGB) image to a simpler grayscale 80x80. This is done to simplify the neural network size.
Also to give some notion of time to the network a difference between the current image and the previous one is calculated. (On the DQN paper it was used a convolution neural network with 6 image samples)

Policy Network

The basic idea is to use a machine learning model that will learn a good policy from playing the game, and receiving rewards. So adding the machine learning part. We're going to define a policy network (ex: 2 layer neural network). This simple neural network will receive the entire image and output the probability of going up.
After the probability of going up is calculated we sample an action (UP/DOWN) from a uniform distribution.

Policy Gradient

One difference between supervised learning and reinforcement learning is that we don't have the correct labels to calculate the error to then back-propagate. Here is where the policy gradient idea comes. Imagine the following example:
During the training our policy network gave a "UP" probability of 30% (0.3), therefor our "DOWN" probability will be 100-30=70% (0.7). During our action sampling from this distribution the "DOWN" action was selected. Now what we do is choose a unit gradient "1" and wait for the possible rewards (0,+1,-1). This will modulate our gradient to 3 possible values:
  • Zero: Our weights will remain the same
  • Positive: Make the network more likely to repeat this action on the future
  • Negative: Make the network less likely to repeat this action on the future
So this is the magic of policy gradient, we choose a unit gradient and modulate with our rewards. After a lot of good/bad actions been selected and properly rewarded the policy network map a "rational/optimum" policy.

Implementation notes

Here we show how to initialize our policy network as a 2 layer neural network with the first layer having 200 neurons and the second output layer 1 neuron.
On the code above, D is our input difference image after pre-processing and H is the number of neurons on the hidden layer.
Bellow we have the python code for this network forward propagation (policy_forward), on this neural network we didn't use bias, but you could.
From the code above we have 2 set of weights W1,W2. Where W1 or the weights of the hidden layer can detect states from the images (Ball is above/bellow paddle), and W2 can decide what to do (action up/down) on those states. So the only problem now is to find W1 and W2 that lead to expert (rational) policy.
Actually we do a forward propagation to get the score for the "UP" action given an input image "x". Then from this "score" we "toss a biased coin" to actually do our atari joystick action (UP-2, Down-5).
By doing this we hope that some times we're going to do a good action and if we do this a lot of times, our policy network will learn.

Training the policy network

  1. 1.
    First we will initialize randomly W1,W2.
  2. 2.
    Then we will play 20 games (one eposide).
  3. 3.
    Keep track of all games and their result (win/loose)
  4. 4.
    After a configured number of episodes we update our policy network
Assuming that each game has 200 frames, and each episode has 20 games, we have to make 4000 decisions (up/down) per episode. Suppose that we run 100 games 12 we win (+1) and 88 we loose(-1). We need to take all the 12x200=2400 decisions and do a positive(+1) update. (This will encourage do do those actions in the future for the same detected states). Now all the 88x200=17600 decisions that make us loose the game we do a negative(-1) update.
The network will now become slightly more likely to repeat actions that worked, and slightly less likely to repeat actions that didn't work.

Some words about Policy Gradient

The policy gradient, are the current (2016) state of the art, but is still very far from human reasoning. For example, you don't need to crash your car hundred of times to start avoiding it. Policy gradient need a lot of samples to start to internalize correct actions, and it must have constant rewards of it. You can imagine that you are doing a kind of "brute-force" search where on the beginning we jitter some actions, and accidentally we do a good one. Also this good actions should be repeated hundred of times before the policy network start to repeat good actions. That's why the agent had to train for 3 days before start to play really good.
When policy gradient will shine:
  • When the problem does not rely on very long-term planning
  • When it receive frequent reward signals for good/bad actions.
Also if the action space start to be to complex (a lot of actions commands) there are other variants, like "deterministic policy gradients". This variant uses another network called "critic" that learns the score function.