Deep Q Learning

Introduction

On the previous chapter we learned about the "old school" Q learning, we used matrices to represent our Q tables. This somehow implies that you at least know how many states (rows) you have on your environment, the problem is that sometimes this is not true.

On this Q table we can say the the expectation of wining the game by taking the action 1 at state 3 is 80. Bigger Q values means that we expect to win.

Also we learned that reinforcement learning is about learning how to behave on some environment where our only feedback is some sparse and time delayed "labels" called rewards. They are time delayed because there are cases where the environment will only tell if your action was good or bad some time after you actually moved.

Some Formalization before continue

Episode:

Considering our environment "Markovian" which means that our state encode everything needed to take a decision. We define an episode (game) as finite set of states, actions and rewards.

s0,a0,r1,s1,a1,r2,s2,a2,r3,sn1,an1,rn,rterminal{s_0,a_0,r_1},{s_1,a_1,r_2},{s_2,a_2,r_3},{s_{n-1},a_{n-1},r_n},{r_{terminal}}

For instance here "r1r_1" means the reward that we take by taking the action a0a_0 on state t0t_0. An episode always finish on an end state (Game over).

Policy

Policy is defined as a "map" from states to actions, is the reinforcement learning objective to find the optimal policy. An optimal policy can be derived from a Q function.

Return or Future Total Reward

We define return as the sum of all immediate rewards on an episode. Which means the sum of rewards until the episode finish.

Rt=r1+r2+r3+...+rnRt=t=0N1rt+1R_t=r_1+r_2+r_3+...+r_n \\ \therefore R_t=\displaystyle\sum_{t=0}^{N-1} r_{t+1}

Future Discounted Reward

To give some sort of flexibility we add to our total reward a parameter called gamma(γ\gamma). If gamma=0 all future rewards will be discarded, if gamma=1 all future rewards will be considered.

Rt=t=0N1γt.rt+1R_t=\displaystyle\sum_{t=0}^{N-1} \gamma^t.r_{t+1}

Q function

We define the function Q(s,a) as maximum expected "Future Discounted Reward" that we can get if we take an action "a" on state "s", and continue optimally from that point until the end of the episode. When we say "continue optimally" means that we continue choosing actions from the policy derived from Q.

Q(st,at)=max(Rt+1)Q(s_t,a_t)=max(R_{t+1})

Bellow we show how to derive a policy from the Q function, basically we want the action that gives the biggest Q value on state "s":

π(s)=maxa[Q(s,a)]\pi(s)=max_a[Q(s,a)]

The function "maxamax_a" will search for the action "a" that maximizes Q(s,a) You can think that the Q function is "the best possible score at the end of the game after performing action a in state s". So if you have the Q function you have everything needed to win all games!.

Greedy policy

If we choose an action that "ALWAYS" maximize the "Discounted Future Reward", you are acting greedy. This means that you are not exploring and you could miss some better actions. This is called exploration-exploitation problem. To solve this we use an algorithm called ϵgreedy\epsilon-greedy, where a small probability ϵ\epsilon will choose a completely random action from time to time.

How to get the Q function

Basically we get the Q function by experience, using an iterative process called bellman equation.

Q(st,at)Q(st,at)+α.[r+γ.maxat+1(Q(st+1,at+1))Q(st,at)]Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha.[r+\gamma.max_{a_{t+1}}(Q(s_{t+1},a_{t+1}))-Q(s_{t},a{t})]

Deep Q Network

The Deep Q learning is about using deep learning techniques to represent the Q table. Is also a kind of recipe to use Q learning on games.

Input

The first thing on this recipe is to get our input, as we may imagine we take information directly form the screen. To add some notion of time we actually get 4 consecutive screens.

  1. Get the images

  2. Rescale to 84x84

  3. Convert to 8 bit grayscale.

Now we have 84x84x256x4 "stuff" as our environment state, this means 7 million states. To find structure on this data we will need a convolution neural network!.

Adding a Convolution Neural Network

Now we will give those 84x84x4 tensor to an CNN, this model will have one output for each actions which will represent a Q-value for each possible action. So for each forward propagation we will have all possible Q values for a particular state encoded on the screen. It's valid to point out that this is not a classification problem but a regression, our Q-value output is a scalar number not a class.

On the Atari paper this CNN had the following structure:

Notice that there is not pooling layer, this is done because want to retain the spatial information.

Loss function

As out CNN is a regression model our loss function will be a "squared error loss"

Loss=12.[r+maxat+1(Q(st+a,at+1;θt1))targetQ(s,a;θ)prediction]2\Large Loss=\frac{1}{2}.[\underbrace{r+max_{a_{t+1}}(Q(s_{t+a},a_{t+1};\theta_{t-1}))}_\text{target}-\underbrace{Q(s,a;\theta)}_\text{prediction}]^2

How to get the Deep Q function

Now to iterate and find the real Q value using deep learning we must follow:

  1. Do a forward pass for the current state "ss" (screens) to get all possible Q values

  2. Do a forward pass on the new state st+1s_{t+1} and find the action maxat+1max_a{t+1} (Action with biggest Q value)

  3. Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.

  4. Use back-propagation and mini-batches stochastic gradient descent to update the network.

Problems with this approach

Unfortunately, we need to solve some problems with this approach. But before talking about the possible solutions let's check which are the problems.

  • Exploration-Exploitation issue: This is easy, we just add some random action along the way. (just use ϵgreedy\epsilon-\text{greedy})

  • Local-Minima: During the training we're going to have a lot of screens that are high correlated, this may guide your network to learn just a replay of the episode. To solve this we need somehow to shuffle the input mini-batch with some other playing data. (Of course of the same game but at different time)

Experience Replay

As mentioned we need to break the similarities of continuous frames during our update step. To do this we will store all game experiences during the episode inside a "replay memory" then during training we will take random mini-batches of this memory. Also you can add some human-experience by adding on the replay memory some human episodes.

Complete Recipe

A complete tensorflow example can be found here.

Last updated