Q_Learning_Simple
Last updated
Last updated
Q_Learning is a model free reinforcement learning technique. Here we are interested on finding through experiences with the environment the action-value function Q. When the Q function is found we can achieve optimal policy just by selecting the action that gives the biggest expected utility(reward).
Consider a robot that need to learn how to leave a house with the best path possible, on this example we have a house with 5 rooms, and one "exit" room.
Above we show the house, plant and also a graph representing it. On this graph all rooms are nodes, and the arrows the actions that can be taken on each node. The arrow values, are the immediate rewards that the agent receive by taking some action on a specific room. We choose our reinforcement learning environment to give 0 reward for all rooms that are not the exit room. On our target room we give a 100 reward.
To summarize
Actions: 0,1,2,3,4,5
States: 0,1,2,3,4,5
Rewards:0,100
Goal state: 5
We can represent all this mechanics on a reward table.
On this table the rows represent the rooms, and the columns the actions. The values on this matrix represent the rewards, the value (-1) indicate that some specific action is not available.
For example looking the row 4 on this matrix gives the following information:
Available actions: Go to room 0,3,5
Possible rewards from room 4: 0 (room 4 to 0),0 (room 4 to 3),100 (room 4 to 5)
The whole point of Q learning is that the matrix R is available only to the environment, the agent need to learn R by himself through experience.
What the agent will have is a Q matrix that encodes, the state,action,rewards, but is initialized with zero, and through experience becomes like the matrix R.
As seen on previous chapters, after we have found the Q matrix, we have an optimum policy, and we're done. Basically we just need to use the Q table to choose the action that gives best expected reward. You can also imagine the Q table as the memory of what the agent learned so far through it's experiences.
Just as a quick reminder let's describe the steps on the Q-Learning algorithm
Initialize the Q matrix with zeros
Select a random initial state
For each episode (set of actions that starts on the initial state and ends on the goal state)
While state is not goal state
Select a random possible action for the current state
Using this possible action consider going to this next state
Get maximum Q value for this next state (All actions from this next state)
After we have a good Q table we just need to follow it:
Set current state = initial state.
From current state, find the action with the highest Q value
Set current state = next state(state from action chosen on 2).
Repeat Steps 2 and 3 until current state = goal state.
As we don't have any prior experience we start our Q matrix with zeros
Now take a look on our reward table (Part of the environment)
As we start from state "room 1" (second row) there are only the actions 3(reward 0) or 5( reward 100) do be done, imagine that we choose randomly the action 5.
On this new (next state 5) there are 3 possible actions: 1 , 4 or 5, with their rewards 0,0,100. Basically is all positive values from row "5", and we're just interested on the one with biggest value. We need to select the biggest Q value with those possible actions by selecting Q(5,1), Q(5,4), Q(5,5), then using a "max" function. But remember that at this state the Q table is still filled with zeros.
As the new state is 5 and this state is the goal state, we finish our episode. Now at the end of this episode the Q matrix will be:
On this new state we randomly selected the state "room 3", by looking the R matrix we have 3 possible actions on this state, also now by chance we chose the action 1
By selecting the action "1" as our next state will have now the following possible actions
Now let's update our Q table:
As our new current state 1 is not the goal state, we continue the process. Now by chance from the possible actions of state 1, we choose the action 5. From the action 5 we have the possible actions: 1,4,5 [Q(5,1), Q(5,4), Q(5,5)] unfortunately we did not computed yet this values and our Q matrix remain unchanged.
After a lot of episodes our Q matrix can be considered to have convergence, on this case Q will be like this:
if we divide all of the nonzero elements by it's greatest value (on this case 500) we normalize the Q table (Optional):
Now with the good Q table, imagine that we start from the state "room 2". If we keep choosing the action that gives maximum Q value, we're going from state 2 to 3, then from 3 to 1, then from 1 to 5, and keeps at 5. In other words we choose the actions [2,3,1,5].
Just one point to pay attention. On state 3 we have the options to choose action 1 or 4, because both have the same max value, we choose the action 1. But the other action would also give the same cumulative reward. [0+0+100]
Now we're ready to mix those things on the computer, we're going to develop 2 functions, one for representing our agent, and the other for the enviroment.
We start by modeling the environment. Which is the part that receives an action from the agent and give as feedback, a immediate reward and state information. This state is actually the state of the environment after some action is made. Just to remember, if our system is markovian all information necessary for choosing the best future action, is encoded on this state.
Observe that the environment has the matrix R and all models needed to completely implement the universe. For example the environment could be a game, our real world (Grid World) etc...
Now on the agent side, we must train to gather experience from the environment. After this we use our learned Q table to act optimally, which means just follow the Q table looking for the actions that gives maximum expected reward. As you may expect the agent interface with the external world through the "simple_RL_enviroment" function.
Let's exercise what we learned so far by doing some episodes by hand. Consider and our initial node(state) to be "room 1"