Reinforcement Learning

Introduction

On this chapter we will learn the basics for Reinforcement learning (Rl), which is a branch of machine learning that is concerned to take a sequence of actions in order to maximize some reward.

Basically an RL does not know anything about the environment, it learns what to do by exploring the environment. It uses actions, and receive states and rewards. The agent can only change your environment through actions.

One of the big difficulties of Rl is that some actions take time to create a reward, and learning this dynamics can be challenging. Also the reward received by the environment is not related to the last action, but some action on the past.

Some concepts:

  • Agents take actions in an environment and receive states and rewards

  • Goal is to find a policy Π(s)=action\Pi(s)=action that maximize it's utility function U(s)U(s)

  • Inspired by research on psychology and animal learning

Here we don't know which actions will produce rewards, also we don't know when an action will produce rewards, some times you do an action that will take time to produce rewards.

Basically all is learned with interactions with the environment.

Reinforcement learning components:

  • Agent: Our robot

  • Environment: The game, or where the agent lives.

  • A set of states sSs \in S

  • Policy: Map between state to actions

  • Reward Function R(s,a,s)R(s,a,s'): Gives immediate reward for each state

  • Value Function: Gives the total amount of reward the agent can expect from a particular state to all possible states from that state. With the value function you can find a policy.

  • Model T(s,a,s)T(s,a,s') (Optional): Used to do planning, instead of simple trial-and-error approach common to Reinforcement learning.

    Here ss' means the possible state after we do an action aa on the state ss

There is a variant of Reinforcement learning called Deep Reinforcement Learning where you use Neural Networks as function approximators for the following:

  • Policy (Select next action when you are on some particular state)

  • Value-Functions (Measure how good a state or state-action pair is right now)

  • The whole Model/World dynamics, so you can predict next states and rewards.

In our minds we will still think that there is a Markov decision process (MDP), which have:

We're looking for a policy π(s)\pi(s), which means a map that give us optimal actions for every state

The only problem is that we don't have now explicitly T(s,a,s)T(s,a,s') or R(s,a,s)R(s,a,s'), so we don't know which states are good or what the actions do. The only way to learn those things is to try them out and learn from our samples.

On Reinforcement learning we know that we can move fast or slow (Actions) and if we're cool, warm or overheated (states). But we don't know what our actions do in terms of how they change states.

Offline (MDPs) vs Online (RL)

Another difference is that while a normal MDP planning agent, find the optimal solution, by means of searching and simulation (Planning). A Rl agent learns from trial and error, so it will do something bad before knowing that it should not do. Also to learn that something is real bad or good, the agent will repeat that a lot of times.

How it works

We're going to learn an approximation of what the actions do and what rewards we get by experience. For instance we could randomly do actions

Late Reward

We force the MDP to have good rewards as soon as possible by giving some discount γ\gamma reward over time. Basically you modulate how in a rush your agent by giving more negative values to R(s)R(s) over time.

Also you can change the behavior of your agent by giving the amount of time that your agent have.

Exploration and Exploitation

One of the problems of Reinforcement learning is the exploration vs exploitation dilemma.

  • Exploitation: Make the best decision with the knowledge that we already know (ex: Choose the action with biggest Q-value)

  • Exploration: Gather more information by doing different (stochastic) actions from known states.

Example: Restaurant

  • Exploitation: Go to favorite restaurant, when you are hungry (gives known reward)

  • Exploration: Try new restaurant (Could give more reward)

One technique to keep always exploring a bit is the usage of ϵgreedy\epsilon-greedy exploration where before we take an action we add some random factor.

References:

Last updated