Introduction

  • Reinforcement Learning is learning how to map situations to actions so as to maximum a numerical reward signal
  • Policy is a function that maps from state to action.
  • There are two types of policies
    • Target policy: Policy used to optimise the decision making.
    • Behaviour policy: Policy used to take actions in the environment or policy used to navigate the environment.
  • Off policy RL algorithms have different behaviour and target policies, they can be decouple the data collection and training.
  • On policy RL algorithms have same behaviour and target policies. The agent takes actions and learns using the same policy.

Q-Learning is Off-Policy RL Algorithm

  • Say that the agent is randomly choosing action to execute in the environment, ie, the behaviour policy is random.
  • We will get Q value for (S, right), using Bellman equation
  • Note that in the above equation, we are not taking the action a, it is selected based on our target policy. But it is not executed.
  • For most off-policy algorithms,
    • the target policy is greedy.
    • the behaviour policy can be random, -greedy or greedy.
  • This Q(S, right) observed on taking action, will be used to update our target policy using TD method.
  • They can be decoupled, collecting data and learning our target policy, so Q-Learning is Off-Policy RL algorithm.

Q-Learning (off-policy TD control) for estimating

Sarsa (on-policy TD control) for estimating