On-policy vs Off-policy RL

Reinforcement Learning is learning how to map situations to actions so as to maximum a numerical reward signal
Policy is a function that maps from state to action.
There are two types of policies
- Target policy: Policy used to optimise the decision making.
- Behaviour policy: Policy used to take actions in the environment or policy used to navigate the environment.
Off policy RL algorithms have different behaviour and target policies, they can be decouple the data collection and training.
On policy RL algorithms have same behaviour and target policies. The agent takes actions and learns using the same policy.

Say that the agent is randomly choosing action to execute in the environment, ie, the behaviour policy is random.
We will get Q value for (S, right), using Bellman equation $Q (S, r i g h t) = R + ma x_{a} Q (S^{'}, a)$
Note that in the above equation, we are not taking the action a, it is selected based on our target policy. But it is not executed.
For most off-policy algorithms,
- the target policy is greedy.
- the behaviour policy can be random, $ϵ$ -greedy or greedy.
This Q(S, right) observed on taking action, will be used to update our target policy using TD method.
They can be decoupled, collecting data and learning our target policy, so Q-Learning is Off-Policy RL algorithm.

Q-Learning (off-policy TD control) for estimating $π \approx π_{*}$

Sarsa (on-policy TD control) for estimating $Q \approx q_{*}$

siv X siv