RL: Beta

Introduction

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior.

Why exploration is needed?
- We won’t be able to find optimal policy during exploitation since we are getting evaluation feedback alone.
The two main aspects of RL are
- evaluative feedback*, ie, how good the action taken was, but not whether it was the best or the worst action possible.
- associative property, ie, the best action depends on the situation. The topic of K-armed Bandit problem settings helps understand non-associative, evaluative feedback aspect of RL.

K-armed Bandit Problem

“Learning Action-value estimates”, ie, Action-value Methods
- $ϵ$ -greedy
  - Incremental update to estimate the value associated with each action
  - The $ϵ$ is the exploration probability
    - $N e wE s t ima t e \leftarrow Ol d E s t ima t e + St e pS i ze * [T a r g e t - Ol d E s t ima t e]$
  - Having decreasing StepSize for stationary reward distribution allows convergence of the Estimates, whereas constant StepSize is used for non-stationary reward distribution.
- Optimistic Initial Values
  - For stationary reward distribution problem, setting optimistic initial values helps exploration and faster convergence.
  - For non-stationary reward distribution, it doesn’t affect since the mean of targets are constantly changes
- Upper-Confidence-Bound Action Selection
  - allows exploration, with preference to actions, ie, based on how close their estimates are to being maximal and the uncertainity in their estimates.
    - $A_{t} = a r g ma x_{a} [Q_{t} (a) + c (l n (t) / N_{t} (a))]$
  - If $N_{t} (a) = 0$ then a is selected first.
“Learning a Numerical Preference” is an alternative to methods that estimate action values.
- Gradient Bandit Algorithm
  - The numerical preference has no interpretation in terms of rewards, in contrast with estimated action values. (ie, which is estimated mean reward we get)
    - $P r (A_{t} = a) = e^{H_{t} (a)} / Σ_{b} e^{H_{t} (b)} = π (a)$
  - Here $π (a)$ is the probability of taking action a at time t.
  - Learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent.
    - $H_{t + 1} (A_{t}) = H_{t} (A_{t}) + α * (R_{t} - \overset{ˉ}{R_{t}}) (1 - π_{t} (A_{t}))$
    - $H_{t + 1} (a) = H_{t} (a) - α * (R_{t} - \overset{ˉ}{R_{t}}) * π_{t} (A_{t}) ........ \forall a! = A_{t}$

The implementation related to the above algorithms can be found here. along with more detailed/rough notes.

Bandits vs Contextual Bandits vs Full RL problems

Associative search tasks (contextual bandits) are intermediate b/w the k-armed bandit problem and full-reinforcement learning problem.
They are like full-reinforcement learning problem in that they involve learning a policy, but they are also like our k-armed bandit problem in that each action only affects the immediate reward.
If actions are allowed to affect the next situation as well as the reward, then we have full-reinforcement learning problem

siv X siv

Explorer

RL: Beta

Introduction

K-armed Bandit Problem

Bandits vs Contextual Bandits vs Full RL problems

Graph View

Backlinks

siv X siv

Explorer

RL: Beta

Introduction §

K-armed Bandit Problem §

Bandits vs Contextual Bandits vs Full RL problems §

Graph View

Backlinks

Introduction

K-armed Bandit Problem

Bandits vs Contextual Bandits vs Full RL problems