Skip to main content
  1. Posts/

Q-learning

852 words
Table of Contents
Reinforcement Learning and Deep Reinforcement Learning - This article is part of a series.
Part 3: This Article

Q-learning, a cornerstone of reinforcement learning, is the quest of an agent to master its environment by discerning the quality of its actions. Through iterative experiences, the agent constructs a map of rewards, forging paths to optimal outcomes.

Critic

  • A critic does not directly determine the action.
  • Given an actor \(\pi\), it evaluates how good \(\pi\) is.
  • There is an actor called state value function \(V^{\pi}(s)\)
    • When using \(\pi\) (to interact with env), the cumulated reward is expected after visiting state \( s \).

It literally just make a prediction. Having \( V^{\pi} \) to look into the state \( s \), then yields a scalar, \(V^{\pi}(s)\).

Output of a critic depends on the evaluated actor.

Another Critic

  • State-action value function \( Q ^{\pi}(s, a)\)
    • When using actor \( \pi \), the cumulated reward is assessed after taking action a at state s

\(\)

Policy \(\pi\) is a network with parameter \(\theta\).

  • Input: the observation of machine represented as a vector or a matrix.
  • Output: each action corresponds to a neuron in output layer.

From the previous article, Trajectory \(\tau = \{ s_1, a_1, s_2, a_2, …, s_T, a_T \} \). For every trajectory, its probability can be calculated ( given the parameters \(\theta\) of the actor ):

$$ p_{\theta}(\tau) = p(s_1)p_{\theta}(a_1|s_1)p(s_2|s_1, a_1)p_{\theta}(a_2|s_2)p(s_3|s_2,a_2)… $$

$$ = p(s_1)\prod_{t=1}^{T}p_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t) $$

\(p(s_t)\) depends on the behavior of the environment, which can NOT be controlled.

Also, there is reward \(rt\), \(R(\tau)=\sum^T{t=1}r_t\)

The objective is to adjust \(\theta\) to maximize \(R(\tau)\). However, reward is a random variable (r.v.), due to the stochasticity of actor and environment (actor at a given state \(s\) gives some random action \(a\), and so is environment). Expected Reward is calculated.

$$ \bar{R}_θ = \sum_τ R(\tau)p_θ(\tau) = E _{\tau \sim p _{\theta}(\tau)}[R(\tau)] $$

Policy Gradient

\(\bar{R}θ = \sumτ R(\tau)p*θ(\tau)\), \(\nabla \bar{R}*θ = \ ?\)

$$ \nabla \bar{R}_θ = \sum_τ R(\tau)\nabla p_θ(\tau) $$

\(R(\tau)\) do not have to be differentiable, can even be a blackbox (Similar to GAN 🧐)

$$ \nabla \bar{R}_θ = \sum_τ R(\tau) p _{\theta}(\tau) \frac{\nabla p_θ(\tau)}{p _{\theta}(\tau)} \\ = \sum_τ R(\tau) p _{\theta}(\tau) \nabla log p_θ(\tau) \\ = E _{\tau \sim p _{\theta}(\tau)}[R(\tau) \nabla log p_θ(\tau) ] \approx \frac{1}{N}\sum _{n=1}^{N} R(\tau^n) \nabla log p_θ(\tau^n) $$

$$ \frac{1}{N}\sum _{n=1}^{N} \sum _{t=1}^{T_n} R(\tau^n) \nabla log p_θ(a_t^n | s_t^n) $$

Note: \(\nabla f(x) = f(x)\nabla log f(x)\) (logarithmic derivative). env component in \(p {\theta}(\tau)\) is not related to \(\theta\), therefore only do gradient \( \nabla \) on \( log pθ(a_t^n | s_t^n) \).

Intuitively, among ALL sampled data, at a certain state \( s_t \) an action \(a_t\) is to be executed (\((a_t | s_t)\) is SOME state and action pair within the trajectory) results in the trajectory \(\tau\). if certain \((a_t | s_t)\) pair results in some positive reward in the trajectory, then increase the likelihood of the pair and vice versa.

More Math

$$ \theta \leftarrow \theta + \eta \nabla \bar{R}_θ $$

$$ \nabla \bar{R}_θ = \frac{1}{N}\sum _{n=1}^{N} \sum _{t=1}^{T_n} R(\tau^n) \nabla log \ p_θ(a_t^n | s_t^n) $$

Cycles of data collection and model updates — Policy is noted as \(\pi_{\theta}\) (using existing agent \(\theta\) to interact with the environment)

Using game as an example, in game play one \(\tau^1\), going through \((a*1^1 | s_1^1)\), \((a_1^2 | s_1^2)\), … results in \(R(\tau^1)\), then for game play two \(\tau^2\) (just change the superscript), etc. Using those data collected to calculate \(\nabla \bar{R}*θ\)

Essentially, plug in ALL the \((s|a)\) pairs to calculate its log probability, get the gradient, then multiply by a weight, which is the reward that is collected from a game play.

Extra Tips

Add a Baseline

  • \(\theta \leftarrow \theta + \eta \nabla \bar{R}_θ \), its possible that \( R (\tau^n)\) is always positive.
  • \(\nabla \bar{R}_θ \approx \frac{1}{N}\sum _{n=1}^{N} \sum {t=1}^{T_n} R(\tau^n) \nabla log \ pθ(at^n | s_t^n)\), according to the equation (and explanation in the previous section), likelihood of \( pθ(a_t^n | s_t^n) \) shall increase if it provides a positive \( R(\tau)\)

Due to the nature of sampling (ONLY some \( (a_t^n | s_t^n) \) pair can be sampled, therefore the probability of the un-sampled actions will decrease), a baseline term \( b \approx E[R(\tau)] \) can be used.

$$ \nabla \bar{R}_θ \approx \frac{1}{N}\sum _{n=1}^{N} \sum _{t=1}^{T_n} [R(\tau^n) - b]\nabla log \ p_θ(a_t^n | s_t^n) $$

Assign Suitable Credit

Instead of calculate the sum of rewards in the ENTIRE \(\tau\), it is BETTER to obtain the sum of rewards after a certain action performed

Use \(\sum^{T*n} *{t’=t} r^n _{t^{\prime}}\) represent the sum of rewards from time \(t\) (where the action was performed) till the end of the game.

We can even take one step further by adding a discount factor for rewards on more future rewards on the current action \(\gamma \lt 1 \), yields \(\sum^{T*n} *{t’=t} \gamma^{t’-t}r^n _{t^{\prime}}\). It is reasonable because rewards collected not too long after an action should be weight MORE than later rewards.

Advantage Function

Eventually, \(R(\tau^n)-b\) becomes \( A^{\theta}(s_t, a_t)\).

It measures how good it is if we take \( a_t \) other than other actions at \( s_t \). \( \theta \) indicates the nature that it goes through a network.

\(\)

Reinforcement Learning and Deep Reinforcement Learning - This article is part of a series.
Part 3: This Article