Maxton‘s Blog

返回

Actor-critic#

  • Actor 负责策略更新(Policy update),即决定在给定状态下采取什么动作。
  • Critic 负责策略评估(Policy evaluation)或价值估计(Value estimation),用于判断 Actor 所选策略的优劣。

QAC (Q-Actor-Critic)#

Critic (价值更新):wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)q(st,at,wt)]wq(st,at,wt)Actor (策略更新):θt+1=θt+αθθlnπ(atst,θt)q(st,at,wt+1)\begin{aligned} &\text{Critic (价值更新):} \\ &w_{t+1} = w_t + \alpha_w [r_{t+1} + \gamma q(s_{t+1}, a_{t+1}, w_t) - q(s_t, a_t, w_t)] \nabla_w q(s_t, a_t, w_t) \\ &\text{Actor (策略更新):} \\ &\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t | s_t, \theta_t) q(s_t, a_t, w_{t+1}) \end{aligned}

A2C(Advantage Actor-Critic)#

  • 引入基线函数(baseline)来减少方差。
θJ(θ)=ESη,Aπ[θlnπ(AS,θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS,θt)(qπ(S,A)b(S))]\begin{aligned} \nabla_{\theta} J(\theta) &= \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) q_{\pi}(S, A) \right] \\ &= \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) (q_{\pi}(S, A) - b(S)) \right] \end{aligned}

要使上述等式成立,需满足:

ESη,Aπ[θlnπ(AS,θt)b(S)]=0\mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) b(S) \right] = 0

具体推导如下:

ESη,Aπ[θlnπ(AS,θt)b(S)]=sSη(s)aAπ(as,θt)θlnπ(as,θt)b(s)=sSη(s)aAθπ(as,θt)b(s)=sSη(s)b(s)θaAπ(as,θt)=sSη(s)b(s)θ1=0\begin{aligned} \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) b(S) \right] &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \nabla_{\theta} \ln \pi(a|s, \theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(a|s, \theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_{\theta} \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_{\theta} 1 = 0 \end{aligned}
  • 基线函数主要用于控制方差,因此目标是找到一个最优的基线函数使得方差最小化。最优的基线函数为:
b(s)=EAπ[θlnπ(As,θt)2q(s,A)]EAπ[θlnπ(As,θt)2]b^*(s) = \frac{\mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2 q(s, A)]}{\mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2]}
  • 由于该形式计算过于复杂,实际应用中通常会去掉权重项 EAπ[θlnπ(As,θt)2]\mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2],即近似为:
b(s)=EAπ[q(s,A)]=vπ(s)b(s) = \mathbb{E}_{A \sim \pi} [q(s, A)] = v_{\pi}(s)

b(s)=vπ(s)b(s) = v_\pi(s) 时:

  • 梯度上升算法为:
θt+1=θt+αE[θlnπ(AS,θt)[qπ(S,A)vπ(S)]]θt+αE[θlnπ(AS,θt)δπ(S,A)]\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) [q_\pi(S, A) - v_\pi(S)] \right] \\ &\doteq \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) \delta_\pi(S, A) \right] \end{aligned}

其中:

δπ(S,A)qπ(S,A)vπ(S)\delta_\pi(S, A) \doteq q_\pi(S, A) - v_\pi(S)

该项被称为优势函数(Advantage function)。根据 vπ(S)v_{\pi}(S) 的定义,它是状态 SS 下所有动作价值的期望。如果某个动作的 qq 值大于平均值 vv,则说明该动作具备“优势”。

  • 该算法的随机版本为:
θt+1=θt+αθlnπ(atst,θt)[qt(st,at)vt(st)]=θt+αθlnπ(atst,θt)δt(st,at)\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) [q_t(s_t, a_t) - v_t(s_t)] \\ &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t) \end{aligned}

此外,该算法可以重新表示为:

θt+1=θt+αθlnπ(atst,θt)δt(st,at)=θt+αθπ(atst,θt)π(atst,θt)δt(st,at)=θt+α(δt(st,at)π(atst,θt))步长 (step size)θπ(atst,θt)\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t) \\ &= \theta_t + \alpha \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)} \delta_t(s_t, a_t) \\ &= \theta_t + \underbrace{\alpha \left( \frac{\delta_t(s_t, a_t)}{\pi(a_t|s_t, \theta_t)} \right)}_{\text{步长 (step size)}} \nabla_\theta \pi(a_t|s_t, \theta_t) \end{aligned}
  • 更新步长与相对值 δt\delta_t 成正比,而非绝对值 qtq_t,这在逻辑上更具合理性。
  • 它依然能很好地平衡探索与利用(Exploration and Exploitation)

通过 TD 误差进行近似:

δt=qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)\delta_t = q_t(s_t, a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t)
  • 这种近似是合理的,因为:
E[qπ(S,A)vπ(S)S=st,A=at]=E[Rt+1+γvπ(St+1)vπ(St)S=st,A=at]\mathbb{E}[q_\pi(S, A) - v_\pi(S)|S = s_t, A = a_t] = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) - v_\pi(S_t)|S = s_t, A = a_t]
  • 优点:只需一个神经网络来近似 vπ(s)v_\pi(s),而不需要维护两个网络分别近似 qπ(s,a)q_\pi(s, a)vπ(s)v_\pi(s)

重要性采样和 Off-policy#

Importance sampling technique#

注意到:

EXp0[X]=xp0(x)x=xp1(x)p0(x)p1(x)xf(x)=EXp1[f(X)]\mathbb{E}_{X \sim p_0}[X] = \sum_x p_0(x)x = \sum_x p_1(x) \underbrace{\frac{p_0(x)}{p_1(x)} x}_{f(x)} = \mathbb{E}_{X \sim p_1}[f(X)]
  • 因此,可以通过估计 EXp1[f(X)]\mathbb{E}_{X \sim p_1}[f(X)] 来估计 EXp0[X]\mathbb{E}_{X \sim p_0}[X]
  • 估计方法如下:

令:

fˉ1ni=1nf(xi),其中 xip1\bar{f} \doteq \frac{1}{n} \sum_{i=1}^n f(x_i), \quad \text{其中 } x_i \sim p_1

那么:

EXp1[fˉ]=EXp1[f(X)]\mathbb{E}_{X \sim p_1}[\bar{f}] = \mathbb{E}_{X \sim p_1}[f(X)] varXp1[fˉ]=1nvarXp1[f(X)]\text{var}_{X \sim p_1}[\bar{f}] = \frac{1}{n} \text{var}_{X \sim p_1}[f(X)]

所以,fˉ\bar{f}EXp0[X]\mathbb{E}_{X \sim p_0}[X] 的良好近似:

EXp0[X]fˉ=1ni=1nf(xi)=1ni=1np0(xi)p1(xi)xi\mathbb{E}_{X \sim p_0}[X] \approx \bar{f} = \frac{1}{n} \sum_{i=1}^n f(x_i) = \frac{1}{n} \sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)} x_i
  • 比例 p0(xi)p1(xi)\frac{p_0(x_i)}{p_1(x_i)} 被称为重要性权重(Importance weight)
  • 如果 p1(xi)=p0(xi)p_1(x_i) = p_0(x_i),重要性权重为 1,fˉ\bar{f} 退化为标准算术平均值。
  • 如果 p0(xi)>p1(xi)p_0(x_i) > p_1(x_i),说明样本 xix_i 在分布 p0p_0 中出现的频率高于 p1p_1。大于 1 的重要性权重会增强该样本在期望计算中的比重。

目标函数定义为:

J(θ)=sSdβ(s)vπ(s)=ESdβ[vπ(S)]J(\theta) = \sum_{s \in \mathcal{S}} d_\beta(s) v_\pi(s) = \mathbb{E}_{S \sim d_\beta} [v_\pi(S)]

其梯度为:

θJ(θ)=ESρ,Aβ[π(AS,θ)β(AS)θlnπ(AS,θ)qπ(S,A)]\nabla_\theta J(\theta) = \mathbb{E}_{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A | S, \theta)}{\beta(A | S)} \nabla_\theta \ln \pi(A | S, \theta) q_\pi(S, A) \right]

离轨策略(Off-policy)梯度对基准函数 b(s)b(s) 同样具有不变性:

θJ(θ)=ESρ,Aβ[π(AS,θ)β(AS)θlnπ(AS,θ)(qπ(S,A)b(S))]\nabla_{\theta} J(\theta) = \mathbb{E}_{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A|S, \theta)}{\beta(A|S)} \nabla_{\theta} \ln \pi(A|S, \theta) (q_{\pi}(S, A) - b(S)) \right]
  • 为减小估计方差,同样选择基准函数 b(S)=vπ(S)b(S) = v_{\pi}(S),得到:
θJ(θ)=E[π(AS,θ)β(AS)θlnπ(AS,θ)(qπ(S,A)vπ(S))]\nabla_{\theta} J(\theta) = \mathbb{E} \left[ \frac{\pi(A|S, \theta)}{\beta(A|S)} \nabla_{\theta} \ln \pi(A|S, \theta) (q_{\pi}(S, A) - v_{\pi}(S)) \right]

对应的随机梯度上升算法为:

θt+1=θt+αθπ(atst,θt)β(atst)θlnπ(atst,θt)(qt(st,at)vt(st))\theta_{t+1} = \theta_t + \alpha_{\theta} \frac{\pi(a_t|s_t, \theta_t)}{\beta(a_t|s_t)} \nabla_{\theta} \ln \pi(a_t|s_t, \theta_t) (q_t(s_t, a_t) - v_t(s_t))

类似于同轨(On-policy)情况:

qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)δt(st,at)q_t(s_t, a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \doteq \delta_t(s_t, a_t)

算法最终形式变为:

θt+1=θt+αθπ(atst,θt)β(atst)θlnπ(atst,θt)δt(st,at)\theta_{t+1} = \theta_t + \alpha_{\theta} \frac{\pi(a_t|s_t, \theta_t)}{\beta(a_t|s_t)} \nabla_{\theta} \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t)

重写后凸显步长关系:

θt+1=θt+αθ(δt(st,at)β(atst))θπ(atst,θt)\theta_{t+1} = \theta_t + \alpha_{\theta} \left( \frac{\delta_t(s_t, a_t)}{\beta(a_t|s_t)} \right) \nabla_{\theta} \pi(a_t|s_t, \theta_t)

Deterministic Actor-Critic (DPG)#

策略表示方式的演变:

  • 在此之前,通用策略记作 π(as,θ)[0,1]\pi(a|s, \theta) \in [0, 1],它通常是随机的(Stochastic)。
  • 现在引入确定性策略(Deterministic policy),记作:
a=μ(s,θ)μ(s)a = \mu(s, \theta) \doteq \mu(s)
  • μ\mu 是从状态空间 S\mathcal{S} 到动作空间 A\mathcal{A} 的直接映射。
  • 实践中 μ\mu 常由神经网络参数化表示,输入为 ss,输出直接为动作 aa,参数为 θ\theta

目标函数的梯度为:

θJ(θ)=sSρμ(s)θμ(s)(aqμ(s,a))a=μ(s)=ESρμ[θμ(S)(aqμ(S,a))a=μ(S)]\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{s \in \mathcal{S}} \rho_{\mu}(s) \nabla_{\theta} \mu(s)\left.\left(\nabla_{a} q_{\mu}(s, a)\right)\right|_{a=\mu(s)} \\ &= \mathbb{E}_{S \sim \rho_{\mu}}\left[\left.\nabla_{\theta} \mu(S)\left(\nabla_{a} q_{\mu}(S, a)\right)\right|_{a=\mu(S)}\right] \end{aligned}

基于确定性策略梯度,最大化 J(θ)J(\theta) 的梯度上升算法为:

θt+1=θt+αθESρμ[θμ(S)(aqμ(S,a))a=μ(S)]\theta_{t+1} = \theta_t + \alpha_\theta \mathbb{E}_{S \sim \rho_\mu} \left[ \nabla_\theta \mu(S) \left( \nabla_a q_\mu(S, a) \right) |_{a=\mu(S)} \right]

对应的随机梯度上升单步更新为:

θt+1=θt+αθθμ(st)(aqμ(st,a))a=μ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t) \left( \nabla_a q_\mu(s_t, a) \right) |_{a=\mu(s_t)}

整体架构更新逻辑如下:

TD 误差

δt=rt+1+γq(st+1,μ(st+1,θt),wt)q(st,at,wt)\delta_t = r_{t+1} + \gamma q(s_{t+1}, \mu(s_{t+1}, \theta_t), w_t) - q(s_t, a_t, w_t)

Critic (价值更新)

wt+1=wt+αwδtwq(st,at,wt)w_{t+1} = w_t + \alpha_w \delta_t \nabla_w q(s_t, a_t, w_t)

Actor (策略更新)

θt+1=θt+αθθμ(st,θt)(aq(st,a,wt+1))a=μ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t, \theta_t) \left( \nabla_a q(s_t, a, w_{t+1}) \right) |_{a=\mu(s_t)}
  • 这是一种离轨策略(Off-policy实现。数据收集策略(行为策略 β\beta)通常不同于当前正在优化的目标策略 μ\mu
  • 为了保证探索性,行为策略 β\beta 通常设定为目标策略加上噪声,即 β=μ+noise\beta = \mu + \text{noise}
  • 关于 q(s,a,w)q(s, a, w) 的函数近似选择
    • 线性函数q(s,a,w)=ϕT(s,a)wq(s, a, w) = \phi^T(s, a)w,其中 ϕ(s,a)\phi(s, a) 为人工设计的特征向量。
    • 神经网络:当使用深度神经网络近似价值和策略时,即演变为深度确定性策略梯度(DDPG, Deep Deterministic Policy Gradient算法。
RL学习笔记:Actor-Critic算法
https://zh.maxtonniu.com/blog/rl_chapter10
作者 Maxton Niu
发布于 2026年2月22日
版权须知 CC BY-NC-SA 4.0