Blogs 博客

English 英文

Technical 技术

Large Language Model (LLM) Reinforcement Learning (RL) systems nowadays usually achieve high rollout utilization by introducing off-policy data, which is typically used for Importance Sampling (IS).

Three dominant rollout strategies have emerged: 1) Partial Rollout (PR), which, once a newer policy is available, aborts and resumes the trajectory with the latest policy, producing trajectories with mixed policy compositions, at the cost of (repeated) re-prefill; 2) Partial Rollout with Stale KV (PR-SKV), which is similar to PR but keeps the stale KV cache to avoid re-prefill, at the cost of introducing distribution shift; 3) Consistent Rollout (CR), which completes each trajectory with a single stale policy, avoiding re-prefill and distribution shift, at the cost of more off-policiness.

This post compares the three strategies from both algorithmic and system perspectives, drawing the following main conclusions: 1) The re-prefill overhead in PR scales cubically with the context length — the dot-product attention complexity scales quadratically and the re-prefill frequency scales linearly — making it prohibitive for long-context tasks. 2) As data for IS, both PR(-SKV) and CR can produce unbiased estimates, while their variance comparison is nuanced: PR(-SKV) can be less off-policy by always using the latest policy once available, but CR fits naturally into the Multiple Importance Sampling formulation and can exploit techniques like the balance heuristic to reduce variance.

Experiments are still in progress.

Jan 7, 2026

Yuxuan Tong, Yingru Li, Guangming Sheng

verl: Flexible and Efficient RL for LLMs

Jun 30, 2025

Yuxuan Tong (童雨轩)

重新思考 RL 中的 KL 梯度优化

修正 GRPO 公式与流行 LLM RL 框架

Chinese 中文

Technical 技术

对于 LLM RL 中相对于参考策略的 KL 优化，GRPO 公式

没有处理 KL 项的 off-policy 问题，这可以通过在多轮更新时重新计算 KL 项并添加重要性采样系数解决
先将 KL 估计样本量应用于动作对数条件似然再求和，而非先求和得到概率再应用估计样本量，与 John Schulman “Approximating KL Divergence” 分析不符（对应导出的梯度也可能因此而错误）

目前流行的 LLM RL 框架（TRL，OpenRLHF，verl）也没有避免上述问题，且存在其他问题：

在计算 KL loss 项时默认不去除任何梯度，实际得到的梯度通常不是在优化 KL 散度
KL loss 项的平均操作存在错误。

本文基于序列决策过程（而非 bandit）建模分析了上述问题，并提供了正确的 KL loss / reward 项实现的数学推导与上述问题的修正思路。

Mar 9, 2025

童雨轩

Categories

verl Tutorial

[WIP] Partial Rollout vs Consistent Rollout in LLM RL

verl: Flexible and Efficient RL for LLMs

重新思考 RL 中的 KL 梯度优化