[WIP] Partial Rollout vs Consistent Rollout in LLM RL

English 英文
Technical 技术
Authors

Yuxuan Tong

Yingru Li

Guangming Sheng

Published

2026/01/07

Last Modified

2026/03/09

Abstract

Large Language Model (LLM) Reinforcement Learning (RL) systems nowadays usually achieve high rollout utilization by introducing off-policy data, which is typically used for Importance Sampling (IS).

Three dominant rollout strategies have emerged: 1) Partial Rollout (PR), which, once a newer policy is available, aborts and resumes the trajectory with the latest policy, producing trajectories with mixed policy compositions, at the cost of (repeated) re-prefill; 2) Partial Rollout with Stale KV (PR-SKV), which is similar to PR but keeps the stale KV cache to avoid re-prefill, at the cost of introducing distribution shift; 3) Consistent Rollout (CR), which completes each trajectory with a single stale policy, avoiding re-prefill and distribution shift, at the cost of more off-policiness.

This post compares the three strategies from both algorithmic and system perspectives, drawing the following main conclusions: 1) The re-prefill overhead in PR scales cubically with the context length — the dot-product attention complexity scales quadratically and the re-prefill frequency scales linearly — making it prohibitive for long-context tasks. 2) As data for IS, both PR(-SKV) and CR can produce unbiased estimates, while their variance comparison is nuanced: PR(-SKV) can be less off-policy by always using the latest policy once available, but CR fits naturally into the Multiple Importance Sampling formulation and can exploit techniques like the balance heuristic to reduce variance.

Experiments are still in progress.

Keywords

Partial Rollout, Consistent Rollout, Importance Sampling, Off-Policy Reinforcement Learning, Large Language Model

1 Three Rollout Strategies in Fully-Utilized LLM RL Systems

In Reinforcement Learning (RL) systems for Large Language Models (LLMs), especially those that pursue full utilization of the hardware resources, it is common that we have to utilize off-policy data, typically a batch of trajectory samples collected from one or multiple stale policies.

Three dominant rollout strategies have emerged to handle this, each producing a different kind of off-policy data:

  1. Partial Rollout (PR) — once a newer policy is available, aborts and resumes the trajectory with the latest policy, producing trajectories with mixed policy compositions, at the cost of (repeated) re-prefill.
  2. Partial Rollout with Stale KV (PR-SKV) — similar to PR, but keeps the stale KV cache to avoid re-prefill, at the cost of introducing distribution shift.
  3. Consistent Rollout (CR) — completes each trajectory with a single (possibly stale) policy, avoiding re-prefill and distribution shift, at the cost of more off-policiness.
Partial Rollout (PR) Partial Rollout with Stale KV (PR-SKV) Consistent Rollout (CR)
Policy per trajectory Mixed (multiple versions within one trajectory) Single (single consistent version)
Re-prefill overhead O(L³/K) — cubic None (reuses stale KV) None (no mid-trajectory switch)
KV consistency Fully consistent (fresh KV) Inconsistent (stale KV + new weights) Fully consistent
Off-policiness Lower (always uses latest policy once available) Higher (keeps stale policy throughout)
IS formulation Trajectory-dependent μi Trajectory-dependent μi (approximate) Clean (Multiple Importance Sampling)

1.1 Partial Rollout (PR)

In a Partial Rollout system, the latest policy \(\pi_{\theta_{t}}\) available at each moment is always used to sample actions. This means a single trajectory can span multiple policy versions.

Formally, a partial rollout trajectory can be defined as \(\tau = (s_{0}, a_{0,\pi_{\theta_{n}}}, \ldots, s_{t}, a_{t,\pi_{\theta_{n+m}}}, \ldots, s_{T}, a_{T,\pi_{\theta_{n+M}}})\), where \(s_{t}\) is the state at timestep \(t\) and \(a_{t,\pi_{\theta_{n+m}}}\) is the action sampled from \(\pi_{\theta_{n+m}}\), the latest policy available in the system at that timestep. Hence, for \(l > m\), \(\pi_{\theta_{n+l}}\) is usually more on-policy than \(\pi_{\theta_{n+m}}\).

Partial rollout trajectories in a batch usually have different policy compositions, depending on the system dynamics.

Partial Rollout is a popular design choice in LLM RL systems, adopted by Kimi k1.5 (Kimi Team et al. 2025), AReaL (Fu et al. 2025), etc., dating back to SEED RL (Espeholt et al. 2020) in traditional distributed RL. In these systems, the ongoing trajectory rollouts are

  1. aborted when
    • either, in synchronous systems like Kimi k1.5, there are enough samples collected for training, releasing the resources for the training engine to update the model weights,
    • or, in asynchronous systems like AReaL and PipelineRL, a new version of model weights is produced by the trainer;
  2. continued with the latest version of model weights.

Partial Rollout is motivated by its efficiency and simplicity, which we elaborate below.

The earliest LLM RL systems (Yao et al. 2023; Shen et al. 2024; Sheng, Zhang, et al. 2025; Hu et al. 2025) all adopt synchronous architectures, where the trainer waits for all trajectories to finish before updating the weights with the collected data. However, as the context length of LLMs scales up, the trajectory length distribution becomes increasingly skewed. If all trajectories must complete in the same rollout stage, only a few long-tail requests may remain in the system, causing severe under-utilization (typically <30% in practice).

Kimi k1.5 addresses this by aborting all ongoing rollouts once enough training samples are collected, directly updating the weights instead of waiting for all trajectories to finish, and then continuing the rollouts with the new model weights.

In asynchronous RL systems with an experience buffer, managing multiple versions of model weights within the rollout engine is cumbersome.

AReaL proposes to maintain only the latest model weights across all instances. Once a new version of weights is produced by the trainer, all rollouts of the stale policy are aborted and then continued with the latest policy.

This is simply implemented by loading the latest weights into all inference engine instances, avoiding the need to route requests across different instances.

However, PR comes with a critical system cost: re-prefill. When a trajectory is aborted and resumed with a newer policy, the new policy must re-process the entire existing context (prompt + previously generated tokens) to rebuild its KV cache before it can continue generation. We analyze this overhead in detail in Section 2.

Beyond the system cost, there are algorithmic concerns about mixing multiple policies within a single trajectory, since most prior works formulate IS with a single consistent behavior policy \(\mu(\cdot \mid \boldsymbol{s})\) and the mixed-policy formulation remains relatively under-explored. We defer a more detailed discussion to Section 3.

1.2 Partial Rollout with Stale KV (PR-SKV)

PR-SKV is a variant of PR designed to avoid the re-prefill overhead, proposed by PipelineRL (Piché et al. 2025). Like PR, it aborts the ongoing trajectory and switches to the latest policy \(\pi_{\theta_{n+1}}\) once available. However, instead of re-computing the KV cache from scratch, PR-SKV retains and reuses the stale KV cache that was computed under the old policy \(\pi_{\theta_{n}}\).

In standard autoregressive transformer inference, the KV cache stores the key and value projections of all past tokens:

\[ \boldsymbol{K}_{0:t} = \boldsymbol{W}_K^{(n)} \boldsymbol{X}_{0:t}, \quad \boldsymbol{V}_{0:t} = \boldsymbol{W}_V^{(n)} \boldsymbol{X}_{0:t}, \]

where \(\boldsymbol{W}_K^{(n)}, \boldsymbol{W}_V^{(n)}\) are the key and value projection matrices of policy \(\pi_{\theta_n}\), and \(\boldsymbol{X}_{0:t}\) denotes the hidden representations of the tokens generated so far.

When the policy switches from \(\pi_{\theta_n}\) to \(\pi_{\theta_{n+1}}\), PR-SKV continues using the KV cache computed under \(\pi_{\theta_n}\) while applying \(\pi_{\theta_{n+1}}\)’s weights for subsequent token generation. This means the query projections for new tokens are computed with \(\boldsymbol{W}_Q^{(n+1)}\), but they attend to keys and values still computed with \(\boldsymbol{W}_K^{(n)}, \boldsymbol{W}_V^{(n)}\). This inconsistency between the query and key-value projections introduces a distribution shift: the resulting token distribution is neither exactly \(\pi_{\theta_n}\) nor \(\pi_{\theta_{n+1}}\), but some hybrid that is difficult to characterize analytically.

The trade-off is clear:

  • Advantage: eliminates the re-prefill overhead entirely (Section 2), matching the system efficiency of CR.
  • Disadvantage: the distribution shift makes the actual sampling distribution hard to specify, complicating the importance sampling formulation. In contrast, standard PR and CR both sample from well-defined policy distributions at each timestep.

In practice, if the policy update is small (e.g., the parameter change \(\|\theta_{n+1} - \theta_n\|\) is small), the distribution shift introduced by the stale KV cache may be negligible. But for larger updates or long contexts where many layers of stale KV accumulate, the shift can become significant.

1.3 Consistent Rollout (CR)

In a Consistent Rollout system, each trajectory is sampled entirely with a single stale policy \(\pi_{\theta_{n}}\), though different trajectories in the same batch may come from different policy versions (i.e., different \(n\)).

This within-trajectory policy consistency simplifies the IS formulation. Consistent Rollout is widely used in traditional distributed RL systems like IMPALA (Espeholt et al. 2018).

However, implementing Consistent Rollout efficiently while avoiding long-tail bubbles in LLM RL systems is more challenging, as it requires managing multiple versions of model weights simultaneously within the rollout engine and dynamically routing requests of varying lengths.

Sheng, Tong, et al. (2025) implement an LLM RL system that achieves Consistent Rollout with full hardware utilization.

The key advantages of CR are:

  • No re-prefill overhead: each trajectory runs to completion with a single policy, so no KV cache invalidation occurs.
  • No distribution shift: unlike PR-SKV, the KV cache is always consistent with the generating policy.
  • Clean IS formulation: each trajectory has a well-defined, single behavior policy, fitting naturally into Multiple Importance Sampling (Section 3.2).

The main cost is more off-policiness: by the time a trajectory finishes (especially a long one), the trainer may have produced several newer policy versions, making the completing policy more stale compared to what PR would use.

2 System Perspective: Cubic Overhead of Full Re-Prefill

A key system-level distinction between PR and the other two strategies (PR-SKV and CR) is the re-prefill overhead. When PR aborts a trajectory and resumes with a newer policy, the new policy must re-process the entire existing context to rebuild its KV cache. We show that this overhead scales cubically with the context length, making it prohibitive for long-context tasks, which are increasingly significant in LLM applications.

2.1 Re-Prefill Frequency Scales Linearly with Context Length

Consider a trajectory of total length \(L\) tokens (including the prompt). Suppose the policy is updated approximately every \(K\) tokens of generation (this interval is determined by the training frequency and system throughput).

The number of re-prefill events during this trajectory is approximately \(L / K\), which grows linearly with \(L\). Longer sequences take more time to generate, during which more policy updates occur, triggering more re-prefills.

2.2 Each Re-Prefill Scales Quadratically with Context Length

Each re-prefill requires the new policy to process all previously generated tokens with dot-product attention. For a context of length \(l\), the attention computation costs \(O(l^2)\) (ignoring the linear terms from feed-forward layers).

Moreover, each successive re-prefill processes a longer context than the previous one: the \(m\)-th re-prefill (at approximately \(mK\) tokens of generation) must process a context of length \(\sim mK\).

2.3 Total Re-Prefill Overhead Scales Cubically

The total cubic scaling \(O(L^3/K)\) is the product of two factors:

  1. The quadratic attention complexity \(O(L^2)\) per re-prefill.
  2. The linear growth \(O(L/K)\) in the number of re-prefills.

As frontier models scale their context lengths from 100K+ to 1M+ tokens, this re-prefill overhead becomes increasingly prohibitive.

3 Algorithm Perspective: Importance Sampling

Importance Sampling (IS) has been widely used in LLM RL systems.

IS is useful for estimating the expectation of a function of random variables — typically the gradient of the return with respect to the policy parameters — under the target policy distribution. In practice, with a batch of trajectory samples, the IS estimate is often formulated as:

\[ \begin{aligned} \hat{\mathbb{E}}_{\mu}(f(\boldsymbol{\tau})) =& \frac{1}{N} \sum_{i=1}^{N} \frac{p_{\theta}(\boldsymbol{\tau}_i)}{q_{\mu}(\boldsymbol{\tau}_i)}f(\boldsymbol{\tau}_i) \\ =& \frac{1}{N} \sum_{i=1}^{N} \frac{p(\boldsymbol{s}_{0}) \prod_{t=0}^{T-1} \pi_{\theta}(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})p(\boldsymbol{s}_{t+1} \mid \boldsymbol{s}_{t}, \boldsymbol{a}_{t})}{p(\boldsymbol{s}_{0}) \prod_{t=0}^{T-1} \mu(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})p(\boldsymbol{s}_{t+1} \mid \boldsymbol{s}_{t}, \boldsymbol{a}_{t})}f(\boldsymbol{\tau}_i) \\ =& \frac{1}{N} \sum_{i=1}^{N} \prod_{t=0}^{T-1} \frac{\pi_{\theta}(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}{\mu(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}f(\boldsymbol{\tau}_i) \end{aligned} \tag{1}\]

where

  • \(N\) is the batch size,
  • \(f(\cdot)\) can be any function of the trajectory sample \(\boldsymbol{\tau}\), of which the most important one is the gradient function of the optimization objective \(\mathbb{E}_{\boldsymbol{\tau} \sim p_{\theta}}[J(\cdot)]\) relative to the policy parameters \(\theta\),
  • \(p_{\theta}(\cdot)\) is the probability density function of the trajectory distribution induced by the target policy \(\pi_{\theta}\) parameterized by \(\theta\) and the environment state transition distribution \(p(\cdot \mid \boldsymbol{s}, \boldsymbol{a})\),
  • \(q_{\mu}(\cdot)\) is the probability density function of the trajectory distribution induced by the behavior policy \(\mu\) and \(p(\cdot \mid \boldsymbol{s}, \boldsymbol{a})\).

A key property of Equation 1 is that it is unbiased, i.e.,

\[ \mathbb{E}_{\boldsymbol{\tau} \sim q_{\mu}}[\hat{\mathbb{E}}_{\mu}(f(\boldsymbol{\tau}))] = \mathbb{E}_{\boldsymbol{\tau} \sim p_{\theta}}[f(\boldsymbol{\tau})] \tag{2}\]

Among the components above, the behavior policy \(\mu(\cdot \mid \boldsymbol{s})\) is the only one whose computation is ambiguous.

However, correctly computing \(\mu(\cdot \mid \boldsymbol{s})\) for the off-policy data produced by the fully-utilized LLM RL systems described above is not straightforward.

3.1 Simple but Intractable: Global Behavior Policy \(\mu^{*}(\cdot \mid \boldsymbol{s})\)

Since there is always an actual distribution we are sampling from, we can always formulate the IS estimate with a global behavior policy \(\mu^{*}(\cdot \mid \boldsymbol{s})\).

The most natural approach is to use \(\mu^{*}(\cdot \mid \boldsymbol{s})\) directly for IS. But this requires computing the probabilities under it.

Fu et al. (2025) Proposition 1 constructs a behavior policy \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\) satisfying \(\mu^{*}(a_{t} \mid s_{t}) = \pi_{\theta_{n+m}}(a_{t} \mid s_{t})\) for each \((s_{t}, a_{t})\) pair in the trajectory samples. This seems reasonable for LLM RL, since autoregressive generation never revisits a past state within the same trajectory.

However, a complication arises: the same state \(s\) may appear in two different trajectory samples \(\tau_{i}\) and \(\tau_{j}\) (this is especially common for the initial state \(s_{0}\), i.e., the prompt), with the same action \(a\) sampled under different policies \(\pi_{\theta_{n+m}}\) and \(\pi_{\theta_{n+l}}\) (\(l \neq m\)) in each. In general, \(\pi_{\theta_{n+m}}(a \mid s) \neq \pi_{\theta_{n+l}}(a \mid s)\), making it infeasible to construct a single behavior policy for both trajectory samples: \(\mu^{*}(a \mid s)=\pi_{\theta_{n+m}}(a \mid s)\) contradicts \(\mu^{*}(a \mid s)=\pi_{\theta_{n+l}}(a \mid s)\).

Where, then, is the flaw in this construction?

The underlying issue is that \(\mu(\cdot \mid \boldsymbol{s})\) does not account for the probability distribution over which policy \(\pi_{\theta_{n+m}}\) is used. Moreover, this distribution is intractable, as it depends on the system dynamics.

3.2 IS with Consistent Rollout Data

IS with consistent rollout data is simpler to formulate, since each trajectory is sampled from a single policy \(\pi_{\theta_{n}}\), and we need only correctly identify which policy generated each trajectory.

3.2.1 Mixture Importance Sampling

In some simple cases, the distribution of which policy is used, i.e., the hyper-policy distribution, is known, where we can formulate the IS estimate as Mixture Importance Sampling (Owen 2013).

3.2.2 Multiple Importance Sampling

However, in more general cases, we only know post hoc that \(n_{j}\) trajectories were sampled from each policy \(\pi_{\theta_{j}}\), where \(N = \sum_{j} n_{j}\). For such cases, we can formulate the IS estimate as Multiple Importance Sampling (Owen 2013):

Suppose that \(\boldsymbol{X}_{i j} \sim q_j\) for \(i=1, \ldots, n_j\) and \(j=1, \ldots, J\) and that \(\omega_j\) are a partition of unity. The multiple importance sampling estimate is

\[\widetilde{\mu}_\omega=\sum_{j=1}^J \frac{1}{n_j} \sum_{i=1}^{n_j} \omega_j\left(\boldsymbol{X}_{i j}\right) \frac{f\left(\boldsymbol{X}_{i j}\right) p\left(\boldsymbol{X}_{i j}\right)}{q_j\left(\boldsymbol{X}_{i j}\right)} .\]

Now assume that \(q_j(\boldsymbol{x})>0\) whenever \(\omega_j(\boldsymbol{x}) p(\boldsymbol{x}) f(\boldsymbol{x}) \neq 0\). Then multiple importance sampling is unbiased, because \[\mathbb{E}\left(\widetilde{\mu}_\omega\right)=\sum_{j=1}^J \mathbb{E}_{q_j}\left(\omega_j(\boldsymbol{X}) \frac{f(\boldsymbol{X}) p(\boldsymbol{X})}{q_j(\boldsymbol{X})}\right)=\sum_{j=1}^J \int \omega_j(\boldsymbol{x}) f(\boldsymbol{x}) p(\boldsymbol{x}) \mathrm{d} \boldsymbol{x}=\mu .\]

3.2.3 Balance Heuristic

The natural follow-up question is how to choose the partition of unity \(\omega_j\):

Among the proposals for functions \(\omega_j(\boldsymbol{x})\), the most studied one is the balance heuristic with \(\omega_j(\boldsymbol{x}) \propto n_j q_j(\boldsymbol{x})\), that is

\[\omega_j(\boldsymbol{x})=\omega_j^{\mathrm{BH}}(\boldsymbol{x}) \equiv \frac{n_j q_j(\boldsymbol{x})}{\sum_{k=1}^J n_k q_k(\boldsymbol{x})} .\]

By construction \(q_j(\boldsymbol{x})>0\) holds whenever \(\left(\omega_j^{\mathrm{BH}} p f\right)(\boldsymbol{x}) \neq 0\). Let \(n=\sum_{j=1}^J n_j\) and define \(\alpha_j=n_j / n\). Then using the balance heuristic, \(\widetilde{\mu}_{\omega^{\text {ВН }}}\) simplifies to \[\widetilde{\mu}_\alpha=\frac{1}{n} \sum_{j=1}^J \sum_{i=1}^{n_j} \frac{f\left(\boldsymbol{X}_{i j}\right) p\left(\boldsymbol{X}_{i j}\right)}{\sum_{j=1}^J \alpha_j q_j\left(\boldsymbol{X}_{i j}\right)} .\]

In other words, multiple importance sampling, with weights from the balance heuristic reduces to the same estimator we would use in mixture importance sampling with mixture weights \(\alpha_j=n_j / n\). Once again, the weight on a given sampled value \(\boldsymbol{X}_{i j}\) does not depend on which mixture component it came from. The balance heuristic is nearly optimal in the following sense:

Theorem 9.8. Let \(n_j \geqslant 1\) be positive integers for \(j=1, \ldots, J\). Let \(\omega_1, \ldots, \omega_J\) be a partition of unity and let \(\omega^{\mathrm{BH}}\) be the balance heuristic. Suppose that \(q_j(\boldsymbol{x})>\) 0 whenever \(\omega_j(\boldsymbol{x}) p(\boldsymbol{x}) f(\boldsymbol{x}) \neq 0\). Then

\[\operatorname{Var}\left(\widetilde{\mu}_{\omega^{\mathrm{BH}}}\right) \leqslant \operatorname{Var}\left(\widetilde{\mu}_\omega\right)+\left(\frac{1}{\min _j n_j}-\frac{1}{\sum_j n_j}\right) \mu^2 .\]

(Owen 2013)

The intuition behind the balance heuristic (\(\omega_j^{\mathrm{BH}} \propto n_{j}q_j(\boldsymbol{x})\)) can also be understood as: the more samples we have from a policy, the more information we have about it, and thus the more weight it should receive.

3.3 IS with Partial Rollout Data

IS with partial rollout data (applicable to both PR and PR-SKV) is more difficult to formulate, since we must also account for the policy composition within each trajectory. To the best of our knowledge, this discussion is absent from previous works, including Espeholt et al. (2020), which first proposed Partial Rollout.

In this section, we discuss the formulation and some properties of IS with partial rollout data that may be useful in practice.

Note that for PR-SKV, the actual sampling distribution at each timestep is affected by the stale KV cache and is not exactly any single policy \(\pi_{\theta_{n+m}}\). The formulation below assumes we can identify the policy used at each timestep, which holds exactly for PR but only approximately for PR-SKV (with the approximation quality depending on how much the policy has changed).

3.3.1 Trajectory-dependent Behavior Policy \(\mu_{i}(\cdot \mid \boldsymbol{s})\)

A general but straightforward formulation for an IS estimate with a batch of trajectory samples is to make the behavior policy trajectory-dependent, i.e., use \(\mu_{i}(\boldsymbol{a} \mid \boldsymbol{s})\) for each trajectory \(\tau_{i}\).

Now for the counter-example mentioned in Section 3.1, \(\mu_i(a \mid s)=\pi_{\theta_{n+m}}(a \mid s)\) and \(\mu_j(a \mid s)=\pi_{\theta_{n+l}}(a \mid s)\) are obviously compatible.

With a batch of \(N\) trajectory samples, each sampled from its own behavior policy \(\boldsymbol{\tau}_1 \sim q_{\mu_1}, \ldots, \boldsymbol{\tau}_N \sim q_{\mu_N}\), we can only use the special case of the batch estimate Equation 1 with \(N=1\), i.e., the single-sample estimate

\[ \hat{\mathbb{E}}_{\mu_i}(\boldsymbol{\tau}_i) = \prod_{t=0}^{T-1} \frac{\pi_{\theta}(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}{\mu_i(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}f(\boldsymbol{\tau}_i) \tag{3}\]

Note that the single-sample estimate is also unbiased, i.e., the unbiasedness of Equation 1 does not depend on the sample size \(N\).

The standard approach to combine them into a batch estimate is to average the single-sample estimates from Equation 3:

\[ \hat{\mathbb{E}}_{\text{avg}} = \frac{1}{N} \sum_{i=1}^{N} \hat{\mathbb{E}}_{\mu_i} = \frac{1}{N} \sum_{i=1}^{N} \prod_{t=0}^{T-1} \frac{\pi_{\theta}(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}{\mu_i(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}f(\boldsymbol{\tau}_i) \tag{4}\]

By the linearity of expectation, the average estimate \(\hat{\mathbb{E}}_{\text{avg}}\) is also unbiased.

Now consider the variance. Let the variance of each single-sample estimate be \(\sigma^2_{\mu_i}\). Since the samples are independent, the variance of \(\hat{\mathbb{E}}_{\text{avg}}\) is:

\[ \sigma^2_{\text{avg}} = \frac{1}{N^2} \sum_{i=1}^{N} \sigma^2_{\mu_i} \tag{5}\]

Since the batch variance decomposes into individual single-sample variances, we focus on analyzing the variance of each single-sample estimate.

3.3.2 Variance of Single-Sample Estimate – Consistent vs. Partial Rollout

Consistent Rollout and Partial Rollout simply correspond to two different cases of the single-sample estimate. A natural question is: given a trajectory \(\boldsymbol{\tau}_{i}\), which of

  1. the consistent old policy \(\pi_{\theta_{n+m}}\) (Consistent Rollout) and
  2. the sliding latest policy \(\mu_{\theta_{n},M}\) using \(\pi_{\theta_{n}},\ldots,\pi_{\theta_{n+M}}\) successively (Partial Rollout)

is better for the unbiased single-sample IS estimate Equation 3, i.e., leads to a lower variance?

Metelli et al. (2020) provided a family of bounds of the variance of IS estimate in terms of the Rényi divergence:

Intuitively, the magnitude of the importance weights provides an indication of how much the probability measures \(P\) and \(Q\) are dissimilar. This notion can be formalized by the Rényi divergence (Rényi, 1961; Van Erven and Harremos, 2014), an information-theoretic dissimilarity index between probability measures.

Remark 1 (Rényi divergence) Let \(P\) and \(Q\) be two probability measures on a measurable space \((\mathcal{X}, \mathscr{F})\) such that \(P \ll Q(P\) is absolutely continuous w.r.t. \(Q)\) and \(Q\) is \(\sigma\)-finite. Let \(P\) and \(Q\) admit \(p\) and \(q\) as Lebesgue probability density functions (p.d.f.), respectively. The \(\alpha\)-Rényi divergence between \(P\) and \(Q\) is defined as:

\[D_\alpha(P \| Q)=\frac{1}{\alpha-1} \log \int_{\mathcal{X}}\left(\frac{\mathrm{d} P}{\mathrm{~d} Q}\right)^\alpha \mathrm{d} Q=\frac{1}{\alpha-1} \log \int_{\mathcal{X}} q(x)\left(\frac{p(x)}{q(x)}\right)^\alpha \mathrm{d} x,\]

where \(\mathrm{d} P / \mathrm{d} Q\) is the Radon-Nikodym derivative of \(P\) w.r.t. \(Q\) and \(\alpha \in[0, \infty]\). Some remarkable cases, defined as limits, are: \(\alpha=1\) when \(D_1(P \| Q)=D_{\mathrm{KL}}(P \| Q)\) and \(\alpha=\infty\) yielding \(D_{\infty}(P \| Q)=\log \operatorname{ess} \sup _{\mathcal{X}}\left\{\frac{\mathrm{d} P}{\mathrm{~d} Q}\right\} .\) (\(\operatorname{ess} \sup\) is the essential supremum of a measurable function \(f\), i.e., the smallest \(M\) such that \(f(x) \leqslant M\) almost everywhere.) Importing the notation from Cortes et al. (2010), we denote the exponentiated \(\alpha\)-Rényi divergence as \(d_\alpha(P \| Q)=\exp \left\{D_\alpha(P \| Q)\right\}\). With little abuse of notation, we will replace \(D_\alpha(P \| Q)\) with \(D_\alpha(p \| q)\) whenever possible within the context. When \(P\) and \(Q\) are (multivariate) Gaussian distributions, i.e., \(P \sim \mathcal{N}\left(\boldsymbol{\mu}_P, \boldsymbol{\Sigma}_P\right)\) and \(Q \sim \mathcal{N}\left(\boldsymbol{\mu}_Q, \boldsymbol{\Sigma}_Q\right)\), the Rényi divergence admits a closed-form for \(\alpha \in[0, \infty)\) (Burbea, 1984):

\[D_\alpha(P \| Q)=\frac{\alpha}{2}\left(\boldsymbol{\mu}_P-\boldsymbol{\mu}_Q\right)^T \boldsymbol{\Sigma}_\alpha^{-1}\left(\boldsymbol{\mu}_P-\boldsymbol{\mu}_Q\right)-\frac{1}{2(\alpha-1)} \log \frac{\operatorname{det}\left(\boldsymbol{\Sigma}_\alpha\right)}{\operatorname{det}\left(\boldsymbol{\Sigma}_P\right)^{1-\alpha} \operatorname{det}\left(\boldsymbol{\Sigma}_Q\right)^\alpha},\]

where \(\boldsymbol{\Sigma}_\alpha=\alpha \boldsymbol{\Sigma}_Q+(1-\alpha) \boldsymbol{\Sigma}_P\) under the assumption that \(\boldsymbol{\Sigma}_\alpha\) is positive-definite. The Rényi divergence can be computed in closed form also for several widely used distributions (Gil et al., 2013).

The Rényi divergence provides a convenient expression for the moments of the importance weights: \(\mathbb{E}_{x \sim Q}\left[w_{P / Q}(x)^\alpha\right]=d_\alpha(P \| Q)^{\alpha-1}\). Moreover, we can relate the Rényi divergence with the variance and the essential supremum of the importance weights (Cortes et al., 2010):

\[\begin{aligned}& \underset{x \sim Q}{\operatorname{Var}}\left[w_{P / Q}(x)\right]=d_2(P \| Q)-1 \\ & \underset{x \sim Q}{\operatorname{ess} \sup }\left\{w_{P / Q}(x)\right\}=d_{\infty}(P \| Q) .\end{aligned}\]

Lemma 1. Let \(P\) and \(Q\) be two probability measures on the measurable space \((\mathcal{X}, \mathscr{F})\) such that \(P \ll Q\). Let \(\alpha \in[1,+\infty], \mathbf{x}=\left(x_1, x_2, \ldots, x_N\right)^T\) be i.i.d. random variables sampled from \(Q\) and \(f: \mathcal{X} \rightarrow \mathbb{R}\) be a function with bounded \(\frac{2 \alpha}{\alpha-1}\)-moment under \(Q\left(\|f\|_{Q, \frac{2 \alpha}{\alpha-1}}<+\infty\right)\). Then, for any \(N>0\), the variance of the IS estimator \(\widehat{\mu}_{P / Q}\) can be upper bounded as:

\[\operatorname{Var}_{\mathbf{x} \sim Q}\left[\hat{\mu}_{P / Q}\right] \leqslant \frac{1}{N}\|f\|_{Q, \frac{2 \alpha}{\alpha-1}}^2 d_{2 \alpha}(P \| Q)^{2-\frac{1}{\alpha}},\]

where we used the abbreviation \(\mathbf{x} \sim Q\) for denoting \(x_i \sim Q\) for all \(i=1,2, \ldots, N\) all independent.

This result generalizes Lemma 4.1 of Metelli et al. (2018), that can be recovered by setting \(\alpha=1\) under the condition that \(\|f\|_{\infty}<+\infty\) :

\[\operatorname{Var}_{\mathbf{x} \sim Q}\left[\widehat{\mu}_{P / Q}\right] \leqslant \frac{1}{N}\|f\|_{\infty}^2 d_2(P \| Q) .\]

When \(\alpha = 1\), the Rényi divergence reduces to the Kullback-Leibler divergence, which is widely used in RL analysis.

When \(P=Q\) almost everywhere, we get \(\operatorname{Var}_{\mathbf{x} \sim Q}\left[\hat{\mu}_{Q / Q}\right] \leqslant \frac{1}{N}\|f\|_{\infty}^2\), a well-known upper bound to the variance of a Monte Carlo estimator. Recalling the definition of ESS (Equation 7) we can rewrite the previous bound as:

\[\underset{\mathbf{x} \sim Q}{\operatorname{Var}}\left[\hat{\mu}_{P / Q}\right] \leqslant \frac{\|f\|_{\infty}^2}{\operatorname{ESS}(P \| Q)} .\]

Thus, the variance scales with ESS instead of \(N\), justifying the definition of ESS.

In our context, \(P\) is the target distribution \(p_{\theta}\) and \(Q\) is \(\mu_{\theta_{n},M}\) or \(\pi_{\theta_{n}}\).

It is possible that \(d_{2 \alpha}(p_{\theta} \| q_{\theta_{n},M}) < d_{2 \alpha}(p_{\theta} \| p_{\theta_{n}})\), since the newer the policy, the more similar its induced distribution is to \(p_{\theta}\). Then, as long as \(\|f\|_{q_{\theta_{n},M}, \frac{2 \alpha}{\alpha-1}}\) is not much larger than \(\|f\|_{p_{\theta_{n}}, \frac{2 \alpha}{\alpha-1}}\), the estimate using \(\mu_{\theta_{n},M}\) has a better variance bound than that using \(\pi_{\theta_{n}}\).

In practice, the effectiveness of IS estimates is often assessed via empirical diagnostic metrics such as the Effective Sample Size (ESS). For example, Piché et al. (2025) measured the ESS of different sampling policies they used.

3.4 Comparing PR(-SKV) and CR for Importance Sampling

We can now summarize the comparison between the three strategies from the IS perspective.

Bias. Both PR(-SKV) and CR can produce unbiased IS estimates, as shown above. For CR, each trajectory has a single well-defined behavior policy, and the unbiasedness follows directly from Equation 2. For PR, the trajectory-dependent behavior policy formulation (Equation 3) preserves unbiasedness, and averaging over the batch (Equation 4) inherits it by linearity of expectation. For PR-SKV, the unbiasedness holds approximately under the assumption that the stale KV cache introduces negligible distribution shift.

Variance. The variance comparison between PR(-SKV) and CR is more nuanced, and it is hard to declare a clear winner:

  • PR(-SKV) can be less off-policy. By always switching to the latest policy once available, PR(-SKV) ensures that the latter portion of each trajectory is generated by a more recent (less stale) policy. This can lead to importance weights closer to 1, reducing variance through smaller Rényi divergence \(d_{2\alpha}(p_{\theta} \| q_{\mu})\).

  • CR can exploit Multiple Importance Sampling. Because each trajectory is sampled from a single consistent policy, CR naturally fits the Multiple Importance Sampling formulation (Section 3.2). This opens the door to variance reduction techniques like the balance heuristic, which provides near-optimal weighting across samples from different policies (Theorem 9.8 in Owen (2013)). PR data, with its mixed-policy trajectories, cannot directly leverage this formulation without additional approximations.

3.5 Discussions

3.5.1 Distribution Shift from Reusing Stale KV Cache

The distribution shift from reusing stale KV cache is an important practical concern that warrants further investigation. When the stale KV cache is reused with a new policy, the effective sampling distribution at each timestep depends on the entire history of policy switches and the accumulated staleness of the cached representations. Quantifying this shift and its impact on IS estimate quality remains an open question.

3.5.2 Minimizing Variance by Optimizing the Weighting

In Equation 4, we use the simple uniform weight \(\frac{1}{N}\). Actually, we can use any weight functions \(\omega_i(\boldsymbol{x})\) that form a partition of unity, i.e., \(\omega_j(\boldsymbol{x}) \geqslant 0\) with \(\sum_{j=1}^J \omega_j(\boldsymbol{x})=1\) for all \(\boldsymbol{x}\), while keeping unbiasedness. Different partitions of unity yield estimates that are all unbiased but may differ in variance. Balance heuristic is a special case of this in the Multiple Importance Sampling formulation, where \(\omega_j^{\mathrm{BH}} \propto n_{j}q_j(\boldsymbol{x})\). However, the optimal weighting is still an open question.

Acknowledgments

Thanks to Wang Zhang for helpful discussions.

Citation

BibTeX:

@article{tong2026partialvsconsistent,
  author = {Yuxuan Tong and Yingru Li and Guangming Sheng},
  title = {{Partial Rollout vs Consistent Rollout in LLM RL}},
  journal = {Blog},
  date = {2026-01-07},
  url = {https://tongyx361.github.io/posts/pr-vs-cr},
  language = {English},
}

References

Espeholt, Lasse, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski‎. 2020. “SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference.” In International Conference on Learning Representations. https://openreview.net/forum?id=rkgvXlrKwH.
Espeholt, Lasse, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, et al. 2018. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.” In Proceedings of the 35th International Conference on Machine Learning, edited by Jennifer Dy and Andreas Krause, 80:1407–16. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v80/espeholt18a.html.
Fu, Wei, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, et al. 2025. AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning.” In The Thirty-Ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=X9diEuva9R.
Hu, Jian, Xibin Wu, Wei Shen, Jason Klein Liu, Weixun Wang, Songlin Jiang, Haoran Wang, et al. 2025. OpenRLHF: A Ray-Based Easy-to-Use, Scalable and High-Performance RLHF Framework.” In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, edited by Ivan Habernal, Peter Schulam, and Jörg Tiedemann, 656–66. Suzhou, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-demos.48.
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, et al. 2025. “Kimi K1.5: Scaling Reinforcement Learning with LLMs.” https://arxiv.org/abs/2501.12599.
Metelli, Alberto Maria, Matteo Papini, Nico Montali, and Marcello Restelli. 2020. “Importance Sampling Techniques for Policy Optimization.” Journal of Machine Learning Research 21 (141): 1–75. http://jmlr.org/papers/v21/20-124.html.
Owen, Art B. 2013. Monte Carlo Theory, Methods and Examples. https://artowen.su.domains/mc/.
Piché, Alexandre, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, and Dzmitry Bahdanau. 2025. “PipelineRL: Faster on-Policy Reinforcement Learning for Long Sequence Generation.” https://arxiv.org/abs/2509.19128.
Shen, Gerald, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, et al. 2024. “NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment.” In First Conference on Language Modeling. https://openreview.net/forum?id=yK2eGE8QVW.
Sheng, Guangming, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, et al. 2025. “Laminar: A Scalable Asynchronous RL Post-Training Framework.” https://arxiv.org/abs/2510.12633.
Sheng, Guangming, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. “HybridFlow: A Flexible and Efficient RLHF Framework.” In Proceedings of the Twentieth European Conference on Computer Systems, 1279–97. EuroSys ’25. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3689031.3696075.
Yao, Zhewei, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, et al. 2023. “DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-Like Models at All Scales.” https://arxiv.org/abs/2308.01320.

Reuse