1 Background
1.1 Importance Sampling (IS)
It is a common practice to optimize LLM policy on off-policy data with Importance Sampling (IS):
\[ \hat{\mathbb{E}}_{\mu} = \frac{p_{\theta}(\boldsymbol{\tau})}{q_{\mu}(\boldsymbol{\tau})}f(\boldsymbol{\tau}), \]
where \(f(\boldsymbol{\tau})\) can be the gradient of the objective function to be optimized, \(p_{\theta}(\boldsymbol{\tau})\) is the trajectory distribution induced by the target policy \(\pi_{\theta}\) and \(q_{\mu}(\boldsymbol{\tau})\) is the trajectory distribution induced by the behavior policy \(\mu\).
1.2 Sliding Latest Policy Trajectory (SLAPT)
We use Sliding Latest Policy Trajectories (SLAPTs) to refer to the trajectories sampled in a Reinforcement Learning (RL) system where we always use the latest policy \(\pi_{\theta_{t}}\) at each moment to sample the actions.
Formally, a SLAPT can be defined as a trajectory \(\tau = (s_{0}, a_{0,\pi_{\theta_{n}}}, \ldots, s_{t}, a_{t,\pi_{\theta_{n+m}}}, \ldots, s_{T}, a_{T,\pi_{\theta_{n+m}}})\), where \(s_{t}\) is the state at timestep \(t\) and \(a_{t,\pi_{\theta_{n+m}}}\) is the action sampled from the policy \(\pi_{\theta_{n+m}}\) that is the latest policy available in the system when sampling. So for \(l > m\), \(\pi_{\theta+{n+l}}\) is usually more on-policy than \(\pi_{\theta_{n+m}}\).
SLAPTs are off-policy by definition, but are as close to on-policy as possible with the best effort in the system.
1.3 Partial Rollout
SLAPTs are a natural result of a popular design choice called Partial Rollout in LLM RL systems nowadays, such as Kimi k1.5 (Kimi Team et al. 2025), AReaL (Fu et al. 2025) and PipelineRL (Piché et al. 2025), etc., which can date back to a traditional Distributed RL system called SEED RL (Espeholt et al. 2020):
The ongoing trajectory rollouts are
- aborted when
- either, in synchoronous systems like Kimi k1.5, there are enough samples collected for training, releasing the resources for the training engine to update the model weights,
- or, in asynchronous systems like AReaL and PipelineRL, a new version of model weights is produced by the trainer;
- continued with the latest version of model weights.
Partial Rollout is motivated by its efficiency and simplicity. Let me explain in detail.
1.3.1 Efficiency: Eliminating Long-Tail Bubbles
The earliest LLM RL systems Hu et al. (2025) all adopt synchoronous architectures, where the trainer always waits for all the trajectories are finished before updating the weights with these data.
However, as the context length of LLMs scales up, the skewness of the trajectory length distribution becomes increasing heavy. If the distribution is very skewed but all the trajectories are required to finish in the same rollout stage, there can be only a few long-tail requests remaining in the system, causing severe under-utilization (typically <30% in practice). (Sheng, Tong, et al. 2025)
Kimi k1.5 proposes to fix this issue by aborting all the ongoing rollouts once enough training samples are collected and directly update the weights with these data, instead of waiting for all the trajectories to finish, and then continue the rollouts with the new model weights.
1.3.2 Simplicity: Only Maintaining One Version of Model Weights
In asynchornous RL systems with an experience buffer, it is troublesome to mamage multiple versions of model weights within the rollout engine.
AReaL proposes to always only maintain the latest model weights across all the instances. Once a new version of model weights is produced by the trainer, all the rollouts of the stale policy are aborted and then continued with the latest policy.
This can be simply implemented by always loading the latest weights into all the inference engine instances, avoiding the bothering to manage the requests across each instance.
2 IS with Sliding Latest Policy Trajectories
Despite Partial Rollout’s efficiency and simplicity, there have been worries about applying IS with SLAPTs, since most previous distributed RL works including the most representative IMPALA (Espeholt et al. 2018) formulate IS with a single consistent behavior policy \(\mu(a \mid s)\), and it feels weird to mix multiple policies \(\pi_{\theta_{n}},\ldots,\pi_{\theta_{n+m}},\ldots,\pi_{\theta_{n+M}}\) within a single trajectory.
2.1 IS with Global Behavior Policy \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\)
Since there is always an actual distribution we are sampling from, we can always formulate it as a global behavior policy \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\).
It is natural to try to use \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\) for IS. So how can we calculate it?
It is easy to notice that, for each trajectory of an LLM policy, we can usually construct a behavior policy \(\forall t, \mu(\boldsymbol{a} \mid \boldsymbol{s})\) satisfies that \(\mu(a_{t} \mid s_{t}) = \pi_{\theta_{n+m}}(a_{t} \mid s_{t})\) at every timestep \(t\), since the LLM is usually auto-regressive and thus never revisits a past state within the same trajectory.
However, it might be confusing when we also notice that, for the same state \(s\) in two different SLAPTs \(\tau_{i}\) and \(\tau_{j}\), the same action \(a\) might be sampled from different policies \(\pi_{\theta_{n+m}}\) and \(\pi_{\theta_{n+l}}\) (\(l \neq m\)) respectively. It is likely that \(\pi_{\theta_{n+m}}(a_{t} \mid s_{t}) \neq \pi_{\theta_{n+l}}(a_{t} \mid s_{t})\), making it impossible to simply construct the same behavior policy for both SLAPTs, i.e., \(\mu(a_{t} \mid s_{t})=\pi_{\theta_{n+m}}(a_{t} \mid s_{t})\) constradicts \(\mu(a_{t} \mid s_{t})=\pi_{\theta_{n+l}}(a_{t} \mid s_{t})\).
So where is the problem of the construction for \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\) mentioned above?
The problem hidden here is that, \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\) does not consider the probability distribution of which LLM policy \(\pi_{\theta_{n+m}}\) is used, but this distribution is intractable since it depends on the system dynamics.
2.2 IS with Trajectory-wise Behavior Policy \(\mu_{i}(\boldsymbol{a} \mid \boldsymbol{s})\)
A trivial solution is to construct a trajectory-wise behavior policy \(\mu_{i}(\boldsymbol{a} \mid \boldsymbol{s})\) for each trajectory \(\tau_{i}\).
Now for the counter-example mentioned above, \(\mu_i(\boldsymbol{a} \mid \boldsymbol{s})=\pi_{\theta_{n+m}}(\boldsymbol{a} \mid \boldsymbol{s})\) and \(\mu_j(\boldsymbol{a} \mid \boldsymbol{s})=\pi_{\theta_{n+l}}(\boldsymbol{a} \mid \boldsymbol{s})\) are obviously compatible.
With a batch of \(N\) trajectory samples sampled from each own behavior policy \(\boldsymbol{\tau}_1 \sim q_{\mu_1}, \ldots, \boldsymbol{\tau}_N \sim q_{\mu_N}\), it is easy to formulate single-sample estimate \(\hat{\mathbb{E}}_{\mu_i} = \frac{p_{\theta}(\boldsymbol{\tau}_i)}{q_{\mu_i}(\boldsymbol{\tau}_i)}f(\boldsymbol{\tau}_i)\).
Note that the single-sample estimate is already unbiased, i.e., the unbiasedness does not depend on the sample size.
The common practice to combine them for an estimate is to average the single-sample estimates \(\hat{\mathbb{E}}_{\mu_i}, i = 1, \ldots, N\), i.e., \(\hat{\mathbb{E}}_{\text{avg}} = \frac{1}{N} \sum_{i=1}^{N} \hat{\mathbb{E}}_{\mu_i}\).
Due to linearity of expectation, \(\hat{\mathbb{E}}_{\text{avg}}\) is still unbiased.
Now let’s take a look at the variance. Let the variance of each single-sample estimate be \(\sigma^2_{\mu_i}\), since the samples are independent, the variance of the average estimate \(\hat{\mathbb{E}}_{\text{avg}}\) is:
\[ \sigma^2_{\text{avg}} = \frac{1}{N^2} \sum_{i=1}^{N} \sigma^2_{\mu_i} \]
2.3 Mixed Sliding Latest Policy Might Have Lower Variance than Consistent Old Policy
Now we wonder, given a trajectory \(\boldsymbol{\tau}_{i}\), which of the mixed best-effort policy \(\mu_{\theta_{n},M}\) using \(\pi_{\theta_{n}},\ldots,\pi_{\theta_{n+M}}\) and the consistent old policy \(\pi_{\theta_{n}}\) is better for IS estimate, i.e., leads to a lower variance?
Metelli et al. (2020) provided a family of bounds of the variance of IS estimate in terms of the Rényi divergence:
Lemma 1. Let \(P\) and \(Q\) be two probability measures on the measurable space \((\mathcal{X}, \mathscr{F})\) such that \(P \ll Q\). Let \(\alpha \in[1,+\infty], \mathbf{x}=\left(x_1, x_2, \ldots, x_N\right)^T\) be i.i.d. random variables sampled from \(Q\) and \(f: \mathcal{X} \rightarrow \mathbb{R}\) be a function with bounded \(\frac{2 \alpha}{\alpha-1}\)-moment under \(Q\left(\|f\|_{Q, \frac{2 \alpha}{\alpha-1}}<+\infty\right)\). Then, for any \(N>0\), the variance of the IS estimator \(\widehat{\mu}_{P / Q}\) can be upper bounded as:
\[\operatorname{Var}_{\mathbf{x} \sim Q}\left[\hat{\mu}_{P / Q}\right] \leqslant \frac{1}{N}\|f\|_{Q, \frac{2 \alpha}{\alpha-1}}^2 d_{2 \alpha}(P \| Q)^{2-\frac{1}{\alpha}},\]
where we used the abbreviation \(\mathbf{x} \sim Q\) for denoting \(x_i \sim Q\) for all \(i=1,2, \ldots, N\) all independent.
This result generalizes Lemma 4.1 of Metelli et al. (2018), that can be recovered by setting \(\alpha=1\) under the condition that \(\|f\|_{\infty}<+\infty\) :
\[\operatorname{Var}_{\mathbf{x} \sim Q}\left[\widehat{\mu}_{P / Q}\right] \leqslant \frac{1}{N}\|f\|_{\infty}^2 d_2(P \| Q) .\]
When \(\alpha = 1\), the Rényi divergence is the Kullback-Leibler divergence widely used in RL analysis.
When \(P=Q\) almost everywhere, we get \(\operatorname{Var}_{\mathbf{x} \sim Q}\left[\hat{\mu}_{Q / Q}\right] \leqslant \frac{1}{N}\|f\|_{\infty}^2\), a well-known upper bound to the variance of a Monte Carlo estimator. Recalling the definition of ESS (Equation 7) we can rewrite the previous bound as:
\[\underset{\mathbf{x} \sim Q}{\operatorname{Var}}\left[\hat{\mu}_{P / Q}\right] \leqslant \frac{\|f\|_{\infty}^2}{\operatorname{ESS}(P \| Q)} .\]
Thus, the variance scales with ESS instead of \(N\), justifying the definition of ESS.
In our context, \(P\) is the target distribution \(p_{\theta}\) and \(Q\) is \(\mu_{\theta_{n},M}\) or \(\pi_{\theta_{n}}\).
Intuitively, it is likely that \(d_{2 \alpha}(p_{\theta} \| q_{\theta_{n},M}) < d_{2 \alpha}(p_{\theta} \| p_{\theta_{n}})\), since the newer the policy is, the more similar its induced distribution is to \(p_{\theta}\).
As long as the \(\|f\|_{q_{\theta_{n},M}, \frac{2 \alpha}{\alpha-1}}\) is not much larger than \(\|f\|_{p_{\theta_{n}}, \frac{2 \alpha}{\alpha-1}}\), the estimate using \(\mu_{\theta_{n},M}\) has better guarantee than the estimate using \(\pi_{\theta_{n}}\).
These conjectures can be verified by empirical experiments. For example, Figure 6 and 7 by Piché et al. (2025) show that \(\mu_{\theta_{n},M}\) has higher ESS and lower KL divergence relative to \(p_{\theta}\) than \(\pi_{\theta_{n}}\), respectively.
We are also planning experiments to further verify these conjectures.
2.4 Exploiting (Approximatly) Identical Distribution with Multiple IS
Besides the likely superiority for each single-sample estimate, the combination of multiple importance sampling has other properties that can be exploited.
Formally, the combination can actually generalize from average weight \(\frac{1}{N}\) to any “partion of unity”, which is a collection of \(J \geqslant 1\) weight functions \(\omega_j(\boldsymbol{x}) \geqslant 0\) which satisfy \(\sum_{j=1}^J \omega_j(\boldsymbol{x})=1\) for all \(\boldsymbol{x}\). Different partitions of unity will lead to estimates that are all unbiased but with different variances.
Given the property that SLAPTs of similar lengths usally conform to identical distributions (approximately, but this can be exact if we limit the timesteps where a trajectory can be aborted), we can reformulate the estimation as Multiple Importance Sampling (Owen 2013):
Suppose that \(\boldsymbol{X}_{i j} \sim q_j\) for \(i=1, \ldots, n_j\) and \(j=1, \ldots, J\) and that \(\omega_j\) are a partition of unity. The multiple importance sampling estimate is
\[\widetilde{\mu}_\omega=\sum_{j=1}^J \frac{1}{n_j} \sum_{i=1}^{n_j} \omega_j\left(\boldsymbol{X}_{i j}\right) \frac{f\left(\boldsymbol{X}_{i j}\right) p\left(\boldsymbol{X}_{i j}\right)}{q_j\left(\boldsymbol{X}_{i j}\right)} .\] (Owen 2013)
Now assume that \(q_j(\boldsymbol{x})>0\) whenever \(\omega_j(\boldsymbol{x}) p(\boldsymbol{x}) f(\boldsymbol{x}) \neq 0\). Then multiple importance sampling is unbiased, because \[\mathbb{E}\left(\widetilde{\mu}_\omega\right)=\sum_{j=1}^J \mathbb{E}_{q_j}\left(\omega_j(\boldsymbol{X}) \frac{f(\boldsymbol{X}) p(\boldsymbol{X})}{q_j(\boldsymbol{X})}\right)=\sum_{j=1}^J \int \omega_j(\boldsymbol{x}) f(\boldsymbol{x}) p(\boldsymbol{x}) \mathrm{d} \boldsymbol{x}=\mu .\]
Among the proposals for functions \(\omega_j(\boldsymbol{x})\), the most studied one is the balance heuristic with \(\omega_j(\boldsymbol{x}) \propto n_j q_j(\boldsymbol{x})\), that is
\[\omega_j(\boldsymbol{x})=\omega_j^{\mathrm{BH}}(\boldsymbol{x}) \equiv \frac{n_j q_j(\boldsymbol{x})}{\sum_{k=1}^J n_k q_k(\boldsymbol{x})} .\]
By construction \(q_j(\boldsymbol{x})>0\) holds whenever \(\left(\omega_j^{\mathrm{BH}} p f\right)(\boldsymbol{x}) \neq 0\). Let \(n=\sum_{j=1}^J n_j\) and define \(\alpha_j=n_j / n\). Then using the balance heuristic, \(\widetilde{\mu}_{\omega^{\text {ВН }}}\) simplifies to \[\widetilde{\mu}_\alpha=\frac{1}{n} \sum_{j=1}^J \sum_{i=1}^{n_j} \frac{f\left(\boldsymbol{X}_{i j}\right) p\left(\boldsymbol{X}_{i j}\right)}{\sum_{j=1}^J \alpha_j q_j\left(\boldsymbol{X}_{i j}\right)} .\]
In other words, multiple importance sampling, with weights from the balance heuristic reduces to the same estimator we would use in mixture importance sampling with mixture weights \(\alpha_j=n_j / n\). Once again, the weight on a given sampled value \(\boldsymbol{X}_{i j}\) does not depend on which mixture component it came from. The balance heuristic is nearly optimal in the following sense:
Theorem 9.8. Let \(n_j \geqslant 1\) be positive integers for \(j=1, \ldots, J\). Let \(\omega_1, \ldots, \omega_J\) be a partition of unity and let \(\omega^{\mathrm{BH}}\) be the balance heuristic. Suppose that \(q_j(\boldsymbol{x})>\) 0 whenever \(\omega_j(\boldsymbol{x}) p(\boldsymbol{x}) f(\boldsymbol{x}) \neq 0\). Then
\[\operatorname{Var}\left(\widetilde{\mu}_{\omega^{\mathrm{BH}}}\right) \leqslant \operatorname{Var}\left(\widetilde{\mu}_\omega\right)+\left(\frac{1}{\min _j n_j}-\frac{1}{\sum_j n_j}\right) \mu^2 .\]