[WIP] Justifying Importance Sampling with Sliding Latest Policy Trajectories

English 英文
Technical 技术
Authors

Yuxuan Tong

Yingru Li

Guangming Sheng

Published

2026/01/07

Abstract

Modern Large Language Model (LLM) Reinforcement Learning (RL) systems, such as Kimi k1.5 and AReaL, increasingly adopt Partial Rollout strategies to mitigate long-tail latency and simplify weight management. These systems generate Sliding Latest Policy Trajectories (SLAPTs) – sequences where the sampling policy may update mid-rollout – thereby challenging the standard assumptions of Importance Sampling (IS) which typically rely on a consistent behavior policy.

This post provides a justification for applying Importance Sampling to SLAPTs. We demonstrate that by constructing trajectory-wise behavior policies, we can maintain unbiased estimates despite the mixed-policy nature of the rollouts. Furthermore, leveraging variance bounds based on Rényi divergence, we argue that SLAPTs – being closer to the target policy than consistent but stale behavior policies – are likelt to yield lower variance. Finally, we propose that Multiple Importance Sampling (MulIS) with the Balance Heuristic (BH) can be utilized to further exploit the distributional properties of these trajectories for better policy optimization.

Keywords

Importance Sampling, Off-Policy Reinforcement Learning, Large Language Model

1 Background

1.1 Importance Sampling (IS)

It is a common practice to optimize LLM policy on off-policy data with Importance Sampling (IS):

\[ \hat{\mathbb{E}}_{\mu} = \frac{p_{\theta}(\boldsymbol{\tau})}{q_{\mu}(\boldsymbol{\tau})}f(\boldsymbol{\tau}), \]

where \(f(\boldsymbol{\tau})\) can be the gradient of the objective function to be optimized, \(p_{\theta}(\boldsymbol{\tau})\) is the trajectory distribution induced by the target policy \(\pi_{\theta}\) and \(q_{\mu}(\boldsymbol{\tau})\) is the trajectory distribution induced by the behavior policy \(\mu\).

1.2 Sliding Latest Policy Trajectory (SLAPT)

We use Sliding Latest Policy Trajectories (SLAPTs) to refer to the trajectories sampled in a Reinforcement Learning (RL) system where we always use the latest policy \(\pi_{\theta_{t}}\) at each moment to sample the actions.

Formally, a SLAPT can be defined as a trajectory \(\tau = (s_{0}, a_{0,\pi_{\theta_{n}}}, \ldots, s_{t}, a_{t,\pi_{\theta_{n+m}}}, \ldots, s_{T}, a_{T,\pi_{\theta_{n+m}}})\), where \(s_{t}\) is the state at timestep \(t\) and \(a_{t,\pi_{\theta_{n+m}}}\) is the action sampled from the policy \(\pi_{\theta_{n+m}}\) that is the latest policy available in the system when sampling. So for \(l > m\), \(\pi_{\theta+{n+l}}\) is usually more on-policy than \(\pi_{\theta_{n+m}}\).

SLAPTs are off-policy by definition, but are as close to on-policy as possible with the best effort in the system.

1.3 Partial Rollout

SLAPTs are a natural result of a popular design choice called Partial Rollout in LLM RL systems nowadays, such as Kimi k1.5 (Kimi Team et al. 2025), AReaL (Fu et al. 2025) and PipelineRL (Piché et al. 2025), etc., which can date back to a traditional Distributed RL system called SEED RL (Espeholt et al. 2020):

The ongoing trajectory rollouts are

  1. aborted when
    • either, in synchoronous systems like Kimi k1.5, there are enough samples collected for training, releasing the resources for the training engine to update the model weights,
    • or, in asynchronous systems like AReaL and PipelineRL, a new version of model weights is produced by the trainer;
  2. continued with the latest version of model weights.

Partial Rollout is motivated by its efficiency and simplicity. Let me explain in detail.

1.3.1 Efficiency: Eliminating Long-Tail Bubbles

The earliest LLM RL systems Hu et al. (2025) all adopt synchoronous architectures, where the trainer always waits for all the trajectories are finished before updating the weights with these data.

However, as the context length of LLMs scales up, the skewness of the trajectory length distribution becomes increasing heavy. If the distribution is very skewed but all the trajectories are required to finish in the same rollout stage, there can be only a few long-tail requests remaining in the system, causing severe under-utilization (typically <30% in practice). (Sheng, Tong, et al. 2025)

Kimi k1.5 proposes to fix this issue by aborting all the ongoing rollouts once enough training samples are collected and directly update the weights with these data, instead of waiting for all the trajectories to finish, and then continue the rollouts with the new model weights.

1.3.2 Simplicity: Only Maintaining One Version of Model Weights

In asynchornous RL systems with an experience buffer, it is troublesome to mamage multiple versions of model weights within the rollout engine.

AReaL proposes to always only maintain the latest model weights across all the instances. Once a new version of model weights is produced by the trainer, all the rollouts of the stale policy are aborted and then continued with the latest policy.

This can be simply implemented by always loading the latest weights into all the inference engine instances, avoiding the bothering to manage the requests across each instance.

2 IS with Sliding Latest Policy Trajectories

Despite Partial Rollout’s efficiency and simplicity, there have been worries about applying IS with SLAPTs, since most previous distributed RL works including the most representative IMPALA (Espeholt et al. 2018) formulate IS with a single consistent behavior policy \(\mu(a \mid s)\), and it feels weird to mix multiple policies \(\pi_{\theta_{n}},\ldots,\pi_{\theta_{n+m}},\ldots,\pi_{\theta_{n+M}}\) within a single trajectory.

2.1 IS with Global Behavior Policy \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\)

Since there is always an actual distribution we are sampling from, we can always formulate it as a global behavior policy \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\).

It is natural to try to use \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\) for IS. So how can we calculate it?

It is easy to notice that, for each trajectory of an LLM policy, we can usually construct a behavior policy \(\forall t, \mu(\boldsymbol{a} \mid \boldsymbol{s})\) satisfies that \(\mu(a_{t} \mid s_{t}) = \pi_{\theta_{n+m}}(a_{t} \mid s_{t})\) at every timestep \(t\), since the LLM is usually auto-regressive and thus never revisits a past state within the same trajectory.

However, it might be confusing when we also notice that, for the same state \(s\) in two different SLAPTs \(\tau_{i}\) and \(\tau_{j}\), the same action \(a\) might be sampled from different policies \(\pi_{\theta_{n+m}}\) and \(\pi_{\theta_{n+l}}\) (\(l \neq m\)) respectively. It is likely that \(\pi_{\theta_{n+m}}(a_{t} \mid s_{t}) \neq \pi_{\theta_{n+l}}(a_{t} \mid s_{t})\), making it impossible to simply construct the same behavior policy for both SLAPTs, i.e., \(\mu(a_{t} \mid s_{t})=\pi_{\theta_{n+m}}(a_{t} \mid s_{t})\) constradicts \(\mu(a_{t} \mid s_{t})=\pi_{\theta_{n+l}}(a_{t} \mid s_{t})\).

So where is the problem of the construction for \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\) mentioned above?

The problem hidden here is that, \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\) does not consider the probability distribution of which LLM policy \(\pi_{\theta_{n+m}}\) is used, but this distribution is intractable since it depends on the system dynamics.

2.2 IS with Trajectory-wise Behavior Policy \(\mu_{i}(\boldsymbol{a} \mid \boldsymbol{s})\)

A trivial solution is to construct a trajectory-wise behavior policy \(\mu_{i}(\boldsymbol{a} \mid \boldsymbol{s})\) for each trajectory \(\tau_{i}\).

Now for the counter-example mentioned above, \(\mu_i(\boldsymbol{a} \mid \boldsymbol{s})=\pi_{\theta_{n+m}}(\boldsymbol{a} \mid \boldsymbol{s})\) and \(\mu_j(\boldsymbol{a} \mid \boldsymbol{s})=\pi_{\theta_{n+l}}(\boldsymbol{a} \mid \boldsymbol{s})\) are obviously compatible.

With a batch of \(N\) trajectory samples sampled from each own behavior policy \(\boldsymbol{\tau}_1 \sim q_{\mu_1}, \ldots, \boldsymbol{\tau}_N \sim q_{\mu_N}\), it is easy to formulate single-sample estimate \(\hat{\mathbb{E}}_{\mu_i} = \frac{p_{\theta}(\boldsymbol{\tau}_i)}{q_{\mu_i}(\boldsymbol{\tau}_i)}f(\boldsymbol{\tau}_i)\).

Note that the single-sample estimate is already unbiased, i.e., the unbiasedness does not depend on the sample size.

The common practice to combine them for an estimate is to average the single-sample estimates \(\hat{\mathbb{E}}_{\mu_i}, i = 1, \ldots, N\), i.e., \(\hat{\mathbb{E}}_{\text{avg}} = \frac{1}{N} \sum_{i=1}^{N} \hat{\mathbb{E}}_{\mu_i}\).

Due to linearity of expectation, \(\hat{\mathbb{E}}_{\text{avg}}\) is still unbiased.

Now let’s take a look at the variance. Let the variance of each single-sample estimate be \(\sigma^2_{\mu_i}\), since the samples are independent, the variance of the average estimate \(\hat{\mathbb{E}}_{\text{avg}}\) is:

\[ \sigma^2_{\text{avg}} = \frac{1}{N^2} \sum_{i=1}^{N} \sigma^2_{\mu_i} \]

2.3 Mixed Sliding Latest Policy Might Have Lower Variance than Consistent Old Policy

Now we wonder, given a trajectory \(\boldsymbol{\tau}_{i}\), which of the mixed best-effort policy \(\mu_{\theta_{n},M}\) using \(\pi_{\theta_{n}},\ldots,\pi_{\theta_{n+M}}\) and the consistent old policy \(\pi_{\theta_{n}}\) is better for IS estimate, i.e., leads to a lower variance?

Metelli et al. (2020) provided a family of bounds of the variance of IS estimate in terms of the Rényi divergence:

Lemma 1. Let \(P\) and \(Q\) be two probability measures on the measurable space \((\mathcal{X}, \mathscr{F})\) such that \(P \ll Q\). Let \(\alpha \in[1,+\infty], \mathbf{x}=\left(x_1, x_2, \ldots, x_N\right)^T\) be i.i.d. random variables sampled from \(Q\) and \(f: \mathcal{X} \rightarrow \mathbb{R}\) be a function with bounded \(\frac{2 \alpha}{\alpha-1}\)-moment under \(Q\left(\|f\|_{Q, \frac{2 \alpha}{\alpha-1}}<+\infty\right)\). Then, for any \(N>0\), the variance of the IS estimator \(\widehat{\mu}_{P / Q}\) can be upper bounded as:

\[\operatorname{Var}_{\mathbf{x} \sim Q}\left[\hat{\mu}_{P / Q}\right] \leqslant \frac{1}{N}\|f\|_{Q, \frac{2 \alpha}{\alpha-1}}^2 d_{2 \alpha}(P \| Q)^{2-\frac{1}{\alpha}},\]

where we used the abbreviation \(\mathbf{x} \sim Q\) for denoting \(x_i \sim Q\) for all \(i=1,2, \ldots, N\) all independent.

This result generalizes Lemma 4.1 of Metelli et al. (2018), that can be recovered by setting \(\alpha=1\) under the condition that \(\|f\|_{\infty}<+\infty\) :

\[\operatorname{Var}_{\mathbf{x} \sim Q}\left[\widehat{\mu}_{P / Q}\right] \leqslant \frac{1}{N}\|f\|_{\infty}^2 d_2(P \| Q) .\]

When \(\alpha = 1\), the Rényi divergence is the Kullback-Leibler divergence widely used in RL analysis.

When \(P=Q\) almost everywhere, we get \(\operatorname{Var}_{\mathbf{x} \sim Q}\left[\hat{\mu}_{Q / Q}\right] \leqslant \frac{1}{N}\|f\|_{\infty}^2\), a well-known upper bound to the variance of a Monte Carlo estimator. Recalling the definition of ESS (Equation 7) we can rewrite the previous bound as:

\[\underset{\mathbf{x} \sim Q}{\operatorname{Var}}\left[\hat{\mu}_{P / Q}\right] \leqslant \frac{\|f\|_{\infty}^2}{\operatorname{ESS}(P \| Q)} .\]

Thus, the variance scales with ESS instead of \(N\), justifying the definition of ESS.

In our context, \(P\) is the target distribution \(p_{\theta}\) and \(Q\) is \(\mu_{\theta_{n},M}\) or \(\pi_{\theta_{n}}\).

Intuitively, it is likely that \(d_{2 \alpha}(p_{\theta} \| q_{\theta_{n},M}) < d_{2 \alpha}(p_{\theta} \| p_{\theta_{n}})\), since the newer the policy is, the more similar its induced distribution is to \(p_{\theta}\).

As long as the \(\|f\|_{q_{\theta_{n},M}, \frac{2 \alpha}{\alpha-1}}\) is not much larger than \(\|f\|_{p_{\theta_{n}}, \frac{2 \alpha}{\alpha-1}}\), the estimate using \(\mu_{\theta_{n},M}\) has better guarantee than the estimate using \(\pi_{\theta_{n}}\).

These conjectures can be verified by empirical experiments. For example, Figure 6 and 7 by Piché et al. (2025) show that \(\mu_{\theta_{n},M}\) has higher ESS and lower KL divergence relative to \(p_{\theta}\) than \(\pi_{\theta_{n}}\), respectively.

We are also planning experiments to further verify these conjectures.

2.4 Exploiting (Approximatly) Identical Distribution with Multiple IS

Besides the likely superiority for each single-sample estimate, the combination of multiple importance sampling has other properties that can be exploited.

Formally, the combination can actually generalize from average weight \(\frac{1}{N}\) to any “partion of unity”, which is a collection of \(J \geqslant 1\) weight functions \(\omega_j(\boldsymbol{x}) \geqslant 0\) which satisfy \(\sum_{j=1}^J \omega_j(\boldsymbol{x})=1\) for all \(\boldsymbol{x}\). Different partitions of unity will lead to estimates that are all unbiased but with different variances.

Given the property that SLAPTs of similar lengths usally conform to identical distributions (approximately, but this can be exact if we limit the timesteps where a trajectory can be aborted), we can reformulate the estimation as Multiple Importance Sampling (Owen 2013):

Suppose that \(\boldsymbol{X}_{i j} \sim q_j\) for \(i=1, \ldots, n_j\) and \(j=1, \ldots, J\) and that \(\omega_j\) are a partition of unity. The multiple importance sampling estimate is

\[\widetilde{\mu}_\omega=\sum_{j=1}^J \frac{1}{n_j} \sum_{i=1}^{n_j} \omega_j\left(\boldsymbol{X}_{i j}\right) \frac{f\left(\boldsymbol{X}_{i j}\right) p\left(\boldsymbol{X}_{i j}\right)}{q_j\left(\boldsymbol{X}_{i j}\right)} .\] (Owen 2013)

Now assume that \(q_j(\boldsymbol{x})>0\) whenever \(\omega_j(\boldsymbol{x}) p(\boldsymbol{x}) f(\boldsymbol{x}) \neq 0\). Then multiple importance sampling is unbiased, because \[\mathbb{E}\left(\widetilde{\mu}_\omega\right)=\sum_{j=1}^J \mathbb{E}_{q_j}\left(\omega_j(\boldsymbol{X}) \frac{f(\boldsymbol{X}) p(\boldsymbol{X})}{q_j(\boldsymbol{X})}\right)=\sum_{j=1}^J \int \omega_j(\boldsymbol{x}) f(\boldsymbol{x}) p(\boldsymbol{x}) \mathrm{d} \boldsymbol{x}=\mu .\]

Among the proposals for functions \(\omega_j(\boldsymbol{x})\), the most studied one is the balance heuristic with \(\omega_j(\boldsymbol{x}) \propto n_j q_j(\boldsymbol{x})\), that is

\[\omega_j(\boldsymbol{x})=\omega_j^{\mathrm{BH}}(\boldsymbol{x}) \equiv \frac{n_j q_j(\boldsymbol{x})}{\sum_{k=1}^J n_k q_k(\boldsymbol{x})} .\]

By construction \(q_j(\boldsymbol{x})>0\) holds whenever \(\left(\omega_j^{\mathrm{BH}} p f\right)(\boldsymbol{x}) \neq 0\). Let \(n=\sum_{j=1}^J n_j\) and define \(\alpha_j=n_j / n\). Then using the balance heuristic, \(\widetilde{\mu}_{\omega^{\text {ВН }}}\) simplifies to \[\widetilde{\mu}_\alpha=\frac{1}{n} \sum_{j=1}^J \sum_{i=1}^{n_j} \frac{f\left(\boldsymbol{X}_{i j}\right) p\left(\boldsymbol{X}_{i j}\right)}{\sum_{j=1}^J \alpha_j q_j\left(\boldsymbol{X}_{i j}\right)} .\]

In other words, multiple importance sampling, with weights from the balance heuristic reduces to the same estimator we would use in mixture importance sampling with mixture weights \(\alpha_j=n_j / n\). Once again, the weight on a given sampled value \(\boldsymbol{X}_{i j}\) does not depend on which mixture component it came from. The balance heuristic is nearly optimal in the following sense:

Theorem 9.8. Let \(n_j \geqslant 1\) be positive integers for \(j=1, \ldots, J\). Let \(\omega_1, \ldots, \omega_J\) be a partition of unity and let \(\omega^{\mathrm{BH}}\) be the balance heuristic. Suppose that \(q_j(\boldsymbol{x})>\) 0 whenever \(\omega_j(\boldsymbol{x}) p(\boldsymbol{x}) f(\boldsymbol{x}) \neq 0\). Then

\[\operatorname{Var}\left(\widetilde{\mu}_{\omega^{\mathrm{BH}}}\right) \leqslant \operatorname{Var}\left(\widetilde{\mu}_\omega\right)+\left(\frac{1}{\min _j n_j}-\frac{1}{\sum_j n_j}\right) \mu^2 .\]

Citation

BibTeX:

@article{tong2026is-slapt,
  author = {Yuxuan Tong and Yingru Li and Guangming Sheng},
  title = {{Justifying Importance Sampling with Sliding Latest Policy Trajectories}},
  journal = {Blog},
  year = {2026},
  url = {https://tongyx361.github.io/posts/is-slapt},
  language = {English},
}

References

Espeholt, Lasse, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski‎. 2020. “SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference.” In International Conference on Learning Representations. https://openreview.net/forum?id=rkgvXlrKwH.
Espeholt, Lasse, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, et al. 2018. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.” In Proceedings of the 35th International Conference on Machine Learning, edited by Jennifer Dy and Andreas Krause, 80:1407–16. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v80/espeholt18a.html.
Fu, Wei, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, et al. 2025. AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning.” In The Thirty-Ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=X9diEuva9R.
Hu, Jian, Xibin Wu, Wei Shen, Jason Klein Liu, Weixun Wang, Songlin Jiang, Haoran Wang, et al. 2025. OpenRLHF: A Ray-Based Easy-to-Use, Scalable and High-Performance RLHF Framework.” In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, edited by Ivan Habernal, Peter Schulam, and Jörg Tiedemann, 656–66. Suzhou, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-demos.48.
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, et al. 2025. “Kimi K1.5: Scaling Reinforcement Learning with LLMs.” https://arxiv.org/abs/2501.12599.
Metelli, Alberto Maria, Matteo Papini, Nico Montali, and Marcello Restelli. 2020. “Importance Sampling Techniques for Policy Optimization.” Journal of Machine Learning Research 21 (141): 1–75. http://jmlr.org/papers/v21/20-124.html.
Owen, Art B. 2013. Monte Carlo Theory, Methods and Examples. https://artowen.su.domains/mc/.
Piché, Alexandre, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, and Dzmitry Bahdanau. 2025. “PipelineRL: Faster on-Policy Reinforcement Learning for Long Sequence Generation.” https://arxiv.org/abs/2509.19128.
Shen, Gerald, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, et al. 2024. “NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment.” In First Conference on Language Modeling. https://openreview.net/forum?id=yK2eGE8QVW.
Sheng, Guangming, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, et al. 2025. “Laminar: A Scalable Asynchronous RL Post-Training Framework.” https://arxiv.org/abs/2510.12633.
Sheng, Guangming, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. “HybridFlow: A Flexible and Efficient RLHF Framework.” In Proceedings of the Twentieth European Conference on Computer Systems, 1279–97. EuroSys ’25. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3689031.3696075.
Yao, Zhewei, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, et al. 2023. “DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-Like Models at All Scales.” https://arxiv.org/abs/2308.01320.

Reuse