[WIP] Importance Sampling Done Right with Off-Policy Data in Fully-Utilized LLM RL Systems

English 英文
Technical 技术
Authors

Yuxuan Tong

Yingru Li

Guangming Sheng

Published

2026/01/07

Last Modified

2026/01/12

Abstract

In the pursuit of maximizing hardware utilization for Large Language Model (LLM) Reinforcement Learning, systems introduces some new sampling strategies like Partial Rollout and Asynchornous Rollout. This introduces complex off-policy data structures, which we categorize into Sliding Latest Policy Trajectories (SLAPTs) and Multiple Consistent Stale Policy Trajectories (MCSPTs). This post discusses the properties of applying Importance Sampling (IS) to these distinct data forms, combining theoretical notations like Multiple Importance Sampling and Rényi divergence and practical observations like the (approximate) indentical distribution structure of SLAPTs of similar lengths.

Keywords

Importance Sampling, Off-Policy Reinforcement Learning, Large Language Model

1 Off-Policy Data in Fully-Utilized LLM RL Systems

In Reinforcement Learning (RL) systems for Large Language Models (LLMs), especially those that pursue full utilization of the hardware resources, it is common that we have to utilize off-policy data, typically a batch of trajectory samples collected from one or multiple stale policies.

So far, there have been two main kinds of off-policy data emerging in such systems, which in this post we refer to as:

  1. Sliding Latest Policy Trajectories (SLAPTs)
  2. Multiple Consistent Stale Policy Trajectories (MCSPTs)

1.1 Sliding Latest Policy Trajectories (SLAPTs)

We use Sliding Latest Policy Trajectories to refer to such batches of trajectories that are sampled in an RL system where we always use the latest policy \(\pi_{\theta_{t}}\) at each moment to sample the actions.

Formally, a SLAPT can be defined as a trajectory \(\tau = (s_{0}, a_{0,\pi_{\theta_{n}}}, \ldots, s_{t}, a_{t,\pi_{\theta_{n+m}}}, \ldots, s_{T}, a_{T,\pi_{\theta_{n+M}}})\), where \(s_{t}\) is the state at timestep \(t\) and \(a_{t,\pi_{\theta_{n+m}}}\) is the action sampled from the policy \(\pi_{\theta_{n+m}}\) that is the latest policy available in the system when sampling. So for \(l > m\), \(\pi_{\theta_{n+l}}\) is usually more on-policy than \(\pi_{\theta_{n+m}}\).

SLAPTs in a batch usually have different policy compositions, depending on the system dynamics.

SLAPTs are a natural result of a popular design choice called Partial Rollout in LLM RL systems nowadays, such as Kimi k1.5 (Kimi Team et al. 2025), AReaL (Fu et al. 2025) and PipelineRL (Piché et al. 2025), etc., which can date back to a traditional distributed RL system called SEED RL (Espeholt et al. 2020):

The ongoing trajectory rollouts are

  1. aborted when
    • either, in synchoronous systems like Kimi k1.5, there are enough samples collected for training, releasing the resources for the training engine to update the model weights,
    • or, in asynchronous systems like AReaL and PipelineRL, a new version of model weights is produced by the trainer;
  2. continued with the latest version of model weights.

Partial Rollout is motivated by its efficiency and simplicity. Let me explain in detail.

The earliest LLM RL systems Hu et al. (2025) all adopt synchoronous architectures, where the trainer always waits for all the trajectories are finished before updating the weights with these data. However, as the context length of LLMs scales up, the skewness of the trajectory length distribution becomes increasing heavy. If the distribution is very skewed but all the trajectories are required to finish in the same rollout stage, there can be only a few long-tail requests remaining in the system, causing severe under-utilization (typically <30% in practice).

Kimi k1.5 proposes to fix this issue by aborting all the ongoing rollouts once enough training samples are collected and directly update the weights with these data, instead of waiting for all the trajectories to finish, and then continue the rollouts with the new model weights.

In asynchornous RL systems with an experience buffer, it is troublesome to mamage multiple versions of model weights within the rollout engine.

AReaL proposes to always only maintain the latest model weights across all the instances. Once a new version of model weights is produced by the trainer, all the rollouts of the stale policy are aborted and then continued with the latest policy.

This can be simply implemented by always loading the latest weights into all the inference engine instances, avoiding the bothering to manage the requests across each instance.

Despite Partial Rollout’s efficiency and simplicity, there have been worries about mixing multiple policies within a single trajectory, since most previous works formulate IS with a single consistent behavior policy \(\mu(\cdot \mid \boldsymbol{s})\) while such formulation is rather under-explored. We defer a more detailed discussion about this to Section 2.

1.2 Multiple Consistent Stale Policy Trajectories (MCSPTs)

We use Multiple Consistent Stale Policy Trajectories (MCSPTs) to refer to such batches of trajectories that

  • each of them is sampled with a consistent stale policy \(\pi_{\theta_{n}}\),
  • while there might be different \(n\) for different trajectories.

The consistency of policy within a trajectory makes the formulation simpler, which is widely used in traditional distributed RL systems like IMPALA (Espeholt et al. 2018).

In contrast, MCSPT sampling is more difficult to implement efficiently against long-tail bubbles in LLM RL systems, because this requires managing multiple versions of model weights at the same time within the rollout engine and dynamically transferring the requests of various lengths.

Sheng, Tong, et al. (2025) implement an LLM RL system that samples MCSPTs with full utilization of the hardware resources.

2 Importance Sampling with Off-Policy Data in (Fully-Utilized) LLM RL Systems

Importance Sampling (IS) has been widely used in LLM RL systems.

It is useful for estimating the expection of some random function, typically the gradient of the return respect to the policy parameters, on the target policy distribution. In practice, with a batch of trajectory samples, the IS estimate is often formulated as:

\[ \begin{aligned} \hat{\mathbb{E}}_{\mu}(f(\boldsymbol{\tau})) =& \frac{1}{N} \sum_{i=1}^{N} \frac{p_{\theta}(\boldsymbol{\tau}_i)}{q_{\mu}(\boldsymbol{\tau}_i)}f(\boldsymbol{\tau}_i) \\ =& \frac{1}{N} \sum_{i=1}^{N} \frac{p(\boldsymbol{s}_{0}) \prod_{t=0}^{T-1} \pi_{\theta}(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})p(\boldsymbol{s}_{t+1} \mid \boldsymbol{s}_{t}, \boldsymbol{a}_{t})}{p(\boldsymbol{s}_{0}) \prod_{t=0}^{T-1} \mu(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})p(\boldsymbol{s}_{t+1} \mid \boldsymbol{s}_{t}, \boldsymbol{a}_{t})}f(\boldsymbol{\tau}_i) \\ =& \frac{1}{N} \sum_{i=1}^{N} \prod_{t=0}^{T-1} \frac{\pi_{\theta}(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}{\mu(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}f(\boldsymbol{\tau}_i) \end{aligned} \tag{1}\]

where

  • \(N\) is the batch size,
  • \(f(\cdot)\) can be any function of the trajectory sample \(\boldsymbol{\tau}\), of which the most important one is the gradient function of the optimization objective \(\mathbb{E}_{\boldsymbol{\tau} \sim p_{\theta}}[J(\cdot)]\) relative to the policy parameters \(\theta\),
  • \(p_{\theta}(\cdot)\) is the probability density function of the trajectory distribution induced by the target policy \(\pi_{\theta}\) parameterized by \(\theta\) and the enviroment state transition distribution \(p(\cdot \mid \boldsymbol{s}, \boldsymbol{a})\),
  • \(q_{\mu}(\cdot)\) is the probability density function of the trajectory distribution induced by the behavior policy \(\mu\) and \(p(\cdot \mid \boldsymbol{s}, \boldsymbol{a})\).

One important property of Equation 1 is that it is unbiased, i.e.,

\[ \mathbb{E}_{\boldsymbol{\tau} \sim q_{\mu}}[\hat{\mathbb{E}}_{\mu}(f(\boldsymbol{\tau}))] = \mathbb{E}_{\boldsymbol{\tau} \sim p_{\theta}}[f(\boldsymbol{\tau})] \tag{2}\]

Of the components mentioned above, the only one that is ambiguous for calculation is the behavior policy \(\mu(\cdot \mid \boldsymbol{s})\).

However, the correct calculation of \(\mu(\cdot \mid \boldsymbol{s})\) with the off-policy data in fully-utilized LLM RL systems mentioned above is not straightforward.

2.1 Simple but Intractable: Global Behavior Policy \(\mu^{*}(\cdot \mid \boldsymbol{s})\)

Since there is always an actual distribution we are sampling from, we can always formulate the IS estimate with a global behavior policy \(\mu^{*}(\cdot \mid \boldsymbol{s})\).

The most direct thought is to use \(\mu^{*}(\cdot \mid \boldsymbol{s})\) for IS in practice. But this requires us to calculate the probabilities under it.

Fu et al. (2025) Proposition 1 resorts to constructing a behavior policy \(\mu(\boldsymbol{a} \mid \boldsymbol{s})\) that satisfies \(\mu^{*}(a_{t} \mid s_{t}) = \pi_{\theta_{n+m}}(a_{t} \mid s_{t})\) for each \((s_{t}, a_{t})\) pair in the trajectory samples. This sounds reasonable for LLM RL since the LLM is usually auto-regressive and thus never revisits a past state within the same trajectory.

However, it might be confusing when we also notice that, the same state \(s\) might appear in two different trajectory samples \(\tau_{i}\) and \(\tau_{j}\), especially for the initial states \(s_{0}\), i.e., the prompts, and the same action \(a\) might be sampled from different policies \(\pi_{\theta_{n+m}}\) and \(\pi_{\theta_{n+l}}\) (\(l \neq m\)) respectively. It is very likely that \(\pi_{\theta_{n+m}}(a \mid s) \neq \pi_{\theta_{n+l}}(a \mid s)\), making it infeasible to simply construct the same behavior policy for both trajectory samples, i.e., \(\mu^{*}(a \mid s)=\pi_{\theta_{n+m}}(a \mid s)\) constradicts \(\mu(a \mid s)=\pi_{\theta_{n+l}}(a \mid s)\).

So where is the problem in the construction for \(\mu(\cdot \mid \boldsymbol{s})\) mentioned above?

The problem hidden here is that, \(\mu(\cdot \mid \boldsymbol{s})\) does not consider the probability distribution of which LLM policy \(\pi_{\theta_{n+m}}\) is used. Furthermore, this distribution is actually intractable since it depends on the system dynamics.

2.2 IS with MCSPTs

IS with MCSPTs is simpler to formulate since each trajectory is sampled with a consistent stale policy \(\pi_{\theta_{n}}\), and we only need to correctly formulate the policy used by each trajectory.

2.2.1 Mixture Importance Sampling

In some simple cases, the distribution of which policy is used, i.e., the hyper-policy distribution, is known, where we can formulate the IS estimate as Mixture Importance Sampling (Owen 2013).

2.2.2 Multiple Importance Sampling

However, in more general cases, we can only know ad hoc that the number of trajectories sampled from each policy \(\pi_{\theta_{j}}\) is \(n_{j}\), where \(N = \sum_{j} n_{j}\). For such cases, we can formulate the IS estimate as Multiple Importance Sampling (Owen 2013):

Suppose that \(\boldsymbol{X}_{i j} \sim q_j\) for \(i=1, \ldots, n_j\) and \(j=1, \ldots, J\) and that \(\omega_j\) are a partition of unity. The multiple importance sampling estimate is

\[\widetilde{\mu}_\omega=\sum_{j=1}^J \frac{1}{n_j} \sum_{i=1}^{n_j} \omega_j\left(\boldsymbol{X}_{i j}\right) \frac{f\left(\boldsymbol{X}_{i j}\right) p\left(\boldsymbol{X}_{i j}\right)}{q_j\left(\boldsymbol{X}_{i j}\right)} .\]

Now assume that \(q_j(\boldsymbol{x})>0\) whenever \(\omega_j(\boldsymbol{x}) p(\boldsymbol{x}) f(\boldsymbol{x}) \neq 0\). Then multiple importance sampling is unbiased, because \[\mathbb{E}\left(\widetilde{\mu}_\omega\right)=\sum_{j=1}^J \mathbb{E}_{q_j}\left(\omega_j(\boldsymbol{X}) \frac{f(\boldsymbol{X}) p(\boldsymbol{X})}{q_j(\boldsymbol{X})}\right)=\sum_{j=1}^J \int \omega_j(\boldsymbol{x}) f(\boldsymbol{x}) p(\boldsymbol{x}) \mathrm{d} \boldsymbol{x}=\mu .\]

2.2.3 Balance Heuristic

The natural problem following is how to choose the partition of unity \(\omega_j\):

Among the proposals for functions \(\omega_j(\boldsymbol{x})\), the most studied one is the balance heuristic with \(\omega_j(\boldsymbol{x}) \propto n_j q_j(\boldsymbol{x})\), that is

\[\omega_j(\boldsymbol{x})=\omega_j^{\mathrm{BH}}(\boldsymbol{x}) \equiv \frac{n_j q_j(\boldsymbol{x})}{\sum_{k=1}^J n_k q_k(\boldsymbol{x})} .\]

By construction \(q_j(\boldsymbol{x})>0\) holds whenever \(\left(\omega_j^{\mathrm{BH}} p f\right)(\boldsymbol{x}) \neq 0\). Let \(n=\sum_{j=1}^J n_j\) and define \(\alpha_j=n_j / n\). Then using the balance heuristic, \(\widetilde{\mu}_{\omega^{\text {ВН }}}\) simplifies to \[\widetilde{\mu}_\alpha=\frac{1}{n} \sum_{j=1}^J \sum_{i=1}^{n_j} \frac{f\left(\boldsymbol{X}_{i j}\right) p\left(\boldsymbol{X}_{i j}\right)}{\sum_{j=1}^J \alpha_j q_j\left(\boldsymbol{X}_{i j}\right)} .\]

In other words, multiple importance sampling, with weights from the balance heuristic reduces to the same estimator we would use in mixture importance sampling with mixture weights \(\alpha_j=n_j / n\). Once again, the weight on a given sampled value \(\boldsymbol{X}_{i j}\) does not depend on which mixture component it came from. The balance heuristic is nearly optimal in the following sense:

Theorem 9.8. Let \(n_j \geqslant 1\) be positive integers for \(j=1, \ldots, J\). Let \(\omega_1, \ldots, \omega_J\) be a partition of unity and let \(\omega^{\mathrm{BH}}\) be the balance heuristic. Suppose that \(q_j(\boldsymbol{x})>\) 0 whenever \(\omega_j(\boldsymbol{x}) p(\boldsymbol{x}) f(\boldsymbol{x}) \neq 0\). Then

\[\operatorname{Var}\left(\widetilde{\mu}_{\omega^{\mathrm{BH}}}\right) \leqslant \operatorname{Var}\left(\widetilde{\mu}_\omega\right)+\left(\frac{1}{\min _j n_j}-\frac{1}{\sum_j n_j}\right) \mu^2 .\]

(Owen 2013)

The heuristic behind the balance heuristic to make \(\omega_j^{\mathrm{BH}} \propto n_{j}q_j(\boldsymbol{x})\) can be also understood as the more samples we have from a policy, the more information we have about it, thus the more weight it should have.

2.3 IS with SLAPTs

The IS with SLAPTs is more difficult to formulate since we need to also consider the policy composition within a trajectory. As far as we know, the related discussion is in absense in previous works including Espeholt et al. (2020) that first proposed the usage of SLAPTs.

In this section, we try to discuss the formulation and some properties of IS with SLAPTs that might be useful in practice.

2.3.1 Trajectory-dependent Behavior Policy \(\mu_{i}(\cdot \mid \boldsymbol{s})\)

A general but trivial formulation for IS estimate with a batch of trajectory samples is to make the behavior policy trajectory-dependent, i.e., use \(\mu_{i}(\boldsymbol{a} \mid \boldsymbol{s})\) for each trajectory \(\tau_{i}\).

Now for the counter-example mentioned in Section 2.1, \(\mu_i(a \mid s)=\pi_{\theta_{n+m}}(a \mid s)\) and \(\mu_j(a \mid s)=\pi_{\theta_{n+l}}(a \mid s)\) are obviously compatible.

With a batch of \(N\) trajectory samples sampled from each own behavior policy \(\boldsymbol{\tau}_1 \sim q_{\mu_1}, \ldots, \boldsymbol{\tau}_N \sim q_{\mu_N}\), we can only use the special case of the batch estimate Equation 1 with \(N=1\), i.e., the single-sample estimate

\[ \hat{\mathbb{E}}_{\mu_i}(\boldsymbol{\tau}_i) = \prod_{t=0}^{T-1} \frac{\pi_{\theta_{n+m}}(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}{\mu_i(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}f(\boldsymbol{\tau}_i) \tag{3}\]

Note that the single-sample estimate is also unbiased, i.e., the unbiasedness of Equation 1 does not depend on the sample size \(N\).

The common practice to combine them into an estimate with the batch of samples is to average the single-sample estimates Equation 3, i.e.,

\[ \hat{\mathbb{E}}_{\text{avg}} = \frac{1}{N} \sum_{i=1}^{N} \hat{\mathbb{E}}_{\mu_i} = \frac{1}{N} \sum_{i=1}^{N} \prod_{t=0}^{T-1} \frac{\pi_{\theta_{n+m}}(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}{\mu_i(\boldsymbol{a}_{t} \mid \boldsymbol{s}_{t})}f(\boldsymbol{\tau}_i) \tag{4}\]

With the linearity of expectation, it is obvious to see the unbiasedness of the average estimate \(\hat{\mathbb{E}}_{\text{avg}}\).

Now let’s take a look at the variance. Let the variance of each single-sample estimate be \(\sigma^2_{\mu_i}\), since the samples are independent, the variance of \(\hat{\mathbb{E}}_{\text{avg}}\) is:

\[ \sigma^2_{\text{avg}} = \frac{1}{N^2} \sum_{i=1}^{N} \sigma^2_{\mu_i} \tag{5}\]

Given the batch variance is determined by composing the variances of the single-sample estimates, we can first try to analyze the variance of the single-sample estimate.

2.3.2 Variance of Single-Sample Estimate – MCSPT vs. SLAPT

MCSPT and SLAPT just form two different cases of the single-sample estimate. We might wonder that, given a trajectory \(\boldsymbol{\tau}_{i}\), which of

  1. the consistent old policy \(\pi_{\theta_{n+m}}\) and
  2. the sliding latest policy \(\mu_{\theta_{n},M}\) using \(\pi_{\theta_{n}},\ldots,\pi_{\theta_{n+M}}\) successively

is better for the unbiased single-sample IS estimate Equation 3, i.e., leads to a lower variance?

Metelli et al. (2020) provided a family of bounds of the variance of IS estimate in terms of the Rényi divergence:

Lemma 1. Let \(P\) and \(Q\) be two probability measures on the measurable space \((\mathcal{X}, \mathscr{F})\) such that \(P \ll Q\). Let \(\alpha \in[1,+\infty], \mathbf{x}=\left(x_1, x_2, \ldots, x_N\right)^T\) be i.i.d. random variables sampled from \(Q\) and \(f: \mathcal{X} \rightarrow \mathbb{R}\) be a function with bounded \(\frac{2 \alpha}{\alpha-1}\)-moment under \(Q\left(\|f\|_{Q, \frac{2 \alpha}{\alpha-1}}<+\infty\right)\). Then, for any \(N>0\), the variance of the IS estimator \(\widehat{\mu}_{P / Q}\) can be upper bounded as:

\[\operatorname{Var}_{\mathbf{x} \sim Q}\left[\hat{\mu}_{P / Q}\right] \leqslant \frac{1}{N}\|f\|_{Q, \frac{2 \alpha}{\alpha-1}}^2 d_{2 \alpha}(P \| Q)^{2-\frac{1}{\alpha}},\]

where we used the abbreviation \(\mathbf{x} \sim Q\) for denoting \(x_i \sim Q\) for all \(i=1,2, \ldots, N\) all independent.

This result generalizes Lemma 4.1 of Metelli et al. (2018), that can be recovered by setting \(\alpha=1\) under the condition that \(\|f\|_{\infty}<+\infty\) :

\[\operatorname{Var}_{\mathbf{x} \sim Q}\left[\widehat{\mu}_{P / Q}\right] \leqslant \frac{1}{N}\|f\|_{\infty}^2 d_2(P \| Q) .\]

When \(\alpha = 1\), the Rényi divergence is the Kullback-Leibler divergence widely used in RL analysis.

When \(P=Q\) almost everywhere, we get \(\operatorname{Var}_{\mathbf{x} \sim Q}\left[\hat{\mu}_{Q / Q}\right] \leqslant \frac{1}{N}\|f\|_{\infty}^2\), a well-known upper bound to the variance of a Monte Carlo estimator. Recalling the definition of ESS (Equation 7) we can rewrite the previous bound as:

\[\underset{\mathbf{x} \sim Q}{\operatorname{Var}}\left[\hat{\mu}_{P / Q}\right] \leqslant \frac{\|f\|_{\infty}^2}{\operatorname{ESS}(P \| Q)} .\]

Thus, the variance scales with ESS instead of \(N\), justifying the definition of ESS.

In our context, \(P\) is the target distribution \(p_{\theta}\) and \(Q\) is \(\mu_{\theta_{n},M}\) or \(\pi_{\theta_{n}}\).

It is possible that \(d_{2 \alpha}(p_{\theta} \| q_{\theta_{n},M}) < d_{2 \alpha}(p_{\theta} \| p_{\theta_{n}})\), since the newer the policy is, the more similar its induced distribution is to \(p_{\theta}\). Then as long as the \(\|f\|_{q_{\theta_{n},M}, \frac{2 \alpha}{\alpha-1}}\) is not much larger than \(\|f\|_{p_{\theta_{n}}, \frac{2 \alpha}{\alpha-1}}\), the estimate using \(\mu_{\theta_{n},M}\) has better guarantee than the estimate using \(\pi_{\theta_{n}}\).

In practice, the final effectiveness of the IS estimate usually relies on empirical diagnostic metrics like Effective Sample Size (ESS). For example, Piché et al. (2025) measured the ESS of different sampling policies they used.

2.3.3 Minimizing Variance by Optimizing the Weighting – Exploiting the Practical Problem Structure

Beyond the comparison on each single-sample estimate, we can also consider the optimal weighting of them. In Equation 4, we use the simplest average weight \(\frac{1}{N}\) for each single-sample estimate. But we can actually use any other weight functions \(\omega_i(\boldsymbol{x})\) to combine them, as long as they form a partition of unity, i.e., a collection of \(J \geqslant 1\) weight functions \(\omega_j(\boldsymbol{x}) \geqslant 0\) which satisfy \(\sum_{j=1}^J \omega_j(\boldsymbol{x})=1\) for all \(\boldsymbol{x}\). Different partitions of unity will lead to estimates that are all unbiased but with different variances.

The optimal weighting should be still an open problem that is beyond our scope. However, there is still some properties that we can exploit from the SLAPT structure. For example, in practical systems, SLAPTs of similar lengths often conform to identical distributions (approximately, but this can be exact if we limit the timesteps where a trajectory can be aborted), we can also apply the formulation of Multiple IS to SLAPT like in Section 2.2. This also supports the comparability of SLAPT and MCSPT.

2.4 Discussions

2.4.1 Off-policyness Allocation

With similar utilization of the same hardware resources, the total number of actions (tokens) that can be sampled with each version of the (LLM) policy should be roughly the same. Such a constraint might be helpful for analyzing how to allocate the “off-policyness” will be optimal for the IS estimate.

Specifically, SLAPTs allocate the actions from different policies to different segments within each trajectory, while MCSPTs allocate the actions from the different policy to different whole trajectories.

Acknowledgments

Thanks for helpful discussion with Wang Zhang.

Citation

BibTeX:

@article{tong2026isdr,
  author = {Yuxuan Tong and Yingru Li and Guangming Sheng},
  title = {{Importance Sampling Done Right with Off-Policy Data in Fully-Utilized LLM RL Systems}},
  journal = {Blog},
  date = {2026-01-07},
  url = {https://tongyx361.github.io/posts/isdr},
  language = {English},
}

References

Espeholt, Lasse, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski‎. 2020. “SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference.” In International Conference on Learning Representations. https://openreview.net/forum?id=rkgvXlrKwH.
Espeholt, Lasse, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, et al. 2018. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.” In Proceedings of the 35th International Conference on Machine Learning, edited by Jennifer Dy and Andreas Krause, 80:1407–16. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v80/espeholt18a.html.
Fu, Wei, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, et al. 2025. AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning.” In The Thirty-Ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=X9diEuva9R.
Hu, Jian, Xibin Wu, Wei Shen, Jason Klein Liu, Weixun Wang, Songlin Jiang, Haoran Wang, et al. 2025. OpenRLHF: A Ray-Based Easy-to-Use, Scalable and High-Performance RLHF Framework.” In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, edited by Ivan Habernal, Peter Schulam, and Jörg Tiedemann, 656–66. Suzhou, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-demos.48.
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, et al. 2025. “Kimi K1.5: Scaling Reinforcement Learning with LLMs.” https://arxiv.org/abs/2501.12599.
Metelli, Alberto Maria, Matteo Papini, Nico Montali, and Marcello Restelli. 2020. “Importance Sampling Techniques for Policy Optimization.” Journal of Machine Learning Research 21 (141): 1–75. http://jmlr.org/papers/v21/20-124.html.
Owen, Art B. 2013. Monte Carlo Theory, Methods and Examples. https://artowen.su.domains/mc/.
Piché, Alexandre, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, and Dzmitry Bahdanau. 2025. “PipelineRL: Faster on-Policy Reinforcement Learning for Long Sequence Generation.” https://arxiv.org/abs/2509.19128.
Shen, Gerald, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, et al. 2024. “NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment.” In First Conference on Language Modeling. https://openreview.net/forum?id=yK2eGE8QVW.
Sheng, Guangming, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, et al. 2025. “Laminar: A Scalable Asynchronous RL Post-Training Framework.” https://arxiv.org/abs/2510.12633.
Sheng, Guangming, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. “HybridFlow: A Flexible and Efficient RLHF Framework.” In Proceedings of the Twentieth European Conference on Computer Systems, 1279–97. EuroSys ’25. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3689031.3696075.
Yao, Zhewei, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, et al. 2023. “DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-Like Models at All Scales.” https://arxiv.org/abs/2308.01320.

Reuse