Modern Large Language Model (LLM) Reinforcement Learning (RL) systems, such as Kimi k1.5 and AReaL, increasingly adopt Partial Rollout strategies to mitigate long-tail latency and simplify weight management. These systems generate Sliding Latest Policy Trajectories (SLAPTs) – sequences where the sampling policy may update mid-rollout – thereby challenging the standard assumptions of Importance Sampling (IS) which typically rely on a consistent behavior policy.
This post provides a justification for applying Importance Sampling to SLAPTs. We demonstrate that by constructing trajectory-wise behavior policies, we can maintain unbiased estimates despite the mixed-policy nature of the rollouts. Furthermore, leveraging variance bounds based on Rényi divergence, we argue that SLAPTs – being closer to the target policy than consistent but stale behavior policies – are likelt to yield lower variance. Finally, we propose that Multiple Importance Sampling (MulIS) with the Balance Heuristic (BH) can be utilized to further exploit the distributional properties of these trajectories for better policy optimization.