Large Language Model (LLM) Reinforcement Learning (RL) systems nowadays usually achieve high rollout utilization by introducing off-policy data, which is typically used for Importance Sampling (IS).
Three dominant rollout strategies have emerged: 1) Partial Rollout (PR), which, once a newer policy is available, aborts and resumes the trajectory with the latest policy, producing trajectories with mixed policy compositions, at the cost of (repeated) re-prefill; 2) Partial Rollout with Stale KV (PR-SKV), which is similar to PR but keeps the stale KV cache to avoid re-prefill, at the cost of introducing distribution shift; 3) Consistent Rollout (CR), which completes each trajectory with a single stale policy, avoiding re-prefill and distribution shift, at the cost of more off-policiness.
This post compares the three strategies from both algorithmic and system perspectives, drawing the following main conclusions: 1) The re-prefill overhead in PR scales cubically with the context length — the dot-product attention complexity scales quadratically and the re-prefill frequency scales linearly — making it prohibitive for long-context tasks. 2) As data for IS, both PR(-SKV) and CR can produce unbiased estimates, while their variance comparison is nuanced: PR(-SKV) can be less off-policy by always using the latest policy once available, but CR fits naturally into the Multiple Importance Sampling formulation and can exploit techniques like the balance heuristic to reduce variance.
Experiments are still in progress.