ByteDance Seed & Tsinghua University
2025/04/26
Reinforcement Learning (RL) for LLM Post-Training can typically be modeled as a dataflow graph, consisting of:
In practice, we should implement the dataflow graph as execution pattern on GPU cluster.
From 0.2.0.post2 till now (after 0.3.0.post1), we have achieved speedup of ~1.4x in the DAPO (w/o dynamic sampling) workload.
verl introduces a hybrid-controller paradigm, consisting of
RayPPOTrainer
) that concentrates the training control logic in a single processActorRolloutWorker
) that conduct the distributed computation in an complex but efficient wayThanks to the programming model of single-controller, verl allows implementing different RL algorithms by only modifying a few lines, usually only in the fit
function.
for prompts in dataloader:
# Stage 1: Sampling Trajectories
batch = actor.generate_sequences(prompts)
# Stage 2: Preparing Experiences
batch = reward.compute_reward(batch)
batch = reference.compute_log_prob(batch)
batch = critic.compute_values(batch)
batch = compute_advantage(batch, "gae")
# Stage 3: Training
critic.update_critic(batch)
actor.update_actor(batch)
for prompts in dataloader:
# Stage 1: Sampling Trajectories
batch = actor.generate_sequences(prompts)
# Stage 2: Preparing Experiences
batch = reward.compute_reward(batch)
batch = reference.compute_log_prob(batch)
batch = compute_advantage(batch, "grpo")
# Stage 3: Training
critic.update_critic(batch)
actor.update_actor(batch)
The optimal execution pattern for different workloads, e.g., training, generation, are usually different.
Instead of splitting the devices to deploy different engines separately for different workloads, causing many bubbles,
verl implements a hybrid engine that can switch between the different procudures on the same cluster, fully utilizing all the GPUs.
Thanks to the hybrid engine, verl allows flexibly switching between different parallelism strategies to achieve the optimal performance.
Generation:
Training & Inference:
Data Parallelism (DP) like FSDP is the most commonly used parallelism strategy.
However, DP performance might be damaged by load imbalance, which is especially severe in long-context training.
verl implements the following feature to improve load balance:
balance_batch
: make the token numbers of the samples dispatched to each DP rank as balanced as possible by reordering the samples in each batch.However, in gradient accumulation, it’s not enough to only balance the total number of tokens for each rank in a batch, since DP syncs in the unit of micro batch.
So here comes the second feature:
use_dynamic_bsz
: deviding the batch into micro batches in such a way that the token numbers of the micro batches are as balanced as possible.use_remove_padding
): verl can save computation by removing padding tokens based on Flash Attention 2.enable_gradient_checkpointing
)use_torch_compile
)use_liger
)lora_rank
etc.)A canonical RL dataset in verl has the following fields:
prompt
: a list of messages {"role": "...", "content": "..."}
data_source
: used to choose the reward functionreward_model
: a dict containing
"ground_truth"
"style"
like "model"
or "rule"
extra_info
: a dict containing extra informationFor VLM RL, verl expects fields "images"
and/or "videos"
For examples, please check the examples/data_preprocess
.
You could also customize the field names via config. Please check the data
section in config files like ppo_trainer.yaml
for more details.
For further customization, verl provides the data.custom_cls
config,
The custom dataset class defined in the .py
file is required to accept the following initialization parameters:
verl allows to define custom reward function via the custom_reward_function
config:
The custom reward function defined in the .py
file is required to accept the parameters passed from the reward manager __call__
method. For example, the NaiveRewardManager
is defined as follows:
To implement more complex features, you might also want to directly add a new reward manager like PRIMERewardManager
or DAPORewardManager
.
To modify the loss function, the most convenient way is to
.backward()
callcompute_policy_loss
entropy_loss
For example, the DataParallelPPOActor.update_policy
method defines the loss function as follows:
class DataParallelPPOActor(BasePPOActor):
def update_policy(self, data: DataProto):
pg_loss = compute_policy_loss(
old_log_prob=old_log_prob, log_prob=log_prob,
advantages=advantages, # ...
)
entropy_loss = agg_loss(loss_mat=entropy)
policy_loss = pg_loss - entropy_loss * entropy_coeff
kld = kl_penalty(
logprob=log_prob, ref_logprob=ref_log_prob, # ...
)
kl_loss = agg_loss(loss_mat=kld)
policy_loss = policy_loss + kl_loss * self.config.kl_loss_coef
loss.backward()
As mentioned above, the main training logic is concentrated in the fit
function of the trainer classes like RayPPOTrainer
.
For example, the DAPORayTrainer
class overrides the fit
function to implement the “dynamic sampling” feature:
(See the next slide for the code ➡️)
class RayDAPOTrainer(RayPPOTrainer):
def fit(self):
for epoch in range(self.config.trainer.total_epochs):
batch = None
for batch_dict in self.train_dataloader:
new_batch = DataProto.from_single_dict(batch_dict)
num_gen_batches += 1
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
new_batch = new_batch.union(gen_batch_output)
if not self.config.algorithm.filter_groups.enable:
batch = new_batch
else:
# Getting `kept_traj_idxs` ...
new_batch = new_batch[kept_traj_idxs]
batch = new_batch if batch is None else DataProto.concat([batch, new_batch])
prompt_bsz = self.config.data.train_batch_size
if num_prompt_in_batch < prompt_bsz:
max_num_gen_batches = self.config.algorithm.filter_groups.max_num_gen_batches
if max_num_gen_batches <= 0 or num_gen_batches < max_num_gen_batches:
continue
else:
traj_bsz = self.config.data.train_batch_size * self.config.actor_rollout_ref.rollout.n
batch = batch[:traj_bsz]
# ...
verl is approaching finishing the support for efficient RL training for huge MoE like DeepSeek-V3-671B, based on the following features:
GPTModel
class for actor and criticFor more details, please check our PR #708.
The awesome SGLang RL team
OpenAIFunctionTool
with end-to-end trainingFor more details, please check their PR #1037.
Besides, our team also integrates the async engine based on vLLM V1 AsyncLLM
. Kudos to Xibin Wu for his great work!
For the most timely updates of important features, please keep an eye on verl’s README.
For related resources like
etc., please scan the QR code:
verl: Flexible and Efficient RL for LLMs