update readme

This commit is contained in:
YeAnbang
2025-11-07 19:19:54 +08:00
parent 40b6a914f3
commit 535eba85e2
2 changed files with 74 additions and 0 deletions

View File

@@ -14,6 +14,7 @@ This repository implements a distributed Reinforcement Learning (RL) training fr
* **Rollout and Policy Decoupling**: Efficient generation and consumption of data through parallel inferencer-trainer architecture.
* **Evaluation Integration**: Easily plug in task-specific eval datasets.
* **Checkpoints and Logging**: Configurable intervals and directories.
* **[New]**: Zero Bubble training framework that supports GRPO and DAPO. [(read more)](./zero_bubble/README.md)
---

View File

@@ -0,0 +1,73 @@
# Zero Bubble Distributed RL Framework for Language Model Fine-Tuning
This folder contains code for the Zero Bubble distributed RL framework. It currently supports **GRPO** and **DAPO**. See the [main README](../README.md) for general installation instructions and usage.
**Note:** This project is under active development — expect changes.
## 🛠 Installation
1. Follow the general installation guide in the [main README](../README.md).
2. Install [pygloo](https://github.com/ray-project/pygloo). Build pygloo for Ray from source following the instructions in its repository README.
## Design idea
We aim to reduce the *“bubble”* — the idle time that occurs between rollouts and training steps (illustrated in Fig. 1).
<div align="center">
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/all_sync.png" width=700/>
</p>
</div>
**Fig. 1** - In an all-sync online RL framework, rollout workers wait for the trainer to finish training and synchronize weights, and the trainer waits for rollouts. This causes large GPU idle time.
<div align="center">
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/zero_bubble.png" width=700/>
</p>
</div>
**Fig. 2** - Our Zero Bubble pipeline follows a producerconsumer pattern:
* A global **data buffer** temporarily stores rollouts produced by inference workers.
* A **weights distributor** buffers updated model weights and distributes them to inference workers.
* When the data buffer has enough data, the trainer continuously consumes from it and pushes updated weights to the weights distributor.
* After finishing a mini-batch, each inference worker checks the weights distributor and synchronizes to a newer weight version if available.
Under ideal conditions (inference workers produce data at the same rate the trainer consumes it), the pipeline eliminates idle time. We call it *zero bubble* because, with an unlimited data buffer, inference and training can run indefinitely without waiting. In practice, to avoid wasted compute and stale/off-policy data, we set a bounded buffer size so inference workers will briefly wait when the buffer is full.
## Usage
In addition to the general parameters (see the main README), the Zero Bubble pipeline introduces one additional parameter:
* **`data_actor_buffer_size_limit`** - Maximum number of rollout batches the data buffer may hold. Defaults to **twice** the trainers mini-batch size. Avoid setting this too large — a very large buffer increases off-policy training. For DAPO, since only effective prompts count, you may need to raise `data_actor_buffer_size_limit` depending on sample utility.
Example: RL training on 8 GPUs with Zero Bubble (zero2)
```bash
python rl_example_zero_bubble.py \
--dataset /path/to/your/dataset.jsonl \
--model /path/to/your/model \
-t 4 -i 4 -b vllm -a DAPO \
-imbs 8 -ibs 8 -tbs 8 -e 2 -rt boxed \
-si 25 -s "Please reason step by step, and put your final answer within \\boxed{}." \
-tMbs 2 -tmbs 2 -p Rebase_Experiments -zero 2 -mpt 512 -mnt 3584
```
## Performance
<div align="center">
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/zero_bubble_gpu_util.png" width=700/>
</p>
</div>
**Fig. 3** - Performance of the Zero Bubble pipeline tested with an unlimited buffer size.
---
If you'd like, I can:
* Produce a short "What changed" summary for the repo (listing grammar/clarity edits).
* Create a compact one-paragraph summary for the project page.
* Convert this into a prettier doc with badges, table of contents, or a changelog. Which would you prefer?