diff --git a/applications/ColossalChat/coati/distributed/README.md b/applications/ColossalChat/coati/distributed/README.md index 21647a8cc..4f3fe94f6 100644 --- a/applications/ColossalChat/coati/distributed/README.md +++ b/applications/ColossalChat/coati/distributed/README.md @@ -14,6 +14,7 @@ This repository implements a distributed Reinforcement Learning (RL) training fr * **Rollout and Policy Decoupling**: Efficient generation and consumption of data through parallel inferencer-trainer architecture. * **Evaluation Integration**: Easily plug in task-specific eval datasets. * **Checkpoints and Logging**: Configurable intervals and directories. +* **[New]**: Zero Bubble training framework that supports GRPO and DAPO. [(read more)](./zero_bubble/README.md) --- diff --git a/applications/ColossalChat/coati/distributed/zero_bubble/README.md b/applications/ColossalChat/coati/distributed/zero_bubble/README.md new file mode 100644 index 000000000..ec140684b --- /dev/null +++ b/applications/ColossalChat/coati/distributed/zero_bubble/README.md @@ -0,0 +1,73 @@ +# Zero Bubble Distributed RL Framework for Language Model Fine-Tuning + +This folder contains code for the Zero Bubble distributed RL framework. It currently supports **GRPO** and **DAPO**. See the [main README](../README.md) for general installation instructions and usage. + +**Note:** This project is under active development — expect changes. + +## 🛠 Installation + +1. Follow the general installation guide in the [main README](../README.md). +2. Install [pygloo](https://github.com/ray-project/pygloo). Build pygloo for Ray from source following the instructions in its repository README. + +## Design idea + +We aim to reduce the *“bubble”* — the idle time that occurs between rollouts and training steps (illustrated in Fig. 1). + +
+

+ +

+
+ +**Fig. 1** - In an all-sync online RL framework, rollout workers wait for the trainer to finish training and synchronize weights, and the trainer waits for rollouts. This causes large GPU idle time. + +
+

+ +

+
+ +**Fig. 2** - Our Zero Bubble pipeline follows a producer–consumer pattern: + +* A global **data buffer** temporarily stores rollouts produced by inference workers. +* A **weights distributor** buffers updated model weights and distributes them to inference workers. +* When the data buffer has enough data, the trainer continuously consumes from it and pushes updated weights to the weights distributor. +* After finishing a mini-batch, each inference worker checks the weights distributor and synchronizes to a newer weight version if available. + +Under ideal conditions (inference workers produce data at the same rate the trainer consumes it), the pipeline eliminates idle time. We call it *zero bubble* because, with an unlimited data buffer, inference and training can run indefinitely without waiting. In practice, to avoid wasted compute and stale/off-policy data, we set a bounded buffer size so inference workers will briefly wait when the buffer is full. + +## Usage + +In addition to the general parameters (see the main README), the Zero Bubble pipeline introduces one additional parameter: + +* **`data_actor_buffer_size_limit`** - Maximum number of rollout batches the data buffer may hold. Defaults to **twice** the trainer’s mini-batch size. Avoid setting this too large — a very large buffer increases off-policy training. For DAPO, since only effective prompts count, you may need to raise `data_actor_buffer_size_limit` depending on sample utility. + +Example: RL training on 8 GPUs with Zero Bubble (zero2) + +```bash +python rl_example_zero_bubble.py \ + --dataset /path/to/your/dataset.jsonl \ + --model /path/to/your/model \ + -t 4 -i 4 -b vllm -a DAPO \ + -imbs 8 -ibs 8 -tbs 8 -e 2 -rt boxed \ + -si 25 -s "Please reason step by step, and put your final answer within \\boxed{}." \ + -tMbs 2 -tmbs 2 -p Rebase_Experiments -zero 2 -mpt 512 -mnt 3584 +``` + +## Performance + +
+

+ +

+
+ +**Fig. 3** - Performance of the Zero Bubble pipeline tested with an unlimited buffer size. + +--- + +If you'd like, I can: + +* Produce a short "What changed" summary for the repo (listing grammar/clarity edits). +* Create a compact one-paragraph summary for the project page. +* Convert this into a prettier doc with badges, table of contents, or a changelog. Which would you prefer?