diff --git a/applications/ColossalChat/coati/distributed/README.md b/applications/ColossalChat/coati/distributed/README.md index ca6a49fb0..74e30bc46 100644 --- a/applications/ColossalChat/coati/distributed/README.md +++ b/applications/ColossalChat/coati/distributed/README.md @@ -129,6 +129,7 @@ pip install ray ``` Install Other Dependencies + ```bash export ATB_LLM_HCCL_ENABLE=1 export ATB_LLM_COMM_BACKEND="hccl" @@ -284,6 +285,8 @@ In addition to the two default training settings we provided--- original `GRPO` train_pipeline_parallelism_size / train_tensor_parallelism_size) ``` + + --- ## 🧪 Example: single machine 8-GPU Zero2 Strategy @@ -334,6 +337,54 @@ plugin_config={ }, # for pp, tp ``` +```bash +# Hint1: replace /models/Qwen/Qwen2.5-7B to your model path +# replace /datasets/train-alignment.jsonl to your dataset path +python rl_example.py + -m /path/to/Qwen2.5-Math-7B/ \ + -d /path/to/train_data.jsonl \ + --master_address '10.0.0.3' + -t 16 \ + -i 16 \ + -p GRPO-Train-Align-Debug \ + -g 2 \ + -ibs 1 \ + -tbs 2 \ + -tMbs 1 \ + -tmbs 2 \ + -imbs 1 \ + -s "Please reason step by step, and put your final answer within \\boxed{}." \ + -tMbs 8 \ + -p GRPO-Train-Align-Debug \ +``` + +## 🧪 Example: multi-machine TP+PP Strategy + +### Create ray cluster on multi-machine +For example, now we have 4 nodes and their IPs are 10.0.0.3, 10.0.0.4, 10.0.0.5, 10.0.0.6. +We use 10.0.0.3 as master node. First we start a ray cluster on 10.0.0.3: +```bash +ray start --head --node-ip-address=10.0.0.3 +``` + +Then, for each slave node (10.0.0.4/10.0.0.5/10.0.0.6), we add to the ray cluser by following code: +```bash +ray start --address='10.0.0.3:6379' +``` + +Modify plugin_config in ./applications/ColossalChat/rl_example.py +```python +plugin_config={ + "tp_size": 4, + "pp_size": 2, + "microbatch_size": max( + 1, args.train_microbatch_size // 2 + ), # microbatch size should be set to train_microbatch_size // pp_size + "zero_stage": 1, + "max_norm": 1.0, + }, # for pp, tp +``` + ```bash # Hint1: replace /models/Qwen/Qwen2.5-7B to your model path # replace /datasets/train-alignment.jsonl to your dataset path