mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-06-25 06:52:46 +00:00
seq parallel doc
This commit is contained in:
parent
ea24e7b9ec
commit
765db38e48
@ -163,7 +163,7 @@ Among the sequence parallelism methods mentioned, both ring attention and Ulysse
|
||||
|
||||
Model structure generalization: Ring attention is better than Ulysses in terms of generalization. Ulysses requires that the model config need to meet ```the head number // (tp group size * sp group size)``` condition, while ring attention has no such restrictions.
|
||||
|
||||
Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream method for sequence parallelism. Both methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
|
||||
Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream for sequence parallelism. All sequence parallel methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
|
||||
|
||||
Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods,can reach approximately 480+ TFLOPS on dual H800s. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py -->
|
||||
|
Loading…
Reference in New Issue
Block a user