seq parallel doc

2025-06-25 06:52:46 +00:00 · 2024-09-27 16:40:13 +08:00 · 2024-09-27 16:40:13 +08:00 · 765db38e48
commit 765db38e48
parent ea24e7b9ec
1 changed files with 1 additions and 1 deletions
--- a/docs/source/en/features/sequence_parallelism.md
+++ b/docs/source/en/features/sequence_parallelism.md
@ -163,7 +163,7 @@ Among the sequence parallelism methods mentioned, both ring attention and Ulysse

    Model structure generalization: Ring attention is better than Ulysses in terms of generalization. Ulysses requires that the model config need to meet ```the head number // (tp group size * sp group size)``` condition, while ring attention has no such restrictions.

-Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream method for sequence parallelism. Both methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
+Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream for sequence parallelism. All sequence parallel methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.

 Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods，can reach approximately 480+ TFLOPS on dual H800s. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->