seq parallel doc

2025-06-25 15:01:43 +00:00 · 2024-09-27 16:36:53 +08:00 · 2024-09-27 16:36:53 +08:00 · 36aea5472d
commit 36aea5472d
parent ee368e75b5
2 changed files with 5 additions and 5 deletions
--- a/docs/source/en/features/sequence_parallelism.md
+++ b/docs/source/en/features/sequence_parallelism.md
@ -160,5 +160,5 @@ Among the sequence parallelism methods mentioned, both ring attention and Ulysse
    Memory usage: Both are similar in terms of memory consumption.
    Model structure generalization: Ring attention is better than Ulysses in terms of generalization. Ulysses requires that the model config need to meet ```the head number // (tp group size * sp group size)``` condition, while ring attention has no such restrictions.
 Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream method for sequence parallelism. Both methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
-Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
+Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods，can reach approximately 480+ TFLOPS on dual H800s. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@ -101,7 +101,7 @@ plugin = HybridParallelPlugin(
        )
 ```
-启动命令参数举例：```--tp 2 --sp 8 --sp_mode split_gather```
+启动参数举例：```--tp 2 --sp 8 --sp_mode split_gather```
 #### 使用DeepSpeed-Ulysses
 定义plugin， 在DeepSpeed-Ulysses的序列并行种，tp group与sp group 是正交的，
@ -114,7 +114,7 @@ plugin = HybridParallelPlugin(
            sequence_parallelism_mode="all_to_all",
        )
 ```
-启动命令参数举例：```--tp 2 --sp 8 --sp_mode all_to_all```
+启动参数举例：```--tp 2 --sp 8 --sp_mode all_to_all```
 #### 使用ring attention
 定义plugin， 在ring attention的序列并行种，tp group与sp group 是正交的，sp_size必须传入准确的并行大小。
@ -127,7 +127,7 @@ plugin = HybridParallelPlugin(
            sequence_parallelism_mode="ring_attn",
        )
 ```
-启动命令参数举例：```--tp 2 --sp 8 --sp_mode ring_attn```
+启动参数举例：```--tp 2 --sp 8 --sp_mode ring_attn```
 #### 使用booster
 ```python
@ -165,7 +165,7 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 由于使用简单，对Attention计算不侵入修改，Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容，如flash attention，还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
-总的来说，我们更推荐您使用Ulysses，只需要在启动时指定```--sp_mode all_to_all```即可。经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时它的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
+总的来说，我们更推荐您使用Ulysses，只需要在启动时指定```--sp_mode all_to_all```即可。经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时它的性能表现也是所有序列并行模式中最好的，在双机H800上能够达到约480以上的tflops。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->