diff --git a/docs/source/en/features/sequence_parallelism.md b/docs/source/en/features/sequence_parallelism.md index 354b8af59..e422a24a9 100644 --- a/docs/source/en/features/sequence_parallelism.md +++ b/docs/source/en/features/sequence_parallelism.md @@ -157,7 +157,7 @@ Currently, the `MoeHybridParallelPlugin` only supports DeepSpeed-Ulysses sequenc ### Conclusion Among the sequence parallelism methods mentioned, both ring attention and Ulysses have their pros and cons, and we need to choose the appropriate sequence parallelism method based on the situation: - Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands more bandwidth from the hardware. + Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands dense network topologies, e.g. NVLink + NVSwitch, so it doesn't scale well across multiple nodes. Memory usage: Both are similar in terms of memory consumption. diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md index 5280a5813..ddf1ae293 100644 --- a/docs/source/zh-Hans/features/sequence_parallelism.md +++ b/docs/source/zh-Hans/features/sequence_parallelism.md @@ -157,7 +157,7 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_ ### 结论 在上述序列并行方法中,ring attention和Ulysses各有优劣,我们需要根据情况来选择合适的序列并行方法: - 通信方面:Ulysses通信量优于ring attention,Ulysess主要包含三次All2All通信量,而ring attention的通信会随着序列长度增长而平方增长。不过另一方面,all2all对底层硬件的要求也会更高。 + 通信方面:Ulysses通信量优于ring attention,Ulysess主要包含三次All2All通信量,而ring attention的通信会随着序列长度增长而平方增长。不过另一方面,all2all op由于需要更复杂的网络拓扑,例如NVLink和NVSwitch,因此在多机情况时,并不会随着机器数量增加而有较好的性能提升。 内存占用:二者类似。