mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-05-30 10:55:24 +00:00
Merge ceadef35d5
into 46ed5d856b
This commit is contained in:
commit
ab6023c5ea
@ -101,6 +101,7 @@ plugin = HybridParallelPlugin(
|
||||
sequence_parallelism_mode="split_gather",
|
||||
)
|
||||
```
|
||||
Example of startup command parameters: ```--tp 2 --sp 8 --sp_mode split_gather```
|
||||
|
||||
#### Using DeepSpeed-Ulysses
|
||||
Define the plugin. In the DeepSpeed-Ulysses sequence parallelism, the tp group and sp group are orthogonal.
|
||||
@ -113,6 +114,7 @@ plugin = HybridParallelPlugin(
|
||||
sequence_parallelism_mode="all_to_all",
|
||||
)
|
||||
```
|
||||
Example of startup command parameters: ```--tp 2 --sp 8 --sp_mode all_to_all```
|
||||
|
||||
#### Using Ring Attention
|
||||
Define the plugin. In ring attention sequence parallelism, the tp group and sp group are orthogonal, and sp_size must be set to the correct parallel size.
|
||||
@ -125,6 +127,8 @@ plugin = HybridParallelPlugin(
|
||||
sequence_parallelism_mode="ring_attn",
|
||||
)
|
||||
```
|
||||
Example of startup command parameters: ```--tp 2 --sp 8 --sp_mode ring_attn```
|
||||
|
||||
#### Using Booster
|
||||
```python
|
||||
booster = Booster(plugin=plugin)
|
||||
@ -151,6 +155,15 @@ Currently, the `MoeHybridParallelPlugin` only supports DeepSpeed-Ulysses sequenc
|
||||
|
||||
|
||||
### Conclusion
|
||||
Among the sequence parallelism methods mentioned, ring attention has no requirements for the number of attention heads and can train ultra-long sequences. However, due to the division of computation, its performance may decrease. TP+SP and DeepSpeed-Ulysses have requirements for the number of attention heads, which must be divisible by the sp group size. These sequence parallelism methods are all compatible with high-performance attention mechanisms like flash attention. Sequence parallelism can also be used with Gemini to train extremely large-scale models, and it can be combined with TP, PP, and DP to form 4D parallelism.
|
||||
Among the sequence parallelism methods mentioned, both ring attention and Ulysses have their pros and cons, and we need to choose the appropriate sequence parallelism method based on the situation:
|
||||
|
||||
Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands dense network topologies, e.g. NVLink + NVSwitch, so it doesn't scale well across multiple nodes.
|
||||
|
||||
Memory usage: Both are similar in terms of memory consumption.
|
||||
|
||||
Model structure generalization: Ring attention is better than Ulysses in terms of generalization. Ulysses requires that the model config need to meet ```the head number // (tp group size * sp group size)``` condition, while ring attention has no such restrictions.
|
||||
|
||||
Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream for sequence parallelism. All sequence parallel methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
|
||||
|
||||
Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods,can reach approximately 480+ TFLOPS on dual H800s. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py -->
|
||||
|
@ -101,6 +101,8 @@ plugin = HybridParallelPlugin(
|
||||
)
|
||||
```
|
||||
|
||||
启动参数举例:```--tp 2 --sp 8 --sp_mode split_gather```
|
||||
|
||||
#### 使用DeepSpeed-Ulysses
|
||||
定义plugin, 在DeepSpeed-Ulysses的序列并行种,tp group与sp group 是正交的,
|
||||
```python
|
||||
@ -112,6 +114,7 @@ plugin = HybridParallelPlugin(
|
||||
sequence_parallelism_mode="all_to_all",
|
||||
)
|
||||
```
|
||||
启动参数举例:```--tp 2 --sp 8 --sp_mode all_to_all```
|
||||
|
||||
#### 使用ring attention
|
||||
定义plugin, 在ring attention的序列并行种,tp group与sp group 是正交的,sp_size必须传入准确的并行大小。
|
||||
@ -124,6 +127,8 @@ plugin = HybridParallelPlugin(
|
||||
sequence_parallelism_mode="ring_attn",
|
||||
)
|
||||
```
|
||||
启动参数举例:```--tp 2 --sp 8 --sp_mode ring_attn```
|
||||
|
||||
#### 使用booster
|
||||
```python
|
||||
booster = Booster(plugin=plugin)
|
||||
@ -150,6 +155,17 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
|
||||
|
||||
|
||||
### 结论
|
||||
在上述序列并行方法中,ring attention对head number没有要求,可训练超长文本,但是由于细分了计算,计算性能会有所下降。TP+SP, DeepSpeed-Ulysses对于head number有要求,需要可被sp group size 整除。这些序列并行都可与其他高性能注意力兼容,如flash attention。sp可与Gemini一起使用训练超大规模模型,也可以与TP,PP,DP等组成4D并行。
|
||||
在上述序列并行方法中,ring attention和Ulysses各有优劣,我们需要根据情况来选择合适的序列并行方法:
|
||||
|
||||
通信方面:Ulysses通信量优于ring attention,Ulysess主要包含三次All2All通信量,而ring attention的通信会随着序列长度增长而平方增长。不过另一方面,all2all op由于需要更复杂的网络拓扑,例如NVLink和NVSwitch,因此在多机情况时,并不会随着机器数量增加而有较好的性能提升。
|
||||
|
||||
内存占用:二者类似。
|
||||
|
||||
模型结构泛化:ring attention优于Ulysses。Ulysses模型泛化性一般,对于head number有要求,需要满足: `head number // (tp group size * sp group size)` ,而ring attention没有此限制。
|
||||
|
||||
由于使用简单,对Attention计算不侵入修改,Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容,如flash attention,还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
|
||||
|
||||
总的来说,我们更推荐您使用Ulysses,只需要在启动时指定```--sp_mode all_to_all```即可。经过测试,在双机16卡的情况下,使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列,同时它的性能表现也是所有序列并行模式中最好的,在双机H800上能够达到约480以上的tflops。但如果追求极致性能优化,或者使用较多机器训练长文本,可以考虑使用ring attention模式的序列并行。
|
||||
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py -->
|
||||
|
Loading…
Reference in New Issue
Block a user