diff --git a/docs/source/en/basics/booster_plugins.md b/docs/source/en/basics/booster_plugins.md index 075b17a1b..57fa81343 100644 --- a/docs/source/en/basics/booster_plugins.md +++ b/docs/source/en/basics/booster_plugins.md @@ -19,26 +19,17 @@ We currently provide the following plugins: More plugins are coming soon. +## Choosing Your Plugin + +Generally only one plugin is used to train a model. Our recommended use case for each plugin is as follows. + +- [Torch DDP Plugin](#torch-ddp-plugin): It is suitable for models with less than 2 billion parameters (e.g. Bert-3m, GPT2-1.5b). +- [Torch FSDP Plugin](#torch-fsdp-plugin) / [Low Level Zero Plugin](#low-level-zero-plugin): It is suitable for models with less than 10 billion parameters (e.g. GPTJ-6b, MegatronLM-8b). +- [Gemini Plugin](#gemini-plugin): It is suitable for models with more than 10 billion parameters (e.g. TuringNLG-17b) and is ideal for scenarios with **high cross-node bandwidth and medium to small-scale clusters (below a thousand cards)** (e.g. Llama2-70b). +- [Hybrid Pararllel Plugin](#hybrid-parallel-plugin): It is suitable for models with more than 60 billion parameters, or special models such as those with exceptionally long sequences, very large vocabularies, and is best suited for scenarios with **low cross-node bandwidth and large-scale clusters (a thousand cards or more)** (e.g. GPT3-175b, Bloom-176b). + ## Plugins -### Torch DDP Plugin - -More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel). - -{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }} - -### Torch FSDP Plugin - -> ⚠ This plugin is not available when torch version is lower than 1.12.0. - -> ⚠ This plugin does not support save/load sharded model checkpoint now. - -> ⚠ This plugin does not support optimizer that use multi params group. - -More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/fsdp.html). - -{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }} - ### Low Level Zero Plugin This plugin implements Zero-1 and Zero-2 (w/wo CPU offload), using `reduce` and `gather` to synchronize gradients and weights. @@ -87,13 +78,22 @@ This plugin implements the combination of various parallel training strategies a {{ autodoc:colossalai.booster.plugin.HybridParallelPlugin }} -## Choosing Your Plugin +### Torch DDP Plugin -Generally only one plugin is used to train a model. Our recommended use case for each plugin is as follows. +More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel). -- [Torch DDP Plugin](#torch-ddp-plugin): It is suitable for models with less than 2 billion parameters (e.g. Bert-3m, GPT2-1.5b). -- [Torch FSDP Plugin](#torch-fsdp-plugin) / [Low Level Zero Plugin](#low-level-zero-plugin): It is suitable for models with less than 10 billion parameters (e.g. GPTJ-6b, MegatronLM-8b). -- [Gemini Plugin](#gemini-plugin): It is suitable for models with more than 10 billion parameters (e.g. TuringNLG-17b) and is ideal for scenarios with **high cross-node bandwidth and medium to small-scale clusters (below a thousand cards)** (e.g. Llama2-70b). -- [Hybrid Pararllel Plugin](#hybrid-parallel-plugin): It is suitable for models with more than 60 billion parameters, or special models such as those with exceptionally long sequences, very large vocabularies, and is best suited for scenarios with **low cross-node bandwidth and large-scale clusters (a thousand cards or more)** (e.g. GPT3-175b, Bloom-176b). +{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }} + +### Torch FSDP Plugin + +> ⚠ This plugin is not available when torch version is lower than 1.12.0. + +> ⚠ This plugin does not support save/load sharded model checkpoint now. + +> ⚠ This plugin does not support optimizer that use multi params group. + +More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/fsdp.html). + +{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }} diff --git a/docs/source/zh-Hans/basics/booster_plugins.md b/docs/source/zh-Hans/basics/booster_plugins.md index 0857f44e1..d4ef7012f 100644 --- a/docs/source/zh-Hans/basics/booster_plugins.md +++ b/docs/source/zh-Hans/basics/booster_plugins.md @@ -1,6 +1,7 @@ # Booster 插件 -作者: [Hongxin Liu](https://github.com/ver217), [Baizhou Zhang](https://github.com/Fridge003) +作者: [Hongxin Liu](https://github.com/ver217), [Baizhou Zhang](https://github.com/Fridge003), [Pengtai Xu](https://github.com/ppt0011) + **前置教程:** - [Booster API](./booster_api.md) @@ -19,27 +20,14 @@ 更多插件即将推出。 +## 插件选择 +- [Torch DDP 插件](#torch-ddp-插件): 适用于参数少于 20 亿的模型(例如 Bert-3m、GPT2-1.5b)。 +- [Torch FSDP 插件](#torch-fsdp-插件) / [Low Level Zero 插件](#low-level-zero-插件): 适用于参数少于 100 亿的模型(例如 GPTJ-6b、MegatronLM-8b)。 +- [Gemini 插件](#gemini-插件): 适合参数超过 100 亿的模型(例如 TuringNLG-17b),且**跨节点带宽高、中小规模集群(千卡以下)**的场景(例如 Llama2-70b)。 +- [Hybrid Pararllel 插件](#hybrid-parallel-插件): 适合参数超过 600 亿的模型、超长序列、超大词表等特殊模型,且**跨节点带宽低、大规模集群(千卡以上)**的场景(例如 GPT3-175b、Bloom-176b)。 + ## 插件 -### Torch DDP 插件 - -更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel). - -{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }} - -### Torch FSDP 插件 - -> ⚠ 如果 torch 版本低于 1.12.0,此插件将不可用。 - -> ⚠ 该插件现在还不支持保存/加载分片的模型 checkpoint。 - -> ⚠ 该插件现在还不支持使用了multi params group的optimizer。 - -更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/fsdp.html). - -{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }} - - ### Low Level Zero 插件 该插件实现了 Zero-1 和 Zero-2(使用/不使用 CPU 卸载),使用`reduce`和`gather`来同步梯度和权重。 @@ -87,10 +75,22 @@ Zero-2 不支持局部梯度累积。如果您坚持使用,虽然可以积累 {{ autodoc:colossalai.booster.plugin.HybridParallelPlugin }} -## 插件选择 -- [Torch DDP 插件](#torch-ddp-插件): 适用于参数少于 20 亿的模型(例如 Bert-3m、GPT2-1.5b)。 -- [Torch FSDP 插件](#torch-fsdp-插件) / [Low Level Zero 插件](#low-level-zero-插件): 适用于参数少于 100 亿的模型(例如 GPTJ-6b、MegatronLM-8b)。 -- [Gemini 插件](#gemini-插件): 适合参数超过 100 亿的模型(例如 TuringNLG-17b),且**跨节点带宽高、中小规模集群(千卡以下)**的场景(例如 Llama2-70b)。 -- [Hybrid Pararllel 插件](#hybrid-parallel-插件): 适合参数超过 600 亿的模型、超长序列、超大词表等特殊模型,且**跨节点带宽低、大规模集群(千卡以上)**的场景(例如 GPT3-175b、Bloom-176b)。 +### Torch DDP 插件 + +更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel). + +{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }} + +### Torch FSDP 插件 + +> ⚠ 如果 torch 版本低于 1.12.0,此插件将不可用。 + +> ⚠ 该插件现在还不支持保存/加载分片的模型 checkpoint。 + +> ⚠ 该插件现在还不支持使用了multi params group的optimizer。 + +更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/fsdp.html). + +{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}