[doc] add tutorial for booster plugins (#3758)

* [doc] add en booster plugins doc * [doc] add booster plugins doc in sidebar * [doc] add zh booster plugins doc * [doc] fix zh booster plugin translation * [doc] reoganize tutorials order of basic section * [devops] force sync to test ci
2026-01-05 07:34:30 +00:00 · 2023-05-19 12:12:42 +08:00
parent 5ce6c9d86f
commit 21e29e2212
3 changed files with 132 additions and 3 deletions
--- a/docs/source/en/basics/booster_plugins.md
+++ b/docs/source/en/basics/booster_plugins.md
@@ -0,0 +1,64 @@
+# Booster Plugins
+
+Author: [Hongxin Liu](https://github.com/ver217)
+
+**Prerequisite:**
+- [Booster API](./booster_api.md)
+
+## Introduction
+
+As mentioned in [Booster API](./booster_api.md), we can use booster plugins to customize the parallel training. In this tutorial, we will introduce how to use booster plugins.
+
+We currently provide the following plugins:
+
+- [Low Level Zero Plugin](#low-level-zero-plugin): It wraps the `colossalai.zero.low_level.LowLevelZeroOptimizer` and can be used to train models with zero-dp. It only supports zero stage-1 and stage-2.
+- [Gemini Plugin](#gemini-plugin): It wraps the [Gemini](../features/zero_with_chunk.md) which implements Zero-3 with chunk-based and heterogeneous memory management.
+- [Torch DDP Plugin](#torch-ddp-plugin): It is a wrapper of `torch.nn.parallel.DistributedDataParallel` and can be used to train models with data parallelism.
+- [Torch FSDP Plugin](#torch-fsdp-plugin): It is a wrapper of `torch.distributed.fsdp.FullyShardedDataParallel` and can be used to train models with zero-dp.
+
+More plugins are coming soon.
+
+## Plugins
+
+### Low Level Zero Plugin
+
+This plugin implements Zero-1 and Zero-2 (w/wo CPU offload), using `reduce` and `gather` to synchronize gradients and weights.
+
+Zero-1 can be regarded as a better substitute of Torch DDP, which is more memory efficient and faster. It can be easily used in hybrid parallelism.
+
+Zero-2 does not support local gradient accumulation. Though you can accumulate gradient if you insist, it cannot reduce communication cost. That is to say, it's not a good idea to use Zero-2 with pipeline parallelism.
+
+{{ autodoc:colossalai.booster.plugin.LowLevelZeroPlugin }}
+
+We've tested compatibility on some famous models, following models may not be supported:
+
+- `timm.models.convit_base`
+- dlrm and deepfm models in `torchrec`
+- `diffusers.VQModel`
+- `transformers.AlbertModel`
+- `transformers.AlbertForPreTraining`
+- `transformers.BertModel`
+- `transformers.BertForPreTraining`
+- `transformers.GPT2DoubleHeadsModel`
+
+Compatibility problems will be fixed in the future.
+
+### Gemini Plugin
+
+This plugin implements Zero-3 with chunk-based and heterogeneous memory management. It can train large models without much loss in speed. It also does not support local gradient accumulation. More details can be found in [Gemini Doc](../features/zero_with_chunk.md).
+
+{{ autodoc:colossalai.booster.plugin.GeminiPlugin }}
+
+### Torch DDP Plugin
+
+More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).
+
+{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }}
+
+### Torch FSDP Plugin
+
+> ⚠ This plugin is not available when torch version is lower than 1.12.0.
+
+More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/fsdp.html).
+
+{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}