[doc] add model examples for each plugin

This commit is contained in:
Pengtai Xu
2023-09-19 18:01:23 +08:00
parent a04337bfc3
commit e10d9f087e
2 changed files with 8 additions and 8 deletions

View File

@@ -91,9 +91,9 @@ This plugin implements the combination of various parallel training strategies a
Generally only one plugin is used to train a model. Our recommended use case for each plugin is as follows.
- [Torch DDP Plugin](#torch-ddp-plugin): It is suitable for models with less than 2 billion parameters.
- [Torch FSDP Plugin](#torch-fsdp-plugin) / [Low Level Zero Plugin](#low-level-zero-plugin): It is suitable for models with less than 10 billion parameters.
- [Gemini Plugin](#gemini-plugin): It is suitable for models with more than 10 billion parameters and is ideal for scenarios with high cross-node bandwidth and medium to small-scale clusters (below a thousand cards).
- [Hybrid Pararllel Plugin](#hybrid-parallel-plugin): It is suitable for models with more than 60 billion parameters, or special models such as those with exceptionally long sequences, very large vocabularies, and is best suited for scenarios with low cross-node bandwidth and large-scale clusters (a thousand cards or more).
- [Torch DDP Plugin](#torch-ddp-plugin): It is suitable for models with less than 2 billion parameters (e.g. Bert-3m, GPT2-1.5b).
- [Torch FSDP Plugin](#torch-fsdp-plugin) / [Low Level Zero Plugin](#low-level-zero-plugin): It is suitable for models with less than 10 billion parameters (e.g. GPTJ-6b, MegatronLM-8b).
- [Gemini Plugin](#gemini-plugin): It is suitable for models with more than 10 billion parameters (e.g. TuringNLG-17b) and is ideal for scenarios with **high cross-node bandwidth and medium to small-scale clusters (below a thousand cards)** (e.g. Llama2-70b).
- [Hybrid Pararllel Plugin](#hybrid-parallel-plugin): It is suitable for models with more than 60 billion parameters, or special models such as those with exceptionally long sequences, very large vocabularies, and is best suited for scenarios with **low cross-node bandwidth and large-scale clusters (a thousand cards or more)** (e.g. GPT3-175b, Bloom-176b).
<!-- doc-test-command: echo -->