[doc] update and revise some typos and errs in docs (#4107)

* fix some typos and problems in doc * fix some typos and problems in doc * add doc test
2025-09-13 05:01:44 +00:00 · 2023-06-28 19:30:37 +08:00
parent 769cddcb2c
commit 711e2b4c00
6 changed files with 99 additions and 62 deletions
--- a/docs/source/zh-Hans/basics/booster_api.md
+++ b/docs/source/zh-Hans/basics/booster_api.md
@@ -1,35 +1,44 @@
 # booster 使用
-作者: [Mingyan Jiang](https://github.com/jiangmingyan)
+
+作者: [Mingyan Jiang](https://github.com/jiangmingyan) [Jianghai Chen](https://github.com/CjhHa1)

 **预备知识:**
+
 - [分布式训练](../concepts/distributed_training.md)
 - [Colossal-AI 总览](../concepts/colossalai_overview.md)

 **示例代码**
- [使用booster训练](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet/README.md)
+
+<!-- update this url-->
+
+- [使用 booster 训练](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet/README.md)

 ## 简介
-在我们的新设计中， `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如，模型、优化器、数据加载器）无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。
+
+在我们的新设计中， `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如，模型、优化器、数据加载器)无缝注入您的训练组件中。 使用 booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。
 在下面的章节中，我们将介绍 `colossalai.booster` 是如何工作的以及使用时我们要注意的细节。

-### Booster插件
-Booster插件是管理并行配置的重要组件（eg：gemini插件封装了gemini加速方案）。目前支持的插件如下：
+### Booster 插件

-***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案，即基于块内存管理的 ZeRO优化方案。
+Booster 插件是管理并行配置的重要组件（eg：gemini 插件封装了 gemini 加速方案）。目前支持的插件如下：

-***TorchDDPPlugin:*** TorchDDPPlugin插件封装了DDP加速方案，实现了模型级别的数据并行，可以跨多机运行。
+**_GeminiPlugin:_** GeminiPlugin 插件封装了 gemini 加速解决方案，即基于块内存管理的 ZeRO 优化方案。

-***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1：切分优化器参数，分发到各并发进程或并发GPU上。阶段 2：切分优化器参数及梯度，分发到各并发进程或并发GPU上。
+**_TorchDDPPlugin:_** TorchDDPPlugin 插件封装了 DDP 加速方案，实现了模型级别的数据并行，可以跨多机运行。

-### Booster接口
+**_LowLevelZeroPlugin:_** LowLevelZeroPlugin 插件封装了零冗余优化器的 1/2 阶段。阶段 1：切分优化器参数，分发到各并发进程或并发 GPU 上。阶段 2：切分优化器参数及梯度，分发到各并发进程或并发 GPU 上。
+
+### Booster 接口
+
+<!--TODO: update autodoc -->

 {{ autodoc:colossalai.booster.Booster }}

 ## 使用方法及示例

-在使用colossalai训练时，首先需要在训练脚本的开头启动分布式环境，并创建需要使用的模型、优化器、损失函数、数据加载器等对象。之后，调用`colossalai.booster` 将特征注入到这些对象中，您就可以使用我们的booster API去进行您接下来的训练流程。
+在使用 colossalai 训练时，首先需要在训练脚本的开头启动分布式环境，并创建需要使用的模型、优化器、损失函数、数据加载器等对象。之后，调用`colossalai.booster` 将特征注入到这些对象中，您就可以使用我们的 booster API 去进行您接下来的训练流程。

-以下是一个伪代码示例，将展示如何使用我们的booster API进行模型训练:
+以下是一个伪代码示例，将展示如何使用我们的 booster API 进行模型训练:

 ```python
 import torch
--- a/docs/source/zh-Hans/features/mixed_precision_training_with_booster.md
+++ b/docs/source/zh-Hans/features/mixed_precision_training_with_booster.md
@@ -3,12 +3,13 @@
 作者: [Mingyan Jiang](https://github.com/jiangmingyan)

 **前置教程**
+
 - [定义配置文件](../basics/define_your_config.md)
- [booster使用](../basics/booster_api.md)
+- [booster 使用](../basics/booster_api.md)

 **相关论文**
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)

+- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)

 ## 引言

@@ -19,18 +20,17 @@ AMP 代表自动混合精度训练。
 2. apex.amp
 3. naive amp

+| Colossal-AI    | 支持张量并行 | 支持流水并行 | fp16 范围                                                 |
+| -------------- | ------------ | ------------ | --------------------------------------------------------- |
+| AMP_TYPE.TORCH | ✅           | ❌           | 在前向和反向传播期间，模型参数、激活和梯度向下转换至 fp16 |
+| AMP_TYPE.APEX  | ❌           | ❌           | 更细粒度，我们可以选择 opt_level O0, O1, O2, O3           |
+| AMP_TYPE.NAIVE | ✅           | ✅           | 模型参数、前向和反向操作，全都向下转换至 fp16             |

-| Colossal-AI | 支持张量并行 | 支持流水并行 | fp16范围 |
-| ----------- | ----------------------- | ------------------------- | ----------- |
-| AMP_TYPE.TORCH | ✅ | ❌ | 在前向和反向传播期间，模型参数、激活和梯度向下转换至fp16 |
-| AMP_TYPE.APEX | ❌ | ❌ | 更细粒度，我们可以选择 opt_level O0, O1, O2, O3 |
-| AMP_TYPE.NAIVE | ✅ | ✅ | 模型参数、前向和反向操作，全都向下转换至fp16 |
+前两个依赖于 PyTorch (1.6 及以上) 和 NVIDIA Apex 的原始实现。最后一种方法类似 Apex O2。在这些方法中，Apex-AMP 与张量并行不兼容。这是因为张量是以张量并行的方式在设备之间拆分的，因此，需要在不同的进程之间进行通信，以检查整个模型权重中是否出现 inf 或 nan。我们修改了 torch amp 实现，使其现在与张量并行兼容。

-前两个依赖于 PyTorch (1.6及以上) 和 NVIDIA Apex 的原始实现。最后一种方法类似 Apex O2。在这些方法中，Apex-AMP 与张量并行不兼容。这是因为张量是以张量并行的方式在设备之间拆分的，因此，需要在不同的进程之间进行通信，以检查整个模型权重中是否出现inf或nan。我们修改了torch amp实现，使其现在与张量并行兼容。
-
-> ❌️ fp16与ZeRO不兼容
+> ❌️ fp16 与 ZeRO 不兼容
 >
-> ⚠️ 流水并行目前仅支持naive amp
+> ⚠️ 流水并行目前仅支持 naive amp

 我们建议使用 torch AMP，因为在不使用流水并行时，它通常比 NVIDIA AMP 提供更好的准确性。

@@ -57,11 +57,14 @@ AMP 代表自动混合精度训练。

 ## Colossal-AI 中的 AMP

-我们支持三种 AMP 训练方法，并允许用户在没有改变代码的情况下使用 AMP 进行训练。booster支持amp特性注入，如果您要使用混合精度训练，则在创建booster实例时指定`mixed_precision`参数，我们现已支持torch amp，apex amp, naive amp（现已移植torch amp至booster，apex amp, naive amp仍由`colossalai.initialize`方式启动，如您需使用，请[参考](./mixed_precision_training.md）;后续将会拓展`bf16`,`pf8`的混合精度训练.
+我们支持三种 AMP 训练方法，并允许用户在没有改变代码的情况下使用 AMP 进行训练。booster 支持 amp 特性注入，如果您要使用混合精度训练，则在创建 booster 实例时指定`mixed_precision`参数，我们现已支持 torch amp，apex amp, naive amp（现已移植 torch amp 至 booster，apex amp, naive amp 仍由`colossalai.initialize`方式启动，如您需使用，请[参考](./mixed_precision_training.md);后续将会拓展`bf16`,`pf8`的混合精度训练.
+
+#### booster 启动方式
+
+您可以在创建 booster 实例时，指定`mixed_precision="fp16"`即使用 torch amp。

-#### booster启动方式
-您可以在创建booster实例时，指定`mixed_precision="fp16"`即使用torch amp。
 <!--- doc-test-ignore-start -->
+
 ```python
 """
    初始化映射关系如下：
@@ -74,9 +77,13 @@ AMP 代表自动混合精度训练。
 from colossalai import Booster
 booster = Booster(mixed_precision='fp16',...)
 ```
+
 <!--- doc-test-ignore-end -->
+
 或者您可以自定义一个`FP16TorchMixedPrecision`对象，如
+
 <!--- doc-test-ignore-start -->
+
 ```python
 from colossalai.mixed_precision import FP16TorchMixedPrecision
 mixed_precision = FP16TorchMixedPrecision(
@@ -86,8 +93,10 @@ mixed_precision = FP16TorchMixedPrecision(
    growth_interval=2000)
 booster = Booster(mixed_precision=mixed_precision,...)
 ```
+
 <!--- doc-test-ignore-end -->
-其他类型的amp使用方式也是一样的。
+
+其他类型的 amp 使用方式也是一样的。

 ### Torch AMP 配置

@@ -96,7 +105,7 @@ booster = Booster(mixed_precision=mixed_precision,...)
 ### Apex AMP 配置

 对于这种模式，我们依靠 Apex 实现混合精度训练。我们支持这个插件，因为它允许对混合精度的粒度进行更精细的控制。
-例如, O2 水平 (优化器水平2) 将保持 batch normalization 为 FP32。
+例如, O2 水平 (优化器水平 2) 将保持 batch normalization 为 FP32。

 如果你想了解更多细节，请参考 [Apex Documentation](https://nvidia.github.io/apex/)。

@@ -104,7 +113,7 @@ booster = Booster(mixed_precision=mixed_precision,...)

 ### Naive AMP 配置

-在 Naive AMP 模式中, 我们实现了混合精度训练，同时保持了与复杂张量和流水并行的兼容性。该 AMP 模式将所有操作转为 FP16 。下列代码块展示了该模式的booster启动方式。
+在 Naive AMP 模式中, 我们实现了混合精度训练，同时保持了与复杂张量和流水并行的兼容性。该 AMP 模式将所有操作转为 FP16 。下列代码块展示了该模式的 booster 启动方式。

 {{ autodoc:colossalai.booster.mixed_precision.FP16NaiveMixedPrecision }}

@@ -186,7 +195,8 @@ lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=NUM_EPOCHS
 ```

 ### 步骤 4. 插入 AMP
-创建一个MixedPrecision对象（如果需要）及torchDDPPlugin对象，调用 `colossalai.boost` 将所有训练组件转为为FP16模式.
+
+创建一个 MixedPrecision 对象（如果需要）及 torchDDPPlugin 对象，调用 `colossalai.boost` 将所有训练组件转为为 FP16 模式.

 ```python
 plugin = TorchDDPPlugin()
@@ -209,7 +219,7 @@ model, optimizer, criterion, dataloader, lr_scheduler = booster.boost(model, opt

 ### 步骤 5. 使用 booster 训练

-使用booster构建一个普通的训练循环。
+使用 booster 构建一个普通的训练循环。

 ```python
 model.train()
@@ -232,4 +242,5 @@ for epoch in range(NUM_EPOCHS):
 ```shell
 colossalai run --nproc_per_node 1 train.py
 ```
+
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training_with_booster.py  -->
--- a/docs/source/zh-Hans/get_started/run_demo.md
+++ b/docs/source/zh-Hans/get_started/run_demo.md
@@ -4,8 +4,8 @@ Colossal-AI 是一个集成的大规模深度学习系统，具有高效的并

 ## 单 GPU

-Colossal-AI 可以用在只有一个 GPU 的系统上训练深度学习模型，并达到 baseline 的性能。 我们提供了一个 [在CIFAR10数据集上训练ResNet](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet) 的例子，该例子只需要一个 GPU。
-您可以在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples) 中获取该例子。详细说明可以在其 `README.md` 中获取。
+Colossal-AI 可以用在只有一个 GPU 的系统上训练深度学习模型，并达到 baseline 的性能。 我们提供了一个 [在 CIFAR10 数据集上训练 ResNet](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/resnet) 的例子，该例子只需要一个 GPU。
+您可以在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI/tree/main/examples) 中获取该例子。详细说明可以在其 `README.md` 中获取。

 ## 多 GPU

@@ -13,16 +13,20 @@ Colossal-AI 可用于在具有多个 GPU 的分布式系统上训练深度学习

 #### 1. 数据并行

-您可以使用与上述单 GPU 演示相同的 [ResNet例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet)。 通过设置 `--nproc_per_node` 为您机器上的 GPU 数量，您就能把数据并行应用在您的例子上了。
+您可以使用与上述单 GPU 演示相同的 [ResNet 例子](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/resnet)。 通过设置 `--nproc_per_node` 为您机器上的 GPU 数量，您就能把数据并行应用在您的例子上了。

 #### 2. 混合并行

-混合并行包括数据、张量和流水线并行。在 Colossal-AI 中，我们支持不同类型的张量并行（即 1D、2D、2.5D 和 3D）。您可以通过简单地改变 `config.py` 中的配置在不同的张量并行之间切换。您可以参考 [GPT example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt), 更多细节能在它的 `README.md` 中被找到。
+混合并行包括数据、张量和流水线并行。在 Colossal-AI 中，我们支持不同类型的张量并行（即 1D、2D、2.5D 和 3D）。您可以通过简单地改变 `config.py` 中的配置在不同的张量并行之间切换。您可以参考 [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt), 更多细节能在它的 `README.md` 中被找到。

-#### 3. MoE并行
+#### 3. MoE 并行

-我们提供了一个 [WideNet例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) 来验证 MoE 的并行性。 WideNet 使用 Mixture of Experts（MoE）来实现更好的性能。更多的细节可以在我们的教程中获取：[教会您如何把Mixture of Experts整合到模型中](../advanced_tutorials/integrate_mixture_of_experts_into_your_model.md)。
+<!-- TODO: 在colossalai中实现这个例子 -->
+
+我们提供了一个 [ViT-MoE 例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/moe) 来验证 MoE 的并行性。 WideNet 使用 Mixture of Experts（MoE）来实现更好的性能。更多的细节可以在我们的教程中获取：[教会您如何把 Mixture of Experts 整合到模型中](../advanced_tutorials/integrate_mixture_of_experts_into_your_model.md)。

 #### 4. 序列并行

-序列并行是为了解决NLP任务中的内存效率和序列长度限制问题。 我们在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples) 中提供了一个 [BERT例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/bert/sequene_parallel)。您可以按照 `README.md` 来执行代码。
+序列并行是为了解决 NLP 任务中的内存效率和序列长度限制问题。 我们在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI/tree/main/examples) 中提供了一个 [Sequence Parallelism 例子](https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/sequence_parallel)。您可以按照 `README.md` 来执行代码。
+
+<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 run_demo.py  -->