[doc] polish shardformer doc (#4779)

* fix example format in docstring * polish shardformer doc
2025-09-10 05:20:33 +00:00 · 2023-09-26 10:57:47 +08:00
parent 26cd6d850c
commit a2db75546d
8 changed files with 215 additions and 68 deletions
--- a/docs/source/en/features/shardformer.md
+++ b/docs/source/en/features/shardformer.md
@@ -214,9 +214,56 @@ In addition, xFormers's `cutlass_op` can serve as a backup for flash attention.
 Enabling `Shardformer` through `Booster` initialized with `HybridParallelPlugin` is the recommended way to awaken the power of Shardformer.
 The main reason is that pipeline parallelism cannot successfully work without the calling of `execute_pipeline` method of `Booster`. Besides, `HybridParallelPlugin` provides the capacity to combine the features of `Shardformer` with other useful features, such as mixed precision training or Zero.

-More details about this usage can be found in chapter [Booster API](../basics/booster_api.md) and [Booster Plugins](../basics/booster_plugins.md).
+[Here](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) is an example on how to trigger `Shardformer` through `HybridParallelPlugin`. Move to the root directory of this example, and execute
+```bash
+torchrun --standalone --nproc_per_node 4  finetune.py --target_f1 0.86 --plugin "hybrid_parallel" --model_type "bert"
+```
+Then you can start finetuning a bert model wrapped by `Shardformer`. The process of wrapping is operated by `HybridParallelPlugin`.

-[Here](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) is an example on how to trigger `Shardformer` through `HybridParallelPlugin`. Please be aware that there's a difference in the way of doing forward and backward between the situation of using pipeline and not using pipeline.
+Let's delve into the code of `finetune.py`:
+
+In the `main` function, the plugin is created through the following codes:
+```python
+...
+elif args.plugin == "hybrid_parallel":
+    # modify the param accordingly for finetuning test cases
+    plugin = HybridParallelPlugin(
+        tp_size=1,
+        pp_size=2,
+        num_microbatches=None,
+        microbatch_size=1,
+        enable_all_optimization=True,
+        zero_stage=1,
+        precision="fp16",
+        initial_scale=1,
+    )
+```
+Here you can change the configuration of plugin by setting `tp_size`, `pp_size` or `zero_stage` to other values. More details about plugin configuration can be found in [Booster Plugins Doc](../basics/booster_plugins.md).
+
+If pipeline parallel is not enabled, just do the training in the same way of other booster plugins(first boost with Booster, then do forward and backward through normal way).
+However, if pipeline parallel is enabled, there are several usages different from other normal cases:
+
+1. Before doing forward or backward, the criterion function (loss function) is processed to meet the argument demand of running pipeline:
+    ```python
+    def _criterion(outputs, inputs):
+        outputs = output_transform_fn(outputs)
+        loss = criterion(outputs)
+        return loss
+    ```
+
+2. In `train_epoch` function, dataloader is converted into `Iterator` class before running pipeline:
+    ```python
+    train_dataloader_iter = iter(train_dataloader)
+    ```
+
+3. Do forward and backward passing through calling `Booster.execute_pipeline` method:
+    ```python
+    outputs = booster.execute_pipeline(
+        train_dataloader_iter, model, _criterion, optimizer, return_loss=True, return_outputs=True
+    )
+    ```
+    Backward passing has been completed by this method, so there is no need to call `loss.backward()` after executing this method.
+    More details about `Booster.execute_pipeline` can be found in [Booster API Doc](../basics/booster_api.md).


 #### 2. Enabling Shardformer Through Shardformer APIs (Not Recommended)
@@ -224,7 +271,26 @@ More details about this usage can be found in chapter [Booster API](../basics/bo
 You can also use Shardformer through manually calling Shardformer APIs. However, this usage is not recommended since pipeline parallelism can't run without `Booster`.

 [Here](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/convergence_benchmark.py)
-is an example on how to trigger `Shardformer` through calling Shardformer APIs.
+is an example on how to trigger `Shardformer` through calling Shardformer APIs. In the `train` function of example code, the model is wrapped by `Shardformer` through the following few codes:
+```python
+...
+if dist.get_world_size() > 1:
+    tp_group = dist.new_group(backend="nccl")
+
+    # First create configuration for Shardformer
+    shard_config = ShardConfig(
+        tensor_parallel_process_group=tp_group,
+        enable_tensor_parallelism=True,
+        enable_all_optimization=True
+    )
+
+    # Then create ShardFormer object with created config
+    shard_former = ShardFormer(shard_config=shard_config)
+
+    # Finally shard the model using ShardFormer.optimize method
+    model, _ = shard_former.optimize(model)
+...
+```

 ### Precautions

@@ -241,6 +307,8 @@ is an example on how to trigger `Shardformer` through calling Shardformer APIs.

 ## How Shardformer Works

+### Main Idea
+
 Generally, Shardformer works through the following four kinds of *replacements*:

 1. Replacing original PyTorch module (e.g. `nn.Linear`, `nn.Embedding`) with a crafted distributed module.