added Chinese documents and fixed some typos in English documents

2026-01-22 21:24:31 +00:00 · 2021-11-02 23:01:13 +08:00
parent ccbc918c11
commit 18ba66e012
19 changed files with 1064 additions and 130 deletions
--- a/docs/parallelization.md
+++ b/docs/parallelization.md
@@ -5,52 +5,51 @@
 We support multiple parallelization in our library.

 Hybrid parallelism in our codebase, namely data parallelism, pipeline parallelism and tensor parallelism (
-1D,2D, 2.5D, 3D). You can initialize the corresponding process group by setting `parallel` in our config. The parallel
+1D, 2D, 2.5D, 3D). You can initialize the corresponding process group by setting `parallel` in our config. The parallel
 configuration can be easily deployed by a dictionary in configuration file. The configuration dictionary must obey the
 following format. Data parallel size will be inferred automatically based on your inputs to pipeline parallelism and
 tensor parallelism.

 ```python
 parallel = dict(
-    pipeline=dict["size": int],
-    tensor=dict["size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any]
-) 
+    pipeline=dict("size": int),
+    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
+)
 ```

 The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and data,
 pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a int
 representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
-represents the way of model parallelism.
+represents the way of tensor parallelism.

 ## Data Parallel
+
 Data parallel is the most common way to distribute your training task by splitting data into several shards and train 
 on a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do 
-not have to explicitly set them in your configurations. When data parallel size is larger than 1, Colossal-AI automatically 
+not have to explicitly set them in your configurations. When data parallel size is larger than 1, ColossalAI automatically 
 adds the distributed data sampler to the dataloader to shard the dataset.

-
 ## 1D, 2D, 2.5D and 3D Parallel
+
 To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each 
-tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
+tensor parallel method. These parallel modes need to work with the distributed layers provided by ColossalAI.
 - 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)

 - 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)  
 2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, 
 model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ 
-devices where N is the number of tensor chunks in a single dimension.
+devices where $N$ is the number of tensor chunks in a single dimension.

 - 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)  
-Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further 
-parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers, 
-where each layer performs matrix multiplication operations independently with a dimension N.
+Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which further 
+parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into $d$ layers, 
+where each layer performs matrix multiplication operations independently with a dimension $N$.

 - 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)  
 We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves 
-the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed 
+the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage are evenly distributed 
 through optimized load balancing of parameters as well as activations.

-
-
 ```python
 # 1D parallel
 parallel = dict(
@@ -85,8 +84,8 @@ and the second layer to the second GPU. This example of course wastes the comput
 the idea of pipeline parallelism. 

 As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline 
-parallelism in PyTorch, you may need to add one more attribute in your model class which tells Colossal-AI the sequence 
-of execution. One example you can refer is `colossalai.nn.VanillaResNet`.
+parallelism in PyTorch, you may need to add one more attribute, `layers_cfg` in your model class which tells ColossalAI
+the sequence of execution. One example you can refer is `colossalai.nn.model.VanillaResNet`.

 ```python
 from colossalai.nn import BaseModel
@@ -193,7 +192,7 @@ class VanillaResNet(BaseModel):
        ]
 ```

-You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI 
+You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, ColossalAI 
 will automatically creates the pipeline schedule which defines the forward and backward step. You can specify how many microbatches
 to run in each step in the `schedule` configuration.

@@ -207,10 +206,10 @@ schedule = dict(
    num_microbatches = 4 # set the number of microbatches per step
 )
 ```
-
+This feature is still in development and is only experimental for now.

 ## Sequence Parallel (experimental)

 Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging. 
 This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120). 
-This feature is still in development is only experimental for now.
+This feature is still in development and is only experimental for now.