[doc] migrate the markdown files (#2652)

2025-09-10 21:40:02 +00:00 · 2023-02-09 14:21:38 +08:00
parent a020eecc70
commit 85b2303b55
84 changed files with 9729 additions and 0 deletions
--- a/docs/source/en/features/1D_tensor_parallel.md
+++ b/docs/source/en/features/1D_tensor_parallel.md
@@ -0,0 +1,111 @@
+# 1D Tensor Parallelism
+
+Author: Zhengda Bian, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+
+**Example Code**
+- [ColossalAI-Examples 1D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_1d.py)
+
+**Related Paper**
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf)
+
+## Introduction
+
+Tensor parallelism partitions model weights across multiple devices in order to reduce memory load.
+An efficient 1D tensor parallelism implementation was introduced by [Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf).
+
+Let's take a linear layer as an example, which consists of a GEMM $Y = XA$. Given 2 processors, we split the columns of $A$ into $[A_1 ~ A_2]$, and calculate $Y_i = XA_i$ on each processor, which then forms $[Y_1 ~ Y_2] = [XA_1 ~ XA_2]$. This is called a column-parallel fashion.
+
+When a second linear layer $Z=YB$ follows the column-parallel one, we split $B$ into $\left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$,
+which is called a row-parallel fashion.
+To calculate $Z = [Y_1 ~ Y_2] \left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$, we first calculate $Y_iB_i$ on each processor, then use an all-reduce to aggregate the results as $Z=Y_1B_1+Y_2B_2$.
+
+We also need to note that in the backward pass, the column-parallel linear layer needs to aggregate the gradients of the input tensor $X$, because on each processor $i$ we only have $\dot{X_i}=\dot{Y_i}A_i^T$.
+Thus, we apply an all-reduce across the processors to get $\dot{X}=\dot{Y}A^T=\dot{Y_1}A_1^T+\dot{Y_2}A_2^T$.
+
+## Efficiency
+Given $P$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 1D tensor parallelism.
+
+| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/P)$    | $O(1/P)$         | $O(1)$               | $O(2(P-1)/P)$             | $O(2(P-1))$             |
+
+## Usage
+
+To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallism setting as below.
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=2, mode='1d'),
+))
+```
+Then Colossal-AI will automatically apply 1D parallelism to all the layers from `colossalai.nn`.
+
+Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.transpose(0, 1).shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.transpose(0, 1).shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+
+Launch Colossal-AI on 2 GPUs and build the model.
+
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
+```shell
+Weight of the first linear layer: torch.Size([256, 512])
+Weight of the second linear layer: torch.Size([512, 256])
+```
+The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the column-parallel partitioning, it becomes `[256, 512]`.
+Similarly, the second row-parallel layer partitions the weight `[1024, 256]` into `[512, 256]`.
+
+We can run the model with some random inputs.
+```python
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+torch.distributed.broadcast(x, src=0)  # synchronize input
+
+x = m(x)
+```
+Then we can see the shapes of activation results.
+```shell
+Output of the first linear layer: torch.Size([16, 512])
+Output of the second linear layer: torch.Size([16, 256])
+```
+The output of the first linear layer is split into 2 partitions (each has the shape `[16, 512]`), while the second layer has identical outputs across the GPUs.
--- a/docs/source/en/features/2D_tensor_parallel.md
+++ b/docs/source/en/features/2D_tensor_parallel.md
@@ -0,0 +1,142 @@
+# 2D Tensor Parallelism
+
+Author: Zhengda Bian, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+- [1D Tensor Parallelism](./1D_tensor_parallel.md)
+
+**Example Code**
+- [ColossalAI-Examples - 2D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_2d.py)
+
+**Related Paper**
+- [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/pdf/2104.05343.pdf)
+
+## Introduction
+
+1D tensor parallelism does not partition activations, which can also consume a great amount of memory in terms of large-scale models.
+To evenly distribute the computation and memory load, [an efficient 2D tensor parallelism algorithm](https://arxiv.org/pdf/2104.05343.pdf) was introduced based on SUMMA (Scalable Universal Matrix Multiplication Algorithm).
+
+Let's still take a linear layer $Y = XA$ as an example.
+Given $P=q\times q$ processors (necessary condition), e.g. $q=2$, we split both the input $X$ and weight $A$ into
+
+$$
+\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right]
+\text{~and~}
+\left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right].
+$$
+
+The calculation includes $q$ steps. When $t=1$, $X_{i0}$ is broadcasted in its row, and $A_{0j}$ is broadcasted in its column. So, we have
+
+$$
+\left[\begin{matrix} X_{10},A_{00} & X_{10},A_{01} \\ X_{00},A_{00} & X_{00},A_{01} \end{matrix} \right].
+$$
+
+Then we multiply $X_{i0}$ and $A_{0j}$ on each processor $(i, j)$ as
+
+$$
+\left[\begin{matrix} X_{10}A_{00} & X_{10}A_{01} \\ X_{00}A_{00} & X_{00}A_{01} \end{matrix} \right] (1).
+$$
+
+Similarly, when $t=2$, $X_{i1}$ is broadcasted in its row, $A_{1j}$ is broadcasted in its column, and we multiply them as
+
+$$
+\left[\begin{matrix} X_{11}A_{10} & X_{11}A_{11} \\ X_{01}A_{10} & X_{01}A_{11} \end{matrix} \right] (2).
+$$
+
+By adding $(1)$ and $(2)$ up, we have
+
+$$
+Y = XA = \left[\begin{matrix} X_{10}A_{00}+X_{11}A_{10} & X_{10}A_{01}+X_{11}A_{11} \\ X_{00}A_{00}+X_{01}A_{10} & X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right].
+$$
+
+## Efficiency
+Given $P=q\times q$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2D tensor parallelism.
+
+| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/q^2)$  | $O(1/q^2)$       | $O(1/q^2)$           | $O(6(q-1)/q)$             | $O(6(q-1))$             |
+
+## Usage
+
+To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallism setting as below.
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=4, mode='2d'),
+))
+```
+Then Colossal-AI will automatically apply 2D parallelism to all the layers from `colossalai.nn`.
+
+Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+Launch Colossal-AI on 4 GPUs and build the model
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
+```shell
+Weight of the first linear layer: torch.Size([128, 512])
+Weight of the second linear layer: torch.Size([512, 128])
+```
+The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2D parallelism, it becomes `[128, 512]` on each GPU.
+Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
+
+We can run the model with some random inputs.
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+Then we can see the shapes of activation results.
+```shell
+Input: torch.Size([8, 128])
+Output of the first linear layer: torch.Size([8, 512])
+Output of the second linear layer: torch.Size([8, 128])
+```
+The activation tensors in 2D parallelism are all split in both row and column.
+E.g. the output of the first linear layer has the shape `[8, 512]`, while the second layer has the output of `[8, 128]`.
--- a/docs/source/en/features/2p5D_tensor_parallel.md
+++ b/docs/source/en/features/2p5D_tensor_parallel.md
@@ -0,0 +1,142 @@
+# 2.5D Tensor Parallelism
+
+Author: Zhengda Bian, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+- [1D Tensor Parallelism](./1D_tensor_parallel.md)
+- [2D Tensor Parallelism](./2D_tensor_parallel.md)
+
+**Example Code**
+- [ColossalAI-Examples - 2.5D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_2p5d.py)
+
+**Related Paper**
+- [2.5-dimensional distributed model training](https://arxiv.org/pdf/2105.14500.pdf)
+
+## Introduction
+
+Compared with 1D tensor parallelism, 2D parallelism reduces the memory cost, but may introduce more communication.
+Therefore, a  [2.5D tensor parallelism algorithm](https://arxiv.org/pdf/2105.14500.pdf) was proposed based on 2.5D SUMMA to reduce communication by using more devices.
+
+Let's still take a linear layer $Y = XA$ as an example.
+Given $P=q \times q \times d$ processors (necessary condition), e.g. $q=d=2$, we split the input $X$ into $d\times q$ rows and $q$ columns as
+
+$$
+\left[\begin{matrix} X_{30} & X_{31} \\ X_{20} & X_{21} \\ X_{10} & X_{11} \\ X_{00} & X_{01}\end{matrix} \right],
+$$
+which can be reshaped into $d$ layers as
+
+$$
+\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right] \text{~and~}\left[\begin{matrix} X_{30} & X_{31} \\ X_{20} & X_{21} \end{matrix} \right].
+$$
+
+Also, the weight $A$ is split into
+
+$$
+\left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right].
+$$
+
+For each layer of $X$, we use the SUMMA algorithm to multiply $X$ and $A$.
+Then, we have the output
+
+$$
+\left[\begin{matrix} Y_{10}=X_{10}A_{00}+X_{11}A_{10} & Y_{11}=X_{10}A_{01}+X_{11}A_{11} \\ Y_{00}=X_{00}A_{00}+X_{01}A_{10} & Y_{01}=X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right]
+\text{~and~}
+$$
+$$
+\left[\begin{matrix} Y_{30}=X_{30}A_{00}+X_{31}A_{10} & Y_{31}=X_{30}A_{01}+X_{31}A_{11} \\ Y_{20}=X_{20}A_{00}+X_{21}A_{10} & Y_{21}=X_{20}A_{01}+X_{21}A_{11} \end{matrix} \right].
+$$
+
+## Efficiency
+Given $P=q \times q \times d$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2.5D tensor parallelism.
+
+| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/dq^2)$ | $O(1/q^2)$       | $O(1/dq^2)$          | $\small O(3(q-1)(d+1)/dq)$       | $O(6(q-1))$             |
+
+## Usage
+
+To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=8, mode='2.5d', depth=2),
+))
+
+```
+Then Colossal-AI will automatically apply 2.5D parallelism to all the layers from `colossalai.nn`.
+
+Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+Launch Colossal-AI on 8 GPUs and build the model
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
+```shell
+Weight of the first linear layer: torch.Size([128, 512])
+Weight of the second linear layer: torch.Size([512, 128])
+```
+The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2.5D parallelism, it becomes `[128, 512]` on each GPU.
+Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
+
+We can run the model with some random inputs.
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)]
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+Then we can see the shapes of activation results.
+```shell
+Input: torch.Size([4, 128])
+Output of the first linear layer: torch.Size([4, 512])
+Output of the second linear layer: torch.Size([4, 128])
+```
+The activation tensors in 2.5D parallelism are all split by $d \times q$ in the row and $q$ in the column.
+E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
+Note, 2.5D parallelism use the same partition method as 2D parallelism for weights, where the difference is the partition of input.
--- a/docs/source/en/features/3D_tensor_parallel.md
+++ b/docs/source/en/features/3D_tensor_parallel.md
@@ -0,0 +1,151 @@
+# 3D Tensor Parallelism
+
+Author: Zhengda Bian, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+- [1D Tensor Parallelism](./1D_tensor_parallel.md)
+- [2D Tensor Parallelism](./2D_tensor_parallel.md)
+
+**Example Code**
+- [ColossalAI-Examples - 3D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_3d.py)
+
+**Related Paper**
+- [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/pdf/2105.14450.pdf)
+
+## Introduction
+
+The [3D tensor parallelism](https://arxiv.org/pdf/2105.14450.pdf) is an approach to parallelize the computation of neural models, hoping to obtain the optimal communication cost.
+
+Let's still take a linear layer $Y = XA$ as an example.
+Given $P=q \times q \times q$ processors (necessary condition), e.g. $q=2$, we split the input $X$ and weight $A$ into
+
+$$
+\left[\begin{matrix}
+            X_{000} & X_{001} \\
+            X_{010} & X_{011} \\
+            X_{100} & X_{101} \\
+            X_{110} & X_{111} \end{matrix}
+\right]
+\text{~and~}
+\left[\begin{matrix}
+            A_{000} & A_{001} & A_{010} & A_{011} \\
+            A_{100} & A_{101} & A_{110} & A_{111} \end{matrix}
+\right]
+\text{~respectively,}$$
+where each $X_{ijl}$ and $A_{lji}$ are stored at processor $(i,j,l)$, as shown in the figure below.
+
+<center>
+<img src="https://s2.loli.net/2022/02/17/JevO6SED5z4PFdp.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/qvtwjdfNXMAb4nF.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/WFzm2N4IwKf1jXZ.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/r2dZQ4hKxwTuIv6.png" width = "200" height = "250" />
+</center>
+
+Then we all-gather $X_{ijl}$ across $(i, 0...q,l)$, as well as $A_{lji}$ across $(0...q, j, l)$.
+So, we have $X_{il}$ and $A_{lj}$ on each processor $(i,j,l)$ to get $X_{il}A_{lj}$.
+Finally, we reduce-scatter the results across $(i, j, 0...q)$ to get $Y_{ijl}$, which forms
+$$
+Y=
+\left[\begin{matrix}
+            Y_{000} & Y_{001} \\
+            Y_{010} & Y_{011} \\
+            Y_{100} & Y_{101} \\
+            Y_{110} & Y_{111} \end{matrix}
+\right].
+$$
+
+We also need to note that in the backward pass, we need to all-gather the gradient $\dot{Y_{ijl}}$, and then reduce-scatter the gradient $\dot{X_{il}}=\dot{Y_{ij}}A_{lj}^T$ and $\dot{A_{lj}}=X_{il}^T\dot{Y_{ij}}$.
+
+## Efficiency
+Given $P=q \times q \times q$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 3D tensor parallelism.
+
+| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/q^3)$  | $O(1/q^3)$       | $O(1/q^3)$           | $O(6(q-1)/q^3)$           | $O(6(q-1))$             |
+
+## Usage
+
+To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=8, mode='3d'),
+))
+```
+Then Colossal-AI will automatically apply 3D parallelism to all the layers from `colossalai.nn`.
+
+Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+Launch Colossal-AI on 8 GPUs and build the model
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
+```shell
+Weight of the first linear layer: torch.Size([128, 256])
+Weight of the second linear layer: torch.Size([512, 64])
+```
+The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 3D parallelism, it becomes `[128, 256]` on each GPU.
+Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 64]`.
+
+We can run the model with some random inputs.
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_WEIGHT)]
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_INPUT)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_OUTPUT)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+Then we can see the shapes of activation results.
+```shell
+Input: torch.Size([4, 128])
+Output of the first linear layer: torch.Size([4, 512])
+Output of the second linear layer: torch.Size([4, 128])
+```
+The activation tensors in 3D parallelism are all split by $q^2$ in the row and $q$ in the column.
+E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
+Note, although the results of 3D parallelism have the same shape as that of 2.5D parallelism for weights here, the content of each partition is different.
--- a/docs/source/en/features/gradient_accumulation.md
+++ b/docs/source/en/features/gradient_accumulation.md
@@ -0,0 +1,45 @@
+# Gradient Accumulation
+
+Author: Shenggui Li, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+
+**Example Code**
+- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
+
+## Introduction
+
+Gradient accumulation is a common way to enlarge your batch size for training.
+When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2),
+leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations,
+and only update the parameters in the preset iteration.
+
+## Usage
+
+It is simple to use gradient accumulation in Colossal-AI. Just add this following configuration into your config file.
+The integer represents the number of iterations to accumulate gradients.
+
+```python
+gradient_accumulation = <int>
+```
+
+## Hands-on Practice
+
+We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
+to demonstrate gradient accumulation. In this example, we set the gradinet accumulation size to be 4. You can run the script using this command:
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  run_resnet_cifar10_with_engine.py
+```
+
+You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
+in the first 3 steps, but only updated in the last step.
+
+```text
+iteration 0, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
+```
--- a/docs/source/en/features/gradient_clipping.md
+++ b/docs/source/en/features/gradient_clipping.md
@@ -0,0 +1,62 @@
+# Gradient Clipping
+
+Author: Boxiang Wang, Haichen Huang, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+
+**Example Code**
+- [ColossalAI-Examples Gradient Clipping](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
+
+**Related Paper**
+- [On the difficulty of training Recurrent Neural Networks](https://arxiv.org/abs/1211.5063)
+
+## Introduction
+
+In order to speed up training process and seek global optimum for better performance, more and more learning
+rate schedulers have been proposed. People turn to control learning rate to adjust descent pace during training,
+which makes gradient vector better to be uniformed in every step. In that case, the descent pace can be
+controlled as expected. As a result, gradient clipping, a technique which can normalize the gradient vector
+to circumscribe it in a uniformed length, becomes indispensable for those who desire their better
+performance of their models.
+
+You do not have to worry about implementing gradient clipping when using Colossal-AI, we support gradient
+clipping in a powerful and convenient way. All you need is just an additional command in your configuration
+file.
+
+## Why you should use gradient clipping provided by Colossal-AI
+
+The reason of why we do not recommend users to write gradient clipping by themselves is that naive gradient clipping
+may fail when applying tensor parallelism, pipeline parallelism or MoE.
+
+According to the illustration below, each GPU only owns a portion of parameters of the weight in a linear layer.
+To get correct norm of gradient vector of the weight of the linear layer, the norm of every gradient vector in each GPU
+should be summed together.
+More complicated thing is that the distribution of bias is different from the distribution of the weight.
+The communication group is different in the sum operation.
+
+(PS: This situation is an old version of 2D parallelism, the implementation in the code is not the same.
+But it is a good example about the difficulty to unify all communication in gradient clipping.)
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/KXiJPHt3Dum82cA.png"/>
+<figcaption>Layout of parameters</figcaption>
+</figure>
+
+Do not worry about it, since Colossal-AI have handled it for you.
+
+### Usage
+To use gradient clipping, you can just simply add gradient clipping norm in your configuration file.
+```python
+clip_grad_norm = 1.0
+```
+
+### Hands-On Practice
+
+We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
+to demonstrate gradient clipping. In this example, we set the gradient clipping vector norm to be 1.0. You can run the script using this command:
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  train_with_engine.py
+```
--- a/docs/source/en/features/gradient_handler.md
+++ b/docs/source/en/features/gradient_handler.md
@@ -0,0 +1,63 @@
+# Gradient Handler
+
+Author: Shenggui Li, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+
+**Example Code**
+- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
+
+## Introduction
+
+In distributed training, gradient synchronization is required at the end of each iteration. This is important because we
+need to make sure the parameters are updated with the same gradients in different machines so that the resulting parameters
+are the same. This is often seen in data parallel as the model is replicated across data parallel ranks.
+
+In Colossal-AI, we provide an interface for users to customize how they want to handle the synchronization. This brings
+flexibility in cases such as implementing a new parallelism method.
+
+When gradient handlers are used, PyTorch `DistributedDataParallel` will not be used as it will synchronize automatically.
+
+## Customize Your Gradient Handlers
+
+To implement a customized gradient handler, you need to follow these steps.
+1. inherit `BaseGradientHandler` in Colossal-AI.
+2. register the gradient handler into the `GRADIENT_HANDLER`.
+3. implement `handle_gradient` method.
+
+```python
+from colossalai.registry import GRADIENT_HANDLER
+from colossalai.engine.gradient_handler import BaseGradientHandler
+
+
+@GRADIENT_HANDLER.register_module
+class MyGradientHandler(BaseGradientHandler):
+
+    def handle_gradient(self):
+        do_something()
+
+
+```
+
+
+## Usage
+
+To use a gradient handler, you need to specify your gradient handler in the config file. The gradient handler
+will be automatically built and attached to the engine.
+
+```python
+gradient_handler = [dict(type='MyGradientHandler')]
+```
+
+
+### Hands-On Practice
+
+We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
+to demonstrate the use of gradient handler. In this example, we used `DataParallelGradientHandler` instead of PyTorch
+`DistributedDataParallel` for data parallel training.
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500  train_with_engine.py
+```
--- a/docs/source/en/features/mixed_precision_training.md
+++ b/docs/source/en/features/mixed_precision_training.md
@@ -0,0 +1,367 @@
+# Auto Mixed Precision Training
+
+Author: Chuanrui Wang, Shenggui Li, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+
+**Example Code**
+- [ColossalAI-Examples AMP](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
+
+**Related Paper**
+- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
+
+
+## Introduction
+
+AMP stands for automatic mixed precision training.
+In Colossal-AI, we have incorporated different implementations of mixed precision training:
+
+1. torch.cuda.amp
+2. apex.amp
+3. naive amp
+
+
+| Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
+| ----------- | ----------------------- | ------------------------- | ----------- |
+| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
+| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
+| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
+
+The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
+The last method is similar to Apex O2 level.
+Among these methods, apex AMP is not compatible with tensor parallelism.
+This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
+We modified the torch amp implementation so that it is compatible with tensor parallelism now.
+
+> ❌️ fp16 and zero configuration are not compatible
+>
+> ⚠️ Pipeline only support naive AMP currently
+
+We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.
+
+## Table of Contents
+
+In this tutorial we will cover:
+
+1. AMP introduction
+2. AMP in Colossal-AI
+3. Hands-on Practice
+
+## AMP Introduction
+
+Automatic Mixed Precision training is a mixture of FP16 and FP32 training.
+
+Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency.
+Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory
+available for large batch size and model size.
+
+However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
+<figcaption>Illustration of an ordinary AMP (figure from <a href="https://arxiv.org/abs/2108.05818">PatrickStar paper</a>)</figcaption>
+</figure>
+
+## AMP in Colossal-AI
+
+We supported three AMP training methods and allowed the user to train with AMP with no code. You can just simply add `fp16`
+configuration in your configuration file to use AMP.
+
+
+```python
+from colossalai.amp import AMP_TYPE
+
+# use Torch AMP
+fp16=dict(
+    mode = AMP_TYPE.TORCH
+)
+
+# use naive AMP
+fp16=dict(
+    mode = AMP_TYPE.NAIVE
+)
+
+# use NVIDIA Apex AMP
+fp16=dict(
+    mode = AMP_TYPE.APEX
+)
+
+```
+
+> These are the minimum configuration, full configuration are stated in the section later
+
+### AMP Modularity
+
+AMP module is designed to be completely modular and can be used independently.
+If you wish to only use AMP in your code base without `colossalai.initialize`,
+you can use `colossalai.amp.convert_to_amp`.
+
+```python
+from colossalai.amp import AMP_TYPE
+
+# exmaple of using torch amp
+model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
+                                                            optimizer,
+                                                            criterion,
+                                                            AMP_TYPE.TORCH)
+```
+
+### Torch AMP Configuration
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16=dict(
+    mode=AMP_TYPE.TORCH,
+
+    # below are default values for grad scaler
+    init_scale=2.**16,
+    growth_factor=2.0,
+    backoff_factor=0.5,
+    growth_interval=2000,
+    enabled=True
+)
+```
+
+With optional arguments:
+- init_scale(float, optional, default=2.**16): Initial scale factor
+- growth_factor(float, optional, default=2.0): Factor by which the scale is multiplied during `update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations.
+- backoff_factor(float, optional, default=0.5): Factor by which the scale is multiplied during `update` if inf/NaN gradients occur in an iteration.
+- growth_interval(int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by ``growth_factor``.
+- enabled(bool, optional, default=True): If ``False``, disables gradient scaling. `step` simply invokes the underlying ``optimizer.step()``, and other methods become no-ops.
+
+### Apex AMP Configuration
+
+For this mode, we rely on the Apex implementation for mixed precision training.
+We support this plugin because it allows for finer control on the granularity of mixed precision.
+For example, O2 level (optimization level 2) will keep batch normalization in fp32.
+
+If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16 = dict(
+    mode=AMP_TYPE.APEX,
+
+    # below are the default values
+    enabled=True,
+    opt_level='O1',
+    cast_model_type=None,
+    patch_torch_functions=None,
+    keep_batchnorm_fp32=None,
+    master_weights=None,
+    loss_scale=None,
+    cast_model_outputs=None,
+    num_losses=1,
+    verbosity=1,
+    min_loss_scale=None,
+    max_loss_scale=16777216.0
+)
+```
+
+Parameters:
+- enabled(bool, optional, default=True): If False, renders all AMP calls no-ops, so your script should run as if Amp were not present.
+
+- opt_level(str, optional, default="O1" ): Pure or mixed precision optimization level.
+Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above Apex AMP Documentation.
+
+- num_losses(int, optional, default=1): Option to tell AMP in advance how many losses/backward passes you plan to use.
+When used in conjunction with the loss_id argument to `amp.scale_loss`, enables Amp to use a different loss scale per
+loss/backward pass, which can improve stability. If num_losses is left to 1, Amp will still support multiple
+losses/backward passes, but use a single global loss scale for all of them.
+
+- verbosity(int, default=1): Set to 0 to suppress Amp-related output.
+
+- min_loss_scale(float, default=None): Sets a floor for the loss scale values that can be chosen by dynamic loss scaling.
+The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.
+
+- max_loss_scale(float, default=2.**24 ): Sets a ceiling for the loss scale values that can be chosen by dynamic loss
+scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.
+
+Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
+cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale.
+They are optional properties override once opt_level is determined
+
+- cast_model_type: Casts your model’s parameters and buffers to the desired type.
+- patch_torch_functions: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
+- keep_batchnorm_fp32: To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
+- master_weights: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
+- loss_scale: If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
+
+
+### Naive AMP Configuration
+
+In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
+This AMP mode will cast all operations into fp16.
+The following code block shows the `config.py` file for this mode.
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16 = dict(
+    mode=AMP_TYPE.NAIVE,
+
+    # below are the default values
+    log_num_zeros_in_grad=False,
+    initial_scale=2 ** 32,
+    min_scale=1,
+    growth_factor=2,
+    backoff_factor=0.5,
+    growth_interval=1000,
+    hysteresis=2
+)
+```
+
+The default parameters of Naive AMP:
+- log_num_zeros_in_grad(bool): return number of zeros in the gradients.
+- initial_scale(int): initial scale of gradient scaler
+- growth_factor(int): the growth rate of loss scale
+- backoff_factor(float): the decrease rate of loss scale
+- hysterisis(int): delay shift in dynamic loss scaling
+- max_scale(int): maximum loss scale allowed
+- verbose(bool): if set to `True`, will print debug info
+
+When using `colossalai.initialize`, you are required to first instantiate a model, an optimizer and a criterion.
+The output model is converted to AMP model of smaller memory consumption.
+If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
+Otherwise, try smaller models or checkout more parallelization training techniques!
+
+
+## Hands-on Practice
+
+We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp) which demonstrates
+the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example, but do note that config files are provided for all AMP modes.
+
+### Step 1. Create a config file
+
+Create a `config.py` and add the `fp16` configuration.
+
+```python
+# in config.py
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 128
+DROP_RATE = 0.1
+NUM_EPOCHS = 300
+
+fp16 = dict(
+    mode=AMP_TYPE.TORCH,
+)
+
+clip_grad_norm = 1.0
+```
+
+### Step 2. Import libraries in train_with_engine.py
+
+Create a `train_with_engine.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
+`pip install timm scipy`.
+
+```python
+import os
+import colossalai
+import torch
+from pathlib import Path
+from colossalai.core import global_context as gpc
+from colossalai.logging import get_dist_logger
+from colossalai.utils import get_dataloader
+from colossalai.trainer import Trainer, hooks
+from colossalai.nn.lr_scheduler import LinearWarmupLR
+from timm.models import vit_base_patch16_224
+from torchvision import datasets, transforms
+
+```
+
+### Step 3. Initialize Distributed Environment
+
+We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
+for other initialization methods.
+
+```python
+# initialize distributed setting
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+
+# launch from torch
+colossalai.launch_from_torch(config=args.config)
+
+```
+
+### Step 4. Create training components
+
+Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
+obtained from the environment varialbe `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
+to a path on your machine. Data will be automatically downloaded to the root path.
+
+```python
+# build model
+    model = vit_base_patch16_224(drop_rate=0.1)
+
+    # build dataloader
+    train_dataset = datasets.Caltech101(
+        root=Path(os.environ['DATA']),
+        download=True,
+        transform=transforms.Compose([
+            transforms.Resize(256),
+            transforms.RandomResizedCrop(224),
+            transforms.RandomHorizontalFlip(),
+            transforms.ToTensor(),
+            Gray2RGB(),
+            transforms.Normalize([0.5, 0.5, 0.5],
+                                 [0.5, 0.5, 0.5])
+        ]))
+
+    train_dataloader = get_dataloader(dataset=train_dataset,
+                                      shuffle=True,
+                                      batch_size=gpc.config.BATCH_SIZE,
+                                      num_workers=1,
+                                      pin_memory=True,
+                                      )
+
+    # build optimizer
+    optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
+
+    # build loss
+    criterion = torch.nn.CrossEntropyLoss()
+
+    # lr_scheduelr
+    lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
+```
+
+### Step 5. Inject AMP Feature
+
+Call `colossalai.initialize` to convert the training components to be running with FP16.
+
+```python
+engine, train_dataloader, _, _ = colossalai.initialize(
+        model, optimizer, criterion, train_dataloader,
+    )
+```
+
+### Step 6. Train with Engine
+
+Use engine in a normal training loops.
+
+```python
+engine.train()
+for epoch in range(gpc.config.NUM_EPOCHS):
+    for img, label in enumerate(train_dataloader):
+        img = img.cuda()
+        label = label.cuda()
+        engine.zero_grad()
+        output = engine(img)
+        loss = engine.criterion(output, label)
+        engine.backward(loss)
+        engine.step()
+        lr_scheduler.step()
+```
+
+### Step 7. Invoke Training Scripts
+
+Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.
+
+```python
+python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
+```
--- a/docs/source/en/features/nvme_offload.md
+++ b/docs/source/en/features/nvme_offload.md
@@ -0,0 +1,42 @@
+# NVMe offload
+
+Author: Hongxin Liu
+
+**Prerequisite:**
+- [Zero Redundancy Optimizer with chunk-based memory management](../features/zero_with_chunk.md)
+
+## Introduction
+
+If a model has `N` parameters, when using Adam, it has `8N` optimizer states. For billion-scale models, optimizer states take at least 32 GB memory. GPU memory limits the model scale we can train, which is called GPU memory wall. If we offload optimizer states to the disk, we can break through GPU memory wall.
+
+We implement a user-friendly and efficient asynchronous Tensor I/O library: [TensorNVMe](https://github.com/hpcaitech/TensorNVMe). With this library, we can simply implement NVMe offload.
+
+> This library is compatible with all kinds of disk (HDD, SATA SSD, and NVMe SSD). As I/O bandwidth of HDD or SATA SSD is low, it's recommended to use this lib only on NVMe disk.
+
+When optimizing a parameter, we can divide the optimization process into three stages: read, compute and offload. We perform the optimization process in a pipelined fashion, which can overlap computation and I/O.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/16/CvRnowrsNyB4hza.jpg"/>
+<figcaption>Optimization process</figcaption>
+</figure>
+
+## Usage
+
+First, please make sure you installed [TensorNVMe](https://github.com/hpcaitech/TensorNVMe):
+
+```shell
+pip install packaging
+pip install tensornvme
+```
+
+We implement NVMe offload of optimizer states for Adam ([CPUAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.cpu_adam.html) and [HybridAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.hybrid_adam.html)).
+
+```python
+from colossalai.nn.optimizer import CPUAdam, HybridAdam
+
+optimizer = HybridAdam(model.parameters(), lr=1e-3, nvme_offload_fraction=1.0, nvme_offload_dir='./')
+```
+
+`nvme_offload_fraction` is the fraction of optimizer states to be offloaded to NVMe. `nvme_offload_dir` is the directory to save NVMe offload files. If `nvme_offload_dir` is `None`, a random temporary directory will be used.
+
+It's compatible with all parallel methods in ColossalAI.
--- a/docs/source/en/features/pipeline_parallel.md
+++ b/docs/source/en/features/pipeline_parallel.md
@@ -0,0 +1,159 @@
+# Pipeline Parallel
+
+Author: Guangyang Lu, Hongxin Liu, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+
+**Example Code**
+- [ColossalAI-Examples ResNet with pipeline](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel)
+
+**Related Paper**
+- [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
+- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
+
+## Quick introduction
+
+In this tutorial, you will learn how to use pipeline parallel. In Colossal-AI, we use 1F1B pipeline, introduced by Nvidia. In this case, ViT and Imagenet are too large to use. Therefore, here we use ResNet and Cifar as example.
+
+## Table Of Content
+
+In this tutorial we will cover:
+
+1. Introduction of 1F1B pipeline.
+2. Usage of non-interleaved and interleaved schedule.
+3. Training ResNet with pipeline.
+
+## Introduction of 1F1B pipeline
+
+First of all, we will introduce you GPipe for your better understanding.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/OAucPF6mWYynUtV.png"/>
+<figcaption>Figure1: GPipe. This figure is from <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> paper.</figcaption>
+</figure>
+
+
+As you can see, for GPipe, only when the forward passes of all microbatches in a batch finish, the backward passes would be executed.
+
+In general, 1F1B(one forward pass followed by one backward pass) is more efficient than GPipe(in memory or both memory and time). There are two schedules of 1F1B pipeline, the non-interleaved and the interleaved. The figures are shown below.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/iJrVkp2HLcahjsT.png"/>
+<figcaption>Figure2: This figure is from <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> paper. The top part shows the default non-interleaved schedule. And the bottom part shows the interleaved schedule.</figcaption>
+</figure>
+
+### Non-interleaved Schedule
+
+The non-interleaved schedule can be divided into three stages. The first stage is the warm-up stage, where workers perform differing numbers of forward passes. At the following stage, workers perform one forward pass followed by one backward pass. Workers will finish backward passes at the last stage.
+
+This mode is more memory-efficient than GPipe. However, it would take the same time to finish a turn of passes as GPipe.
+
+### Interleaved Schedule
+
+This schedule requires **the number of microbatches to be an integer multiple of the stage of pipeline**.
+
+In this schedule, each device can perform computation for multiple subsets of layers(called a model chunk) instead of a single contiguous set of layers. i.e. Before device 1 had layer 1-4; device 2 had layer 5-8; and so on. But now device 1 has layer 1,2,9,10; device 2 has layer 3,4,11,12; and so on. With this scheme, each device in the pipeline is assigned multiple pipeline stages and each pipeline stage has less computation.
+
+This mode is both memory-efficient and time-efficient.
+
+## Usage of non-interleaved and interleaved schedule
+
+In Colossal-AI, we provided both non-interleaved(as `PipelineSchedule`) and interleaved schedule(as  `InterleavedPipelineSchedule`).
+
+You just need to set `NUM_MICRO_BATCHES` in config file and set `NUM_CHUNKS` in config file if you want to use Interleaved Pipeline Schedule. If you certainly know the shape of each pipeline stage's output tensor and the shapes are all the same, you can set `TENSOR_SHAPE` in config file to further reduce communication. Otherwise, you can just ignore `tensor_shape`, and the shape will be exchanged over pipeline stages automatically. Then we will generate an appropriate schedule for you.
+
+## Training ResNet with pipeline
+
+Let's build the `ResNet` model first with Colossal PipelinableContext:
+```python
+import os
+from typing import Callable, List, Optional, Type, Union
+import torch
+import torch.nn as nn
+import colossalai
+import colossalai.nn as col_nn
+
+from colossalai.core import global_context as gpc
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.trainer import Trainer, hooks
+from colossalai.utils import MultiTimer, get_dataloader
+from colossalai.context import ParallelMode
+from colossalai.pipeline.pipelinable import PipelinableContext
+
+from titans.dataloader.cifar10 import build_cifar
+from torchvision.models import resnet50
+from torchvision.models.resnet import BasicBlock, Bottleneck, conv1x1
+
+# Define some config
+BATCH_SIZE = 64
+NUM_EPOCHS = 2
+NUM_CHUNKS = 1
+CONFIG = dict(NUM_MICRO_BATCHES=4, parallel=dict(pipeline=2))
+
+# Train
+disable_existing_loggers()
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+colossalai.launch_from_torch(backend=args.backend, config=CONFIG)
+logger = get_dist_logger()
+pipelinable = PipelinableContext()
+
+# build model
+with pipelinable:
+    model = resnet50()
+```
+
+Define an execution sequence.
+```python
+exec_seq = [
+    'conv1', 'bn1', 'relu', 'maxpool', 'layer1', 'layer2', 'layer3', 'layer4', 'avgpool',
+    (lambda x: torch.flatten(x, 1), "behind"), 'fc'
+]
+pipelinable.to_layer_list(exec_seq)
+```
+
+Partition the model into pipeline.
+```python
+model = pipelinable.partition(NUM_CHUNKS, gpc.pipeline_parallel_size, gpc.get_local_rank(ParallelMode.PIPELINE))
+```
+
+In this tutorial, we use `Trainer` to train `ResNet`:
+```python
+# build criterion
+criterion = nn.CrossEntropyLoss()
+
+# optimizer
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+
+# build dataloader
+root = os.environ.get('DATA', './data')
+train_dataloader, test_dataloader = build_cifar(BATCH_SIZE, root, padding=4, crop=32, resize=32)
+
+lr_scheduler = col_nn.lr_scheduler.LinearWarmupLR(optimizer, NUM_EPOCHS, warmup_steps=1)
+engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, optimizer, criterion,
+                                                                                train_dataloader, test_dataloader,
+                                                                                lr_scheduler)
+timer = MultiTimer()
+
+trainer = Trainer(engine=engine, timer=timer, logger=logger)
+
+hook_list = [
+    hooks.LossHook(),
+    hooks.AccuracyHook(col_nn.metric.Accuracy()),
+    hooks.LogMetricByEpochHook(logger),
+    hooks.LRSchedulerHook(lr_scheduler, by_epoch=True)
+]
+
+trainer.fit(train_dataloader=train_dataloader,
+            epochs=NUM_EPOCHS,
+            test_dataloader=test_dataloader,
+            test_interval=1,
+            hooks=hook_list,
+            display_progress=True)
+```
+
+We use `2` pipeline stages and the batch will be splitted into `4` micro batches.
--- a/docs/source/en/features/zero_with_chunk.md
+++ b/docs/source/en/features/zero_with_chunk.md
@@ -0,0 +1,262 @@
+# Zero Redundancy Optimizer with chunk-based memory management
+
+Author: [Hongxiu Liu](https://github.com/ver217), [Jiarui Fang](https://github.com/feifeibear), [Zijian Ye](https://github.com/ZijianYY)
+**Prerequisite:**
+- [Define Your Configuration](../basics/define_your_config.md)
+
+**Example Code**
+
+- [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt)
+
+**Related Paper**
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
+- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
+- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
+
+## Introduction
+
+The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
+model states (optimizer states, gradients, and parameters) instead of replicating them.
+By doing so, memory efficiency is boosted drastically compared to classic data parallelism, while the computational granularity
+and communication efficiency is retained.
+
+1. **Shard Optimizer States**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights,
+and the first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
+
+
+2. **Shard Gradient**: After reduction inside data parallel process group, gradient tensors are also partitioned such that each process only stores the gradients corresponding to its partition of the optimizer states. Note, Colossal converts gradient into fp32 format to participate in parameter updating.
+
+3. **Shard Parameter**: The 16-bit model parameters are partitioned across the processes of a data parallel group.
+
+4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for paramters, gradients and optimizer states.
+
+Besides, this article will introduce the Zero Redundancy Optimizer with chunk-based memory management.
+
+When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significiant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
+
+Using the Chunk mechanism introduced in ColossalAI v0.1.8, we can improve the efficiency of ZeRO. We store a continuous set of parameters in initialization order into a Chunk (a chunk is a continuous memory space), and each Chunk has the same size. Organizing memory in chunks can lead to efficient use of network bandwidth between PCI-e and GPU-GPU, reduce the number of communications, and avoid potential memory fragmentation.
+
+Before v0.1.8, ZeRO had a high communication cost for parameter communications. If a parameter was used multiple times in several consecutive operators, there will be repeated communications operations, and the efficiency was highly damaged. This situation is very common when using the Gradient Checkpoint technique, and the parameter will recompute the forward propagation during backward propagation.
+
+Taking GPT as an example, its Checkpoint will be applied to each GPT Block, and each GPT Block contains a Self-Attention layer and an MLP layer. During the backward pass, the forward of the Self-Attention layer and the MLP layer will be computed in turn, and then the backward of the MLP layer and the Self-Attention layer will be computed in turn.
+
+In addition, due to the communication and memory movement of small Tensors, the bandwidth of NVLINK and PCI-E cannot be fully utilized, and each communication and memory movement has the overhead of kernel launch. After using Chunk, multiple small Tensor communication and memory movement can be changed into one large Tensor communication and memory movement, which not only improves bandwidth utilization but also reduces the overhead of kernel launch.
+
+We also provide a lightweight chunk search mechanism to help users automatically find the chunk size with the smallest memory fragmentation.
+
+## Usage
+
+### GeminiDDP
+
+We will use `GeminiDDP` to use ZeRO with chunk-based memory management. This is our new torch.Module wrapper which uses ZeRO-DP and Gemini. ZeRO is for parallelism and Gemini is for memory management.
+
+Also Make sure that your model is initialized under the context of ColoInitContext.
+
+```python
+with ColoInitContext(device='cpu', default_dist_spec=default_dist_spec, default_pg=default_pg):
+  model = gpt2_medium(checkpoint=True)
+```
+
+Define the model parameters as follows:
+
+```python
+chunk_manager = init_chunk_manager(model=module,
+                                           init_device=device,
+                                           hidden_dim=hidden_dim,
+                                           search_range_mb=search_range_mb,
+                                           min_chunk_size_mb=min_chunk_size_mb)
+gemini_manager = GeminiManager(placement_policy, chunk_manager)
+```
+
+`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still samller than the minimum chunk size, all parameters will be compacted into one small chunk.
+
+Initialization of the optimizer.
+```python
+optimizer = GeminiAdamOptimizer(model, lr=1e-3, initial_scale=2**5)
+```
+
+Training
+```python
+optimizer.zero_grad()
+outputs = model(input_ids, attn_mask)
+loss = criterion(outputs, input_ids)
+optimizer.backward(loss)
+optimizer.step()
+```
+> ⚠️ Note: Please do not use `loss.backward()`, the standard way of writing is `optimizer.backward(loss)`.
+
+### Train GPT
+
+In this example, we use `Hugging Face Transformers`. You have to install `transformers` before running this example. We will take `GPT2 Medium` as an example here.
+
+For simplicity, we just use randomly generated data here.
+
+First we only need to import `GPT2LMHeadModel` from `Huggingface transformers` to define our model, which does not require users to define or modify the model, so that users can use it more conveniently.
+
+```python
+class GPTLMModel(nn.Module):
+
+    def __init__(self,
+                 hidden_size=768,
+                 num_layers=12,
+                 num_attention_heads=12,
+                 max_seq_len=1024,
+                 vocab_size=50257,
+                 checkpoint=False):
+        super().__init__()
+        self.checkpoint = checkpoint
+        self.model = GPT2LMHeadModel(
+            GPT2Config(n_embd=hidden_size,
+                       n_layer=num_layers,
+                       n_head=num_attention_heads,
+                       n_positions=max_seq_len,
+                       n_ctx=max_seq_len,
+                       vocab_size=vocab_size))
+        if checkpoint:
+            self.model.gradient_checkpointing_enable()
+
+    def forward(self, input_ids, attention_mask):
+        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
+
+def gpt2_medium(checkpoint=False):
+    return GPTLMModel(hidden_size=1024, num_layers=24, num_attention_heads=16, checkpoint=checkpoint)
+```
+
+Define our loss function:
+
+```python
+class GPTLMLoss(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+        self.loss_fn = nn.CrossEntropyLoss()
+
+    def forward(self, logits, labels):
+        shift_logits = logits[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+```
+
+Define tensor parallel and parameter sharding strategies for tensor parallelism:
+
+```python
+def tensor_parallelize(model: torch.nn.Module, pg: ProcessGroup):
+    for mn, module in model.named_modules():
+        for pn, param in module.named_parameters(recurse=False):
+            if hasattr(param, 'visited'):
+                continue
+            param.set_dist_spec(ReplicaSpec())
+            if 'mlp.c_fc' in mn:
+                if 'weight' in pn or 'bias' in pn:
+                    split_param_col_tp1d(param, pg)
+                    param.compute_spec.set_output_replicate(False)
+                else:
+                    param.set_dist_spec(ReplicaSpec())
+            elif 'mlp.c_proj' in mn:
+                if 'weight' in pn:
+                    split_param_row_tp1d(param, pg)
+                else:
+                    param.set_dist_spec(ReplicaSpec())
+            elif 'wte' in mn or 'wpe' in mn:
+                split_param_col_tp1d(param, pg)
+            elif 'c_attn' in mn or 'c_proj' in mn:
+                split_param_col_tp1d(param, pg)
+            else:
+                param.set_dist_spec(ReplicaSpec())
+
+            param.visited = True
+def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
+    spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
+    param.set_tensor_spec(*spec)
+
+
+def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(0, param, pg)
+
+
+def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(-1, param, pg)
+```
+
+Define a model which uses Gemini + ZeRO DDP:
+
+```python
+def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placememt_policy: str = "auto"):
+    cai_version = colossalai.__version__
+    if version.parse(cai_version) > version.parse("0.1.10"):
+        from colossalai.nn.parallel import GeminiDDP
+        model = GeminiDDP(model,
+                          device=get_current_device(),
+                          placement_policy=placememt_policy,
+                          pin_memory=True,
+                          search_range_mb=32)
+    elif version.parse(cai_version) <= version.parse("0.1.10") and version.parse(cai_version) >= version.parse("0.1.9"):
+        from colossalai.gemini import ChunkManager, GeminiManager
+        chunk_size = ChunkManager.search_chunk_size(model, 64 * 1024**2, 32)
+        gemini_manager = GeminiManager(placememt_policy, chunk_manager)
+        chunk_manager = ChunkManager(chunk_size,
+                                     pg,
+                                     enable_distributed_storage=True,
+                                     init_device=GeminiManager.get_default_device(placememt_policy))
+        model = ZeroDDP(model, gemini_manager)
+    else:
+        raise NotImplemented(f"CAI version {cai_version} is not supported")
+    return model
+```
+
+As we pre-train GPT in this example, we just use a simple language model loss.
+
+Write a function to get random inputs:
+
+```python
+def get_data(batch_size, seq_len, vocab_size):
+    input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), device=torch.cuda.current_device())
+    attention_mask = torch.ones_like(input_ids)
+    return input_ids, attention_mask
+```
+
+Finally, we can define our training loop:
+
+```python
+def main():
+    args = parse_args()
+    BATCH_SIZE = 8
+    SEQ_LEN = 1024
+    VOCAB_SIZE = 50257
+    NUM_STEPS = 10
+    colossalai.launch_from_torch(config={})
+
+    # build criterion
+    criterion = GPTLMLoss()
+
+    torch.manual_seed(123)
+    default_pg = ProcessGroup(tp_degree=args.tp_degree)
+    default_dist_spec = ShardSpec([-1], [args.tp_degree]) if args.shardinit else None
+    # build GPT model
+    with ColoInitContext(device='cpu', default_dist_spec=default_dist_spec, default_pg=default_pg):
+      model = gpt2_medium(checkpoint=True)
+    pg = default_pg
+    # Tensor Parallelism (TP)
+    tensor_parallelize(model, pg)
+    # Gemini + ZeRO DP, Note it must be used after TP
+    model = gemini_zero_dpp(model, pg, args.placement)
+    # build optimizer
+    optimizer = GeminiAdamOptimizer(model, lr=1e-3, initial_scale=2**5)
+    numel = sum([p.numel() for p in model.parameters()])
+    get_tflops_func = partial(get_tflops, numel, BATCH_SIZE, SEQ_LEN)
+    torch.cuda.synchronize()
+    model.train()
+    for n in range(NUM_STEPS):
+        # we just use randomly generated data here
+        input_ids, attn_mask = get_data(BATCH_SIZE, SEQ_LEN, VOCAB_SIZE)
+        optimizer.zero_grad()
+        outputs = model(input_ids, attn_mask)
+        loss = criterion(outputs, input_ids)
+        optimizer.backward(loss)
+        optimizer.step()
+
+    torch.cuda.synchronize()
+```
+> ⚠️ Note: If you want to use the Gemini module, please do not use the [Gradient Accumulation](../features/gradient_accumulation.md) we mentioned before。
+The complete example can be found on [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).