[doc] clean up outdated docs (#4765)

* [doc] clean up outdated docs * [doc] fix linking * [doc] fix linking
2025-09-13 13:11:05 +00:00 · 2023-09-21 11:36:20 +08:00
parent df66741f77
commit 66f3926019
43 changed files with 12 additions and 3425 deletions
--- a/docs/source/en/features/1D_tensor_parallel.md
+++ b/docs/source/en/features/1D_tensor_parallel.md
@@ -2,10 +2,6 @@

 Author: Zhengda Bian, Yongbin Li

-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
-
 **Example Code**
 - [Tensor Parallelism with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples)

--- a/docs/source/en/features/2D_tensor_parallel.md
+++ b/docs/source/en/features/2D_tensor_parallel.md
@@ -3,8 +3,6 @@
 Author: Zhengda Bian, Yongbin Li

 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
 - [1D Tensor Parallelism](./1D_tensor_parallel.md)

 **Example Code**
--- a/docs/source/en/features/2p5D_tensor_parallel.md
+++ b/docs/source/en/features/2p5D_tensor_parallel.md
@@ -3,8 +3,6 @@
 Author: Zhengda Bian, Yongbin Li

 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
 - [1D Tensor Parallelism](./1D_tensor_parallel.md)
 - [2D Tensor Parallelism](./2D_tensor_parallel.md)

--- a/docs/source/en/features/3D_tensor_parallel.md
+++ b/docs/source/en/features/3D_tensor_parallel.md
@@ -3,8 +3,6 @@
 Author: Zhengda Bian, Yongbin Li

 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Configure Parallelization](../basics/configure_parallelization.md)
 - [1D Tensor Parallelism](./1D_tensor_parallel.md)
 - [2D Tensor Parallelism](./2D_tensor_parallel.md)

--- a/docs/source/en/features/gradient_accumulation.md
+++ b/docs/source/en/features/gradient_accumulation.md
@@ -1,47 +0,0 @@
-# Gradient Accumulation (Outdated)
-
-Author: Shenggui Li, Yongbin Li
-
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
-
-**Example Code**
- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
-
-## Introduction
-
-Gradient accumulation is a common way to enlarge your batch size for training.
-When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2),
-leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations,
-and only update the parameters in the preset iteration.
-
-## Usage
-
-It is simple to use gradient accumulation in Colossal-AI. Just add this following configuration into your config file.
-The integer represents the number of iterations to accumulate gradients.
-
-```python
-gradient_accumulation = <int>
-```
-
-## Hands-on Practice
-
-We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
-to demonstrate gradient accumulation. In this example, we set the gradient accumulation size to be 4. You can run the script using this command:
-
-```shell
-python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  run_resnet_cifar10_with_engine.py
-```
-
-You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
-in the first 3 steps, but only updated in the last step.
-
-```text
-iteration 0, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
-iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
-iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
-iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
-```
-
-<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_accumulation.py  -->
--- a/docs/source/en/features/gradient_accumulation_with_booster.md
+++ b/docs/source/en/features/gradient_accumulation_with_booster.md
@@ -1,9 +1,8 @@
-# Gradient Accumulation (Latest)
+# Gradient Accumulation

 Author: [Mingyan Jiang](https://github.com/jiangmingyan)

 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
 - [Training Booster](../basics/booster_api.md)

 ## Introduction
--- a/docs/source/en/features/gradient_clipping.md
+++ b/docs/source/en/features/gradient_clipping.md
@@ -1,64 +0,0 @@
-# Gradient Clipping (Outdated)
-
-Author: Boxiang Wang, Haichen Huang, Yongbin Li
-
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
-
-**Example Code**
- [ColossalAI-Examples Gradient Clipping](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
-
-**Related Paper**
- [On the difficulty of training Recurrent Neural Networks](https://arxiv.org/abs/1211.5063)
-
-## Introduction
-
-In order to speed up training process and seek global optimum for better performance, more and more learning
-rate schedulers have been proposed. People turn to control learning rate to adjust descent pace during training,
-which makes gradient vector better to be uniformed in every step. In that case, the descent pace can be
-controlled as expected. As a result, gradient clipping, a technique which can normalize the gradient vector
-to circumscribe it in a uniformed length, becomes indispensable for those who desire their better
-performance of their models.
-
-You do not have to worry about implementing gradient clipping when using Colossal-AI, we support gradient
-clipping in a powerful and convenient way. All you need is just an additional command in your configuration
-file.
-
-## Why you should use gradient clipping provided by Colossal-AI
-
-The reason of why we do not recommend users to write gradient clipping by themselves is that naive gradient clipping
-may fail when applying tensor parallelism, pipeline parallelism or MoE.
-
-According to the illustration below, each GPU only owns a portion of parameters of the weight in a linear layer.
-To get correct norm of gradient vector of the weight of the linear layer, the norm of every gradient vector in each GPU
-should be summed together.
-More complicated thing is that the distribution of bias is different from the distribution of the weight.
-The communication group is different in the sum operation.
-
-(PS: This situation is an old version of 2D parallelism, the implementation in the code is not the same.
-But it is a good example about the difficulty to unify all communication in gradient clipping.)
-
-<figure style={{textAlign: "center"}}>
-<img src="https://s2.loli.net/2022/01/28/KXiJPHt3Dum82cA.png"/>
-<figcaption>Layout of parameters</figcaption>
-</figure>
-
-Do not worry about it, since Colossal-AI have handled it for you.
-
-### Usage
-To use gradient clipping, you can just simply add gradient clipping norm in your configuration file.
-```python
-clip_grad_norm = 1.0
-```
-
-### Hands-On Practice
-
-We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
-to demonstrate gradient clipping. In this example, we set the gradient clipping vector norm to be 1.0. You can run the script using this command:
-
-```shell
-python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  train_with_engine.py
-```
-
-<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_clipping.py  -->
--- a/docs/source/en/features/gradient_clipping_with_booster.md
+++ b/docs/source/en/features/gradient_clipping_with_booster.md
@@ -1,9 +1,8 @@
-# Gradient Clipping (Latest)
+# Gradient Clipping

 Author: [Mingyan Jiang](https://github.com/jiangmingyan)

 **Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
 - [Training Booster](../basics/booster_api.md)

 **Related Paper**
--- a/docs/source/en/features/gradient_handler.md
+++ b/docs/source/en/features/gradient_handler.md
@@ -1,64 +0,0 @@
-# Gradient Handler
-
-Author: Shenggui Li, Yongbin Li
-
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
-
-**Example Code**
- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
-
-## Introduction
-
-In distributed training, gradient synchronization is required at the end of each iteration. This is important because we
-need to make sure the parameters are updated with the same gradients in different machines so that the resulting parameters
-are the same. This is often seen in data parallel as the model is replicated across data parallel ranks.
-
-In Colossal-AI, we provide an interface for users to customize how they want to handle the synchronization. This brings
-flexibility in cases such as implementing a new parallelism method.
-
-When gradient handlers are used, PyTorch `DistributedDataParallel` will not be used as it will synchronize automatically.
-
-## Customize Your Gradient Handlers
-
-To implement a customized gradient handler, you need to follow these steps.
-1. inherit `BaseGradientHandler` in Colossal-AI.
-2. register the gradient handler into the `GRADIENT_HANDLER`.
-3. implement `handle_gradient` method.
-
-```python
-from colossalai.legacy.registry import GRADIENT_HANDLER
-from colossalai.legacy.engine.gradient_handler import BaseGradientHandler
-
-
-@GRADIENT_HANDLER.register_module
-class MyGradientHandler(BaseGradientHandler):
-
-    def handle_gradient(self):
-        do_something()
-
-
-```
-
-
-## Usage
-
-To use a gradient handler, you need to specify your gradient handler in the config file. The gradient handler
-will be automatically built and attached to the engine.
-
-```python
-gradient_handler = [dict(type='MyGradientHandler')]
-```
-
-
-### Hands-On Practice
-
-We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
-to demonstrate the use of gradient handler. In this example, we used `DataParallelGradientHandler` instead of PyTorch
-`DistributedDataParallel` for data parallel training.
-
-```shell
-python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500  train_with_engine.py
-```
-<!-- doc-test-command: echo  -->
--- a/docs/source/en/features/mixed_precision_training.md
+++ b/docs/source/en/features/mixed_precision_training.md
@@ -1,368 +0,0 @@
-# Auto Mixed Precision Training (Outdated)
-
-Author: Chuanrui Wang, Shenggui Li, Yongbin Li
-
-**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
-
-**Example Code**
- [ColossalAI-Examples AMP](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
-
-**Related Paper**
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
-
-
-## Introduction
-
-AMP stands for automatic mixed precision training.
-In Colossal-AI, we have incorporated different implementations of mixed precision training:
-
-1. torch.cuda.amp
-2. apex.amp
-3. naive amp
-
-
-| Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
-| ----------- | ----------------------- | ------------------------- | ----------- |
-| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
-| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
-| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
-
-The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
-The last method is similar to Apex O2 level.
-Among these methods, apex AMP is not compatible with tensor parallelism.
-This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
-We modified the torch amp implementation so that it is compatible with tensor parallelism now.
-
-> ❌️ fp16 and zero configuration are not compatible
->
-> ⚠️ Pipeline only support naive AMP currently
-
-We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.
-
-## Table of Contents
-
-In this tutorial we will cover:
-
-1. AMP introduction
-2. AMP in Colossal-AI
-3. Hands-on Practice
-
-## AMP Introduction
-
-Automatic Mixed Precision training is a mixture of FP16 and FP32 training.
-
-Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency.
-Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory
-available for large batch size and model size.
-
-However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.
-
-<figure style={{textAlign: "center"}}>
-<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
-<figcaption>Illustration of an ordinary AMP (figure from <a href="https://arxiv.org/abs/2108.05818">PatrickStar paper</a>)</figcaption>
-</figure>
-
-## AMP in Colossal-AI
-
-We supported three AMP training methods and allowed the user to train with AMP with no code. You can just simply add `fp16`
-configuration in your configuration file to use AMP.
-
-
-```python
-from colossalai.amp import AMP_TYPE
-
-# use Torch AMP
-fp16=dict(
-    mode = AMP_TYPE.TORCH
-)
-
-# use naive AMP
-fp16=dict(
-    mode = AMP_TYPE.NAIVE
-)
-
-# use NVIDIA Apex AMP
-fp16=dict(
-    mode = AMP_TYPE.APEX
-)
-
-```
-
-> These are the minimum configuration, full configuration are stated in the section later
-
-### AMP Modularity
-
-AMP module is designed to be completely modular and can be used independently.
-If you wish to only use AMP in your code base without `colossalai.initialize`,
-you can use `colossalai.amp.convert_to_amp`.
-
-```python
-from colossalai.amp import AMP_TYPE
-
-# example of using torch amp
-model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
-                                                            optimizer,
-                                                            criterion,
-                                                            AMP_TYPE.TORCH)
-```
-
-### Torch AMP Configuration
-
-```python
-from colossalai.amp import AMP_TYPE
-
-fp16=dict(
-    mode=AMP_TYPE.TORCH,
-
-    # below are default values for grad scaler
-    init_scale=2.**16,
-    growth_factor=2.0,
-    backoff_factor=0.5,
-    growth_interval=2000,
-    enabled=True
-)
-```
-
-With optional arguments:
- init_scale(float, optional, default=2.**16): Initial scale factor
- growth_factor(float, optional, default=2.0): Factor by which the scale is multiplied during `update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations.
- backoff_factor(float, optional, default=0.5): Factor by which the scale is multiplied during `update` if inf/NaN gradients occur in an iteration.
- growth_interval(int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by ``growth_factor``.
- enabled(bool, optional, default=True): If ``False``, disables gradient scaling. `step` simply invokes the underlying ``optimizer.step()``, and other methods become no-ops.
-
-### Apex AMP Configuration
-
-For this mode, we rely on the Apex implementation for mixed precision training.
-We support this plugin because it allows for finer control on the granularity of mixed precision.
-For example, O2 level (optimization level 2) will keep batch normalization in fp32.
-
-If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).
-
-```python
-from colossalai.amp import AMP_TYPE
-
-fp16 = dict(
-    mode=AMP_TYPE.APEX,
-
-    # below are the default values
-    enabled=True,
-    opt_level='O1',
-    cast_model_type=None,
-    patch_torch_functions=None,
-    keep_batchnorm_fp32=None,
-    master_weights=None,
-    loss_scale=None,
-    cast_model_outputs=None,
-    num_losses=1,
-    verbosity=1,
-    min_loss_scale=None,
-    max_loss_scale=16777216.0
-)
-```
-
-Parameters:
- enabled(bool, optional, default=True): If False, renders all AMP calls no-ops, so your script should run as if Amp were not present.
-
- opt_level(str, optional, default="O1" ): Pure or mixed precision optimization level.
-Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above Apex AMP Documentation.
-
- num_losses(int, optional, default=1): Option to tell AMP in advance how many losses/backward passes you plan to use.
-When used in conjunction with the loss_id argument to `amp.scale_loss`, enables Amp to use a different loss scale per
-loss/backward pass, which can improve stability. If num_losses is left to 1, Amp will still support multiple
-losses/backward passes, but use a single global loss scale for all of them.
-
- verbosity(int, default=1): Set to 0 to suppress Amp-related output.
-
- min_loss_scale(float, default=None): Sets a floor for the loss scale values that can be chosen by dynamic loss scaling.
-The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.
-
- max_loss_scale(float, default=2.**24 ): Sets a ceiling for the loss scale values that can be chosen by dynamic loss
-scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.
-
-Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
-cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale.
-They are optional properties override once opt_level is determined
-
- cast_model_type: Casts your model’s parameters and buffers to the desired type.
- patch_torch_functions: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
- keep_batchnorm_fp32: To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
- master_weights: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
- loss_scale: If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
-
-
-### Naive AMP Configuration
-
-In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
-This AMP mode will cast all operations into fp16.
-The following code block shows the `config.py` file for this mode.
-
-```python
-from colossalai.amp import AMP_TYPE
-
-fp16 = dict(
-    mode=AMP_TYPE.NAIVE,
-
-    # below are the default values
-    log_num_zeros_in_grad=False,
-    initial_scale=2 ** 32,
-    min_scale=1,
-    growth_factor=2,
-    backoff_factor=0.5,
-    growth_interval=1000,
-    hysteresis=2
-)
-```
-
-The default parameters of Naive AMP:
- log_num_zeros_in_grad(bool): return number of zeros in the gradients.
- initial_scale(int): initial scale of gradient scaler
- growth_factor(int): the growth rate of loss scale
- backoff_factor(float): the decrease rate of loss scale
- hysteresis(int): delay shift in dynamic loss scaling
- max_scale(int): maximum loss scale allowed
- verbose(bool): if set to `True`, will print debug info
-
-When using `colossalai.initialize`, you are required to first instantiate a model, an optimizer and a criterion.
-The output model is converted to AMP model of smaller memory consumption.
-If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
-Otherwise, try smaller models or checkout more parallelization training techniques!
-
-
-## Hands-on Practice
-
-We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp) which demonstrates
-the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example, but do note that config files are provided for all AMP modes.
-
-### Step 1. Create a config file
-
-Create a `config.py` and add the `fp16` configuration.
-
-```python
-# in config.py
-from colossalai.amp import AMP_TYPE
-
-BATCH_SIZE = 128
-DROP_RATE = 0.1
-NUM_EPOCHS = 300
-
-fp16 = dict(
-    mode=AMP_TYPE.TORCH,
-)
-
-clip_grad_norm = 1.0
-```
-
-### Step 2. Import libraries in train_with_engine.py
-
-Create a `train_with_engine.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
-`pip install timm scipy`.
-
-```python
-import os
-import colossalai
-import torch
-from pathlib import Path
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.utils import get_dataloader
-from colossalai.legacy.trainer import Trainer, hooks
-from colossalai.nn.lr_scheduler import LinearWarmupLR
-from timm.models import vit_base_patch16_224
-from torchvision import datasets, transforms
-
-```
-
-### Step 3. Initialize Distributed Environment
-
-We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
-for other initialization methods.
-
-```python
-# initialize distributed setting
-parser = colossalai.get_default_parser()
-args = parser.parse_args()
-
-# launch from torch
-colossalai.launch_from_torch(config=args.config)
-
-```
-
-### Step 4. Create training components
-
-Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
-obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
-to a path on your machine. Data will be automatically downloaded to the root path.
-
-```python
-# build model
-    model = vit_base_patch16_224(drop_rate=0.1)
-
-    # build dataloader
-    train_dataset = datasets.Caltech101(
-        root=Path(os.environ['DATA']),
-        download=True,
-        transform=transforms.Compose([
-            transforms.Resize(256),
-            transforms.RandomResizedCrop(224),
-            transforms.RandomHorizontalFlip(),
-            transforms.ToTensor(),
-            Gray2RGB(),
-            transforms.Normalize([0.5, 0.5, 0.5],
-                                 [0.5, 0.5, 0.5])
-        ]))
-
-    train_dataloader = get_dataloader(dataset=train_dataset,
-                                      shuffle=True,
-                                      batch_size=gpc.config.BATCH_SIZE,
-                                      num_workers=1,
-                                      pin_memory=True,
-                                      )
-
-    # build optimizer
-    optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
-
-    # build loss
-    criterion = torch.nn.CrossEntropyLoss()
-
-    # lr_scheduler
-    lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
-```
-
-### Step 5. Inject AMP Feature
-
-Call `colossalai.initialize` to convert the training components to be running with FP16.
-
-```python
-engine, train_dataloader, _, _ = colossalai.initialize(
-        model, optimizer, criterion, train_dataloader,
-    )
-```
-
-### Step 6. Train with Engine
-
-Use engine in a normal training loops.
-
-```python
-engine.train()
-for epoch in range(gpc.config.NUM_EPOCHS):
-    for img, label in enumerate(train_dataloader):
-        img = img.cuda()
-        label = label.cuda()
-        engine.zero_grad()
-        output = engine(img)
-        loss = engine.criterion(output, label)
-        engine.backward(loss)
-        engine.step()
-        lr_scheduler.step()
-```
-
-### Step 7. Invoke Training Scripts
-
-Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.
-
-```shell
-python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
-```
-<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training.py  -->
--- a/docs/source/en/features/mixed_precision_training_with_booster.md
+++ b/docs/source/en/features/mixed_precision_training_with_booster.md
@@ -1,10 +1,9 @@
-# Auto Mixed Precision Training (Latest)
+# Auto Mixed Precision Training

 Author: [Mingyan Jiang](https://github.com/jiangmingyan)

 **Prerequisite**

- [Define Your Configuration](../basics/define_your_config.md)
 - [Training Booster](../basics/booster_api.md)

 **Related Paper**
@@ -61,7 +60,7 @@ However, there are other operations, like reductions, which require the dynamic

 ## AMP in Colossal-AI

-We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Now booster support torch amp, the other two(apex amp, naive amp) are still started by `colossalai.initialize`, if needed, please refer to [this](./mixed_precision_training.md). Next we will support `bf16`, `fp8`.
+We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Next we will support `bf16`, `fp8`.

 ### Start with Booster

--- a/docs/source/en/features/zero_with_chunk.md
+++ b/docs/source/en/features/zero_with_chunk.md
@@ -204,7 +204,7 @@ def main():

    torch.cuda.synchronize()
 ```
-> ⚠️ Note: If you want to use the Gemini module, please do not use the [Gradient Accumulation](../features/gradient_accumulation.md) we mentioned before。
+> ⚠️ Note: If you want to use the Gemini module, please do not use the [Gradient Accumulation](../features/gradient_accumulation_with_booster.md) we mentioned before。
 The complete example can be found on [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).

 <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 zero_with_chunk.py  -->