mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-09-03 10:06:44 +00:00
[doc] Fix typo under colossalai and doc(#3618)
* Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402
This commit is contained in:
@@ -4,7 +4,7 @@ Colossal-Auto simplifies the process of deploying large-scale machine learning m
|
||||
|
||||
### 1. Basic usage
|
||||
|
||||
Colossal-Auto can be used to find a hybrid SPMD parallel strategy includes data, tensor(i.e., 1D, 2D, sequencial) for each operation. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel).
|
||||
Colossal-Auto can be used to find a hybrid SPMD parallel strategy includes data, tensor(i.e., 1D, 2D, sequential) for each operation. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel).
|
||||
Detailed instructions can be found in its `README.md`.
|
||||
|
||||
### 2. Integration with activation checkpoint
|
||||
|
@@ -44,7 +44,7 @@ In some solutions, the [Zero-offload](https://arxiv.org/abs/2101.06840) adopted
|
||||
</figure>
|
||||
|
||||
|
||||
Colossal-AI designed Gemini, just like two-stars, which manages the memory space of CPU and GPU efficiently. It can make the tensor dynamically distributed in the storage space of CPU-GPU during training, so that the model training can break through the memory wall of GPU. The memory manager consists of two parts: **MemStatsCollector (MSC)** and **StatefuleTensorMgr (STM)**.
|
||||
Colossal-AI designed Gemini, just like two-stars, which manages the memory space of CPU and GPU efficiently. It can make the tensor dynamically distributed in the storage space of CPU-GPU during training, so that the model training can break through the memory wall of GPU. The memory manager consists of two parts: **MemStatsCollector (MSC)** and **StatefulTensorMgr (STM)**.
|
||||
|
||||
We take advantage of the iterative characteristics of the deep learning network training process. We divide iterations into two stages: warmup and non-warmup. One or several iterative steps at the beginning belong to the warmup stage, and the other iterative steps belong to the non-warmup stage. In the warmup stage, we collect information for the MSC, while in the non-warmup stage, STM gets the information collected by the MSC to move the tensor, so as to minimize the CPU-GPU data movement volume.
|
||||
|
||||
|
@@ -20,7 +20,7 @@ To launch the distributed inference service quickly, you can download the OPT-12
|
||||
|
||||
2. Prepare a prebuilt service image
|
||||
|
||||
Pull a docker image from dockerhub installed with Colossal-AI inference.
|
||||
Pull a docker image from docker hub installed with Colossal-AI inference.
|
||||
|
||||
```bash
|
||||
docker pull hpcaitech/energon-ai:latest
|
||||
|
@@ -12,7 +12,7 @@ Author: Yuxuan Lou
|
||||
|
||||
## Introduction
|
||||
|
||||
In this example for ViT model, Colossal-AI provides three different parallelism techniques which acclerate model training: data parallelism, pipeline parallelism and tensor parallelism.
|
||||
In this example for ViT model, Colossal-AI provides three different parallelism techniques which accelerate model training: data parallelism, pipeline parallelism and tensor parallelism.
|
||||
We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs.
|
||||
|
||||
|
||||
@@ -31,7 +31,7 @@ pip install colossalai
|
||||
|
||||
|
||||
## Data Parallelism
|
||||
Data parallism is one basic way to accelerate model training process. You can apply data parallism to training by only two steps:
|
||||
Data parallism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps:
|
||||
1. Define a configuration file
|
||||
2. Change a few lines of code in train script
|
||||
|
||||
@@ -108,7 +108,7 @@ disable_existing_loggers()
|
||||
logger = get_dist_logger()
|
||||
```
|
||||
|
||||
After initialization, you can acess the variables in the config file by using `colossalai.core.global_context`.
|
||||
After initialization, you can access the variables in the config file by using `colossalai.core.global_context`.
|
||||
|
||||
```python
|
||||
#access parameters
|
||||
@@ -162,7 +162,7 @@ optimizer = colossalai.nn.Lamb(model.parameters(), lr=1.8e-2, weight_decay=0.1)
|
||||
# build loss
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
# lr_scheduelr
|
||||
# lr_scheduler
|
||||
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
|
||||
```
|
||||
|
||||
@@ -230,10 +230,10 @@ torchrun --standalone --nproc_per_node <NUM_GPUs> train_dp.py --config ./config
|
||||
|
||||
|
||||
## Pipeline Parallelism
|
||||
Aside from data parallelism, Colossal-AI also support pipleline parallelism. In specific, Colossal-AI uses 1F1B pipeline introduced by NVIDIA. For more details, you can view the related [documents](https://www.colossalai.org/tutorials/features/pipeline_parallel).
|
||||
Aside from data parallelism, Colossal-AI also support pipeline parallelism. In specific, Colossal-AI uses 1F1B pipeline introduced by NVIDIA. For more details, you can view the related [documents](https://www.colossalai.org/tutorials/features/pipeline_parallel).
|
||||
|
||||
### Define your configuration file(`hybrid_parallel/configs/vit_pipeline.py`)
|
||||
To apply pipleline parallel on the data parallel basis, you only need to add a **parallel dict**
|
||||
To apply pipeline parallel on the data parallel basis, you only need to add a **parallel dict**
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
@@ -250,7 +250,7 @@ clip_grad_norm = 1.0
|
||||
|
||||
Other configs:
|
||||
```python
|
||||
# hyperparameters
|
||||
# hyper parameters
|
||||
# BATCH_SIZE is as per GPU
|
||||
# global batch size = BATCH_SIZE x data parallel size
|
||||
BATCH_SIZE = 256
|
||||
@@ -276,7 +276,7 @@ Colossal-AI provides two methods to build a pipeline model from the existing mod
|
||||
- `colossalai.builder.build_pipeline_model_from_cfg`
|
||||
- `colossalai.builder.build_pipeline_model`
|
||||
|
||||
Besides, you can also build a pipeline model from scrath with Colossal-AI.
|
||||
Besides, you can also build a pipeline model from scratch with Colossal-AI.
|
||||
```python
|
||||
import math
|
||||
from typing import Callable
|
||||
@@ -521,7 +521,7 @@ def build_cifar(batch_size):
|
||||
return train_dataloader, test_dataloader
|
||||
|
||||
|
||||
# craete dataloaders
|
||||
# create dataloaders
|
||||
train_dataloader , test_dataloader = build_cifar()
|
||||
|
||||
# create loss function
|
||||
@@ -539,7 +539,7 @@ lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
|
||||
#### Start Colossal-AI engine
|
||||
|
||||
```python
|
||||
# intiailize
|
||||
# initialize
|
||||
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model,
|
||||
optimizer=optimizer,
|
||||
criterion=criterion,
|
||||
@@ -615,7 +615,7 @@ TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
|
||||
|
||||
Ohter configs:
|
||||
```python
|
||||
# hyperparameters
|
||||
# hyper parameters
|
||||
# BATCH_SIZE is as per GPU
|
||||
# global batch size = BATCH_SIZE x data parallel size
|
||||
BATCH_SIZE = 256
|
||||
|
@@ -42,7 +42,7 @@ Therefore, when using Distributed Spec, we only need to describe the way that th
|
||||
|
||||
## Compute Spec
|
||||
|
||||
An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec) describes how a Coloensor be used in DNN training. Currently, we will set the correct Compute Pattern for the ColoTensor as the parameters of the module. The specific application scenarios will be shown in the next document.
|
||||
An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec) describes how a Colotensor be used in DNN training. Currently, we will set the correct Compute Pattern for the ColoTensor as the parameters of the module. The specific application scenarios will be shown in the next document.
|
||||
|
||||
## ColoParameter
|
||||
|
||||
|
@@ -172,7 +172,7 @@ In this config file, we specify that we want to use batch size 128 per GPU and r
|
||||
#### Step 2. Initialize Distributed Environment
|
||||
|
||||
We need to initialize the distributed training environment. This has been introduced in the tutorial on how to
|
||||
[launch Colossal-AI](./launch_colossalai.md). For this demostration, we use `launch_from_torch` and PyTorch launch utility.
|
||||
[launch Colossal-AI](./launch_colossalai.md). For this demonstration, we use `launch_from_torch` and PyTorch launch utility.
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
|
@@ -6,18 +6,18 @@ Author: Shenggui Li, Siqi Mai
|
||||
|
||||
With the development of deep learning model size, it is important to shift to a new training paradigm. The traditional training method with no parallelism and optimization became a thing of the past and new training methods are the key to make training large-scale models efficient and cost-effective.
|
||||
|
||||
Colossal-AI is designed to be a unfied system to provide an integrated set of training skills and utilities to the user. You can find the common training utilities such as mixed precision training and gradient accumulation. Besides, we provide an array of parallelism including data, tensor and pipeline parallelism. We optimize tensor parallelism with different multi-dimensional distributed matrix-matrix multiplication algorithm. We also provided different pipeline parallelism methods to allow the user to scale their model across nodes efficiently. More advanced features such as offloading can be found in this tutorial documentation in detail as well.
|
||||
Colossal-AI is designed to be a unified system to provide an integrated set of training skills and utilities to the user. You can find the common training utilities such as mixed precision training and gradient accumulation. Besides, we provide an array of parallelism including data, tensor and pipeline parallelism. We optimize tensor parallelism with different multi-dimensional distributed matrix-matrix multiplication algorithm. We also provided different pipeline parallelism methods to allow the user to scale their model across nodes efficiently. More advanced features such as offloading can be found in this tutorial documentation in detail as well.
|
||||
|
||||
## General Usage
|
||||
|
||||
We aim to make Colossal-AI easy to use and non-instrusive to user code. There is a simple general workflow if you want to use Colossal-AI.
|
||||
We aim to make Colossal-AI easy to use and non-intrusive to user code. There is a simple general workflow if you want to use Colossal-AI.
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/01/28/ZK7ICWzbMsVuJof.png"/>
|
||||
<figcaption>Workflow</figcaption>
|
||||
</figure>
|
||||
|
||||
1. Prepare a configiguration file where specifies the features you want to use and your parameters.
|
||||
1. Prepare a configuration file where specifies the features you want to use and your parameters.
|
||||
2. Initialize distributed backend with `colossalai.launch`
|
||||
3. Inject the training features into your training components (e.g. model, optimizer) with `colossalai.initialize`.
|
||||
4. Run training and testing
|
||||
|
@@ -42,7 +42,7 @@ Given $P$ processors, we present the theoretical computation and memory cost, as
|
||||
|
||||
## Usage
|
||||
|
||||
To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallism setting as below.
|
||||
To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallelism setting as below.
|
||||
```python
|
||||
CONFIG = dict(parallel=dict(
|
||||
data=1,
|
||||
|
@@ -60,7 +60,7 @@ Given $P=q\times q$ processors, we present the theoretical computation and memor
|
||||
|
||||
## Usage
|
||||
|
||||
To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallism setting as below.
|
||||
To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallelism setting as below.
|
||||
```python
|
||||
CONFIG = dict(parallel=dict(
|
||||
data=1,
|
||||
|
@@ -57,7 +57,7 @@ Given $P=q \times q \times d$ processors, we present the theoretical computation
|
||||
|
||||
## Usage
|
||||
|
||||
To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
|
||||
To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below.
|
||||
```python
|
||||
CONFIG = dict(parallel=dict(
|
||||
data=1,
|
||||
|
@@ -28,7 +28,7 @@ gradient_accumulation = <int>
|
||||
## Hands-on Practice
|
||||
|
||||
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
|
||||
to demonstrate gradient accumulation. In this example, we set the gradinet accumulation size to be 4. You can run the script using this command:
|
||||
to demonstrate gradient accumulation. In this example, we set the gradient accumulation size to be 4. You can run the script using this command:
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
|
||||
|
@@ -101,7 +101,7 @@ you can use `colossalai.amp.convert_to_amp`.
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
# exmaple of using torch amp
|
||||
# example of using torch amp
|
||||
model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
@@ -220,7 +220,7 @@ The default parameters of Naive AMP:
|
||||
- initial_scale(int): initial scale of gradient scaler
|
||||
- growth_factor(int): the growth rate of loss scale
|
||||
- backoff_factor(float): the decrease rate of loss scale
|
||||
- hysterisis(int): delay shift in dynamic loss scaling
|
||||
- hysteresis(int): delay shift in dynamic loss scaling
|
||||
- max_scale(int): maximum loss scale allowed
|
||||
- verbose(bool): if set to `True`, will print debug info
|
||||
|
||||
@@ -292,7 +292,7 @@ colossalai.launch_from_torch(config=args.config)
|
||||
### Step 4. Create training components
|
||||
|
||||
Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
|
||||
obtained from the environment varialbe `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
|
||||
obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
|
||||
to a path on your machine. Data will be automatically downloaded to the root path.
|
||||
|
||||
```python
|
||||
@@ -326,7 +326,7 @@ to a path on your machine. Data will be automatically downloaded to the root pat
|
||||
# build loss
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
# lr_scheduelr
|
||||
# lr_scheduler
|
||||
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
|
||||
```
|
||||
|
||||
|
@@ -57,7 +57,7 @@ It's compatible with all parallel methods in ColossalAI.
|
||||
|
||||
Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`.
|
||||
|
||||
We should install denpendencies first:
|
||||
We should install dependencies first:
|
||||
|
||||
```shell
|
||||
pip install psutil transformers
|
||||
@@ -99,7 +99,7 @@ class GPTLMLoss(nn.Module):
|
||||
shift_labels.view(-1))
|
||||
```
|
||||
|
||||
And we define some utility functions, which generates random data, computes the number of paramters of a model and get memory usage of current process:
|
||||
And we define some utility functions, which generates random data, computes the number of parameters of a model and get memory usage of current process:
|
||||
|
||||
```python
|
||||
def get_data(batch_size: int, seq_len: int,
|
||||
@@ -251,7 +251,7 @@ Time: 3.691 s
|
||||
Mem usage: 5298.344 MB
|
||||
```
|
||||
|
||||
NVME offload saves about 294 MB memory. Note that enabling `pin_memory` of Gemini can accelerate training but increase memory usage. So this result also meets our expectation. If we disable `pin_memory`, we can aslo observe a memory usage drop about 900 MB.
|
||||
NVME offload saves about 294 MB memory. Note that enabling `pin_memory` of Gemini can accelerate training but increase memory usage. So this result also meets our expectation. If we disable `pin_memory`, we can also observe a memory usage drop about 900 MB.
|
||||
|
||||
## API Reference
|
||||
|
||||
|
@@ -32,11 +32,11 @@ and the first and second momentum estimates) are partitioned across the processe
|
||||
|
||||
3. **Shard Parameter**: The 16-bit model parameters are partitioned across the processes of a data parallel group.
|
||||
|
||||
4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for paramters, gradients and optimizer states.
|
||||
4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for parameters, gradients and optimizer states.
|
||||
|
||||
Besides, this article will introduce the Zero Redundancy Optimizer with chunk-based memory management.
|
||||
|
||||
When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significiant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
|
||||
When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
|
||||
|
||||
Using the Chunk mechanism introduced in ColossalAI v0.1.8, we can improve the efficiency of ZeRO. We store a continuous set of parameters in initialization order into a Chunk (a chunk is a continuous memory space), and each Chunk has the same size. Organizing memory in chunks can lead to efficient use of network bandwidth between PCI-e and GPU-GPU, reduce the number of communications, and avoid potential memory fragmentation.
|
||||
|
||||
|
Reference in New Issue
Block a user