mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-09-08 20:40:34 +00:00
fix typo docs/
This commit is contained in:
@@ -56,7 +56,7 @@ follow the steps below to create a new distributed initialization.
|
||||
world_size: int,
|
||||
config: Config,
|
||||
data_parallel_size: int,
|
||||
pipeline_parlalel_size: int,
|
||||
pipeline_parallel_size: int,
|
||||
tensor_parallel_size: int,
|
||||
arg1,
|
||||
arg2):
|
||||
|
@@ -121,7 +121,7 @@ Inside the initialization of Experts, the local expert number of each GPU will b
|
||||
|
||||
|
||||
## Train Your Model
|
||||
Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine.
|
||||
Do not to forget to use `colossalai.initialize` function in `colossalai` to add gradient handler for the engine.
|
||||
We handle the back-propagation of MoE models for you.
|
||||
In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
|
||||
You can find more information about the handler `MoeGradientHandler` in colossal directory.
|
||||
|
@@ -53,7 +53,7 @@ export CHECKPOINT_DIR="your_opt_checkpoint_path"
|
||||
# the ${CONFIG_DIR} must contain a server.sh file as the entry of service
|
||||
export CONFIG_DIR="config_file_path"
|
||||
|
||||
docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest
|
||||
docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:latest
|
||||
```
|
||||
|
||||
Then open `https://[IP-ADDRESS]:8020/docs#` in your browser to try out!
|
||||
|
@@ -69,7 +69,7 @@ After the forward operation of the embedding module, each word in all sequences
|
||||
<figcaption>The embedding module</figcaption>
|
||||
</figure>
|
||||
|
||||
Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer percepton is located in the second block.
|
||||
Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer perception is located in the second block.
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/>
|
||||
|
@@ -195,7 +195,7 @@ def build_cifar(batch_size):
|
||||
|
||||
## Training ViT using pipeline
|
||||
|
||||
You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an approriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage.
|
||||
You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleaved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an appropriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage.
|
||||
|
||||
You should `export DATA=/path/to/cifar`.
|
||||
|
||||
|
@@ -16,14 +16,14 @@ In this example for ViT model, Colossal-AI provides three different parallelism
|
||||
We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs.
|
||||
|
||||
|
||||
## Tabel of Contents
|
||||
## Table of Contents
|
||||
1. Colossal-AI installation
|
||||
2. Steps to train ViT with data parallelism
|
||||
3. Steps to train ViT with pipeline parallelism
|
||||
4. Steps to train ViT with tensor parallelism or hybrid parallelism
|
||||
|
||||
## Colossal-AI Installation
|
||||
You can install Colossal-AI pacakage and its dependencies with PyPI.
|
||||
You can install Colossal-AI package and its dependencies with PyPI.
|
||||
```bash
|
||||
pip install colossalai
|
||||
```
|
||||
@@ -31,7 +31,7 @@ pip install colossalai
|
||||
|
||||
|
||||
## Data Parallelism
|
||||
Data parallism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps:
|
||||
Data parallelism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps:
|
||||
1. Define a configuration file
|
||||
2. Change a few lines of code in train script
|
||||
|
||||
@@ -94,7 +94,7 @@ from torchvision import transforms
|
||||
from torchvision.datasets import CIFAR10
|
||||
```
|
||||
|
||||
#### Lauch Colossal-AI
|
||||
#### Launch Colossal-AI
|
||||
|
||||
In train script, you need to initialize the distributed environment for Colossal-AI after your config file is prepared. We call this process `launch`. In Colossal-AI, we provided several launch methods to initialize the distributed backend. In most cases, you can use `colossalai.launch` and `colossalai.get_default_parser` to pass the parameters via command line. Besides, Colossal-AI can utilize the existing launch tool provided by PyTorch as many users are familiar with by using `colossalai.launch_from_torch`. For more details, you can view the related [documents](https://www.colossalai.org/docs/basics/launch_colossalai).
|
||||
|
||||
@@ -613,7 +613,7 @@ NUM_MICRO_BATCHES = parallel['pipeline']
|
||||
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
|
||||
```
|
||||
|
||||
Ohter configs:
|
||||
Other configs:
|
||||
```python
|
||||
# hyper parameters
|
||||
# BATCH_SIZE is as per GPU
|
||||
|
@@ -14,9 +14,9 @@ In our new design, `colossalai.booster` replaces the role of `colossalai.initial
|
||||
### Plugin
|
||||
Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows:
|
||||
|
||||
***GeminiPlugin:*** This plugin wrapps the Gemini acceleration solution, that ZeRO with chunk-based memory management.
|
||||
***GeminiPlugin:*** This plugin wraps the Gemini acceleration solution, that ZeRO with chunk-based memory management.
|
||||
|
||||
***TorchDDPPlugin:*** This plugin wrapps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.
|
||||
***TorchDDPPlugin:*** This plugin wraps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.
|
||||
|
||||
***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs.
|
||||
|
||||
|
@@ -52,7 +52,7 @@ An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/c
|
||||
|
||||
## Example
|
||||
|
||||
Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_dgree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor.
|
||||
Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_degree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor.
|
||||
|
||||
|
||||
```python
|
||||
|
@@ -67,7 +67,7 @@ Given $P=q \times q \times q$ processors, we present the theoretical computation
|
||||
|
||||
## Usage
|
||||
|
||||
To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
|
||||
To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below.
|
||||
```python
|
||||
CONFIG = dict(parallel=dict(
|
||||
data=1,
|
||||
|
@@ -75,7 +75,7 @@ Build your model, optimizer, loss function, lr scheduler and dataloaders. Note t
|
||||
NUM_EPOCHS = 200
|
||||
BATCH_SIZE = 128
|
||||
GRADIENT_CLIPPING = 0.1
|
||||
# build resnetå
|
||||
# build resnet
|
||||
model = resnet34(num_classes=10)
|
||||
# build dataloaders
|
||||
train_dataset = CIFAR10(root=Path(os.environ.get('DATA', './data')),
|
||||
|
@@ -53,7 +53,7 @@ It's compatible with all parallel methods in ColossalAI.
|
||||
|
||||
> ⚠ It only offloads optimizer states on CPU. This means it only affects CPU training or Zero/Gemini with offloading.
|
||||
|
||||
## Exampls
|
||||
## Examples
|
||||
|
||||
Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`.
|
||||
|
||||
|
@@ -156,4 +156,4 @@ trainer.fit(train_dataloader=train_dataloader,
|
||||
display_progress=True)
|
||||
```
|
||||
|
||||
We use `2` pipeline stages and the batch will be splitted into `4` micro batches.
|
||||
We use `2` pipeline stages and the batch will be split into `4` micro batches.
|
||||
|
@@ -72,7 +72,7 @@ chunk_manager = init_chunk_manager(model=module,
|
||||
gemini_manager = GeminiManager(placement_policy, chunk_manager)
|
||||
```
|
||||
|
||||
`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still samller than the minimum chunk size, all parameters will be compacted into one small chunk.
|
||||
`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still smaller than the minimum chunk size, all parameters will be compacted into one small chunk.
|
||||
|
||||
Initialization of the optimizer.
|
||||
```python
|
||||
|
Reference in New Issue
Block a user