Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
This commit is contained in:
Frank Lee
2021-11-18 19:45:06 +08:00
committed by GitHub
parent 2b05de4c64
commit 3defa32aee
80 changed files with 2194 additions and 1584 deletions

View File

@@ -0,0 +1,5 @@
colossalai.engine.amp.amp\_type
===============================
.. automodule:: colossalai.engine.amp.amp_type
:members:

View File

@@ -0,0 +1,5 @@
colossalai.engine.amp.grad\_scaler
==================================
.. automodule:: colossalai.engine.amp.grad_scaler
:members:

View File

@@ -0,0 +1,12 @@
colossalai.engine.amp
=====================
.. automodule:: colossalai.engine.amp
:members:
.. toctree::
:maxdepth: 2
colossalai.engine.amp.amp_type
colossalai.engine.amp.grad_scaler

View File

@@ -1,5 +0,0 @@
colossalai.engine.amp\_type
===========================
.. automodule:: colossalai.engine.amp_type
:members:

View File

@@ -7,11 +7,6 @@ colossalai.engine
.. toctree::
:maxdepth: 2
colossalai.engine.amp
colossalai.engine.gradient_handler
colossalai.engine.schedule
.. toctree::
:maxdepth: 2
colossalai.engine.amp_type

View File

@@ -21,7 +21,6 @@ colossalai
.. toctree::
:maxdepth: 2
colossalai.checkpointing
colossalai.constants
colossalai.core
colossalai.initialize

View File

@@ -0,0 +1,5 @@
colossalai.utils.checkpointing
==============================
.. automodule:: colossalai.utils.checkpointing
:members:

View File

@@ -9,6 +9,7 @@ colossalai.utils
:maxdepth: 2
colossalai.utils.activation_checkpoint
colossalai.utils.checkpointing
colossalai.utils.common
colossalai.utils.cuda
colossalai.utils.memory

View File

@@ -17,38 +17,40 @@ parallel = dict(
)
```
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and data,
pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a int
representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and
data, pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a
int representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
represents the way of tensor parallelism.
## Data Parallel
Data parallel is the most common way to distribute your training task by splitting data into several shards and train
on a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do
not have to explicitly set them in your configurations. When data parallel size is larger than 1, Colossal-AI automatically
Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
have to explicitly set them in your configurations. When data parallel size is larger than 1, Colossal-AI automatically
adds the distributed data sampler to the dataloader to shard the dataset.
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
-
1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
devices where $N$ is the number of tensor chunks in a single dimension.
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ devices where
$N$ is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
parallelizes 2D tensor parallelism. An amount of $P = N^2 d$ processors are arranged into $d$ layers,
where each layer performs matrix multiplication operations independently with a dimension $N$.
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
further parallelizes 2D tensor parallelism. An amount of $P = N^2 d$ processors are arranged into $d$ layers, where
each layer performs matrix multiplication operations independently with a dimension $N$.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage are evenly distributed
through optimized load balancing of parameters as well as activations.
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
achieves the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage
are evenly distributed through optimized load balancing of parameters as well as activations.
```python
# 1D parallel
@@ -78,12 +80,12 @@ parallel = dict(
## Pipeline Parallel (experimental)
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
and the second layer to the second GPU. This example of course wastes the computing resources and is only to demonstrate
the idea of pipeline parallelism.
the idea of pipeline parallelism.
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline
parallelism in PyTorch, you may need to add one more attribute, `layers_cfg` in your model class which tells Colossal-AI
the sequence of execution. One example you can refer is `colossalai.nn.model.VanillaResNet`.
@@ -192,9 +194,9 @@ class VanillaResNet(BaseModel):
]
```
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
will automatically creates the pipeline schedule which defines the forward and backward step. You can specify how many microbatches
to run in each step in the `schedule` configuration.
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
will automatically creates the pipeline schedule which defines the forward and backward step. You can specify how many
microbatches to run in each step in the `schedule` configuration.
```python
parallel = dict(
@@ -206,10 +208,11 @@ schedule = dict(
num_microbatches = 4 # set the number of microbatches per step
)
```
This feature is still in development and is only experimental for now.
## Sequence Parallel (experimental)
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
This feature is still in development and is only experimental for now.

View File

@@ -1,8 +1,8 @@
# Quick demo
Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system
can accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The
system can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
## Single GPU
@@ -32,25 +32,17 @@ realizes the training process.
```python
import colossalai
from colossalai.core import global_context as gpc
from colossalai.engine import Engine
from colossalai.logging import get_global_dist_logger
from colossalai.trainer import Trainer
def run_trainer():
model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
engine, train_dataloader, test_dataloader = colossalai.initialize()
logger = get_global_dist_logger()
schedule.data_sync = False
engine = Engine(
model=model,
criterion=criterion,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
schedule=schedule
)
logger.info("engine is built", ranks=[0])
trainer = Trainer(engine=engine,
hooks_cfg=gpc.config.hooks,
verbose=True)
logger.info("trainer is built", ranks=[0])
@@ -58,11 +50,13 @@ def run_trainer():
trainer.fit(
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
max_epochs=gpc.config.num_epochs,
epochs=gpc.config.num_epochs,
hooks_cfg=gpc.config.hooks,
display_progress=True,
test_interval=2
)
if __name__ == '__main__':
run_trainer()
```
@@ -72,9 +66,9 @@ Zoo. The detailed substitution process is elaborated [here](model.md).
## Features
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development of
distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools to
kickstart distributed training in a few lines.
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools
to kickstart distributed training in a few lines.
- [Data Parallelism](parallelization.md)
- [Pipeline Parallelism](parallelization.md)

View File

@@ -4,40 +4,36 @@ Colossal-AI是一个大规模深度学习系统其中包含高效的并行技
## 单GPU系统
在带有GPU的非分布式系统上进行模型训练时Colossal-AI可以达到当前的基线效率。[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google Colab示例展现如何使用Colossal-AI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
在带有GPU的非分布式系统上进行模型训练时Colossal-AI可以达到当前的基线效率。[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google
Colab示例展现如何使用Colossal-AI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
## 多GPU系统
在多GPU的分布式系统上训练深度学习模型时Colossal-AI可以使用高效的并行技术来显著地加速训练过程这些技术将在下面的[并行技术](parallelization.md)章节中被详述。下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型其中`HOST`变量为您分布式系统的IP地址。请注意下面的代码使用了[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
在多GPU的分布式系统上训练深度学习模型时Colossal-AI可以使用高效的并行技术来显著地加速训练过程这些技术将在下面的[并行技术](parallelization.md)
章节中被详述。下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型其中`HOST`
变量为您分布式系统的IP地址。请注意下面的代码使用了[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
```bash
HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
```
`./configs/vit/vit_2d.py`是一个[配置文件](config.md)Colossal-AI使用配置文件来定义训练过程中需要用到的参数比如模型类型、数据集、以及优化器、学习率调度器等。您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`是一个标准的训练脚本,具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
`./configs/vit/vit_2d.py`是一个[配置文件](config.md)
Colossal-AI使用配置文件来定义训练过程中需要用到的参数比如模型类型、数据集、以及优化器、学习率调度器等。您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`
是一个标准的训练脚本,具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
```python
import colossalai
from colossalai.core import global_context as gpc
from colossalai.engine import Engine
from colossalai.logging import get_global_dist_logger
from colossalai.trainer import Trainer
def run_trainer():
model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
engine, train_dataloader, test_dataloader = colossalai.initialize()
logger = get_global_dist_logger()
schedule.data_sync = False
engine = Engine(
model=model,
criterion=criterion,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
schedule=schedule
)
logger.info("engine is built", ranks=[0])
trainer = Trainer(engine=engine,
hooks_cfg=gpc.config.hooks,
verbose=True)
logger.info("trainer is built", ranks=[0])
@@ -45,11 +41,13 @@ def run_trainer():
trainer.fit(
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
max_epochs=gpc.config.num_epochs,
epochs=gpc.config.num_epochs,
hooks_cfg=gpc.config.hooks,
display_progress=True,
test_interval=2
)
if __name__ == '__main__':
run_trainer()
```

View File

@@ -2,9 +2,9 @@
## Build your engine
To better understand how `Engine` class works, let's start from the conception of the process function in common engines. The process function
usually controls the behavior over a batch of a dataset, `Engine` class just controls the process function. Here we give a standard process
function in the following code block.
To better understand how `Engine` class works, let's start from the conception of the process function in common
engines. The process function usually controls the behavior over a batch of a dataset, `Engine` class just controls the
process function. Here we give a standard process function in the following code block.
```python
def process_function(dataloader, model, criterion, optim):
@@ -16,32 +16,33 @@ def process_function(dataloader, model, criterion, optim):
optim.setp()
```
In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users to write their own process
functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for users, we provide the powerful `Engine` class. This class
enables pipeline parallelism and offers one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler
in the `Engine` class to adjust learning rate during training.
In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users
to write their own process functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for
users, we provide the powerful `Engine` class. This class enables pipeline parallelism and offers
one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler in
the `Engine` class to adjust learning rate during training.
In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The following code block provides
an example.
In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The
following code block provides an example. **The engine is automatically created from the config file for you if you
start with `colossalai.initialize`.**
```python
import torch
import torch.nn as nn
import torchvision.models as models
import colossalai
from colossalai.engine import Engine
model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model)
lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
schedule = colossalai.engine.schedule.NoPipelineSchedule()
optimizer = torch.optim.Adam(model.parameters())
schedule = colossalai.engine.NoPipelineSchedule()
MyEngine = Engine(
model=model,
criterion=criterion,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
schedule=schedule
step_schedule=schedule
)
```
@@ -51,21 +52,24 @@ More information regarding the class can be found in the API references.
### Overview
To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly recommend that you read *Get Started*
To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly
recommend that you read *Get Started*
section and *Build your engine* first.
The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write your own scripts, you can simply
construct your own trainer by calling the `Trainer` class, just like what we did in the following code block.
The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write
your own scripts, you can simply construct your own trainer by calling the `Trainer` class, just like what we did in the
following code block.
```python
MyTrainer = Trainer(MyEngine)
MyTrainer = Trainer(my_engine)
```
After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more powerful, we incorporate a set of
handy tools to the class. For example, you can monitor or record the running states and metrics which indicate the current performance of the model. These
functions are realized by hooks. The `BasicHook` class allows you to execute your hook functions at specified time. We have already created some practical
hooks for you, as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the class can be found
in the API references.
After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more
powerful, we incorporate a set of handy tools to the class. For example, you can monitor or record the running states
and metrics which indicate the current performance of the model. These functions are realized by hooks. The `BasicHook`
class allows you to execute your hook functions at specified time. We have already created some practical hooks for you,
as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the
class can be found in the API references.
```python
hooks = [
@@ -80,18 +84,21 @@ hooks = [
]
```
These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides, they print the current loss and
accuracy to let users monitor the performance of the model.
These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides,
they print the current loss and accuracy to let users monitor the performance of the model.
### Hook
If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook` class to write a metric collector.
These hook functions can be called at twelve timing in the trainer's life cycle. Besides, you can define the priorities of all hooks to arrange the execution order of them.
More information can be found in the API references.
If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook`
class to write a metric collector. These hook functions can be called at twelve timing in the trainer's life cycle.
Besides, you can define the priorities of all hooks to arrange the execution order of them. More information can be
found in the API references.
### Metric
You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your write your own metric hooks, please set
the priority carefully and make sure the hook is called before other hooks which might require the results of the metric hook.
You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your
write your own metric hooks, please set the priority carefully and make sure the hook is called before other hooks which
might require the results of the metric hook.
We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary and metrics can be accessed by their names.
We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary
and metrics can be accessed by their names.

View File

@@ -14,28 +14,30 @@ def process_function(dataloader, model, criterion, optim):
optim.setp()
```
`ignite.engine``keras.engine`中,进程函数需要由用户提供,然而,用户很难为流水线并行编写进程函数。为了向用户提供方便的混合并行,我们提供了具备强大功能的`Engine`类,该类支持流水线并行,并提供前向传播后向传播不交织的策略。同时,您可以在`Engine`类中使用您事先定义好的学习率调度器来在训练过程中调整学习率。
`ignite.engine``keras.engine`中,进程函数需要由用户提供,然而,用户很难为流水线并行编写进程函数。为了向用户提供方便的混合并行,我们提供了具备强大功能的`Engine`
类,该类支持流水线并行,并提供前向传播后向传播不交织的策略。同时,您可以在`Engine`类中使用您事先定义好的学习率调度器来在训练过程中调整学习率。
您在构造引擎时只需要定义`model``criterion``optimizer``lr_scheduler``schedule`等变量即可,下面的代码块给出了一个这样的例子。
**如果你使用`colossalai.initialize`的话engine会从config文件里自动构建。**
```python
import torch
import torch.nn as nn
import torchvision.models as models
import colossalai
from colossalai.engine import Engine
model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model)
lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
schedule = colossalai.engine.schedule.NoPipelineSchedule()
schedule = colossalai.engine.NoPipelineSchedule()
MyEngine = Engine(
model=model,
criterion=criterion,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
schedule=schedule
step_schedule=schedule
)
```
@@ -48,10 +50,12 @@ MyEngine = Engine(
`Trainer`类旨在让科研工作者和工程师更加方便地使用我们的系统,您不需要自己写脚本,只需要调用`Trainer`类来构造您的训练器即可,就像下面的代码块中所做的。
```python
MyTrainer = Trainer(MyEngine)
MyTrainer = Trainer(my_trainer)
```
在此之后,您可以使用`fit`方法来训练或调用您的模型。除此之外,为了让我们的`Trainer`类拥有更强大的功能,我们加入了一系列方便您使用的工具。例如,您可以在训练过程中持续监测并记录模型目前的运行状态和表现,这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`类让您可以在指定时间执行您的钩子函数。如下方的代码块所示我们事先为您定义好了一些实用的钩子函数您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
在此之后,您可以使用`fit`方法来训练或调用您的模型。除此之外,为了让我们的`Trainer`
类拥有更强大的功能,我们加入了一系列方便您使用的工具。例如,您可以在训练过程中持续监测并记录模型目前的运行状态和表现,这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`
类让您可以在指定时间执行您的钩子函数。如下方的代码块所示我们事先为您定义好了一些实用的钩子函数您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
```python
hooks = [
@@ -70,7 +74,8 @@ hooks = [
### 钩子函数
如果您有个性化需求,您可以继承我们的`BaseHook`类并添加您的钩子函数,或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在`Trainer`生命周期的12个时间点被执行。更多该类的相关信息可以在API信息中找到。
如果您有个性化需求,您可以继承我们的`BaseHook`类并添加您的钩子函数,或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在`Trainer`
生命周期的12个时间点被执行。更多该类的相关信息可以在API信息中找到。
### 度量标准