mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-09-02 01:28:31 +00:00
Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b7699
.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
This commit is contained in:
5
docs/colossalai/colossalai.engine.amp.amp_type.rst
Normal file
5
docs/colossalai/colossalai.engine.amp.amp_type.rst
Normal file
@@ -0,0 +1,5 @@
|
||||
colossalai.engine.amp.amp\_type
|
||||
===============================
|
||||
|
||||
.. automodule:: colossalai.engine.amp.amp_type
|
||||
:members:
|
5
docs/colossalai/colossalai.engine.amp.grad_scaler.rst
Normal file
5
docs/colossalai/colossalai.engine.amp.grad_scaler.rst
Normal file
@@ -0,0 +1,5 @@
|
||||
colossalai.engine.amp.grad\_scaler
|
||||
==================================
|
||||
|
||||
.. automodule:: colossalai.engine.amp.grad_scaler
|
||||
:members:
|
12
docs/colossalai/colossalai.engine.amp.rst
Normal file
12
docs/colossalai/colossalai.engine.amp.rst
Normal file
@@ -0,0 +1,12 @@
|
||||
colossalai.engine.amp
|
||||
=====================
|
||||
|
||||
.. automodule:: colossalai.engine.amp
|
||||
:members:
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
colossalai.engine.amp.amp_type
|
||||
colossalai.engine.amp.grad_scaler
|
@@ -1,5 +0,0 @@
|
||||
colossalai.engine.amp\_type
|
||||
===========================
|
||||
|
||||
.. automodule:: colossalai.engine.amp_type
|
||||
:members:
|
@@ -7,11 +7,6 @@ colossalai.engine
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
colossalai.engine.amp
|
||||
colossalai.engine.gradient_handler
|
||||
colossalai.engine.schedule
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
colossalai.engine.amp_type
|
||||
|
@@ -21,7 +21,6 @@ colossalai
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
colossalai.checkpointing
|
||||
colossalai.constants
|
||||
colossalai.core
|
||||
colossalai.initialize
|
||||
|
5
docs/colossalai/colossalai.utils.checkpointing.rst
Normal file
5
docs/colossalai/colossalai.utils.checkpointing.rst
Normal file
@@ -0,0 +1,5 @@
|
||||
colossalai.utils.checkpointing
|
||||
==============================
|
||||
|
||||
.. automodule:: colossalai.utils.checkpointing
|
||||
:members:
|
@@ -9,6 +9,7 @@ colossalai.utils
|
||||
:maxdepth: 2
|
||||
|
||||
colossalai.utils.activation_checkpoint
|
||||
colossalai.utils.checkpointing
|
||||
colossalai.utils.common
|
||||
colossalai.utils.cuda
|
||||
colossalai.utils.memory
|
||||
|
@@ -17,38 +17,40 @@ parallel = dict(
|
||||
)
|
||||
```
|
||||
|
||||
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and data,
|
||||
pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a int
|
||||
representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
|
||||
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and
|
||||
data, pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a
|
||||
int representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
|
||||
represents the way of tensor parallelism.
|
||||
|
||||
## Data Parallel
|
||||
|
||||
Data parallel is the most common way to distribute your training task by splitting data into several shards and train
|
||||
on a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do
|
||||
not have to explicitly set them in your configurations. When data parallel size is larger than 1, Colossal-AI automatically
|
||||
Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
|
||||
a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
|
||||
have to explicitly set them in your configurations. When data parallel size is larger than 1, Colossal-AI automatically
|
||||
adds the distributed data sampler to the dataloader to shard the dataset.
|
||||
|
||||
## 1D, 2D, 2.5D and 3D Parallel
|
||||
|
||||
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
|
||||
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
|
||||
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
|
||||
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
||||
|
||||
-
|
||||
1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
||||
|
||||
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
|
||||
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
|
||||
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
|
||||
devices where $N$ is the number of tensor chunks in a single dimension.
|
||||
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
|
||||
outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ devices where
|
||||
$N$ is the number of tensor chunks in a single dimension.
|
||||
|
||||
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
|
||||
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
|
||||
parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into $d$ layers,
|
||||
where each layer performs matrix multiplication operations independently with a dimension $N$.
|
||||
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
|
||||
further parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into $d$ layers, where
|
||||
each layer performs matrix multiplication operations independently with a dimension $N$.
|
||||
|
||||
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
|
||||
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
|
||||
the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage are evenly distributed
|
||||
through optimized load balancing of parameters as well as activations.
|
||||
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
|
||||
achieves the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage
|
||||
are evenly distributed through optimized load balancing of parameters as well as activations.
|
||||
|
||||
```python
|
||||
# 1D parallel
|
||||
@@ -78,12 +80,12 @@ parallel = dict(
|
||||
|
||||
## Pipeline Parallel (experimental)
|
||||
|
||||
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
|
||||
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
|
||||
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
|
||||
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
|
||||
and the second layer to the second GPU. This example of course wastes the computing resources and is only to demonstrate
|
||||
the idea of pipeline parallelism.
|
||||
the idea of pipeline parallelism.
|
||||
|
||||
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline
|
||||
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline
|
||||
parallelism in PyTorch, you may need to add one more attribute, `layers_cfg` in your model class which tells Colossal-AI
|
||||
the sequence of execution. One example you can refer is `colossalai.nn.model.VanillaResNet`.
|
||||
|
||||
@@ -192,9 +194,9 @@ class VanillaResNet(BaseModel):
|
||||
]
|
||||
```
|
||||
|
||||
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
|
||||
will automatically creates the pipeline schedule which defines the forward and backward step. You can specify how many microbatches
|
||||
to run in each step in the `schedule` configuration.
|
||||
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
|
||||
will automatically creates the pipeline schedule which defines the forward and backward step. You can specify how many
|
||||
microbatches to run in each step in the `schedule` configuration.
|
||||
|
||||
```python
|
||||
parallel = dict(
|
||||
@@ -206,10 +208,11 @@ schedule = dict(
|
||||
num_microbatches = 4 # set the number of microbatches per step
|
||||
)
|
||||
```
|
||||
|
||||
This feature is still in development and is only experimental for now.
|
||||
|
||||
## Sequence Parallel (experimental)
|
||||
|
||||
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
|
||||
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
|
||||
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
|
||||
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
|
||||
This feature is still in development and is only experimental for now.
|
||||
|
@@ -1,8 +1,8 @@
|
||||
# Quick demo
|
||||
|
||||
Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system
|
||||
can accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The
|
||||
system can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
|
||||
Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
|
||||
accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
|
||||
can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
|
||||
|
||||
## Single GPU
|
||||
|
||||
@@ -32,25 +32,17 @@ realizes the training process.
|
||||
```python
|
||||
import colossalai
|
||||
from colossalai.core import global_context as gpc
|
||||
from colossalai.engine import Engine
|
||||
from colossalai.logging import get_global_dist_logger
|
||||
from colossalai.trainer import Trainer
|
||||
|
||||
|
||||
def run_trainer():
|
||||
model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
|
||||
engine, train_dataloader, test_dataloader = colossalai.initialize()
|
||||
logger = get_global_dist_logger()
|
||||
schedule.data_sync = False
|
||||
engine = Engine(
|
||||
model=model,
|
||||
criterion=criterion,
|
||||
optimizer=optimizer,
|
||||
lr_scheduler=lr_scheduler,
|
||||
schedule=schedule
|
||||
)
|
||||
|
||||
logger.info("engine is built", ranks=[0])
|
||||
|
||||
trainer = Trainer(engine=engine,
|
||||
hooks_cfg=gpc.config.hooks,
|
||||
verbose=True)
|
||||
logger.info("trainer is built", ranks=[0])
|
||||
|
||||
@@ -58,11 +50,13 @@ def run_trainer():
|
||||
trainer.fit(
|
||||
train_dataloader=train_dataloader,
|
||||
test_dataloader=test_dataloader,
|
||||
max_epochs=gpc.config.num_epochs,
|
||||
epochs=gpc.config.num_epochs,
|
||||
hooks_cfg=gpc.config.hooks,
|
||||
display_progress=True,
|
||||
test_interval=2
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
run_trainer()
|
||||
```
|
||||
@@ -72,9 +66,9 @@ Zoo. The detailed substitution process is elaborated [here](model.md).
|
||||
|
||||
## Features
|
||||
|
||||
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development of
|
||||
distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools to
|
||||
kickstart distributed training in a few lines.
|
||||
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
|
||||
of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools
|
||||
to kickstart distributed training in a few lines.
|
||||
|
||||
- [Data Parallelism](parallelization.md)
|
||||
- [Pipeline Parallelism](parallelization.md)
|
||||
|
@@ -4,40 +4,36 @@ Colossal-AI是一个大规模深度学习系统,其中包含高效的并行技
|
||||
|
||||
## 单GPU系统
|
||||
|
||||
在带有GPU的非分布式系统上进行模型训练时,Colossal-AI可以达到当前的基线效率。[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google Colab示例展现如何使用Colossal-AI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
|
||||
在带有GPU的非分布式系统上进行模型训练时,Colossal-AI可以达到当前的基线效率。[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google
|
||||
Colab示例展现如何使用Colossal-AI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
|
||||
|
||||
## 多GPU系统
|
||||
|
||||
在多GPU的分布式系统上训练深度学习模型时,Colossal-AI可以使用高效的并行技术来显著地加速训练过程,这些技术将在下面的[并行技术](parallelization.md)章节中被详述。下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型,其中`HOST`变量为您分布式系统的IP地址。请注意下面的代码使用了[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
|
||||
在多GPU的分布式系统上训练深度学习模型时,Colossal-AI可以使用高效的并行技术来显著地加速训练过程,这些技术将在下面的[并行技术](parallelization.md)
|
||||
章节中被详述。下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型,其中`HOST`
|
||||
变量为您分布式系统的IP地址。请注意下面的代码使用了[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
|
||||
|
||||
```bash
|
||||
HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
|
||||
```
|
||||
|
||||
`./configs/vit/vit_2d.py`是一个[配置文件](config.md),Colossal-AI使用配置文件来定义训练过程中需要用到的参数,比如模型类型、数据集、以及优化器、学习率调度器等。您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`是一个标准的训练脚本,具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
|
||||
`./configs/vit/vit_2d.py`是一个[配置文件](config.md)
|
||||
,Colossal-AI使用配置文件来定义训练过程中需要用到的参数,比如模型类型、数据集、以及优化器、学习率调度器等。您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`
|
||||
是一个标准的训练脚本,具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
from colossalai.core import global_context as gpc
|
||||
from colossalai.engine import Engine
|
||||
from colossalai.logging import get_global_dist_logger
|
||||
from colossalai.trainer import Trainer
|
||||
|
||||
|
||||
def run_trainer():
|
||||
model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
|
||||
engine, train_dataloader, test_dataloader = colossalai.initialize()
|
||||
logger = get_global_dist_logger()
|
||||
schedule.data_sync = False
|
||||
engine = Engine(
|
||||
model=model,
|
||||
criterion=criterion,
|
||||
optimizer=optimizer,
|
||||
lr_scheduler=lr_scheduler,
|
||||
schedule=schedule
|
||||
)
|
||||
logger.info("engine is built", ranks=[0])
|
||||
|
||||
trainer = Trainer(engine=engine,
|
||||
hooks_cfg=gpc.config.hooks,
|
||||
verbose=True)
|
||||
logger.info("trainer is built", ranks=[0])
|
||||
|
||||
@@ -45,11 +41,13 @@ def run_trainer():
|
||||
trainer.fit(
|
||||
train_dataloader=train_dataloader,
|
||||
test_dataloader=test_dataloader,
|
||||
max_epochs=gpc.config.num_epochs,
|
||||
epochs=gpc.config.num_epochs,
|
||||
hooks_cfg=gpc.config.hooks,
|
||||
display_progress=True,
|
||||
test_interval=2
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
run_trainer()
|
||||
```
|
||||
|
@@ -2,9 +2,9 @@
|
||||
|
||||
## Build your engine
|
||||
|
||||
To better understand how `Engine` class works, let's start from the conception of the process function in common engines. The process function
|
||||
usually controls the behavior over a batch of a dataset, `Engine` class just controls the process function. Here we give a standard process
|
||||
function in the following code block.
|
||||
To better understand how `Engine` class works, let's start from the conception of the process function in common
|
||||
engines. The process function usually controls the behavior over a batch of a dataset, `Engine` class just controls the
|
||||
process function. Here we give a standard process function in the following code block.
|
||||
|
||||
```python
|
||||
def process_function(dataloader, model, criterion, optim):
|
||||
@@ -16,32 +16,33 @@ def process_function(dataloader, model, criterion, optim):
|
||||
optim.setp()
|
||||
```
|
||||
|
||||
In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users to write their own process
|
||||
functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for users, we provide the powerful `Engine` class. This class
|
||||
enables pipeline parallelism and offers one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler
|
||||
in the `Engine` class to adjust learning rate during training.
|
||||
In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users
|
||||
to write their own process functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for
|
||||
users, we provide the powerful `Engine` class. This class enables pipeline parallelism and offers
|
||||
one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler in
|
||||
the `Engine` class to adjust learning rate during training.
|
||||
|
||||
In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The following code block provides
|
||||
an example.
|
||||
In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The
|
||||
following code block provides an example. **The engine is automatically created from the config file for you if you
|
||||
start with `colossalai.initialize`.**
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torchvision.models as models
|
||||
import colossalai
|
||||
from colossalai.engine import Engine
|
||||
|
||||
model = models.resnet18()
|
||||
criterion = nn.CrossEntropyLoss()
|
||||
optimizer = torch.optim.Adam(model)
|
||||
lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
|
||||
schedule = colossalai.engine.schedule.NoPipelineSchedule()
|
||||
optimizer = torch.optim.Adam(model.parameters())
|
||||
schedule = colossalai.engine.NoPipelineSchedule()
|
||||
|
||||
MyEngine = Engine(
|
||||
model=model,
|
||||
criterion=criterion,
|
||||
optimizer=optimizer,
|
||||
lr_scheduler=lr_scheduler,
|
||||
schedule=schedule
|
||||
step_schedule=schedule
|
||||
)
|
||||
```
|
||||
|
||||
@@ -51,21 +52,24 @@ More information regarding the class can be found in the API references.
|
||||
|
||||
### Overview
|
||||
|
||||
To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly recommend that you read *Get Started*
|
||||
To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly
|
||||
recommend that you read *Get Started*
|
||||
section and *Build your engine* first.
|
||||
|
||||
The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write your own scripts, you can simply
|
||||
construct your own trainer by calling the `Trainer` class, just like what we did in the following code block.
|
||||
The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write
|
||||
your own scripts, you can simply construct your own trainer by calling the `Trainer` class, just like what we did in the
|
||||
following code block.
|
||||
|
||||
```python
|
||||
MyTrainer = Trainer(MyEngine)
|
||||
MyTrainer = Trainer(my_engine)
|
||||
```
|
||||
|
||||
After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more powerful, we incorporate a set of
|
||||
handy tools to the class. For example, you can monitor or record the running states and metrics which indicate the current performance of the model. These
|
||||
functions are realized by hooks. The `BasicHook` class allows you to execute your hook functions at specified time. We have already created some practical
|
||||
hooks for you, as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the class can be found
|
||||
in the API references.
|
||||
After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more
|
||||
powerful, we incorporate a set of handy tools to the class. For example, you can monitor or record the running states
|
||||
and metrics which indicate the current performance of the model. These functions are realized by hooks. The `BasicHook`
|
||||
class allows you to execute your hook functions at specified time. We have already created some practical hooks for you,
|
||||
as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the
|
||||
class can be found in the API references.
|
||||
|
||||
```python
|
||||
hooks = [
|
||||
@@ -80,18 +84,21 @@ hooks = [
|
||||
]
|
||||
```
|
||||
|
||||
These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides, they print the current loss and
|
||||
accuracy to let users monitor the performance of the model.
|
||||
These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides,
|
||||
they print the current loss and accuracy to let users monitor the performance of the model.
|
||||
|
||||
### Hook
|
||||
|
||||
If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook` class to write a metric collector.
|
||||
These hook functions can be called at twelve timing in the trainer's life cycle. Besides, you can define the priorities of all hooks to arrange the execution order of them.
|
||||
More information can be found in the API references.
|
||||
If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook`
|
||||
class to write a metric collector. These hook functions can be called at twelve timing in the trainer's life cycle.
|
||||
Besides, you can define the priorities of all hooks to arrange the execution order of them. More information can be
|
||||
found in the API references.
|
||||
|
||||
### Metric
|
||||
|
||||
You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your write your own metric hooks, please set
|
||||
the priority carefully and make sure the hook is called before other hooks which might require the results of the metric hook.
|
||||
You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your
|
||||
write your own metric hooks, please set the priority carefully and make sure the hook is called before other hooks which
|
||||
might require the results of the metric hook.
|
||||
|
||||
We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary and metrics can be accessed by their names.
|
||||
We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary
|
||||
and metrics can be accessed by their names.
|
||||
|
@@ -14,28 +14,30 @@ def process_function(dataloader, model, criterion, optim):
|
||||
optim.setp()
|
||||
```
|
||||
|
||||
在`ignite.engine`与`keras.engine`中,进程函数需要由用户提供,然而,用户很难为流水线并行编写进程函数。为了向用户提供方便的混合并行,我们提供了具备强大功能的`Engine`类,该类支持流水线并行,并提供前向传播后向传播不交织的策略。同时,您可以在`Engine`类中使用您事先定义好的学习率调度器来在训练过程中调整学习率。
|
||||
在`ignite.engine`与`keras.engine`中,进程函数需要由用户提供,然而,用户很难为流水线并行编写进程函数。为了向用户提供方便的混合并行,我们提供了具备强大功能的`Engine`
|
||||
类,该类支持流水线并行,并提供前向传播后向传播不交织的策略。同时,您可以在`Engine`类中使用您事先定义好的学习率调度器来在训练过程中调整学习率。
|
||||
|
||||
您在构造引擎时只需要定义`model`、`criterion`、`optimizer`、`lr_scheduler`与`schedule`等变量即可,下面的代码块给出了一个这样的例子。
|
||||
**如果你使用`colossalai.initialize`的话,engine会从config文件里自动构建。**
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torchvision.models as models
|
||||
import colossalai
|
||||
from colossalai.engine import Engine
|
||||
|
||||
model = models.resnet18()
|
||||
criterion = nn.CrossEntropyLoss()
|
||||
optimizer = torch.optim.Adam(model)
|
||||
lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
|
||||
schedule = colossalai.engine.schedule.NoPipelineSchedule()
|
||||
schedule = colossalai.engine.NoPipelineSchedule()
|
||||
|
||||
MyEngine = Engine(
|
||||
model=model,
|
||||
criterion=criterion,
|
||||
optimizer=optimizer,
|
||||
lr_scheduler=lr_scheduler,
|
||||
schedule=schedule
|
||||
step_schedule=schedule
|
||||
)
|
||||
```
|
||||
|
||||
@@ -48,10 +50,12 @@ MyEngine = Engine(
|
||||
`Trainer`类旨在让科研工作者和工程师更加方便地使用我们的系统,您不需要自己写脚本,只需要调用`Trainer`类来构造您的训练器即可,就像下面的代码块中所做的。
|
||||
|
||||
```python
|
||||
MyTrainer = Trainer(MyEngine)
|
||||
MyTrainer = Trainer(my_trainer)
|
||||
```
|
||||
|
||||
在此之后,您可以使用`fit`方法来训练或调用您的模型。除此之外,为了让我们的`Trainer`类拥有更强大的功能,我们加入了一系列方便您使用的工具。例如,您可以在训练过程中持续监测并记录模型目前的运行状态和表现,这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`类让您可以在指定时间执行您的钩子函数。如下方的代码块所示,我们事先为您定义好了一些实用的钩子函数,您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
|
||||
在此之后,您可以使用`fit`方法来训练或调用您的模型。除此之外,为了让我们的`Trainer`
|
||||
类拥有更强大的功能,我们加入了一系列方便您使用的工具。例如,您可以在训练过程中持续监测并记录模型目前的运行状态和表现,这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`
|
||||
类让您可以在指定时间执行您的钩子函数。如下方的代码块所示,我们事先为您定义好了一些实用的钩子函数,您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
|
||||
|
||||
```python
|
||||
hooks = [
|
||||
@@ -70,7 +74,8 @@ hooks = [
|
||||
|
||||
### 钩子函数
|
||||
|
||||
如果您有个性化需求,您可以继承我们的`BaseHook`类并添加您的钩子函数,或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在`Trainer`生命周期的12个时间点被执行。更多该类的相关信息可以在API信息中找到。
|
||||
如果您有个性化需求,您可以继承我们的`BaseHook`类并添加您的钩子函数,或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在`Trainer`
|
||||
生命周期的12个时间点被执行。更多该类的相关信息可以在API信息中找到。
|
||||
|
||||
### 度量标准
|
||||
|
||||
|
Reference in New Issue
Block a user