[misc] refactor launch API and tensor constructor (#5666)

* [misc] remove config arg from initialize

* [misc] remove old tensor contrusctor

* [plugin] add npu support for ddp

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [devops] fix doc test ci

* [test] fix test launch

* [doc] update launch doc

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
Hongxin Liu
2024-04-29 10:40:11 +08:00
committed by GitHub
parent 91fa553775
commit 7f8b16635b
223 changed files with 294 additions and 403 deletions

View File

@@ -62,7 +62,7 @@ plugin = HybridParallelPlugin(
## 创建分布式环境.
```python
# Launch ColossalAI
colossalai.launch_from_torch(config={}, seed=42)
colossalai.launch_from_torch(seed=42)
coordinator = DistCoordinator()
```
## 定义GPT-2模型的训练组件

View File

@@ -70,7 +70,7 @@ PP_SIZE = 2
首先我们创建一个分布式环境
```python
# Launch ColossalAI
colossalai.launch_from_torch(config={}, seed=SEEDå)
colossalai.launch_from_torch(seed=SEEDå)
coordinator = DistCoordinator()
world_size = coordinator.world_size
```

View File

@@ -60,7 +60,7 @@ from colossalai.booster.plugin import TorchDDPPlugin
def train():
# launch colossalai
colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost')
colossalai.launch(rank=rank, world_size=world_size, port=port, host='localhost')
# create plugin and objects for training
plugin = TorchDDPPlugin()

View File

@@ -74,8 +74,7 @@ import colossalai
args = colossalai.get_default_parser().parse_args()
# launch distributed environment
colossalai.launch(config=args.config,
rank=args.rank,
colossalai.launch(rank=args.rank,
world_size=args.world_size,
host=args.host,
port=args.port,
@@ -93,20 +92,11 @@ PyTorch自带的启动器需要在每个节点上都启动命令才能启动多
首先我们需要在代码里指定我们的启动方式。由于这个启动器是PyTorch启动器的封装那么我们自然而然应该使用`colossalai.launch_from_torch`
分布式环境所需的参数,如 rank, world size, host 和 port 都是由 PyTorch 启动器设置的,可以直接从环境变量中读取。
config.py
```python
BATCH_SIZE = 512
LEARNING_RATE = 3e-3
WEIGHT_DECAY = 0.3
NUM_EPOCHS = 2
```
train.py
```python
import colossalai
colossalai.launch_from_torch(
config="./config.py",
)
colossalai.launch_from_torch()
...
```
@@ -186,7 +176,6 @@ colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1 --e
import colossalai
colossalai.launch_from_slurm(
config=<CONFIG>,
host=args.host,
port=args.port
)
@@ -206,7 +195,6 @@ srun python train.py --host <master_node> --port 29500
您可以在您的训练脚本中尝试以下操作。
```python
colossalai.launch_from_openmpi(
config=<CONFIG>,
host=args.host,
port=args.port
)
@@ -219,3 +207,5 @@ mpirun --hostfile <my_hostfile> -np <num_process> python train.py --host <node n
- --hostfile: 指定一个要运行的主机列表。
- --np: 设置总共要启动的进程GPU的数量。例如如果 --np 44个 python 进程将被初始化以运行 train.py。
<!-- doc-test-command: echo -->

View File

@@ -46,7 +46,7 @@ parser = colossalai.get_default_parser()
args = parser.parse_args()
# launch from torch
colossalai.launch_from_torch(config=dict())
colossalai.launch_from_torch()
```

View File

@@ -61,7 +61,7 @@ from colossalai.nn.lr_scheduler import CosineAnnealingLR
我们需要初始化分布式环境. 为了快速演示,我们使用`launch_from_torch`. 您可以参考 [Launch Colossal-AI](../basics/launch_colossalai.md)
```python
colossalai.launch_from_torch(config=dict())
colossalai.launch_from_torch()
logger = get_dist_logger()
```

View File

@@ -29,7 +29,7 @@ from colossalai.booster.plugin import GeminiPlugin
from transformers import LlamaForCausalLM, LlamaConfig, BertForPreTraining
colossalai.launch({})
colossalai.launch()
plugin = GeminiPlugin()
booster = Booster(plugin)

View File

@@ -19,11 +19,11 @@ AMP 代表自动混合精度训练。
2. apex.amp
3. naive amp
| Colossal-AI | 支持张量并行 | 支持流水并行 | fp16 范围 |
| -------------- | ------------ | ------------ | --------------------------------------------------------- |
| AMP_TYPE.TORCH | ✅ | ❌ | 在前向和反向传播期间,模型参数、激活和梯度向下转换至 fp16 |
| AMP_TYPE.APEX | ❌ | ❌ | 更细粒度,我们可以选择 opt_level O0, O1, O2, O3 |
| AMP_TYPE.NAIVE | ✅ | ✅ | 模型参数、前向和反向操作,全都向下转换至 fp16 |
| Colossal-AI | 支持张量并行 | 支持流水并行 | fp16 范围 |
|----------------|--------------|--------------|-------------------------------------------------------|
| AMP_TYPE.TORCH | ✅ | ❌ | 在前向和反向传播期间,模型参数、激活和梯度向下转换至 fp16 |
| AMP_TYPE.APEX | ❌ | ❌ | 更细粒度,我们可以选择 opt_level O0, O1, O2, O3 |
| AMP_TYPE.NAIVE | ✅ | ✅ | 模型参数、前向和反向操作,全都向下转换至 fp16 |
前两个依赖于 PyTorch (1.6 及以上) 和 NVIDIA Apex 的原始实现。最后一种方法类似 Apex O2。在这些方法中Apex-AMP 与张量并行不兼容。这是因为张量是以张量并行的方式在设备之间拆分的,因此,需要在不同的进程之间进行通信,以检查整个模型权重中是否出现 inf 或 nan。我们修改了 torch amp 实现,使其现在与张量并行兼容。
@@ -153,7 +153,7 @@ parser = colossalai.get_default_parser()
args = parser.parse_args()
# launch from torch
colossalai.launch_from_torch(config=dict())
colossalai.launch_from_torch()
```

View File

@@ -175,7 +175,7 @@ Mem usage: 4968.016 MB
```python
def train_gemini_cpu(nvme_offload_fraction: float = 0.0):
colossalai.launch_from_torch({})
colossalai.launch_from_torch()
config = GPT2Config()
with ColoInitContext(device=torch.cuda.current_device()):
model = GPT2LMHeadModel(config)

View File

@@ -174,7 +174,7 @@ def main():
SEQ_LEN = 1024
VOCAB_SIZE = 50257
NUM_STEPS = 10
colossalai.launch_from_torch(config={})
colossalai.launch_from_torch()
# build criterion
criterion = GPTLMLoss()