[example] add train resnet/vit with booster example (#3694)

* [example] add train vit with booster example

* [example] update readme

* [example] add train resnet with booster example

* [example] enable ci

* [example] enable ci

* [example] add requirements

* [hotfix] fix analyzer init

* [example] update requirements
This commit is contained in:
Hongxin Liu
2023-05-08 10:42:30 +08:00
committed by GitHub
parent 2629f9717d
commit f83ea813f5
17 changed files with 578 additions and 174 deletions

View File

@@ -0,0 +1,37 @@
# Train ViT on CIFAR-10 from scratch
## 🚀 Quick Start
This example provides a training script, which provides an example of training ViT on CIFAR10 dataset from scratch.
- Training Arguments
- `-p`, `--plugin`: Plugin to use. Choices: `torch_ddp`, `torch_ddp_fp16`, `low_level_zero`. Defaults to `torch_ddp`.
- `-r`, `--resume`: Resume from checkpoint file path. Defaults to `-1`, which means not resuming.
- `-c`, `--checkpoint`: The folder to save checkpoints. Defaults to `./checkpoint`.
- `-i`, `--interval`: Epoch interval to save checkpoints. Defaults to `5`. If set to `0`, no checkpoint will be saved.
- `--target_acc`: Target accuracy. Raise exception if not reached. Defaults to `None`.
### Install requirements
```bash
pip install -r requirements.txt
```
### Train
```bash
# train with torch DDP with fp32
colossalai run --nproc_per_node 4 train.py -c ./ckpt-fp32
# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 4 train.py -c ./ckpt-fp16 -p torch_ddp_fp16
# train with low level zero
colossalai run --nproc_per_node 4 train.py -c ./ckpt-low_level_zero -p low_level_zero
```
Expected accuracy performance will be:
| Model | Single-GPU Baseline FP32 | Booster DDP with FP32 | Booster DDP with FP16 | Booster Low Level Zero |
| --------- | ------------------------ | --------------------- | --------------------- | ---------------------- |
| ViT | 83.00% | 84.03% | 84.00% | 84.43% |