mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-06-28 00:07:29 +00:00
* [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example
57 lines
2.2 KiB
Markdown
57 lines
2.2 KiB
Markdown
# Train ResNet on CIFAR-10 from scratch
|
|
|
|
## 🚀 Quick Start
|
|
|
|
This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.
|
|
|
|
- Training Arguments
|
|
- `-p`, `--plugin`: Plugin to use. Choices: `torch_ddp`, `torch_ddp_fp16`, `low_level_zero`. Defaults to `torch_ddp`.
|
|
- `-r`, `--resume`: Resume from checkpoint file path. Defaults to `-1`, which means not resuming.
|
|
- `-c`, `--checkpoint`: The folder to save checkpoints. Defaults to `./checkpoint`.
|
|
- `-i`, `--interval`: Epoch interval to save checkpoints. Defaults to `5`. If set to `0`, no checkpoint will be saved.
|
|
- `--target_acc`: Target accuracy. Raise exception if not reached. Defaults to `None`.
|
|
|
|
- Eval Arguments
|
|
- `-e`, `--epoch`: select the epoch to evaluate
|
|
- `-c`, `--checkpoint`: the folder where checkpoints are found
|
|
|
|
### Install requirements
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Train
|
|
The folders will be created automatically.
|
|
```bash
|
|
# train with torch DDP with fp32
|
|
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32
|
|
|
|
# train with torch DDP with mixed precision training
|
|
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 -p torch_ddp_fp16
|
|
|
|
# train with low level zero
|
|
colossalai run --nproc_per_node 2 train.py -c ./ckpt-low_level_zero -p low_level_zero
|
|
```
|
|
|
|
### Eval
|
|
|
|
```bash
|
|
# evaluate fp32 training
|
|
python eval.py -c ./ckpt-fp32 -e 80
|
|
|
|
# evaluate fp16 mixed precision training
|
|
python eval.py -c ./ckpt-fp16 -e 80
|
|
|
|
# evaluate low level zero training
|
|
python eval.py -c ./ckpt-low_level_zero -e 80
|
|
```
|
|
|
|
Expected accuracy performance will be:
|
|
|
|
| Model | Single-GPU Baseline FP32 | Booster DDP with FP32 | Booster DDP with FP16 | Booster Low Level Zero | Booster Gemini |
|
|
| --------- | ------------------------ | --------------------- | --------------------- | ---------------------- | -------------- |
|
|
| ResNet-18 | 85.85% | 84.91% | 85.46% | 84.50% | 84.60% |
|
|
|
|
**Note: the baseline is adapted from the [script](https://pytorch-tutorial.readthedocs.io/en/latest/tutorial/chapter03_intermediate/3_2_2_cnn_resnet_cifar10/) to use `torchvision.models.resnet18`**
|