mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2026-05-05 12:24:38 +00:00
[SC] add GPT example for auto checkpoint (#1889)
* [sc] SC tutorial for auto checkpoint * [sc] polish examples * [sc] polish readme * [sc] polish readme and help information * [sc] polish readme and help information
This commit is contained in:
@@ -15,3 +15,82 @@ export DATA=/path/to/data
|
||||
```bash
|
||||
colossalai run --nproc_per_node 4 auto_parallel_demo.py
|
||||
```
|
||||
|
||||
## Auto Checkpoint Benchmarking
|
||||
|
||||
We prepare three demos for you to test the performance of auto checkpoint, the test `demo_resnet50.py` and `demo_gpt2_medium.py` will show you the ability of solver to search checkpoint strategy that could fit in the given budget.
|
||||
|
||||
The usage of the above two test
|
||||
```bash
|
||||
python demo_resnet50.py --help
|
||||
usage: ResNet50 Auto Activation Benchmark [-h] [--batch_size BATCH_SIZE] [--num_steps NUM_STEPS] [--sample_points SAMPLE_POINTS] [--free_memory FREE_MEMORY]
|
||||
[--start_factor START_FACTOR]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--batch_size BATCH_SIZE
|
||||
batch size for benchmark, default 128
|
||||
--num_steps NUM_STEPS
|
||||
number of test steps for benchmark, default 5
|
||||
--sample_points SAMPLE_POINTS
|
||||
number of sample points for benchmark from start memory budget to maximum memory budget (free_memory), default 15
|
||||
--free_memory FREE_MEMORY
|
||||
maximum memory budget in MB for benchmark, default 11000 MB
|
||||
--start_factor START_FACTOR
|
||||
start memory budget factor for benchmark, the start memory budget will be free_memory / start_factor, default 4
|
||||
|
||||
# run with default settings
|
||||
python demo_resnet50.py
|
||||
|
||||
python demo_gpt2_medium.py --help
|
||||
usage: GPT2 medium Auto Activation Benchmark [-h] [--batch_size BATCH_SIZE] [--num_steps NUM_STEPS] [--sample_points SAMPLE_POINTS] [--free_memory FREE_MEMORY]
|
||||
[--start_factor START_FACTOR]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--batch_size BATCH_SIZE
|
||||
batch size for benchmark, default 8
|
||||
--num_steps NUM_STEPS
|
||||
number of test steps for benchmark, default 5
|
||||
--sample_points SAMPLE_POINTS
|
||||
number of sample points for benchmark from start memory budget to maximum memory budget (free_memory), default 15
|
||||
--free_memory FREE_MEMORY
|
||||
maximum memory budget in MB for benchmark, default 56000 MB
|
||||
--start_factor START_FACTOR
|
||||
start memory budget factor for benchmark, the start memory budget will be free_memory / start_factor, default 10
|
||||
|
||||
# run with default settings
|
||||
python demo_gpt2_medium.py
|
||||
```
|
||||
|
||||
There are some results for your reference
|
||||
|
||||
### ResNet 50
|
||||

|
||||
|
||||
### GPT2 Medium
|
||||

|
||||
|
||||
We also prepare the demo `demo_resnet152.py` to manifest the benefit of auto activation with large batch, the usage is listed as follows
|
||||
```bash
|
||||
python demo_resnet152.py --help
|
||||
usage: ResNet152 Auto Activation Through Put Benchmark [-h] [--num_steps NUM_STEPS]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--num_steps NUM_STEPS
|
||||
number of test steps for benchmark, default 5
|
||||
|
||||
# run with default settings
|
||||
python demo_resnet152.py
|
||||
```
|
||||
|
||||
here are some results on our end for your reference
|
||||
```bash
|
||||
===============test summary================
|
||||
batch_size: 512, peak memory: 73314.392 MB, through put: 254.286 images/s
|
||||
batch_size: 1024, peak memory: 73316.216 MB, through put: 397.608 images/s
|
||||
batch_size: 2048, peak memory: 72927.837 MB, through put: 277.429 images/s
|
||||
```
|
||||
|
||||
The above tests will output the test summary and a plot of the benchmarking results.
|
||||
|
||||
Reference in New Issue
Block a user