mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-09-02 17:46:42 +00:00
[doc] add tutorial for booster checkpoint (#3785)
* [doc] add checkpoint related docstr for booster * [doc] add en checkpoint doc * [doc] add zh checkpoint doc * [doc] add booster checkpoint doc in sidebar * [doc] add cuation about ckpt for plugins * [doc] add doctest placeholder * [doc] add doctest placeholder * [doc] add doctest placeholder
This commit is contained in:
48
docs/source/en/basics/booster_checkpoint.md
Normal file
48
docs/source/en/basics/booster_checkpoint.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Booster Checkpoint
|
||||
|
||||
Author: [Hongxin Liu](https://github.com/ver217)
|
||||
|
||||
**Prerequisite:**
|
||||
- [Booster API](./booster_api.md)
|
||||
|
||||
## Introduction
|
||||
|
||||
We've introduced the [Booster API](./booster_api.md) in the previous tutorial. In this tutorial, we will introduce how to save and load checkpoints using booster.
|
||||
|
||||
## Model Checkpoint
|
||||
|
||||
{{ autodoc:colossalai.booster.Booster.save_model }}
|
||||
|
||||
Model must be boosted by `colossalai.booster.Booster` before saving. `checkpoint` is the path to saved checkpoint. It can be a file, if `shard=False`. Otherwise, it should be a directory. If `shard=True`, the checkpoint will be saved in a sharded way. This is useful when the checkpoint is too large to be saved in a single file. Our sharded checkpoint format is compatible with [huggingface/transformers](https://github.com/huggingface/transformers).
|
||||
|
||||
{{ autodoc:colossalai.booster.Booster.load_model }}
|
||||
|
||||
Model must be boosted by `colossalai.booster.Booster` before loading. It will detect the checkpoint format automatically, and load in corresponding way.
|
||||
|
||||
## Optimizer Checkpoint
|
||||
|
||||
> ⚠ Saving optimizer checkpoint in a sharded way is not supported yet.
|
||||
|
||||
{{ autodoc:colossalai.booster.Booster.save_optimizer }}
|
||||
|
||||
Optimizer must be boosted by `colossalai.booster.Booster` before saving.
|
||||
|
||||
{{ autodoc:colossalai.booster.Booster.load_optimizer }}
|
||||
|
||||
Optimizer must be boosted by `colossalai.booster.Booster` before loading.
|
||||
|
||||
## LR Scheduler Checkpoint
|
||||
|
||||
{{ autodoc:colossalai.booster.Booster.save_lr_scheduler }}
|
||||
|
||||
LR scheduler must be boosted by `colossalai.booster.Booster` before saving. `checkpoint` is the local path to checkpoint file.
|
||||
|
||||
{{ autodoc:colossalai.booster.Booster.load_lr_scheduler }}
|
||||
|
||||
LR scheduler must be boosted by `colossalai.booster.Booster` before loading. `checkpoint` is the local path to checkpoint file.
|
||||
|
||||
## Checkpoint design
|
||||
|
||||
More details about checkpoint design can be found in our discussion [A Unified Checkpoint System Design](https://github.com/hpcaitech/ColossalAI/discussions/3339).
|
||||
|
||||
<!-- doc-test-command: echo -->
|
@@ -43,12 +43,16 @@ We've tested compatibility on some famous models, following models may not be su
|
||||
|
||||
Compatibility problems will be fixed in the future.
|
||||
|
||||
> ⚠ This plugin can only load optimizer checkpoint saved by itself with the same number of processes now. This will be fixed in the future.
|
||||
|
||||
### Gemini Plugin
|
||||
|
||||
This plugin implements Zero-3 with chunk-based and heterogeneous memory management. It can train large models without much loss in speed. It also does not support local gradient accumulation. More details can be found in [Gemini Doc](../features/zero_with_chunk.md).
|
||||
|
||||
{{ autodoc:colossalai.booster.plugin.GeminiPlugin }}
|
||||
|
||||
> ⚠ This plugin can only load optimizer checkpoint saved by itself with the same number of processes now. This will be fixed in the future.
|
||||
|
||||
### Torch DDP Plugin
|
||||
|
||||
More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).
|
||||
@@ -62,3 +66,5 @@ More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/genera
|
||||
More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/fsdp.html).
|
||||
|
||||
{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}
|
||||
|
||||
<!-- doc-test-command: echo -->
|
||||
|
Reference in New Issue
Block a user