mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-09-12 20:54:35 +00:00
[doc] added reference to related works (#2994)
* [doc] added reference to related works * polish code
This commit is contained in:
@@ -119,5 +119,6 @@ model on a single machine.
|
||||
</figure>
|
||||
|
||||
Related paper:
|
||||
- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
|
||||
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
|
||||
- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
|
||||
|
@@ -5,6 +5,11 @@ Author: Hongxin Liu
|
||||
**Prerequisite:**
|
||||
- [Zero Redundancy Optimizer with chunk-based memory management](../features/zero_with_chunk.md)
|
||||
|
||||
**Related Paper**
|
||||
|
||||
- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
|
||||
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
|
||||
|
||||
## Introduction
|
||||
|
||||
If a model has `N` parameters, when using Adam, it has `8N` optimizer states. For billion-scale models, optimizer states take at least 32 GB memory. GPU memory limits the model scale we can train, which is called GPU memory wall. If we offload optimizer states to the disk, we can break through GPU memory wall.
|
||||
|
@@ -1,6 +1,7 @@
|
||||
# Zero Redundancy Optimizer with chunk-based memory management
|
||||
|
||||
Author: [Hongxiu Liu](https://github.com/ver217), [Jiarui Fang](https://github.com/feifeibear), [Zijian Ye](https://github.com/ZijianYY)
|
||||
|
||||
**Prerequisite:**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
|
||||
@@ -9,9 +10,11 @@ Author: [Hongxiu Liu](https://github.com/ver217), [Jiarui Fang](https://github.c
|
||||
- [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt)
|
||||
|
||||
**Related Paper**
|
||||
|
||||
- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
|
||||
- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
|
||||
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
|
||||
- [DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters](https://dl.acm.org/doi/10.1145/3394486.3406703)
|
||||
- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
|
||||
|
||||
## Introduction
|
||||
|
Reference in New Issue
Block a user