mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-08-11 12:51:55 +00:00
[misc] Add dist optim to doc sidebar (#5806)
* add to sidebar * fix chinese
This commit is contained in:
parent
8795bb2e80
commit
7f9ec599be
@ -56,6 +56,7 @@
|
|||||||
"features/pipeline_parallel",
|
"features/pipeline_parallel",
|
||||||
"features/nvme_offload",
|
"features/nvme_offload",
|
||||||
"features/lazy_init",
|
"features/lazy_init",
|
||||||
|
"features/distributed_optimizers",
|
||||||
"features/cluster_utils"
|
"features/cluster_utils"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
@ -14,12 +14,6 @@ Apart from the widely adopted Adam and SGD, many modern optimizers require layer
|
|||||||
## Optimizers
|
## Optimizers
|
||||||
Adafactor is a first-order Adam variant using Non-negative Matrix Factorization(NMF) to reduce memory footprint. CAME improves by introducting a confidence matrix to correct NMF. GaLore further reduces memory by projecting gradients into a low-rank space and 8-bit block-wise quantization. Lamb allows huge batch sizes without lossing accuracy via layer-wise adaptive update bounded by the inverse of its Lipschiz constant.
|
Adafactor is a first-order Adam variant using Non-negative Matrix Factorization(NMF) to reduce memory footprint. CAME improves by introducting a confidence matrix to correct NMF. GaLore further reduces memory by projecting gradients into a low-rank space and 8-bit block-wise quantization. Lamb allows huge batch sizes without lossing accuracy via layer-wise adaptive update bounded by the inverse of its Lipschiz constant.
|
||||||
|
|
||||||
## API Reference
|
|
||||||
|
|
||||||
{{ autodoc:colossalai.nn.optimizer.distributed_adafactor.DistributedAdaFactor }}
|
|
||||||
{{ autodoc:colossalai.nn.optimizer.distributed_lamb.DistributedLamb }}
|
|
||||||
{{ autodoc:colossalai.nn.optimizer.distributed_galore.DistGaloreAwamW }}
|
|
||||||
{{ autodoc:colossalai.nn.optimizer.distributed_came.DistributedCAME }}
|
|
||||||
|
|
||||||
## Hands-On Practice
|
## Hands-On Practice
|
||||||
We now demonstrate how to use Distributed Adafactor with booster API combining Tensor Parallel and ZeRO 2 with 4 GPUs. **Note that even if you're not aware of distributed optimizers, the plugins automatically casts yours to the distributed version for convenience.**
|
We now demonstrate how to use Distributed Adafactor with booster API combining Tensor Parallel and ZeRO 2 with 4 GPUs. **Note that even if you're not aware of distributed optimizers, the plugins automatically casts yours to the distributed version for convenience.**
|
||||||
@ -140,3 +134,10 @@ optim = DistGaloreAwamW(
|
|||||||
</table>
|
</table>
|
||||||
|
|
||||||
<!-- doc-test-command: colossalai run --nproc_per_node 4 distributed_optimizers.py -->
|
<!-- doc-test-command: colossalai run --nproc_per_node 4 distributed_optimizers.py -->
|
||||||
|
|
||||||
|
## API Reference
|
||||||
|
|
||||||
|
{{ autodoc:colossalai.nn.optimizer.distributed_adafactor.DistributedAdaFactor }}
|
||||||
|
{{ autodoc:colossalai.nn.optimizer.distributed_lamb.DistributedLamb }}
|
||||||
|
{{ autodoc:colossalai.nn.optimizer.distributed_galore.DistGaloreAwamW }}
|
||||||
|
{{ autodoc:colossalai.nn.optimizer.distributed_came.DistributedCAME }}
|
||||||
|
@ -13,12 +13,6 @@ Author: Wenxuan Tan, Junwen Duan, Renjie Mao
|
|||||||
## 优化器
|
## 优化器
|
||||||
Adafactor 是一种首次采用非负矩阵分解(NMF)的 Adam 变体,用于减少内存占用。CAME 通过引入一个置信度矩阵来改进 NMF 的效果。GaLore 通过将梯度投影到低秩空间,并使用 8 位块状量化进一步减少内存占用。Lamb 允许使用巨大的批量大小而不失准确性,通过按其 Lipschitz 常数的倒数界定的逐层自适应更新实现
|
Adafactor 是一种首次采用非负矩阵分解(NMF)的 Adam 变体,用于减少内存占用。CAME 通过引入一个置信度矩阵来改进 NMF 的效果。GaLore 通过将梯度投影到低秩空间,并使用 8 位块状量化进一步减少内存占用。Lamb 允许使用巨大的批量大小而不失准确性,通过按其 Lipschitz 常数的倒数界定的逐层自适应更新实现
|
||||||
|
|
||||||
## API 参考
|
|
||||||
|
|
||||||
{{ autodoc:colossalai.nn.optimizer.distributed_adafactor.DistributedAdaFactor }}
|
|
||||||
{{ autodoc:colossalai.nn.optimizer.distributed_lamb.DistributedLamb }}
|
|
||||||
{{ autodoc:colossalai.nn.optimizer.distributed_galore.DistGaloreAwamW }}
|
|
||||||
{{ autodoc:colossalai.nn.optimizer.distributed_came.DistributedCAME }}
|
|
||||||
|
|
||||||
## 使用
|
## 使用
|
||||||
现在我们展示如何使用分布式 Adafactor 与 booster API 结合 Tensor Parallel 和 ZeRO 2。即使您不使用distributed optimizer,plugin 也会自动将optimizer转换为分布式版本以方便使用。
|
现在我们展示如何使用分布式 Adafactor 与 booster API 结合 Tensor Parallel 和 ZeRO 2。即使您不使用distributed optimizer,plugin 也会自动将optimizer转换为分布式版本以方便使用。
|
||||||
@ -137,3 +131,10 @@ optim = DistGaloreAwamW(
|
|||||||
</table>
|
</table>
|
||||||
|
|
||||||
<!-- doc-test-command: colossalai run --nproc_per_node 4 distributed_optimizers.py -->
|
<!-- doc-test-command: colossalai run --nproc_per_node 4 distributed_optimizers.py -->
|
||||||
|
|
||||||
|
## API 参考
|
||||||
|
|
||||||
|
{{ autodoc:colossalai.nn.optimizer.distributed_adafactor.DistributedAdaFactor }}
|
||||||
|
{{ autodoc:colossalai.nn.optimizer.distributed_lamb.DistributedLamb }}
|
||||||
|
{{ autodoc:colossalai.nn.optimizer.distributed_galore.DistGaloreAwamW }}
|
||||||
|
{{ autodoc:colossalai.nn.optimizer.distributed_came.DistributedCAME }}
|
||||||
|
Loading…
Reference in New Issue
Block a user