[doc] FP8 training and communication document (#6050)

* Add FP8 training and communication document * add fp8 docstring for plugins * fix typo * fix typo
2025-09-09 04:50:17 +00:00 · 2024-09-14 11:01:05 +08:00
parent f20b066c59
commit bdb125f83f
7 changed files with 31 additions and 7 deletions
--- a/docs/source/en/features/mixed_precision_training_with_booster.md
+++ b/docs/source/en/features/mixed_precision_training_with_booster.md
@@ -9,6 +9,7 @@ Author: [Mingyan Jiang](https://github.com/jiangmingyan)
 **Related Paper**

 - [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
+- [FP8 Formats for Deep Learning](https://arxiv.org/pdf/2209.05433)

 ## Introduction

@@ -60,7 +61,11 @@ However, there are other operations, like reductions, which require the dynamic

 ## AMP in Colossal-AI

-We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Next we will support `bf16`, `fp8`.
+We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Next we will support `bf16`.
+
+Currently we only support `fp8` mixed precision training for the `Linear` layer. Please specify the `use_fp8` parameter when create the plugin object.
+
+To reduce the communication volume inter nodes in low-bandwidth scenarios, we support FP8 communication compression. Please specify the `fp8_communication` parameter when create the  plugin object.

 ### Start with Booster

@@ -74,7 +79,6 @@ instantiate `Booster` with `mixed_precision="fp16"`, then you can train with tor
    'fp16': torch amp
    'fp16_apex': apex amp,
    'bf16': bf16,
-    'fp8': fp8,
    'fp16_naive': naive amp
 """
 from colossalai import Booster
@@ -128,6 +132,10 @@ The output model is converted to AMP model of smaller memory consumption.
 If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
 Otherwise, try smaller models or checkout more parallelization training techniques!

+### FP8 Communication
+
+In low-bandwidth scenarios, to reduce the communication load multiple nodes, we support FP8 communication compression, which can be enabled by using `fp8_communication=True` when you when create the plugin object (such as `GeminiPlugin`). The all-to-all, all-gather and P2P operations inter nodes will use FP8 format for data transmission. Currently the FP8 communication of reduction operators such as all-reduce and reduce-scatter is currently not supported due to lack of support of the NCCL library.
+
 ## Hands-on Practice

 Now we will introduce the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example.