update results on a single GPU, highlight quick view (#981)

2025-09-24 11:08:50 +00:00 · 2022-05-16 21:14:35 +08:00
parent c2fdc6a011
commit 7471f97fc3
2 changed files with 93 additions and 173 deletions
--- a/README-zh-Hans.md
+++ b/README-zh-Hans.md
@@ -28,7 +28,7 @@
 <li><a href="#为何选择-Colossal-AI">为何选择 Colossal-AI</a> </li>
 <li><a href="#特点">特点</a> </li>
 <li>
-   <a href="#展示样例">展示样例</a> 
+   <a href="#并行样例展示">并行样例展示</a> 
   <ul>
     <li><a href="#ViT">ViT</a></li>
     <li><a href="#GPT-3">GPT-3</a></li>
@@ -37,6 +37,13 @@
     <li><a href="#PaLM">PaLM</a></li>
   </ul>
 </li>
 <li>
   <a href="#单GPU样例展示">单GPU样例展示</a> 
   <ul>
     <li><a href="#GPT-2-Single">GPT-2</a></li>
     <li><a href="#PaLM-Single">PaLM</a></li>
   </ul>
 </li>
 <li>
   <a href="#安装">安装</a>
@@ -83,7 +90,7 @@ Colossal-AI 为您提供了一系列并行训练组件。我们的目标是让
  - 基于参数文件的并行化
 <p align="right">(<a href="#top">返回顶端</a>)</p>
-## 展示样例
+## 并行样例展示
 ### ViT
 <p align="center">
 <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/ViT.png" width="450" />
@@ -120,43 +127,49 @@ Colossal-AI 为您提供了一系列并行训练组件。我们的目标是让
 <p align="right">(<a href="#top">返回顶端</a>)</p>
 ## 单GPU样例展示
 ### GPT-2
 <p id="GPT-2-Single" align="center">
 <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/GPT2-GPU1.png" width=450/>
 </p>
 - 用相同的硬件条件训练20倍大的模型
 ### PaLM
 <p id="PaLM-Single" align="center">
 <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/PaLM-GPU1.png" width=450/>
 </p>
 - 用相同的硬件条件训练34倍大的模型
 <p align="right">(<a href="#top">back to top</a>)</p>
 ## 安装
-### PyPI
+### 从官方安装
-```bash
+您可以访问我们[下载](/download)页面来安装Colossal-AI，在这个页面上发布的版本都预编译了CUDA扩展。
 pip install colossalai
 ```
 该命令将会安装 CUDA extension, 如果你已安装 CUDA, NVCC 和 torch。 
-如果你不想安装 CUDA extension, 可在命令中添加`--global-option="--no_cuda_ext"`, 例如:
+### 从源安装
 ```bash
 pip install colossalai --global-option="--no_cuda_ext"
 ```
-如果你想使用 `ZeRO`, 你可以使用:
+> 此文档将与版本库的主分支保持一致。如果您遇到任何问题，欢迎给我们提 issue :)
 ```bash
 pip install colossalai[zero]
 ```
 ### 从源代码安装
 > Colossal-AI 的版本将与该项目的主分支保持一致。欢迎通过 issue 反馈你遇到的任何问题 :)
 ```shell
 git clone https://github.com/hpcaitech/ColossalAI.git
 cd ColossalAI
-# 安装依赖
+
 # install dependency
 pip install -r requirements/requirements.txt
-# 安装 colossalai
+# install colossalai
 pip install .
 ```
-如果你不想安装和使用 CUDA kernel fusion (使用 fused 优化器需安装):
+如果您不想安装和启用 CUDA 内核融合（使用融合优化器时强制安装）：
 ```shell
-pip install --global-option="--no_cuda_ext" .
+NO_CUDA_EXT=1 pip install .
 ```
 <p align="right">(<a href="#top">返回顶端</a>)</p>
@@ -201,78 +214,23 @@ docker run -ti --gpus all --rm --ipc=host colossalai bash
 ### 几行代码开启分布式训练
 ```python
-import colossalai
+parallel = dict(
-from colossalai.utils import get_dataloader
+    pipeline=2,
-
+    tensor=dict(mode='2.5d', depth = 1, size=4)
 # my_config 可以是 config 文件的路径或字典对象
 # 'localhost' 仅适用于单节点，在多节点时需指明节点名
 colossalai.launch(
    config=my_config,
    rank=rank,
    world_size=world_size,
    backend='nccl',
    port=29500,
    host='localhost'
 )
 # 构建模型
 model = ...
 # 构建数据集, dataloader 会默认处理分布式数据 sampler
 train_dataset = ...
 train_dataloader = get_dataloader(dataset=dataset,
                                shuffle=True
                                )
 # 构建优化器
 optimizer = ...
 # 构建损失函数
 criterion = ...
 # 初始化 colossalai
 engine, train_dataloader, _, _ = colossalai.initialize(
    model=model,
    optimizer=optimizer,
    criterion=criterion,
    train_dataloader=train_dataloader
 )
 # 开始训练
 engine.train()
 for epoch in range(NUM_EPOCHS):
    for data, label in train_dataloader:
        engine.zero_grad()
        output = engine(data)
        loss = engine.criterion(output, label)
        engine.backward(loss)
        engine.step()
 ```
-### 构建一个简单的2维并行模型
+### 几行代码开启异构训练
 假设我们有一个非常巨大的 MLP 模型，它巨大的 hidden size 使得它难以被单个 GPU 容纳。我们可以将该模型的权重以二维网格的形式分配到多个 GPU 上，且保持你熟悉的模型构建方式。
 ```python
-from colossalai.nn import Linear2D
+zero = dict(
-import torch.nn as nn
+    model_config=dict(
-
+        tensor_placement_policy='auto',
-
+        shard_strategy=TensorShardStrategy(),
-class MLP_2D(nn.Module):
+        reuse_fp16_shard=True
-
+    ),
-    def __init__(self):
+    optimizer_config=dict(initial_scale=2**5, gpu_margin_mem_ratio=0.2)
-        super().__init__()
+)
        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
    def forward(self, x):
        x = self.linear_1(x)
        x = self.linear_2(x)
        return x
 ```
 <p align="right">(<a href="#top">返回顶端</a>)</p>
--- a/README.md
+++ b/README.md
@@ -28,7 +28,7 @@
 <li><a href="#Why-Colossal-AI">Why Colossal-AI</a> </li>
 <li><a href="#Features">Features</a> </li>
 <li>
-   <a href="#Demo">Demo</a> 
+   <a href="#Parallel-Demo">Parallel Demo</a> 
   <ul>
     <li><a href="#ViT">ViT</a></li>
     <li><a href="#GPT-3">GPT-3</a></li>
@@ -37,6 +37,13 @@
     <li><a href="#PaLM">PaLM</a></li>
   </ul>
 </li>
 <li>
   <a href="#Single-GPU-Demo">Single GPU Demo</a> 
   <ul>
     <li><a href="#GPT-2-Single">GPT-2</a></li>
     <li><a href="#PaLM-Single">PaLM</a></li>
   </ul>
 </li>
 <li>
   <a href="#Installation">Installation</a>
@@ -88,7 +95,7 @@ distributed training in a few lines.
 <p align="right">(<a href="#top">back to top</a>)</p>
-## Demo
+## Parallel Demo
 ### ViT
 <p align="center">
 <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/ViT.png" width="450" />
@@ -124,27 +131,39 @@ Please visit our [documentation and tutorials](https://www.colossalai.org/) for
 <p align="right">(<a href="#top">back to top</a>)</p>
 ## Single GPU Demo
 ### GPT-2
 <p id="GPT-2-Single" align="center">
 <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/GPT2-GPU1.png" width=450/>
 </p>
 - 20x larger model size on the same hardware
 ### PaLM
 <p id="PaLM-Single" align="center">
 <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/PaLM-GPU1.png" width=450/>
 </p>
 - 34x larger model size on the same hardware
 <p align="right">(<a href="#top">back to top</a>)</p>
 ## Installation
-### PyPI
+### Download From Official Releases
-```bash
+You can visit the [Download](/download) page to download Colossal-AI with pre-built CUDA extensions.
 pip install colossalai
 ```
 This command will install CUDA extension if your have installed CUDA, NVCC and torch. 
 If you don't want to install CUDA extension, you should add `--global-option="--no_cuda_ext"`, like:
 ```bash
 pip install colossalai --global-option="--no_cuda_ext"
 ```
-### Install From Source
+### Download From Source
-> The version of Colossal-AI will be in line with the main branch of the repository. Feel free to create an issue if you encounter any problems. :-)
+> The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
 ```shell
 git clone https://github.com/hpcaitech/ColossalAI.git
 cd ColossalAI
 # install dependency
 pip install -r requirements/requirements.txt
@@ -155,7 +174,7 @@ pip install .
 If you don't want to install and enable CUDA kernel fusion (compulsory installation when using fused optimizer):
 ```shell
-pip install --global-option="--no_cuda_ext" .
+NO_CUDA_EXT=1 pip install .
 ```
 <p align="right">(<a href="#top">back to top</a>)</p>
@@ -200,80 +219,23 @@ Thanks so much to all of our amazing contributors!
 ### Start Distributed Training in Lines
 ```python
-import colossalai
+parallel = dict(
-from colossalai.utils import get_dataloader
+    pipeline=2,
-
+    tensor=dict(mode='2.5d', depth = 1, size=4)
 # my_config can be path to config file or a dictionary obj
 # 'localhost' is only for single node, you need to specify
 # the node name if using multiple nodes
 colossalai.launch(
    config=my_config,
    rank=rank,
    world_size=world_size,
    backend='nccl',
    port=29500,
    host='localhost'
 )
 # build your model
 model = ...
 # build you dataset, the dataloader will have distributed data
 # sampler by default
 train_dataset = ...
 train_dataloader = get_dataloader(dataset=dataset,
                                shuffle=True
                                )
 # build your optimizer
 optimizer = ...
 # build your loss function
 criterion = ...
 # initialize colossalai
 engine, train_dataloader, _, _ = colossalai.initialize(
    model=model,
    optimizer=optimizer,
    criterion=criterion,
    train_dataloader=train_dataloader
 )
 # start training
 engine.train()
 for epoch in range(NUM_EPOCHS):
    for data, label in train_dataloader:
        engine.zero_grad()
        output = engine(data)
        loss = engine.criterion(output, label)
        engine.backward(loss)
        engine.step()
 ```
-### Write a Simple 2D Parallel Model
+### Start Heterogeneous Training in Lines
 Let's say we have a huge MLP model and its very large hidden size makes it difficult to fit into a single GPU. We can
 then distribute the model weights across GPUs in a 2D mesh while you still write your model in a familiar way.
 ```python
-from colossalai.nn import Linear2D
+zero = dict(
-import torch.nn as nn
+    model_config=dict(
-
+        tensor_placement_policy='auto',
-
+        shard_strategy=TensorShardStrategy(),
-class MLP_2D(nn.Module):
+        reuse_fp16_shard=True
-
+    ),
-    def __init__(self):
+    optimizer_config=dict(initial_scale=2**5, gpu_margin_mem_ratio=0.2)
-        super().__init__()
+)
        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
    def forward(self, x):
        x = self.linear_1(x)
        x = self.linear_2(x)
        return x
 ```