Develop/experiments (#59)

* Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b7699. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> * Split conv2d, class token, positional embedding in 2d, Fix random number in ddp Fix convergence in cifar10, Imagenet1000 * Integrate 1d tensor parallel in Colossal-AI (#39) * fixed 1D and 2D convergence (#38) * optimized 2D operations * fixed 1D ViT convergence problem * Feature/ddp (#49) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b7699. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * support torch ddp * fix loss accumulation * add log for ddp * change seed * modify timing hook Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * Feature/pipeline (#40) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b7699. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * optimize communication of pipeline parallel * fix grad clip for pipeline Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51) * Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset * update api for better usability (#58) update api for better usability Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2025-09-09 04:50:17 +00:00 · 2021-12-09 15:08:29 +08:00
parent eb2f8b1f6b
commit da01c234e1
229 changed files with 6532 additions and 8741 deletions
--- a/examples/colossal_cifar_demo.ipynb
+++ b/examples/colossal_cifar_demo.ipynb
@@ -1,20 +1,4 @@
 {
- "nbformat": 4,
- "nbformat_minor": 0,
- "metadata": {
-  "colab": {
-   "name": "colossal_cifar_demo.ipynb",
-   "provenance": []
-  },
-  "kernelspec": {
-   "name": "python3",
-   "display_name": "Python 3"
-  },
-  "language_info": {
-   "name": "python"
-  },
-  "accelerator": "GPU"
- },
 "cells": [
  {
   "cell_type": "markdown",
@@ -27,6 +11,7 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
@@ -34,14 +19,10 @@
    "id": "vP7LvCpG23a2",
    "outputId": "b37f7203-8a02-4736-c527-603f2bb34d7d"
   },
-   "source": [
-    "!pip install ColossalAI deepspeed"
-   ],
-   "execution_count": null,
   "outputs": [
    {
-     "output_type": "stream",
     "name": "stdout",
+     "output_type": "stream",
     "text": [
      "Requirement already satisfied: ColossalAI in /usr/local/lib/python3.7/dist-packages (0.1)\n",
      "Requirement already satisfied: deepspeed in /usr/local/lib/python3.7/dist-packages (0.5.4)\n",
@@ -60,10 +41,14 @@
      "Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from triton->deepspeed) (3.3.0)\n"
     ]
    }
+   ],
+   "source": [
+    "!pip install ColossalAI deepspeed"
   ]
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
@@ -71,24 +56,23 @@
    "id": "UVKEurtS4SFS",
    "outputId": "99fb6050-5da7-4f27-b4eb-9b3ccf830efb"
   },
-   "source": [
-    "import colossalai\n",
-    "from colossalai.engine import Engine, NoPipelineSchedule\n",
-    "from colossalai.trainer import Trainer\n",
-    "from colossalai.context import Config\n",
-    "import torch"
-   ],
-   "execution_count": null,
   "outputs": [
    {
-     "output_type": "stream",
     "name": "stdout",
+     "output_type": "stream",
     "text": [
      "Please install apex to use FP16 Optimizer\n",
      "Apex should be installed to use the FP16 optimizer\n",
      "apex is required for mixed precision training\n"
     ]
    }
+   ],
+   "source": [
+    "import colossalai\n",
+    "from colossalai.engine import Engine, NonPipelineSchedule\n",
+    "from colossalai.trainer import Trainer\n",
+    "from colossalai.context import Config\n",
+    "import torch"
   ]
  },
  {
@@ -102,6 +86,7 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
@@ -109,6 +94,28 @@
    "id": "8yF7Lc-K7NAS",
    "outputId": "01312349-a8b0-4de4-9103-7d1b48e6cc36"
   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,596 INFO: Added key: store_based_barrier_key:1 to store for rank: 0\n",
+      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,598 INFO: Rank 0: Completed store-based barrier for 1 nodes.\n",
+      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,602 INFO: Added key: store_based_barrier_key:2 to store for rank: 0\n",
+      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,605 INFO: Rank 0: Completed store-based barrier for 1 nodes.\n",
+      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,608 INFO: Added key: store_based_barrier_key:3 to store for rank: 0\n",
+      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,610 INFO: Rank 0: Completed store-based barrier for 1 nodes.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "process rank 0 is bound to device 0\n",
+      "initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1124,the default parallel seed is ParallelMode.DATA.\n"
+     ]
+    }
+   ],
   "source": [
    "parallel_cfg = Config(dict(parallel=dict(\n",
    "    data=dict(size=1),\n",
@@ -121,29 +128,6 @@
    "          host='127.0.0.1',\n",
    "          port=8888,\n",
    "          backend='nccl')"
-   ],
-   "execution_count": null,
-   "outputs": [
-    {
-     "output_type": "stream",
-     "name": "stderr",
-     "text": [
-      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,596 INFO: Added key: store_based_barrier_key:1 to store for rank: 0\n",
-      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,598 INFO: Rank 0: Completed store-based barrier for 1 nodes.\n",
-      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,602 INFO: Added key: store_based_barrier_key:2 to store for rank: 0\n",
-      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,605 INFO: Rank 0: Completed store-based barrier for 1 nodes.\n",
-      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,608 INFO: Added key: store_based_barrier_key:3 to store for rank: 0\n",
-      "colossalai - torch.distributed.distributed_c10d - 2021-10-15 03:27:51,610 INFO: Rank 0: Completed store-based barrier for 1 nodes.\n"
-     ]
-    },
-    {
-     "output_type": "stream",
-     "name": "stdout",
-     "text": [
-      "process rank 0 is bound to device 0\n",
-      "initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1124,the default parallel seed is ParallelMode.DATA.\n"
-     ]
-    }
   ]
  },
  {
@@ -157,13 +141,24 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "metadata": {
-    "id": "ZyGhyD47-dUY",
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
+    "id": "ZyGhyD47-dUY",
    "outputId": "98bbf2d1-a1c4-4bb4-b6df-600777b1e8f5"
   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Files already downloaded and verified\n",
+      "Files already downloaded and verified\n"
+     ]
+    }
+   ],
   "source": [
    "transform_cfg = [\n",
    "    dict(type='ToTensor'),\n",
@@ -179,17 +174,6 @@
    "\n",
    "testset = colossalai.nn.data.CIFAR10Dataset(transform_cfg, root='./data', train=False)\n",
    "testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=2)"
-   ],
-   "execution_count": null,
-   "outputs": [
-    {
-     "output_type": "stream",
-     "name": "stdout",
-     "text": [
-      "Files already downloaded and verified\n",
-      "Files already downloaded and verified\n"
-     ]
-    }
   ]
  },
  {
@@ -203,9 +187,11 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "metadata": {
    "id": "cQ_y7lBG09LS"
   },
+   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
@@ -232,9 +218,7 @@
    "\n",
    "\n",
    "model = Net().cuda()"
-   ],
-   "execution_count": null,
-   "outputs": []
+   ]
  },
  {
   "cell_type": "markdown",
@@ -247,6 +231,7 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
@@ -254,6 +239,18 @@
    "id": "YtaDoCax1BCf",
    "outputId": "b33b1641-03d8-4597-c8c2-1a4c1d61e9b0"
   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "colossalai - rank_0 - 2021-10-15 03:27:56,018 WARNING: No gradient handler is set up, please make sure you do not need to all-reduce the gradients after a training step.\n",
+      "colossalai - rank_0 - 2021-10-15 03:27:56,024 INFO: build LogMetricByEpochHook for train, priority = 1\n",
+      "colossalai - rank_0 - 2021-10-15 03:27:56,026 INFO: build LossHook for train, priority = 10\n",
+      "colossalai - rank_0 - 2021-10-15 03:27:56,029 INFO: build AccuracyHook for train, priority = 10\n"
+     ]
+    }
+   ],
   "source": [
    "import torch.optim as optim\n",
    "\n",
@@ -270,19 +267,6 @@
    "trainer = Trainer(engine=engine,\n",
    "          hooks_cfg=[dict(type='LossHook'), dict(type='LogMetricByEpochHook'), dict(type='AccuracyHook')],\n",
    "          verbose=True)"
-   ],
-   "execution_count": null,
-   "outputs": [
-    {
-     "output_type": "stream",
-     "name": "stderr",
-     "text": [
-      "colossalai - rank_0 - 2021-10-15 03:27:56,018 WARNING: No gradient handler is set up, please make sure you do not need to all-reduce the gradients after a training step.\n",
-      "colossalai - rank_0 - 2021-10-15 03:27:56,024 INFO: build LogMetricByEpochHook for train, priority = 1\n",
-      "colossalai - rank_0 - 2021-10-15 03:27:56,026 INFO: build LossHook for train, priority = 10\n",
-      "colossalai - rank_0 - 2021-10-15 03:27:56,029 INFO: build AccuracyHook for train, priority = 10\n"
-     ]
-    }
   ]
  },
  {
@@ -296,6 +280,7 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
@@ -303,22 +288,10 @@
    "id": "w-J3IP-J1sfx",
    "outputId": "bdb76939-04f1-4124-ce5e-3af44c0d902c"
   },
-   "source": [
-    "num_epochs = 10\n",
-    "test_interval = 1\n",
-    "trainer.fit(\n",
-    "        train_dataloader=trainloader,\n",
-    "        test_dataloader=testloader,\n",
-    "        max_epochs=num_epochs,\n",
-    "        display_progress=True,\n",
-    "        test_interval=test_interval\n",
-    "    )"
-   ],
-   "execution_count": null,
   "outputs": [
    {
-     "output_type": "stream",
     "name": "stderr",
+     "output_type": "stream",
     "text": [
      "[Epoch 0 train]:   0%|          | 0/391 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)\n",
      "  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)\n",
@@ -364,7 +337,34 @@
      "colossalai - rank_0 - 2021-10-15 03:30:57,332 INFO: Testing - Epoch 10 - LogMetricByEpochHook: Loss = 1.41242, Accuracy = 0.48500\n"
     ]
    }
+   ],
+   "source": [
+    "num_epochs = 10\n",
+    "test_interval = 1\n",
+    "trainer.fit(\n",
+    "        train_dataloader=trainloader,\n",
+    "        test_dataloader=testloader,\n",
+    "        max_epochs=num_epochs,\n",
+    "        display_progress=True,\n",
+    "        test_interval=test_interval\n",
+    "    )"
   ]
  }
- ]
-}
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "name": "colossal_cifar_demo.ipynb",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
--- a/examples/run_trainer.py
+++ b/examples/run_trainer.py
@@ -3,13 +3,13 @@

 import colossalai
 from colossalai.core import global_context as gpc
-from colossalai.logging import get_global_dist_logger
+from colossalai.logging import get_dist_logger
 from colossalai.trainer import Trainer


 def run_trainer():
    engine, train_dataloader, test_dataloader = colossalai.initialize()
-    logger = get_global_dist_logger()
+    logger = get_dist_logger()
    engine.schedule.data_sync = False

    logger.info("engine is built", ranks=[0])
--- a/examples/vit-b16/README.md
+++ b/examples/vit-b16/README.md
@@ -1,54 +1,40 @@
 # Overview

-A common way to speed up AI model training is to implement large-batch training with the help of data parallelism, but this requires expensive supercomputer clusters. In this example, we used a small server with only 4 GPUs to reproduce the large-scale pre-training of Vision Transformer (ViT) on ImageNet-1K in 14 hours.
+Here is an example of training ViT-B/16 on Imagenet-1K with batch size 32K.
+We use 8x NVIDIA A100 GPU in this example. 

 # How to run
-
-On a single server, you can directly use torch.distributed to start pre-training on multiple GPUs in parallel.
-
-```shell
-python -m torch.distributed.launch --nproc_per_node <num_of_gpus> train_dali.py --world_size <num_of_gpus> --config <path to your config file>
-```
-
-For scaling on a GPU cluster, you can use the [Slurm](https://slurm.schedmd.com/documentation.html) Workload Manager to start the following commands and get running environment information.
+Using [Slurm](https://slurm.schedmd.com/documentation.html):
 ```shell
 srun python train_dali.py --local_rank=$SLURM_PROCID --world_size=$SLURM_NPROCS --host=$HOST --port=29500 --config=vit-b16.py
 ```

-# Experiments
-
-To facilitate more people to reproduce the experiments with large-scale data parallel, we pre-trained ViT-Base/32 in only 14.58 hours on a small server with 4 NVIDIA A100 GPUs using ImageNet-1K dataset with batch size 32K for 300 epochs maintaining accuracy. For more complex pre-training of ViT-Base/16 and ViT-Large/32, it also takes only 78.58 hours and 37.83 hours to complete. Since the server used in this example is not a standard NVIDIA DGX A100 supercomputing unit, perhaps a better acceleration can be obtained on more professional hardware.
+# Results

 ![Loss Curve](./loss.jpeg)
 ![Accuracy](./acc.jpeg)

-As can be seen from the above figure, the ViT model eventually converges well after training 300 epochs. It is worth noting that, unlike the common small-batch training convergence process, the model performance has a temporary decline in the middle of the large-batch training process. This is due to the difficulty of convergence in large-batch training. As the number of iterations is reduced, a larger learning rate is needed to ensure the final convergence. Since we did not carefully adjust the parameters, perhaps other parameter settings could get better convergence.
-
 # Details
-
 `vit-b16.py`

-This is a [configuration file](https://colossalai.org/config.html) that defines training parameters used by Colossal-AI, such as model, dataset, training methods (optimizer, learning rate scheduler, number of epochs, etc.). The config content can be accessed through `gpc.config` in the program.
+It is a [config file](https://colossalai.org/config.html), which is used by ColossalAI to define all kinds of training arguments, such as the model, dataset, and training method (optimizer, lr_scheduler, epoch, etc.). You can access config content by `gpc.config`.

-In this example, we trained ViT-Base/16 for 300 epochs on the ImageNet-1K dataset. The batch size is expanded to 32K through data parallelism. Since only 4 A100 GPUs on one small server are used, and the GPU memory is limited, the batch size of 32K cannot be used directly. Therefore, the batch size used on each GPU is only 256, and the 256 batch size is equivalently expanded to 8K through gradient accumulation 32 times. Finally, data parallelism is used between 4 GPUs to achieve an equivalent batch size of 32K.
+In this example, we train the ViT-Base patch 16 model 300 epochs on ImageNet-1K. The batch size is set to 32K through data parallel (4K on each GPU from 16x gradient accumulation with batch size 256). Since the batch size is very large than common usage, leading to convergence difficulties, we use a 
+large batch optimizer [LAMB](https://arxiv.org/abs/1904.00962), and we can scale the batch size to 32K with a little accuracy loss. The learning rate and weight decay of the optimizer are set to 1.8e-2 and 0.1, respectively. We use a linear warmup learning rate scheduler and warmup 150 epochs.
+We introduce FP16 mixed precision to accelerate training and use gradient clipping to help convergence.
+For simplicity and speed, we didn't apply `RandAug` and just used [Mixup](https://arxiv.org/abs/1710.09412) in data augmentation.

-Since the batch size of 32K far exceeds the use range of common optimizers and is difficult to train, we use the large-batch optimizer [LAMB](https://arxiv.org/abs/1904.00962) provided by Colossal-AI to achieve a better convergence. The learning rate and weight decay of [LAMB](https://arxiv.org/abs/1904.00962) are set to 1.8e-2 and 0.1, respectively. The learning rate scheduler uses a linear warmup strategy of 150 epochs. We also used FP16 mixed precision to speed up the training process, and introduced gradient clipping to help convergence. For simplicity and speed, we only use [Mixup](https://arxiv.org/abs/1710.09412) instead of `RandAug` in data augmentation.
+If you have enough computing resources, you can expand this example conveniently with data parallel on a very large scale without gradient accumulation, and finish the training process even within one hour.

-By tuning the parallelism, this example can be quickly deployed to a single server with several GPUs or to a large cluster with lots of nodes and GPUs. If there are enough computing resources to allow data parallel to be directly extended to hundreds or even thousands of GPUs, the training process of several days on a single A100 GPU can be shortened to less than half an hour.

 `imagenet_dali_dataloader.py`
-To accelerate the training process, we use [DALI](https://github.com/NVIDIA/DALI) to read data and require the dataset to be in TFRecord format, which avoids directly reading a large number of raw image files and being limited by the efficiency of the file system.
+To accelerate the training process, we use [DALI](https://github.com/NVIDIA/DALI) as data loader. Note that it requires the dataset in TFRecord format, avoiding read raw images which reduces efficiency of the file system.

 `train_dali.py`
-We call DALI in this file to read data and start the training process using Colossal-AI.
+We build the DALI data loader and train process using Colossal-AI here.

 `mixup.py`
-Since Mixup is used as data augmentation, we define the loss function of Mixup here.
+Since we used Mixup, we define mixup loss in this file.

 `hooks.py`
-We define hook functions that record running information to help debugging.
-
-# How to build TFRecords dataset
-
-As we use [DALI](https://github.com/NVIDIA/DALI) to read data, we use the TFRecords dataset instead of raw Imagenet dataset. If you don't have TFRecords dataset, follow [imagenet-tools](https://github.com/ver217/imagenet-tools) to build one.
-
+We also define useful hooks to log information help debugging.
--- a/examples/vit-b16/train_dali.py
+++ b/examples/vit-b16/train_dali.py
@@ -3,7 +3,7 @@ import os
 import colossalai
 from colossalai.context import ParallelMode
 from colossalai.core import global_context as gpc
-from colossalai.logging import get_global_dist_logger
+from colossalai.logging import get_dist_logger
 from colossalai.trainer import Trainer
 from colossalai.utils import set_global_multitimer_status
 from dataloader.imagenet_dali_dataloader import DaliDataloader
@@ -49,7 +49,7 @@ def main():
        train_dataloader=build_dali_train,
        test_dataloader=build_dali_test
    )
-    logger = get_global_dist_logger()
+    logger = get_dist_logger()
    set_global_multitimer_status(True)
    timer = colossalai.utils.get_global_multitimer()
    trainer = Trainer(engine=engine,
--- a/examples/vit-b16/vit-b16.py
+++ b/examples/vit-b16/vit-b16.py
@@ -73,6 +73,6 @@ dali = dict(
 engine = dict(
    schedule=None,
    gradient_handlers=None,
-    gradient_accumulation=32,
+    gradient_accumulation=16,
    gradient_clipping=1.0,
 )