diff --git a/README-zh-Hans.md b/README-zh-Hans.md
new file mode 100644
index 000000000..208f85671
--- /dev/null
+++ b/README-zh-Hans.md
@@ -0,0 +1,207 @@
+# Colossal-AI
+
+[](https://www.colossalai.org/)
+
+
+
+
+
+ [](https://github.com/hpcaitech/ColossalAI/actions/workflows/PR_CI.yml)
+ [](https://colossalai.readthedocs.io/en/latest/?badge=latest)
+ [](https://codebeat.co/projects/github-com-hpcaitech-colossalai-main)
+
+ | [English](README.md) | [中文](README-zh-Hans.md) |
+
+一个整合高效并行技术的AI大模型训练系统。
+
+## 特点
+
+Colossal-AI为您提供了一系列并行训练组件。我们的目标是让您的分布式AI模型训练像普通的单GPU模型一样简单。我们提供的友好工具可以让您在几行代码内快速开始分布式训练。
+
+- 数据并行
+- 流水线并行
+- 1维, 2维, 2.5维, 3维张量并行
+- 序列并行
+- 友好的trainer和engine
+- 可扩展新的并行方式
+- 混合精度
+- 零冗余优化器 (ZeRO)
+
+## 样例
+### ViT
+
+
+
+- 14倍批大小
+- 5倍训练速度
+
+### GPT-3 & GPT-2
+
+
+
+- GPT-3:释放 50% GPU 资源占用, 或 10.7% 加速
+- GPT-2:降低11倍GPU显存占用,或超线性扩展
+
+### BERT
+
+
+
+- 2倍训练速度
+- 1.5倍序列长度
+
+请访问我们的[文档和教程](https://www.colossalai.org/)以了解详情。
+
+
+## 安装
+
+### PyPI
+
+```bash
+pip install colossalai
+```
+该命令将会安装CUDA extension,如果你已安装CUDA, NVCC和torch。
+
+如果你不想安装CUDA extension, 可在命令中添加`--global-option="--no_cuda_ext"`, 例如:
+```bash
+pip install colossalai --global-option="--no_cuda_ext"
+```
+
+如果你想使用`ZeRO`, 你可以使用:
+```bash
+pip install colossalai[zero]
+```
+
+### 从源代码安装
+
+> Colossal-AI的版本将与该项目的主分支保持一致。欢迎通过issue反馈你遇到的任何问题 :)
+
+```shell
+git clone https://github.com/hpcaitech/ColossalAI.git
+cd ColossalAI
+# 安装依赖
+pip install -r requirements/requirements.txt
+
+# 安装 colossalai
+pip install .
+```
+
+如果你不想安装和使用CUDA kernel fusion (使用fused优化器需安装):
+
+```shell
+pip install --global-option="--no_cuda_ext" .
+```
+
+## 使用 Docker
+
+运行以下命令从我们提供的docker文件中建立docker镜像。
+
+```bash
+cd ColossalAI
+docker build -t colossalai ./docker
+```
+
+运行以下命令从以交互式启动docker镜像.
+
+```bash
+docker run -ti --gpus all --rm --ipc=host colossalai bash
+```
+
+## 做出贡献
+
+欢迎为该项目做出贡献,请参阅[贡献指南](./CONTRIBUTING.md)。
+
+
+## 快速预览
+
+### Start Distributed Training in Lines
+
+```python
+import colossalai
+from colossalai.utils import get_dataloader
+
+
+# my_config可以是config文件的路径或字典对象
+# 'localhost' 仅适用于单节点,在多节点时需指明节点名
+colossalai.launch(
+ config=my_config,
+ rank=rank,
+ world_size=world_size,
+ backend='nccl',
+ port=29500,
+ host='localhost'
+)
+
+# 构建模型
+model = ...
+
+# 构建数据集, dataloader会默认处理分布式数据sampler
+train_dataset = ...
+train_dataloader = get_dataloader(dataset=dataset,
+ shuffle=True
+ )
+
+
+# 构建优化器
+optimizer = ...
+
+# 构建损失函数
+criterion = ...
+
+# 初始化colossalai
+engine, train_dataloader, _, _ = colossalai.initialize(
+ model=model,
+ optimizer=optimizer,
+ criterion=criterion,
+ train_dataloader=train_dataloader
+)
+
+# 开始训练
+engine.train()
+for epoch in range(NUM_EPOCHS):
+ for data, label in train_dataloader:
+ engine.zero_grad()
+ output = engine(data)
+ loss = engine.criterion(output, label)
+ engine.backward(loss)
+ engine.step()
+
+```
+
+### 构建一个简单的2维并行模型
+
+假设我们有一个非常巨大的MLP模型,它巨大的hidden size使得它难以被单个GPU容纳。我们可以将该模型的权重以二维网格的形式分配到多个GPU上,且保持你熟悉的模型构建方式。
+
+```python
+from colossalai.nn import Linear2D
+import torch.nn as nn
+
+
+class MLP_2D(nn.Module):
+
+ def __init__(self):
+ super().__init__()
+ self.linear_1 = Linear2D(in_features=1024, out_features=16384)
+ self.linear_2 = Linear2D(in_features=16384, out_features=1024)
+
+ def forward(self, x):
+ x = self.linear_1(x)
+ x = self.linear_2(x)
+ return x
+
+```
+
+
+## 引用
+
+```
+@article{bian2021colossal,
+ title={Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},
+ author={Bian, Zhengda and Liu, Hongxin and Wang, Boxiang and Huang, Haichen and Li, Yongbin and Wang, Chuanrui and Cui, Fan and You, Yang},
+ journal={arXiv preprint arXiv:2110.14883},
+ year={2021}
+}
+```
diff --git a/README.md b/README.md
index 93282185f..65e05991b 100644
--- a/README.md
+++ b/README.md
@@ -13,9 +13,52 @@
[](https://github.com/hpcaitech/ColossalAI/actions/workflows/PR_CI.yml)
[](https://colossalai.readthedocs.io/en/latest/?badge=latest)
[](https://codebeat.co/projects/github-com-hpcaitech-colossalai-main)
+
+ | [English](README.md) | [中文](README-zh-Hans.md) |
An integrated large-scale model training system with efficient parallelization techniques.
+
+## Features
+
+Colossal-AI provides a collection of parallel training components for you. We aim to support you to write your
+distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart
+distributed training in a few lines.
+
+- Data Parallelism
+- Pipeline Parallelism
+- 1D, 2D, 2.5D, 3D tensor parallelism
+- Sequence parallelism
+- Friendly trainer and engine
+- Extensible for new parallelism
+- Mixed Precision Training
+- Zero Redundancy Optimizer (ZeRO)
+
+## Examples
+### ViT
+
+
+
+- 14x larger batch size
+- 5x faster training
+
+### GPT-3 & GPT-2
+
+
+
+- Free 50% GPU resources, or 10.7% acceleration for GPT-3
+- 11x lower GPU RAM, or superlinear scaling for GPT-2
+
+### BERT
+
+
+
+- 2x faster training
+- 50% longer sequence length
+
+Please visit our [documentation and tutorials](https://www.colossalai.org/) for more details.
+
+
## Installation
### PyPI
@@ -37,7 +80,7 @@ pip install colossalai[zero]
### Install From Source
-> The documentation will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
+> The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
```shell
git clone https://github.com/hpcaitech/ColossalAI.git
@@ -107,13 +150,13 @@ train_dataloader = get_dataloader(dataset=dataset,
)
-# build your
+# build your optimizer
optimizer = ...
# build your loss function
criterion = ...
-# build your lr_scheduler
+# initialize colossalai
engine, train_dataloader, _, _ = colossalai.initialize(
model=model,
optimizer=optimizer,
@@ -157,21 +200,6 @@ class MLP_2D(nn.Module):
```
-## Features
-
-Colossal-AI provides a collection of parallel training components for you. We aim to support you to write your
-distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart
-distributed training in a few lines.
-
-- Data Parallelism
-- Pipeline Parallelism
-- 1D, 2D, 2.5D, 3D and sequence parallelism
-- Friendly trainer and engine
-- Extensible for new parallelism
-- Mixed Precision Training
-- Zero Redundancy Optimizer (ZeRO)
-
-Please visit our [documentation and tutorials](https://www.colossalai.org/) for more details.
## Cite Us
diff --git a/docs/images/BERT_seq.png b/docs/images/BERT_seq.png
new file mode 100644
index 000000000..1cdf78269
Binary files /dev/null and b/docs/images/BERT_seq.png differ
diff --git a/docs/images/GPT_2_3.png b/docs/images/GPT_2_3.png
new file mode 100644
index 000000000..08181c29d
Binary files /dev/null and b/docs/images/GPT_2_3.png differ
diff --git a/docs/images/ViT_TP.png b/docs/images/ViT_TP.png
new file mode 100644
index 000000000..f142cfefd
Binary files /dev/null and b/docs/images/ViT_TP.png differ