[ci] update ci (#6254 )

* fix for async io * test for upgrading transformers * add ci machine * fix * fix * fix * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test_fp16_torch.py * Update build_on_pr.yml * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fiux * fix * fix * fix --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Update README.md (#6268 )
2025-04-30 20:55:17 +00:00 · 2025-04-18 16:40:53 +08:00 · 2025-04-17 12:07:25 +08:00 · 2025-03-07 14:14:26 +08:00 · 2025-03-03 16:15:09 +08:00 · 2025-03-01 19:04:14 +08:00
111 changed files with 4258 additions and 1967 deletions
--- a/.compatibility
+++ b/.compatibility
@ -1,3 +1,3 @@
-2.2.2-12.1.0
 2.3.0-12.1.0
 2.4.0-12.4.1
+2.5.1-12.4.1
--- a/.cuda_ext.json
+++ b/.cuda_ext.json
@ -1,11 +1,11 @@
 {
  "build": [
    {
-      "torch_command": "pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121",
+      "torch_command": "pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121",
      "cuda_image": "hpcaitech/cuda-conda:12.1"
    },
    {
-      "torch_command": "pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124",
+      "torch_command": "pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124",
      "cuda_image": "hpcaitech/cuda-conda:12.4"
    }
  ]
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -15,6 +15,26 @@ body:
    options:
    - label: I have searched the existing issues
      required: true
+
+- type: checkboxes
+  attributes:
+    label: The bug has not been fixed in the latest main branch
+    options:
+    - label: I have checked the latest main branch
+      required: true
+
+- type: dropdown
+  id: share_script
+  attributes:
+    label: Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
+    description: If not, please share your setting/training config, and/or point to the line in the repo that throws the error.
+              If the issue is not easily reproducible by us, it will reduce the likelihood of getting responses.
+    options:
+      - Yes, I will share a minimal reproducible script.
+      - No, I prefer not to share.
+  validations:
+    required: true
+
 - type: textarea
  attributes:
    label: 🐛 Describe the bug
--- a/.github/workflows/build_on_pr.yml
+++ b/.github/workflows/build_on_pr.yml
@ -87,10 +87,10 @@ jobs:
    name: Build and Test Colossal-AI
    needs: detect
    if: needs.detect.outputs.anyLibraryFileChanged == 'true'
-    runs-on: [self-hosted, gpu]
+    runs-on: ubuntu-latest
    container:
-      image: hpcaitech/pytorch-cuda:2.2.2-12.1.0
-      options: --gpus all --rm -v /dev/shm -v /data/scratch:/data/scratch
+      image: image-cloud.luchentech.com/hpcaitech/pytorch-cuda:2.2.2-12.1.0
+      options: --gpus all --shm-size=2g --rm -v /dev/shm -v /data/scratch:/data/scratch
    timeout-minutes: 90
    defaults:
      run:
@ -166,6 +166,7 @@ jobs:
          LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
          LLAMA_PATH: /data/scratch/llama-tiny
          MOE_TENSOR_PATH: /data/scratch/moe_tensors
+          HF_ENDPOINT: https://hf-mirror.com

      - name: Collate artifact
        env:
@ -199,7 +200,7 @@ jobs:
          fi

      - name: Upload test coverage artifact
-        uses: actions/upload-artifact@v3
+        uses: actions/upload-artifact@v4
        with:
          name: report
          path: report/
--- a/.github/workflows/build_on_schedule.yml
+++ b/.github/workflows/build_on_schedule.yml
@ -70,6 +70,7 @@ jobs:
          LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
          LLAMA_PATH: /data/scratch/llama-tiny
          MOE_TENSOR_PATH: /data/scratch/moe_tensors
+          HF_ENDPOINT: https://hf-mirror.com

      - name: Notify Lark
        id: message-preparation
--- a/.github/workflows/compatiblity_test_on_dispatch.yml
+++ b/.github/workflows/compatiblity_test_on_dispatch.yml
@ -79,3 +79,4 @@ jobs:
          LD_LIBRARY_PATH: /github/home/.tensornvme/lib
          LLAMA_PATH: /data/scratch/llama-tiny
          MOE_TENSOR_PATH: /data/scratch/moe_tensors
+          HF_ENDPOINT: https://hf-mirror.com
--- a/.github/workflows/compatiblity_test_on_pr.yml
+++ b/.github/workflows/compatiblity_test_on_pr.yml
@ -73,3 +73,4 @@ jobs:
          LD_LIBRARY_PATH: /github/home/.tensornvme/lib
          LLAMA_PATH: /data/scratch/llama-tiny
          MOE_TENSOR_PATH: /data/scratch/moe_tensors
+          HF_ENDPOINT: https://hf-mirror.com
--- a/.github/workflows/compatiblity_test_on_schedule.yml
+++ b/.github/workflows/compatiblity_test_on_schedule.yml
@ -67,6 +67,7 @@ jobs:
          LD_LIBRARY_PATH: /github/home/.tensornvme/lib
          LLAMA_PATH: /data/scratch/llama-tiny
          MOE_TENSOR_PATH: /data/scratch/moe_tensors
+          HF_ENDPOINT: https://hf-mirror.com

      - name: Notify Lark
        id: message-preparation
--- a/.github/workflows/run_chatgpt_examples.yml
+++ b/.github/workflows/run_chatgpt_examples.yml
@ -31,13 +31,12 @@ jobs:

      - name: Install Colossal-AI
        run: |
-          BUILD_EXT=1 pip install --no-cache-dir -v -e .
+          pip install --no-cache-dir -v -e .

      - name: Install ChatGPT
        run: |
          cd applications/ColossalChat
          pip install --no-cache-dir -v .
-          export BUILD_EXT=1
          pip install --no-cache-dir -r examples/requirements.txt

      - name: Install Transformers
@ -61,5 +60,6 @@ jobs:
          PRETRAINED_MODEL_PATH: ./models
          SFT_DATASET: ./sft_data
          PROMPT_DATASET: ./prompt_data
+          PROMPT_RLVR_DATASET: ./prompt_data
          PREFERENCE_DATASET: ./preference_data
          KTO_DATASET: ./kto_data
--- a/README.md
+++ b/README.md
@ -38,12 +38,13 @@ Limited Academic Bonuses:

 <div align="center">
   <a href="https://hpc-ai.com/?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai">
-   <img src="https://github.com/hpcaitech/public_assets/blob/main/colossalai/img/2.gif" width="850" />
+   <img src="https://github.com/hpcaitech/public_assets/blob/main/colossalai/img/2-2.gif" width="850" />
   </a>
 </div>


 ## Latest News
+* [2025/02] [DeepSeek 671B Fine-Tuning Guide Revealed—Unlock the Upgraded DeepSeek Suite with One Click, AI Players Ecstatic!](https://company.hpc-ai.com/blog/shocking-release-deepseek-671b-fine-tuning-guide-revealed-unlock-the-upgraded-deepseek-suite-with-one-click-ai-players-ecstatic)
 * [2024/12] [The development cost of video generation models has saved by 50%! Open-source solutions are now available with H200 GPU vouchers](https://company.hpc-ai.com/blog/the-development-cost-of-video-generation-models-has-saved-by-50-open-source-solutions-are-now-available-with-h200-gpu-vouchers) [[code]](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py) [[vouchers]](https://colossalai.org/zh-Hans/docs/get_started/bonus/)
 * [2024/10] [How to build a low-cost Sora-like app? Solutions for you](https://company.hpc-ai.com/blog/how-to-build-a-low-cost-sora-like-app-solutions-for-you)
 * [2024/09] [Singapore Startup HPC-AI Tech Secures 50 Million USD in Series A Funding to Build the Video Generation AI Model and GPU Platform](https://company.hpc-ai.com/blog/singapore-startup-hpc-ai-tech-secures-50-million-usd-in-series-a-funding-to-build-the-video-generation-ai-model-and-gpu-platform)
--- a/applications/ColossalChat/.gitignore
+++ b/applications/ColossalChat/.gitignore
@ -158,6 +158,7 @@ temp/
 applications/ColossalChat/logs
 applications/ColossalChat/models
 applications/ColossalChat/sft_data
+applications/ColossalChat/kto_data
 applications/ColossalChat/prompt_data
 applications/ColossalChat/preference_data
 applications/ColossalChat/temp
--- a/applications/ColossalChat/README.md
+++ b/applications/ColossalChat/README.md
@ -7,31 +7,23 @@
 ## Table of Contents

 - [Table of Contents](#table-of-contents)
- [What is ColossalChat and Coati ?](#what-is-colossalchat-and-coati-)
+- [What is ColossalChat?](#what-is-colossalchat)
 - [Online demo](#online-demo)
 - [Install](#install)
  - [Install the environment](#install-the-environment)
  - [Install the Transformers](#install-the-transformers)
- [How to use?](#how-to-use)
+- [Introduction](#introduction)
  - [Supervised datasets collection](#step-1-data-collection)
  - [RLHF Training Stage1 - Supervised instructs tuning](#rlhf-training-stage1---supervised-instructs-tuning)
  - [RLHF Training Stage2 - Training reward model](#rlhf-training-stage2---training-reward-model)
  - [RLHF Training Stage3 - Training model with reinforcement learning by human feedback](#rlhf-training-stage3---proximal-policy-optimization)
+  - [Alternative Option for RLHF: GRPO](#alternative-option-for-rlhf-group-relative-policy-optimization-grpo)
+  - [Alternative Option For RLHF: DPO](#alternative-option-for-rlhf-direct-preference-optimization)
+  - [Alternative Option For RLHF: SimPO](#alternative-option-for-rlhf-simple-preference-optimization-simpo)
+  - [Alternative Option For RLHF: ORPO](#alternative-option-for-rlhf-odds-ratio-preference-optimization-orpo)
+  - [Alternative Option For RLHF: KTO](#alternative-option-for-rlhf-kahneman-tversky-optimization-kto)
+  - [SFT for DeepSeek V3/R1](#sft-for-deepseek-v3)
  - [Inference Quantization and Serving - After Training](#inference-quantization-and-serving---after-training)
- [Coati7B examples](#coati7b-examples)
-  - [Generation](#generation)
-  - [Open QA](#open-qa)
-  - [Limitation for LLaMA-finetuned models](#limitation)
-  - [Limitation of dataset](#limitation)
- [Alternative Option For RLHF: DPO](#alternative-option-for-rlhf-direct-preference-optimization)
- [Alternative Option For RLHF: SimPO](#alternative-option-for-rlhf-simple-preference-optimization-simpo)
- [Alternative Option For RLHF: ORPO](#alternative-option-for-rlhf-odds-ratio-preference-optimization-orpo)
- [Alternative Option For RLHF: KTO](#alternative-option-for-rlhf-kahneman-tversky-optimization-kto)
- [O1 Journey](#o1-journey)
-  - [Inference with Self-refined MCTS](#inference-with-self-refined-mcts)
- [FAQ](#faq)
-  - [How to save/load checkpoint](#faq)
-  - [How to train with limited resources](#faq)
 - [Invitation to open-source contribution](#invitation-to-open-source-contribution)
 - [Quick Preview](#quick-preview)
 - [Authors](#authors)
@ -40,9 +32,9 @@

 ---

-## What Is ColossalChat And Coati ?
+## What is ColossalChat?

-[ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) is the project to implement LLM with RLHF, powered by the [Colossal-AI](https://github.com/hpcaitech/ColossalAI) project.
+[ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalChat) is a project to implement LLM with RLHF, powered by the [Colossal-AI](https://github.com/hpcaitech/ColossalAI).

 Coati stands for `ColossalAI Talking Intelligence`. It is the name for the module implemented in this project and is also the name of the large language model developed by the ColossalChat project.

@ -53,8 +45,6 @@ The Coati package provides a unified large language model framework that has imp
 - Supervised instructions fine-tuning
 - Training reward model
 - Reinforcement learning with human feedback
- Quantization inference
- Fast model deploying
 - Perfectly integrated with the Hugging Face ecosystem, a high degree of model customization

 <div align="center">
@ -114,77 +104,16 @@ cd $COLOSSAL_AI_ROOT/applications/ColossalChat
 pip install .
 ```

-## How To Use?
+## Introduction

 ### RLHF Training Stage1 - Supervised Instructs Tuning

-Stage1 is supervised instructs fine-tuning (SFT). This step is a crucial part of the RLHF training process, as it involves training a machine learning model using human-provided instructions to learn the initial behavior for the task at hand. Here's a detailed guide on how to SFT your LLM with ColossalChat. More details can be found in [example guideline](./examples/README.md).
-
-#### Step 1: Data Collection
-The first step in Stage 1 is to collect a dataset of human demonstrations of the following format.
-
-```json
-[
-    {"messages":
-      [
-        {
-          "from": "user",
-          "content": "what are some pranks with a pen i can do?"
-        },
-        {
-          "from": "assistant",
-          "content": "Are you looking for practical joke ideas?"
-        },
-      ]
-    },
-]
-```
-
-#### Step 2: Preprocessing
-Once you have collected your SFT dataset, you will need to preprocess it. This involves four steps: data cleaning, data deduplication, formatting and tokenization. In this section, we will focus on formatting and tokenization.
-
-In this code, we provide a flexible way for users to set the conversation template for formatting chat data using Huggingface's newest feature--- chat template. Please follow the [example guideline](./examples/README.md) on how to format and tokenize data.
-
-#### Step 3: Training
-Choose a suitable model architecture for your task. Note that your model should be compatible with the tokenizer that you used to tokenize the SFT dataset. You can run [train_sft.sh](./examples/training_scripts/train_sft.sh) to start a supervised instructs fine-tuning. More details can be found in [example guideline](./examples/README.md).
+Stage1 is supervised instructs fine-tuning (SFT). This step is a crucial part of the RLHF training process, as it involves training a machine learning model using human-provided instructions to learn the initial behavior for the task at hand. More details can be found in [example guideline](./examples/README.md).

 ### RLHF Training Stage2 - Training Reward Model

 Stage2 trains a reward model, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.

-#### Step 1: Data Collection
-Below shows the preference dataset format used in training the reward model.
-
-```json
-[
-    {"context": [
-        {
-          "from": "human",
-          "content": "Introduce butterflies species in Oregon."
-        }
-      ],
-      "chosen": [
-        {
-          "from": "assistant",
-          "content": "About 150 species of butterflies live in Oregon, with about 100 species are moths..."
-        },
-      ],
-      "rejected": [
-        {
-          "from": "assistant",
-          "content": "Are you interested in just the common butterflies?  There are a few common ones which will be easy to find..."
-        },
-      ]
-    },
-]
-```
-
-#### Step 2: Preprocessing
-Similar to the second step in the previous stage, we format the reward data into the same structured format as used in step 2 of the SFT stage. You can run [prepare_preference_dataset.sh](./examples/data_preparation_scripts/prepare_preference_dataset.sh) to prepare the preference data for reward model training.
-
-#### Step 3: Training
-You can run [train_rm.sh](./examples/training_scripts/train_rm.sh) to start the reward model training. More details can be found in [example guideline](./examples/README.md).
-
 ### RLHF Training Stage3 - Proximal Policy Optimization

 In stage3 we will use reinforcement learning algorithm--- Proximal Policy Optimization (PPO), which is the most complex part of the training process:
@ -193,86 +122,26 @@ In stage3 we will use reinforcement learning algorithm--- Proximal Policy Optimi
 <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/stage-3.jpeg" width=800/>
 </p>

-#### Step 1: Data Collection
-PPO uses two kind of training data--- the prompt data and the sft data (optional). The first dataset is mandatory, data samples within the prompt dataset ends with a line from "human" and thus the "assistant" needs to generate a response to answer to the "human". Note that you can still use conversation that ends with a line from the "assistant", in that case, the last line will be dropped. Here is an example of the prompt dataset format.

-```json
-[
-    {"messages":
-      [
-        {
-          "from": "human",
-          "content": "what are some pranks with a pen i can do?"
-        }
-      ]
-    },
-]
-```
-
-#### Step 2: Data Preprocessing
-To prepare the prompt dataset for PPO training, simply run [prepare_prompt_dataset.sh](./examples/data_preparation_scripts/prepare_prompt_dataset.sh)
-
-#### Step 3: Training
-You can run the [train_ppo.sh](./examples/training_scripts/train_ppo.sh) to start PPO training. Here are some unique arguments for PPO, please refer to the training configuration section for other training configuration. More detais can be found in [example guideline](./examples/README.md).
-
-```bash
--pretrain $PRETRAINED_MODEL_PATH \
--rm_pretrain $PRETRAINED_MODEL_PATH \ # reward model architectual
--tokenizer_dir $PRETRAINED_TOKENIZER_PATH \
--rm_checkpoint_path $REWARD_MODEL_PATH \ # reward model checkpoint path
--prompt_dataset ${prompt_dataset[@]} \ # List of string, the prompt dataset
--ptx_dataset ${ptx_dataset[@]} \ # List of string, the SFT data used in the SFT stage
--ptx_batch_size 1 \ # batch size for calculate ptx loss
--ptx_coef 0.0 \ # none-zero if ptx loss is enable
--num_episodes 2000 \ # number of episodes to train
--num_collect_steps 1 \
--num_update_steps 1 \
--experience_batch_size 8 \
--train_batch_size 4 \
--accumulation_steps 2
-```
-
-Each episode has two phases, the collect phase and the update phase. During the collect phase, we will collect experiences (answers generated by actor), store those in ExperienceBuffer. Then data in ExperienceBuffer is used during the update phase to update parameter of actor and critic.
-
- Without tensor parallelism,
-```
-experience buffer size
-= num_process * num_collect_steps * experience_batch_size
-= train_batch_size * accumulation_steps * num_process
-```
-
- With tensor parallelism,
-```
-num_tp_group = num_process / tp
-experience buffer size
-= num_tp_group * num_collect_steps * experience_batch_size
-= train_batch_size * accumulation_steps * num_tp_group
-```
-
-## Alternative Option For RLHF: Direct Preference Optimization (DPO)
+### Alternative Option For RLHF: Direct Preference Optimization (DPO)
 For those seeking an alternative to Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO) presents a compelling option. DPO, as detailed in this [paper](https://arxiv.org/abs/2305.18290), DPO offers an low-cost way to perform RLHF and usually request less computation resources compares to PPO. Read this [README](./examples/README.md) for more information.

-### DPO Training Stage1 - Supervised Instructs Tuning
-
-Please refer the [sft section](#dpo-training-stage1---supervised-instructs-tuning) in the PPO part.
-
-### DPO Training Stage2 - DPO Training
-#### Step 1: Data Collection & Preparation
-For DPO training, you only need the preference dataset. Please follow the instruction in the [preference dataset preparation section](#rlhf-training-stage2---training-reward-model) to prepare the preference data for DPO training.
-
-#### Step 2: Training
-You can run the [train_dpo.sh](./examples/training_scripts/train_dpo.sh) to start DPO training. More detais can be found in [example guideline](./examples/README.md).
-
-## Alternative Option For RLHF: Simple Preference Optimization (SimPO)
+### Alternative Option For RLHF: Simple Preference Optimization (SimPO)
 Simple Preference Optimization (SimPO) from this [paper](https://arxiv.org/pdf/2405.14734) is similar to DPO but it abandons the use of the reference model, which makes the training more efficient. It also adds a reward shaping term called target reward margin to enhance training stability. It also use length normalization to better align with the inference process. Read this [README](./examples/README.md) for more information.

-## Alternative Option For RLHF: Odds Ratio Preference Optimization (ORPO)
+### Alternative Option For RLHF: Odds Ratio Preference Optimization (ORPO)
 Odds Ratio Preference Optimization (ORPO) from this [paper](https://arxiv.org/pdf/2403.07691) is a reference model free alignment method that use a mixture of SFT loss and a reinforcement leanring loss calculated based on odds-ratio-based implicit reward to makes the training more efficient and stable. Read this [README](./examples/README.md) for more information.

-## Alternative Option For RLHF: Kahneman-Tversky Optimization (KTO)
+### Alternative Option For RLHF: Kahneman-Tversky Optimization (KTO)
 We support the method introduced in the paper [KTO:Model Alignment as Prospect Theoretic Optimization](https://arxiv.org/pdf/2402.01306) (KTO). Which is a aligment method that directly maximize "human utility" of generation results. Read this [README](./examples/README.md) for more information.

-## Inference Quantization and Serving - After Training
+### Alternative Option For RLHF: Group Relative Policy Optimization (GRPO)
+We support the main algorithm used to train DeepSeek R1 model, a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO. Read this [README](./examples/README.md) for more information.
+
+### SFT for DeepSeek V3
+We support fine-tuning DeepSeek V3/R1 model with LoRA. Read this [README](./examples/README.md) for more information.
+
+### Inference Quantization and Serving - After Training

 We provide an online inference server and a benchmark. We aim to run inference on single GPU, so quantization is essential when using large models.

@ -281,182 +150,7 @@ We support 8-bit quantization (RTN), 4-bit quantization (GPTQ), and FP16 inferen
 Online inference server scripts can help you deploy your own services.
 For more details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).

-## O1 Journey
-### Inference with Self-refined MCTS
-We provide the implementation of MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models with Monte Carlo Tree Search.
-You can serve model using vLLM and update the config file in `Qwen32B_prompt_CFG` and then run the following script.
-```python
-from coati.reasoner.guided_search.mcts import MCTS
-from coati.reasoner.guided_search.prompt_store.qwen import Qwen32B_prompt_CFG
-
-problem = "How Many R in 'Strawberry'"
-
-search_tree = MCTS(problem=problem, max_simulations=8, cfg=Qwen32B_prompt_CFG)
-answer = search_tree.simulate()
-print(answer)
-```
-
-## Coati7B examples
-
-### Generation
-
-<details><summary><b>E-mail</b></summary>
-
-![phd](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/Phd.png)
-
-</details>
-
-<details><summary><b>coding</b></summary>
-
-![sort](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/quick_sort.png)
-
-</details>
-
-<details><summary><b>regex</b></summary>
-
-![regex](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/regex.png)
-
-</details>
-
-<details><summary><b>Tex</b></summary>
-
-![tex](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/tex.png)
-
-</details>
-
-<details><summary><b>writing</b></summary>
-
-![writing](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/writing.png)
-
-</details>
-
-<details><summary><b>Table</b></summary>
-
-![Table](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/table.png)
-
-</details>
-
-### Open QA
-
-<details><summary><b>Game</b></summary>
-
-![Game](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/game.png)
-
-</details>
-
-<details><summary><b>Travel</b></summary>
-
-![Travel](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/travel.png)
-
-</details>
-
-<details><summary><b>Physical</b></summary>
-
-![Physical](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/physical.png)
-
-</details>
-
-<details><summary><b>Chemical</b></summary>
-
-![Chemical](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/chemical.png)
-
-</details>
-
-<details><summary><b>Economy</b></summary>
-
-![Economy](https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/economy.png)
-
-</details>
-
-You can find more examples in this [repo](https://github.com/XueFuzhao/InstructionWild/blob/main/comparison.md).
-
-### Limitation
-
-<details><summary><b>Limitation for LLaMA-finetuned models</b></summary>
- Both Alpaca and ColossalChat are based on LLaMA. It is hard to compensate for the missing knowledge in the pre-training stage.
- Lack of counting ability: Cannot count the number of items in a list.
- Lack of Logics (reasoning and calculation)
- Tend to repeat the last sentence (fail to produce the end token).
- Poor multilingual results: LLaMA is mainly trained on English datasets (Generation performs better than QA).
-</details>
-
-<details><summary><b>Limitation of dataset</b></summary>
- Lack of summarization ability: No such instructions in finetune datasets.
- Lack of multi-turn chat: No such instructions in finetune datasets
- Lack of self-recognition: No such instructions in finetune datasets
- Lack of Safety:
-  - When the input contains fake facts, the model makes up false facts and explanations.
-  - Cannot abide by OpenAI's policy: When generating prompts from OpenAI API, it always abides by its policy. So no violation case is in the datasets.
-</details>
-
-## FAQ
-
-<details><summary><b>How to save/load checkpoint</b></summary>
-
-We have integrated the Transformers save and load pipeline, allowing users to freely call Hugging Face's language models and save them in the HF format.
-
- Option 1: Save the model weights, model config and generation config (Note: tokenizer will not be saved) which can be loaded using HF's from_pretrained method.
-```python
-# if use lora, you can choose to merge lora weights before saving
-if args.lora_rank > 0 and args.merge_lora_weights:
-        from coati.models.lora import LORA_MANAGER
-
-        # NOTE: set model to eval to merge LoRA weights
-        LORA_MANAGER.merge_weights = True
-        model.eval()
-# save model checkpoint after fitting on only rank0
-booster.save_model(model, os.path.join(args.save_dir, "modeling"), shard=True)
-
-```
-
- Option 2: Save the model weights, model config, generation config, as well as the optimizer, learning rate scheduler, running states (Note: tokenizer will not be saved) which are needed for resuming training.
-```python
-from coati.utils import save_checkpoint
-# save model checkpoint after fitting on only rank0
-save_checkpoint(
-        save_dir=actor_save_dir,
-        booster=actor_booster,
-        model=model,
-        optimizer=optim,
-        lr_scheduler=lr_scheduler,
-        epoch=0,
-        step=step,
-        batch_size=train_batch_size,
-        coordinator=coordinator,
-    )
-```
-To load the saved checkpoint
-```python
-from coati.utils import load_checkpoint
-start_epoch, start_step, sampler_start_idx = load_checkpoint(
-        load_dir=checkpoint_path,
-        booster=booster,
-        model=model,
-        optimizer=optim,
-        lr_scheduler=lr_scheduler,
-    )
-```
-</details>
-
-<details><summary><b>How to train with limited resources</b></summary>
-
-Here are some suggestions that can allow you to train a 7B model on a single or multiple consumer-grade GPUs.
-
-`batch_size`, `lora_rank` and `grad_checkpoint` are the most important parameters to successfully train the model. To maintain a descent batch size for gradient calculation, consider increase the accumulation_step and reduce the batch_size on each rank.
-
-If you only have a single 24G GPU. Generally, using lora and "zero2-cpu" will be sufficient.
-
-`gemini` and `gemini-auto` can enable a single 24G GPU to train the whole model without using LoRA if you have sufficient CPU memory. But that strategy doesn't support gradient accumulation.
-
-If you have multiple GPUs each has very limited VRAM, say 8GB. You can try the `3d` for the plugin option, which supports tensor parellelism, set `--tp` to the number of GPUs that you have.
-</details>
-
-### Real-time progress
-
-You will find our progress in github [project broad](https://github.com/orgs/hpcaitech/projects/17/views/1).
-
 ## Invitation to open-source contribution
-
 Referring to the successful attempts of [BLOOM](https://bigscience.huggingface.co/) and [Stable Diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion), any and all developers and partners with computing powers, datasets, models are welcome to join and build the Colossal-AI community, making efforts towards the era of big AI models from the starting point of replicating ChatGPT!

 You may contact us or participate in the following ways:
@ -500,25 +194,17 @@ Thanks so much to all of our amazing contributors!
 - Increase the capacity of the fine-tuning model by up to 3.7 times on a single GPU
 - Keep in a sufficiently high running speed

-|  Model Pair   | Alpaca-7B ⚔ Coati-7B | Coati-7B ⚔ Alpaca-7B |
-| :-----------: | :------------------: | :------------------: |
-| Better Cases  |     38 ⚔ **41**      |     **45** ⚔ 33      |
-|   Win Rate    |    48% ⚔ **52%**     |    **58%** ⚔ 42%     |
-| Average Score |   7.06 ⚔ **7.13**    |   **7.31** ⚔ 6.82    |
-
- Our Coati-7B model performs better than Alpaca-7B when using GPT-4 to evaluate model performance. The Coati-7B model we evaluate is an old version we trained a few weeks ago and the new version is around the corner.
-
 ## Authors

 Coati is developed by ColossalAI Team:
-
- [ver217](https://github.com/ver217) Leading the project while contributing to the main framework.
+- [ver217](https://github.com/ver217) Leading the project while contributing to the main framework (System Lead).
+- [Tong Li](https://github.com/TongLi3701) Leading the project while contributing to the main framework (Algorithm Lead).
+- [Anbang Ye](https://github.com/YeAnbang) Contributing to the refactored PPO version with updated acceleration framework. Add support for DPO, SimPO, ORPO.
 - [FrankLeeeee](https://github.com/FrankLeeeee) Providing ML infra support and also taking charge of both front-end and back-end development.
 - [htzhou](https://github.com/ht-zhou) Contributing to the algorithm and development for RM and PPO training.
 - [Fazzie](https://fazzie-key.cool/about/index.html) Contributing to the algorithm and development for SFT.
 - [ofey404](https://github.com/ofey404) Contributing to both front-end and back-end development.
 - [Wenhao Chen](https://github.com/CWHer) Contributing to subsequent code enhancements and performance improvements.
- [Anbang Ye](https://github.com/YeAnbang) Contributing to the refactored PPO version with updated acceleration framework. Add support for DPO, SimPO, ORPO.

 The PhD student from [(HPC-AI) Lab](https://ai.comp.nus.edu.sg/) also contributed a lot to this project.
 - [Zangwei Zheng](https://github.com/zhengzangw)
@ -527,7 +213,6 @@ The PhD student from [(HPC-AI) Lab](https://ai.comp.nus.edu.sg/) also contribute
 We also appreciate the valuable suggestions provided by [Jian Hu](https://github.com/hijkzzz) regarding the convergence of the PPO algorithm.

 ## Citations
-
 ```bibtex
@article{Hu2021LoRALA,
    title   = {LoRA: Low-Rank Adaptation of Large Language Models},
@ -598,8 +283,22 @@ We also appreciate the valuable suggestions provided by [Jian Hu](https://github
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2403.07691},
 }
+@misc{shao2024deepseekmathpushinglimitsmathematical,
+      title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
+      author={Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
+      year={2024},
+      eprint={2402.03300},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2402.03300},
+}
+@misc{logic-rl,
+author       = {Tian Xie and Qingnan Ren and Yuqian Hong and Zitian Gao and Haoming Luo},
+title        = {Logic-RL},
+howpublished = {https://github.com/Unakar/Logic-RL},
+note         = {Accessed: 2025-02-03},
+year         = {2025}
+}
 ```
-
 ## Licenses
-
 Coati is licensed under the [Apache 2.0 License](LICENSE).
--- a/applications/ColossalChat/coati/dataset/conversation.py
+++ b/applications/ColossalChat/coati/dataset/conversation.py
@ -141,7 +141,7 @@ def setup_conversation_template(
                pass
            except ValueError as e:
                raise ValueError(e)
-    if not dist.is_initialized() or dist.get_rank() == 0:
+    if save_path is not None and (not dist.is_initialized() or dist.get_rank() == 0):
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        with open(save_path, "w", encoding="utf8") as f:
            logger.info(f"Successfully generated a conversation tempalte config, save to {save_path}.")
--- a/applications/ColossalChat/coati/dataset/loader.py
+++ b/applications/ColossalChat/coati/dataset/loader.py
@ -8,6 +8,7 @@ import os
 from dataclasses import dataclass
 from typing import Dict, Iterator, List, Optional, Sequence, Union

+import jsonlines
 import torch
 import torch.nn.functional as F
 from coati.dataset.utils import chuncate_sequence, pad_to_max_len
@ -155,13 +156,14 @@ class DataCollatorForPromptDataset(DataCollatorForSupervisedDataset):
                `input_ids`: `torch.Tensor` of shape (bsz, max_len);
                `attention_mask`: `torch.BoolTensor` of shape (bsz, max_len);
        """
+        gt_answer = [ins.get("gt_answer", None) for ins in instances]
        instances = [{"input_ids": ins["input_ids"], "labels": ins["input_ids"]} for ins in instances]
        ret = super().__call__(instances=instances)
        input_ids = F.pad(
            ret["input_ids"], (self.max_length - ret["input_ids"].size(1), 0), value=self.tokenizer.pad_token_id
        )
        attention_mask = F.pad(ret["attention_mask"], (self.max_length - ret["attention_mask"].size(1), 0), value=False)
-        return {"input_ids": input_ids, "attention_mask": attention_mask}
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "gt_answer": gt_answer}


@dataclass
@ -344,3 +346,77 @@ class StatefulDistributedSampler(DistributedSampler):

    def set_start_index(self, start_index: int) -> None:
        self.start_index = start_index
+
+
+def apply_chat_template_and_mask(
+    tokenizer: PreTrainedTokenizer,
+    chat: List[Dict[str, str]],
+    max_length: Optional[int] = None,
+    padding: bool = True,
+    truncation: bool = True,
+    ignore_idx: int = -100,
+) -> Dict[str, torch.Tensor]:
+    tokens = []
+    assistant_mask = []
+    for i, msg in enumerate(chat):
+        msg_tokens = tokenizer.apply_chat_template([msg], tokenize=True)
+        # remove unexpected bos token
+        if i > 0 and msg_tokens[0] == tokenizer.bos_token_id:
+            msg_tokens = msg_tokens[1:]
+        tokens.extend(msg_tokens)
+        if msg["role"] == "assistant":
+            assistant_mask.extend([True] * len(msg_tokens))
+        else:
+            assistant_mask.extend([False] * len(msg_tokens))
+    attention_mask = [1] * len(tokens)
+    if max_length is not None:
+        if padding and len(tokens) < max_length:
+            to_pad = max_length - len(tokens)
+            if tokenizer.padding_side == "right":
+                tokens.extend([tokenizer.pad_token_id] * to_pad)
+                assistant_mask.extend([False] * to_pad)
+                attention_mask.extend([0] * to_pad)
+            else:
+                tokens = [tokenizer.pad_token_id] * to_pad + tokens
+                assistant_mask = [False] * to_pad + assistant_mask
+                attention_mask = [0] * to_pad + attention_mask
+        if truncation and len(tokens) > max_length:
+            tokens = tokens[:max_length]
+            assistant_mask = assistant_mask[:max_length]
+            attention_mask = attention_mask[:max_length]
+    input_ids = torch.tensor(tokens, dtype=torch.long)
+    attention_mask = torch.tensor(attention_mask, dtype=torch.long)
+    labels = input_ids.clone()
+    labels[~torch.tensor(assistant_mask, dtype=torch.bool)] = ignore_idx
+
+    return {
+        "input_ids": input_ids,
+        "attention_mask": attention_mask,
+        "labels": labels,
+    }
+
+
+class RawConversationDataset(Dataset):
+    """
+    Raw conversation dataset.
+    Each instance is a dictionary with fields `system`, `roles`, `messages`, `offset`, `sep_style`, `seps`.
+    """
+
+    def __init__(self, tokenizer: PreTrainedTokenizer, input_file: str, max_length: int) -> None:
+        self.tokenizer = tokenizer
+        self.raw_texts = []
+        with jsonlines.open(input_file) as f:
+            for line in f:
+                self.raw_texts.append(line)
+        self.tokenized_texts = [None] * len(self.raw_texts)
+        self.max_length = max_length
+
+    def __len__(self) -> int:
+        return len(self.raw_texts)
+
+    def __getitem__(self, index: int):
+        if self.tokenized_texts[index] is None:
+            message = self.raw_texts[index]
+            tokens = apply_chat_template_and_mask(self.tokenizer, message, self.max_length)
+            self.tokenized_texts[index] = dict(tokens)
+        return self.tokenized_texts[index]
--- a/applications/ColossalChat/coati/dataset/tokenization_utils.py
+++ b/applications/ColossalChat/coati/dataset/tokenization_utils.py
@ -147,7 +147,6 @@ def tokenize_prompt(
        ignore_index: the ignore index when calculate loss during training
        max_length: the maximum context length
    """
-
    messages = data_point["messages"]
    template = deepcopy(conversation_template)
    template.messages = []
@ -167,7 +166,6 @@ def tokenize_prompt(
    if len(template.messages) % 2 != 1:
        # exclude the answer if provided. keep only the prompt
        template.messages = template.messages[:-1]
-
    # Prepare data
    prompt = template.get_prompt(length=len(template.messages), add_generation_prompt=True)
    tokenized = tokenizer([prompt], add_special_tokens=False)["input_ids"][0]
@ -185,12 +183,21 @@ def tokenize_prompt(
        )

    # `inputs_decode` can be used to check whether the tokenization method is true.
-    return dict(
-        input_ids=tokenized,
-        inputs_decode=prompt,
-        seq_length=len(tokenized),
-        seq_category=data_point["category"] if "category" in data_point else "None",
-    )
+    if "gt_answer" in data_point:
+        return dict(
+            input_ids=tokenized,
+            inputs_decode=prompt,
+            seq_length=len(tokenized),
+            seq_category=data_point["category"] if "category" in data_point else "None",
+            gt_answer=data_point["gt_answer"],
+        )
+    else:
+        return dict(
+            input_ids=tokenized,
+            inputs_decode=prompt,
+            seq_length=len(tokenized),
+            seq_category=data_point["category"] if "category" in data_point else "None",
+        )


 def apply_rlhf_data_format(template: Conversation, tokenizer: Any):
--- a/applications/ColossalChat/coati/experience_buffer/naive.py
+++ b/applications/ColossalChat/coati/experience_buffer/naive.py
@ -27,6 +27,8 @@ class NaiveExperienceBuffer(ExperienceBuffer):
        self.target_device = torch.device(f"cuda:{torch.cuda.current_device()}")
        # TODO(ver217): add prefetch
        self.items: List[BufferItem] = []
+        self.rng_sequence = []
+        self.ptr = 0

    @torch.no_grad()
    def append(self, experience: Experience) -> None:
@ -40,6 +42,9 @@ class NaiveExperienceBuffer(ExperienceBuffer):
            if samples_to_remove > 0:
                logger.warning(f"Experience buffer is full. Removing {samples_to_remove} samples.")
                self.items = self.items[samples_to_remove:]
+        self.rng_sequence = [i for i in range(len(self.items))]
+        random.shuffle(self.rng_sequence)
+        self.ptr = 0

    def clear(self) -> None:
        self.items.clear()
@ -52,7 +57,10 @@ class NaiveExperienceBuffer(ExperienceBuffer):
        Returns:
            A batch of sampled experiences.
        """
-        items = random.sample(self.items, self.sample_batch_size)
+        items = []
+        for _ in range(self.sample_batch_size):
+            self.ptr = (self.ptr + 1) % len(self.items)
+            items.append(self.items[self.rng_sequence[self.ptr]])
        experience = make_experience_batch(items)
        if self.cpu_offload:
            experience.to_device(self.target_device)
--- a/applications/ColossalChat/coati/experience_maker/naive.py
+++ b/applications/ColossalChat/coati/experience_maker/naive.py
@ -2,6 +2,8 @@
 experience maker.
 """

+from typing import Any
+
 import torch
 import torch.nn.functional as F
 from coati.dataset.utils import find_first_occurrence_subsequence
@ -38,14 +40,27 @@ class NaiveExperienceMaker(ExperienceMaker):
        kl_coef: float = 0.01,
        gamma: float = 1.0,
        lam: float = 0.95,
+        use_grpo: bool = False,
+        num_generation: int = 8,
+        inference_batch_size: int = None,
+        logits_forward_batch_size: int = 2,
    ) -> None:
        super().__init__(actor, critic, reward_model, initial_model)
        self.tokenizer = tokenizer
        self.kl_coef = kl_coef
        self.gamma = gamma
        self.lam = lam
+        self.use_grpo = use_grpo
+        self.num_generation = num_generation
+        self.inference_batch_size = inference_batch_size
+        self.logits_forward_batch_size = logits_forward_batch_size
+        if not self.use_grpo:
+            assert self.critic is not None, "Critic model is required for PPO training."
+        else:
+            assert self.critic is None, "Critic model is not required for GRPO training."
+            assert self.num_generation > 1, "Number of generations should be greater than 1 for GRPO training."

-    @torch.no_grad()
+    @torch.inference_mode()
    def calculate_advantage(self, value: torch.Tensor, reward: torch.Tensor, num_actions: int) -> torch.Tensor:
        """
        Calculates the advantage values for each action based on the value and reward tensors.
@ -69,7 +84,9 @@ class NaiveExperienceMaker(ExperienceMaker):
        return advantages

    @torch.no_grad()
-    def make_experience(self, input_ids: torch.Tensor, attention_mask: torch.Tensor, **generate_kwargs) -> Experience:
+    def make_experience(
+        self, input_ids: torch.Tensor, attention_mask: torch.Tensor, gt_answer: Any = None, **generate_kwargs
+    ) -> Experience:
        """
        Generates an experience using the given input_ids and attention_mask.

@ -83,98 +100,204 @@ class NaiveExperienceMaker(ExperienceMaker):

        """
        self.actor.eval()
-        self.critic.eval()
+        if self.critic:
+            self.critic.eval()
        self.initial_model.eval()
        self.reward_model.eval()
        pad_token_id = self.tokenizer.pad_token_id
-
        stop_token_ids = generate_kwargs.get("stop_token_ids", None)
+        if isinstance(stop_token_ids, int):
+            stop_token_ids = [[stop_token_ids]]
+        elif isinstance(stop_token_ids[0], int):
+            stop_token_ids = [stop_token_ids]
+        elif isinstance(stop_token_ids[0], list):
+            pass
+        else:
+            raise ValueError(
+                f"stop_token_ids should be a list of list of integers, a list of integers or an integers. got {stop_token_ids}"
+            )
+        generate_kwargs["stop_token_ids"] = stop_token_ids
        torch.manual_seed(41)  # for tp, gurantee the same input for reward model

-        sequences = generate(self.actor, input_ids, self.tokenizer, **generate_kwargs)
+        if self.use_grpo and self.num_generation > 1:
+            # Generate multiple responses for each prompt
+            input_ids = input_ids.repeat_interleave(self.num_generation, dim=0)
+            gt_answer_tmp = []
+            for t in gt_answer:
+                gt_answer_tmp.extend([t] * self.num_generation)
+            gt_answer = gt_answer_tmp
+        if self.inference_batch_size is None:
+            self.inference_batch_size = input_ids.size(0)

-        # Pad to max length
-        sequences = F.pad(sequences, (0, generate_kwargs["max_length"] - sequences.size(1)), value=pad_token_id)
-        sequence_length = sequences.size(1)
+        batch_sequences = []
+        batch_input_ids_rm = []
+        batch_attention_mask_rm = []
+        batch_attention_mask = []
+        batch_r = []
+        batch_action_log_probs = []
+        batch_base_action_log_probs = []
+        batch_action_mask = []
+        num_actions = 0

-        # Calculate auxiliary tensors
-        attention_mask = None
-        if pad_token_id is not None:
-            attention_mask = sequences.not_equal(pad_token_id).to(dtype=torch.long, device=sequences.device)
+        for inference_mini_batch_id in range(0, input_ids.size(0), self.inference_batch_size):
+            s, e = inference_mini_batch_id, inference_mini_batch_id + self.inference_batch_size
+            if input_ids[s:e].size(0) == 0:
+                break
+            sequences = generate(self.actor, input_ids[s:e], self.tokenizer, **generate_kwargs)
+            # pad to max_len, you don't want to get an OOM error after a thousands of steps
+            sequences = F.pad(sequences, (0, generate_kwargs["max_length"] - sequences.size(1)), value=pad_token_id)

-        input_len = input_ids.size(1)
-        if stop_token_ids is None:
-            # End the sequence with eos token
-            eos_token_id = self.tokenizer.eos_token_id
-            if eos_token_id is None:
-                action_mask = torch.ones_like(sequences, dtype=torch.bool)
-            else:
-                # Left padding may be applied, only mask action
-                action_mask = (sequences[:, input_len:] == eos_token_id).cumsum(dim=-1) == 0
-                action_mask = F.pad(action_mask, (1 + input_len, -1), value=True)  # include eos token and input
-        else:
-            # stop_token_ids are given, generation ends with stop_token_ids
-            action_mask = torch.ones_like(sequences, dtype=torch.bool)
-            for i in range(sequences.size(0)):
-                stop_index = find_first_occurrence_subsequence(
-                    sequences[i][input_len:], torch.tensor(stop_token_ids).to(sequences.device)
-                )
-                if stop_index == -1:
-                    # Sequence does not contain stop_token_ids, this should never happen BTW
-                    logger.warning(
-                        "Generated sequence does not contain stop_token_ids. Please check your chat template config"
-                    )
+            # Pad to max length
+            sequence_length = sequences.size(1)
+
+            # Calculate auxiliary tensors
+            attention_mask = None
+            if pad_token_id is not None:
+                attention_mask = sequences.not_equal(pad_token_id).to(dtype=torch.long, device=sequences.device)
+
+            input_len = input_ids.size(1)
+            if stop_token_ids is None:
+                # End the sequence with eos token
+                eos_token_id = self.tokenizer.eos_token_id
+                if eos_token_id is None:
+                    action_mask = torch.ones_like(sequences, dtype=torch.bool)
                else:
-                    # Keep stop tokens
-                    stop_index = input_len + stop_index
-                    action_mask[i, stop_index + len(stop_token_ids) :] = False
-
-        generation_end_index = (action_mask == True).sum(dim=-1) - 1
-        action_mask[:, :input_len] = False
-        action_mask = action_mask[:, 1:]
-        action_mask = action_mask[:, -(sequences.size(1) - input_len) :]
-        num_actions = action_mask.size(1)
-
-        actor_output = self.actor(input_ids=sequences, attention_mask=attention_mask)["logits"]
-        action_log_probs = calc_action_log_probs(actor_output, sequences, num_actions)
-
-        base_model_output = self.initial_model(input_ids=sequences, attention_mask=attention_mask)["logits"]
-
-        base_action_log_probs = calc_action_log_probs(base_model_output, sequences, num_actions)
-
-        # Convert to right padding for the reward model and the critic model
-        input_ids_rm = torch.zeros_like(sequences, device=sequences.device)
-        attention_mask_rm = torch.zeros_like(sequences, device=sequences.device)
-        for i in range(sequences.size(0)):
-            sequence = sequences[i]
-            bos_index = (sequence != pad_token_id).nonzero().reshape([-1])[0]
-            eos_index = generation_end_index[i]
-            sequence_to_pad = sequence[bos_index:eos_index]
-            sequence_padded = F.pad(
-                sequence_to_pad, (0, sequence_length - sequence_to_pad.size(0)), value=self.tokenizer.pad_token_id
-            )
-            input_ids_rm[i] = sequence_padded
-            if sequence_length - sequence_to_pad.size(0) > 0:
-                attention_mask_rm[i, : sequence_to_pad.size(0) + 1] = 1
+                    # Left padding may be applied, only mask action
+                    action_mask = (sequences[:, input_len:] == eos_token_id).cumsum(dim=-1) == 0
+                    action_mask = F.pad(action_mask, (1 + input_len, -1), value=True)  # include eos token and input
            else:
-                attention_mask_rm[i, :] = 1
-        attention_mask_rm = attention_mask_rm.to(dtype=torch.bool)
+                # stop_token_ids are given, generation ends with stop_token_ids
+                action_mask = torch.ones_like(sequences, dtype=torch.bool)
+                for i in range(sequences.size(0)):
+                    stop_token_pos = [
+                        find_first_occurrence_subsequence(
+                            sequences[i][input_len:], torch.tensor(stop_token_id).to(sequences.device)
+                        )
+                        for stop_token_id in stop_token_ids
+                    ]
+                    stop_index = min([i for i in stop_token_pos if i != -1], default=-1)
+                    stop_token_id = stop_token_ids[stop_token_pos.index(stop_index)]
+                    if stop_index == -1:
+                        # Sequence does not contain stop_token_ids, this should never happen BTW
+                        logger.warning(
+                            "Generated sequence does not contain stop_token_ids. Please check your chat template config"
+                        )
+                        print(self.tokenizer.decode(sequences[i], skip_special_tokens=True))
+                    else:
+                        # Keep stop tokens
+                        stop_index = input_len + stop_index
+                        action_mask[i, stop_index + len(stop_token_id) :] = False

-        r = self.reward_model(
-            input_ids=input_ids_rm.to(dtype=torch.long, device=sequences.device),
-            attention_mask=attention_mask_rm.to(device=sequences.device),
-        )
+            generation_end_index = (action_mask == True).sum(dim=-1) - 1
+            action_mask[:, :input_len] = False
+            action_mask = action_mask[:, 1:]
+            action_mask = action_mask[:, -(sequences.size(1) - input_len) :]
+            num_actions = action_mask.size(1)
+            torch.cuda.empty_cache()
+            with torch.inference_mode():
+                actor_output = []
+                base_model_output = []
+                for i in range(0, sequences.size(0), self.logits_forward_batch_size):
+                    actor_output.append(
+                        self.actor(
+                            input_ids=sequences[i : i + self.logits_forward_batch_size],
+                            attention_mask=attention_mask[i : i + self.logits_forward_batch_size],
+                            use_cache=False,
+                        )["logits"]
+                    )
+                    base_model_output.append(
+                        self.initial_model(
+                            input_ids=sequences[i : i + self.logits_forward_batch_size],
+                            attention_mask=attention_mask[i : i + self.logits_forward_batch_size],
+                            use_cache=False,
+                        )["logits"]
+                    )
+                actor_output = torch.cat(actor_output, dim=0)
+                base_model_output = torch.cat(base_model_output, dim=0)
+                action_log_probs = calc_action_log_probs(actor_output, sequences, num_actions)
+                base_action_log_probs = calc_action_log_probs(base_model_output, sequences, num_actions)

-        value = self.critic(
-            input_ids=input_ids_rm.to(dtype=torch.long, device=sequences.device),
-            attention_mask=attention_mask_rm.to(device=sequences.device),
-        )
-        reward, kl = compute_reward(r, self.kl_coef, action_log_probs, base_action_log_probs, action_mask=action_mask)
-        value = value[:, -num_actions:] * action_mask
-        advantages = self.calculate_advantage(value, reward, num_actions)
+            # Convert to right padding for the reward model and the critic model
+            input_ids_rm = torch.zeros_like(sequences, device=sequences.device)
+            response_start = []
+            response_end = []
+            attention_mask_rm = torch.zeros_like(sequences, device=sequences.device)
+            for i in range(sequences.size(0)):
+                sequence = sequences[i]
+                bos_index = (sequence != pad_token_id).nonzero().reshape([-1])[0]
+                eos_index = generation_end_index[i] + 1  # include the stop token
+                sequence_to_pad = sequence[bos_index:eos_index]
+                response_start.append(input_len - bos_index)
+                response_end.append(eos_index - bos_index)
+                sequence_padded = F.pad(
+                    sequence_to_pad, (0, sequence_length - sequence_to_pad.size(0)), value=self.tokenizer.pad_token_id
+                )
+                input_ids_rm[i] = sequence_padded
+                if sequence_length - sequence_to_pad.size(0) > 0:
+                    attention_mask_rm[i, : sequence_to_pad.size(0) + 1] = 1
+                else:
+                    attention_mask_rm[i, :] = 1
+            attention_mask_rm = attention_mask_rm.to(dtype=torch.bool)

-        advantages = advantages.detach()
-        value = value.detach()
+            r = self.reward_model(
+                input_ids=input_ids_rm.to(dtype=torch.long, device=sequences.device),
+                attention_mask=attention_mask_rm.to(device=sequences.device),
+                response_start=response_start,
+                response_end=response_end,
+                gt_answer=gt_answer[s:e],
+            )
+
+            batch_sequences.append(sequences)
+            batch_input_ids_rm.append(input_ids_rm)
+            batch_attention_mask_rm.append(attention_mask_rm)
+            batch_attention_mask.append(attention_mask)
+            batch_r.append(r)
+            batch_action_log_probs.append(action_log_probs.cpu())
+            batch_base_action_log_probs.append(base_action_log_probs.cpu())
+            batch_action_mask.append(action_mask)
+
+        sequences = torch.cat(batch_sequences, dim=0)
+        input_ids_rm = torch.cat(batch_input_ids_rm, dim=0)
+        attention_mask_rm = torch.cat(batch_attention_mask_rm, dim=0)
+        attention_mask = torch.cat(batch_attention_mask, dim=0)
+        r = torch.cat(batch_r, dim=0)
+        action_log_probs = torch.cat(batch_action_log_probs, dim=0).to(sequences.device)
+        base_action_log_probs = torch.cat(batch_base_action_log_probs, dim=0).to(sequences.device)
+        action_mask = torch.cat(batch_action_mask, dim=0).to(sequences.device)
+        if not self.use_grpo:
+            value = self.critic(
+                input_ids=input_ids_rm.to(dtype=torch.long, device=sequences.device),
+                attention_mask=attention_mask_rm.to(device=sequences.device),
+            )
+            value = value[:, -num_actions:] * action_mask
+            reward, kl = compute_reward(
+                r, self.kl_coef, action_log_probs, base_action_log_probs, action_mask=action_mask
+            )
+            advantages = self.calculate_advantage(value, reward, num_actions)
+            advantages = advantages.detach()
+            value = value.detach()
+        else:
+            # GRPO advantage calculation
+            kl = torch.sum(
+                -self.kl_coef * (action_log_probs - base_action_log_probs) * action_mask, dim=-1
+            ) / torch.sum(
+                action_mask, dim=-1
+            )  # address numerical instability issue
+            r = kl + r
+            mean_gr = r.view(-1, self.num_generation).mean(dim=1)
+            std_gr = r.view(-1, self.num_generation).std(dim=1)
+            mean_gr = mean_gr.repeat_interleave(self.num_generation, dim=0)
+            std_gr = std_gr.repeat_interleave(self.num_generation, dim=0)
+            advantages = (r - mean_gr) / (std_gr + 1e-4)
+            value = r.detach()  # dummy value
        r = r.detach()
-
-        return Experience(sequences, action_log_probs, value, r, kl, advantages, attention_mask, action_mask)
+        return Experience(
+            sequences.cpu(),
+            action_log_probs.cpu(),
+            value.cpu(),
+            r.cpu(),
+            kl.cpu(),
+            advantages.cpu(),
+            attention_mask.cpu(),
+            action_mask.cpu(),
+        )
--- a/applications/ColossalChat/coati/models/init.py
+++ b/applications/ColossalChat/coati/models/init.py
@ -4,12 +4,14 @@ from .generation import generate, generate_streaming, prepare_inputs_fn, update_
 from .lora import LoraConfig, convert_to_lora_module, lora_manager
 from .loss import DpoLoss, KTOLoss, LogExpLoss, LogSigLoss, PolicyLoss, ValueLoss
 from .reward_model import RewardModel
+from .rlvr_reward_model import RLVRRewardModel
 from .utils import disable_dropout

 __all__ = [
    "BaseModel",
    "Critic",
    "RewardModel",
+    "RLVRRewardModel",
    "PolicyLoss",
    "ValueLoss",
    "LogSigLoss",
--- a/applications/ColossalChat/coati/models/generation.py
+++ b/applications/ColossalChat/coati/models/generation.py
@ -1,3 +1,4 @@
+import copy
 from typing import Any, Callable, List, Optional

 import torch
@ -88,13 +89,14 @@ def update_model_kwargs_fn(outputs: dict, new_mask, **model_kwargs) -> dict:
    return model_kwargs


-def prepare_inputs_fn(input_ids: torch.Tensor, pad_token_id: int, **model_kwargs) -> dict:
+def prepare_inputs_fn(input_ids: torch.Tensor, **model_kwargs) -> dict:
    model_kwargs["input_ids"] = input_ids
    return model_kwargs


 def _sample(
    model: Any,
+    tokenizer: Any,
    input_ids: torch.Tensor,
    max_length: int,
    early_stopping: bool = True,
@ -137,8 +139,8 @@ def _sample(
    if max_new_tokens is None:
        max_new_tokens = max_length - context_length
    if context_length + max_new_tokens > max_length or max_new_tokens == 0:
+        print("Exeeded length limitation")
        return input_ids
-
    logits_processor = _prepare_logits_processor(top_k, top_p, temperature)
    unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
    past = None
@ -183,18 +185,14 @@ def _sample(

        if stop_token_ids is not None:
            # If the last len(stop_token_ids) tokens of input_ids are equal to stop_token_ids, set sentence to finished.
-            tokens_to_check = input_ids[:, -len(stop_token_ids) :]
-            unfinished_sequences = unfinished_sequences.mul(
-                torch.any(tokens_to_check != torch.LongTensor(stop_token_ids).to(input_ids.device), dim=1).long()
-            )
+            for stop_token_id in stop_token_ids:
+                tokens_to_check = input_ids[:, -len(stop_token_id) :]
+                unfinished_sequences = unfinished_sequences.mul(
+                    torch.any(tokens_to_check != torch.LongTensor(stop_token_id).to(input_ids.device), dim=1).long()
+                )

        # Stop when each sentence is finished if early_stopping=True
        if (early_stopping and _is_sequence_finished(unfinished_sequences)) or i == context_length + max_new_tokens - 1:
-            if i == context_length + max_new_tokens - 1:
-                # Force to end with stop token ids
-                input_ids[input_ids[:, -1] != pad_token_id, -len(stop_token_ids) :] = (
-                    torch.LongTensor(stop_token_ids).to(input_ids.device).long()
-                )
            return input_ids


@ -237,8 +235,10 @@ def generate(
        raise NotImplementedError
    elif is_sample_gen_mode:
        # Run sample
+        generation_kwargs = copy.deepcopy(model_kwargs)
        res = _sample(
            model,
+            tokenizer,
            input_ids,
            max_length,
            early_stopping=early_stopping,
@ -249,8 +249,9 @@ def generate(
            temperature=temperature,
            prepare_inputs_fn=prepare_inputs_fn,
            update_model_kwargs_fn=update_model_kwargs_fn,
-            **model_kwargs,
+            **generation_kwargs,
        )
+        del generation_kwargs
        return res
    elif is_beam_gen_mode:
        raise NotImplementedError
@ -350,11 +351,17 @@ def _sample_streaming(
            unfinished_sequences = unfinished_sequences.mul((next_tokens != eos_token_id).long())

        if stop_token_ids is not None:
-            # If the last len(stop_token_ids) tokens of input_ids are equal to stop_token_ids, set sentence to finished.
            tokens_to_check = input_ids[:, -len(stop_token_ids) :]
-            unfinished_sequences = unfinished_sequences.mul(
-                torch.any(tokens_to_check != torch.LongTensor(stop_token_ids).to(input_ids.device), dim=1).long()
-            )
+            if isinstance(stop_token_ids[0], int):
+                # If the last len(stop_token_ids) tokens of input_ids are equal to stop_token_ids, set sentence to finished.
+                unfinished_sequences = unfinished_sequences.mul(
+                    torch.any(tokens_to_check != torch.LongTensor(stop_token_ids).to(input_ids.device), dim=1).long()
+                )
+            else:
+                for stop_token_id in stop_token_ids:
+                    unfinished_sequences = unfinished_sequences.mul(
+                        torch.any(tokens_to_check != torch.LongTensor(stop_token_id).to(input_ids.device), dim=1).long()
+                    )

        # Stop when each sentence is finished if early_stopping=True
        if (
--- a/applications/ColossalChat/coati/models/reward_model.py
+++ b/applications/ColossalChat/coati/models/reward_model.py
@ -25,7 +25,9 @@ class RewardModel(BaseModel):
        self.value_head = nn.Linear(self.last_hidden_state_size, 1)
        self.value_head.weight.data.normal_(mean=0.0, std=1 / (self.last_hidden_state_size + 1))

-    def forward(self, input_ids: torch.LongTensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+    def forward(
+        self, input_ids: torch.LongTensor, attention_mask: Optional[torch.Tensor] = None, **kwargs
+    ) -> torch.Tensor:
        outputs = self.model(input_ids, attention_mask=attention_mask)

        last_hidden_states = outputs["last_hidden_state"]
--- a/applications/ColossalChat/coati/models/rlvr_reward_model.py
+++ b/applications/ColossalChat/coati/models/rlvr_reward_model.py
@ -0,0 +1,50 @@
+"""
+reward model
+"""
+
+from typing import Callable, List, Optional
+
+import torch
+
+
+class RLVRRewardModel:
+    """
+    RLVRReward model class. Support varifiable reward.
+
+    Args:
+        reward_fn_list List: list of reward functions
+        **kwargs: all other kwargs as in reward functions
+    """
+
+    def __init__(self, reward_fn_list: List[Callable], **kwargs) -> None:
+        self.reward_fn_list = reward_fn_list
+        self.kwargs = kwargs
+
+    def __call__(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        response_start: List = None,
+        response_end: List = None,
+        gt_answer: List = None,
+    ) -> torch.Tensor:
+        # apply varifiable reward
+        bs = input_ids.size(0)
+        rewards = torch.zeros(bs, device=input_ids.device)
+        for i in range(bs):
+            for reward_fn in self.reward_fn_list:
+                rewards[i] += reward_fn(
+                    input_ids[i],
+                    attention_mask[i],
+                    response_start=response_start[i],
+                    response_end=response_end[i],
+                    gt_answer=gt_answer[i],
+                    **self.kwargs,
+                )
+        return rewards
+
+    def to(self, device):
+        return self
+
+    def eval(self):
+        return self
--- a/applications/ColossalChat/coati/models/utils.py
+++ b/applications/ColossalChat/coati/models/utils.py
@ -142,3 +142,17 @@ def disable_dropout(model: torch.nn.Module):
        for module in model.modules():
            if isinstance(module, torch.nn.Dropout):
                module.p = 0.0
+
+
+def repad_to_left(tensor, tokenizer):
+    repadded_input_ids = []
+    max_non_padded_seq_len = 0
+    for i in range(tensor.size(0)):
+        non_pad_indices = (tensor[i] != tokenizer.pad_token_id).nonzero(as_tuple=True)[0]
+        start, end = non_pad_indices.min(), non_pad_indices.max()
+        repadded_input_ids.append(tensor[i][start : end + 1])
+        max_non_padded_seq_len = max(max_non_padded_seq_len, repadded_input_ids[-1].size(0))
+    repadded_input_ids = [
+        F.pad(t, (max_non_padded_seq_len - t.size(0), 0), value=tokenizer.pad_token_id) for t in repadded_input_ids
+    ]
+    return torch.stack(repadded_input_ids)
--- a/applications/ColossalChat/coati/reasoner/guided_search/llm.py
+++ b/applications/ColossalChat/coati/reasoner/guided_search/llm.py
@ -1,26 +0,0 @@
-import openai
-from openai.types.chat.chat_completion import ChatCompletion
-from openai.types.chat.chat_completion_message_param import ChatCompletionMessageParam
-
-API_KEY = "Dummy API Key"
-
-
-def get_client(base_url: str | None = None) -> openai.Client:
-    return openai.Client(api_key=API_KEY, base_url=base_url)
-
-
-def chat_completion(
-    messages: list[ChatCompletionMessageParam],
-    model: str,
-    base_url: str | None = None,
-    temperature: float = 0.8,
-    **kwargs,
-) -> ChatCompletion:
-    client = get_client(base_url)
-    response = client.chat.completions.create(
-        model=model,
-        messages=messages,
-        temperature=temperature,
-        **kwargs,
-    )
-    return response
--- a/applications/ColossalChat/coati/reasoner/guided_search/mcts.py
+++ b/applications/ColossalChat/coati/reasoner/guided_search/mcts.py
@ -1,250 +0,0 @@
-"""
-Implementation of MCTS + Self-refine algorithm.
-
-Reference:
-1. "Accessing GPT-4 level Mathematical Olympiad Solutions via Monte
-Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report"
-2. https://github.com/BrendanGraham14/mcts-llm/
-3. https://github.com/trotsky1997/MathBlackBox/
-4. https://github.com/openreasoner/openr/blob/main/reason/guided_search/tree.py
-"""
-
-from __future__ import annotations
-
-import math
-from collections import deque
-
-import numpy as np
-import tqdm
-from coati.reasoner.guided_search.llm import chat_completion
-from coati.reasoner.guided_search.prompt_store.base import PromptCFG
-from pydantic import BaseModel
-
-
-class MCTSNode(BaseModel):
-    """
-    Node for MCTS.
-    """
-
-    answer: str
-    parent: MCTSNode = None
-    children: list[MCTSNode] = []
-    num_visits: int = 0
-    Q: int = 0
-    rewards: list[int] = []
-
-    def expand_node(self, node) -> None:
-        self.children.append(node)
-
-    def add_reward(self, reward: int) -> None:
-        self.rewards.append(reward)
-        self.Q = (np.min(self.rewards) + np.mean(self.rewards)) / 2
-
-
-class MCTS(BaseModel):
-    """
-    Simulation of MCTS process.
-    """
-
-    problem: str
-    max_simulations: int
-    cfg: PromptCFG
-    C: float = 1.4
-    max_children: int = 2
-    epsilon: float = 1e-5
-    root: MCTSNode = None
-
-    def initialization(self):
-        """
-        Root Initiation.
-        """
-        # Simple answer as root. You can also use negative response such as "I do not know" as a response.
-        base_answer = self.sample_base_answer()
-        self.root = MCTSNode(answer=base_answer)
-        self.self_evaluate(self.root)
-
-    def is_fully_expanded(self, node: MCTSNode):
-        return len(node.children) >= self.max_children or any(child.Q > node.Q for child in node.children)
-
-    def select_node(self) -> MCTSNode:
-        """
-        Select next node to explore.
-        """
-        candidates: list[MCTSNode] = []
-        to_explore = deque([self.root])
-
-        while to_explore:
-            current_node = to_explore.popleft()
-            if not self.is_fully_expanded(current_node):
-                candidates.append(current_node)
-            to_explore.extend(current_node.children)
-
-        if not candidates:
-            return self.root
-
-        return max(candidates, key=self.compute_uct)
-
-    def self_evaluate(self, node: MCTSNode):
-        """
-        Sample reward of the answer.
-        """
-        reward = self.sample_reward(node)
-        node.add_reward(reward)
-
-    def back_propagation(self, node: MCTSNode):
-        """
-        Back propagate the value of the refined answer.
-        """
-        parent = node.parent
-        while parent:
-            best_child_Q = max(child.Q for child in parent.children)
-            parent.Q = (parent.Q + best_child_Q) / 2
-            parent.num_visits += 1
-            parent = parent.parent
-
-    def compute_uct(self, node: MCTSNode):
-        """
-        Compute UCT.
-        """
-        if node.parent is None:
-            return -100
-        return node.Q + self.C * math.sqrt(math.log(node.parent.num_visits + 1) / (node.num_visits + self.epsilon))
-
-    def simulate(self):
-        self.initialization()
-        for _ in tqdm.tqdm(range(self.max_simulations)):
-            node = self.select_node()
-            child = self.self_refine(node)
-            node.expand_node(child)
-            self.self_evaluate(child)
-            self.back_propagation(child)
-
-        return self.get_best_answer()
-
-    def get_best_answer(self):
-        to_visit = deque([self.root])
-        best_node = self.root
-
-        while to_visit:
-            current_node = to_visit.popleft()
-            if current_node.Q > best_node.Q:
-                best_node = current_node
-            to_visit.extend(current_node.children)
-
-        return best_node.answer
-
-    def self_refine(self, node: MCTSNode):
-        """
-        Refine node.
-        """
-        critique_response = chat_completion(
-            messages=[
-                {
-                    "role": "system",
-                    "content": self.cfg.critic_system_prompt,
-                },
-                {
-                    "role": "user",
-                    "content": "\n\n".join(
-                        [
-                            f"<problem>\n{self.problem}\n</problem>",
-                            f"<current_answer>\n{node.answer}\n</current_answer>",
-                        ]
-                    ),
-                },
-            ],
-            model=self.cfg.model,
-            base_url=self.cfg.base_url,
-            max_tokens=self.cfg.max_tokens,
-        )
-        critique = critique_response.choices[0].message.content
-        assert critique is not None
-        refined_answer_response = chat_completion(
-            messages=[
-                {
-                    "role": "system",
-                    "content": self.cfg.refine_system_prompt,
-                },
-                {
-                    "role": "user",
-                    "content": "\n\n".join(
-                        [
-                            f"<problem>\n{self.problem}\n</problem>",
-                            f"<current_answer>\n{node.answer}\n</current_answer>",
-                            f"<critique>\n{critique}\n</critique>",
-                        ]
-                    ),
-                },
-            ],
-            model=self.cfg.model,
-            base_url=self.cfg.base_url,
-            max_tokens=self.cfg.max_tokens,
-        )
-        refined_answer = refined_answer_response.choices[0].message.content
-        assert refined_answer is not None
-
-        return MCTSNode(answer=refined_answer, parent=node)
-
-    def sample_base_answer(self):
-        response = chat_completion(
-            messages=[
-                {
-                    "role": "system",
-                    "content": self.cfg.base_system_prompt,
-                },
-                {
-                    "role": "user",
-                    "content": f"<problem>\n {self.problem} \n</problem> \nLet's think step by step",
-                },
-            ],
-            model=self.cfg.model,
-            base_url=self.cfg.base_url,
-            max_tokens=self.cfg.max_tokens,
-        )
-        assert response.choices[0].message.content is not None
-        return response.choices[0].message.content
-
-    def sample_reward(self, node: MCTSNode):
-        """
-        Calculate reward.
-        """
-        messages = [
-            {
-                "role": "system",
-                "content": self.cfg.evaluate_system_prompt,
-            },
-            {
-                "role": "user",
-                "content": "\n\n".join(
-                    [
-                        f"<problem>\n{self.problem}\n</problem>",
-                        f"<answer>\n{node.answer}\n</answer>",
-                    ]
-                ),
-            },
-        ]
-        for attempt in range(3):
-            try:
-                response = chat_completion(
-                    messages=messages,
-                    model=self.cfg.model,
-                    base_url=self.cfg.base_url,
-                    max_tokens=self.cfg.max_tokens,
-                )
-                assert response.choices[0].message.content is not None
-                return int(response.choices[0].message.content)
-            except ValueError:
-                messages.extend(
-                    [
-                        {
-                            "role": "assistant",
-                            "content": response.choices[0].message.content,
-                        },
-                        {
-                            "role": "user",
-                            "content": "Failed to parse reward as an integer.",
-                        },
-                    ]
-                )
-                if attempt == 2:
-                    raise
--- a/applications/ColossalChat/coati/reasoner/guided_search/prompt_store/base.py
+++ b/applications/ColossalChat/coati/reasoner/guided_search/prompt_store/base.py
@ -1,11 +0,0 @@
-from pydantic import BaseModel
-
-
-class PromptCFG(BaseModel):
-    model: str
-    base_url: str
-    max_tokens: int = 4096
-    base_system_prompt: str
-    critic_system_prompt: str
-    refine_system_prompt: str
-    evaluate_system_prompt: str
--- a/applications/ColossalChat/coati/reasoner/guided_search/prompt_store/qwen.py
+++ b/applications/ColossalChat/coati/reasoner/guided_search/prompt_store/qwen.py
@ -1,22 +0,0 @@
-"""
-Prompts for Qwen Series.
-"""
-
-from coati.reasoner.guided_search.prompt_store.base import PromptCFG
-
-Qwen32B_prompt_CFG = PromptCFG(
-    base_url="http://0.0.0.0:8008/v1",
-    model="Qwen2.5-32B-Instruct",
-    base_system_prompt="The user will present a problem. Analyze and solve the problem in the following structure:\n"
-    "Begin with [Reasoning Process] to explain the approach. \n Proceed with [Verification] to confirm the solution. \n Conclude with [Final Answer] in the format: 'Answer: [answer]'",
-    critic_system_prompt="Provide a detailed and constructive critique of the answer, focusing on ways to improve its clarity, accuracy, and relevance."
-    "Highlight specific areas that need refinement or correction, and offer concrete suggestions for enhancing the overall quality and effectiveness of the response.",
-    refine_system_prompt="""# Instruction
-                            Refine the answer based on the critique. The response should begin with [reasoning process]...[Verification]... and end with [Final Answer].
-                         """,
-    evaluate_system_prompt=(
-        "Critically analyze this answer and provide a reward score between -100 and 100 based on strict standards."
-        "The score should clearly reflect the quality of the answer."
-        "Make sure the reward score is an integer. You should only return the score. If the score is greater than 95, return 95."
-    ),
-)
--- a/applications/ColossalChat/coati/trainer/init.py
+++ b/applications/ColossalChat/coati/trainer/init.py
@ -1,5 +1,6 @@
 from .base import OLTrainer, SLTrainer
 from .dpo import DPOTrainer
+from .grpo import GRPOTrainer
 from .kto import KTOTrainer
 from .orpo import ORPOTrainer
 from .ppo import PPOTrainer
@ -15,4 +16,5 @@ __all__ = [
    "DPOTrainer",
    "ORPOTrainer",
    "KTOTrainer",
+    "GRPOTrainer",
 ]
--- a/applications/ColossalChat/coati/trainer/base.py
+++ b/applications/ColossalChat/coati/trainer/base.py
@ -96,6 +96,7 @@ class OLTrainer(ABC):
        self.sample_buffer = sample_buffer
        self.dataloader_pin_memory = dataloader_pin_memory
        self.callbacks = callbacks
+        self.num_train_step = 0

    @contextmanager
    def _fit_ctx(self) -> None:
@ -212,5 +213,6 @@ class OLTrainer(ABC):
                        self._update_phase(update_step)
                    # NOTE: this is for on-policy algorithms
                    self.data_buffer.clear()
-                if self.save_interval > 0 and (episode + 1) % (self.save_interval) == 0:
-                    self._save_checkpoint(episode + 1)
+
+                if self.num_train_step > 0 and (self.num_train_step + 1) % (self.save_interval) == 0:
+                    self._save_checkpoint(self.num_train_step + 1)
--- a/applications/ColossalChat/coati/trainer/dpo.py
+++ b/applications/ColossalChat/coati/trainer/dpo.py
@ -343,7 +343,7 @@ class DPOTrainer(SLTrainer):
                self.accumulative_meter.add("loss", loss_mean.to(torch.float16).item())
                self.accumulative_meter.add("accuracy", reward_accuracies_mean.to(torch.float16).item())

-                if (i + 1) % self.accumulation_steps == 0:
+                if (self.num_train_step + 1) % self.accumulation_steps == 0:
                    self.optimizer.step()
                    self.optimizer.zero_grad()
                    self.actor_scheduler.step()
@ -358,29 +358,30 @@ class DPOTrainer(SLTrainer):
                    )
                    step_bar.update()
                    if self.writer and is_rank_0():
-                        self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), self.num_train_step)
-                        self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], self.num_train_step)
+                        global_step = (self.num_train_step + 1) / self.accumulation_steps
+                        self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), global_step)
+                        self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], global_step)
                        self.writer.add_scalar(
-                            "train/chosen_rewards", self.accumulative_meter.get("chosen_rewards"), self.num_train_step
+                            "train/chosen_rewards", self.accumulative_meter.get("chosen_rewards"), global_step
                        )
                        self.writer.add_scalar(
                            "train/rejected_rewards",
                            self.accumulative_meter.get("rejected_rewards"),
-                            self.num_train_step,
+                            global_step,
                        )
                        self.writer.add_scalar(
                            "train/margin",
                            self.accumulative_meter.get("chosen_rewards")
                            - self.accumulative_meter.get("rejected_rewards"),
-                            self.num_train_step,
+                            global_step,
                        )
                        self.writer.add_scalar(
                            "train/accuracy",
                            self.accumulative_meter.get("accuracy"),
-                            self.num_train_step,
+                            global_step,
                        )
-                    self.num_train_step += 1
                    self.accumulative_meter.reset()
+                self.num_train_step += 1

            if self.save_dir is not None and self.num_train_step > 0 and self.num_train_step % self.save_interval == 0:
                # save checkpoint
--- a/applications/ColossalChat/coati/trainer/grpo.py
+++ b/applications/ColossalChat/coati/trainer/grpo.py
@ -0,0 +1,386 @@
+"""
+GRPO trainer
+"""
+
+import os
+from typing import Dict, List, Optional, Union
+
+import torch
+import wandb
+from coati.experience_buffer import NaiveExperienceBuffer
+from coati.experience_maker import Experience, NaiveExperienceMaker
+from coati.models import RewardModel, RLVRRewardModel
+from coati.models.loss import GPTLMLoss, PolicyLoss
+from coati.models.utils import calc_action_log_probs
+from coati.trainer.callbacks import Callback
+from coati.trainer.utils import all_reduce_mean
+from coati.utils import AccumulativeMeanMeter, save_checkpoint
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import _LRScheduler
+from torch.utils.data import DataLoader, DistributedSampler
+from tqdm import tqdm
+from transformers import PreTrainedModel, PreTrainedTokenizerBase
+
+from colossalai.booster import Booster
+from colossalai.booster.plugin import GeminiPlugin
+from colossalai.cluster import DistCoordinator
+from colossalai.utils import get_current_device
+
+from .base import OLTrainer
+from .utils import AnnealingScheduler, CycledDataLoader, is_rank_0, to_device
+
+
+def _set_default_generate_kwargs(actor: PreTrainedModel) -> Dict:
+    """
+    Set default keyword arguments for generation based on the actor model.
+
+    Args:
+        actor (PreTrainedModel): The actor model.
+
+    Returns:
+        Dict: A dictionary containing the default keyword arguments for generation.
+    """
+    unwrapped_model = actor.unwrap()
+    new_kwargs = {}
+    # use huggingface models method directly
+    if hasattr(unwrapped_model, "prepare_inputs_for_generation"):
+        new_kwargs["prepare_inputs_fn"] = unwrapped_model.prepare_inputs_for_generation
+    if hasattr(unwrapped_model, "_update_model_kwargs_for_generation"):
+        new_kwargs["update_model_kwargs_fn"] = unwrapped_model._update_model_kwargs_for_generation
+    return new_kwargs
+
+
+class GRPOTrainer(OLTrainer):
+    """
+        Trainer for GRPO algorithm.
+
+    Args:
+        strategy (Booster): the strategy to use for training
+        actor (Actor): the actor model in ppo algorithm
+        reward_model (RewardModel): the reward model in rlhf algorithm to make reward of sentences
+        initial_model (Actor): the initial model in rlhf algorithm to generate reference logics to limit the update of actor
+        actor_optim (Optimizer): the optimizer to use for actor model
+        kl_coef (float, defaults to 0.1): the coefficient of kl divergence loss
+        train_batch_size (int, defaults to 8): the batch size to use for training
+        buffer_limit (int, defaults to 0): the max_size limitation of buffer
+        buffer_cpu_offload (bool, defaults to True): whether to offload buffer to cpu
+        eps_clip (float, defaults to 0.2): the clip coefficient of policy loss
+        vf_coef (float, defaults to 1.0): the coefficient of value loss
+        ptx_coef (float, defaults to 0.9): the coefficient of ptx loss
+        value_clip (float, defaults to 0.4): the clip coefficient of value loss
+        sample_buffer (bool, defaults to False): whether to sample from buffer
+        dataloader_pin_memory (bool, defaults to True): whether to pin memory for data loader
+        offload_inference_models (bool, defaults to True): whether to offload inference models to cpu during training process
+        callbacks (List[Callback], defaults to []): the callbacks to call during training process
+        generate_kwargs (dict, optional): the kwargs to use while model generating
+    """
+
+    def __init__(
+        self,
+        actor_booster: Booster,
+        actor: PreTrainedModel,
+        reward_model: Union[RewardModel, RLVRRewardModel],
+        initial_model: PreTrainedModel,
+        actor_optim: Optimizer,
+        actor_lr_scheduler: _LRScheduler,
+        tokenizer: PreTrainedTokenizerBase,
+        kl_coef: float = 0.1,
+        ptx_coef: float = 0.9,
+        train_batch_size: int = 8,
+        buffer_limit: int = 0,
+        buffer_cpu_offload: bool = True,
+        eps_clip: float = 0.2,
+        vf_coef: float = 1.0,
+        value_clip: float = 0.2,
+        sample_buffer: bool = False,
+        dataloader_pin_memory: bool = True,
+        offload_inference_models: bool = True,
+        apply_loss_mask: bool = True,
+        accumulation_steps: int = 1,
+        save_interval: int = 0,
+        save_dir: str = None,
+        use_tp: bool = False,
+        num_generation: int = 8,
+        inference_batch_size: int = None,
+        logits_forward_batch_size: int = None,
+        temperature_annealing_config: Optional[Dict] = None,
+        coordinator: DistCoordinator = None,
+        callbacks: List[Callback] = [],
+        **generate_kwargs,
+    ) -> None:
+        if isinstance(actor_booster, GeminiPlugin):
+            assert not offload_inference_models, "GeminiPlugin is not compatible with manual model.to('cpu')"
+
+        data_buffer = NaiveExperienceBuffer(train_batch_size, buffer_limit, buffer_cpu_offload)
+        super().__init__(actor_booster, None, data_buffer, sample_buffer, dataloader_pin_memory, callbacks=callbacks)
+        self.generate_kwargs = _set_default_generate_kwargs(actor)
+        self.generate_kwargs.update(generate_kwargs)
+
+        self.actor = actor
+        self.actor_booster = actor_booster
+        self.actor_scheduler = actor_lr_scheduler
+        self.tokenizer = tokenizer
+        self.experience_maker = NaiveExperienceMaker(
+            self.actor,
+            None,
+            reward_model,
+            initial_model,
+            self.tokenizer,
+            kl_coef,
+            use_grpo=True,
+            num_generation=num_generation,
+            inference_batch_size=inference_batch_size,
+            logits_forward_batch_size=logits_forward_batch_size,
+        )
+        if temperature_annealing_config:
+            # use annealing
+            self.temperature_annealing_scheduler = AnnealingScheduler(
+                temperature_annealing_config["start_temperature"],
+                temperature_annealing_config["end_temperature"],
+                temperature_annealing_config["annealing_warmup_steps"],
+                temperature_annealing_config["annealing_steps"],
+            )
+        else:
+            self.temperature_annealing_scheduler = None
+
+        self.train_batch_size = train_batch_size
+
+        self.actor_loss_fn = PolicyLoss(eps_clip)
+        self.vf_coef = vf_coef
+        self.ptx_loss_fn = GPTLMLoss()
+        self.ptx_coef = ptx_coef
+        self.actor_optim = actor_optim
+        self.save_interval = save_interval
+        self.apply_loss_mask = apply_loss_mask
+        self.coordinator = coordinator
+        self.actor_save_dir = os.path.join(save_dir, "actor")
+        self.num_train_step = 0
+        self.accumulation_steps = accumulation_steps
+        self.use_tp = use_tp
+        self.accumulative_meter = AccumulativeMeanMeter()
+        self.offload_inference_models = offload_inference_models
+        self.device = get_current_device()
+
+    def _before_fit(
+        self,
+        prompt_dataloader: DataLoader,
+        pretrain_dataloader: Optional[DataLoader] = None,
+        log_dir: Optional[str] = None,
+        use_wandb: bool = False,
+    ):
+        """
+        Args:
+            prompt_dataloader (DataLoader): the dataloader to use for prompt data
+            pretrain_dataloader (DataLoader): the dataloader to use for pretrain data
+        """
+        self.prompt_dataloader = CycledDataLoader(prompt_dataloader)
+        self.pretrain_dataloader = CycledDataLoader(pretrain_dataloader) if pretrain_dataloader is not None else None
+
+        self.writer = None
+        if use_wandb and is_rank_0():
+            assert log_dir is not None, "log_dir must be provided when use_wandb is True"
+            import wandb
+
+            self.wandb_run = wandb.init(project="Coati-grpo", sync_tensorboard=True)
+        if log_dir is not None and is_rank_0():
+            import os
+            import time
+
+            from torch.utils.tensorboard import SummaryWriter
+
+            log_dir = os.path.join(log_dir, "grpo")
+            log_dir = os.path.join(log_dir, time.strftime("%Y-%m-%d_%H:%M:%S", time.localtime()))
+            self.writer = SummaryWriter(log_dir=log_dir)
+
+    def _setup_update_phrase_dataload(self):
+        """
+        why not use distributed_dataloader?
+            if tp is used, input on each rank is the same and we use the same dataloader to feed same experience to all ranks
+            if tp is not used, input on each rank is different and we expect different experiences to be fed to each rank
+        """
+        self.dataloader = DataLoader(
+            self.data_buffer,
+            batch_size=self.train_batch_size,
+            shuffle=True,
+            drop_last=True,
+            pin_memory=self.dataloader_pin_memory,
+            collate_fn=self.data_buffer.collate_fn,
+        )
+
+    def _make_experience(self, collect_step: int) -> Experience:
+        """
+        Make experience
+        """
+        prompts = self.prompt_dataloader.next()
+        if self.offload_inference_models:
+            # TODO(ver217): this may be controlled by strategy if they are prepared by strategy
+            self.experience_maker.initial_model.to(self.device)
+            self.experience_maker.reward_model.to(self.device)
+        if self.temperature_annealing_scheduler:
+            self.generate_kwargs["temperature"] = self.temperature_annealing_scheduler.get_temperature()
+        return self.experience_maker.make_experience(
+            input_ids=prompts["input_ids"].to(get_current_device()),
+            attention_mask=prompts["attention_mask"].to(get_current_device()),
+            gt_answer=prompts["gt_answer"],
+            **self.generate_kwargs,
+        )
+
+    def _training_step(self, experience: Experience):
+        """
+        Args:
+            experience:
+                sequences: [batch_size, prompt_length + response_length] --- <PAD>...<PAD><PROMPT>...<PROMPT><RESPONSE>...<RESPONSE><PAD>...<PAD>
+        """
+        self.actor.train()
+        num_actions = experience.action_log_probs.size(1)
+        # policy loss
+
+        actor_logits = self.actor(input_ids=experience.sequences, attention_mask=experience.attention_mask)[
+            "logits"
+        ]  # [batch size, prompt_length + response_length]
+        action_log_probs = calc_action_log_probs(actor_logits, experience.sequences, num_actions)
+        actor_loss, to_skip, max_ratio = self.actor_loss_fn(
+            action_log_probs,
+            experience.action_log_probs,
+            experience.advantages.unsqueeze(dim=-1).repeat_interleave(action_log_probs.size(-1), dim=-1),
+            action_mask=experience.action_mask if self.apply_loss_mask else None,
+        )
+        # sequence that is not end properly are not counted in token cost
+        token_cost = torch.sum(
+            (experience.sequences[:, -num_actions:] != self.tokenizer.pad_token_id).to(torch.float), axis=-1
+        ).to(actor_logits.device)
+        end_properly = experience.sequences[:, -1] == self.tokenizer.pad_token_id
+        mean_token_cost = torch.sum(token_cost * end_properly) / torch.sum(end_properly)
+        actor_loss = (1 - self.ptx_coef) * actor_loss
+        if not to_skip:
+            self.actor_booster.backward(loss=actor_loss, optimizer=self.actor_optim)
+
+        # ptx loss
+        if self.ptx_coef != 0:
+            batch = self.pretrain_dataloader.next()
+            batch = to_device(batch, self.device)
+            outputs = self.actor(batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["labels"])
+            ptx_loss = outputs.loss
+            ptx_loss = self.ptx_coef * ptx_loss
+            self.actor_booster.backward(loss=ptx_loss, optimizer=self.actor_optim)
+
+        # sync
+        actor_loss_mean = all_reduce_mean(tensor=actor_loss)
+        max_ratio_mean = all_reduce_mean(tensor=max_ratio)
+        reward_mean = all_reduce_mean(tensor=experience.reward.mean())
+        advantages_mean = all_reduce_mean(tensor=experience.advantages.mean())
+        kl_mean = all_reduce_mean(tensor=experience.kl.mean())
+        mean_token_cost = all_reduce_mean(tensor=mean_token_cost)
+        if self.ptx_coef != 0:
+            ptx_loss_mean = all_reduce_mean(tensor=ptx_loss)
+
+        self.accumulative_meter.add("actor_loss", actor_loss_mean.to(torch.float16).mean().item())
+        self.accumulative_meter.add("max_ratio", max_ratio_mean.to(torch.float16).item())
+        self.accumulative_meter.add("reward", reward_mean.to(torch.float16).mean().item())
+        self.accumulative_meter.add("advantages", advantages_mean.to(torch.float16).item())
+        self.accumulative_meter.add("skip_ratio", 1.0 if to_skip else 0.0)
+        self.accumulative_meter.add("mean_token_cost", mean_token_cost.to(torch.float16).item())
+        self.accumulative_meter.add("kl", kl_mean.to(torch.float16).item())
+        if self.ptx_coef != 0:
+            self.accumulative_meter.add("ptx_loss", ptx_loss_mean.to(torch.float16).mean().item())
+
+        if self.num_train_step % self.accumulation_steps == self.accumulation_steps - 1:
+            self.actor_optim.step()
+            self.actor_optim.zero_grad()
+            self.actor_scheduler.step()
+
+            if self.temperature_annealing_scheduler:
+                self.temperature_annealing_scheduler.step_forward()
+
+            # preparing logging model output and corresponding rewards.
+            if self.num_train_step % 10 == 0:
+                response_text = self.experience_maker.tokenizer.batch_decode(
+                    experience.sequences, skip_special_tokens=True
+                )
+                for i in range(len(response_text)):
+                    response_text[i] = response_text[i] + f"\n\nReward: {experience.reward[i]}"
+
+                if self.writer and is_rank_0() and "wandb_run" in self.__dict__:
+                    # log output to wandb
+                    my_table = wandb.Table(
+                        columns=[f"sample response {i}" for i in range(len(response_text))], data=[response_text]
+                    )
+                    try:
+                        self.wandb_run.log({"sample_response": my_table})
+                    except OSError as e:
+                        self.coordinator.print_on_master(e)
+                elif self.writer and is_rank_0():
+                    for line in response_text:
+                        self.coordinator.print_on_master(line)
+
+            if self.writer and is_rank_0():
+                global_step = (self.num_train_step + 1) / self.accumulation_steps
+                self.writer.add_scalar("train/max_ratio", self.accumulative_meter.get("max_ratio"), global_step)
+                self.writer.add_scalar("train/skip_ratio", self.accumulative_meter.get("skip_ratio"), global_step)
+                self.writer.add_scalar("train/actor_loss", self.accumulative_meter.get("actor_loss"), global_step)
+                self.writer.add_scalar("train/lr_actor", self.actor_optim.param_groups[0]["lr"], global_step)
+                if self.ptx_coef != 0:
+                    self.writer.add_scalar("train/ptx_loss", self.accumulative_meter.get("ptx_loss"), global_step)
+                self.writer.add_scalar("reward", self.accumulative_meter.get("reward"), global_step)
+                self.writer.add_scalar("token_cost", self.accumulative_meter.get("mean_token_cost"), global_step)
+                self.writer.add_scalar("approx_kl", self.accumulative_meter.get("kl"), global_step)
+                self.writer.add_scalar("advantages", self.accumulative_meter.get("advantages"), global_step)
+            self.accumulative_meter.reset()
+        self.num_train_step += 1
+
+    def _learn(self, update_step: int):
+        """
+        Perform the learning step of the PPO algorithm.
+
+        Args:
+            update_step (int): The current update step.
+
+        Returns:
+            None
+        """
+        if self.offload_inference_models:
+            self.experience_maker.initial_model.to("cpu")
+            self.experience_maker.reward_model.to("cpu")
+        # buffer may be empty at first, we should rebuild at each training
+        if self.sample_buffer:
+            experience = self.data_buffer.sample()
+            self._on_learn_batch_start()
+            experience.to_device(self.device)
+            self._training_step(experience)
+            self._on_learn_batch_end(experience)
+        else:
+            if isinstance(self.dataloader.sampler, DistributedSampler):
+                self.dataloader.sampler.set_epoch(update_step)
+            pbar = tqdm(self.dataloader, desc=f"Train epoch [{update_step + 1}]", disable=not is_rank_0())
+            for experience in pbar:
+                self._on_learn_batch_start()
+                experience.to_device(self.device)
+                self._training_step(experience)
+                self._on_learn_batch_end(experience)
+
+    def _save_checkpoint(self, num_train_step: int = 0):
+        """
+        Save the actor checkpoints with running states.
+
+        Args:
+            num_train_step (int): The current num_train_step number.
+
+        Returns:
+            None
+        """
+
+        self.coordinator.print_on_master("\nStart saving actor checkpoint with running states")
+        save_checkpoint(
+            save_dir=self.actor_save_dir,
+            booster=self.actor_booster,
+            model=self.actor,
+            optimizer=self.actor_optim,
+            lr_scheduler=self.actor_scheduler,
+            epoch=0,
+            step=num_train_step + 1,
+            batch_size=self.train_batch_size,
+            coordinator=self.coordinator,
+        )
+        self.coordinator.print_on_master(
+            f"Saved actor checkpoint at episode {(num_train_step + 1)} at folder {self.actor_save_dir}"
+        )
--- a/applications/ColossalChat/coati/trainer/kto.py
+++ b/applications/ColossalChat/coati/trainer/kto.py
@ -217,25 +217,25 @@ class KTOTrainer(SLTrainer):
            self.accumulative_meter.add("rejected_rewards", rejected_rewards_mean.to(torch.float16).mean().item())
            self.accumulative_meter.add("loss", loss_mean.to(torch.float16).detach().item())

-            if i % self.accumulation_steps == self.accumulation_steps - 1:
-                self.num_train_step += 1
+            if self.num_train_step % self.accumulation_steps == self.accumulation_steps - 1:
                step_bar.update()
                # logging
                if self.writer and is_rank_0():
-                    self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), self.num_train_step)
-                    self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], self.num_train_step)
+                    global_step = (self.num_train_step + 1) / self.accumulation_steps
+                    self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), global_step)
+                    self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], global_step)
                    self.writer.add_scalar(
-                        "train/chosen_rewards", self.accumulative_meter.get("chosen_rewards"), self.num_train_step
+                        "train/chosen_rewards", self.accumulative_meter.get("chosen_rewards"), global_step
                    )
                    self.writer.add_scalar(
                        "train/rejected_rewards",
                        self.accumulative_meter.get("rejected_rewards"),
-                        self.num_train_step,
+                        global_step,
                    )
                    self.writer.add_scalar(
                        "train/margin",
                        self.accumulative_meter.get("chosen_rewards") - self.accumulative_meter.get("rejected_rewards"),
-                        self.num_train_step,
+                        global_step,
                    )
                self.accumulative_meter.reset()

@ -256,6 +256,7 @@ class KTOTrainer(SLTrainer):
                    self.coordinator.print_on_master(
                        f"Saved checkpoint at epoch {epoch} step {self.save_interval} at folder {self.save_dir}"
                    )
+            self.num_train_step += 1

        step_bar.close()

--- a/applications/ColossalChat/coati/trainer/orpo.py
+++ b/applications/ColossalChat/coati/trainer/orpo.py
@ -184,35 +184,35 @@ class ORPOTrainer(SLTrainer):
            self.accumulative_meter.add("log_odds_ratio", log_odds_ratio.to(torch.float16).mean().item())
            self.accumulative_meter.add("accuracy", reward_accuracies_mean.to(torch.float16).item())

-            if i % self.accumulation_steps == self.accumulation_steps - 1:
-                self.num_train_step += 1
+            if self.num_train_step % self.accumulation_steps == self.accumulation_steps - 1:
                step_bar.update()
+                global_step = (self.num_train_step + 1) / self.accumulation_steps
                # logging
                if self.writer and is_rank_0():
-                    self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), self.num_train_step)
-                    self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], self.num_train_step)
+                    self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), global_step)
+                    self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], global_step)
                    self.writer.add_scalar(
-                        "train/chosen_rewards", self.accumulative_meter.get("chosen_rewards"), self.num_train_step
+                        "train/chosen_rewards", self.accumulative_meter.get("chosen_rewards"), global_step
                    )
                    self.writer.add_scalar(
                        "train/rejected_rewards",
                        self.accumulative_meter.get("rejected_rewards"),
-                        self.num_train_step,
+                        global_step,
                    )
                    self.writer.add_scalar(
                        "train/margin",
                        self.accumulative_meter.get("chosen_rewards") - self.accumulative_meter.get("rejected_rewards"),
-                        self.num_train_step,
+                        global_step,
                    )
                    self.writer.add_scalar(
                        "train/accuracy",
                        self.accumulative_meter.get("accuracy"),
-                        self.num_train_step,
+                        global_step,
                    )
                    self.writer.add_scalar(
                        "train/log_odds_ratio",
                        self.accumulative_meter.get("log_odds_ratio"),
-                        self.num_train_step,
+                        global_step,
                    )
                self.accumulative_meter.reset()

@ -233,6 +233,7 @@ class ORPOTrainer(SLTrainer):
                    self.coordinator.print_on_master(
                        f"Saved checkpoint at epoch {epoch} step {self.save_interval} at folder {self.save_dir}"
                    )
+            self.num_train_step += 1

        step_bar.close()

--- a/applications/ColossalChat/coati/trainer/ppo.py
+++ b/applications/ColossalChat/coati/trainer/ppo.py
@ -3,13 +3,13 @@ PPO trainer
 """

 import os
-from typing import Dict, List, Optional
+from typing import Dict, List, Optional, Union

 import torch
 import wandb
 from coati.experience_buffer import NaiveExperienceBuffer
 from coati.experience_maker import Experience, NaiveExperienceMaker
-from coati.models import Critic, RewardModel
+from coati.models import Critic, RewardModel, RLVRRewardModel
 from coati.models.loss import GPTLMLoss, PolicyLoss, ValueLoss
 from coati.models.utils import calc_action_log_probs
 from coati.trainer.callbacks import Callback
@ -84,7 +84,7 @@ class PPOTrainer(OLTrainer):
        critic_booster: Booster,
        actor: PreTrainedModel,
        critic: Critic,
-        reward_model: RewardModel,
+        reward_model: Union[RewardModel, RLVRRewardModel],
        initial_model: PreTrainedModel,
        actor_optim: Optimizer,
        critic_optim: Optimizer,
@ -210,6 +210,7 @@ class PPOTrainer(OLTrainer):
        return self.experience_maker.make_experience(
            input_ids=prompts["input_ids"].to(get_current_device()),
            attention_mask=prompts["attention_mask"].to(get_current_device()),
+            gt_answer=prompts["gt_answer"],
            **self.generate_kwargs,
        )

@ -219,7 +220,6 @@ class PPOTrainer(OLTrainer):
            experience:
                sequences: [batch_size, prompt_length + response_length] --- <PAD>...<PAD><PROMPT>...<PROMPT><RESPONSE>...<RESPONSE><PAD>...<PAD>
        """
-        self.num_train_step += 1
        self.actor.train()
        self.critic.train()
        num_actions = experience.action_log_probs.size(1)
@ -293,7 +293,7 @@ class PPOTrainer(OLTrainer):
            self.critic_scheduler.step()

            # preparing logging model output and corresponding rewards.
-            if self.num_train_step % 10 == 1:
+            if self.num_train_step % 10 == 0:
                response_text = self.experience_maker.tokenizer.batch_decode(
                    experience.sequences, skip_special_tokens=True
                )
@ -335,6 +335,7 @@ class PPOTrainer(OLTrainer):
                self.writer.add_scalar("value", self.accumulative_meter.get("value"), self.num_train_step)
                self.writer.add_scalar("advantages", self.accumulative_meter.get("advantages"), self.num_train_step)
            self.accumulative_meter.reset()
+        self.num_train_step += 1

    def _learn(self, update_step: int):
        """
--- a/applications/ColossalChat/coati/trainer/rm.py
+++ b/applications/ColossalChat/coati/trainer/rm.py
@ -150,29 +150,29 @@ class RewardModelTrainer(SLTrainer):
            self.accumulative_meter.add("loss", loss_mean.to(torch.float16).item())
            self.accumulative_meter.add("accuracy", accuracy_mean.mean().to(torch.float16).item())

-            if (i + 1) % self.accumulation_steps == 0:
+            if (self.num_train_step + 1) % self.accumulation_steps == 0:
                self.optimizer.step()
                self.optimizer.zero_grad()
                self.actor_scheduler.step()
                step_bar.update()
-                self.num_train_step += 1

                # Logging
                if self.writer and is_rank_0():
-                    self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), self.num_train_step)
-                    self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], self.num_train_step)
+                    global_step = (self.num_train_step + 1) / self.accumulation_steps
+                    self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), global_step)
+                    self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], global_step)
                    self.writer.add_scalar(
                        "train/dist",
                        self.accumulative_meter.get("chosen_rewards") - self.accumulative_meter.get("rejected_rewards"),
-                        self.num_train_step,
+                        global_step,
                    )
                    self.writer.add_scalar(
-                        "train/reward_chosen", self.accumulative_meter.get("chosen_rewards"), self.num_train_step
+                        "train/reward_chosen", self.accumulative_meter.get("chosen_rewards"), global_step
                    )
                    self.writer.add_scalar(
-                        "train/reward_reject", self.accumulative_meter.get("rejected_rewards"), self.num_train_step
+                        "train/reward_reject", self.accumulative_meter.get("rejected_rewards"), global_step
                    )
-                    self.writer.add_scalar("train/acc", self.accumulative_meter.get("accuracy"), self.num_train_step)
+                    self.writer.add_scalar("train/acc", self.accumulative_meter.get("accuracy"), global_step)

                self.accumulative_meter.reset()

@ -193,6 +193,7 @@ class RewardModelTrainer(SLTrainer):
                    self.coordinator.print_on_master(
                        f"Saved checkpoint at epoch {epoch} step {(i + 1)/self.accumulation_steps} at folder {self.save_dir}"
                    )
+            self.num_train_step += 1
        step_bar.close()

    def _eval(self, epoch):
--- a/applications/ColossalChat/coati/trainer/sft.py
+++ b/applications/ColossalChat/coati/trainer/sft.py
@ -143,18 +143,18 @@ class SFTTrainer(SLTrainer):
                self.accumulative_meter.add("loss", loss_mean.to(torch.float16).item())

                # Gradient accumulation
-                if (i + 1) % self.accumulation_steps == 0:
+                if (self.num_train_step + 1) % self.accumulation_steps == 0:
                    self.optimizer.step()
                    self.optimizer.zero_grad()
                    self.scheduler.step()
-
+                    global_step = (self.num_train_step + 1) / self.accumulation_steps
                    step_bar.set_postfix({"train/loss": self.accumulative_meter.get("loss")})
                    if self.writer:
-                        self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), self.num_train_step)
-                        self.writer.add_scalar("train/lr", self.scheduler.get_last_lr()[0], self.num_train_step)
-                    self.num_train_step += 1
+                        self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), global_step)
+                        self.writer.add_scalar("train/lr", self.scheduler.get_last_lr()[0], global_step)
                    self.accumulative_meter.reset()
                    step_bar.update()
+                self.num_train_step += 1

            # Save checkpoint
            if (
--- a/applications/ColossalChat/coati/trainer/utils.py
+++ b/applications/ColossalChat/coati/trainer/utils.py
@ -12,6 +12,27 @@ from torch.utils.data import DataLoader
 from colossalai.booster import Plugin


+class AnnealingScheduler:
+    def __init__(self, start, end, warmup_steps=100, annealing_step=2000):
+        self.start = start
+        self.end = end
+        self.warmup_steps = warmup_steps
+        self.step = 0
+        self.annealing_step = annealing_step
+
+    def get_temperature(self):
+        if self.step <= self.warmup_steps:
+            return self.start  # Stop annealing after warm-up steps
+        elif self.step >= self.annealing_step:
+            return self.end
+        # Linear annealing
+        temp = self.start - (self.step / self.annealing_step) * (self.start - self.end)
+        return temp
+
+    def step_forward(self):
+        self.step += 1
+
+
 class CycledDataLoader:
    """
    A data loader that cycles through the data when it reaches the end.
--- a/applications/ColossalChat/coati/utils/reward_score/init.py
+++ b/applications/ColossalChat/coati/utils/reward_score/init.py
@ -0,0 +1,4 @@
+from .competition import math_competition_reward_fn
+from .gsm8k import gsm8k_reward_fn
+
+__all__ = ["gsm8k_reward_fn", "math_competition_reward_fn"]
--- a/applications/ColossalChat/coati/utils/reward_score/competition.py
+++ b/applications/ColossalChat/coati/utils/reward_score/competition.py
@ -0,0 +1,26 @@
+import torch
+
+from .utils import extract_solution, validate_response_structure
+
+
+def math_competition_reward_fn(input_ids, attention_mask, **kwargs):
+    # apply varifiable reward
+    # reward 10 points if the final answer is correct, reward 1 point if format is correct
+
+    gt_answer = kwargs["gt_answer"]
+    tokenizer = kwargs["tokenizer"]
+    s, e = kwargs["response_start"], kwargs["response_end"]
+    reward = torch.tensor(0.0).to(input_ids.device)
+    if gt_answer is None:
+        return reward
+    decoded_final_answer = tokenizer.decode(input_ids[s : e + 1], skip_special_tokens=True)
+    final_answer, processed_str = extract_solution(decoded_final_answer)
+
+    format_valid = validate_response_structure(processed_str, kwargs["tags"])
+    if not format_valid:
+        return reward
+    else:
+        reward += 1.0
+        if gt_answer.strip().replace(" ", "").lower() == final_answer.strip().replace(" ", "").lower():
+            reward = reward + 9.0
+        return reward
--- a/applications/ColossalChat/coati/utils/reward_score/gsm8k.py
+++ b/applications/ColossalChat/coati/utils/reward_score/gsm8k.py
@ -0,0 +1,31 @@
+import torch
+
+from .utils import extract_solution, validate_response_structure
+
+
+def gsm8k_reward_fn(input_ids, attention_mask, **kwargs):
+    # apply varifiable reward
+    # reward 10 points if the final answer is correct, reward 1 point if format is correct
+
+    gt_answer = kwargs["gt_answer"]
+    tokenizer = kwargs["tokenizer"]
+    s, e = kwargs["response_start"], kwargs["response_end"]
+    reward = torch.tensor(0.0).to(input_ids.device)
+    if gt_answer is None:
+        return reward
+    decoded_final_answer = tokenizer.decode(input_ids[s:e], skip_special_tokens=True)
+    final_answer, processed_str = extract_solution(decoded_final_answer)
+    is_valid = True
+    try:
+        int(final_answer.strip())
+    except Exception:
+        is_valid = False
+
+    format_valid = validate_response_structure(processed_str, kwargs["tags"])
+    if not is_valid or not format_valid:
+        return reward
+    else:
+        reward += 1.0
+        if gt_answer.strip().replace(" ", "").lower() == final_answer.strip().replace(" ", "").lower():
+            reward = reward + 9.0
+        return reward
--- a/applications/ColossalChat/coati/utils/reward_score/utils.py
+++ b/applications/ColossalChat/coati/utils/reward_score/utils.py
@ -0,0 +1,76 @@
+# Copyright Unakar
+# Modified from https://github.com/Unakar/Logic-RL/blob/086373176ac198c97277ff50f4b6e7e1bfe669d3/verl/utils/reward_score/kk.py#L99
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+from typing import Dict, Optional, Tuple
+
+
+def validate_response_structure(processed_str: str, tags: Dict = None) -> bool:
+    """Performs comprehensive validation of response structure.
+
+    Args:
+        processed_str: Processed response string from the model
+
+    Returns:
+        Boolean indicating whether all formatting requirements are met
+    """
+    validation_passed = True
+    # Check required tags
+    if tags is None:
+        tags = {
+            "think_start": {"text": "<think>", "num_occur": 1},
+            "think_end": {"text": "</think>", "num_occur": 1},
+            "answer_start": {"text": "<answer>", "num_occur": 1},
+            "answer_end": {"text": "</answer>", "num_occur": 1},
+        }
+    positions = {}
+    for tag_name, tag_info in tags.items():
+        tag_str = tag_info["text"]
+        expected_count = tag_info["num_occur"]
+        count = processed_str.count(tag_str)
+        positions[tag_name] = pos = processed_str.find(tag_str)
+        if count != expected_count:
+            validation_passed = False
+    # Verify tag order
+    if (
+        positions["think_start"] > positions["think_end"]
+        or positions["think_end"] > positions["answer_start"]
+        or positions["answer_start"] > positions["answer_end"]
+    ):
+        validation_passed = False
+    if len(processed_str) - positions["answer_end"] != len(tags["answer_end"]["text"]):
+        validation_passed = False
+    return validation_passed
+
+
+def extract_solution(solution_str: str) -> Tuple[Optional[str], str]:
+    """Extracts the final answer from the model's response string.
+
+    Args:
+        solution_str: Raw response string from the language model
+
+    Returns:
+        Tuple containing (extracted_answer, processed_string)
+    """
+
+    # Extract final answer using XML-style tags
+    answer_pattern = r"<answer>(.*?)</answer>"
+    matches = list(re.finditer(answer_pattern, solution_str, re.DOTALL))
+
+    if not matches:
+        return None, solution_str
+
+    final_answer = matches[-1].group(1).strip()
+    return final_answer, solution_str
--- a/applications/ColossalChat/conversation_template/MiniCPM-2b.json
+++ b/applications/ColossalChat/conversation_template/MiniCPM-2b.json
@ -0,0 +1,8 @@
+{
+    "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+    "system_message": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
+    "stop_ids": [
+        122753
+    ],
+    "end_of_assistant": "<|im_end|>"
+}
--- a/applications/ColossalChat/conversation_template/Qwen_Qwen2.5-3B.json
+++ b/applications/ColossalChat/conversation_template/Qwen_Qwen2.5-3B.json
@ -0,0 +1,26 @@
+{
+    "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+    "system_message": "You are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. Now the user asks you to solve a math problem that involves reasoning. After thinking, when you finally reach a conclusion, clearly output the final answer without explanation within the <answer> </answer> tags, your final answer should be a integer without unit, currency mark, thousands separator or other text. i.e., <answer> 123 </answer>.\n",
+    "stop_ids": [
+        151643
+    ],
+    "end_of_assistant": "<|endoftext|>",
+    "response_format_tags": {
+        "think_start": {
+            "text": "<think>",
+            "num_occur": 1
+        },
+        "think_end": {
+            "text": "</think>",
+            "num_occur": 1
+        },
+        "answer_start": {
+            "text": "<answer>",
+            "num_occur": 1
+        },
+        "answer_end": {
+            "text": "</answer>",
+            "num_occur": 1
+        }
+    }
+}
--- a/applications/ColossalChat/examples/README.md
+++ b/applications/ColossalChat/examples/README.md
@ -2,8 +2,6 @@


 ## Table of Contents
-
-
 - [Examples](#examples)
  - [Table of Contents](#table-of-contents)
  - [Install Requirements](#install-requirements)
@ -27,13 +25,14 @@
    - [Reward](#reward)
    - [KL Divergence](#approximate-kl-divergence)
  - [Note on PPO Training](#note-on-ppo-training)
+  - [GRPO Training and DeepSeek R1 reproduction](#grpo-training-and-deepseek-r1-reproduction)
  - [Alternative Option For RLHF: Direct Preference Optimization](#alternative-option-for-rlhf-direct-preference-optimization)
    - [DPO Stage 1: Supervised Instruction Tuning](#dpo-training-stage1---supervised-instructs-tuning)
    - [DPO Stage 2: DPO Training](#dpo-training-stage2---dpo-training)
  - [Alternative Option For RLHF: Simple Preference Optimization](#alternative-option-for-rlhf-simple-preference-optimization)
  - [Alternative Option For RLHF: Kahneman-Tversky Optimization (KTO)](#alternative-option-for-rlhf-kahneman-tversky-optimization-kto)
  - [Alternative Option For RLHF: Odds Ratio Preference Optimization](#alternative-option-for-rlhf-odds-ratio-preference-optimization)
-  - [List of Supported Models](#list-of-supported-models)
+  - [SFT for DeepSeek V3](#sft-for-deepseek-v3)
  - [Hardware Requirements](#hardware-requirements)
  - [Inference example](#inference-example)
  - [Attention](#attention)
@ -725,10 +724,69 @@ Answer: The causes of this problem are two-fold. Check your reward model, make s
 #### Q4: Generation is garbage
 Answer: Yes, this happens and is well documented by other implementations. After training for too many episodes, the actor gradually deviate from its original state, which may leads to decrease in language modeling capabilities. A way to fix this is to add supervised loss during PPO. Set ptx_coef to an non-zero value (between 0 and 1), which balances PPO loss and sft loss.

+## GRPO Training and DeepSeek R1 reproduction
+We support GRPO (Group Relative Policy Optimization), which is the reinforcement learning algorithm used in DeepSeek R1 paper. In this section, we will walk through GRPO training with an example trying to reproduce Deepseek R1's results in mathematical problem solving.
+
+**Note: Currently, our PPO and GRPO pipelines are still under extensive development (integration with Ray and the inference engine). The speed is primarily limited by the rollout process, as we are using a naive generation approach without any acceleration. This experiment is focused solely on verifying the correctness of the GRPO algorithm. We will open-source the new version of code as soon as possible, so please stay tuned.**
+
+### GRPO Model Selection
+We finally select the base version of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B). We also did experiments on the instruct version [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) but the later one fails to explore more diversed output. We recommend to use base models (without SFT) and use a few SFT steps (see [SFT section](#rlhf-training-stage1---supervised-instructs-tuning)) to correct the base model's output format before GRPO.
+
+### Reinforcement Learning with Verifiable Reward
+Both the PPO and the GRPO support reinforcement learning with verifiable reward (RLVR). In this experiment on mathematical problem solving, we define the reward function as following, in the following definition, forward is correct if there are exactly one pair of <think></think>, <answer></answer> tags in the response and the order of the tags is correct.
+
+- reward=0, if format is incorrect.
+- reward=1, if format is correct but the answer doesn't match the ground truth answer exactly.
+- reward=10, if format is correct and the answer match the ground truth answer exactly.
+
+### Step 1: Data Collection & Preparation
+For GPRO training, you only need the prompt dataset. Please follow the instruction in the [prompt dataset preparation](#rlhf-training-stage3---proximal-policy-optimization) to prepare the prompt data for GPRO training. In our reproduction experiment, we use the [qwedsacf/competition_math dataset](https://huggingface.co/datasets/qwedsacf/competition_math), which is available on Huggingface.
+
+### Step 2: Training
+You can run the [train_grpo.sh](./training_scripts/train_grpo.sh) to start GRPO training. The script share most of its arguments with the PPO script (please refer to the [PPO training section](#step-3-training) for more details). Here are some unique arguments for GRPO.
+
+```bash
+--num_generations 8 \ # number of roll outs to collect for each prompt
+--inference_batch_size 8 \ # batch size used during roll out
+--logits_forward_batch_size 1 \ # batch size used to calculate logits for GRPO training
+--initial_temperature \ # initial temperature for annealing algorithm
+--final_temperature \ # final temperature for annealing algorithm
+```
+
+As the GRPO requires to collect a group of response from each prompt (usually greater than 8), the effective batch size will satisfy the following constraints,
+
+- Without tensor parallelism,
+```
+experience buffer size
+= num_process * num_collect_steps * experience_batch_size * num_generations
+= train_batch_size * accumulation_steps * num_process
+```
+
+- With tensor parallelism,
+```
+num_tp_group = num_process / tp
+experience buffer size
+= num_tp_group * num_collect_steps * experience_batch_size * num_generations
+= train_batch_size * accumulation_steps * num_tp_group
+```
+
+During roll out, we perform rebatching to prevent out of memory both before roll out and before calculating logits. Please choose a proper setting for the "inference_batch_size" and the "logits_forward_batch_size" based on your device.
+
+### GRPO Result
+#### Reward and Response Length
+<div style="display: flex; justify-content: space-between;">
+  <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/grpo/reward.png" style="width: 48%;" />
+  <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/grpo/token_cost.png" style="width: 48%;" />
+</div>
+
+#### Response Length Distribution (After Training) and Sample response
+<div style="display: flex; justify-content: space-between;">
+  <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/grpo/token_cost_eval.png" style="width: 48%;" />
+  <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/grpo/sample.png" style="width: 48%;" />
+</div>
+

 ## Alternative Option For RLHF: Direct Preference Optimization
-
-
 For those seeking an alternative to Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO) presents a compelling option. DPO, as detailed in the paper (available at [https://arxiv.org/abs/2305.18290](https://arxiv.org/abs/2305.18290)), DPO offers an low-cost way to perform RLHF and usually request less computation resources compares to PPO.


@ -814,8 +872,95 @@ For training, use the [train_kto.sh](./examples/training_scripts/train_orpo.sh)
 <img width="1000" alt="image" src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/KTO.png">
 </p>

-## Hardware Requirements

+### SFT for DeepSeek V3
+We add a script to supervised-fintune the DeepSeek V3/R1 model with LoRA. The script is located in `examples/training_scripts/lora_fintune.py`. The script is similar to the SFT script for Coati7B, but with a few differences. This script is compatible with Peft.
+
+#### Dataset preparation
+
+This script receives JSONL format file as input dataset. Each line of dataset should be a list of chat dialogues. E.g.
+```json
+[{"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}]
+```
+```json
+[{"role": "user", "content": "火烧赤壁 曹操为何不拨打119求救？"}, {"role": "assistant", "content": "因为在三国时期，还没有电话和现代的消防系统，所以曹操无法拨打119求救。"}]
+```
+
+The dialogues can by multiple turns and it can contain system prompt. For more details, see the [chat_templating](https://huggingface.co/docs/transformers/main/chat_templating).
+
+#### Model weights preparation
+
+We use bf16 weights for finetuning. If you downloaded fp8 DeepSeek V3/R1 weights, you can use the [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) to convert the weights to bf16 via GPU. For Ascend NPU, you can use this [script](https://gitee.com/ascend/ModelZoo-PyTorch/blob/master/MindIE/LLM/DeepSeek/DeepSeek-V2/NPU_inference/fp8_cast_bf16.py).
+
+We have also added details on how to load and reason with lora models.
+```python
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
+from peft import (
+    PeftModel
+)
+import torch
+
+# Set model path
+model_name = "Qwen/Qwen2.5-3B"
+lora_adapter = "Qwen2.5-3B_lora" # Your lora model Path
+merged_model_path = "Qwen2.5-3B_merged"
+
+######
+# How to Load lora Model
+######
+# 1.Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+
+# 2.Load lora model
+peft_model = PeftModel.from_pretrained(
+    base_model,
+    lora_adapter,
+    torch_dtype=torch.bfloat16
+)
+
+# 3.Merge lora model
+merged_model = peft_model.merge_and_unload()
+
+# 4.Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    model_name,
+    trust_remote_code=True,
+    pad_token="<|endoftext|>"
+)
+
+# 5.Save merged lora model
+merged_model.save_pretrained(
+    merged_model_path,
+    safe_serialization=True
+)
+tokenizer.save_pretrained(merged_model_path)
+
+# 6.Run Inference
+test_input = tokenizer("Instruction: Finding prime numbers up to 100\nAnswer:", return_tensors="pt").to("cuda")
+output = merged_model.generate(**test_input, max_new_tokens=100)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+#### Usage
+
+After preparing the dataset and model weights, you can run the script with the following command:
+```bash
+colossalai run --hostfile path-to-host-file --nproc_per_node 8 lora_finetune.py --pretrained path-to-DeepSeek-R1-bf16 --dataset path-to-dataset.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --warmup_steps 8 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora
+```
+
+For more details of each argument, you can run `python lora_finetune.py --help`.
+
+The sample command does not use CPU offload to get better throughput. The minimum hardware requirement for sample command is 32 ascend 910B NPUs (with `ep=8,pp=4`) or 24 H100/H800 GPUs (with `ep=8,pp=3`). If you enable CPU offload by `--zero_cpu_offload`, the hardware requirement can be further reduced.
+
+## Hardware Requirements
 For SFT, we recommend using zero2 or zero2-cpu for 7B model and tp is your model is extra large. We tested the VRAM consumption on a dummy dataset with a sequence length of 2048. In all experiments, we use H800 GPUs with 80GB VRAM and enable gradient checkpointing and flash attention.
 - 2 H800 GPU
  - zero2-cpu, micro batch size=4, VRAM Usage=22457.98 MB
@ -872,35 +1017,9 @@ For KTO, we recommend using zero2-cpu or zero2 plugin, We tested the VRAM consum
  - zero2_cpu, micro batch size=2, VRAM_USAGE=32443.22 MB
  - zero2, micro batch size=4, VRAM_USAGE=59307.97 MB

-## List of Supported Models
-
-For SFT, we support the following models/series:
- Colossal-LLaMA-2
- ChatGLM2
- ChatGLM3 (only with zero2, zero2_cpu plugin)
- Baichuan2
- LLaMA2
- Qwen1.5-7B-Chat (with transformers==4.39.1)
- Yi-1.5
-
-For PPO and DPO, we theoratically support the following models/series (without guarantee):
- Colossal-LLaMA-2 (tested)
- ChatGLM2
- Baichuan2
- LLaMA2 (tested)
- Qwen1.5-7B-Chat (with transformers==4.39.1)
- Yi-1.5
-
-*-* The zero2, zero2_cpu plugin also support a wide range of chat models not listed above.
-
 ## Inference example
-
-
 We support different inference options, including int8 and int4 quantization.
 For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).

-
 ## Attention
-
-
 The examples are demos for the whole training process. You need to change the hyper-parameters to reach great performance.
--- a/applications/ColossalChat/examples/data_preparation_scripts/prepare_prompt_dataset.sh
+++ b/applications/ColossalChat/examples/data_preparation_scripts/prepare_prompt_dataset.sh
@ -11,4 +11,4 @@ python prepare_dataset.py --type prompt \
    --data_cache_dir $SAVE_DIR/cache \
    --data_jsonl_output_dir $SAVE_DIR/jsonl \
    --data_arrow_output_dir $SAVE_DIR/arrow \
-    --max_length 1024
+    --max_length 300
--- a/applications/ColossalChat/examples/ray/1mmt_prompt.py
+++ b/applications/ColossalChat/examples/ray/1mmt_prompt.py
@ -1,181 +0,0 @@
-import argparse
-import os
-import socket
-from functools import partial
-
-import pandas as pd
-import ray
-from coati.quant import llama_load_quant, low_resource_init
-from coati.ray.detached_trainer_ppo import DetachedPPOTrainer
-from coati.ray.experience_maker_holder import ExperienceMakerHolder
-from coati.ray.utils import (
-    get_actor_from_args,
-    get_critic_from_args,
-    get_reward_model_from_args,
-    get_strategy_from_args,
-    get_tokenizer_from_args,
-)
-from torch.utils.data import DataLoader
-from transformers import AutoConfig
-from transformers.modeling_utils import no_init_weights
-
-
-def get_free_port():
-    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-        s.bind(("", 0))
-        return s.getsockname()[1]
-
-
-def get_local_ip():
-    with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
-        s.connect(("8.8.8.8", 80))
-        return s.getsockname()[0]
-
-
-def main(args):
-    master_addr = str(get_local_ip())
-    # trainer_env_info
-    trainer_port = str(get_free_port())
-    env_info_trainers = [
-        {
-            "local_rank": "0",
-            "rank": str(rank),
-            "world_size": str(args.num_trainers),
-            "master_port": trainer_port,
-            "master_addr": master_addr,
-        }
-        for rank in range(args.num_trainers)
-    ]
-
-    # maker_env_info
-    maker_port = str(get_free_port())
-    env_info_maker = {
-        "local_rank": "0",
-        "rank": "0",
-        "world_size": "1",
-        "master_port": maker_port,
-        "master_addr": master_addr,
-    }
-
-    # configure tokenizer
-    tokenizer = get_tokenizer_from_args(args.model)
-
-    def trainer_model_fn():
-        actor = get_actor_from_args(args.model, args.pretrain).half().cuda()
-        critic = get_critic_from_args(args.model, args.critic_pretrain).half().cuda()
-        return actor, critic
-
-    # configure Trainer
-    trainer_refs = [
-        DetachedPPOTrainer.options(name=f"trainer{i}", num_gpus=1, max_concurrency=2).remote(
-            experience_maker_holder_name_list=["maker1"],
-            strategy_fn=partial(get_strategy_from_args, args.trainer_strategy),
-            model_fn=trainer_model_fn,
-            env_info=env_info_trainer,
-            train_batch_size=args.train_batch_size,
-            buffer_limit=16,
-            eval_performance=True,
-            debug=args.debug,
-            update_lora_weights=not (args.lora_rank == 0),
-        )
-        for i, env_info_trainer in enumerate(env_info_trainers)
-    ]
-
-    def model_fn():
-        actor = get_actor_from_args(args.model, args.pretrain).requires_grad_(False).half().cuda()
-        critic = get_critic_from_args(args.model, args.critic_pretrain).requires_grad_(False).half().cuda()
-        reward_model = get_reward_model_from_args(args.model, args.critic_pretrain).requires_grad_(False).half().cuda()
-        if args.initial_model_quant_ckpt is not None and args.model == "llama":
-            # quantize initial model
-            actor_cfg = AutoConfig.from_pretrained(args.pretrain)
-            with low_resource_init(), no_init_weights():
-                initial_model = get_actor_from_args(args.model, config=actor_cfg)
-            initial_model.model = (
-                llama_load_quant(
-                    initial_model.model, args.initial_model_quant_ckpt, args.quant_bits, args.quant_group_size
-                )
-                .cuda()
-                .requires_grad_(False)
-            )
-        else:
-            initial_model = get_actor_from_args(args.model, args.pretrain).requires_grad_(False).half().cuda()
-        return actor, critic, reward_model, initial_model
-
-    # configure Experience Maker
-    experience_holder_ref = ExperienceMakerHolder.options(name="maker1", num_gpus=1, max_concurrency=2).remote(
-        detached_trainer_name_list=[f"trainer{i}" for i in range(args.num_trainers)],
-        strategy_fn=partial(get_strategy_from_args, args.maker_strategy),
-        model_fn=model_fn,
-        env_info=env_info_maker,
-        experience_batch_size=args.experience_batch_size,
-        kl_coef=0.1,
-        debug=args.debug,
-        update_lora_weights=not (args.lora_rank == 0),
-        # sync_models_from_trainers=True,
-        # generation kwargs:
-        max_length=512,
-        do_sample=True,
-        temperature=1.0,
-        top_k=50,
-        pad_token_id=tokenizer.pad_token_id,
-        eos_token_id=tokenizer.eos_token_id,
-        eval_performance=True,
-        use_cache=True,
-    )
-
-    # uncomment this function if sync_models_from_trainers is True
-    # ray.get([
-    #     trainer_ref.sync_models_to_remote_makers.remote()
-    #     for trainer_ref in trainer_refs
-    # ])
-
-    wait_tasks = []
-
-    total_steps = args.experience_batch_size * args.experience_steps // (args.num_trainers * args.train_batch_size)
-    for trainer_ref in trainer_refs:
-        wait_tasks.append(trainer_ref.fit.remote(total_steps, args.update_steps, args.train_epochs))
-
-    dataset_size = args.experience_batch_size * 4
-
-    def build_dataloader():
-        def tokenize_fn(texts):
-            batch = tokenizer(texts, return_tensors="pt", max_length=96, padding="max_length", truncation=True)
-            return {k: v.cuda() for k, v in batch.items()}
-
-        dataset = pd.read_csv(args.prompt_path)["prompt"]
-        dataloader = DataLoader(dataset=dataset, batch_size=dataset_size, shuffle=True, collate_fn=tokenize_fn)
-        return dataloader
-
-    wait_tasks.append(experience_holder_ref.workingloop.remote(build_dataloader, num_steps=args.experience_steps))
-
-    ray.get(wait_tasks)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--prompt_path", type=str, default=None)
-    parser.add_argument("--num_trainers", type=int, default=1)
-    parser.add_argument(
-        "--trainer_strategy",
-        choices=["ddp", "colossalai_gemini", "colossalai_zero2", "colossalai_gemini_cpu", "colossalai_zero2_cpu"],
-        default="ddp",
-    )
-    parser.add_argument("--maker_strategy", choices=["naive"], default="naive")
-    parser.add_argument("--model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
-    parser.add_argument("--critic_model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
-    parser.add_argument("--pretrain", type=str, default=None)
-    parser.add_argument("--critic_pretrain", type=str, default=None)
-    parser.add_argument("--experience_steps", type=int, default=4)
-    parser.add_argument("--experience_batch_size", type=int, default=8)
-    parser.add_argument("--train_epochs", type=int, default=1)
-    parser.add_argument("--update_steps", type=int, default=2)
-    parser.add_argument("--train_batch_size", type=int, default=8)
-    parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
-
-    parser.add_argument("--initial_model_quant_ckpt", type=str, default=None)
-    parser.add_argument("--quant_bits", type=int, default=4)
-    parser.add_argument("--quant_group_size", type=int, default=128)
-    parser.add_argument("--debug", action="store_true")
-    args = parser.parse_args()
-    ray.init(namespace=os.environ["RAY_NAMESPACE"], runtime_env={"env_vars": dict(os.environ)})
-    main(args)
--- a/applications/ColossalChat/examples/ray/mmmt_prompt.py
+++ b/applications/ColossalChat/examples/ray/mmmt_prompt.py
@ -1,201 +0,0 @@
-import argparse
-import os
-import socket
-from functools import partial
-
-import pandas as pd
-import ray
-from coati.quant import llama_load_quant, low_resource_init
-from coati.ray.detached_trainer_ppo import DetachedPPOTrainer
-from coati.ray.experience_maker_holder import ExperienceMakerHolder
-from coati.ray.utils import (
-    get_actor_from_args,
-    get_critic_from_args,
-    get_receivers_per_sender,
-    get_reward_model_from_args,
-    get_strategy_from_args,
-)
-from torch.utils.data import DataLoader
-from transformers import AutoConfig, AutoTokenizer
-from transformers.modeling_utils import no_init_weights
-
-
-def get_free_port():
-    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-        s.bind(("", 0))
-        return s.getsockname()[1]
-
-
-def get_local_ip():
-    with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
-        s.connect(("8.8.8.8", 80))
-        return s.getsockname()[0]
-
-
-def main(args):
-    master_addr = str(get_local_ip())
-    # trainer_env_info
-    trainer_port = str(get_free_port())
-    env_info_trainers = [
-        {
-            "local_rank": "0",
-            "rank": str(rank),
-            "world_size": str(args.num_trainers),
-            "master_port": trainer_port,
-            "master_addr": master_addr,
-        }
-        for rank in range(args.num_trainers)
-    ]
-
-    # maker_env_info
-    maker_port = str(get_free_port())
-    env_info_makers = [
-        {
-            "local_rank": "0",
-            "rank": str(rank),
-            "world_size": str(args.num_makers),
-            "master_port": maker_port,
-            "master_addr": master_addr,
-        }
-        for rank in range(args.num_makers)
-    ]
-
-    # configure tokenizer
-    tokenizer = AutoTokenizer.from_pretrained(args.pretrain)
-    tokenizer.pad_token = tokenizer.eos_token
-
-    def model_fn():
-        actor = get_actor_from_args(args.model, args.pretrain).requires_grad_(False).half().cuda()
-        critic = get_critic_from_args(args.model, args.critic_pretrain).requires_grad_(False).half().cuda()
-        reward_model = get_reward_model_from_args(args.model, args.critic_pretrain).requires_grad_(False).half().cuda()
-        if args.initial_model_quant_ckpt is not None and args.model == "llama":
-            # quantize initial model
-            actor_cfg = AutoConfig.from_pretrained(args.pretrain)
-            with low_resource_init(), no_init_weights():
-                initial_model = get_actor_from_args(args.model, config=actor_cfg)
-            initial_model.model = (
-                llama_load_quant(
-                    initial_model.model, args.initial_model_quant_ckpt, args.quant_bits, args.quant_group_size
-                )
-                .cuda()
-                .requires_grad_(False)
-            )
-        else:
-            initial_model = get_actor_from_args(args.model, args.pretrain).requires_grad_(False).half().cuda()
-        return actor, critic, reward_model, initial_model
-
-    # configure Experience Maker
-    experience_holder_refs = [
-        ExperienceMakerHolder.options(name=f"maker{i}", num_gpus=1, max_concurrency=2).remote(
-            detached_trainer_name_list=[
-                f"trainer{x}"
-                for x in get_receivers_per_sender(i, args.num_makers, args.num_trainers, allow_idle_sender=False)
-            ],
-            strategy_fn=partial(get_strategy_from_args, args.maker_strategy),
-            model_fn=model_fn,
-            env_info=env_info_maker,
-            kl_coef=0.1,
-            debug=args.debug,
-            update_lora_weights=not (args.lora_rank == 0),
-            # sync_models_from_trainers=True,
-            # generation kwargs:
-            max_length=512,
-            do_sample=True,
-            temperature=1.0,
-            top_k=50,
-            pad_token_id=tokenizer.pad_token_id,
-            eos_token_id=tokenizer.eos_token_id,
-            eval_performance=True,
-            use_cache=True,
-        )
-        for i, env_info_maker in enumerate(env_info_makers)
-    ]
-
-    def trainer_model_fn():
-        actor = get_actor_from_args(args.model, args.pretrain, lora_rank=args.lora_rank).half().cuda()
-        critic = get_critic_from_args(args.model, args.critic_pretrain, lora_rank=args.lora_rank).half().cuda()
-        return actor, critic
-
-    # configure Trainer
-    trainer_refs = [
-        DetachedPPOTrainer.options(name=f"trainer{i}", num_gpus=1, max_concurrency=2).remote(
-            experience_maker_holder_name_list=[
-                f"maker{x}"
-                for x in get_receivers_per_sender(i, args.num_trainers, args.num_makers, allow_idle_sender=True)
-            ],
-            strategy_fn=partial(get_strategy_from_args, args.trainer_strategy),
-            model_fn=trainer_model_fn,
-            env_info=env_info_trainer,
-            train_batch_size=args.train_batch_size,
-            buffer_limit=16,
-            eval_performance=True,
-            debug=args.debug,
-            update_lora_weights=not (args.lora_rank == 0),
-        )
-        for i, env_info_trainer in enumerate(env_info_trainers)
-    ]
-
-    dataset_size = args.experience_batch_size * 4
-
-    def build_dataloader():
-        def tokenize_fn(texts):
-            batch = tokenizer(texts, return_tensors="pt", max_length=96, padding="max_length", truncation=True)
-            return {k: v.cuda() for k, v in batch.items()}
-
-        dataset = pd.read_csv(args.prompt_path)["prompt"]
-        dataloader = DataLoader(dataset=dataset, batch_size=dataset_size, shuffle=True, collate_fn=tokenize_fn)
-        return dataloader
-
-    # uncomment this function if sync_models_from_trainers is True
-    # ray.get([
-    #     trainer_ref.sync_models_to_remote_makers.remote()
-    #     for trainer_ref in trainer_refs
-    # ])
-
-    wait_tasks = []
-
-    for experience_holder_ref in experience_holder_refs:
-        wait_tasks.append(experience_holder_ref.workingloop.remote(build_dataloader, num_steps=args.experience_steps))
-
-    total_steps = (
-        args.experience_batch_size
-        * args.experience_steps
-        * args.num_makers
-        // (args.num_trainers * args.train_batch_size)
-    )
-    for trainer_ref in trainer_refs:
-        wait_tasks.append(trainer_ref.fit.remote(total_steps, args.update_steps, args.train_epochs))
-
-    ray.get(wait_tasks)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--prompt_path", type=str, default=None)
-    parser.add_argument("--num_makers", type=int, default=1)
-    parser.add_argument("--num_trainers", type=int, default=1)
-    parser.add_argument(
-        "--trainer_strategy",
-        choices=["ddp", "colossalai_gemini", "colossalai_zero2", "colossalai_gemini_cpu", "colossalai_zero2_cpu"],
-        default="ddp",
-    )
-    parser.add_argument("--maker_strategy", choices=["naive"], default="naive")
-    parser.add_argument("--model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
-    parser.add_argument("--critic_model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
-    parser.add_argument("--pretrain", type=str, default=None)
-    parser.add_argument("--critic_pretrain", type=str, default=None)
-    parser.add_argument("--experience_steps", type=int, default=4)
-    parser.add_argument("--experience_batch_size", type=int, default=8)
-    parser.add_argument("--train_epochs", type=int, default=1)
-    parser.add_argument("--update_steps", type=int, default=2)
-    parser.add_argument("--train_batch_size", type=int, default=8)
-    parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
-
-    parser.add_argument("--initial_model_quant_ckpt", type=str, default=None)
-    parser.add_argument("--quant_bits", type=int, default=4)
-    parser.add_argument("--quant_group_size", type=int, default=128)
-    parser.add_argument("--debug", action="store_true")
-    args = parser.parse_args()
-
-    ray.init(namespace=os.environ["RAY_NAMESPACE"], runtime_env={"env_vars": dict(os.environ)})
-    main(args)
--- a/applications/ColossalChat/examples/ray/requirements.txt
+++ b/applications/ColossalChat/examples/ray/requirements.txt
@ -1 +0,0 @@
-ray
--- a/applications/ColossalChat/examples/ray/test_ci.sh
+++ b/applications/ColossalChat/examples/ray/test_ci.sh
@ -1,12 +0,0 @@
-#!/bin/bash
-
-set -xe
-BASE=$(realpath $(dirname $0))
-
-export RAY_NAMESPACE=admin
-export DATA=/data/scratch/chatgpt/prompts.csv
-
-# install requirements
-pip install -r ${BASE}/requirements.txt
-
-python ${BASE}/mmmt_prompt.py --prompt_path $DATA --num_makers 2 --num_trainers 2 --trainer_strategy colossalai_gemini --model opt --critic_model opt --pretrain facebook/opt-350m --critic_pretrain facebook/opt-125m --experience_batch_size 4 --train_batch_size 2
--- a/applications/ColossalChat/examples/requirements.txt
+++ b/applications/ColossalChat/examples/requirements.txt
@ -1,4 +1,4 @@
 pandas>=1.4.1
 sentencepiece
-colossalai==0.4.0
+colossalai==0.4.7
 prompt_toolkit
--- a/applications/ColossalChat/examples/training_scripts/lora_finetune.py
+++ b/applications/ColossalChat/examples/training_scripts/lora_finetune.py
@ -0,0 +1,455 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Supervised fine-tuning of MoE models like Deepseek V3/R1 on a downstream task.
+"""
+
+import argparse
+import json
+import os
+import resource
+from contextlib import nullcontext
+from types import MethodType
+
+import torch
+import torch.distributed as dist
+from coati.dataset.loader import RawConversationDataset
+from peft import LoraConfig
+from tqdm import tqdm
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+
+import colossalai
+from colossalai.accelerator import get_accelerator
+from colossalai.booster import Booster
+from colossalai.booster.plugin import (
+    GeminiPlugin,
+    HybridParallelPlugin,
+    LowLevelZeroPlugin,
+    MoeHybridParallelPlugin,
+    Plugin,
+    TorchDDPPlugin,
+)
+from colossalai.cluster import DistCoordinator
+from colossalai.lazy import LazyInitContext
+from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
+from colossalai.nn.optimizer import HybridAdam
+from colossalai.utils import get_current_device
+
+
+def all_reduce_mean(loss: torch.Tensor, plugin: Plugin) -> torch.Tensor:
+    loss = loss.data
+    group = getattr(plugin, "dp_group", None)
+    dist.all_reduce(loss, group=group)
+    return loss / dist.get_world_size(group)
+
+
+def train(args) -> None:
+    # ==============================
+    # Initialize Distributed Training
+    # ==============================
+    colossalai.launch_from_torch()
+    accelerator = get_accelerator()
+    coordinator = DistCoordinator()
+
+    # ==============================
+    # Initialize Booster
+    # ==============================
+    if args.plugin == "ddp":
+        plugin = TorchDDPPlugin(find_unused_parameters=True if args.use_grad_checkpoint is False else False)
+    elif args.plugin == "gemini":
+        plugin = GeminiPlugin(
+            precision=args.mixed_precision,
+            initial_scale=2**16,
+            max_norm=args.grad_clip,
+            enable_gradient_accumulation=(args.accumulation_steps > 1),
+            enable_fused_normalization=get_accelerator().is_available(),
+            enable_flash_attention=args.use_flash_attn,
+        )
+    elif args.plugin == "gemini_auto":
+        plugin = GeminiPlugin(
+            precision=args.mixed_precision,
+            placement_policy="auto",
+            initial_scale=2**16,
+            max_norm=args.grad_clip,
+            enable_gradient_accumulation=(args.accumulation_steps > 1),
+            enable_fused_normalization=get_accelerator().is_available(),
+            enable_flash_attention=args.use_flash_attn,
+        )
+    elif args.plugin == "zero2":
+        plugin = LowLevelZeroPlugin(
+            stage=2,
+            precision=args.mixed_precision,
+            initial_scale=2**16,
+            max_norm=args.grad_clip,
+        )
+    elif args.plugin == "zero2_cpu":
+        plugin = LowLevelZeroPlugin(
+            stage=2,
+            precision=args.mixed_precision,
+            initial_scale=2**16,
+            cpu_offload=True,
+            max_norm=args.grad_clip,
+        )
+    elif args.plugin == "3d":
+        plugin = HybridParallelPlugin(
+            tp_size=args.tp,
+            pp_size=args.pp,
+            sp_size=args.sp,
+            sequence_parallelism_mode=args.sp_mode,
+            zero_stage=args.zero_stage,
+            enable_flash_attention=args.use_flash_attn,
+            enable_fused_normalization=get_accelerator().is_available(),
+            enable_sequence_parallelism=args.enable_sequence_parallelism,
+            cpu_offload=True if args.zero_stage >= 1 and args.zero_cpu_offload else False,
+            max_norm=args.grad_clip,
+            precision=args.mixed_precision,
+            microbatch_size=args.microbatch_size,
+        )
+    elif args.plugin == "moe":
+        plugin = MoeHybridParallelPlugin(
+            ep_size=args.ep,
+            tp_size=args.tp,
+            pp_size=args.pp,
+            zero_stage=args.zero_stage,
+            sp_size=args.sp,
+            sequence_parallelism_mode=args.sp_mode,
+            enable_sequence_parallelism=args.sp > 1,
+            enable_fused_normalization=get_accelerator().is_available(),
+            enable_flash_attention=args.use_flash_attn,
+            max_norm=args.grad_clip,
+            precision=args.mixed_precision,
+            microbatch_size=args.microbatch_size,
+        )
+    else:
+        raise ValueError(f"Unknown plugin {args.plugin}")
+
+    booster = Booster(plugin=plugin)
+
+    def is_master():
+        if isinstance(plugin, HybridParallelPlugin) and plugin.pp_size > 1:
+            return coordinator.rank == coordinator.world_size - 1
+        return coordinator.is_master()
+
+    # ==============================
+    # Initialize Tensorboard and Save Config
+    # ==============================
+    if is_master():
+        if args.tensorboard_dir is not None:
+            from torch.utils.tensorboard import SummaryWriter
+
+            os.makedirs(args.tensorboard_dir, exist_ok=True)
+            writer = SummaryWriter(args.tensorboard_dir)
+
+        with open(args.config_file, "w") as f:
+            json.dump(args.__dict__, f, indent=4)
+
+    # ======================================================
+    # Initialize Tokenizer, Dataset, Collator and Dataloader
+    # ======================================================
+    tokenizer = AutoTokenizer.from_pretrained(args.pretrained, trust_remote_code=True)
+
+    coordinator.print_on_master(
+        f"Training Info:\nConfig file: {args.config_file} \nTensorboard logs: {args.tensorboard_dir} \nModel checkpoint: {args.save_dir}"
+    )
+
+    coordinator.print_on_master(f"Load dataset: {args.dataset}")
+    dataset = RawConversationDataset(
+        tokenizer,
+        args.dataset,
+        args.max_length,
+    )
+
+    dataloader = plugin.prepare_dataloader(
+        dataset=dataset,
+        batch_size=args.batch_size,
+        shuffle=True,
+        drop_last=True,
+    )
+
+    coordinator.print_on_master(
+        f"Max device memory after data loader: {accelerator.max_memory_allocated() / 1024 ** 2:.2f} MB"
+    )
+
+    # ======================================================
+    # Initialize Model, Objective, Optimizer and LR Scheduler
+    # ======================================================
+    # When training the ChatGLM model, LoRA and gradient checkpointing are incompatible.
+    init_ctx = (
+        LazyInitContext(default_device=get_current_device())
+        if isinstance(plugin, (GeminiPlugin, HybridParallelPlugin))
+        else nullcontext()
+    )
+    attn_impl = "eager" if get_accelerator().name == "npu" else "flash_attention_2"
+
+    config = AutoConfig.from_pretrained(args.pretrained, trust_remote_code=True)
+
+    with init_ctx:
+        # from_pretrained is not compatible with LoRA, we load pretrained weights later.
+        # model = AutoModelForCausalLM.from_pretrained(
+        #     args.pretrained,
+        #     torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
+        #     trust_remote_code=True,
+        #     attn_implementation=attn_impl,
+        # )
+        model = AutoModelForCausalLM.from_config(
+            config,
+            trust_remote_code=True,
+            attn_implementation=attn_impl,
+            torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
+        )
+
+        if args.lora_rank > 0:
+            if model.__class__.__name__.startswith("DeepseekV3"):
+                lora_config = LoraConfig(
+                    task_type="CAUSAL_LM",
+                    r=args.lora_rank,
+                    lora_alpha=args.lora_alpha,
+                    target_modules=["gate_proj", "up_proj", "down_proj"],
+                )
+            else:
+                lora_config = LoraConfig(task_type="CAUSAL_LM", r=args.lora_rank, lora_alpha=args.lora_alpha)
+            model = booster.enable_lora(model, lora_config=lora_config)
+
+    # this is essential, otherwise the grad checkpoint will not work.
+    model.train()
+
+    if args.use_grad_checkpoint:
+        model.gradient_checkpointing_enable()
+        coordinator.print_on_master(msg="Gradient checkpointing enabled successfully")
+    if model.config.__class__.__name__.startswith("DeepseekV3"):
+        model.config.use_cache = False
+        model.eval()
+        # enable grad for moe layers
+        for m in model.modules():
+            if m.__class__.__name__ == "DeepseekV3MoE":
+                m.moe_infer = MethodType(m.moe_infer.__wrapped__, m)
+
+    model_numel = sum(p.numel() for p in model.parameters())
+    coordinator.print_on_master(f"Model params: {model_numel / 1e9:.2f} B")
+
+    optimizer = HybridAdam(
+        model_params=model.parameters(),
+        lr=args.lr,
+        betas=(0.9, 0.95),
+        weight_decay=args.weight_decay,
+        adamw_mode=True,
+    )
+
+    if args.warmup_steps is None:
+        args.warmup_steps = int(args.num_epochs * 0.025 * (len(dataloader) // args.accumulation_steps))
+        coordinator.print_on_master(f"Warmup steps is set to {args.warmup_steps}")
+
+    lr_scheduler = CosineAnnealingWarmupLR(
+        optimizer=optimizer,
+        total_steps=args.num_epochs * (len(dataloader) // args.accumulation_steps),
+        warmup_steps=args.warmup_steps,
+        eta_min=0.1 * args.lr,
+    )
+
+    # Flash attention will be disabled because it does NOT support fp32.
+    default_dtype = torch.float16 if args.mixed_precision == "fp16" else torch.bfloat16
+    torch.set_default_dtype(default_dtype)
+    model, optimizer, _, dataloader, lr_scheduler = booster.boost(
+        model=model,
+        optimizer=optimizer,
+        lr_scheduler=lr_scheduler,
+        dataloader=dataloader,
+    )
+
+    torch.set_default_dtype(torch.float)
+    booster.load_model(model, args.pretrained, low_cpu_mem_mode=False, num_threads=8)
+
+    coordinator.print_on_master(
+        f"Booster init max device memory: {accelerator.max_memory_allocated() / 1024 ** 2:.2f} MB"
+    )
+    coordinator.print_on_master(
+        f"Booster init max CPU memory: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.2f} MB"
+    )
+
+    start_epoch = 0
+    start_step = 0
+
+    num_steps_per_epoch = len(dataloader) // args.accumulation_steps
+
+    for epoch in range(start_epoch, args.num_epochs):
+        dataloader.sampler.set_epoch(epoch=epoch)
+        if isinstance(plugin, HybridParallelPlugin) and plugin.pp_size > 1:
+            data_iter = iter(dataloader)
+            step_bar = tqdm(
+                range(len(dataloader)),
+                desc="Step",
+                disable=not is_master(),
+            )
+            for step in step_bar:
+                outputs = booster.execute_pipeline(
+                    data_iter,
+                    model,
+                    criterion=lambda outputs, inputs: outputs[0],
+                    optimizer=optimizer,
+                    return_loss=True,
+                )
+                loss = outputs["loss"]
+                if booster.plugin.stage_manager.is_last_stage():
+                    global_loss = all_reduce_mean(loss, plugin)
+
+                optimizer.step()
+
+                if booster.plugin.stage_manager.is_last_stage():
+                    grad_norm = optimizer.get_grad_norm()
+                    step_bar.set_postfix({"loss": global_loss.item(), "grad_norm": grad_norm})
+
+                if args.tensorboard_dir is not None and is_master():
+                    global_step = (epoch * num_steps_per_epoch) + (step + 1) // args.accumulation_steps
+                    writer.add_scalar(tag="Loss", scalar_value=global_loss.item(), global_step=global_step)
+                    writer.add_scalar(
+                        tag="Learning Rate",
+                        scalar_value=lr_scheduler.get_last_lr()[0],
+                        global_step=global_step,
+                    )
+                    writer.add_scalar(tag="Grad Norm", scalar_value=grad_norm, global_step=global_step)
+
+                lr_scheduler.step()
+                optimizer.zero_grad()
+
+        else:
+            pbar = tqdm(
+                dataloader,
+                desc=f"Epoch {epoch}",
+                disable=not is_master(),
+                initial=start_step // args.accumulation_steps,
+            )
+            total_loss = torch.tensor(0.0, device=get_current_device())
+            for step, batch in enumerate(pbar, start=start_step // args.accumulation_steps):
+                batch = {k: v.to(get_current_device()) for k, v in batch.items() if isinstance(v, torch.Tensor)}
+
+                batch_output = model(**batch)
+
+                loss = batch_output.loss / args.accumulation_steps
+                total_loss.add_(loss.data)
+
+                booster.backward(loss=loss, optimizer=optimizer)
+
+                if (step + 1) % args.accumulation_steps == 0:
+                    all_reduce_mean(total_loss, plugin)
+
+                    optimizer.step()
+
+                    grad_norm = optimizer.get_grad_norm()
+                    pbar.set_postfix({"loss": total_loss.item(), "grad_norm": grad_norm})
+                    if args.tensorboard_dir is not None and is_master():
+                        global_step = (epoch * num_steps_per_epoch) + (step + 1) // args.accumulation_steps
+                        writer.add_scalar(tag="Loss", scalar_value=total_loss.item(), global_step=global_step)
+                        writer.add_scalar(
+                            tag="Learning Rate",
+                            scalar_value=lr_scheduler.get_last_lr()[0],
+                            global_step=global_step,
+                        )
+                        writer.add_scalar(tag="Grad Norm", scalar_value=grad_norm, global_step=global_step)
+
+                    lr_scheduler.step()
+                    optimizer.zero_grad()
+
+                    total_loss.fill_(0.0)
+
+        # Delete cache.
+        # del batch, batch_labels, batch_output, loss
+        accelerator.empty_cache()
+
+    # Final save.
+    coordinator.print_on_master("Start saving final model checkpoint")
+    if args.lora_rank > 0:
+        booster.save_lora_as_pretrained(model, os.path.join(args.save_dir, "lora"))
+    else:
+        booster.save_model(model, os.path.join(args.save_dir, "modeling"), shard=True)
+    coordinator.print_on_master(f"Saved final model checkpoint at epoch {epoch} at folder {args.save_dir}")
+
+    coordinator.print_on_master(f"Max device memory usage: {accelerator.max_memory_allocated()/1024**2:.2f} MB")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Basic training information.
+    parser.add_argument(
+        "-m",
+        "--pretrained",
+        type=str,
+        required=True,
+        help="Address of the pre-trained model",
+    )
+    parser.add_argument("-d", "--dataset", type=str, required=True, help="Raw Jonl dataset for training.")
+    parser.add_argument(
+        "-p",
+        "--plugin",
+        type=str,
+        default="zero2",
+        choices=["gemini", "gemini_auto", "zero2", "zero2_cpu", "3d", "ddp", "moe"],
+        help="Choose which plugin to use",
+    )
+    parser.add_argument("--save_dir", type=str, default="checkpoint_dir", help="Checkpoint directory")
+    parser.add_argument("--tensorboard_dir", type=str, default=None, help="Tensorboard directory")
+    parser.add_argument("--config_file", type=str, default="training_config.json", help="Config file")
+    # Training parameters
+    parser.add_argument("-n", "--num_epochs", type=int, default=1, help="Number of training epochs")
+    parser.add_argument("--accumulation_steps", type=int, default=1, help="Number of accumulation steps")
+    parser.add_argument("--batch_size", type=int, default=2, help="Global Batch size of each process")
+    parser.add_argument("--lr", type=float, default=3e-4, help="Learning rate")
+    parser.add_argument("--max_length", type=int, default=8192, help="Model max length")
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="bf16",
+        choices=["fp16", "bf16"],
+        help="Mixed precision",
+    )
+    parser.add_argument("--grad_clip", type=float, default=1.0, help="Gradient clipping value")
+    parser.add_argument("--weight_decay", type=float, default=0.1, help="Weight decay")
+    parser.add_argument("--warmup_steps", type=int, default=None, help="Warmup steps")
+    parser.add_argument(
+        "-g",
+        "--use_grad_checkpoint",
+        action="store_true",
+        default=False,
+        help="Use gradient checkpointing",
+    )
+    parser.add_argument(
+        "-f",
+        "--use_flash_attn",
+        action="store_true",
+        default=False,
+        help="Use flash-attention",
+    )
+
+    # Additional arguments for 3d plugin.
+    parser.add_argument("--tp", type=int, default=1, help="TP size, used for 3d plugin.")
+    parser.add_argument("--pp", type=int, default=1, help="PP size, used for 3d plugin.")
+    parser.add_argument("--sp", type=int, default=1, help="SP size, used for 3d plugin.")
+    parser.add_argument("--ep", type=int, default=1, help="EP size, used for moe plugin.")
+    parser.add_argument("--zero_stage", type=int, default=1, help="Zero stage, used for 3d plugin.", choices=[0, 1, 2])
+    parser.add_argument(
+        "--sp_mode",
+        type=str,
+        default="split_gather",
+        choices=["split_gather", "ring", "all_to_all"],
+        help="SP mode, used for 3d plugin.",
+    )
+    parser.add_argument(
+        "--enable_sequence_parallelism",
+        default=False,
+        action="store_true",
+        help="Whether to enable SP, used for 3d plugin.",
+    )
+    parser.add_argument(
+        "--zero_cpu_offload", default=False, action="store_true", help="Whether to use offloading, used for 3d plugin."
+    )
+    parser.add_argument(
+        "--microbatch_size", type=int, default=1, help="Batch size for each process in PP, used for 3d plugin."
+    )
+    parser.add_argument("--lora_rank", type=int, default=0, help="lora rank when using lora to train.")
+    parser.add_argument("--lora_alpha", type=int, default=8, help="lora alpha when using lora to train.")
+
+    args = parser.parse_args()
+
+    if args.plugin in ["3d", "moe"] and args.pp > 1 and args.accumulation_steps > 1:
+        raise ValueError("Accumulation steps should be 1 when using PP. Please adjust batch size directly.")
+
+    train(args)
--- a/applications/ColossalChat/examples/training_scripts/lora_sft_data.jsonl
+++ b/applications/ColossalChat/examples/training_scripts/lora_sft_data.jsonl
@ -0,0 +1,10 @@
+[{"role": "user", "content": "只剩一个心脏了还能活吗？"}, {"role": "assistant", "content": "能，人本来就只有一个心脏。"}]
+[{"role": "user", "content": "爸爸再婚，我是不是就有了个新娘？"}, {"role": "assistant", "content": "不是的，你有了一个继母。\"新娘\"是指新婚的女方，而你爸爸再婚，他的新婚妻子对你来说是继母。"}]
+[{"role": "user", "content": "马上要上游泳课了，昨天洗的泳裤还没干，怎么办"}, {"role": "assistant", "content": "游泳时泳裤本来就会湿，不用晾干。"}]
+[{"role": "user", "content": "为什么没人说ABCD型的成语？🤔"}, {"role": "assistant", "content": "这是因为中文成语一般都是四字成语，每个字都有其特定的含义，四个字合在一起构成一个完整的意思。而ABCD型的成语最常见，所以大家不会刻意强调。"}]
+[{"role": "user", "content": "太阳还有五十亿年就没了，那到时候向日葵看哪呢？"}, {"role": "assistant", "content": "实际上，向日葵并不会意识到太阳的存在与否，它们只是通过光敏反应来追踪太阳。如果太阳消失，向日葵也不会存活下来，所以它们不会看向任何地方。"}]
+[{"role": "user", "content": "我把我车钥匙落在车里，现在锁不了车门了。 有人能告诉我怎么办吗？"}, {"role": "assistant", "content": "如果你的车钥匙已经锁在车里，那么你实际上已经无法再从外部打开车门了，所以你的车已经被锁住了。你需要做的是找专业的开锁服务或者联系汽车救援服务来帮你打开车门，取回你的车钥匙。"}]
+[{"role": "user", "content": "同学说steam好玩，我现在八级烫伤了怎么办？ 救救我"}, {"role": "assistant", "content": "\"Steam\"是一个电子游戏分发平台，不是steam蒸汽。建议立即去医院，顺便看看脑子。"}]
+[{"role": "user", "content": "人在岩浆里憋气最长时间是多少? 刚才好奇搜了一下，水里憋气时间最长的是克罗地亚54岁在水中闭气24分33秒，但是我搜在岩浆里憋气时间最长，居然发现一个没有，不都是液体吗?为什么没人在岩浆里憋气?"}, {"role": "assistant", "content": "因为岩浆的温度极高，达到了800-1200摄氏度，人如果接触到岩浆，会立刻被烧伤，甚至瞬间死亡，所以没有人能在岩浆里憋气。这并不是一个合理的比较，因为水和岩浆虽然都是液体，但是它们的性质差别极大。"}]
+[{"role": "user", "content": "出生证丢了怎么证明自己出生了"}, {"role": "assistant", "content": "这是个很有意思的问题，从现实意义来讲，你既然已经活着，就已经证明你出生了。如果需要法律意义上的证明，可以通过其他官方文件，例如护照、身份证、驾驶证等。此外，医院的出生记录、学校的入学记录、家庭照片等也可以作为间接证据。如果所有这些都没有，可以向出生地的公共卫生部门或者医院申请复制出生证明。"}]
+[{"role": "user", "content": "为什么抄袭永远都是今人抄袭古人，而没有古人抄袭今人的案例？"}, {"role": "assistant", "content": "这是因为时间的流逝是单向的，古人无法预知今人的思想和作品，所以古人无法抄袭今人。而今人可以通过学习古人的作品，因此有可能出现抄袭古人的情况。"}]
--- a/applications/ColossalChat/examples/training_scripts/train_grpo.py
+++ b/applications/ColossalChat/examples/training_scripts/train_grpo.py
@ -0,0 +1,494 @@
+import argparse
+import json
+import os
+import resource
+from contextlib import nullcontext
+
+import torch
+import torch.distributed as dist
+from coati.dataset import (
+    DataCollatorForPromptDataset,
+    DataCollatorForSupervisedDataset,
+    StatefulDistributedSampler,
+    load_tokenized_dataset,
+    setup_conversation_template,
+)
+from coati.models import LoraConfig, RewardModel, RLVRRewardModel, convert_to_lora_module, disable_dropout, lora_manager
+from coati.trainer import GRPOTrainer
+from coati.utils import load_checkpoint
+from coati.utils.reward_score import *
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+import colossalai
+from colossalai.booster import Booster
+from colossalai.booster.plugin import GeminiPlugin, HybridParallelPlugin, LowLevelZeroPlugin
+from colossalai.cluster import DistCoordinator
+from colossalai.logging import get_dist_logger
+from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
+from colossalai.nn.optimizer import HybridAdam
+from colossalai.shardformer.policies.auto_policy import get_autopolicy
+
+logger = get_dist_logger()
+# default settings for response format tags, overwrite it in chat_template definition if needed
+response_format_tags = {
+    "think_start": {"text": "<think>", "num_occur": 1},
+    "think_end": {"text": "</think>", "num_occur": 1},
+    "answer_start": {"text": "<answer>", "num_occur": 1},
+    "answer_end": {"text": "</answer>", "num_occur": 1},
+}
+
+
+def train(args):
+    global response_format_tags
+    lora_config = None
+    if args.lora_config is not None:
+        lora_config = LoraConfig.from_file(args.lora_config)
+    # check lora compatibility
+    if "gemini" in args.plugin and lora_config is not None and lora_config.r > 0:
+        raise ValueError("LoRA is not supported in GeminiPlugin. Please use other plugin")
+    if args.plugin == "gemini_auto" and args.accumulation_steps > 1:
+        raise ValueError("Gradient accumulation is not supported in GeminiPlugin. Please use other plugin")
+    # ==============================
+    # Initialize Distributed Training
+    # ==============================
+    colossalai.launch_from_torch()
+    coordinator = DistCoordinator()
+
+    # ======================================================
+    # Initialize Model, Objective, Optimizer and LR Scheduler
+    # ======================================================
+    # Temp Fix: Disable lazy init due to version conflict
+    # init_ctx = (
+    #     LazyInitContext(default_device=get_current_device()) if isinstance(plugin, (GeminiPlugin,)) else nullcontext()
+    # )
+
+    init_ctx = nullcontext()
+    with init_ctx:
+        if args.use_flash_attn:
+            actor = AutoModelForCausalLM.from_pretrained(
+                args.pretrain,
+                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
+                use_flash_attention_2=True,
+                local_files_only=True,
+                trust_remote_code=True,
+            )
+            ref_model = AutoModelForCausalLM.from_pretrained(
+                args.pretrain,
+                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
+                use_flash_attention_2=True,
+                local_files_only=True,
+                trust_remote_code=True,
+            )
+            if args.rm_pretrain:
+                reward_model = RewardModel(
+                    args.rm_pretrain,
+                    torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
+                    use_flash_attention_2=True,
+                    trust_remote_code=True,
+                )
+            coordinator.print_on_master(msg="Flash-attention enabled successfully")
+        else:
+            actor = AutoModelForCausalLM.from_pretrained(args.pretrain, local_files_only=True, trust_remote_code=True)
+            if args.rm_pretrain:
+                reward_model = RewardModel(args.rm_pretrain, trust_remote_code=True)
+            ref_model = AutoModelForCausalLM.from_pretrained(
+                args.pretrain, local_files_only=True, trust_remote_code=True
+            )
+
+        if args.lora_config is not None:
+            actor = convert_to_lora_module(actor, lora_config=lora_config)
+            for name, module in actor.named_modules():
+                if "norm" in name or "gate" in name:
+                    module = module.to(torch.float32)
+            lora_manager.able_to_merge = False
+
+        # Disable dropout
+        disable_dropout(actor)
+
+    if args.grad_checkpoint:
+        actor.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
+        coordinator.print_on_master(msg="Gradient checkpointing enabled successfully")
+
+    # configure tokenizer
+    tokenizer_dir = args.tokenizer_dir if args.tokenizer_dir is not None else args.pretrain
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir, use_fast=False, trust_remote_code=True)
+    if os.path.exists(args.conversation_template_config):
+        with open(args.conversation_template_config, "r", encoding="utf8") as f:
+            conversation_template_config = json.load(f)
+        dist.barrier()
+        if "response_format_tags" in conversation_template_config:
+            logger.warning(f"Overwrite default response format tags with {args.conversation_template_config}")
+            response_format_tags = conversation_template_config.get("response_format_tags", response_format_tags)
+        conversation_template = setup_conversation_template(
+            tokenizer, chat_template_config=conversation_template_config, save_path=args.conversation_template_config
+        )
+        stop_ids = conversation_template.stop_ids if len(conversation_template.stop_ids) > 0 else None
+    else:
+        raise ValueError("Conversation template config is not provided or incorrect")
+    if hasattr(tokenizer, "pad_token") and hasattr(tokenizer, "eos_token") and tokenizer.eos_token is not None:
+        try:
+            # Some tokenizers doesn't allow to set pad_token mannually e.g., Qwen
+            tokenizer.pad_token = tokenizer.eos_token
+        except AttributeError as e:
+            logger.warning(f"Unable to set pad token to eos token, {str(e)}")
+    if not hasattr(tokenizer, "pad_token") or tokenizer.pad_token is None:
+        logger.warning(
+            "The tokenizer does not have a pad token which is required. May lead to unintended behavior in training, Please consider manually set them."
+        )
+
+    tokenizer.add_bos_token = False
+    tokenizer.add_eos_token = False
+    tokenizer.padding_side = "left"  # left padding for generation (online learning)
+
+    # configure generation config
+    actor.generation_config.update(
+        pad_token_id=tokenizer.eos_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id
+    )
+
+    # configure optimizer
+    coordinator.print_on_master(f"setting up optimizer for actor: lr={args.lr}, weight_decay={args.weight_decay}")
+    actor_optim = HybridAdam(
+        model_params=actor.parameters(),
+        lr=args.lr,
+        betas=(0.9, 0.95),
+        weight_decay=args.weight_decay,
+        adamw_mode=True,
+    )
+
+    if args.warmup_steps is None:
+        args.warmup_steps = int(0.025 * args.num_episodes)
+        coordinator.print_on_master(f"Warmup steps is set to {args.warmup_steps}")
+
+    actor_lr_scheduler = CosineAnnealingWarmupLR(
+        optimizer=actor_optim,
+        total_steps=args.num_episodes,
+        warmup_steps=args.warmup_steps,
+        eta_min=0.1 * args.lr,
+    )
+
+    # ==============================
+    # Initialize Booster
+    # ==============================
+    if args.plugin == "ddp":
+        """
+        Default torch ddp plugin without any acceleration, for
+        debugging purpose acceleration, for debugging purpose
+        """
+        plugin = TorchDDPPlugin(find_unused_parameters=True)
+    elif args.plugin == "gemini":
+        plugin = GeminiPlugin(
+            precision=args.mixed_precision,
+            placement_policy="static",
+            initial_scale=2**16,
+            max_norm=args.grad_clip,
+            enable_gradient_accumulation=True,
+            enable_flash_attention=args.use_flash_attn,
+        )
+    elif args.plugin == "gemini_auto":
+        plugin = GeminiPlugin(
+            precision=args.mixed_precision,
+            placement_policy="auto",
+            initial_scale=2**16,
+            max_norm=args.grad_clip,
+            enable_flash_attention=args.use_flash_attn,
+        )
+    elif args.plugin == "zero2":
+        plugin = LowLevelZeroPlugin(
+            stage=2,
+            precision=args.mixed_precision,
+            initial_scale=2**16,
+            max_norm=args.grad_clip,
+        )
+    elif args.plugin == "zero2_cpu":
+        plugin = LowLevelZeroPlugin(
+            stage=2,
+            precision=args.mixed_precision,
+            initial_scale=2**16,
+            cpu_offload=True,
+            max_norm=args.grad_clip,
+        )
+    elif args.plugin == "3d":
+        if args.use_flash_attn and (args.tp > 1 or args.pp > 1 or args.sp > 1 or args.enable_sequence_parallelism):
+            logger.warning("Flash attention cannot be used with 3D parallelism for PPO training. Disabling it.")
+            args.use_flash_attn = False
+        plugin = HybridParallelPlugin(
+            tp_size=args.tp,
+            pp_size=args.pp,
+            sp_size=args.sp,
+            sequence_parallelism_mode=args.sp_mode,
+            zero_stage=args.zero_stage,
+            enable_flash_attention=args.use_flash_attn,
+            enable_sequence_parallelism=args.enable_sequence_parallelism,
+            cpu_offload=True if args.zero_stage >= 1 and args.zero_cpu_offload else False,
+            parallel_output=False,
+            max_norm=args.grad_clip,
+            precision=args.mixed_precision,
+        )
+        if args.rm_pretrain:
+            custom_plugin = HybridParallelPlugin(
+                tp_size=args.tp,
+                pp_size=args.pp,
+                sp_size=args.sp,
+                sequence_parallelism_mode=args.sp_mode,
+                zero_stage=args.zero_stage,
+                enable_flash_attention=args.use_flash_attn,
+                enable_sequence_parallelism=args.enable_sequence_parallelism,
+                cpu_offload=True if args.zero_stage >= 1 and args.zero_cpu_offload else False,
+                parallel_output=False,
+                max_norm=args.grad_clip,
+                precision=args.mixed_precision,
+                custom_policy=get_autopolicy(reward_model.model),
+            )
+    else:
+        raise ValueError(f"Unknown plugin {args.plugin}")
+
+    if args.plugin != "3d" and args.rm_pretrain:
+        custom_plugin = plugin
+
+    # configure dataset
+    coordinator.print_on_master(f"Load dataset: {args.prompt_dataset}")
+    mode_map = {"train": "train", "valid": "validation", "test": "test"}
+    train_prompt_dataset = load_tokenized_dataset(dataset_paths=args.prompt_dataset, mode="train", mode_map=mode_map)
+
+    data_collator = DataCollatorForPromptDataset(tokenizer=tokenizer, max_length=args.max_length - args.max_seq_len)
+
+    train_prompt_dataloader = plugin.prepare_dataloader(
+        dataset=train_prompt_dataset,
+        batch_size=args.experience_batch_size,
+        shuffle=True,
+        drop_last=True,
+        collate_fn=data_collator,
+        distributed_sampler_cls=StatefulDistributedSampler,
+    )
+
+    if len(args.ptx_dataset) > 0:
+        train_ptx_dataset = load_tokenized_dataset(dataset_paths=args.ptx_dataset, mode="train", mode_map=mode_map)
+        data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer, max_length=args.max_length)
+        train_pretrain_dataloader = plugin.prepare_dataloader(
+            dataset=train_ptx_dataset,
+            batch_size=args.ptx_batch_size,
+            shuffle=True,
+            drop_last=True,
+            collate_fn=data_collator,
+            distributed_sampler_cls=StatefulDistributedSampler,
+        )
+    else:
+        train_pretrain_dataloader = None
+
+    actor_booster = Booster(plugin=plugin)
+    ref_booster = Booster(plugin=plugin)
+    if args.rm_pretrain:
+        rm_booster = Booster(plugin=custom_plugin)
+
+    default_dtype = torch.float16 if args.mixed_precision == "fp16" else torch.bfloat16
+    torch.set_default_dtype(default_dtype)
+    actor, actor_optim, _, train_prompt_dataloader, actor_lr_scheduler = actor_booster.boost(
+        model=actor,
+        optimizer=actor_optim,
+        lr_scheduler=actor_lr_scheduler,
+        dataloader=train_prompt_dataloader,
+    )
+    if args.rm_pretrain:
+        reward_model, _, _, _, _ = rm_booster.boost(model=reward_model, dataloader=train_prompt_dataloader)
+    else:
+        if args.reward_functions:
+            reward_fn_list = []
+            for reward_fn in args.reward_functions:
+                """
+                To define custom reward function, you can define your functions under:
+                    colossalai/applications/ColossalChat/coati/utils/reward_score/__init__.py
+                and use it here by mofiying the following line:
+                """
+                if reward_fn == "gsm8k_reward_fn":
+                    reward_fn_list.append(gsm8k_reward_fn)
+                elif reward_fn == "math_competition_reward_fn":
+                    reward_fn_list.append(math_competition_reward_fn)
+                else:
+                    raise ValueError(f"Unknown reward function {reward_fn}")
+            reward_model = RLVRRewardModel(
+                reward_fn_list=reward_fn_list, tokenizer=tokenizer, tags=response_format_tags
+            )
+
+    ref_model, _, _, _, _ = ref_booster.boost(model=ref_model, dataloader=train_prompt_dataloader)
+
+    torch.set_default_dtype(torch.float)
+
+    coordinator.print_on_master(f"Booster init max CUDA memory: {torch.cuda.max_memory_allocated() / 1024 ** 2:.2f} MB")
+    coordinator.print_on_master(
+        f"Booster init max CPU memory: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.2f} MB"
+    )
+
+    sampler_start_idx = 0
+    start_step = 0
+
+    if args.rm_checkpoint_path is not None:
+        if "modeling" in args.rm_checkpoint_path:
+            rm_booster.load_model(reward_model, args.rm_checkpoint_path)
+        else:
+            _, _, _ = load_checkpoint(
+                load_dir=args.rm_checkpoint_path,
+                booster=rm_booster,
+                model=reward_model,
+                optimizer=None,
+                lr_scheduler=None,
+            )
+        coordinator.print_on_master(f"Loaded reward model checkpoint {args.rm_checkpoint_path}")
+    if args.checkpoint_path is not None:
+        if "modeling" in args.checkpoint_path:
+            actor_booster.load_model(actor, args.checkpoint_path)
+            ref_booster.load_model(ref_model, args.checkpoint_path)
+            coordinator.print_on_master(f"Loaded actor and reference model {args.checkpoint_path}")
+        else:
+            _, start_step, sampler_start_idx = load_checkpoint(
+                load_dir=args.checkpoint_path,
+                booster=actor_booster,
+                model=actor,
+                optimizer=actor_optim,
+                lr_scheduler=actor_lr_scheduler,
+            )
+            _, _, _ = load_checkpoint(load_dir=args.checkpoint_path, booster=ref_booster, model=ref_model)
+            assert isinstance(train_prompt_dataloader.sampler, StatefulDistributedSampler)
+            train_prompt_dataloader.sampler.set_start_index(start_index=sampler_start_idx)
+
+            coordinator.print_on_master(
+                f"Loaded actor and reference model checkpoint {args.checkpoint_path} at spisode {start_step}"
+            )
+            coordinator.print_on_master(f"Loaded sample at index {sampler_start_idx}")
+
+        coordinator.print_on_master(
+            f"Checkpoint loaded max CUDA memory: {torch.cuda.max_memory_allocated() / 1024 ** 2:.2f} MB"
+        )
+        coordinator.print_on_master(
+            f"Checkpoint loaded CUDA memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB"
+        )
+        coordinator.print_on_master(
+            f"Checkpoint loaded max CPU memory: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.2f} MB"
+        )
+
+    # configure trainer
+    trainer = GRPOTrainer(
+        actor_booster,
+        actor,
+        reward_model,
+        ref_model,
+        actor_optim,
+        actor_lr_scheduler,
+        tokenizer=tokenizer,
+        stop_token_ids=[stop_ids],
+        kl_coef=args.kl_coef,
+        ptx_coef=args.ptx_coef,
+        train_batch_size=args.train_batch_size,
+        buffer_limit=args.num_collect_steps * args.experience_batch_size * args.num_generations,
+        max_length=args.max_length,
+        use_cache=True,
+        do_sample=True,
+        apply_loss_mask=not args.disable_loss_mask,
+        accumulation_steps=args.accumulation_steps,
+        save_dir=args.save_path,
+        save_interval=args.save_interval,
+        top_k=50,
+        use_tp=args.tp > 1,
+        num_generations=args.num_generations,
+        inference_batch_size=args.inference_batch_size,
+        logits_forward_batch_size=args.logits_forward_batch_size,
+        offload_inference_models="gemini" not in args.plugin,
+        coordinator=coordinator,
+        max_tokens_thinking=args.max_tokens_thinking if args.max_tokens_thinking else args.max_length - 100,
+        temperature_annealing_config={
+            "start_temperature": args.initial_temperature,
+            "end_temperature": args.final_temperature,
+            "annealing_warmup_steps": min(100, int(args.num_episodes / 6)),
+            "annealing_steps": min(600, int(args.num_episodes / 2)),
+        },
+        # Hack: some old model's default update_model_kwargs_fn/prepare_inputs_fn may doesn't work due to version conflict with transformers, you can overwrite them
+        # update_model_kwargs_fn=update_model_kwargs_fn,
+        # prepare_inputs_fn = None
+    )
+
+    trainer.fit(
+        num_episodes=args.num_episodes,
+        num_collect_steps=args.num_collect_steps,
+        num_update_steps=args.num_update_steps,
+        prompt_dataloader=train_prompt_dataloader,
+        pretrain_dataloader=train_pretrain_dataloader,
+        log_dir=args.log_dir,
+        use_wandb=args.use_wandb,
+    )
+
+    if lora_config is not None and lora_config.r > 0:
+        # NOTE: set model to eval to merge LoRA weights
+        lora_manager.able_to_merge = True
+        actor.eval()
+    # save model checkpoint after fitting on only rank0
+    coordinator.print_on_master("Start saving final actor model checkpoint")
+    actor_booster.save_model(actor, os.path.join(trainer.actor_save_dir, "modeling"), shard=True)
+    coordinator.print_on_master(
+        f"Saved final actor model checkpoint at episodes {args.num_episodes} at folder {args.save_path}"
+    )
+    coordinator.print_on_master(f"Max CUDA memory usage: {torch.cuda.max_memory_allocated()/1024**2:.2f} MB")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--prompt_dataset", nargs="+", default=[])
+    parser.add_argument("--ptx_dataset", nargs="+", default=[])
+    parser.add_argument(
+        "--plugin",
+        type=str,
+        default="gemini",
+        choices=["gemini", "gemini_auto", "zero2", "zero2_cpu", "3d"],
+        help="Choose which plugin to use",
+    )
+    parser.add_argument(
+        "--conversation_template_config",
+        type=str,
+        default=None,
+        help="Path \
+        to save conversation template config files.",
+    )
+    parser.add_argument("--grad_clip", type=float, default=1.0, help="Gradient clipping value")
+    parser.add_argument("--weight_decay", type=float, default=0.1, help="Weight decay")
+    parser.add_argument("--warmup_steps", type=int, default=None, help="Warmup steps")
+    parser.add_argument("--tokenizer_dir", type=str, default=None)
+    parser.add_argument("--tp", type=int, default=1)
+    parser.add_argument("--pp", type=int, default=1)
+    parser.add_argument("--sp", type=int, default=1)
+    parser.add_argument("--enable_sequence_parallelism", default=False, action="store_true")
+    parser.add_argument("--zero_stage", type=int, default=0, help="Zero stage", choices=[0, 1, 2])
+    parser.add_argument("--zero_cpu_offload", default=False, action="store_true")
+    parser.add_argument("--sp_mode", type=str, default="split_gather", choices=["split_gather", "ring", "all_to_all"])
+    parser.add_argument("--pretrain", type=str, default=None)
+    parser.add_argument("--rm_pretrain", type=str, default=None)
+    parser.add_argument("--checkpoint_path", type=str, default=None)
+    parser.add_argument("--rm_checkpoint_path", type=str, help="Reward model checkpoint path")
+    parser.add_argument("--reward_functions", type=str, nargs="+", default=None, help="Reward functions to use")
+    parser.add_argument("--save_path", type=str, default="actor_checkpoint_prompts")
+    parser.add_argument("--num_episodes", type=int, default=1)
+    parser.add_argument("--num_collect_steps", type=int, default=2)
+    parser.add_argument("--num_update_steps", type=int, default=5)
+    parser.add_argument("--num_generations", type=int, default=8)
+    parser.add_argument("--inference_batch_size", type=int, default=None)
+    parser.add_argument("--save_interval", type=int, default=1000)
+    parser.add_argument("--train_batch_size", type=int, default=16)
+    parser.add_argument("--logits_forward_batch_size", type=int, default=1)
+    parser.add_argument("--experience_batch_size", type=int, default=16)
+    parser.add_argument("--ptx_batch_size", type=int, default=4)
+    parser.add_argument("--lora_config", type=str, default=None, help="low-rank adaptation config file path")
+    parser.add_argument("--mixed_precision", type=str, default="fp16", choices=["fp16", "bf16"], help="Mixed precision")
+    parser.add_argument("--accumulation_steps", type=int, default=8)
+    parser.add_argument("--lr", type=float, default=1e-6)
+    parser.add_argument("--kl_coef", type=float, default=0.7)
+    parser.add_argument("--ptx_coef", type=float, default=0.0)
+    parser.add_argument("--disable_loss_mask", default=False, action="store_true")
+    parser.add_argument("--max_length", type=int, default=2048)
+    parser.add_argument("--max_tokens_thinking", type=int, default=2000)
+    parser.add_argument("--max_seq_len", type=int, default=256)
+    parser.add_argument("--initial_temperature", type=float, default=1.0)
+    parser.add_argument("--final_temperature", type=float, default=0.9)
+    parser.add_argument("--log_dir", default=None, type=str)
+    parser.add_argument("--use_wandb", default=False, action="store_true")
+    parser.add_argument("--grad_checkpoint", default=False, action="store_true")
+    parser.add_argument("--use_flash_attn", default=False, action="store_true")
+
+    args = parser.parse_args()
+    train(args)
--- a/applications/ColossalChat/examples/training_scripts/train_grpo.sh
+++ b/applications/ColossalChat/examples/training_scripts/train_grpo.sh
@ -0,0 +1,86 @@
+#!/bin/bash
+set_n_least_used_CUDA_VISIBLE_DEVICES() {
+    local n=${1:-"9999"}
+    echo "GPU Memory Usage:"
+    local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
+        tail -n +2 |
+        nl -v 0 |
+        tee /dev/tty |
+        sort -g -k 2 |
+        awk '{print $1}' |
+        head -n $n)
+    export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
+    echo "Now CUDA_VISIBLE_DEVICES is set to:"
+    echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
+}
+set_n_least_used_CUDA_VISIBLE_DEVICES 8
+
+PROJECT_NAME="PPO-RLVR"
+
+PARENT_SAVE_DIR="" # Path to a folder to save checkpoints
+PARENT_CONFIG_FILE="" # Path to a folder to save training config logs
+PRETRAINED_MODEL_PATH="" # local pretrained model path (from RLHF step 1: SFT)
+PRETRAINED_TOKENIZER_PATH="" # huggingface or local tokenizer path
+CONVERSATION_TEMPLATE_CONFIG_PATH="" # path to the conversation config file
+LOGDIR=""
+
+declare -a prompt_dataset=(
+    YOUR/PROMPT/DATA/DIR/arrow/part-00000
+    YOUR/PROMPT/DATA/DIR/arrow/part-00001
+    YOUR/PROMPT/DATA/DIR/arrow/part-00002
+    YOUR/PROMPT/DATA/DIR/arrow/part-00003
+    YOUR/PROMPT/DATA/DIR/arrow/part-00004
+    YOUR/PROMPT/DATA/DIR/arrow/part-00005
+    YOUR/PROMPT/DATA/DIR/arrow/part-00006
+    YOUR/PROMPT/DATA/DIR/arrow/part-00007
+    YOUR/PROMPT/DATA/DIR/arrow/part-00008
+    YOUR/PROMPT/DATA/DIR/arrow/part-00009
+)
+
+declare -a ptx_dataset=(
+    YOUR/SFT/DATA/DIR/arrow/part-00000
+    YOUR/SFT/DATA/DIR/arrow/part-00001
+    YOUR/SFT/DATA/DIR/arrow/part-00002
+    YOUR/SFT/DATA/DIR/arrow/part-00003
+    YOUR/SFT/DATA/DIR/arrow/part-00004
+    YOUR/SFT/DATA/DIR/arrow/part-00005
+    YOUR/SFT/DATA/DIR/arrow/part-00006
+    YOUR/SFT/DATA/DIR/arrow/part-00007
+    YOUR/SFT/DATA/DIR/arrow/part-00008
+    YOUR/SFT/DATA/DIR/arrow/part-00009
+)
+
+TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
+FULL_PROJECT_NAME="${PROJECT_NAME}-${TIMESTAMP}"
+SAVE_DIR="${PARENT_SAVE_DIR}${FULL_PROJECT_NAME}"
+CONFIG_FILE="${PARENT_CONFIG_FILE}${FULL_PROJECT_NAME}.json"
+
+colossalai run --nproc_per_node 8 --num_nodes 1 --hostfile ./hostfile train_grpo.py \
+    --pretrain $PRETRAINED_MODEL_PATH \
+    --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \
+    --prompt_dataset ${prompt_dataset[@]} \
+    --conversation_template_config $CONVERSATION_TEMPLATE_CONFIG_PATH \
+    --ptx_coef 0.0 \
+    --plugin "zero2_cpu" \
+    --reward_functions math_competition_reward_fn \
+    --save_interval 250 \
+    --save_path $SAVE_DIR \
+    --num_episodes 100 \
+    --num_collect_steps 8 \
+    --num_update_steps 1 \
+    --experience_batch_size 1 \
+    --train_batch_size 4 \
+    --inference_batch_size 8 \
+    --logits_forward_batch_size 2 \
+    --accumulation_steps 4 \
+    --lr 1e-6 \
+    --mixed_precision "bf16" \
+    --grad_clip 0.1\
+    --weight_decay 0.01 \
+    --kl_coef 0.01 \
+    --warmup_steps 40 \
+    --max_length 2000 \
+    --max_seq_len 1700 \
+    --log_dir $LOGDIR \
+    --use_flash_attn \
+    --grad_checkpoint
--- a/applications/ColossalChat/examples/training_scripts/train_ppo.py
+++ b/applications/ColossalChat/examples/training_scripts/train_ppo.py
@ -13,9 +13,18 @@ from coati.dataset import (
    load_tokenized_dataset,
    setup_conversation_template,
 )
-from coati.models import Critic, LoraConfig, RewardModel, convert_to_lora_module, disable_dropout, lora_manager
+from coati.models import (
+    Critic,
+    LoraConfig,
+    RewardModel,
+    RLVRRewardModel,
+    convert_to_lora_module,
+    disable_dropout,
+    lora_manager,
+)
 from coati.trainer import PPOTrainer
 from coati.utils import load_checkpoint
+from coati.utils.reward_score import *
 from transformers import AutoModelForCausalLM, AutoTokenizer

 import colossalai
@ -29,8 +38,17 @@ from colossalai.shardformer.policies.auto_policy import get_autopolicy

 logger = get_dist_logger()

+# default settings for response format tags, overwrite it in chat_template definition if needed
+response_format_tags = {
+    "think_start": {"text": "<think>", "num_occur": 1},
+    "think_end": {"text": "</think>", "num_occur": 1},
+    "answer_start": {"text": "<answer>", "num_occur": 1},
+    "answer_end": {"text": "</answer>", "num_occur": 1},
+}
+

 def train(args):
+    global response_format_tags
    lora_config = None
    if args.lora_config is not None:
        lora_config = LoraConfig.from_file(args.lora_config)
@ -61,28 +79,36 @@ def train(args):
                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
                use_flash_attention_2=True,
                local_files_only=True,
+                trust_remote_code=True,
            )
            ref_model = AutoModelForCausalLM.from_pretrained(
                args.pretrain,
                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
                use_flash_attention_2=True,
                local_files_only=True,
+                trust_remote_code=True,
            )
-            reward_model = RewardModel(
-                args.rm_pretrain,
-                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
-                use_flash_attention_2=True,
-            )
+            if not args.no_neural_reward_model:
+                reward_model = RewardModel(
+                    args.rm_pretrain,
+                    torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
+                    use_flash_attention_2=True,
+                    trust_remote_code=True,
+                )
            critic = Critic(
                args.rm_pretrain,
                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
                use_flash_attention_2=True,
+                trust_remote_code=True,
            )
            coordinator.print_on_master(msg="Flash-attention enabled successfully")
        else:
-            actor = AutoModelForCausalLM.from_pretrained(args.pretrain, local_files_only=True)
-            ref_model = AutoModelForCausalLM.from_pretrained(args.pretrain, local_files_only=True)
-            reward_model = RewardModel(args.rm_pretrain)
+            actor = AutoModelForCausalLM.from_pretrained(args.pretrain, local_files_only=True, trust_remote_code=True)
+            ref_model = AutoModelForCausalLM.from_pretrained(
+                args.pretrain, local_files_only=True, trust_remote_code=True
+            )
+            if not args.no_neural_reward_model:
+                reward_model = RewardModel(args.rm_pretrain, trust_remote_code=True)
            critic = Critic(args.rm_pretrain)

        if args.lora_config is not None:
@ -112,6 +138,9 @@ def train(args):
        with open(args.conversation_template_config, "r", encoding="utf8") as f:
            conversation_template_config = json.load(f)
        dist.barrier()
+        if "response_format_tags" in conversation_template_config:
+            logger.warning(f"Overwrite default response format tags with {args.conversation_template_config}")
+            response_format_tags = conversation_template_config.get("response_format_tags", response_format_tags)
        conversation_template = setup_conversation_template(
            tokenizer, chat_template_config=conversation_template_config, save_path=args.conversation_template_config
        )
@ -245,7 +274,7 @@ def train(args):
            parallel_output=False,
            max_norm=args.grad_clip,
            precision=args.mixed_precision,
-            custom_policy=get_autopolicy(reward_model.model),
+            custom_policy=get_autopolicy(critic.model),
        )
    else:
        raise ValueError(f"Unknown plugin {args.plugin}")
@ -284,7 +313,8 @@ def train(args):

    actor_booster = Booster(plugin=plugin)
    ref_booster = Booster(plugin=plugin)
-    rm_booster = Booster(plugin=custom_plugin)
+    if not args.no_neural_reward_model:
+        rm_booster = Booster(plugin=custom_plugin)
    critic_booster = Booster(plugin=custom_plugin)

    default_dtype = torch.float16 if args.mixed_precision == "fp16" else torch.bfloat16
@ -302,7 +332,28 @@ def train(args):
        lr_scheduler=critic_lr_scheduler,
        dataloader=train_prompt_dataloader,
    )
-    reward_model, _, _, _, _ = rm_booster.boost(model=reward_model, dataloader=train_prompt_dataloader)
+    if not args.no_neural_reward_model:
+        reward_model, _, _, _, _ = rm_booster.boost(model=reward_model, dataloader=train_prompt_dataloader)
+    else:
+        if args.reward_functions:
+            reward_fn_list = []
+            for reward_fn in args.reward_functions:
+                """
+                To define custom reward function, you can define your functions under:
+                    colossalai/applications/ColossalChat/coati/utils/reward_score/__init__.py
+                and use it here by mofiying the following line:
+                """
+                if reward_fn == "gsm8k_reward_fn":
+                    reward_fn_list.append(gsm8k_reward_fn)
+                elif reward_fn == "math_competition_reward_fn":
+                    reward_fn_list.append(math_competition_reward_fn)
+                else:
+                    raise ValueError(f"Unknown reward function {reward_fn}")
+                reward_fn_list.append(eval(reward_fn))
+            reward_model = RLVRRewardModel(
+                reward_fn_list=reward_fn_list, tokenizer=tokenizer, tags=response_format_tags
+            )
+
    ref_model, _, _, _, _ = ref_booster.boost(model=ref_model, dataloader=train_prompt_dataloader)

    torch.set_default_dtype(torch.float)
@ -481,9 +532,11 @@ if __name__ == "__main__":
    parser.add_argument("--sp_mode", type=str, default="split_gather", choices=["split_gather", "ring", "all_to_all"])
    parser.add_argument("--pretrain", type=str, default=None)
    parser.add_argument("--rm_pretrain", type=str, default=None)
+    parser.add_argument("--no_neural_reward_model", default=False, action="store_true")
    parser.add_argument("--checkpoint_path", type=str, default=None)
    parser.add_argument("--critic_checkpoint_path", type=str, default=None)
    parser.add_argument("--rm_checkpoint_path", type=str, help="Reward model checkpoint path")
+    parser.add_argument("--reward_functions", type=str, nargs="+", default=None, help="Reward functions to use")
    parser.add_argument("--save_path", type=str, default="actor_checkpoint_prompts")
    parser.add_argument("--num_episodes", type=int, default=1)
    parser.add_argument("--num_collect_steps", type=int, default=2)
--- a/applications/ColossalChat/requirements.txt
+++ b/applications/ColossalChat/requirements.txt
@ -2,7 +2,7 @@ transformers==4.39.3
 tqdm
 datasets==2.14.7
 loralib
-colossalai>=0.4.0
+colossalai>=0.4.7
 torch>=2.1.0
 langchain
 tokenizers
@ -21,3 +21,4 @@ ninja==1.11.1
 sentencepiece==0.1.99
 flash-attn
 tiktoken
+jsonlines
--- a/applications/ColossalChat/tests/generate_dummy_datasets_for_testing.py
+++ b/applications/ColossalChat/tests/generate_dummy_datasets_for_testing.py
@ -20,6 +20,15 @@ prompt_seed = {
        },
    ]
 }
+prompt_rlvr_seed = {
+    "messages": [
+        {
+            "from": "user",
+            "content": "What is the degree of the polynomial $(4 +5x^3 +100 +2\pi x^4 + \sqrt{10}x^4 +9)$?",
+        },
+    ],
+    "gt_answer": "4",
+}
 preference_seed = {
    "context": [
        {"from": "user", "content": "What kind of noises did dinosaurs make?"},
@ -72,6 +81,8 @@ if __name__ == "__main__":
        seed = sft_seed
    elif args.data_type == "prompt":
        seed = prompt_seed
+    elif args.data_type == "prompt_rlvr":
+        seed = prompt_rlvr_seed
    elif args.data_type == "preference":
        seed = preference_seed
    elif args.data_type == "kto":
--- a/applications/ColossalChat/tests/prepare_test_env.sh
+++ b/applications/ColossalChat/tests/prepare_test_env.sh
@ -0,0 +1,16 @@
+# run under /ColossalAI/applications/ColossalChat
+export NCCL_SHM_DISABLE=1
+export MAX_JOBS=1
+export PRETRAINED_MODEL_PATH=./models
+export SFT_DATASET=./sft_data
+export PROMPT_DATASET=./prompt_data
+export PROMPT_RLVR_DATASET=./prompt_data
+export PREFERENCE_DATASET=./preference_data
+export KTO_DATASET=./kto_data
+mkdir models
+mkdir sft_data
+mkdir prompt_data
+mkdir preference_data
+mkdir kto_data
+# ./tests/test_data_preparation.sh
+# ./tests/test_train.sh
--- a/applications/ColossalChat/tests/test_data_preparation.sh
+++ b/applications/ColossalChat/tests/test_data_preparation.sh
@ -24,7 +24,12 @@ if [ -z "$SFT_DATASET" ]; then
 fi

 if [ -z "$PROMPT_DATASET" ]; then
-    echo "Please set \$PROMPT_DATASET to the path to prompts."
+    echo "Please set \$PROMPT_DATASET to the path to prompts dataset."
+    exit 1
+fi
+
+if [ -z "$PROMPT_RLVR_DATASET" ]; then
+    echo "Please set \$PROMPT_RLVR_DATASET to the path to prompts dataset with gt_answer labels."
    exit 1
 fi

@ -69,6 +74,8 @@ get_data_input_dirs() {
        echo "$SFT_DATASET"
    elif [[ $data_type == "prompt" ]]; then
        echo "$PROMPT_DATASET"
+    elif [[ $data_type == "prompt_rlvr" ]]; then
+        echo "$PROMPT_RLVR_DATASET"
    elif [[ $data_type == "preference" ]]; then
        echo "$PREFERENCE_DATASET"
    elif [[ $data_type == "kto" ]]; then
@ -123,6 +130,10 @@ python $TEST_DIR/generate_dummy_datasets_for_testing.py \
    --data_dir $(get_data_input_dirs prompt) \
    --data_type "prompt"

+python $TEST_DIR/generate_dummy_datasets_for_testing.py \
+    --data_dir $(get_data_input_dirs prompt_rlvr) \
+    --data_type "prompt_rlvr"
+
 python $TEST_DIR/generate_dummy_datasets_for_testing.py \
    --data_dir $(get_data_input_dirs kto) \
    --data_type "kto"
@ -266,6 +277,52 @@ for model in ${MODELS[@]}; do
 done


+echo "[Test]: testing prepare_prompt_dataset.py (with verifiable reward)..."
+
+# FIXME: This is a hack to skip tests that are not working
+SKIPPED_TESTS=(
+)
+
+# test prepare_prompt_dataset
+for model in ${MODELS[@]}; do
+    data_type="prompt_rlvr"
+    if [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$data_type " ]]; then
+        echo "[Test]: Skipped $model-$data_type"
+        continue
+    fi
+    cache_dir=$DATA_SAVE_PATH/tokenized_${model}_${data_type}/cache
+    jsonl_dir=$DATA_SAVE_PATH/tokenized_${model}_${data_type}/jsonl
+    arrow_dir=$DATA_SAVE_PATH/tokenized_${model}_${data_type}/arrow
+    data_input_dirs=$(get_data_input_dirs $data_type)
+    tokenizer_dir=$(get_tokenizer_dirs $model)
+    conversation_template=$(get_conversation_template_config $model)
+    for i in $(seq $NUM_RETRY); do
+        rm -rf $cache_dir
+        rm -rf $jsonl_dir
+        rm -rf $arrow_dir
+        echo "[Test]: $model-$data_type, attempt $i"
+        python $EXAMPLES_DIR/data_preparation_scripts/prepare_dataset.py \
+            --type prompt \
+            --data_input_dirs $data_input_dirs \
+            --conversation_template_config $conversation_template \
+            --tokenizer_dir $tokenizer_dir \
+            --data_cache_dir $cache_dir \
+            --data_jsonl_output_dir $jsonl_dir \
+            --data_arrow_output_dir $arrow_dir \
+            --max_length 400 \
+            --num_samples_per_datafile 100 \
+            --num_spliced_dataset_bins 1
+        passed=$?
+        if [ $passed -eq 0 ]; then
+            break
+        fi
+    done
+    if [ $passed -ne 0 ]; then
+        echo "[Test]: Failed $model-$data_type"
+        exit 1
+    fi
+done
+
 echo "[Test]: testing prepare_kto_dataset.py ..."

 # FIXME: This is a hack to skip tests that are not working
--- a/applications/ColossalChat/tests/test_train.sh
+++ b/applications/ColossalChat/tests/test_train.sh
@ -81,8 +81,242 @@ random_choice() {
    echo ${arr[$idx]}
 }

+echo "[Test]: testing grpo ..."


+SKIPPED_TESTS=(
+    llama-3d # 3d plugin doesn't support lora
+    llama-gemini # gemini doesn't support lora
+)
+
+GRAD_CKPTS=('--grad_checkpoint')
+REWARD_FLAG=('nn' 'vr')
+for reward_type in ${REWARD_FLAG[@]}; do
+    for lora_rank in ${LORA_RANK[@]}; do
+        for model in ${MODELS[@]}; do
+            for plugin in ${PLUGINS[@]}; do
+                if [[ $plugin == "gemini_auto" ]]; then
+                    echo "[Test]: Skipped $model-$plugin"
+                    continue # gemini_auto plugin doesn't support generation
+                fi
+                if [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$plugin-$lora_rank " ]]; then
+                    echo "[Test]: Skipped $model-$plugin-$lora_rank"
+                    continue
+                elif [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$plugin " ]]; then
+                    echo "[Test]: Skipped $model-$plugin"
+                    continue
+                fi
+                pretrain=$(get_pretrain $model)
+                rm_pretrain="--rm_pretrain $pretrain"
+                reward_fn=""
+                if [[ $reward_type == "vr" ]]; then
+                    rm_pretrain=""
+                    reward_fn="--reward_functions gsm8k_reward_fn"
+                fi
+                tokenizer_dir=$(get_tokenizer_dirs $model)
+                grad_ckpt=$(random_choice "${GRAD_CKPTS[@]}")
+                tp='1'
+                bs='2'
+                ebs='1'
+                conversation_template=$(get_conversation_template_config $model)
+                if [[ $plugin == "zero2" ]]; then
+                    lora_config=$LORA_CONFIG_ENABLE
+                else
+                    lora_config=""
+                fi
+                if [[ $plugin == "3d" ]]; then
+                    tp='2'
+                    bs='2'
+                    ebs='1'
+                fi
+                grad_accu='2'
+                # gemini_auto and gemini doesn't support gradient accumulation
+                if [[ $plugin == "gemini_auto" ]]; then
+                    grad_accu='1'
+                fi
+                # gemini_auto and gemini doesn't support generation
+                if [[ $plugin == "gemini_auto" ]]; then
+                    # gemini-auto doesn't support generation
+                    echo "[Test]: Skipped $model-$plugin"
+                    continue
+                fi
+                for i in $(seq $NUM_RETRY); do
+                    echo "[Test]: $model-$plugin-$lora_rank-$reward_type, attempt $i"
+                    declare -a prompt_dataset=()
+                    for split in $(seq -f "%05g" 0 0); do
+                        if [[ $reward_type == "vr" ]]; then
+                            prompt_dataset+=("$TEMP_DIR/rlhf_data/tokenized_${model}_prompt_rlvr/arrow/part-$split")
+                        else
+                            prompt_dataset+=("$TEMP_DIR/rlhf_data/tokenized_${model}_prompt/arrow/part-$split")
+                        fi
+                    done
+                    declare -a ptx_dataset=()
+                    for split in $(seq -f "%05g" 0 0); do
+                        ptx_dataset+=("$TEMP_DIR/rlhf_data/tokenized_${model}_sft/arrow/part-$split")
+                    done
+                    colossalai run --nproc_per_node 2 --master_port 31332 $EXAMPLES_DIR/training_scripts/train_grpo.py \
+                        --pretrain $pretrain \
+                        $rm_pretrain \
+                        --tokenizer_dir $tokenizer_dir \
+                        --conversation_template_config $conversation_template \
+                        --prompt_dataset ${prompt_dataset[@]} \
+                        --ptx_dataset ${ptx_dataset[@]} \
+                        --ptx_batch_size 1 \
+                        --num_generations 2 \
+                        --ptx_coef 0.2 \
+                        --save_path $MODEL_SAVE_PATH \
+                        $lora_config \
+                        --plugin $plugin \
+                        --num_episodes 5 \
+                        --num_collect_steps 1 \
+                        --num_update_steps 1 \
+                        --experience_batch_size $ebs \
+                        --train_batch_size $bs \
+                        --accumulation_steps $grad_accu \
+                        --lr 9e-6 \
+                        --mixed_precision "bf16" \
+                        --grad_clip 1.0 \
+                        --tp $tp \
+                        --lr 2e-5 \
+                        $grad_ckpt \
+                        --max_len 200 \ \
+                        --max_seq_len 10 \
+                        $reward_fn
+                        # --use_flash_attn
+                    passed=$?
+                    if [ $passed -eq 0 ]; then
+                        rm -rf ${MODEL_SAVE_PATH:?}/*
+                        rm -rf ${MODELS_DIR:?}/*
+                        break
+                    fi
+                done
+                if [ $passed -ne 0 ]; then
+                    echo "[Test]: Failed $model-$plugin-$lora_rank-$reward_type"
+                    exit 1
+                fi
+            done
+        done
+    done
+done
+
+
+echo "[Test]: testing ppo ..."
+
+
+SKIPPED_TESTS=(
+    llama-3d # 3d plugin doesn't support lora
+    llama-gemini # gemini doesn't support lora
+)
+
+GRAD_CKPTS=('--grad_checkpoint')
+REWARD_FLAG=('vr' 'nn')
+for reward_type in ${REWARD_FLAG[@]}; do
+    for lora_rank in ${LORA_RANK[@]}; do
+        for model in ${MODELS[@]}; do
+            for plugin in ${PLUGINS[@]}; do
+                if [[ $plugin == "gemini_auto" ]]; then
+                    echo "[Test]: Skipped $model-$plugin"
+                    continue # gemini_auto plugin doesn't support generation
+                fi
+                if [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$plugin-$lora_rank " ]]; then
+                    echo "[Test]: Skipped $model-$plugin-$lora_rank"
+                    continue
+                elif [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$plugin " ]]; then
+                    echo "[Test]: Skipped $model-$plugin"
+                    continue
+                fi
+                pretrain=$(get_pretrain $model)
+                reward_fn=""
+                no_nn=""
+                if [[ $reward_type == "vr" ]]; then
+                    reward_fn="--reward_functions gsm8k_reward_fn"
+                    no_nn="--no_neural_reward_model"
+                fi
+                tokenizer_dir=$(get_tokenizer_dirs $model)
+                grad_ckpt=$(random_choice "${GRAD_CKPTS[@]}")
+                tp='1'
+                bs='2'
+                ebs='2'
+                conversation_template=$(get_conversation_template_config $model)
+                if [[ $plugin == "zero2" ]]; then
+                    lora_config=$LORA_CONFIG_ENABLE
+                else
+                    lora_config=""
+                fi
+                if [[ $plugin == "3d" ]]; then
+                    tp='2'
+                    bs='2'
+                    ebs='2'
+                fi
+                grad_accu='2'
+                # gemini_auto and gemini doesn't support gradient accumulation
+                if [[ $plugin == "gemini_auto" ]]; then
+                    grad_accu='1'
+                fi
+                # gemini_auto and gemini doesn't support generation
+                if [[ $plugin == "gemini_auto" ]]; then
+                    # gemini-auto doesn't support generation
+                    echo "[Test]: Skipped $model-$plugin"
+                    continue
+                fi
+                for i in $(seq $NUM_RETRY); do
+                    echo "[Test]: $model-$plugin-$lora_rank-$reward_type, attempt $i"
+                    declare -a prompt_dataset=()
+                    for split in $(seq -f "%05g" 0 0); do
+                        if [[ $reward_type == "vr" ]]; then
+                            prompt_dataset+=("$TEMP_DIR/rlhf_data/tokenized_${model}_prompt_rlvr/arrow/part-$split")
+                        else
+                            prompt_dataset+=("$TEMP_DIR/rlhf_data/tokenized_${model}_prompt/arrow/part-$split")
+                        fi
+                    done
+                    declare -a ptx_dataset=()
+                    for split in $(seq -f "%05g" 0 0); do
+                        ptx_dataset+=("$TEMP_DIR/rlhf_data/tokenized_${model}_sft/arrow/part-$split")
+                    done
+                    colossalai run --nproc_per_node 2 --master_port 31332 $EXAMPLES_DIR/training_scripts/train_ppo.py \
+                        --pretrain $pretrain \
+                        --rm_pretrain $pretrain \
+                        --tokenizer_dir $tokenizer_dir \
+                        --conversation_template_config $conversation_template \
+                        --prompt_dataset ${prompt_dataset[@]} \
+                        --ptx_dataset ${ptx_dataset[@]} \
+                        --ptx_batch_size 1 \
+                        --ptx_coef 0.2 \
+                        --save_path $MODEL_SAVE_PATH \
+                        $lora_config \
+                        --plugin $plugin \
+                        --num_episodes 5 \
+                        --num_collect_steps 1 \
+                        --num_update_steps 1 \
+                        --experience_batch_size $ebs \
+                        --train_batch_size $bs \
+                        --accumulation_steps $grad_accu \
+                        --lr 9e-6 \
+                        --mixed_precision "bf16" \
+                        --grad_clip 1.0 \
+                        --tp $tp \
+                        --lr 2e-5 \
+                        $grad_ckpt \
+                        --max_len 400 \
+                        --max_seq_len 10 \
+                        $reward_fn \
+                        $no_nn
+                        # --use_flash_attn
+                    passed=$?
+                    if [ $passed -eq 0 ]; then
+                        rm -rf ${MODEL_SAVE_PATH:?}/*
+                        rm -rf ${MODELS_DIR:?}/*
+                        break
+                    fi
+                done
+                if [ $passed -ne 0 ]; then
+                    echo "[Test]: Failed $model-$plugin-$lora_rank-$reward_type"
+                    exit 1
+                fi
+            done
+        done
+    done
+done

 echo "[Test]: testing sft ..."

@ -316,111 +550,6 @@ for lora_rank in ${LORA_RANK[@]}; do
    done
 done

-
-echo "[Test]: testing ppo ..."
-
-
-SKIPPED_TESTS=(
-    llama-3d # 3d plugin doesn't support lora
-    llama-gemini # gemini doesn't support lora
-)
-
-GRAD_CKPTS=('--grad_checkpoint')
-for lora_rank in ${LORA_RANK[@]}; do
-    for model in ${MODELS[@]}; do
-        for plugin in ${PLUGINS[@]}; do
-            if [[ $plugin == "gemini_auto" ]]; then
-                echo "[Test]: Skipped $model-$plugin"
-                continue # gemini_auto plugin doesn't support generation
-            fi
-            if [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$plugin-$lora_rank " ]]; then
-                echo "[Test]: Skipped $model-$plugin-$lora_rank"
-                continue
-            elif [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$plugin " ]]; then
-                echo "[Test]: Skipped $model-$plugin"
-                continue
-            fi
-            pretrain=$(get_pretrain $model)
-            tokenizer_dir=$(get_tokenizer_dirs $model)
-            grad_ckpt=$(random_choice "${GRAD_CKPTS[@]}")
-            tp='1'
-            bs='4'
-            ebs='8'
-            conversation_template=$(get_conversation_template_config $model)
-            if [[ $plugin == "zero2" ]]; then
-                lora_config=$LORA_CONFIG_ENABLE
-            else
-                lora_config=""
-            fi
-            if [[ $plugin == "3d" ]]; then
-                tp='2'
-                bs='16'
-                ebs='32'
-            fi
-            grad_accu='2'
-            # gemini_auto and gemini doesn't support gradient accumulation
-            if [[ $plugin == "gemini_auto" ]]; then
-                grad_accu='1'
-            fi
-            # gemini_auto and gemini doesn't support generation
-            if [[ $plugin == "gemini_auto" ]]; then
-                # gemini-auto doesn't support generation
-                echo "[Test]: Skipped $model-$plugin"
-                continue
-            fi
-            for i in $(seq $NUM_RETRY); do
-                echo "[Test]: $model-$plugin-$lora_rank, attempt $i"
-                declare -a prompt_dataset=()
-                for split in $(seq -f "%05g" 0 0); do
-                    prompt_dataset+=("$TEMP_DIR/rlhf_data/tokenized_${model}_prompt/arrow/part-$split")
-                done
-                declare -a ptx_dataset=()
-                for split in $(seq -f "%05g" 0 0); do
-                    ptx_dataset+=("$TEMP_DIR/rlhf_data/tokenized_${model}_sft/arrow/part-$split")
-                done
-                colossalai run --nproc_per_node 2 --master_port 31332 $EXAMPLES_DIR/training_scripts/train_ppo.py \
-                    --pretrain $pretrain \
-                    --rm_pretrain $pretrain \
-                    --tokenizer_dir $tokenizer_dir \
-                    --conversation_template_config $conversation_template \
-                    --prompt_dataset ${prompt_dataset[@]} \
-                    --ptx_dataset ${ptx_dataset[@]} \
-                    --ptx_batch_size 1 \
-                    --ptx_coef 0.2 \
-                    --save_path $MODEL_SAVE_PATH \
-                    $lora_config \
-                    --plugin $plugin \
-                    --num_episodes 5 \
-                    --num_collect_steps 1 \
-                    --num_update_steps 1 \
-                    --experience_batch_size $ebs \
-                    --train_batch_size $bs \
-                    --accumulation_steps $grad_accu \
-                    --lr 9e-6 \
-                    --mixed_precision "bf16" \
-                    --grad_clip 1.0 \
-                    --tp $tp \
-                    --lr 2e-5 \
-                    $grad_ckpt \
-                    --max_len 400 \
-                    --max_seq_len 10 \
-                    # --use_flash_attn
-                passed=$?
-                if [ $passed -eq 0 ]; then
-                    rm -rf ${MODEL_SAVE_PATH:?}/*
-                    rm -rf ${MODELS_DIR:?}/*
-                    break
-                fi
-            done
-            if [ $passed -ne 0 ]; then
-                echo "[Test]: Failed $model-$plugin-$lora_rank"
-                exit 1
-            fi
-        done
-    done
-done
-
-
 echo "[Test]: testing DPO ..."

 SKIPPED_TESTS=(
@ -446,7 +575,7 @@ for lora_rank in ${LORA_RANK[@]}; do
            bs='2'
            if [[ $plugin == "3d" ]]; then
                tp='2'
-                bs='8'
+                bs='2'
            fi
            if [[ $plugin == "zero2" ]]; then
                lora_config=$LORA_CONFIG_ENABLE
@ -503,10 +632,10 @@ for lora_rank in ${LORA_RANK[@]}; do
 done


-
 echo "[Test]: testing ORPO ..."

 SKIPPED_TESTS=(
+    llama-3d-0
    llama-3d-20 # 3d plugin doesn't support lora
    llama-gemini_auto-20  # gemini_auto plugin doesn't support lora
    llama-gemini-20 # gemini doesn't support lora
@ -529,7 +658,7 @@ for lora_rank in ${LORA_RANK[@]}; do
            bs='2'
            if [[ $plugin == "3d" ]]; then
                tp='2'
-                bs='8'
+                bs='2'
            fi
            if [[ $plugin == "zero2" ]]; then
                lora_config=$LORA_CONFIG_ENABLE
@ -585,11 +714,10 @@ for lora_rank in ${LORA_RANK[@]}; do
    done
 done

-
-
 echo "[Test]: testing KTO ..."

 SKIPPED_TESTS=(
+    llama-3d-0
    llama-3d-20 # 3d plugin doesn't support lora
    llama-gemini_auto-20  # gemini_auto plugin doesn't support lora
    llama-gemini-20 # gemini doesn't support lora
@ -612,7 +740,7 @@ for lora_rank in ${LORA_RANK[@]}; do
            bs='2'
            if [[ $plugin == "3d" ]]; then
                tp='2'
-                bs='8'
+                bs='2'
            fi
            if [[ $plugin == "zero2" ]]; then
                lora_config=$LORA_CONFIG_ENABLE
--- a/colossalai/booster/plugin/gemini_plugin.py
+++ b/colossalai/booster/plugin/gemini_plugin.py
@ -20,7 +20,7 @@ from colossalai.checkpoint_io.utils import (
    create_pinned_state_dict,
    get_model_base_filenames,
    get_optimizer_base_filenames,
-    load_shard_state_dict,
+    load_state_dict_shards,
    save_config_file,
    save_state_dict,
    save_state_dict_shards,
@ -29,7 +29,6 @@ from colossalai.cluster import DistCoordinator, ProcessGroupMesh
 from colossalai.interface import ModelWrapper, OptimizerWrapper
 from colossalai.logging import get_dist_logger
 from colossalai.shardformer import ShardConfig, ShardFormer
-from colossalai.utils.safetensors import load_flat
 from colossalai.zero import GeminiDDP, GeminiOptimizer
 from colossalai.zero.gemini.memory_tracer import MemStats

@ -86,11 +85,11 @@ class GeminiCheckpointIO(GeneralCheckpointIO):
            if use_async:
                from colossalai.utils.safetensors import save

-                if id(model) not in self.pinned_state_dicts:
-                    self.pinned_state_dicts[id(model)] = create_pinned_state_dict(state_dict)
+                if hash(model) not in self.pinned_state_dicts:
+                    self.pinned_state_dicts[hash(model)] = create_pinned_state_dict(state_dict)
                for k, v in state_dict.items():
-                    self.pinned_state_dicts[id(model)][k].copy_(v)
-                    state_dict[k] = self.pinned_state_dicts[id(model)][k]
+                    self.pinned_state_dicts[hash(model)][k].copy_(v)
+                    state_dict[k] = self.pinned_state_dicts[hash(model)][k]
                writer = save(checkpoint, state_dict)
                self.async_writers.append(writer)
            else:
@ -173,9 +172,9 @@ class GeminiCheckpointIO(GeneralCheckpointIO):
        Path(checkpoint_path).mkdir(parents=True, exist_ok=True)

        if use_async and self.coordinator.is_master():
-            if id(model) not in self.pinned_state_dicts:
-                self.pinned_state_dicts[id(model)] = {}
-            pinned_state_dicts = self.pinned_state_dicts[id(model)]
+            if hash(model) not in self.pinned_state_dicts:
+                self.pinned_state_dicts[hash(model)] = {}
+            pinned_state_dicts = self.pinned_state_dicts[hash(model)]
        else:
            pinned_state_dicts = None
        state_dict_shard = model.state_dict_shard(
@ -350,11 +349,9 @@ class GeminiCheckpointIO(GeneralCheckpointIO):

        # Load optimizer states from shard files under checkpoint path.
        # For each file, only load the states managed by current process.
-        for shard_file in checkpoint_files:
-            if shard_file.endswith(".safetensors"):
-                state_dict_shard = load_flat(shard_file)
-            else:
-                state_dict_shard = load_shard_state_dict(Path(shard_file), use_safetensors=False)
+        for state_dict_shard in load_state_dict_shards(
+            checkpoint_files, True, False, low_cpu_mem_mode=low_cpu_mem_mode
+        ):
            if not low_cpu_mem_mode:
                state_dict_shard = create_pinned_state_dict(state_dict_shard, empty=False, num_threads=num_threads)
            optimizer.load_param_states(state_dict_shard)
--- a/colossalai/booster/plugin/hybrid_parallel_plugin.py
+++ b/colossalai/booster/plugin/hybrid_parallel_plugin.py
@ -10,6 +10,7 @@ from typing import Any, Callable, Dict, Iterator, List, Optional, OrderedDict, T
 import numpy as np
 import torch
 import torch.distributed as dist
+from peft import PeftModel
 from torch import Tensor, inf
 from torch.distributed import ProcessGroup, get_world_size
 from torch.nn import Module, SyncBatchNorm
@ -25,6 +26,7 @@ from colossalai.amp.naive_amp.mixed_precision_optimizer import MixedPrecisionOpt
 from colossalai.checkpoint_io import CheckpointIO, HybridParallelCheckpointIO
 from colossalai.cluster import ProcessGroupMesh
 from colossalai.interface import AMPModelMixin, ModelWrapper, OptimizerWrapper
+from colossalai.interface.model import PeftUnwrapMixin
 from colossalai.interface.optimizer import DistributedOptim
 from colossalai.logging import get_dist_logger
 from colossalai.nn.optimizer import DistGaloreAwamW, cast_to_distributed
@ -219,11 +221,13 @@ class HybridParallelModule(ModelWrapper, AMPModelMixin):
        with self._hook_context():
            return super().forward(*args, **kwargs)

-    def unwrap(self):
-        module = super().unwrap()
-        if isinstance(module, DDP):
-            module = module.module
-        return module
+    def unwrap(self, unwrap_peft: bool = True):
+        model = self.module
+        if isinstance(model, DDP):
+            model = model.module
+        if unwrap_peft and isinstance(model, PeftModel):
+            model = PeftUnwrapMixin(model)
+        return model

    def _force_wait_all_gather(self):
        for p in self.module.parameters():
@ -1188,6 +1192,15 @@ class HybridParallelPlugin(PipelinePluginBase):
        else:
            self.sp_group = self.pg_mesh.get_group_along_axis(self.sp_axis)

+        # sync gradients across DP * SP ranks
+        # sync gradients across DP * SP ranks
+        # Apply Hybrid ZeRO across DP * SP ranks
+        if self.enable_sequence_parallelism and not is_share_sp_tp(self.sequence_parallelism_mode):
+            self.mixed_dp_group = self.pg_mesh.create_group_along_axis([self.dp_axis, self.sp_axis])
+            self.dp_size = get_world_size(self.mixed_dp_group)
+        else:
+            self.mixed_dp_group = self.dp_group
+
        self.shard_config = ShardConfig(
            tensor_parallel_process_group=self.tp_group,
            sequence_parallel_process_group=self.sp_group,
@ -1298,19 +1311,11 @@ class HybridParallelPlugin(PipelinePluginBase):
            use_ddp = (self.dp_size > 1 and self.pp_size == 1 and self.zero_stage == 0) or (
                self.dp_size == 1 and self.pp_size == 1
            )
-            # sync gradients across DP * SP ranks
-            # sync gradients across DP * SP ranks
-            # Apply Hybrid ZeRO across DP * SP ranks
-            if self.enable_sequence_parallelism and not is_share_sp_tp(self.sequence_parallelism_mode):
-                dp_group = self.pg_mesh.create_group_along_axis([self.dp_axis, self.sp_axis])
-                self.dp_size = get_world_size(dp_group)
-            else:
-                dp_group = self.dp_group
            model = HybridParallelModule(
                model,
                precision=self.precision,
                shard_config=self.shard_config,
-                dp_group=dp_group,
+                dp_group=self.mixed_dp_group,
                tp_group=self.tp_group,
                sp_group=self.sp_group,
                use_ddp=use_ddp,
@ -1359,7 +1364,7 @@ class HybridParallelPlugin(PipelinePluginBase):
                    model,
                    use_pipeline=self.enable_pipeline_parallelism,
                    param_info=param_info,
-                    dp_process_group=dp_group,
+                    dp_process_group=self.mixed_dp_group,
                    tp_process_group=self.tp_group,
                    pp_process_group=self.pp_group,
                    verbose=True,
@ -1488,7 +1493,9 @@ class HybridParallelPlugin(PipelinePluginBase):
        )

    def get_checkpoint_io(self) -> CheckpointIO:
-        return HybridParallelCheckpointIO(self.dp_group, self.pp_group, self.tp_group, self.sp_group, self.zero_stage)
+        return HybridParallelCheckpointIO(
+            self.mixed_dp_group, self.pp_group, self.tp_group, self.sp_group, self.zero_stage
+        )

    def no_sync(self, model: Module, optimizer: OptimizerWrapper) -> Iterator[None]:
        assert (
@ -1506,7 +1513,7 @@ class HybridParallelPlugin(PipelinePluginBase):
        from peft import PeftModel, get_peft_model

        assert not isinstance(model, HybridParallelModule), "Lora should be enabled before boosting the model."
-        assert self.pp_size == 1 and self.tp_size == 1
+        assert self.tp_size == 1
        self.lora_enabled = True
        self.logger.warning("You have enabled LoRa training. Please check the hyperparameters such as lr", ranks=[0])

--- a/colossalai/booster/plugin/low_level_zero_plugin.py
+++ b/colossalai/booster/plugin/low_level_zero_plugin.py
@ -24,8 +24,8 @@ from colossalai.checkpoint_io.utils import (
    get_optimizer_base_filenames,
    get_shard_filename,
    load_param_groups_into_optimizer,
-    load_shard_state_dict,
    load_state_dict,
+    load_state_dict_shards,
    load_states_into_optimizer,
    save_param_groups,
    save_state_dict,
@ -276,13 +276,7 @@ class LowLevelZeroCheckpointIO(TorchDDPCheckpointIO):

        checkpoint_files, _ = ckpt_index_file.get_checkpoint_filenames()

-        for shard_file in checkpoint_files:
-            if shard_file.endswith(".safetensors"):
-                from colossalai.utils.safetensors import load_flat
-
-                state_dict = load_flat(shard_file)
-            else:
-                state_dict = load_shard_state_dict(Path(shard_file), use_safetensors=False)
+        for state_dict in load_state_dict_shards(checkpoint_files, True, False, low_cpu_mem_mode):
            # shard state dict
            for param_idx, state in state_dict.items():
                for k, v in state.items():
@ -365,23 +359,10 @@ class LowLevelZeroCheckpointIO(TorchDDPCheckpointIO):
            model, checkpoint_path, gather_dtensor, prefix, max_shard_size, use_safetensors, use_async=use_async
        )

-    def save_lora_as_pretrained(self, model, checkpoint, use_safetensors):
-        if os.path.isfile(checkpoint):
-            self.logger.error(f"Provided path ({checkpoint}) should be a directory, not a file", ranks=[0])
-            return
-        from peft import PeftModel
-
-        assert isinstance(model, ModelWrapper), "Please boost the model before saving!"
+    def save_lora_as_pretrained(self, model, checkpoint, use_safetensors, state_dict: Optional[dict] = None):
+        assert isinstance(model, LowLevelZeroModel), "Please boost the model before saving!"
        model._force_wait_all_gather()
-        peft_model = model.unwrap()
-        assert isinstance(
-            peft_model, PeftModel
-        ), "The model doesn't have lora adapters, please enable lora before saving."
-        return peft_model.save_pretrained(
-            checkpoint,
-            safe_serialization=use_safetensors,
-            state_dict=tree_map(lambda x: x.data if torch.is_tensor(x) else x, peft_model.state_dict()),
-        )
+        super().save_lora_as_pretrained(model, checkpoint, use_safetensors, state_dict=state_dict)


 class LowLevelZeroPlugin(DPPluginBase):
--- a/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py
+++ b/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py
@ -19,7 +19,6 @@ from colossalai.booster.plugin.hybrid_parallel_plugin import (
    HybridParallelPlugin,
    HybridParallelZeroOptimizer,
    get_param_info,
-    reinitialize_optimizer,
 )
 from colossalai.checkpoint_io import MoECheckpointIO
 from colossalai.cluster.process_group_mesh import ProcessGroupMesh
@ -351,6 +350,14 @@ class MoeHybridParallelPlugin(HybridParallelPlugin):
            self.sp_group = self.pg_mesh.get_group_along_axis(self.tp_axis)
        else:
            self.sp_group = self.pg_mesh.get_group_along_axis(self.sp_axis)
+
+        # sync gradients across DP * SP ranks
+        if self.enable_sequence_parallelism and self.sequence_parallelism_mode == "all_to_all":
+            self.mixed_dp_group = self.pg_mesh.create_group_along_axis([self.moe_dp_axis, self.ep_axis, self.sp_axis])
+            self.dp_size = dist.get_world_size(self.mixed_dp_group)
+        else:
+            self.mixed_dp_group = self.dp_group
+
        self.use_fp8 = use_fp8

        self.shard_config = ShardConfig(
@ -404,7 +411,7 @@ class MoeHybridParallelPlugin(HybridParallelPlugin):

    def get_checkpoint_io(self) -> MoECheckpointIO:
        return MoECheckpointIO(
-            self.dp_group,
+            self.mixed_dp_group,
            self.pp_group,
            self.tp_group,
            self.sp_group,
@ -435,12 +442,6 @@ class MoeHybridParallelPlugin(HybridParallelPlugin):
                and self.sequence_parallelism_mode == "all_to_all"
            )

-            # sync gradients across DP * SP ranks
-            if self.enable_sequence_parallelism and self.sequence_parallelism_mode == "all_to_all":
-                dp_group = self.pg_mesh.create_group_along_axis([self.moe_dp_axis, self.ep_axis, self.sp_axis])
-            else:
-                dp_group = self.dp_group
-
            if use_ddp:
                self.logger.warning(
                    f"Will have to check all params are used in pytorch DDP since not all experts are always activated",
@ -448,7 +449,7 @@ class MoeHybridParallelPlugin(HybridParallelPlugin):
                )
                self.ddp_config["find_unused_parameters"] = True

-                if dist.get_process_group_ranks(dp_group) != dist.get_process_group_ranks(self.moe_dp_group):
+                if dist.get_process_group_ranks(self.mixed_dp_group) != dist.get_process_group_ranks(self.moe_dp_group):
                    raise ValueError(
                        f"if pytorch DDP is used, dp_group and moe_dp_group are expected to be the same since DDP can only reduce grad across a single group, but found dp_group {dist.get_process_group_ranks(dp_group)} and moe_dp_group {dist.get_process_group_ranks(self.moe_dp_group)}, you might want to modify your config to bypass DDP \nhint: check the above ddp condition to by pass this"
                    )
@ -457,7 +458,7 @@ class MoeHybridParallelPlugin(HybridParallelPlugin):
                module=model,
                precision=self.precision,
                shard_config=self.shard_config,
-                dp_group=dp_group,
+                dp_group=self.mixed_dp_group,
                tp_group=self.tp_group,
                sp_group=self.sp_group,
                use_ddp=use_ddp,
@ -466,18 +467,13 @@ class MoeHybridParallelPlugin(HybridParallelPlugin):
                use_fp8=self.use_fp8,
            )
        if optimizer is not None and not isinstance(optimizer, OptimizerWrapper):
-            if self.ep_size > 1:
-                # if ep is enabled, the num of (moe) paramaters changed since they are sharded among ep groups
-                # but the optimizer is not aware of ep, so we need to update the optimizer
-                reinitialize_optimizer(optimizer, model)
-
            if self.zero_stage == 0:
                is_zero = False
                if self.precision in ["fp16", "bf16"]:
                    optimizer = HybridParallelAMPOptimizer(
                        optimizer,
                        model,
-                        use_pipeline=self.enable_pipeline_parallelism,
+                        use_pipeline=self.enable_pipeline_parallelism or self.ep_size > 1,
                        param_info=param_info,
                        precision=self.precision,
                        max_norm=self.max_norm,
@ -487,7 +483,7 @@ class MoeHybridParallelPlugin(HybridParallelPlugin):
                    optimizer = HybridParallelNaiveOptimizer(
                        optimizer,
                        model,
-                        use_pipeline=self.enable_pipeline_parallelism,
+                        use_pipeline=self.enable_pipeline_parallelism or self.ep_size > 1,
                        param_info=param_info,
                        max_norm=self.max_norm,
                        pp_process_group=self.pp_group,
@ -505,9 +501,9 @@ class MoeHybridParallelPlugin(HybridParallelPlugin):
                optimizer = MoeHybridParallelZeroOptimizer(
                    optimizer,
                    model,
-                    use_pipeline=self.enable_pipeline_parallelism,
+                    use_pipeline=self.enable_pipeline_parallelism or self.ep_size > 1,
                    param_info=param_info,
-                    dp_process_group=dp_group,
+                    dp_process_group=self.mixed_dp_group,
                    tp_process_group=self.tp_group,
                    pp_process_group=self.pp_group,
                    moe_dp_group=self.moe_dp_group,
--- a/colossalai/booster/plugin/torch_ddp_plugin.py
+++ b/colossalai/booster/plugin/torch_ddp_plugin.py
@ -2,6 +2,7 @@ from typing import Callable, Dict, Iterator, List, Optional, Tuple, Union

 import torch
 import torch.nn as nn
+from peft import PeftModel
 from torch.nn.parallel import DistributedDataParallel as DDP
 from torch.optim import Optimizer
 from torch.optim.lr_scheduler import _LRScheduler as LRScheduler
@ -11,6 +12,7 @@ from torch.utils.data import DataLoader
 from colossalai.checkpoint_io import CheckpointIO, GeneralCheckpointIO
 from colossalai.cluster import DistCoordinator
 from colossalai.interface import ModelWrapper, OptimizerWrapper
+from colossalai.interface.model import PeftUnwrapMixin
 from colossalai.logging import get_dist_logger
 from colossalai.quantization import BnbQuantizationConfig, quantize_model
 from colossalai.utils import get_current_device
@ -166,7 +168,11 @@ class TorchDDPCheckpointIO(GeneralCheckpointIO):
        )

    def save_lora_as_pretrained(
-        self, model: Union[nn.Module, ModelWrapper], checkpoint: str, use_safetensors: bool = False
+        self,
+        model: Union[nn.Module, ModelWrapper],
+        checkpoint: str,
+        use_safetensors: bool = False,
+        state_dict: Optional[dict] = None,
    ) -> None:
        """
        Save the lora adapters and adapter configuration file to checkpoint directory.
@ -174,15 +180,17 @@ class TorchDDPCheckpointIO(GeneralCheckpointIO):
        from peft import PeftModel

        assert isinstance(model, ModelWrapper), "Please boost the model before saving!"
+        peft_model = model.unwrap(unwrap_peft=False)
+        assert isinstance(
+            peft_model, PeftModel
+        ), "The model doesn't have lora adapters, please enable lora before saving."
+        if state_dict is None:
+            state_dict = tree_map(lambda x: x.data.cpu() if torch.is_tensor(x) else x, peft_model.state_dict())
        if self.coordinator.is_master():
-            peft_model = model.unwrap()
-            assert isinstance(
-                peft_model, PeftModel
-            ), "The model doesn't have lora adapters, please enable lora before saving."
            return peft_model.save_pretrained(
                checkpoint,
                safe_serialization=use_safetensors,
-                state_dict=tree_map(lambda x: x.data if torch.is_tensor(x) else x, peft_model.state_dict()),
+                state_dict=state_dict,
            )


@ -191,8 +199,11 @@ class TorchDDPModel(ModelWrapper):
        super().__init__(module)
        self.module = DDP(module, *args, **kwargs)

-    def unwrap(self):
-        return self.module.module
+    def unwrap(self, unwrap_peft: bool = True) -> nn.Module:
+        model = self.module.module
+        if unwrap_peft and isinstance(model, PeftModel):
+            model = PeftUnwrapMixin(model)
+        return model


 class TorchDDPPlugin(DPPluginBase):
--- a/colossalai/booster/plugin/torch_fsdp_plugin.py
+++ b/colossalai/booster/plugin/torch_fsdp_plugin.py
@ -103,11 +103,11 @@ class TorchFSDPCheckpointIO(GeneralCheckpointIO):
            if use_async:
                from colossalai.utils.safetensors import save

-                if id(model) not in self.pinned_state_dicts:
-                    self.pinned_state_dicts[id(model)] = create_pinned_state_dict(full_model_state)
+                if hash(model) not in self.pinned_state_dicts:
+                    self.pinned_state_dicts[hash(model)] = create_pinned_state_dict(full_model_state)
                for k, v in full_model_state.items():
-                    self.pinned_state_dicts[id(model)][k].copy_(v)
-                    full_model_state[k] = self.pinned_state_dicts[id(model)][k]
+                    self.pinned_state_dicts[hash(model)][k].copy_(v)
+                    full_model_state[k] = self.pinned_state_dicts[hash(model)][k]
                writer = save(checkpoint, full_model_state)
                self.async_writers.append(writer)
            else:
@ -186,9 +186,9 @@ class TorchFSDPCheckpointIO(GeneralCheckpointIO):
            state_dict = model.unwrap().state_dict()

        if use_async and self.coordinator.is_master():
-            if id(model) not in self.pinned_state_dicts:
-                self.pinned_state_dicts[id(model)] = {}
-            pinned_state_dicts = self.pinned_state_dicts[id(model)]
+            if hash(model) not in self.pinned_state_dicts:
+                self.pinned_state_dicts[hash(model)] = {}
+            pinned_state_dicts = self.pinned_state_dicts[hash(model)]
        else:
            pinned_state_dicts = None
        state_dict_shard = utils.shard_model_checkpoint(
@ -255,8 +255,8 @@ class TorchFSDPCheckpointIO(GeneralCheckpointIO):
        checkpoint_files, _ = ckpt_index_file.get_checkpoint_filenames()

        fsdp_state_dict = {}
-        for shard_file in checkpoint_files:
-            fsdp_state_dict.update(utils.load_shard_state_dict(Path(shard_file), use_safetensors))
+        for state_dict in utils.load_state_dict_shards(checkpoint_files, False, use_safetensors):
+            fsdp_state_dict.update(state_dict)

        with FSDP.state_dict_type(model.unwrap(), StateDictType.FULL_STATE_DICT):
            model.unwrap().load_state_dict(fsdp_state_dict, strict=False)
@ -388,11 +388,7 @@ class TorchFSDPCheckpointIO(GeneralCheckpointIO):
        # Load param
        fsdp_optim_state = {}
        checkpoint_files, _ = ckpt_index_file.get_checkpoint_filenames()
-        for shard_file in checkpoint_files:
-            if shard_file.endswith(".safetensors"):
-                state_dict_shard = load_flat(shard_file, seperator=".")
-            else:
-                state_dict_shard = utils.load_shard_state_dict(Path(shard_file), use_safetensors=False)
+        for state_dict_shard in utils.load_state_dict_shards(checkpoint_files, True, False):
            fsdp_optim_state.update(state_dict_shard)

        fsdp_optim_dict = dict(state=fsdp_optim_state, param_groups=saved_param_groups)
@ -441,9 +437,6 @@ class TorchFSDPModel(ModelWrapper):
        super().__init__(module)
        self.module = FSDP(module, *args, **kwargs)

-    def unwrap(self):
-        return self.module
-

 class FSDPOptimizerWrapper(OptimizerWrapper):
    def __init__(self, optimizer: Optimizer, model: nn.Module):
--- a/colossalai/checkpoint_io/checkpoint_io_base.py
+++ b/colossalai/checkpoint_io/checkpoint_io_base.py
@ -437,7 +437,11 @@ class CheckpointIO(ABC):

    @abstractmethod
    def save_lora_as_pretrained(
-        self, model: Union[nn.Module, ModelWrapper], checkpoint: str, use_safetensors: bool = False
+        self,
+        model: Union[nn.Module, ModelWrapper],
+        checkpoint: str,
+        use_safetensors: bool = False,
+        state_dict: Optional[dict] = None,
    ) -> None:
        """
        Save the lora adapters and adapter configuration file to a pretrained checkpoint directory.
@ -446,4 +450,5 @@ class CheckpointIO(ABC):
            model (Union[nn.Module, ModelWrapper]): A model boosted by Booster.
            checkpoint (str): Path to the checkpoint directory. It must be a local path.
            use_safetensors (bool, optional): Whether to use safe tensors when saving. Defaults to False.
+            state_dict (Optional[dict], optional): The state dict to save. Defaults to None.
        """
--- a/colossalai/checkpoint_io/general_checkpoint_io.py
+++ b/colossalai/checkpoint_io/general_checkpoint_io.py
@ -18,9 +18,9 @@ from .utils import (
    get_optimizer_base_filenames,
    is_safetensors_available,
    load_param_groups_into_optimizer,
-    load_shard_state_dict,
    load_state_dict,
    load_state_dict_into_model,
+    load_state_dict_shards,
    load_states_into_optimizer,
    save_config_file,
    save_param_groups,
@ -60,9 +60,9 @@ class GeneralCheckpointIO(CheckpointIO):
        if use_async:
            from colossalai.utils.safetensors import move_and_save

-            if id(model) not in self.pinned_state_dicts:
-                self.pinned_state_dicts[id(model)] = create_pinned_state_dict(state_dict)
-            writer = move_and_save(checkpoint, state_dict, self.pinned_state_dicts[id(model)])
+            if hash(model) not in self.pinned_state_dicts:
+                self.pinned_state_dicts[hash(model)] = create_pinned_state_dict(state_dict)
+            writer = move_and_save(checkpoint, state_dict, self.pinned_state_dicts[hash(model)])
            self.async_writers.append(writer)
        else:
            # save the checkpoint
@ -94,11 +94,7 @@ class GeneralCheckpointIO(CheckpointIO):

        checkpoint_files, _ = ckpt_index_file.get_checkpoint_filenames()

-        for shard_file in checkpoint_files:
-            if shard_file.endswith(".safetensors"):
-                state_dict = load_flat(shard_file)
-            else:
-                state_dict = load_shard_state_dict(Path(shard_file), use_safetensors=False)
+        for state_dict in load_state_dict_shards(checkpoint_files, True, False, low_cpu_mem_mode):
            if not low_cpu_mem_mode:
                state_dict = create_pinned_state_dict(state_dict, empty=False, num_threads=num_threads)
            load_states_into_optimizer(optimizer, state_dict, id_map)
@ -238,7 +234,7 @@ class GeneralCheckpointIO(CheckpointIO):
        index_file = CheckpointIndexFile(checkpoint_path)

        if use_async:
-            pinned_state_dict = self.pinned_state_dicts.get(id(model), None)
+            pinned_state_dict = self.pinned_state_dicts.get(hash(model), None)
            total_size, new_pinned_state_dict, writers = async_move_save_state_dict_shards(
                sharded_state_dict=state_dict_shard,
                checkpoint=checkpoint_path,
@ -247,7 +243,7 @@ class GeneralCheckpointIO(CheckpointIO):
                is_master=True,
                pinned_state_dict=pinned_state_dict,
            )
-            self.pinned_state_dicts[id(model)] = new_pinned_state_dict
+            self.pinned_state_dicts[hash(model)] = new_pinned_state_dict
            self.async_writers.extend(writers)
        else:
            # Save shards of optimizer states.
@ -295,8 +291,7 @@ class GeneralCheckpointIO(CheckpointIO):
        checkpoint_files, _ = ckpt_index_file.get_checkpoint_filenames()
        missing_keys = []

-        for shard_file in checkpoint_files:
-            state_dict = load_shard_state_dict(Path(shard_file), use_safetensors)
+        for state_dict in load_state_dict_shards(checkpoint_files, False, use_safetensors, low_cpu_mem_mode):
            if not low_cpu_mem_mode:
                state_dict = create_pinned_state_dict(state_dict, empty=False, num_threads=num_threads)
            load_state_dict_into_model(model, state_dict, missing_keys, strict, load_sub_module)
@ -313,5 +308,7 @@ class GeneralCheckpointIO(CheckpointIO):
                    )
                )

-    def save_lora_as_pretrained(self, model: nn.Module, checkpoint: str, use_safetensors: bool = False) -> None:
+    def save_lora_as_pretrained(
+        self, model: nn.Module, checkpoint: str, use_safetensors: bool = False, state_dict: Optional[dict] = None
+    ) -> None:
        raise NotImplementedError
--- a/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py
+++ b/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py
@ -1,6 +1,7 @@
 import copy
 import logging
 import os
+from collections import defaultdict
 from functools import reduce
 from pathlib import Path
 from shutil import rmtree
@ -10,6 +11,7 @@ import torch
 import torch.distributed as dist
 import torch.nn as nn
 from torch.distributed import ProcessGroup
+from torch.optim import Optimizer
 from torch.optim.lr_scheduler import _LRScheduler as LRScheduler
 from torch.utils._pytree import tree_map

@ -31,13 +33,14 @@ from .utils import (
    async_save_state_dict_shards,
    create_pinned_state_dict,
    gather_distributed_param,
+    gather_state_dict_fast,
+    get_lora_state_dict,
    get_model_base_filenames,
    get_optimizer_base_filenames,
    is_safetensors_available,
    load_shard_state_dict,
    load_state_dict,
    load_state_dict_into_model,
-    load_states_into_optimizer,
    save_config_file,
    save_param_groups,
    save_state_dict,
@ -107,9 +110,9 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):
            if param is None:
                continue
            # Gather tensor pieces when using tensor parallel.
-            if is_padded_tensor(param):
-                param = to_unpadded_tensor(param)
            param_ = gather_distributed_param(param, keep_vars=False)
+            if is_padded_tensor(param_):
+                param_ = to_unpadded_tensor(param_)
            if pinned_state_dicts is not None:
                if (prefix + name) not in pinned_state_dicts:
                    pinned_state_dicts[prefix + name] = torch.empty_like(param_, pin_memory=True, device="cpu")
@ -246,9 +249,9 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):
        # Only devices with tp_rank == 0 are responsible for model saving.
        control_saving = self.tp_rank == 0 and self.sp_rank == 0
        if control_saving and use_async:
-            if id(model) not in self.pinned_state_dicts:
-                self.pinned_state_dicts[id(model)] = {}
-            pinned_state_dicts = self.pinned_state_dicts[id(model)]
+            if hash(model) not in self.pinned_state_dicts:
+                self.pinned_state_dicts[hash(model)] = {}
+            pinned_state_dicts = self.pinned_state_dicts[hash(model)]
        else:
            pinned_state_dicts = None
        state_dict_shard = HybridParallelCheckpointIO._model_sharder(
@ -724,26 +727,37 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):
                    state_dict = load_shard_state_dict(Path(file_path), use_safetensors=False)
                if not low_cpu_mem_mode:
                    state_dict = create_pinned_state_dict(state_dict, empty=False, num_threads=num_threads)
-                load_states_into_optimizer(optimizer.optim, state_dict, id_map, strict=True)
+                self.load_states_into_optimizer(optimizer, state_dict, id_map)
                loaded_file.add(filename)

-        # Then shard the loaded optimizer states if using tp/zero.
-        for param, state in optimizer.optim.state.items():
-            device = param.device
-            if master_to_working_map is not None:
-                working_param = master_to_working_map[id(param)]
-            else:
-                working_param = param
-            original_shape = optimizer.param_info["param2shape"][id(working_param)]
-            sharded_state = self.shard_from_complete_optimizer_state(
-                state, current_shape=working_param.shape, original_shape=original_shape, device=device, inplace=True
-            )
-            optimizer.optim.state[param] = sharded_state
-
        sharded_optimizer_loading_epilogue(optimizer.optim)
        if self.verbose and self.coordinator.is_master():
            logging.info(f"The optimizer has been successfully loaded from sharded checkpoint: {ckpt_root_path}.")

+    def load_states_into_optimizer(self, optimizer: Optimizer, state_dict: dict, id_map: dict):
+        state_dict = {int(k): v for k, v in state_dict.items()}
+        new_states = defaultdict(dict)
+        master_to_working_map = optimizer.get_master_to_working_map()
+        for k, state in state_dict.items():
+            if k in id_map:
+                param = id_map[k]
+                device = param.device
+                dtype = param.dtype
+                if master_to_working_map is not None:
+                    working_param = master_to_working_map[id(param)]
+                else:
+                    working_param = param
+                original_shape = optimizer.param_info["param2shape"][id(working_param)]
+                new_states[param] = self.shard_from_complete_optimizer_state(
+                    state,
+                    current_shape=working_param.shape,
+                    original_shape=original_shape,
+                    device=device,
+                    dtype=dtype,
+                    inplace=True,
+                )
+        optimizer.optim.state.update(new_states)
+
    def save_unsharded_model(
        self, model: ModelWrapper, checkpoint: str, gather_dtensor: bool, use_safetensors: bool, use_async: bool = False
    ):
@ -775,11 +789,11 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):
                if use_async:
                    from colossalai.utils.safetensors import save

-                    if id(model) not in self.pinned_state_dicts:
-                        self.pinned_state_dicts[id(model)] = create_pinned_state_dict(state_dict)
+                    if hash(model) not in self.pinned_state_dicts:
+                        self.pinned_state_dicts[hash(model)] = create_pinned_state_dict(state_dict)
                    for name, param in state_dict.items():
-                        self.pinned_state_dicts[id(model)][name].copy_(param)
-                        state_dict[name] = self.pinned_state_dicts[id(model)][name]
+                        self.pinned_state_dicts[hash(model)][name].copy_(param)
+                        state_dict[name] = self.pinned_state_dicts[hash(model)][name]
                    writer = save(path=checkpoint, state_dict=state_dict)
                    self.async_writers.append(writer)
                else:
@ -797,11 +811,11 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):
                if use_async:
                    from colossalai.utils.safetensors import save

-                    if id(model) not in self.pinned_state_dicts:
-                        self.pinned_state_dicts[id(model)] = create_pinned_state_dict(complete_state_dict)
+                    if hash(model) not in self.pinned_state_dicts:
+                        self.pinned_state_dicts[hash(model)] = create_pinned_state_dict(complete_state_dict)
                    for name, param in complete_state_dict.items():
-                        self.pinned_state_dicts[id(model)][name].copy_(param)
-                        complete_state_dict[name] = self.pinned_state_dicts[id(model)][name]
+                        self.pinned_state_dicts[hash(model)][name].copy_(param)
+                        complete_state_dict[name] = self.pinned_state_dicts[hash(model)][name]
                    writer = save(path=checkpoint, state_dict=complete_state_dict)
                    self.async_writers.append(writer)
                else:
@ -988,22 +1002,7 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):
            for param in pg["params"]:
                param_id = _get_param_id_from_optimizer_param(param, master_to_working_map)
                id_map[param_id] = param
-        load_states_into_optimizer(optimizer.optim, state_dict["state"], id_map, strict=True)
-
-        # Then shard the loaded optimizer states if using tp/zero.
-        for param, state in optimizer.optim.state.items():
-            if param is None:
-                continue
-            device = param.device
-            if master_to_working_map is not None:
-                working_param = master_to_working_map[id(param)]
-            else:
-                working_param = param
-            original_shape = optimizer.param_info["param2shape"][id(working_param)]
-            sharded_state = self.shard_from_complete_optimizer_state(
-                state, current_shape=working_param.shape, original_shape=original_shape, device=device, inplace=True
-            )
-            optimizer.optim.state[param] = sharded_state
+        self.load_states_into_optimizer(optimizer, state_dict["state"], id_map)

        sharded_optimizer_loading_epilogue(optimizer.optim)

@ -1086,6 +1085,7 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):
        current_shape: torch.Size,
        original_shape: torch.Size,
        device: torch.device,
+        dtype: torch.dtype,
        inplace: bool,
    ) -> OrderedDict:
        """
@ -1135,11 +1135,11 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):
                        slice_size = v.numel() // self.global_dp_size
                        v = v.split(slice_size, dim=0)[self.dp_rank]

-                state_[k] = v.detach().clone().to(device)
+                state_[k] = v.detach().clone().to(device=device, dtype=dtype)

        return state_

-    def save_lora_as_pretrained(self, model, checkpoint, use_safetensors):
+    def save_lora_as_pretrained(self, model, checkpoint, use_safetensors, state_dict: Optional[dict] = None):
        if os.path.isfile(checkpoint):
            logging.error(f"Provided path ({checkpoint}) should be a directory, not a file")
            return
@ -1147,12 +1147,21 @@ class HybridParallelCheckpointIO(GeneralCheckpointIO):

        assert isinstance(model, ModelWrapper), "Please boost the model before saving!"
        model._force_wait_all_gather()
-        peft_model = model.unwrap()
+        peft_model = model.unwrap(unwrap_peft=False)
        assert isinstance(
            peft_model, PeftModel
        ), "The model doesn't have lora adapters, please enable lora before saving."
-        return peft_model.save_pretrained(
-            checkpoint,
-            safe_serialization=use_safetensors,
-            state_dict=tree_map(lambda x: x.data if torch.is_tensor(x) else x, peft_model.state_dict()),
-        )
+        if state_dict is None:
+            state_dict = tree_map(lambda x: x.data if torch.is_tensor(x) else x, peft_model.state_dict())
+        if self.pp_size > 1:
+            lora_state_dict = get_lora_state_dict(peft_model, state_dict)
+            gathered_lora_state_dict = gather_state_dict_fast(lora_state_dict, self.pp_group, device="cpu")
+            if self.pp_rank == 0:
+                state_dict.update(gathered_lora_state_dict)
+        state_dict = tree_map(lambda x: x.cpu() if torch.is_tensor(x) else x, state_dict)
+        if self.coordinator.is_master():
+            return peft_model.save_pretrained(
+                checkpoint,
+                safe_serialization=use_safetensors,
+                state_dict=state_dict,
+            )
--- a/colossalai/checkpoint_io/moe_checkpoint.py
+++ b/colossalai/checkpoint_io/moe_checkpoint.py
@ -10,6 +10,7 @@ import torch.distributed as dist
 import torch.nn as nn
 from torch.distributed import ProcessGroup
 from torch.distributed.distributed_c10d import get_global_rank
+from torch.utils._pytree import tree_map

 from colossalai.checkpoint_io import CheckpointIndexFile
 from colossalai.checkpoint_io.hybrid_parallel_checkpoint_io import HybridParallelCheckpointIO
@ -17,6 +18,8 @@ from colossalai.checkpoint_io.index_file import CheckpointIndexFile
 from colossalai.checkpoint_io.utils import (
    StateDictSharder,
    gather_distributed_param,
+    gather_state_dict_fast,
+    get_lora_state_dict,
    get_model_base_filenames,
    get_optimizer_base_filenames,
    load_shard_state_dict,
@ -698,15 +701,18 @@ class MoECheckpointIO(HybridParallelCheckpointIO):
                        all_param = None
                    # gather param from every ep rank
                    # dist.all_gather(all_param, param, group=ep_group)
-                    dist.gather(param, all_param, group=ep_group)
+                    dist.gather(param, all_param, dst=dist.get_global_rank(ep_group, 0), group=ep_group)
                    if ep_rank == 0:
                        all_param = torch.cat(all_param, dim=0)
                        state_dict[name] = all_param.cpu()

        if self.pp_size > 1:
            if self.dp_rank == 0:
-                out = [None for _ in range(self.pp_size)]
-                dist.gather_object(state_dict, out, group=self.pp_group)
+                if self.pp_rank == 0:
+                    out = [None for _ in range(self.pp_size)]
+                else:
+                    out = None
+                dist.gather_object(state_dict, out, dst=dist.get_global_rank(self.pp_group, 0), group=self.pp_group)
                if self.pp_rank == 0:
                    new_state_dict = {}
                    for o in out:
@ -889,3 +895,26 @@ class MoECheckpointIO(HybridParallelCheckpointIO):
            optimizer.optim.state[param] = sharded_state
        sharded_optimizer_loading_epilogue(optimizer.optim)
        dist.barrier()
+
+    def save_lora_as_pretrained(self, model, checkpoint, use_safetensors, state_dict=None):
+        if os.path.isfile(checkpoint):
+            logging.error(f"Provided path ({checkpoint}) should be a directory, not a file")
+            return
+        from peft import PeftModel
+
+        assert isinstance(model, ModelWrapper), "Please boost the model before saving!"
+        model._force_wait_all_gather()
+        peft_model = model.unwrap(unwrap_peft=False)
+        assert isinstance(
+            peft_model, PeftModel
+        ), "The model doesn't have lora adapters, please enable lora before saving."
+        if state_dict is None:
+            state_dict = tree_map(lambda x: x.data if torch.is_tensor(x) else x, peft_model.state_dict())
+        if self.ep_size > 1:
+            lora_state_dict = get_lora_state_dict(peft_model, state_dict)
+            moe_params = set(n for n, p in peft_model.named_parameters() if is_moe_tensor(p))
+            expert_state_dict = {n: p for n, p in lora_state_dict.items() if n in moe_params}
+            gathered_expert_state_dict = gather_state_dict_fast(expert_state_dict, self.ep_group)
+            if self.ep_rank == 0:
+                state_dict.update(gathered_expert_state_dict)
+        return super().save_lora_as_pretrained(model, checkpoint, use_safetensors, state_dict)
--- a/colossalai/checkpoint_io/utils.py
+++ b/colossalai/checkpoint_io/utils.py
@ -2,26 +2,33 @@
 import concurrent.futures
 import os
 import re
+import warnings
 from collections import abc as container_abcs
 from collections import defaultdict
 from itertools import chain
 from pathlib import Path
-from typing import Dict, Iterator, List, Mapping, Optional, OrderedDict, Tuple, Union
+from typing import Dict, Generator, Iterator, List, Mapping, Optional, OrderedDict, Tuple, Union

 import torch
+import torch.distributed as dist
 import torch.nn as nn
 from packaging.version import Version
+from peft import PeftModel, PeftType
+from peft.utils.other import EMBEDDING_LAYER_NAMES, check_file_exists_on_hf_hub
+from peft.utils.save_and_load import get_embedding_layer_name, has_valid_embedding_base_layer
 from torch.optim import Optimizer
 from torch.utils._pytree import tree_flatten, tree_map, tree_unflatten

 from colossalai.accelerator import get_accelerator
+from colossalai.interface.model import PeftUnwrapMixin
 from colossalai.tensor.d_tensor import (
    is_customized_distributed_tensor,
    is_distributed_tensor,
    to_global,
    to_global_for_customized_distributed_tensor,
 )
-from colossalai.utils.safetensors import _flatten_optim_state_dict
+from colossalai.utils import get_current_device
+from colossalai.utils.safetensors import _flatten_optim_state_dict, load_flat

 SAFE_WEIGHTS_NAME = "model.safetensors"
 WEIGHTS_NAME = "pytorch_model.bin"
@ -309,12 +316,13 @@ def async_save_state_dict_shards(
        checkpoint_file_path = os.path.join(checkpoint, shard_file)

        if state_preprocess:
-            state_dict, _ = _flatten_optim_state_dict(state_dict=shard, seperator=".")
+            state_dict, metadata = _flatten_optim_state_dict(state_dict=shard, seperator=".")
        else:
            state_dict = shard
+            metadata = None

        # Only save on master rank.
-        writer = save(checkpoint_file_path, state_dict=state_dict)
+        writer = save(checkpoint_file_path, state_dict=state_dict, metadata=metadata)
        writers.append(writer)
        shard_filenames.append(shard_file)
        del shard
@ -371,9 +379,10 @@ def async_move_save_state_dict_shards(
        checkpoint_file_path = os.path.join(checkpoint, shard_file)

        if state_preprocess:
-            state_dict, _ = _flatten_optim_state_dict(state_dict=shard)
+            state_dict, metadata = _flatten_optim_state_dict(state_dict=shard)
        else:
            state_dict = shard
+            metadata = None

        if pinned_state_dict is not None:
            sub_pinned_state_dict = {k: pinned_state_dict[k] for k in state_dict.keys()}
@ -382,7 +391,7 @@ def async_move_save_state_dict_shards(
            returned_state_dict.update(sub_pinned_state_dict)

        # Only save on master rank.
-        writer = move_and_save(checkpoint_file_path, state_dict, sub_pinned_state_dict)
+        writer = move_and_save(checkpoint_file_path, state_dict, sub_pinned_state_dict, metadata)
        writers.append(writer)
        shard_filenames.append(shard_file)
        del shard
@ -546,6 +555,8 @@ def save_config_file(model: nn.Module, checkpoint_path: str, is_master: bool = T
        from transformers.modeling_utils import unwrap_model as unwrap_huggingface_model
    except ImportError:
        return
+    if isinstance(model, PeftUnwrapMixin):
+        model = model.base_model
    if not isinstance(model, PreTrainedModel):
        return

@ -684,6 +695,9 @@ def load_state_dict_into_model(
        state_dict (dict): a dict containing parameters and
            persistent buffers.
    """
+    if isinstance(model, PeftUnwrapMixin):
+        state_dict = model.patch_state_dict(state_dict)
+        model = model.base_model
    if not isinstance(state_dict, Mapping):
        raise TypeError("Expected state_dict to be dict-like, got {}.".format(type(state_dict)))

@ -972,3 +986,170 @@ def create_pinned_state_dict(
                idx = future_to_idx[future]
                elems[idx] = future.result()
            return tree_unflatten(elems, spec)
+
+
+def load_optim_or_model_shard(path: str, is_optim: bool, use_safetensors: bool) -> dict:
+    if is_optim:
+        if path.endswith(".safetensors"):
+            state_dict = load_flat(path)
+        else:
+            state_dict = load_shard_state_dict(Path(path), use_safetensors=False)
+    else:
+        state_dict = load_shard_state_dict(Path(path), use_safetensors)
+    return state_dict
+
+
+def load_state_dict_shards(
+    checkpoint_files: List[str],
+    is_optim: bool,
+    use_safetensors: bool,
+    low_cpu_mem_mode: bool = True,
+    prefetch: int = 3,
+) -> Generator[dict, None, None]:
+    if low_cpu_mem_mode:
+        for shard_file in checkpoint_files:
+            state_dict = load_optim_or_model_shard(shard_file, is_optim, use_safetensors)
+            yield state_dict
+    else:
+        with concurrent.futures.ThreadPoolExecutor(max_workers=prefetch) as executor:
+            futures = []
+            for shard_file in checkpoint_files:
+                future = executor.submit(load_optim_or_model_shard, shard_file, is_optim, use_safetensors)
+                futures.append(future)
+            for future in concurrent.futures.as_completed(futures):
+                yield future.result()
+
+
+# adapted from `peft/utils/save_and_load.py`
+def get_lora_state_dict(
+    model: PeftModel, state_dict: dict, adapter_name="default", save_embedding_layers="auto"
+) -> dict:
+    config = model.peft_config[adapter_name]
+    if config.peft_type != PeftType.LORA:
+        raise ValueError(f"Adapter {adapter_name} is not a LORA adapter.")
+    # to_return = lora_state_dict(model, bias=model.peft_config.bias)
+    # adapted from `https://github.com/microsoft/LoRA/blob/main/loralib/utils.py`
+    # to be used directly with the state dict which is necessary when using DeepSpeed or FSDP
+    bias = config.bias
+    if bias == "none":
+        to_return = {k: state_dict[k] for k in state_dict if "lora_" in k}
+    elif bias == "all":
+        to_return = {k: state_dict[k] for k in state_dict if "lora_" in k or "bias" in k}
+    elif bias == "lora_only":
+        to_return = {}
+        for k in state_dict:
+            if "lora_" in k:
+                to_return[k] = state_dict[k]
+                bias_name = k.split("lora_")[0] + "bias"
+                if bias_name in state_dict:
+                    to_return[bias_name] = state_dict[bias_name]
+    else:
+        raise NotImplementedError
+    to_return = {k: v for k, v in to_return.items() if (("lora_" in k and adapter_name in k) or ("bias" in k))}
+    if config.use_dora:
+        # Here we take care of a refactor of DoRA which changed lora_magnitude_vector from a ParameterDict to a
+        # ModuleDict with a DoraLayer instance. The old parameter is now the "weight" attribute of that layer. Since
+        # we want the state_dict format not to change, we remove the "weight" part.
+        new_dora_suffix = f"lora_magnitude_vector.{adapter_name}.weight"
+
+        def renamed_dora_weights(k):
+            if k.endswith(new_dora_suffix):
+                k = k[:-7]  # remove ".weight"
+            return k
+
+        to_return = {renamed_dora_weights(k): v for k, v in to_return.items()}
+
+    # DEAL WITH EMBEDDINGS
+    # check the common embedding layers in `target_modules` to reset `save_embedding_layers` if necessary
+    is_embedding_in_target_modules = False
+    if (
+        save_embedding_layers == "auto"
+        and hasattr(config, "target_modules")
+        and any(k in config.target_modules for k in EMBEDDING_LAYER_NAMES)
+    ):
+        warnings.warn("Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.")
+        save_embedding_layers = is_embedding_in_target_modules = True
+    elif save_embedding_layers == "auto":
+        vocab_size = getattr(getattr(model, "config", None), "vocab_size", None)
+        model_id = getattr(config, "base_model_name_or_path", None)
+
+        # For some models e.g. diffusers the text config file is stored in a subfolder
+        # we need to make sure we can download that config.
+        has_base_config = False
+
+        # ensure that this check is not performed in HF offline mode, see #1452
+        if model_id is not None:
+            local_config_exists = os.path.exists(os.path.join(model_id, "config.json"))
+            exists = local_config_exists or check_file_exists_on_hf_hub(model_id, "config.json")
+            if exists is None:
+                # check failed, could not determine if it exists or not
+                warnings.warn(
+                    f"Could not find a config file in {model_id} - will assume that the vocabulary was not modified."
+                )
+                has_base_config = False
+            else:
+                has_base_config = exists
+
+        # check if the vocab size of the base model is different from the vocab size of the finetuned model
+        if (
+            vocab_size
+            and model_id
+            and has_base_config
+            and (vocab_size != model.config.__class__.from_pretrained(model_id).vocab_size)
+        ):
+            warnings.warn(
+                "Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning."
+            )
+            save_embedding_layers = True
+        else:
+            save_embedding_layers = False
+
+    if save_embedding_layers and hasattr(model, "get_input_embeddings"):
+        for layer in [model.get_input_embeddings(), model.get_output_embeddings()]:
+            if not is_embedding_in_target_modules or has_valid_embedding_base_layer(layer):
+                # support from version >= 0.6.2
+                embedding_module_name = get_embedding_layer_name(model, layer, is_embedding_in_target_modules)
+                if embedding_module_name:
+                    to_return.update({k: v for k, v in state_dict.items() if embedding_module_name in k})
+    elif save_embedding_layers:
+        warnings.warn("Could not identify embedding layer(s) because the model is not a 🤗 transformers model.")
+
+    return to_return
+
+
+def gather_state_dict_fast(
+    state_dict: Dict[str, torch.Tensor],
+    group: dist.ProcessGroup,
+    device: Optional[Union[torch.device, str]] = None,
+    dst: int = 0,
+) -> Optional[Dict[str, torch.Tensor]]:
+    if device is None:
+        device = get_current_device()
+    rank = dist.get_rank(group)
+    world_size = dist.get_world_size(group)
+    metadata = [(k, v.shape, v.dtype) for k, v in state_dict.items()]
+    all_meta_data = [None] * world_size
+    if rank == dst:
+        returned_state_dict = state_dict.copy()
+        dist.gather_object(metadata, all_meta_data, dst=dist.get_global_rank(group, rank), group=group)
+        for i, target_metadata in enumerate(all_meta_data):
+            if i == dst:
+                continue
+            ops = []
+            for k, shape, dtype in target_metadata:
+                buffer = torch.empty(shape, dtype=dtype, device=get_current_device())
+                returned_state_dict[k] = buffer
+                ops.append(dist.P2POp(dist.irecv, buffer, dist.get_global_rank(group, i), group))
+            reqs = dist.batch_isend_irecv(ops)
+            for req, (k, *_) in zip(reqs, target_metadata):
+                req.wait()
+                returned_state_dict[k] = returned_state_dict[k].to(device)
+        return returned_state_dict
+    else:
+        dist.gather_object(metadata, dst=dist.get_global_rank(group, dst), group=group)
+        ops = []
+        for k, *_ in metadata:
+            ops.append(dist.P2POp(dist.isend, state_dict[k], dist.get_global_rank(group, dst), group))
+        reqs = dist.batch_isend_irecv(ops)
+        for req in reqs:
+            req.wait()
--- a/colossalai/cluster/process_group_mesh.py
+++ b/colossalai/cluster/process_group_mesh.py
@ -64,7 +64,10 @@ class ProcessGroupMesh:
            system resources.
        """
        for group in self._ranks_to_group.values():
-            dist.destroy_process_group(group)
+            try:
+                dist.destroy_process_group(group)
+            except ValueError:
+                pass

        # Manually clear all process groups to save memory
        gc.collect()
--- a/colossalai/inference/README.md
+++ b/colossalai/inference/README.md
@ -62,7 +62,7 @@ engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)

 # Step 4: try inference
 prompts = ['Who is the best player in the history of NBA?']
-response = engine.generate(prompts)
+response = engine.generate(prompts=prompts)
 pprint(response)
 ```

--- a/colossalai/interface/model.py
+++ b/colossalai/interface/model.py
@ -1,4 +1,102 @@
+import re
+from typing import Dict, Set
+
+import torch
 import torch.nn as nn
+from peft import PeftModel, PeftType
+
+
+def extract_lora_layers(model: PeftModel, names: Set[str], adapter_name: str = "default"):
+    config = model.peft_config[adapter_name]
+    if config.peft_type != PeftType.LORA:
+        raise ValueError(f"Adapter {adapter_name} is not a LORA adapter.")
+    # to_return = lora_state_dict(model, bias=model.peft_config.bias)
+    # adapted from `https://github.com/microsoft/LoRA/blob/main/loralib/utils.py`
+    # to be used directly with the state dict which is necessary when using DeepSpeed or FSDP
+    bias = config.bias
+    if bias == "none":
+        to_return = {k for k in names if "lora_" in k}
+    elif bias == "all":
+        to_return = {k for k in names if "lora_" in k or "bias" in k}
+    elif bias == "lora_only":
+        to_return = set()
+        for k in names:
+            if "lora_" in k:
+                to_return.add(k)
+                bias_name = k.split("lora_")[0] + "bias"
+                if bias_name in names:
+                    to_return.add(bias_name)
+    else:
+        raise NotImplementedError
+    to_return = {k for k in to_return if (("lora_" in k and adapter_name in k) or ("bias" in k))}
+    if config.use_dora:
+        # Here we take care of a refactor of DoRA which changed lora_magnitude_vector from a ParameterDict to a
+        # ModuleDict with a DoraLayer instance. The old parameter is now the "weight" attribute of that layer. Since
+        # we want the state_dict format not to change, we remove the "weight" part.
+        new_dora_suffix = f"lora_magnitude_vector.{adapter_name}.weight"
+
+        def renamed_dora_weights(k):
+            if k.endswith(new_dora_suffix):
+                k = k[:-7]  # remove ".weight"
+            return k
+
+        to_return = {renamed_dora_weights(k) for k in to_return}
+
+    to_return = {re.sub(f"lora_\S\.{adapter_name}\.(weight|bias)", "base_layer", k) for k in to_return}
+    return to_return
+
+
+class PeftUnwrapMixin:
+    def __init__(self, peft_model: PeftModel):
+        self.base_model = peft_model.get_base_model()
+        # peft does not affect buffers
+        self.lora_layers = extract_lora_layers(peft_model, set(n for n, p in self.base_model.named_parameters()))
+        potential_lora_weights = set()
+        for n in self.lora_layers:
+            potential_lora_weights.add(f"{n}.weight")
+            potential_lora_weights.add(f"{n}.bias")
+        self.lora_param_to_origin_param = {n: n.replace("base_layer.", "") for n in potential_lora_weights}
+        self.origin_param_to_lora_param = {v: k for k, v in self.lora_param_to_origin_param.items()}
+
+    def named_parameters(self):
+        for n, p in self.base_model.named_parameters():
+            if n in self.lora_param_to_origin_param:
+                n = self.lora_param_to_origin_param[n]
+            yield n, p
+
+    def named_buffers(self):
+        return self.base_model.named_buffers()
+
+    @property
+    def _modules(self):
+        return self.base_model._modules
+
+    @property
+    def _non_persistent_buffers_set(self):
+        return self.base_model._non_persistent_buffers_set
+
+    def patch_state_dict(self, state_dict: Dict[str, torch.Tensor]):
+        new_state_dict = {}
+        for k, v in state_dict.items():
+            if k in self.origin_param_to_lora_param:
+                k = self.origin_param_to_lora_param[k]
+            new_state_dict[k] = v
+        return new_state_dict
+
+    def state_dict(self):
+        state_dict = {}
+        for k, v in self.base_model.state_dict().items():
+            if k in self.lora_param_to_origin_param:
+                k = self.lora_param_to_origin_param[k]
+            state_dict[k] = v
+        return state_dict
+
+    def load_state_dict(self, state_dict, strict: bool = True, assign: bool = False):
+        state_dict = self.patch_state_dict(state_dict)
+        self.base_model.load_state_dict(state_dict, strict=strict, assign=assign)
+
+    def __hash__(self):
+        return hash(self.base_model)


 class ModelWrapper(nn.Module):
@ -13,13 +111,17 @@ class ModelWrapper(nn.Module):
        super().__init__()
        self.module = module

-    def unwrap(self):
+    def unwrap(self, unwrap_peft: bool = True):
        """
        Unwrap the model to return the original model for checkpoint saving/loading.
        """
        if isinstance(self.module, ModelWrapper):
-            return self.module.unwrap()
-        return self.module
+            model = self.module.unwrap()
+        else:
+            model = self.module
+        if unwrap_peft and isinstance(model, PeftModel):
+            model = PeftUnwrapMixin(model)
+        return model

    def forward(self, *args, **kwargs):
        return self.module(*args, **kwargs)
--- a/colossalai/lazy/lazy_init.py
+++ b/colossalai/lazy/lazy_init.py
@ -104,7 +104,7 @@ def _data_tolist(tensor: torch.Tensor) -> list:
    return tensor.data.tolist()


-def _convert_cls(tensor: "LazyTensor", target: torch.Tensor) -> torch.Tensor:
+def _convert_cls(tensor: "LazyTensor", target: torch.Tensor, requires_grad=None) -> torch.Tensor:
    """Convert a lazy tensor's class to target's class, with target's data.

    The reason why we change the class of a lazy tensor in-place is that this can easily handle shared modules/parameters, which is common in huggingface models.
@ -117,13 +117,14 @@ def _convert_cls(tensor: "LazyTensor", target: torch.Tensor) -> torch.Tensor:
    Returns:
        torch.Tensor: the converted tensor
    """
+    requires_grad = target.requires_grad if requires_grad is None else requires_grad
    cls_to_become = Parameter if isinstance(tensor, Parameter) else torch.Tensor
    tensor.__class__ = cls_to_become
    if cls_to_become is Parameter:
        # to fit UninitializedParameter
        delattr(tensor, "_is_param")
    tensor.data = target
-    tensor.requires_grad = target.requires_grad
+    tensor.requires_grad = requires_grad
    # subclass of torch.Tensor does not have tolist() method
    # overwrite this method after materialization or distribution
    tensor.tolist = MethodType(_data_tolist, tensor)
@ -212,9 +213,10 @@ class LazyTensor(torch.Tensor):
        Returns:
            torch.Tensor: The materialized tensor (self).
        """
+        requires_grad = self.requires_grad
        target = self._materialize_data()
        self.clean()
-        return _convert_cls(self, target)
+        return _convert_cls(self, target, requires_grad=requires_grad)

    def clean(self) -> None:
        """Clean all stored operations, meta data and materialized data, which prevents memory leaking. This should be called after all tensors are materialized."""
--- a/colossalai/pipeline/p2p.py
+++ b/colossalai/pipeline/p2p.py
@ -156,7 +156,9 @@ def _check_for_nccl_hccl_backend(group):
        while isinstance(pg, c10d._ProcessGroupWrapper):
            pg = pg.wrapped_pg

-    return (c10d.is_nccl_available() or torch.distributed.is_hccl_available()) and pg.name() == c10d.Backend.NCCL
+    return (c10d.is_nccl_available() or torch.distributed.is_hccl_available()) and (
+        pg.name() == c10d.Backend.NCCL or pg.name() == c10d.Backend.HCCL
+    )


 def _check_device(group):
--- a/colossalai/shardformer/layer/linear.py
+++ b/colossalai/shardformer/layer/linear.py
@ -384,7 +384,7 @@ class Linear1D_Row(ParallelModule):
        out_features (int): size of each output sample.
        bias (bool, optional): If set to ``False``, the layer will not learn an additive bias, defaults to ``True``.
        dtype (`torch.dtype`): The dtype of parameters, defaults to None.
-        parallel_input (bool): If set to ``True``, it's assumed that the input is split, defaults to False.
+        parallel_input (bool): If set to ``True``, it's assumed that the input is already split/copied across each rank, defaults to False.
        process_group (`torch.distributed.ProcessGroup`): The process group to be used for weight sharding and communication, defaults to None.
        seq_parallel_mode (`str`): The type of sp mode, it will use sequence parallel when `seq_parallel_mode` is not None. Defaults to None.
        seq_parallel_dim (`int`): Which dim will sequence parallelism split and gather the sequence.
@ -544,14 +544,14 @@ class Linear1D_Row(ParallelModule):
        if self.parallel_input:
            assert (
                input_.shape[-1] == self.weight.shape[-1]
-            ), "Invalid shapes in Linear1D_Row forward: input={}, weight={}. Expected last dim of input {}.".format(
+            ), "Invalid shapes in Linear1D_Row forward: input={}, weight={}. Expected feature dim of input {}.".format(
                input_.shape, self.weight.shape, self.weight.shape[-1]
            )
            input_ = input_
        else:
            assert (
                divide(input_.shape[-1], self.num_partitions) == self.weight.shape[-1]
-            ), "Invalid shapes in Linear1D_Row forward: input={}, weight={}. Expected last dim of input {}.".format(
+            ), "Invalid shapes in Linear1D_Row forward: input={}, weight={}. Expected feature dim of input {}.".format(
                input_.shape, self.weight.shape, self.weight.shape[-1] * self.num_partitions
            )
            input_ = split_forward_gather_backward(
--- a/colossalai/shardformer/modeling/deepseek_v3.py
+++ b/colossalai/shardformer/modeling/deepseek_v3.py
@ -0,0 +1,385 @@
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+import torch.distributed as dist
+from torch.distributed import ProcessGroup
+from torch.nn import CrossEntropyLoss
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+
+from colossalai.lazy import LazyInitContext
+from colossalai.moe._operation import (
+    DPGradScalerIn,
+    DPGradScalerOut,
+    EPGradScalerIn,
+    EPGradScalerOut,
+    all_to_all_uneven,
+)
+from colossalai.pipeline.stage_manager import PipelineStageManager
+from colossalai.shardformer.layer.linear import ParallelModule
+from colossalai.shardformer.shard.utils import set_tensors_to_none
+from colossalai.tensor.moe_tensor.api import set_moe_tensor_ep_group
+
+
+class EpDeepseekV3MoE(ParallelModule):
+    """
+    A mixed expert module containing shared experts.
+    """
+
+    def __init__(self, config):
+        raise RuntimeError(f"Please use `from_native_module` to create an instance of {self.__class__.__name__}")
+
+    def setup_process_groups(
+        self,
+        moe_dp_group: ProcessGroup,
+        ep_group: ProcessGroup,
+    ):
+        assert moe_dp_group is not None
+        assert ep_group is not None
+
+        self.ep_size = dist.get_world_size(ep_group)
+        self.ep_rank = dist.get_rank(ep_group)
+        self.num_experts = self.config.n_routed_experts
+        assert self.num_experts % self.ep_size == 0
+
+        self.ep_group = ep_group
+        self.num_experts_per_ep = self.num_experts // self.ep_size
+        self.experts_per_rank = self.num_experts_per_ep
+        self.expert_start_idx = self.ep_rank * self.num_experts_per_ep
+        held_experts = self.experts[self.expert_start_idx : self.expert_start_idx + self.num_experts_per_ep]
+
+        set_tensors_to_none(self.experts, exclude=set(held_experts))
+
+        # setup moe_dp group
+        self.moe_dp_group = moe_dp_group
+        self.moe_dp_size = dist.get_world_size(moe_dp_group)
+
+        for p in self.experts.parameters():
+            set_moe_tensor_ep_group(p, ep_group)
+
+    @staticmethod
+    def from_native_module(
+        module,
+        moe_dp_group: ProcessGroup,
+        ep_group: ProcessGroup,
+        *args,
+        **kwargs,
+    ) -> "EpDeepseekV3MoE":
+        if module.__class__.__name__ != "DeepseekV3MLP":
+            module.__class__ = EpDeepseekV3MoE
+            module.setup_process_groups(moe_dp_group, ep_group)
+        LazyInitContext.materialize(module)
+        return module
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        identity = hidden_states
+        orig_shape = hidden_states.shape
+        topk_idx, topk_weight = self.gate(hidden_states)
+        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
+        y = self.moe_forward(hidden_states, topk_idx, topk_weight).view(*orig_shape)
+        if self.config.n_shared_experts is not None:
+            y = y + self.shared_experts(identity)
+        return y
+
+    def moe_forward(self, x: torch.Tensor, topk_ids: torch.Tensor, topk_weight: torch.Tensor) -> torch.Tensor:
+        cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts)))
+        cnts.scatter_(1, topk_ids, 1)
+        tokens_per_expert = cnts.sum(dim=0)
+        idxs = topk_ids.view(-1).argsort()
+        sorted_tokens = x[idxs // topk_ids.shape[1]]
+        if self.ep_size > 1:
+            tokens_per_ep_rank = tokens_per_expert.view(self.ep_size, -1).sum(dim=1)
+            tokens_per_expert_group = tokens_per_expert.new_empty(tokens_per_expert.shape[0])
+            dist.all_to_all_single(tokens_per_expert_group, tokens_per_expert, group=self.ep_group)
+
+            output_splits = tokens_per_expert_group.view(self.ep_size, -1).sum(1).tolist()
+            input_split_sizes = tokens_per_ep_rank.tolist()
+
+            gathered_tokens, _ = all_to_all_uneven(sorted_tokens, input_split_sizes, output_splits, self.ep_group)
+            tokens_per_expert_post_gather = tokens_per_expert_group.view(self.ep_size, self.experts_per_rank).sum(dim=0)
+            gatherd_idxs = np.zeros(shape=(gathered_tokens.shape[0],), dtype=np.int32)
+            s = 0
+            for i, k in enumerate(tokens_per_expert_group.cpu().numpy()):
+                gatherd_idxs[s : s + k] = i % self.experts_per_rank
+                s += k
+            gatherd_idxs = gatherd_idxs.argsort()
+            sorted_tokens = gathered_tokens[gatherd_idxs]
+            tokens_per_expert = tokens_per_expert_post_gather
+
+            # moe-dp related code
+            activate_experts = tokens_per_expert_post_gather > 0
+            activate_experts = activate_experts.int()
+            dist.all_reduce(activate_experts, group=self.moe_dp_group)
+
+            # ep related code
+            sorted_tokens = EPGradScalerIn.apply(sorted_tokens, self.ep_size)
+
+        tokens_per_expert = tokens_per_expert.cpu().numpy()
+
+        outputs = []
+        start_idx = 0
+        for i, num_tokens in enumerate(tokens_per_expert):
+            end_idx = start_idx + num_tokens
+            if num_tokens == 0:
+                continue
+            expert = self.experts[i + self.ep_rank * self.experts_per_rank]
+            tokens_for_this_expert = sorted_tokens[start_idx:end_idx]
+            # moe-dp related code
+            tokens_for_this_expert = DPGradScalerIn.apply(tokens_for_this_expert, self.moe_dp_size, activate_experts[i])
+            expert_out = expert(tokens_for_this_expert)
+            # moe-dp related code
+            expert_out = DPGradScalerOut.apply(expert_out, self.moe_dp_size, activate_experts[i])
+            outputs.append(expert_out)
+            start_idx = end_idx
+
+        if len(outputs) > 0:
+            outs = torch.cat(outputs, dim=0)
+        else:
+            assert sorted_tokens.numel() == 0, f"sorted_tokens: should be empty, but got {sorted_tokens.shape}"
+            outs = sorted_tokens
+
+        if self.ep_size > 1:
+            outs = EPGradScalerOut.apply(outs, self.ep_size)
+            new_x = torch.empty_like(outs)
+            new_x[gatherd_idxs] = outs
+            gathered_tokens, _ = all_to_all_uneven(new_x, output_splits, input_split_sizes, self.ep_group)
+            outs = gathered_tokens
+
+        new_x = torch.empty_like(outs)
+        new_x[idxs] = outs
+        final_out = (
+            (new_x.view(*topk_ids.shape, -1).type(topk_weight.dtype) * topk_weight.unsqueeze(dim=-1))
+            .sum(dim=1)
+            .type(new_x.dtype)
+        )
+
+        return final_out
+
+
+def deepseek_v3_model_forward(
+    self,
+    input_ids: torch.LongTensor = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[List[torch.FloatTensor]] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+    stage_manager: Optional[PipelineStageManager] = None,
+    stage_index: Optional[List[int]] = None,
+    hidden_states_internal: Optional[torch.Tensor] = None,
+) -> Union[Tuple, BaseModelOutputWithPast]:
+    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+    output_hidden_states = (
+        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+    )
+    use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+    # retrieve input_ids and inputs_embeds
+    if input_ids is not None and inputs_embeds is not None:
+        raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+    elif input_ids is not None:
+        batch_size, seq_length = input_ids.shape[:2]
+    elif inputs_embeds is not None:
+        batch_size, seq_length = inputs_embeds.shape[:2]
+    else:
+        raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+    past_key_values_length = 0
+    if use_cache:
+        use_legacy_cache = not isinstance(past_key_values, Cache)
+        if use_legacy_cache:
+            past_key_values = DynamicCache.from_legacy_cache(past_key_values)
+        past_key_values_length = past_key_values.get_usable_length(seq_length)
+
+    if position_ids is None:
+        device = input_ids.device if input_ids is not None else inputs_embeds.device
+        position_ids = torch.arange(
+            past_key_values_length,
+            seq_length + past_key_values_length,
+            dtype=torch.long,
+            device=device,
+        )
+        position_ids = position_ids.unsqueeze(0)
+
+    if stage_manager is None or stage_manager.is_first_stage():
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+    else:
+        inputs_embeds = hidden_states_internal
+
+    if self._use_flash_attention_2:
+        # 2d mask is passed through the layers
+        attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
+    else:
+        # 4d mask is passed through the layers
+        attention_mask = _prepare_4d_causal_attention_mask(
+            attention_mask,
+            (batch_size, seq_length),
+            inputs_embeds,
+            past_key_values_length,
+        )
+
+    # embed positions
+    hidden_states = inputs_embeds
+
+    # decoder layers
+    all_hidden_states = () if output_hidden_states else None
+    all_self_attns = () if output_attentions else None
+    next_decoder_cache = None
+
+    if stage_index is not None:
+        start_idx, end_idx = stage_index
+    else:
+        start_idx, end_idx = 0, len(self.layers)
+    for i, decoder_layer in enumerate(self.layers[start_idx:end_idx], start=start_idx):
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        if self.gradient_checkpointing and i > 0:
+            layer_outputs = self._gradient_checkpointing_func(
+                decoder_layer.__call__,
+                hidden_states,
+                attention_mask,
+                position_ids,
+                past_key_values,
+                output_attentions,
+                use_cache,
+            )
+        else:
+            layer_outputs = decoder_layer(
+                hidden_states,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_values,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+            )
+
+        hidden_states = layer_outputs[0]
+
+        if use_cache:
+            next_decoder_cache = layer_outputs[2 if output_attentions else 1]
+
+        if output_attentions:
+            all_self_attns += (layer_outputs[1],)
+
+    if stage_manager is None or stage_manager.is_last_stage():
+        hidden_states = self.norm(hidden_states)
+
+    # add hidden states from the last decoder layer
+    if output_hidden_states:
+        all_hidden_states += (hidden_states,)
+
+    next_cache = None
+    if use_cache:
+        next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
+    if stage_manager is not None and not stage_manager.is_last_stage():
+        return {
+            "hidden_states_internal": hidden_states,
+        }
+    if not return_dict:
+        return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+    return BaseModelOutputWithPast(
+        last_hidden_state=hidden_states,
+        past_key_values=next_cache,
+        hidden_states=all_hidden_states,
+        attentions=all_self_attns,
+    )
+
+
+def deepseek_v3_for_causal_lm_forward(
+    self,
+    input_ids: torch.LongTensor = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[List[torch.FloatTensor]] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    labels: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+    stage_manager: Optional[PipelineStageManager] = None,
+    stage_index: Optional[List[int]] = None,
+    hidden_states_internal: Optional[torch.Tensor] = None,
+) -> Union[Tuple, CausalLMOutputWithPast]:
+    r"""
+    Args:
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, transformers.,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, transformers., config.vocab_size]`.
+    Returns:
+    Example:
+    ```python
+    >>> from transformers import AutoTokenizer, DeepseekV3ForCausalLM
+    >>> model = DeepseekV3ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+    >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+    >>> prompt = "Hey, are you conscious? Can you talk to me?"
+    >>> inputs = tokenizer(prompt, return_tensors="pt")
+    >>> # Generate
+    >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+    >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+    ```"""
+    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+    output_hidden_states = (
+        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+    )
+    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = deepseek_v3_model_forward(
+        self.model,
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=return_dict,
+        stage_manager=stage_manager,
+        stage_index=stage_index,
+        hidden_states_internal=hidden_states_internal,
+    )
+    if stage_manager is not None and not stage_manager.is_last_stage():
+        return outputs
+
+    hidden_states = outputs[0]
+
+    logits = self.lm_head(hidden_states)
+    logits = logits.float()
+
+    loss = None
+    if labels is not None:
+        # Shift so that tokens < n predict n
+        shift_logits = logits[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+        # Flatten the tokens
+        loss_fct = CrossEntropyLoss()
+        shift_logits = shift_logits.view(-1, self.config.vocab_size)
+        shift_labels = shift_labels.view(-1)
+        # Enable model parallelism
+        shift_labels = shift_labels.to(shift_logits.device)
+        loss = loss_fct(shift_logits, shift_labels)
+
+    if not return_dict:
+        output = (logits,) + outputs[1:]
+        return (loss,) + output if loss is not None else output
+
+    return CausalLMOutputWithPast(
+        loss=loss,
+        logits=logits,
+        past_key_values=outputs.past_key_values,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+    )
--- a/colossalai/shardformer/policies/auto_policy.py
+++ b/colossalai/shardformer/policies/auto_policy.py
@ -167,6 +167,13 @@ _POLICY_LIST = {
    "transformers_modules.modeling_deepseek.DeepseekForCausalLM": PolicyLocation(
        file_name="deepseek", class_name="DeepseekForCausalLMPolicy"
    ),
+    # DeepseekV3
+    "transformers_modules.modeling_deepseek.DeepseekV3Model": PolicyLocation(
+        file_name="deepseek_v3", class_name="DeepseekV3ModelPolicy"
+    ),
+    "transformers_modules.modeling_deepseek.DeepseekV3ForCausalLM": PolicyLocation(
+        file_name="deepseek_v3", class_name="DeepseekV3ForCausalLMPolicy"
+    ),
    # Falcon
    "transformers.models.falcon.modeling_falcon.FalconModel": PolicyLocation(
        file_name="falcon", class_name="FalconModelPolicy"
--- a/colossalai/shardformer/policies/deepseek_v3.py
+++ b/colossalai/shardformer/policies/deepseek_v3.py
@ -0,0 +1,164 @@
+from functools import partial
+from typing import Callable, Dict, List, Union
+
+import torch.nn as nn
+
+from colossalai.shardformer.layer import FusedRMSNorm
+from colossalai.shardformer.modeling.deepseek_v3 import (
+    EpDeepseekV3MoE,
+    deepseek_v3_for_causal_lm_forward,
+    deepseek_v3_model_forward,
+)
+from colossalai.shardformer.policies.base_policy import ModulePolicyDescription, Policy, SubModuleReplacementDescription
+
+__all__ = ["DeepseekPolicy", "DeepseekForCausalLMPolicy"]
+
+
+class DeepseekV3Policy(Policy):
+    def config_sanity_check(self):
+        assert not self.shard_config.enable_tensor_parallelism, "DeepSeekV3 does not support tensor parallelism"
+        assert not self.shard_config.enable_sequence_parallelism, "DeepSeekV3 does not support sequence parallelism"
+        if self.shard_config.pipeline_stage_manager:
+            assert not self.shard_config.pipeline_stage_manager.use_zbv, "DeepSeekV3 does not support ZBV"
+
+    def preprocess(self):
+        return self.model
+
+    def module_policy(self) -> Dict[Union[str, nn.Module], ModulePolicyDescription]:
+
+        policy = {}
+
+        # support gradient checkpointing
+        if self.shard_config.pipeline_stage_manager is None:
+            policy["DeepseekV3Model"] = ModulePolicyDescription(
+                method_replacement={"forward": deepseek_v3_model_forward}
+            )
+
+        if self.shard_config.expert_parallel_size > 1:
+            # expert parallel
+            self.append_or_create_submodule_replacement(
+                description=[
+                    SubModuleReplacementDescription(
+                        suffix="mlp",
+                        target_module=EpDeepseekV3MoE,
+                        kwargs={
+                            "ep_group": self.shard_config.ep_group,
+                            "moe_dp_group": self.shard_config.moe_dp_group,
+                        },
+                    )
+                ],
+                policy=policy,
+                target_key="DeepseekV3DecoderLayer",
+            )
+
+        # optimization configuration
+        if self.shard_config.enable_fused_normalization:
+            # TODO: prevent casting to fp32
+            self.append_or_create_submodule_replacement(
+                description=[
+                    SubModuleReplacementDescription(
+                        suffix="input_layernorm",
+                        target_module=FusedRMSNorm,
+                    ),
+                    SubModuleReplacementDescription(
+                        suffix="post_attention_layernorm",
+                        target_module=FusedRMSNorm,
+                    ),
+                ],
+                policy=policy,
+                target_key="DeepseekV3DecoderLayer",
+            )
+
+            self.append_or_create_submodule_replacement(
+                description=SubModuleReplacementDescription(
+                    suffix="norm",
+                    target_module=FusedRMSNorm,
+                ),
+                policy=policy,
+                target_key="DeepseekV3Model",
+            )
+
+        return policy
+
+    def postprocess(self):
+        return self.model
+
+    def set_pipeline_forward(self, model_cls: str, new_forward: Callable, policy: Dict) -> None:
+        """If under pipeline parallel setting, replacing the original forward method of huggingface
+        to customized forward method, and add this changing to policy."""
+        if self.pipeline_stage_manager:
+            num_layers = self.model.config.num_hidden_layers
+            stage_manager = self.pipeline_stage_manager
+
+            layers_per_stage = stage_manager.distribute_layers(num_layers)
+            stage_index = stage_manager.get_stage_index(layers_per_stage)
+            method_replacement = {"forward": partial(new_forward, stage_manager=stage_manager, stage_index=stage_index)}
+            self.append_or_create_method_replacement(
+                description=method_replacement, policy=policy, target_key=model_cls
+            )
+
+        return
+
+    def get_held_layers(self) -> List[nn.Module]:
+        """Get pipeline layers for current stage."""
+        assert self.pipeline_stage_manager is not None
+
+        module = self.model
+        if module.__class__.__name__.startswith("PeftModel"):
+            module = module.get_base_model()
+        if module.__class__.__name__ != "DeepseekV3Model":
+            module = module.model
+
+        stage_manager = self.pipeline_stage_manager
+
+        held_layers = []
+
+        if stage_manager.is_interleave:
+            assert stage_manager.num_model_chunks is not None
+            layers_per_stage = stage_manager.distribute_layers(len(module.layers))
+            stage_indices = stage_manager.get_stage_index(layers_per_stage)
+            stage_manager.stage_indices = stage_indices
+            if stage_manager.is_first_stage(ignore_chunk=True):
+                held_layers.append(module.embed_tokens)
+            for start_idx, end_idx in stage_indices:
+                held_layers.extend(module.layers[start_idx:end_idx])
+            if (stage_manager.use_zbv and stage_manager.is_first_stage(ignore_chunk=True)) or (
+                not stage_manager.use_zbv and stage_manager.is_last_stage(ignore_chunk=True)
+            ):
+                # for zbv, when is_first_stage (last fwd), we append norm
+                # for interleaved, when is_last_stage (last fwd), we also append norm
+                held_layers.append(module.norm)
+        else:
+            layers_per_stage = stage_manager.distribute_layers(len(module.layers))
+            if stage_manager.is_first_stage():
+                held_layers.append(module.embed_tokens)
+            start_idx, end_idx = stage_manager.get_stage_index(layers_per_stage)
+            held_layers.extend(module.layers[start_idx:end_idx])
+            if stage_manager.is_last_stage():
+                held_layers.append(module.norm)
+        return held_layers
+
+
+class DeepseekV3ModelPolicy(DeepseekV3Policy):
+    def module_policy(self):
+        policy = super().module_policy()
+        if self.shard_config.pipeline_stage_manager:
+            self.set_pipeline_forward("DeepseekV3Model", deepseek_v3_model_forward, policy)
+        return policy
+
+
+class DeepseekV3ForCausalLMPolicy(DeepseekV3Policy):
+    def module_policy(self):
+        policy = super().module_policy()
+        if self.shard_config.pipeline_stage_manager:
+            self.set_pipeline_forward("DeepseekV3ForCausalLM", deepseek_v3_for_causal_lm_forward, policy)
+        return policy
+
+    def get_held_layers(self):
+        stage_manager = self.pipeline_stage_manager
+        held_layers = super().get_held_layers()
+        if stage_manager.use_zbv and stage_manager.is_first_stage(ignore_chunk=True):
+            held_layers.append(self.model.lm_head)
+        elif stage_manager.is_last_stage(ignore_chunk=True):
+            held_layers.append(self.model.lm_head)
+        return held_layers
--- a/colossalai/shardformer/shard/shard_config.py
+++ b/colossalai/shardformer/shard/shard_config.py
@ -68,6 +68,10 @@ class ShardConfig:
    def sequence_parallel_size(self):
        return self._sequence_parallel_size

+    @property
+    def expert_parallel_size(self):
+        return self._expert_parallel_size
+
    def __post_init__(self):
        # turn on all optimization if all_optimization is set to True
        if self.enable_all_optimization:
@ -103,6 +107,8 @@ class ShardConfig:
        else:
            self._sequence_parallel_size = dist.get_world_size(self.sequence_parallel_process_group)

+        self._expert_parallel_size = dist.get_world_size(self.ep_group) if self.ep_group else 1
+
    def _turn_on_all_optimization(self):
        """
        Turn on all optimization.
--- a/colossalai/zero/low_level/low_level_optim.py
+++ b/colossalai/zero/low_level/low_level_optim.py
@ -786,30 +786,36 @@ class LowLevelZeroOptimizer(OptimizerWrapper):
        """
        zero_state = dict()
        device = get_accelerator().get_current_device()
-        for param, state in self.optim.state.items():
-            working_param = self.master_to_working_param[id(param)]
-            pg = self.param_to_pg[working_param]
-            if not only_on_master or get_nd_rank(pg) == 0:
-                zero_state[param] = copy.deepcopy(state)
-            else:
-                zero_state[param] = {}
+        for param_group in self.optim.param_groups:
+            for param in param_group["params"]:
+                if param not in self.optim.state:
+                    continue
+                state = self.optim.state[param]
+                working_param = self.master_to_working_param[id(param)]
+                pg = self.param_to_pg[working_param]
+                if not only_on_master or get_nd_rank(pg) == 0:
+                    zero_state[param] = copy.deepcopy(state)
+                else:
+                    zero_state[param] = {}

-            if pinned_state_dicts is not None and param not in pinned_state_dicts:
-                pinned_state_dicts[param] = {}
+                if pinned_state_dicts is not None and param not in pinned_state_dicts:
+                    pinned_state_dicts[param] = {}

-            for k, v in state.items():
-                if isinstance(v, torch.Tensor) and k != "step":
-                    gathered_tensor = torch.empty(v.numel() * get_nd_world_size(pg), device=device, dtype=v.dtype)
-                    all_gather_into_flat_tensor_nd(gathered_tensor, v.to(device).flatten(), pg)
-                    param_state = gathered_tensor[: working_param.numel()].reshape_as(working_param)
-                    if not only_on_master or get_nd_rank(pg) == 0:
-                        if pinned_state_dicts is not None and k not in pinned_state_dicts[param]:
-                            pinned_state_dicts[param][k] = torch.empty_like(param_state, pin_memory=True, device="cpu")
-                        if pinned_state_dicts is not None:
-                            pinned_state_dicts[param][k].copy_(param_state)
-                            zero_state[param][k] = pinned_state_dicts[param][k]
-                        else:
-                            zero_state[param][k] = param_state.cpu()
+                for k, v in state.items():
+                    if isinstance(v, torch.Tensor) and k != "step":
+                        gathered_tensor = torch.empty(v.numel() * get_nd_world_size(pg), device=device, dtype=v.dtype)
+                        all_gather_into_flat_tensor_nd(gathered_tensor, v.to(device).flatten(), pg)
+                        param_state = gathered_tensor[: working_param.numel()].reshape_as(working_param)
+                        if not only_on_master or get_nd_rank(pg) == 0:
+                            if pinned_state_dicts is not None and k not in pinned_state_dicts[param]:
+                                pinned_state_dicts[param][k] = torch.empty_like(
+                                    param_state, pin_memory=True, device="cpu"
+                                )
+                            if pinned_state_dicts is not None:
+                                pinned_state_dicts[param][k].copy_(param_state)
+                                zero_state[param][k] = pinned_state_dicts[param][k]
+                            else:
+                                zero_state[param][k] = param_state.cpu()

        states_dict = self._pack_state(zero_state)

@ -865,48 +871,52 @@ class LowLevelZeroOptimizer(OptimizerWrapper):
        device = get_accelerator().get_current_device()
        local_states = self.optim.state_dict()["state"]

-        idx2master = {}
+        master2idx = {}
        cnt = 0
        for param_group in self.optim.param_groups:
            for param in param_group["params"]:
-                idx2master[cnt] = param
+                master2idx[param] = cnt
                cnt += 1
-        for param_idx, states in local_states.items():
-            current_block_size = 0
-            if pinned_state_dicts is not None and param_idx not in pinned_state_dicts:
-                pinned_state_dicts[param_idx] = {}
-            master_param = idx2master[param_idx]
-            working_param = self.master_to_working_param[id(master_param)]
-            pg = self.param_to_pg[working_param]
-            if not only_on_master or get_nd_rank(pg) == 0:
-                current_block = copy.deepcopy(states)
-            else:
-                current_block = {}

-            for k, v in states.items():
-                if isinstance(v, torch.Tensor) and k != "step":
-                    state_tensor = torch.empty(v.numel() * get_nd_world_size(pg), device=device, dtype=v.dtype)
-                    all_gather_into_flat_tensor_nd(state_tensor, v.to(device).flatten(), pg)
-                    state_tensor = state_tensor[: working_param.numel()].reshape_as(working_param)
-                    if not only_on_master or get_nd_rank(pg) == 0:
-                        if pinned_state_dicts is not None and k not in pinned_state_dicts[param_idx]:
-                            pinned_state_dicts[param_idx][k] = torch.empty_like(
-                                state_tensor, pin_memory=True, device="cpu"
-                            )
-                        if pinned_state_dicts is not None:
-                            pinned_state_dicts[param_idx][k].copy_(state_tensor)
-                            current_block[k] = pinned_state_dicts[param_idx][k]
-                        else:
-                            current_block[k] = state_tensor.cpu()
-                    current_block_size += calculate_tensor_size(state_tensor)
+        for param_group in self.optim.param_groups:
+            for master_param in param_group["params"]:
+                param_idx = master2idx[master_param]
+                states = local_states[param_idx]

-            if ret_block_size + current_block_size > max_shard_size and len(ret_block) > 0:
-                yield ret_block, ret_block_size
-                ret_block = dict()
-                ret_block_size = 0
+                current_block_size = 0
+                if pinned_state_dicts is not None and param_idx not in pinned_state_dicts:
+                    pinned_state_dicts[param_idx] = {}
+                working_param = self.master_to_working_param[id(master_param)]
+                pg = self.param_to_pg[working_param]
+                if not only_on_master or get_nd_rank(pg) == 0:
+                    current_block = copy.deepcopy(states)
+                else:
+                    current_block = {}

-            ret_block[param_idx] = current_block
-            ret_block_size += current_block_size
+                for k, v in states.items():
+                    if isinstance(v, torch.Tensor) and k != "step":
+                        state_tensor = torch.empty(v.numel() * get_nd_world_size(pg), device=device, dtype=v.dtype)
+                        all_gather_into_flat_tensor_nd(state_tensor, v.to(device).flatten(), pg)
+                        state_tensor = state_tensor[: working_param.numel()].reshape_as(working_param)
+                        if not only_on_master or get_nd_rank(pg) == 0:
+                            if pinned_state_dicts is not None and k not in pinned_state_dicts[param_idx]:
+                                pinned_state_dicts[param_idx][k] = torch.empty_like(
+                                    state_tensor, pin_memory=True, device="cpu"
+                                )
+                            if pinned_state_dicts is not None:
+                                pinned_state_dicts[param_idx][k].copy_(state_tensor)
+                                current_block[k] = pinned_state_dicts[param_idx][k]
+                            else:
+                                current_block[k] = state_tensor.cpu()
+                        current_block_size += calculate_tensor_size(state_tensor)
+
+                if ret_block_size + current_block_size > max_shard_size and len(ret_block) > 0:
+                    yield ret_block, ret_block_size
+                    ret_block = dict()
+                    ret_block_size = 0
+
+                ret_block[param_idx] = current_block
+                ret_block_size += current_block_size

        yield ret_block, ret_block_size

--- a/docs/README-zh-Hans.md
+++ b/docs/README-zh-Hans.md
@ -25,6 +25,7 @@
 </div>

 ## 新闻
+* [2025/02] [DeepSeek 671B Fine-Tuning Guide Revealed—Unlock the Upgraded DeepSeek Suite with One Click, AI Players Ecstatic!](https://company.hpc-ai.com/blog/shocking-release-deepseek-671b-fine-tuning-guide-revealed-unlock-the-upgraded-deepseek-suite-with-one-click-ai-players-ecstatic)
 * [2024/12] [The development cost of video generation models has saved by 50%! Open-source solutions are now available with H200 GPU vouchers](https://company.hpc-ai.com/blog/the-development-cost-of-video-generation-models-has-saved-by-50-open-source-solutions-are-now-available-with-h200-gpu-vouchers) [[code]](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py) [[vouchers]](https://colossalai.org/zh-Hans/docs/get_started/bonus/)
 * [2024/10] [How to build a low-cost Sora-like app? Solutions for you](https://company.hpc-ai.com/blog/how-to-build-a-low-cost-sora-like-app-solutions-for-you)
 * [2024/09] [Singapore Startup HPC-AI Tech Secures 50 Million USD in Series A Funding to Build the Video Generation AI Model and GPU Platform](https://company.hpc-ai.com/blog/singapore-startup-hpc-ai-tech-secures-50-million-usd-in-series-a-funding-to-build-the-video-generation-ai-model-and-gpu-platform)
--- a/examples/language/deepseek/benchmark.py
+++ b/examples/language/deepseek/benchmark.py
@ -4,11 +4,13 @@ import resource
 import time
 import warnings
 from contextlib import nullcontext
+from types import MethodType

 import torch
 import torch.distributed as dist
 from data_utils import RandomDataset
 from model_utils import format_numel_str, get_model_numel
+from peft import LoraConfig
 from performance_evaluator import PerformanceEvaluator, get_profile_context
 from tqdm import tqdm
 from transformers import AutoConfig, AutoModelForCausalLM
@ -29,7 +31,7 @@ warnings.filterwarnings("ignore")

 # We have lots of llamas for your choice!
 MODEL_CONFIGS = {
-    "100m": lambda: AutoConfig.from_pretrained(
+    "100m": AutoConfig.from_pretrained(
        "deepseek-ai/deepseek-moe-16b-base",
        max_position_embeddings=4096,
        num_hidden_layers=1,
@ -44,20 +46,29 @@ MODEL_CONFIGS = {
        attn_implementation="flash_attention_2",
        trust_remote_code=True,
    ),
-    "7b": lambda: AutoConfig.from_pretrained(
+    "7b": AutoConfig.from_pretrained(
        "deepseek-ai/deepseek-moe-16b-base",
        max_position_embeddings=4096,
        num_hidden_layers=13,
        attn_implementation="flash_attention_2",
        trust_remote_code=True,
    ),
-    "14b": lambda: AutoConfig.from_pretrained(
+    "14b": AutoConfig.from_pretrained(
        "deepseek-ai/deepseek-moe-16b-base",
        max_position_embeddings=4096,
        num_hidden_layers=26,
        attn_implementation="flash_attention_2",
        trust_remote_code=True,
    ),
+    "v3-7b": AutoConfig.from_pretrained(
+        "deepseek-ai/DeepSeek-V3",
+        num_hidden_layers=6,
+        first_k_dense_replace=2,
+        n_routed_experts=32,
+        vocab_size=8192,
+        attn_implementation="flash_attention_2",
+        trust_remote_code=True,
+    ),
 }


@ -119,6 +130,7 @@ def main():
        help="Sequence parallelism mode",
    )
    parser.add_argument("--debug", action="store_true", help="Enable debug mode")
+    parser.add_argument("--enable_lora", action="store_true", help="Enable LoRA")
    args = parser.parse_args()

    colossalai.launch_from_torch()
@ -151,7 +163,7 @@ def main():
            sp_size=args.sp,
            sequence_parallelism_mode=args.sp_mode,
            enable_sequence_parallelism=args.sp > 1,
-            enable_fused_normalization=torch.cuda.is_available(),
+            enable_fused_normalization=get_accelerator().is_available(),
            enable_flash_attention=args.xformers,
            microbatch_size=args.mbs,
            precision="bf16",
@ -171,7 +183,10 @@ def main():
    # ==============================
    dp_size = getattr(plugin, "dp_size", coordinator.world_size)

-    config = MODEL_CONFIGS[args.config]()
+    if args.config in MODEL_CONFIGS:
+        config = MODEL_CONFIGS[args.config]
+    else:
+        config = AutoConfig.from_pretrained(args.config, trust_remote_code=True)

    torch.cuda.manual_seed(42)

@ -189,11 +204,26 @@ def main():
        else nullcontext()
    )

+    attn_impl = "eager" if get_accelerator().name == "npu" else "flash_attention_2"
    with init_ctx:
-        model = AutoModelForCausalLM.from_config(config, trust_remote_code=True).to(torch.bfloat16)
+        model = AutoModelForCausalLM.from_config(
+            config, trust_remote_code=True, attn_implementation=attn_impl, torch_dtype=torch.bfloat16
+        ).to(torch.bfloat16)
+        if args.enable_lora:
+            model = booster.enable_lora(
+                model,
+                lora_config=LoraConfig(task_type="CAUSAL_LM", target_modules=["gate_proj", "up_proj", "down_proj"]),
+            )

    if args.grad_checkpoint:
        model.gradient_checkpointing_enable()
+    if config.__class__.__name__.startswith("DeepseekV3"):
+        model.config.use_cache = False
+        model.eval()
+        # enable grad for moe layers
+        for m in model.modules():
+            if m.__class__.__name__ == "DeepseekV3MoE":
+                m.moe_infer = MethodType(m.moe_infer.__wrapped__, m)

    model_numel = get_model_numel(model)
    coordinator.print_on_master(f"Model params: {format_numel_str(model_numel)}")
@ -228,40 +258,42 @@ def main():
    ) as prof:  # , distributed_debug_mode(10, enable=True):
        if isinstance(plugin, MoeHybridParallelPlugin) and args.pp > 1:
            data_iter = iter(dataloader)
-            for step in tqdm(range(len(dataloader)), desc="Step", disable=not coordinator.is_master()):
-                performance_evaluator.on_step_start(step)
-                outputs = booster.execute_pipeline(
-                    data_iter,
-                    model,
-                    criterion=lambda outputs, inputs: outputs[0],
-                    optimizer=optimizer,
-                    return_loss=True,
-                )
-                loss = outputs["loss"]
-                if dist.get_rank() == dist.get_world_size() - 1:
-                    print(f"Step {step} loss: {loss}")
-                optimizer.step()
-                optimizer.zero_grad()
+            with tqdm(
+                range(len(dataloader)), desc="Step", disable=dist.get_rank() != dist.get_world_size() - 1
+            ) as pbar:
+                for step in pbar:
+                    performance_evaluator.on_step_start(step)
+                    outputs = booster.execute_pipeline(
+                        data_iter,
+                        model,
+                        criterion=lambda outputs, inputs: outputs[0],
+                        optimizer=optimizer,
+                        return_loss=True,
+                    )
+                    loss = outputs["loss"]
+                    loss_scalar = loss.item() if loss is not None else None
+                    pbar.set_postfix({"loss": loss_scalar})
+                    optimizer.step()
+                    optimizer.zero_grad()

-                performance_evaluator.on_step_end(input_ids=torch.empty(args.batch_size, args.max_length))
-                prof.step()
-                print(f"rank {dist.get_rank()} step {step} passed")
+                    performance_evaluator.on_step_end(input_ids=torch.empty(args.batch_size, args.max_length))
+                    prof.step()
        else:
-            for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not coordinator.is_master())):
-                performance_evaluator.on_step_start(step)
-                outputs = model(**batch)
-                loss = outputs[0]
-                del outputs  # free memory
+            with tqdm(dataloader, desc="Step", disable=not coordinator.is_master()) as pbar:
+                for step, batch in enumerate(pbar):
+                    performance_evaluator.on_step_start(step)
+                    outputs = model(**batch)
+                    loss = outputs[0]
+                    del outputs  # free memory

-                if dist.get_rank() == dist.get_world_size() - 1:
-                    print(f"Step {step} loss: {loss}")
+                    pbar.set_postfix({"loss": loss.item()})

-                booster.backward(loss, optimizer)
-                optimizer.step()
-                optimizer.zero_grad()
+                    booster.backward(loss, optimizer)
+                    optimizer.step()
+                    optimizer.zero_grad()

-                performance_evaluator.on_step_end(**batch)
-                prof.step()
+                    performance_evaluator.on_step_end(**batch)
+                    prof.step()

    performance_evaluator.on_fit_end()
    coordinator.print_on_master(f"Max CUDA memory usage: {get_accelerator().max_memory_allocated()/1024**2:.2f} MB")
--- a/examples/language/performance_evaluator.py
+++ b/examples/language/performance_evaluator.py
@ -7,6 +7,7 @@ from torch import Tensor
 from torch.profiler import ProfilerActivity, profile, schedule, tensorboard_trace_handler

 from colossalai.cluster import DistCoordinator
+from colossalai.utils import get_current_device


 def divide(x: float, y: float) -> float:
@ -29,7 +30,7 @@ def all_reduce_mean(x: float, world_size: int) -> float:
    # tensor = tensor / world_size
    # return tensor.item()

-    tensor = torch.tensor([x], device=torch.cuda.current_device(), dtype=torch.float)
+    tensor = torch.tensor([x], device=get_current_device(), dtype=torch.float)
    dist.all_reduce(tensor)
    tensor = tensor / world_size
    return tensor.item()
--- a/requirements/requirements.txt
+++ b/requirements/requirements.txt
@ -8,7 +8,7 @@ click
 fabric
 contexttimer
 ninja
-torch>=2.2.0,<=2.4.1
+torch>=2.2.0,<=2.5.1
 safetensors
 einops
 pydantic
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -0,0 +1,10 @@
+import gc
+
+from colossalai.accelerator import get_accelerator
+
+
+def pytest_runtest_setup(item):
+    # called for running each test in 'a' directory
+    accelerator = get_accelerator()
+    accelerator.empty_cache()
+    gc.collect()
--- a/tests/kit/model_zoo/custom/simple_mlp.py
+++ b/tests/kit/model_zoo/custom/simple_mlp.py
@ -13,7 +13,7 @@ _HID_DIM = 128


 class Net(nn.Module):
-    def __init__(self, in_dim=_IN_DIM, hid_dim=_HID_DIM, identity=False, dtype=torch.float32):
+    def __init__(self, in_dim=_IN_DIM, hid_dim=_HID_DIM, identity=True, dtype=torch.float32):
        super().__init__()
        if identity:
            self.fc0 = nn.Identity()
@ -30,7 +30,7 @@ class Net(nn.Module):
 class TPNet(nn.Module):
    def __init__(
        self,
-        fc0=nn.Linear(_IN_DIM, _IN_DIM),
+        fc0=nn.Identity(),
        fc1=nn.Linear(_IN_DIM, _HID_DIM),
        fc2=nn.Linear(_HID_DIM, _IN_DIM),
        tp_group=None,
--- a/tests/kit/model_zoo/transformers/deepseek_v3.py
+++ b/tests/kit/model_zoo/transformers/deepseek_v3.py
@ -0,0 +1,75 @@
+# modified from tests/kit/model_zoo/transformers/mistral.py
+from types import MethodType
+
+import torch
+import transformers
+from transformers import AutoConfig
+
+# ===============================
+# Register single-sentence Mixtral
+# ===============================
+
+
+def data_gen():
+    # Generated from following code snippet
+    #
+    # from transformers import AutoModelForCausalLM, AutoTokenizer
+    # tokenizer = AutoTokenizer.from_pretrained("mixtralai/Mixtral-7B-v0.1")
+    # input = 'My favourite condiment is vinegar' (last two words repeated to satisfy length requirement)
+    # tokenized_input = tokenizer([input], return_tensors="pt")
+    # input_ids = tokenized_input['input_ids']
+    # attention_mask = tokenized_input['attention_mask']
+    input_ids = torch.tensor([[1, 22, 55, 77, 532, 349, 43, 22]], dtype=torch.int64)
+    attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1, 1, 1]], dtype=torch.int64)
+    return dict(input_ids=input_ids, attention_mask=attention_mask)
+
+
+def data_gen_for_lm():
+    # LM data gen
+    # the `labels` of LM is the token of the output, cause no padding, use `input_ids` as `labels`
+    data = data_gen()
+    data["labels"] = data["input_ids"].clone()
+    return data
+
+
+# define output transform function
+output_transform_fn = lambda x: x
+
+# define loss function
+loss_fn = lambda x: x[0].mean()
+loss_fn_for_lm = lambda x: x.loss
+
+
+def init_deepseek():
+
+    config = AutoConfig.from_pretrained(
+        "deepseek-ai/DeepSeek-V3",
+        hidden_size=128,
+        intermediate_size=320,
+        kv_lora_rank=4,
+        moe_intermediate_size=32,
+        num_attention_heads=4,
+        num_experts_per_tok=4,
+        n_group=4,
+        num_hidden_layers=3,
+        num_key_value_heads=4,
+        first_k_dense_replace=1,
+        q_lora_rank=8,
+        torch_dtype="bfloat16",
+        n_routed_experts=16,
+        topk_group=2,
+        v_head_dim=32,
+        qk_nope_head_dim=32,
+        qk_rope_head_dim=32,
+        trust_remote_code=True,
+        vocab_size=2048,
+    )
+
+    if hasattr(config, "pad_token_id"):
+        config.pad_token_id = config.eos_token_id
+    model = transformers.AutoModelForCausalLM.from_config(config, trust_remote_code=True)
+    # enable grad for moe layers
+    for m in model.modules():
+        if m.__class__.__name__ == "DeepseekV3MoE":
+            m.moe_infer = MethodType(m.moe_infer.__wrapped__, m)
+    return model
--- a/tests/test_checkpoint_io/test_low_level_zero_checkpoint_io.py
+++ b/tests/test_checkpoint_io/test_low_level_zero_checkpoint_io.py
@ -126,6 +126,7 @@ def run_fn(stage, shard, offload, model_fn, data_gen_fn, output_transform_fn, lo

            booster.save_lora_as_pretrained(model, model_ckpt_path)
            booster.save_optimizer(optimizer, optimizer_ckpt_path, shard=False)
+            dist.barrier()
            new_model = new_booster.enable_lora(new_model, pretrained_dir=model_ckpt_path, lora_config=lora_config)
            new_model, new_optimizer, criterion, _, _ = new_booster.boost(new_model, new_optimizer, criterion)
            check_state_dict_equal(model.state_dict(), new_model.state_dict())
--- a/tests/test_cluster/test_device_mesh_manager.py
+++ b/tests/test_cluster/test_device_mesh_manager.py
@ -1,7 +1,7 @@
 from colossalai.cluster.device_mesh_manager import DeviceMeshInfo, DeviceMeshManager
 from colossalai.initialize import launch
 from colossalai.logging import disable_existing_loggers
-from colossalai.testing import spawn
+from colossalai.testing import rerun_if_address_is_in_use, spawn


 def check_device_mesh_manager(rank, world_size, port):
@ -24,6 +24,7 @@ def check_device_mesh_manager(rank, world_size, port):
    assert device_mesh_with_shape._logical_mesh_id.tolist() == [[0, 1], [2, 3]]


+@rerun_if_address_is_in_use()
 def test_device_mesh_manager():
    spawn(check_device_mesh_manager, 4)

--- a/tests/test_fp8/test_all_to_all_single.py
+++ b/tests/test_fp8/test_all_to_all_single.py
@ -6,9 +6,10 @@ from torch.testing import assert_close
 from colossalai import launch
 from colossalai.accelerator import get_accelerator
 from colossalai.quantization.fp8 import all_to_all_single_fp8
-from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn
+from colossalai.testing import clear_cache_before_run, parameterize, rerun_if_address_is_in_use, spawn


+@clear_cache_before_run()
@parameterize("shape", [(4,), (1, 8, 16), (4, 8, 16)])
@parameterize("dtype", [torch.bfloat16, torch.float16])
@parameterize("async_op", [True, False])
@ -24,6 +25,7 @@ def check_all2all(shape, dtype, async_op):
    assert_close(output, output_fp8, rtol=0.1, atol=0.1)


+@clear_cache_before_run()
@parameterize("shape", [(8, 8, 16)])
@parameterize("dtype", [torch.bfloat16, torch.float16])
@parameterize("async_op", [True, False])
--- a/tests/test_fp8/test_fp8_all_to_all.py
+++ b/tests/test_fp8/test_fp8_all_to_all.py
@ -6,9 +6,10 @@ from torch.testing import assert_close
 from colossalai import launch
 from colossalai.accelerator import get_accelerator
 from colossalai.quantization.fp8 import _all_to_all_fp8
-from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn
+from colossalai.testing import clear_cache_before_run, parameterize, rerun_if_address_is_in_use, spawn


+@clear_cache_before_run()
@parameterize("shape", [(16, 8, 4)])
@parameterize("scatter_dim", [0, 1, 2])
@parameterize("dtype", [torch.bfloat16, torch.float16])
--- a/tests/test_fp8/test_fp8_all_to_all_single.py
+++ b/tests/test_fp8/test_fp8_all_to_all_single.py
@ -6,11 +6,12 @@ from torch.testing import assert_close
 from colossalai import launch
 from colossalai.accelerator import get_accelerator
 from colossalai.quantization.fp8 import all_to_all_single_fp8
-from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn
+from colossalai.testing import clear_cache_before_run, parameterize, rerun_if_address_is_in_use, spawn

 dist.all_to_all_single


+@clear_cache_before_run()
@parameterize("shape", [(4), (8, 7), (4, 8, 16)])
@parameterize("dtype", [torch.bfloat16, torch.float16])
@parameterize("fp8_format", ["e4m3", "e5m2"])
--- a/tests/test_fp8/test_fp8_allgather.py
+++ b/tests/test_fp8/test_fp8_allgather.py
@ -6,9 +6,10 @@ from torch.testing import assert_close
 from colossalai import launch
 from colossalai.accelerator import get_accelerator
 from colossalai.quantization.fp8 import _all_gather_fp8
-from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn
+from colossalai.testing import clear_cache_before_run, parameterize, rerun_if_address_is_in_use, spawn


+@clear_cache_before_run()
@parameterize(
    "shape",
    [(3, 7, 16)],
--- a/tests/test_fp8/test_fp8_allreduce.py
+++ b/tests/test_fp8/test_fp8_allreduce.py
@ -5,7 +5,7 @@ from torch.testing import assert_close
 from colossalai import launch
 from colossalai.accelerator import get_accelerator
 from colossalai.quantization.fp8 import all_reduce_fp8
-from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn
+from colossalai.testing import clear_cache_before_run, parameterize, rerun_if_address_is_in_use, spawn


@parameterize(
@ -20,6 +20,7 @@ from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn
        (8,),
    ],
 )
+@clear_cache_before_run()
@parameterize("dtype", [torch.float16, torch.bfloat16])
@parameterize("fp8_format", ["e4m3", "e5m2"])
@parameterize("async_op", [True, False])
--- a/tests/test_fp8/test_fp8_cast.py
+++ b/tests/test_fp8/test_fp8_cast.py
@ -3,9 +3,10 @@ from torch.testing import assert_close

 from colossalai.accelerator import get_accelerator
 from colossalai.quantization.fp8 import cast_from_fp8, cast_from_fp8_pipeline, cast_to_fp8, cast_to_fp8_pipeline
-from colossalai.testing import parameterize
+from colossalai.testing import clear_cache_before_run, parameterize


+@clear_cache_before_run()
@parameterize("shape", [(100, 10), (10, 100), (3, 7), (2, 1), (1, 2), (2, 2), (4, 2), (5,), (4,), (2,)])
@parameterize("dtype", [torch.bfloat16, torch.float16, torch.float32])
@parameterize("fp8_format", ["e4m3", "e5m2"])
--- a/tests/test_fp8/test_fp8_fsdp_comm_hook.py
+++ b/tests/test_fp8/test_fp8_fsdp_comm_hook.py
@ -8,7 +8,7 @@ from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.testing import assert_close

 from colossalai import launch
-from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn
+from colossalai.testing import clear_cache_before_run, parameterize, rerun_if_address_is_in_use, spawn

 # example modified from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

@ -28,6 +28,7 @@ class ToyModel(nn.Module):
        return self.net2(self.relu(self.net1(x)))


+@clear_cache_before_run()
@parameterize("mode", ["grad", "params"])
 def run_model(mode):
    rank = dist.get_rank()
--- a/tests/test_fp8/test_fp8_reduce_scatter.py
+++ b/tests/test_fp8/test_fp8_reduce_scatter.py
@ -6,9 +6,10 @@ from torch.testing import assert_close
 from colossalai import launch
 from colossalai.accelerator import get_accelerator
 from colossalai.quantization.fp8 import reduce_scatter_fp8
-from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn
+from colossalai.testing import clear_cache_before_run, parameterize, rerun_if_address_is_in_use, spawn


+@clear_cache_before_run()
@parameterize("shape", [(16, 8, 4)])
@parameterize("scatter_dim", [0, 1, 2])
@parameterize("dtype", [torch.bfloat16, torch.float16])
--- a/tests/test_optimizer/_utils.py
+++ b/tests/test_optimizer/_utils.py
@ -1,10 +1,13 @@
 import torch
 import torch.distributed as dist
+import torch.nn as nn
 from torch.testing import assert_close

 import colossalai
 from colossalai.shardformer.layer.utils import Randomizer
+from colossalai.tensor.d_tensor import get_layout, get_sharding_spec, is_distributed_tensor
 from colossalai.tensor.d_tensor.api import clear_layout_converter
+from colossalai.tensor.d_tensor.sharding_spec import DimSpec
 from colossalai.testing import parameterize, spawn
 from tests.kit.model_zoo import model_zoo
 from tests.test_shardformer.test_model._utils import (
@ -15,6 +18,88 @@ from tests.test_shardformer.test_model._utils import (
 )


+def force_assign_grad(p, g_dtype, grad=None):
+    """Bypass inconsistent grad and param dtype error when assigning grad"""
+    orig_p = p.data
+    p.data = torch.randn_like(p, device=orig_p.device, dtype=g_dtype) if grad == None else grad.clone().to(g_dtype)
+    p.grad = p.data
+    p.data = orig_p
+
+
+def setup_param_groups(model: nn.Module) -> list:
+    no_decay = ["bias", "LayerNorm.weight"]
+    optimizer_grouped_parameters = [
+        {
+            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+            "weight_decay": 0.1,
+        },
+        {
+            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+            "weight_decay": 0.0,
+        },
+    ]
+    return optimizer_grouped_parameters
+
+
+# setup flatten param groups, sharding spec and shape; (For dist Adafactor and CAME)
+def setup_flatten_param_groups_sharding_spec_shape(model: nn.Module) -> dict:
+    flatten_optimizer_grouped_parameters = []
+    sharding_spec = {}  # {id(flatten param): get_layout(p).global_shape}
+    param_shape = {}  # {id(flatten param): get_sharding_spec(p)}
+    for n, p in model.named_parameters():
+        # flatten_p = copy.deepcopy(p).flatten()
+        flatten_p = nn.Parameter(p.clone().flatten().requires_grad_(True))
+        flatten_optimizer_grouped_parameters.append(flatten_p)
+        if is_distributed_tensor(p):
+            sharding_spec[id(flatten_p)] = get_sharding_spec(p)
+            param_shape[id(flatten_p)] = get_layout(p).global_shape
+        else:
+            sharding_spec[id(flatten_p)] = None
+            param_shape[id(flatten_p)] = p.shape
+    return flatten_optimizer_grouped_parameters, sharding_spec, param_shape
+
+
+def set_master_param_to_shard_param(master_param_list) -> dict:
+    master_param_to_shard_param = {id(p): p for p in master_param_list}
+    return master_param_to_shard_param
+
+
+def set_dist_grad(
+    dist_module: nn.Module,
+    torch_model: nn.Module,
+    g_dtype: torch.dtype,
+    group: dist.ProcessGroup,
+    tp_spec: DimSpec,
+) -> None:
+    """
+    Set split grads for Tensor Parallel or ZeRO DP.
+    We do not need a separate treatment for ZeRO,
+    as the wrapper takes care of reduce-scattering grads.
+    """
+    rank = dist.get_rank(group)
+    world_size = dist.get_world_size(group)
+
+    for p, torch_p in zip(dist_module.parameters(), torch_model.parameters()):
+        if torch_p.grad is None:
+            torch_p.grad = torch.zeros_like(torch_p)
+
+        is_distributed = hasattr(p, "dist_layout")
+        if is_distributed:
+            sharding = p.dist_layout.sharding_spec.sharding_sequence
+            split_dim = sharding.index(tp_spec)
+            shape = torch_p.split(world_size, dim=split_dim)[rank].shape
+
+            indices = torch.arange(shape[split_dim] * rank, shape[split_dim] * (rank + 1))
+            # Generate grads only for the correctly split chunk
+            torch_p.grad.index_add_(split_dim, indices, torch.randn(shape, device=torch_p.device, dtype=g_dtype))
+
+        else:
+            shape = torch_p.shape
+            torch_p.grad += torch.randn(shape, device=torch_p.device, dtype=g_dtype)
+
+        force_assign_grad(p, g_dtype, grad=torch_p.grad)
+
+
 def check_optim_states(org_optim, sharded_optim):
    for group in org_optim.param_groups:
        for p in group["params"]:
--- a/tests/test_optimizer/test_adam_optim.py
+++ b/tests/test_optimizer/test_adam_optim.py
@ -8,6 +8,7 @@ from torch.optim import Adam, AdamW

 from colossalai.nn.optimizer import CPUAdam, FusedAdam, HybridAdam
 from tests.kit.model_zoo import model_zoo
+from tests.test_optimizer._utils import force_assign_grad, setup_param_groups

 _ALLOWED_OPTIM_DEVICES = [
    (FusedAdam, torch.device("cuda:0")),
@ -26,29 +27,11 @@ _ALLOWED_P_G_TYPES = [
 N_STEPS = 3


-def setup_param_groups(bert_model: nn.Module) -> list:
-    no_decay = ["bias", "LayerNorm.weight"]
-    optimizer_grouped_parameters = [
-        {
-            "params": [p for n, p in bert_model.named_parameters() if not any(nd in n for nd in no_decay)],
-            "weight_decay": 0.1,
-        },
-        {
-            "params": [p for n, p in bert_model.named_parameters() if any(nd in n for nd in no_decay)],
-            "weight_decay": 0.0,
-        },
-    ]
-    return optimizer_grouped_parameters
-
-
 def set_grad(model: nn.Module, torch_model: nn.Module, g_dtype: torch.dtype) -> None:
    for p, torch_p in zip(model.parameters(), torch_model.parameters()):
        torch_p.grad = torch.rand_like(torch_p)
        # avoid inconsistent grad and param dtype error
-        orig_p = p.data
-        p.data = torch_p.grad.clone().to(g_dtype)
-        p.grad = p.data
-        p.data = orig_p
+        force_assign_grad(p, g_dtype, torch_p.grad)


@pytest.mark.parametrize("optim_cls, device", _ALLOWED_OPTIM_DEVICES)
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
flybird11111	46ed5d856b	[ci] update ci (#6254 ) * fix for async io * test for upgrading transformers * add ci machine * fix * fix * fix * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test_fp16_torch.py * Update build_on_pr.yml * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * fiux * fix * fix * fix --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2025-04-18 16:40:53 +08:00
Yanjia0	7ecdf9a211	Update README.md (#6268 ) Image Change from H100 to H200	2025-04-17 12:07:25 +08:00
duanjunwen	44d4053fec	[HotFix] update load lora model Readme; (#6240 ) * [fix] update load lora model Readme; * [fix] update lora infer readme * [fix] remove useless comments	2025-03-07 14:14:26 +08:00
Hongxin Liu	6d676ee0e9	[release] update version (#6236 )	2025-03-03 16:15:09 +08:00
Hongxin Liu	56fe130b15	[hotfix] fix lora load (#6231 ) * [hotfix] fix lora load * [hotfix] fix hp load * accelerate deepseek loading	2025-03-01 19:04:14 +08:00
Hongxin Liu	f32861ccc5	[misc] update torch version (#6206 ) * [misc] update torch version * fix test * fix test * fix test * fix test	2025-02-24 14:35:48 +08:00
YeAnbang	b9e60559b8	Merge pull request #6208 from hpcaitech/grpo_dev [Chat] fix colossalchat bugs	2025-02-20 21:23:16 +08:00
pre-commit-ci[bot]	7595c453a5	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2025-02-20 10:25:19 +00:00
YeAnbang	53834b74b9	fix num_train_step update	2025-02-20 18:24:04 +08:00
YeAnbang	0171884664	fix inference rebatching bug	2025-02-20 17:28:49 +08:00
Hongxin Liu	9379cbd668	[release] update version (#6195 ) * [release] update version * fix test * fix test	2025-02-20 11:36:18 +08:00
binmakeswell	24dee8f0b7	[doc] DeepSeek V3/R1 news (#6199 ) * [doc] DeepSeek V3/R1 news * [doc] DeepSeek V3/R1 news * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2025-02-19 15:07:29 +08:00
Hongxin Liu	f73ae55394	[application] add lora sft example data (#6198 )	2025-02-18 20:18:18 +08:00
Tong Li	f8b9e88484	[application] Update README (#6196 ) * remove unused ray * remove unused readme * update readme * update readme * update * update * add link * update readme * update readme * fix link * update code * update cititaion * update * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update readme * update project * add images * update link * update note --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2025-02-18 20:17:56 +08:00
Hongxin Liu	d54642a263	[application] add lora sft example (#6192 ) * [application] add lora sft example * update requirements * update readme * update comment * update ci	2025-02-18 13:06:38 +08:00
YeAnbang	d20c8ffd97	Add GRPO and Support RLVR for PPO (#6186 ) * add grpo, support rlvr * add grpo, support rlvr * tested deepseek r1 pipeline * add ci * verify grpo r1 * verify grpo r1 * update readme, remove unused code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove path * clean code * fix circular import * fix ci OOM * fix ci OOM * skip kto tp, fix qwen generation --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2025-02-18 09:43:36 +08:00
flybird11111	ce0ec40811	[checkpointio] fix for async io (#6189 )	2025-02-14 17:34:13 +08:00
Hongxin Liu	5ff5323538	[hotfix] fix zero optim save (#6191 )	2025-02-14 15:09:50 +08:00
Hongxin Liu	014837e725	[shardformer] support pipeline for deepseek v3 and optimize lora save (#6188 ) * [shardformer] support pipeline for deepseek v3 * [checkpointio] fix lora save * [devops] update ci env * [booster] optimize lora * fix test * fix test	2025-02-14 14:48:54 +08:00
Wenxuan Tan	ec73f1b5e2	[CI] Cleanup Dist Optim tests with shared helper funcs (#6125 ) * Refractor and cleanup using common helper funcs. Tests passed * Update comments * Fix relative import * Fix param fetching bug	2025-02-12 13:42:34 +08:00
flybird11111	5c09d726a6	[checkpointio] fix checkpoint for 3d (#6187 ) * fix checkpoint io for 3d * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update hybrid_parallel_checkpoint_io.py * fix --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2025-02-12 11:54:55 +08:00
Hongxin Liu	2b415e5999	[shardformer] support ep for deepseek v3 (#6185 ) * [feature] support ep for deepseek v3 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test * [shardformer] fix deepseek v3 init * [lazy] fit lora for lazy init * [example] support npu for deepseek v3 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2025-02-11 16:10:25 +08:00
flybird11111	17062c83b9	[hotfix] fix hybrid checkpointio for sp+dp (#6184 ) * Update hybrid_parallel_plugin.py * Update hybrid_parallel_plugin.py * Update hybrid_parallel_plugin.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update build_on_pr.yml * Update test_zerobubble_pp.py * fix * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2025-02-06 17:21:04 +08:00
Wenxuan Tan	ca0aa2365d	[Issue template] Add checkbox asking for details to reproduce error (#6104 ) * Add checkbox asking about reproducing error * update * Update * Update checkbox	2025-01-24 14:36:25 +08:00
Lemon Qin	97e60cbbcb	[checkpointio] gather tensor before unpad it if the tensor is both padded and distributed (#6168 )	2025-01-21 10:23:15 +08:00
Guangyao Zhang	5b094a836b	[Inference]Fix example in readme (#6178 )	2025-01-08 11:51:50 +08:00
Hongxin Liu	ee81366cac	[checkpointio] support load-pin overlap (#6177 ) * [checkpointio] support load-pin overlap * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [test] add conftest --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2025-01-07 16:16:04 +08:00