[MoE/ZeRO] Moe refactor with zero refactor (#5821)

* [moe] removed openmoe-coupled code and rectify mixstral code (#5471) * [Feauture] MoE refractor; Intergration with Mixtral (#5682) * cherry pick from refractor-moe branch * tests passed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support ep + zero --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add mixtral auto policy & move pipeline forward code to modeling folder * [moe refactor] modify kernel test without Route Class * [moe refactor] add moe tensor test path environment variable to github workflow * fix typos * fix moe test bug due to the code rebase * [moe refactor] fix moe zero test, and little bug in low level zero * fix typo * add moe tensor path to github workflow * remove some useless code * fix typo & unify global variable XX_AXIS logic without using -1 * fix typo & prettifier the code * remove print code & support zero 2 test * remove useless code * reanme function * fix typo * fix typo * Further improve the test code * remove print code * [moe refactor] change test model from fake moe model to mixtral moe layer and remove useless test * [moe refactor] skip some unit test which will be refactored later * [moe refactor] fix unit import error * [moe refactor] fix circular import issues * [moe refactor] remove debug code * [moe refactor] update github workflow * [moe/zero] refactor low level optimizer (#5767) * [zero] refactor low level optimizer * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] MoE refactor with newest version of ZeRO (#5801) * [zero] remove redundant members in BucketStore (#5802) * [zero] align api with previous version * [Moe/Zero] Update MoeHybridParallelPlugin with refactored ZeRO and Fix Zero bug (#5819) * [moe refactor] update unit test with the refactored ZeRO and remove useless test * move moe checkpoint to checkpoint folder and exchange global axis to class member * update moe hybrid parallel plugin with newest version of zero & fix zero working/master params bug * fix zero unit test * Add an assertion to prevent users from using it incorrectly * [hotfix]Solve the compatibility issue of zero refactor (#5823) * [moe refactor] update unit test with the refactored ZeRO and remove useless test * move moe checkpoint to checkpoint folder and exchange global axis to class member * update moe hybrid parallel plugin with newest version of zero & fix zero working/master params bug * fix zero unit test * Add an assertion to prevent users from using it incorrectly * Modify function parameter names to resolve compatibility issues * [zero] fix missing hook removal (#5824) * [MoE] Resolve .github conflict (#5829) * [Fix/Example] Fix Llama Inference Loading Data Type (#5763) * [fix/example] fix llama inference loading dtype * revise loading dtype of benchmark llama3 * [release] update version (#5752) * [release] update version * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [test] fix ddp plugin test * [test] fix gptj and rpc test * [devops] fix cuda ext compatibility * [inference] fix flash decoding test * [inference] fix flash decoding test * fix (#5765) * [test] Fix/fix testcase (#5770) * [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [Hotfix] Add missing init file in inference.executor (#5774) * [CI/tests] simplify some test case to reduce testing time (#5755) * [ci/tests] simplify some test case to reduce testing time * [ci/tests] continue to remove test case to reduce ci time cost * restore some test config * [ci/tests] continue to reduce ci time cost * [misc] update dockerfile (#5776) * [misc] update dockerfile * [misc] update dockerfile * [devops] fix docker ci (#5780) * [Inference]Add Streaming LLM (#5745) * Add Streaming LLM * add some parameters to llama_generation.py * verify streamingllm config * add test_streamingllm.py * modified according to the opinions of review * add Citation * change _block_tables tolist * [hotfix] fix llama flash attention forward (#5777) * [misc] Accelerate CI for zero and dist optim (#5758) * remove fp16 from lamb * remove d2h copy in checking states --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [Test/CI] remove test cases to reduce CI duration (#5753) * [test] smaller gpt2 test case * [test] reduce test cases: tests/test_zero/test_gemini/test_zeroddp_state_dict.py * [test] reduce test cases: tests/test_zero/test_gemini/test_grad_accum.py * [test] reduce test cases tests/test_zero/test_gemini/test_optim.py * Revert "[test] smaller gpt2 test case" Some tests might depend on the size of model (num of chunks) This reverts commit df705a5210. * [test] reduce test cases: tests/test_checkpoint_io/test_gemini_checkpoint_io.py * [CI] smaller test model for two mwo the two modifid cases * [CI] hardcode gpt model for tests/test_zero/test_gemini/test_search.py since we need a fixed answer there * [hotfix] fix testcase in test_fx/test_tracer (#5779) * [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [fix] fix test_deepfm_model & test_dlrf_model； * [fix] fix test_hf_albert & test_hf_gpt; * [gemini] optimize reduce scatter d2h copy (#5760) * [gemini] optimize reduce scatter d2h copy * [fix] fix missing reduce variable * [refactor] remove legacy async reduce scatter code * [gemini] missing sync * Revert "[refactor] remove legacy async reduce scatter code" This reverts commit 58ad76d466. * [gemini] further optimize with async all reduce * [fix] pass flag from manager to chunk * Allow building cuda extension without a device. (#5535) Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are. * [misc] fix dist logger (#5782) * [install]fix setup (#5786) * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [misc] update requirements (#5787) * [shardformer] fix import (#5788) * upgrade colossal-chat support tp_group>1, add sp for sft * upgrade ppo dpo rm script * run pre-commit * moupdate ci tests, st ci test cases passed, tp failed in generation for ppo, sp is buggy * fix training script * fix ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix transformers version * remove duplicated test * fix datasets version * remove models that require huggingface auth from ci * remove local data path * update ci * remove baichuan from template test due to transformer version conflict * merge * Refactor modeling by adding attention backend Signed-off-by: char-1ee <xingjianli59@gmail.com> * Fix tests and naming Signed-off-by: char-1ee <xingjianli59@gmail.com> * Pass inference model shard configs for module init Signed-off-by: char-1ee <xingjianli59@gmail.com> * Clean up Signed-off-by: char-1ee <xingjianli59@gmail.com> * replace the customized dataloader setup with the build-in one * replace the customized dataloader setup with the build-in one * Remove flash attention backend Signed-off-by: char-1ee <xingjianli59@gmail.com> * fix readme * Fix test import Signed-off-by: char-1ee <xingjianli59@gmail.com> * update sft trainning script * [Inference]refactor baichuan (#5791) * refactor baichuan * remove unused code and add TODO for lazyinit * [test] fix chatglm test kit (#5793) * [shardformer] fix modeling of bloom and falcon (#5796) * [test] fix qwen2 pytest distLarge (#5797) * [Inference] Fix flash-attn import and add model test (#5794) * Fix torch int32 dtype Signed-off-by: char-1ee <xingjianli59@gmail.com> * Fix flash-attn import Signed-off-by: char-1ee <xingjianli59@gmail.com> * Add generalized model test Signed-off-by: char-1ee <xingjianli59@gmail.com> * Remove exposed path to model Signed-off-by: char-1ee <xingjianli59@gmail.com> * Add default value for use_flash_attn Signed-off-by: char-1ee <xingjianli59@gmail.com> * Rename model test Signed-off-by: char-1ee <xingjianli59@gmail.com> --------- Signed-off-by: char-1ee <xingjianli59@gmail.com> * [Gemini] Use async stream to prefetch and h2d data moving (#5781) * use async stream to prefetch and h2d data moving * Remove redundant code * [gemini] quick fix on possible async operation (#5803) * [gemini] quick fix on possible async operation * [gemini] quick fix on possible async operation * [shardformer] upgrade transformers to 4.39.3 (#5815) * [shardformer]upgrade transformers for gpt2/gptj/whisper (#5807) * [shardformer] fix modeling of gpt2 and gptj * [shardformer] fix whisper modeling * [misc] update requirements --------- Co-authored-by: ver217 <lhx0217@gmail.com> * [shardformer]upgrade transformers for mistral (#5808) * upgrade transformers for mistral * fix * fix * [shardformer]upgrade transformers for llama (#5809) * update transformers fix * fix * fix * [inference] upgrade transformers (#5810) * update transformers fix * fix * fix * fix * fix * [gemini] update transformers for gemini (#5814) --------- Co-authored-by: ver217 <lhx0217@gmail.com> * Support 4d parallel + flash attention (#5789) * support tp + sp + pp * remove comments --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> --------- Signed-off-by: char-1ee <xingjianli59@gmail.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: duanjunwen <935724073@qq.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu> Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: botbw <wang1570@e.ntu.edu.sg> Co-authored-by: Charles Coulombe <ccoulombe@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: YeAnbang <anbangy2@outlook.com> Co-authored-by: char-1ee <xingjianli59@gmail.com> Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com> Co-authored-by: Guangyao Zhang <xjtu521@qq.com> * [zero] fix hook bug * [zero] add low level optimizer back (#5839) * [zero] fix param & refactor * [zero] add back original low level opt * [zero] remove moe related * [zero] pass zero tests * [zero] refactor * [chore] add del func back * [zero] comments and naming (#5840) * [zero] modify api (#5843) * [zero] modify api * [test] remove _grad_store access in tests * [test] fix (#5857) * [CI] skip openmoe CI check * [CI] fox pre-commit * [zero] remove redundant memebr init (#5862) * [misc] remove useless code, modify the pg mesh implementation * [misc] remove useless code, modify the pg mesh implementation * [misc] use tempfile * resolve conflict with main branch * [misc] use tempfile in test_moe_checkpoint.py * [misc] remove useless code, add assertion about sequence parallel, move logger into function * [misc] remove useless code --------- Signed-off-by: char-1ee <xingjianli59@gmail.com> Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu> Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: botbw <wang1570@e.ntu.edu.sg> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: duanjunwen <935724073@qq.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Charles Coulombe <ccoulombe@users.noreply.github.com> Co-authored-by: YeAnbang <anbangy2@outlook.com> Co-authored-by: char-1ee <xingjianli59@gmail.com> Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com> Co-authored-by: Guangyao Zhang <xjtu521@qq.com>
2025-09-02 17:46:42 +00:00 · 2024-06-28 14:00:08 +08:00
parent 773d9f964a
commit 416580b314
69 changed files with 1780 additions and 3076 deletions
--- a/tests/test_moe/test_moe_checkpoint.py
+++ b/tests/test_moe/test_moe_checkpoint.py
@@ -1,201 +1,176 @@
-import importlib
 import os
-import shutil
-import sys
+import tempfile
+from contextlib import nullcontext
+from copy import deepcopy

 import pytest
 import torch
 import torch.distributed as dist
-from transformers.models.llama import LlamaConfig
+from torch.optim import Adam
+from transformers.models.mixtral.configuration_mixtral import MixtralConfig
+from transformers.models.mixtral.modeling_mixtral import MixtralForCausalLM

 import colossalai
-from colossalai.accelerator import get_accelerator
 from colossalai.booster import Booster
 from colossalai.booster.plugin.moe_hybrid_parallel_plugin import MoeHybridParallelPlugin
-from colossalai.testing import DummyDataloader, check_state_dict_equal, rerun_if_address_is_in_use, spawn
+from colossalai.checkpoint_io import MoECheckpointIO
+from colossalai.tensor.moe_tensor.api import is_moe_tensor
+from colossalai.testing.utils import spawn

-sys.path.append(
-    os.path.join(
-        os.path.dirname(os.path.dirname(os.path.dirname(__file__))),
-        "examples/language/openmoe",
-    )
-)
-
-OpenMoeForCausalLM = importlib.import_module("model.modeling_openmoe").OpenMoeForCausalLM
-set_openmoe_args = importlib.import_module("model.modeling_openmoe").set_openmoe_args
-OpenMoeForCausalLMPolicy = importlib.import_module("model.openmoe_policy").OpenMoeForCausalLMPolicy
+tokens, n_experts = 7, 4
+hidden_size = 8
+top_k = 2


-def data_gen_fn(batch_size: int = 2, max_length: int = 4, vocab_size: int = 20):
-    input_ids = torch.randint(0, vocab_size, (batch_size, max_length), device=get_accelerator().get_current_device())
-    attention_mask = torch.ones_like(input_ids)
+def check_model_equal(model1, model2):
+    assert set(model1.state_dict().keys()) == set(model2.state_dict().keys())
+    for i, ((name, p1), p2) in enumerate(zip(model1.named_parameters(), model2.parameters())):
+        if not torch.equal(p1.half(), p2.half()):
+            print(f"Model parameter {name} is not equal. is_moe_tensor: {is_moe_tensor(p1)}")
+            raise AssertionError(f"Model parameter {name} is not equal")
+
+
+def get_optimizer_snapshot(optim):
+    state = {id(k): deepcopy(v) for k, v in optim.state.items()}
+    param_groups = []
+    for group in optim.param_groups:
+        params = [id(p) for p in group["params"]]
+        new_group = {"params": params}
+        for k, v in group.items():
+            if k != "params":
+                new_group[k] = v
+        param_groups.append(new_group)
    return {
-        "input_ids": input_ids,
-        "attention_mask": attention_mask,
-        "labels": input_ids,
+        "state": state,
+        "param_groups": param_groups,
    }


-def run_fwd_bwd(
-    model, data, label, criterion, optimizer, enable_autocast=False, pipeline=False, booster=None, plugin=None
-):
-    model.train()
-    if pipeline:
-        train_dataloader_iter = DummyDataloader(data_gen_fn, length=1)
-        is_pp_last_stage = booster.plugin.stage_manager.is_last_stage()
-        y = booster.execute_pipeline(
-            train_dataloader_iter,
-            model,
-            lambda x, y: x.loss,
-            optimizer,
-            return_loss=True,
-        )
-        # Backward and optimize
-        if is_pp_last_stage:
-            loss = y["loss"]
-    else:
-        if criterion:
-            y = model(data).logits
-            loss = criterion(y)
+def check_optimizer_snapshot_equal(snapshot1, snapshot2, param2name, moe_dp_group=None):
+    assert len(snapshot1["param_groups"]) == len(snapshot2["param_groups"])
+    for group1, group2 in zip(snapshot1["param_groups"], snapshot2["param_groups"]):
+        assert set(group1.keys()) == set(group2.keys())
+        for k in group1.keys():
+            assert group1[k] == group2[k]
+    # check state
+    assert set(snapshot1["state"].keys()) == set(
+        snapshot2["state"].keys()
+    ), f"{snapshot1['state'].keys()}, {snapshot2['state'].keys()}"
+
+    passed = True
+    count = 0
+    for pid in snapshot1["state"].keys():
+        state1, state2 = snapshot1["state"][pid], snapshot2["state"][pid]
+        assert set(state1.keys()) == set(state2.keys())
+        bug = False
+        for k in state1.keys():
+            if isinstance(state1[k], torch.Tensor):
+                if not torch.equal(state1[k], state2[k]):
+                    bug = True
+                    count += 1
+            else:
+                assert state1[k] == state2[k]
+        if bug:
+            passed = False
+
+    if not passed:
+        raise AssertionError(f"A total of {count} optim states are not equal")
+
+
+def check_mixtral_moe_layer():
+    context = tempfile.TemporaryDirectory() if dist.get_rank() == 0 else nullcontext()
+    with context as f:
+        torch.cuda.set_device(dist.get_rank())
+        if dist.get_rank() == 0:
+            broadcast_objects = [f]  # any picklable object
        else:
-            loss = model(data, label)
-        loss = loss.float()
+            broadcast_objects = [None]
+        dist.broadcast_object_list(broadcast_objects, src=0)

-        if optimizer is not None:
-            optimizer.backward(loss)
-        else:
-            loss.backward()
-    return y
-
-
-def get_config():
-    config = LlamaConfig(
-        vocab_size=300,
-        hidden_size=16,
-        intermediate_size=32,
-        num_hidden_layers=2,
-        num_attention_heads=2,
-        head_dim=4,
-        dropout_rate=0.0,
-        hidden_act="swiglu",
-    )
-    set_openmoe_args(config, num_experts=8, moe_layer_interval=1)
-    return config
-
-
-def get_model(parallel):
-    config = get_config()
-    model = OpenMoeForCausalLM(config)
-    optim = torch.optim.Adam(model.parameters())
-
-    if parallel == None:
-        plugin = MoeHybridParallelPlugin(
-            precision="bf16",
-            tp_size=1,
-            pp_size=1,
-            ep_size=1,
-            zero_stage=2,
-            custom_policy=OpenMoeForCausalLMPolicy(),
+        config = MixtralConfig(
+            hidden_size=hidden_size,
+            intermediate_size=hidden_size * 2,
+            num_local_experts=n_experts,
+            num_experts_per_tok=top_k,
+            num_attention_heads=2,
+            num_key_value_heads=2,
        )
-    elif parallel == "ep":
+        torch.manual_seed(0)
+        input_ids = torch.randint(0, 100, (2, tokens)).cuda()
+        orig_model = MixtralForCausalLM(config).cuda()
+        model = deepcopy(orig_model)
+        optimizer = Adam(model.parameters(), lr=1e-3)
        plugin = MoeHybridParallelPlugin(
-            precision="bf16",
-            tp_size=1,
-            pp_size=1,
-            ep_size=dist.get_world_size(),
-            zero_stage=2,
-            custom_policy=OpenMoeForCausalLMPolicy(),
-        )
-    elif parallel == "ep_zero":
-        plugin = MoeHybridParallelPlugin(
-            precision="bf16",
-            tp_size=1,
-            pp_size=1,
-            ep_size=2,
-            zero_stage=2,
-            extra_dp_size=2,
-            custom_policy=OpenMoeForCausalLMPolicy(),
-        )
-    elif parallel == "hybrid":
-        plugin = MoeHybridParallelPlugin(
-            precision="bf16",
-            tp_size=1,
            pp_size=2,
            ep_size=2,
-            zero_stage=1,
+            tp_size=1,
+            checkpoint_io=MoECheckpointIO,
            microbatch_size=1,
-            custom_policy=OpenMoeForCausalLMPolicy(),
+            zero_stage=1,
        )
-    booster = Booster(plugin=plugin)
-    model, optim, _, _, _ = booster.boost(model=model, optimizer=optim)
-    return model, booster, optim
+        booster = Booster(plugin=plugin)
+        model, optimizer, *_ = booster.boost(model=model, optimizer=optimizer)
+        # initialize grads
+        data_iter = iter(
+            [{"input_ids": input_ids, "attention_mask": torch.ones_like(input_ids), "labels": input_ids.clone()}]
+        )
+        booster.execute_pipeline(
+            data_iter,
+            model,
+            lambda outputs, inputs: outputs.loss,
+            optimizer,
+        )
+
+        tmpdirname = broadcast_objects[0]
+        model_dir = os.path.join(tmpdirname, "mixtral_model")
+        hf_model_dir = os.path.join(tmpdirname, "mixtral_hf_model")
+        optim_dir = os.path.join(tmpdirname, "mixtral_optim")
+
+        booster.save_model(model, model_dir, shard=True)
+        dist.barrier()
+        if dist.get_rank() == 0:
+            saved_model = MixtralForCausalLM.from_pretrained(model_dir).cuda()
+            check_model_equal(orig_model, saved_model)
+            # check_model_equal(model, saved_model)
+            saved_model.save_pretrained(hf_model_dir)
+        dist.barrier()
+        # check load model
+        new_model = MixtralForCausalLM(config).cuda()
+        new_optimizer = Adam(new_model.parameters(), lr=1e-3)
+        new_model, new_optimizer, *_ = booster.boost(model=new_model, optimizer=new_optimizer)
+        booster.load_model(new_model, hf_model_dir)
+        check_model_equal(model, new_model)
+
+        # check save optimizer
+        optimizer.step()
+        for group in optimizer.param_groups:
+            group["lr"] = 0.1
+        snapshot = get_optimizer_snapshot(optimizer.unwrap())
+        booster.save_optimizer(optimizer, optim_dir, shard=True)
+        dist.barrier()
+
+        # reset optimizer state
+        for state in optimizer.unwrap().state.values():
+            for v in state.values():
+                if isinstance(v, torch.Tensor):
+                    v.zero_()
+        booster.load_optimizer(optimizer, optim_dir)
+        loaded_snapshot = get_optimizer_snapshot(optimizer.unwrap())
+        check_optimizer_snapshot_equal(snapshot, loaded_snapshot, None, model)
+        # Ensure rank 0 waits for all other ranks to finish
+        dist.barrier()


-def _test_moe_checkpoint(rank, parallel):
-    model1, booster1, optim1 = get_model(parallel)
-    model2, booster2, optim2 = get_model(parallel)
-    model3, booster3, optim3 = get_model(parallel)
-
-    # param ckpt
-    # shard
-    booster1.save_model(model1, "./tmp_ckpt1", shard=True, size_per_shard=1)
-    booster2.load_model(model2, "./tmp_ckpt1")
-    # unshard
-    booster1.save_model(model1, "./tmp_ckpt1.pth")
-    booster3.load_model(model3, "./tmp_ckpt1.pth")
-    # check
-    check_state_dict_equal(model1.state_dict(), model2.state_dict(), False)
-    check_state_dict_equal(model1.state_dict(), model3.state_dict(), False)
-
-    # optim ckpt
-    criterion = lambda x: x.mean()
-    data = torch.randint(0, 4, (2, 4)).cuda()
-    label = torch.randint(0, 4, (2,)).cuda()
-    if parallel == "hybrid":
-        kwargs = {"pipeline": True, "booster": booster1, "plugin": booster1.plugin}
-    else:
-        kwargs = {}
-    run_fwd_bwd(model1, data, label, criterion, optim1, **kwargs)
-    optim1.step()
-    optim1.zero_grad()
-    # shard
-    booster1.save_optimizer(optim1, "./tmp_ckpt2", shard=True, size_per_shard=1)
-    dist.barrier()
-    booster2.load_optimizer(optim2, "./tmp_ckpt2")
-    # unshard
-    booster1.save_optimizer(optim1, "./tmp_ckpt2.pth")
-    booster3.load_optimizer(optim3, "./tmp_ckpt2.pth")
-    # check
-    check_state_dict_equal(optim1.optim.state_dict(), optim2.optim.state_dict(), False)
-    check_state_dict_equal(optim1.optim.state_dict(), optim3.optim.state_dict(), False)
-
-    if dist.get_rank() == 0:
-        shutil.rmtree("./tmp_ckpt1")
-        shutil.rmtree("./tmp_ckpt2")
-        os.remove("./tmp_ckpt1.pth")
-        os.remove("./tmp_ckpt2.pth")
+def run_dist(rank: int, world_size: int, port: int):
+    colossalai.launch(rank, world_size, "localhost", port)
+    check_mixtral_moe_layer()


-def _run_dist(rank, world_size, port, parallel):
-    colossalai.launch(
-        config=dict(),
-        rank=rank,
-        world_size=world_size,
-        host="localhost",
-        port=port,
-        backend="nccl",
-    )
-    _test_moe_checkpoint(rank, parallel)
-
-
-@pytest.mark.skip(reason="This is tested in ColossalMOE")
-@pytest.mark.dist
+# Test EP + ZeRO + PP
@pytest.mark.parametrize("world_size", [4])
-@pytest.mark.parametrize("parallel", [None, "ep", "ep_zero", "hybrid"])
-@rerun_if_address_is_in_use()
-def test_moe_checkpoint(world_size, parallel):
-    spawn(_run_dist, world_size, parallel=parallel)
+def test_mixtral_moe_layer(world_size: int):
+    spawn(run_dist, world_size)


 if __name__ == "__main__":
-    test_moe_checkpoint(world_size=4, parallel="hybrid")
+    test_mixtral_moe_layer(4)