[Feature] LoRA rebased to main branch (#5622)

* [Inference]ADD Bench Chatglm2 script (#4963) * add bench chatglm * fix bug and make utils --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Pipeline inference] Combine kvcache with pipeline inference (#4938) * merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test * updated c++17 compiler flags (#4983) * [Inference] Dynamic Batching Inference, online and offline (#4953) * [inference] Dynamic Batching for Single and Multiple GPUs (#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * Revert "[inference]Re push async dynamic batching (#4901)" (#4905) This reverts commit fbf3c09e67. * Revert "[inference] Async dynamic batching (#4894)" This reverts commit fced140250. * Revert "[inference] Async dynamic batching (#4894)" (#4909) This reverts commit fced140250. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (#4911) * Revert "[inference] Async dynamic batching (#4894)" This reverts commit fced140250. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * add assertion for config (#4947) * [Inference] Finish dynamic batching offline test (#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Cuiqing Li <lixx3527@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> * fix ColossalEval (#4992) Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> * [doc]Update doc for colossal-inference (#4989) * update doc * Update README.md --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> * [hotfix] Fix the bug where process groups were not being properly released. (#4940) * Fix the bug where process groups were not being properly released. * test * Revert "test" This reverts commit 479900c139. * [hotfix] fix the bug of repeatedly storing param group (#4951) * [doc] add supported feature diagram for hybrid parallel plugin (#4996) * [Pipeline Inference] Merge pp with tp (#4993) * refactor pipeline into new CaiInferEngine * updata llama modeling forward * merge tp with pp * update docstring * optimize test workflow and example * fix typo * add assert and todo * [release] update version (#4995) * [release] update version * [hotfix] fix ci * [moe] merge moe into main (#4978) * update moe module * support openmoe * [hotfix] fix grad accumulation plus clipping for gemini (#5002) * [hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926) * [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915) * Add layer norm gradients all-reduce for sequence parallel. * skip pipeline inference test * [hotfix] fixing polices of sequence parallel (#4922) * Add layer norm gradients all-reduce for sequence parallel. * fix parameter passing when calling get_autopolicy --------- Co-authored-by: littsk <1214689160@qq.com> * Hotfix/add grad all reduce for sequence parallel (#4927) * Add layer norm gradients all-reduce for sequence parallel. * fix parameter passing when calling get_autopolicy * fix bug using wrong variables --------- Co-authored-by: littsk <1214689160@qq.com> * fix policy initialization * fix bloom and chatglm policices * polish code of handling layernorm * fix moe module * polish code of class initializing --------- Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> * [format] applied code formatting on changed files in pull request 4926 (#5007) Co-authored-by: github-actions <github-actions@github.com> * [Inference] Fix bug in ChatGLM2 Tensor Parallelism (#5014) * fix bug * fix * fix multiquery * fix multiquery --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [misc] add code owners (#5024) * [moe] support optimizer checkpoint (#5015) * Refactor MoE Manager setup method * unshard optim ckpt * optim io * update transformer version * update requirements * update ckpt * update ckpt * update ckpt * fix engine * fix engine * Support mtbench (#5025) Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> * [moe]: fix ep/tp tests, add hierarchical all2all (#4982) * fix: add warning for EP different behavior * fix: use shard_data in ep & tp model * to: add used_capacity * fix: fix router test * feat: add create_ep_node_group * feat: add create_ep_hierarchical_group fn * feat: add HierarchicalAllToAll * test: add hierarchical all2all test * fix: fix test errors * fix: simplify create_ep_hierarchical_group * fix: add hierarchical_alltoall arg * fix: fix environ typo * revert: revert process mesh order * to: add todo mark * fix: skip hierarchical_comm if torch < 1.13.1 * [shardformer] Fix serialization error with Tensor Parallel state saving (#5018) * Fix serialization error with Tensor Parallel state saving * Refactor state_dict CPU transfer using tree_map * [gemini] gemini support tensor parallelism. (#4942) * [colossalai]fix typo * [inference] Add smmoothquant for llama (#4904) * [inference] add int8 rotary embedding kernel for smoothquant (#4843) * [inference] add smoothquant llama attention (#4850) * add smoothquant llama attention * remove uselss code * remove useless code * fix import error * rename file name * [inference] add silu linear fusion for smoothquant llama mlp (#4853) * add silu linear * update skip condition * catch smoothquant cuda lib exception * prcocess exception for tests * [inference] add llama mlp for smoothquant (#4854) * add llama mlp for smoothquant * fix down out scale * remove duplicate lines * add llama mlp check * delete useless code * [inference] add smoothquant llama (#4861) * add smoothquant llama * fix attention accuracy * fix accuracy * add kv cache and save pretrained * refactor example * delete smooth * refactor code * [inference] add smooth function and delete useless code for smoothquant (#4895) * add smooth function and delete useless code * update datasets * remove duplicate import * delete useless file * refactor codes (#4902) * rafactor code * add license * add torch-int and smoothquant license * Update flash_attention_patch.py To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer. https://github.com/huggingface/transformers/pull/25598 * [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921) * [kernel] support pure fp16 for cpu adam (#4896) * [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919) * [kernel] fix cpu adam * [test] update gemini optim test * [format] applied code formatting on changed files in pull request 4908 (#4918) Co-authored-by: github-actions <github-actions@github.com> * [gemini] support gradient accumulation (#4869) * add test * fix no_sync bug in low level zero plugin * fix test * add argument for grad accum * add grad accum in backward hook for gemini * finish implementation, rewrite tests * fix test * skip stuck model in low level zero test * update doc * optimize communication & fix gradient checkpoint * modify doc * cleaning codes * update cpu adam fp16 case * [hotfix] fix torch 2.0 compatibility (#4936) * [hotfix] fix launch * [test] fix test gemini optim * [shardformer] fix vit * [test] add no master test for low level zero plugin (#4934) * [format] applied code formatting on changed files in pull request 4820 (#4886) Co-authored-by: github-actions <github-actions@github.com> * [nfc] fix some typo with colossalai/ docs/ etc. (#4920) * [Refactor] Integrated some lightllm kernels into token-attention (#4946) * add some req for inference * clean codes * add codes * add some lightllm deps * clean codes * hello * delete rms files * add some comments * add comments * add doc * add lightllm deps * add lightllm cahtglm2 kernels * add lightllm cahtglm2 kernels * replace rotary embedding with lightllm kernel * add some commnets * add some comments * add some comments * add * replace fwd kernel att1 * fix a arg * add * add * fix token attention * add some comments * clean codes * modify comments * fix readme * fix bug * fix bug --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com> * [test] merge old components to test to model zoo (#4945) * [test] add custom models in model zoo * [test] update legacy test * [test] update model zoo * [test] update gemini test * [test] remove components to test * [inference] add reference and fix some bugs (#4937) * add reference and fix some bugs * update gptq init --------- Co-authored-by: Xu Kai <xukai16@foxamil.com> * [Inference]ADD Bench Chatglm2 script (#4963) * add bench chatglm * fix bug and make utils --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Pipeline inference] Combine kvcache with pipeline inference (#4938) * merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test * updated c++17 compiler flags (#4983) * [Inference] Dynamic Batching Inference, online and offline (#4953) * [inference] Dynamic Batching for Single and Multiple GPUs (#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * Revert "[inference]Re push async dynamic batching (#4901)" (#4905) This reverts commit fbf3c09e67. * Revert "[inference] Async dynamic batching (#4894)" This reverts commit fced140250. * Revert "[inference] Async dynamic batching (#4894)" (#4909) This reverts commit fced140250. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (#4911) * Revert "[inference] Async dynamic batching (#4894)" This reverts commit fced140250. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * add assertion for config (#4947) * [Inference] Finish dynamic batching offline test (#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Cuiqing Li <lixx3527@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> * fix ColossalEval (#4992) Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> * [doc]Update doc for colossal-inference (#4989) * update doc * Update README.md --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> * [hotfix] Fix the bug where process groups were not being properly released. (#4940) * Fix the bug where process groups were not being properly released. * test * Revert "test" This reverts commit 479900c139. * [hotfix] fix the bug of repeatedly storing param group (#4951) * [doc] add supported feature diagram for hybrid parallel plugin (#4996) * [Pipeline Inference] Merge pp with tp (#4993) * refactor pipeline into new CaiInferEngine * updata llama modeling forward * merge tp with pp * update docstring * optimize test workflow and example * fix typo * add assert and todo * [release] update version (#4995) * [release] update version * [hotfix] fix ci * [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp * fix fix fix * update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO * support fused layernorm support fused layernorm support fused layernorm * update fusedlayernorm update fusedlayernorm update fusedlayernorm * add sequence parallel to gemini add sequence parallel to gemini * fix * fix comments fix comments fix comments * fix * fix t5 * clear cache * fix * activate ci * activate ci * fix * fix * fix * fix * revert * modify tp gather method modify tp gather method modify tp gather method modify tp gather method * fix test --------- Co-authored-by: Xu Kai <xukai16@foxmail.com> Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: Cuiqing Li <lixx3527@gmail.com> Co-authored-by: cuiqing.li <lixx336@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com> Co-authored-by: Xu Kai <xukai16@foxamil.com> Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com> Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: littsk <1214689160@qq.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> * [hotfix] Suport extra_kwargs in ShardConfig (#5031) * [refactor]: replace inference args with extra_kwargs in ShardConfig * modify shardconfig * polish code * fix policy bug in llama * fix bug in auto policy * remove setattr in ShardConfig * fix wrong EOS token in ColossalChat * [Kernels]Update triton kernels into 2.1.0 (#5046) * update flash-context-attention * adding kernels * fix * reset * add build script * add building process * add llama2 exmaple * add colossal-llama2 test * clean * fall back test setting * fix test file * clean * clean * clean --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> * [pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017) * Use p2p * Cannot bidirectonal send p2p * Refactor tensor creation and serialization in P2P communication * Fix llama forward args in flash attention * Add flop estimate from megatron * Support loading weight not in weight_map when strict=False in hybrid_parallel * Use send_forward_recv_backward, etc in 1f1b * Use dataclass for metdata Remove torch.cuda.synchronize() as suggested * Add comment about the torch.cuda.synchronize for potential error * Typo * Update hybrid_parallel_checkpoint_io.py * Update p2p.py * Update one_f_one_b.py * Update p2p.py --------- Co-authored-by: flybird11111 <1829166702@qq.com> * [gemini] gemini support extra-dp (#5043) * support ddp * fix * fix * fix fix * support ddp * fix * fix * fix fix * simplify tests * fix * fix * fix fix fix * fix * [shardformer] fix llama error when transformers upgraded. (#5055) * fix-llama * Update llama.py * [hotfix]: modify create_ep_hierarchical_group and add test (#5032) * feat: modify create_ep_hierarchical_group args * test: add ep tests * fix: remove get_process_group_ranks * fix: fix src_rank * [exampe] fix llama example' loss error when using gemini plugin (#5060) fix llama example * [inference] Refactor inference architecture (#5057) * [inference] support only TP (#4998) * support only tp * enable tp * add support for bloom (#5008) * [refactor] refactor gptq and smoothquant llama (#5012) * refactor gptq and smoothquant llama * fix import error * fix linear import torch-int * fix smoothquant llama import error * fix import accelerate error * fix bug * fix import smooth cuda * fix smoothcuda * [Inference Refactor] Merge chatglm2 with pp and tp (#5023) merge chatglm with pp and tp * [Refactor] remove useless inference code (#5022) * remove useless code * fix quant model * fix test import bug * mv original inference legacy * fix chatglm2 * [Refactor] refactor policy search and quant type controlling in inference (#5035) * [Refactor] refactor policy search and quant type controling in inference * [inference] update readme (#5051) * update readme * update readme * fix architecture * fix table * fix table * [inference] udpate example (#5053) * udpate example * fix run.sh * fix rebase bug * fix some errors * update readme * add some features * update interface * update readme * update benchmark * add requirements-infer --------- Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> * [Kernels]added flash-decoidng of triton (#5063) * added flash-decoidng of triton based on lightllm kernel * add req * clean * clean * delete build.sh --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> * [misc] remove outdated submodule (#5070) * [npu] add npu support for gemini and zero (#5067) * [npu] setup device utils (#5047) * [npu] add npu device support * [npu] support low level zero * [test] update npu zero plugin test * [hotfix] fix import * [test] recover tests * [npu] gemini support npu (#5052) * [npu] refactor device utils * [gemini] support npu * [example] llama2+gemini support npu * [kernel] add arm cpu adam kernel (#5065) * [kernel] add arm cpu adam * [optim] update adam optimizer * [kernel] arm cpu adam remove bf16 support * [hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069) * [inference] update examples and engine (#5073) * update examples and engine * fix choices * update example * [format] applied code formatting on changed files in pull request 5067 (#5072) Co-authored-by: github-actions <github-actions@github.com> * [hotfix/hybridengine] Fix init model with random parameters in benchmark (#5074) * fix init model with random parameters * fix example * [inference] refactor examples and fix schedule (#5077) * [setup] refactor infer setup * [hotfix] fix infenrece behavior on 1 1 gpu * [exmaple] refactor inference examples * fix thrust-transform-reduce error (#5078) * [nfc] fix typo in docs/ (#4972) * [nfc] fix typo and author name (#5089) * [gemini]fix gemini optimzer, saving Shardformer in Gemini got list assignment index out of range (#5085) * [Hotfix] Fix model policy matching strategy in ShardFormer (#5064) * hotfix/Fix get model policy strategy in ShardFormer * fix bug in auto policy * [shardformer]fix flash attention, when mask is casual, just don't unpad it (#5084) * fix flash attn * fix fix * [npu] add npu support for hybrid plugin and llama (#5090) * llama 3d * update * fix autocast * [Feature] Add document retrieval QA (#5020) * add langchain * add langchain * Add files via upload * add langchain * fix style * fix style: remove extra space * add pytest; modified retriever * add pytest; modified retriever * add tests to build_on_pr.yml * fix build_on_pr.yml * fix build on pr; fix environ vars * seperate unit tests for colossalqa from build from pr * fix container setting; fix environ vars * commented dev code * add incremental update * remove stale code * fix style * change to sha3 224 * fix retriever; fix style; add unit test for document loader * fix ci workflow config * fix ci workflow config * add set cuda visible device script in ci * fix doc string * fix style; update readme; refactored * add force log info * change build on pr, ignore colossalqa * fix docstring, captitalize all initial letters * fix indexing; fix text-splitter * remove debug code, update reference * reset previous commit * update LICENSE update README add key-value mode, fix bugs * add files back * revert force push * remove junk file * add test files * fix retriever bug, add intent classification * change conversation chain design * rewrite prompt and conversation chain * add ui v1 * ui v1 * fix atavar * add header * Refactor the RAG Code and support Pangu * Refactor the ColossalQA chain to Object-Oriented Programming and the UI demo. * resolved conversation. tested scripts under examples. web demo still buggy * fix ci tests * Some modifications to add ChatGPT api * modify llm.py and remove unnecessary files * Delete applications/ColossalQA/examples/ui/test_frontend_input.json * Remove OpenAI api key * add colossalqa * move files * move files * move files * move files * fix style * Add Readme and fix some bugs. * Add something to readme and modify some code * modify a directory name for clarity * remove redundant directory * Correct a type in llm.py * fix AI prefix * fix test_memory.py * fix conversation * fix some erros and typos * Fix a missing import in RAG_ChatBot.py * add colossalcloud LLM wrapper, correct issues in code review --------- Co-authored-by: YeAnbang <anbangy2@outlook.com> Co-authored-by: Orion-Zheng <zheng_zian@u.nus.edu> Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com> Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu> * remove duplicate import (#5100) * fix typo change lazy_iniy to lazy_init (#5099) * [nfc] fix typo change directoty to directory (#5111) * [FEATURE] Add Safety Eval Datasets to ColossalEval (#5095) * add safetybench and cvalues(responsibility) eval dataset * Modify code according to review suggestions --------- Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu> * [hotfix] fixed memory usage of shardformer module replacement (#5122) * [shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: eric8607242 <e0928021388@gmail.com> * [doc] add moe news (#5128) * [doc] add moe news * [doc] add moe news * [doc] add moe news * [doc] updated paper citation (#5131) * fix typo change JOSNL TO JSONL etc. (#5116) * [format] applied code formatting on changed files in pull request 5088 (#5127) Co-authored-by: github-actions <github-actions@github.com> * [format] applied code formatting on changed files in pull request 5124 (#5125) Co-authored-by: github-actions <github-actions@github.com> * [format] applied code formatting on changed files in pull request 5115 (#5118) Co-authored-by: github-actions <github-actions@github.com> * [accelerator] init the accelerator module (#5129) * [accelerator] init the accelerator module * polish code * polish code * polish code * polish code * [npu] support triangle attention for llama (#5130) * update fused attn * update spda * tri attn * update triangle * import * fix * fix * [plugin]fix 3d checkpoint load when booster boost without optimizer. (#5135) * fix 3d checkpoint load when booster boost without optimizer fix 3d checkpoint load when booster boost without optimizer * test ci * revert ci * fix fix * [ColossalQA] refactor server and webui & add new feature (#5138) * refactor server and webui & add new feature * add requirements * modify readme and ui * [doc] fix colossalqa document (#5146) * fix doc * modify doc * fix (#5158) fix * [Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878) * Add finetuning Colossal-Llama-2 example * Add finetuning Colossal-Llama-2 example 2 * Add finetuning Colossal-Llama-2 example and support NEFTuning * Add inference example and refine neftune * Modify readme file * update the imports --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com> * [gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix * [colossalqa] fix pangu api (#5170) * fix pangu api * add comment * [ColossalEval] Support GSM, Data Leakage Evaluation and Tensor Parallel (#5169) * Support GSM, Data Leakage Evaluation and Tensor Parallel * remove redundant code and update inference.py in examples/gpt_evaluation --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> * [shardformer] llama support DistCrossEntropy (#5176) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix * llama support dist-cross fix fix fix fix fix fix fix fix * fix * fix * fix fix * test ci * test ci * fix * [Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878) * Add finetuning Colossal-Llama-2 example * Add finetuning Colossal-Llama-2 example 2 * Add finetuning Colossal-Llama-2 example and support NEFTuning * Add inference example and refine neftune * Modify readme file * update the imports --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com> * llama support dist-cross fix fix fix fix fix fix fix fix * fix * fix * fix fix * test ci * test ci * fix * fix ci * fix ci --------- Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com> Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com> * Fix ColossalEval (#5186) Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> * [doc] update pytorch version in documents. (#5177) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix * update pytorch version in documents * polish readme in application/chat (#5194) * [pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134) * test: add more p2p tests * fix: remove send_forward_recv_forward as p2p op list need to use the same group * fix: make send and receive atomic * feat: update P2PComm fn * feat: add metadata cache in 1f1b * feat: add metadata cache in interleaved pp * feat: modify is_xx_stage fn * revert: add _broadcast_object_list * feat: add interleaved pp in llama policy * feat: set NCCL_BUFFSIZE in HybridParallelPlugin * Improve logic for selecting metrics (#5196) Co-authored-by: Xu <yuanchen.xu00@gmail.com> * [doc] Update required third-party library list for testing and torch comptibility checking (#5207) * doc/update requirements-test.txt * update torch-cuda compatibility check * support linear accumulation fusion (#5199) support linear accumulation fusion support linear accumulation fusion fix * [pipeline]: support arbitrary batch size in forward_only mode (#5201) * fix: remove drop last in val & test dataloader * feat: add run_forward_only, support arbitrary bs * chore: modify ci script * [pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214) * fix: add fallback order option and update 1f1b * fix: fix deadlock comm in interleaved pp * test: modify p2p test * [devops] update torch versoin in ci (#5217) * fix-test (#5210) fix-test fix-test * fix flash attn (#5209) * [nfc] fix typo colossalai/shardformer/ (#5133) * [Colossal-LLaMA-2] Release Colossal-LLaMA-2-13b-base model (#5224) * update readme * update readme * update link * update * update readme * update * update * update * update title * update example * update example * fix content * add conclusion * add license * update * update * update version * fix minor * [doc] Update README.md of Colossal-LLAMA2 (#5233) * Update README.md * Update README.md * [doc] Make leaderboard format more uniform and good-looking (#5231) * Make leaderboard format more unifeid and good-looking * Update README.md * Update README.md * [doc] add Colossal-LLaMA-2-13B (#5234) * [doc] add Colossal-LLaMA-2-13B * [doc] add Colossal-LLaMA-2-13B * [doc] add Colossal-LLaMA-2-13B * [format] applied code formatting on changed files in pull request 5234 (#5235) Co-authored-by: github-actions <github-actions@github.com> * [doc] SwiftInfer release (#5236) * [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release * [npu] use extension for op builder (#5172) * update extension * update cpu adam * update is * add doc for cpu adam * update kernel * update commit * update flash * update memory efficient * update flash attn * update flash attention loader * update api * fix * update doc * update example time limit * reverse change * fix doc * remove useless kernel * fix * not use warning * update * update * [pipeline] A more general _communicate in p2p (#5062) * A more general _communicate * feat: finish tree_flatten version p2p * fix: update p2p api calls --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [npu] change device to accelerator api (#5239) * update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com> * [hotfix] removed unused flag (#5242) * [doc] fix typo in Colossal-LLaMA-2/README.md (#5247) * [workflow] fixed build CI (#5240) * [workflow] fixed build CI * polish * polish * polish * polish * polish * [ci] fixed booster test (#5251) * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed ddp test (#5254) * [ci] fixed ddp test * polish * fix typo in applications/ColossalEval/README.md (#5250) * [ci] fix shardformer tests. (#5255) * fix ci fix * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [doc] fix doc typo (#5256) * [doc] fix annotation display * [doc] fix llama2 doc * [hotfix]: add pp sanity check and fix mbs arg (#5268) * fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check * [workflow] fixed incomplete bash command (#5272) * [workflow] fixed oom tests (#5275) * [workflow] fixed oom tests * polish * polish * polish * [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276) * fix ci fix * fix test * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests * fix --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [shardformer] hybridparallelplugin support gradients accumulation. (#5246) * support gradients acc fix fix fix fix fix fix fix fix fix fix fix fix fix * fix fix * fix fix fix * [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230) * fix auto loading gpt2 tokenizer (#5279) * [doc] add llama2-13B disyplay (#5285) * Update README.md * fix 13b typo --------- Co-authored-by: binmakeswell <binmakeswell@gmail.com> * fix llama pretrain (#5287) * [hotfix] fix 3d plugin test (#5292) * fix bug for mefture (#5299) * [NFC] polish applications/Colossal-LLaMA-2/colossal_llama2/tokenizer/init_tokenizer.py code style (#5228) * fix some typo (#5307) * [feat] refactored extension module (#5298) * [feat] refactored extension module * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * [workflow] updated CI image (#5318) * [accelerator] fixed npu api * [tests] fix t5 test. (#5322) * [ci] fix shardformer tests. (#5255) * fix ci fix * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * fix t5 test --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [doc] added docs for extensions (#5324) * [doc] added docs for extensions * polish * polish * fix typo under extensions/ (#5330) * fix typo change dosen't to doesn't (#5308) * [extension] fixed exception catch (#5342) * [Chat] fix sft loss nan (#5345) * fix script * fix script * fix chat nan * fix chat nan * [checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347) * [checkpointio] fix hybrid parallel optim checkpoint * [extension] fix cuda extension * [checkpointio] fix gemini optimizer checkpoint * polish code * [fix] remove unnecessary dp_size assert (#5351) * fix: remove unnecessary assert * test: add more 3d plugin tests * fix: add warning * [gemini] fix param op hook when output is tuple (#5355) * [gemini] fix param op hook when output is tuple * [gemini] fix param op hook * [llama] fix dataloader for hybrid parallel (#5358) * [plugin] refactor prepare dataloader * [plugin] update train script * [llama] update training script (#5360) * [llama] update training script * [doc] polish docstr * [llama] add flash attn patch for npu (#5362) * [llama] fix neftune & pbar with start_step (#5364) * [eval] update llama npu eval (#5366) * [llama] polish training script and fix optim ckpt (#5368) * [lr-scheduler] fix load state dict and add test (#5369) * [llama] fix memory issue (#5371) * [llama] fix memory issue * [llama] add comment * [moe] init mixtral impl * [moe] update capacity computing (#5253) * [moe] top2 allow uneven input * [moe] update capacity computing * [moe] remove debug info * [moe] update capacity computing * [moe] update capacity computing * [moe] support mixtral (#5309) * [moe] add mixtral block for single expert * [moe] mixtral block fwd support uneven ep * [moe] mixtral block bwd support uneven ep * [moe] add mixtral moe layer * [moe] simplify replace * [meo] support save sharded mixtral * [meo] support load sharded mixtral * [meo] support save sharded optim * [meo] integrate moe manager into plug * [meo] fix optimizer load * [meo] fix mixtral layer * [moe] fix mixtral checkpoint io (#5314) * [moe] fix mixtral forward default value (#5329) * [moe] fix mixtral optim checkpoint (#5344) * [moe] fix tests * [release] update version (#5380) * [llama] fix training and inference scripts (#5384) * [llama] refactor inference example to fit sft * [llama] fix training script to fit gemini * [llama] fix inference script * [doc] Fix typo (#5361) * [doc] updated installation command (#5389) * [hotfix] fix variable type for top_p (#5313) Co-authored-by: binmakeswell <binmakeswell@gmail.com> * [hotfix] Fix wrong import in meta_registry (#5392) * [extension] hotfix jit extension setup (#5402) * [example] reuse flash attn patch (#5400) * [fsdp] impl save/load shard model/optimizer (#5357) * [setup] fixed nightly release (#5388) * [shardformer]gather llama logits (#5398) * gather llama logits * fix * update requirements (#5407) * [workflow] added pypi channel (#5412) * [doc] fix blog link * [doc] fix blog link * fix sft single turn inference example (#5416) * [example]add gpt2 benchmark example script. (#5295) * benchmark gpt2 * fix fix fix fix * [doc] fix typo in Colossal-LLaMA-2/README.md (#5247) * [workflow] fixed build CI (#5240) * [workflow] fixed build CI * polish * polish * polish * polish * polish * [ci] fixed booster test (#5251) * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed ddp test (#5254) * [ci] fixed ddp test * polish * fix typo in applications/ColossalEval/README.md (#5250) * [ci] fix shardformer tests. (#5255) * fix ci fix * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [doc] fix doc typo (#5256) * [doc] fix annotation display * [doc] fix llama2 doc * [hotfix]: add pp sanity check and fix mbs arg (#5268) * fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check * [workflow] fixed incomplete bash command (#5272) * [workflow] fixed oom tests (#5275) * [workflow] fixed oom tests * polish * polish * polish * [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276) * fix ci fix * fix test * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests * fix --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [shardformer] hybridparallelplugin support gradients accumulation. (#5246) * support gradients acc fix fix fix fix fix fix fix fix fix fix fix fix fix * fix fix * fix fix fix * [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230) * fix auto loading gpt2 tokenizer (#5279) * [doc] add llama2-13B disyplay (#5285) * Update README.md * fix 13b typo --------- Co-authored-by: binmakeswell <binmakeswell@gmail.com> * fix llama pretrain (#5287) * fix * fix * fix fix * fix fix fix * fix fix * benchmark gpt2 * fix fix fix fix * [workflow] fixed build CI (#5240) * [workflow] fixed build CI * polish * polish * polish * polish * polish * [ci] fixed booster test (#5251) * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test * fix fix * fix fix fix * fix * fix fix fix fix fix * fix * Update shardformer.py --------- Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: Wenhao Chen <cwher@outlook.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com> Co-authored-by: Desperado-Jia <502205863@qq.com> * [doc] sora release (#5425) * [doc] sora release * [doc] sora release * [doc] sora release * [doc] sora release * [devops] fix extention building (#5427) * [hotfix] fix sd vit import error (#5420) * fix import error * Update dpt_depth.py --------- Co-authored-by: binmakeswell <binmakeswell@gmail.com> * [hotfix] fix typo of openmoe model source (#5403) * [doc] update some translations with README-zh-Hans.md (#5382) * [hotfix] fix typo change _descrption to _description (#5331) * [hotfix] fix typo change enabel to enable under colossalai/shardformer/ (#5317) * [eval-hotfix] set few_shot_data to None when few shot is disabled (#5422) * [hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335) Co-authored-by: binmakeswell <binmakeswell@gmail.com> * [doc] Fix typo s/infered/inferred/ (#5288) Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com> * [hotfix] fix stable diffusion inference bug. (#5289) * Update train_ddp.yaml delete "strategy" to fix DDP config loading bug in "main.py" * Update train_ddp.yaml fix inference with scripts/txt2img.py config file load bug. * Update README.md add pretrain model test code. * [colossal-llama2] add stream chat examlple for chat version model (#5428) * add stream chat for chat version * remove os.system clear * modify function name * [release] update version (#5411) * fix tensor data update for gemini loss caluculation (#5442) * [hotfix] fix typo s/keywrods/keywords etc. (#5429) * [devops] fix compatibility (#5444) * [devops] fix compatibility * [hotfix] update compatibility test on pr * [devops] fix compatibility * [devops] record duration during comp test * [test] decrease test duration * fix falcon * [shardformer] fix gathering output when using tensor parallelism (#5431) * fix * padding vocab_size when using pipeline parallellism padding vocab_size when using pipeline parallellism fix fix * fix * fix fix fix * fix gather output * fix * fix * fix fix resize embedding fix resize embedding * fix resize embedding fix * revert * revert * revert * [doc] release Open-Sora 1.0 with model weights (#5468) * [doc] release Open-Sora 1.0 with model weights * [doc] release Open-Sora 1.0 with model weights * [doc] release Open-Sora 1.0 with model weights * [doc] update open-sora demo (#5479) * [doc] update open-sora demo * [doc] update open-sora demo * [doc] update open-sora demo * [example] add grok-1 inference (#5485) * [misc] add submodule * remove submodule * [example] support grok-1 tp inference * [example] add grok-1 inference script * [example] refactor code * [example] add grok-1 readme * [exmaple] add test ci * [exmaple] update readme * [release] grok-1 314b inference (#5490) * [release] grok-1 inference * [release] grok-1 inference * [release] grok-1 inference * [example] update Grok-1 inference (#5495) * revise grok-1 example * remove unused arg in scripts * prevent re-installing torch * update readme * revert modifying colossalai requirements * add perf * trivial * add tokenizer url * [hotfix] set return_outputs=False in examples and polish code (#5404) * fix: simplify merge_batch * fix: use return_outputs=False to eliminate extra memory consumption * feat: add return_outputs warning * style: remove `return_outputs=False` as it is the default value * [release] grok-1 inference benchmark (#5500) * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [shardformer]Fix lm parallel. (#5480) * fix * padding vocab_size when using pipeline parallellism padding vocab_size when using pipeline parallellism fix fix * fix * fix fix fix * fix gather output * fix * fix * fix fix resize embedding fix resize embedding * fix resize embedding fix * revert * revert * revert * fix lm forward distribution * fix * test ci * fix * [fix] fix grok-1 example typo (#5506) * [devops] fix example test ci (#5504) * Fix ColoTensorSpec for py11 (#5440) * fixed layout converter caching and updated tester * Empty-Commit * [shardformer] update colo attention to support custom mask (#5510) * [feature] refactor colo attention (#5462) * [extension] update api * [feature] add colo attention * [feature] update sdpa * [feature] update npu attention * [feature] update flash-attn * [test] add flash attn test * [test] update flash attn test * [shardformer] update modeling to fit colo attention (#5465) * [misc] refactor folder structure * [shardformer] update llama flash-attn * [shardformer] fix llama policy * [devops] update tensornvme install * [test] update llama test * [shardformer] update colo attn kernel dispatch * [shardformer] update blip2 * [shardformer] update chatglm * [shardformer] update gpt2 * [shardformer] update gptj * [shardformer] update opt * [shardformer] update vit * [shardformer] update colo attention mask prep * [shardformer] update whisper * [test] fix shardformer tests (#5514) * [test] fix shardformer tests * [test] fix shardformer tests * [format] applied code formatting on changed files in pull request 5510 (#5517) Co-authored-by: github-actions <github-actions@github.com> * [shardformer] fix pipeline forward error if custom layer distribution is used (#5189) * Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution * Change static methods for t5 layer distribution to member functions * Change static methods for whisper layer distribution to member functions * Replace whisper policy usage with self one * Fix test case to use non-static layer distribution methods * fix: fix typo --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [Fix] Grok-1 use tokenizer from the same pretrained path (#5532) * [fix] use tokenizer from the same pretrained path * trust remote code * [ColossalChat] Update RLHF V2 (#5286) * Add dpo. Fix sft, ppo, lora. Refactor all * fix and tested ppo * 2 nd round refactor * add ci tests * fix ci * fix ci * fix readme, style * fix readme style * fix style, fix benchmark * reproduce benchmark result, remove useless files * rename to ColossalChat * use new image * fix ci workflow * fix ci * use local model/tokenizer for ci tests * fix ci * fix ci * fix ci * fix ci timeout * fix rm progress bar. fix ci timeout * fix ci * fix ci typo * remove 3d plugin from ci temporary * test environment * cannot save optimizer * support chat template * fix readme * fix path * test ci locally * restore build_or_pr * fix ci data path * fix benchmark * fix ci, move ci tests to 3080, disable fast tokenizer * move ci to 85 * support flash attention 2 * add all-in-one data preparation script. Fix colossal-llama2-chat chat template * add hardware requirements * move ci test data * fix save_model, add unwrap * fix missing bos * fix missing bos; support grad accumulation with gemini * fix ci * fix ci * fix ci * fix llama2 chat template config * debug sft * debug sft * fix colossalai version requirement * fix ci * add sanity check to prevent NaN loss * fix requirements * add dummy data generation script * add dummy data generation script * add dummy data generation script * add dummy data generation script * update readme * update readme * update readme and ignore * fix logger bug * support parallel_output * modify data preparation logic * fix tokenization * update lr * fix inference * run pre-commit --------- Co-authored-by: Tong Li <tong.li352711588@gmail.com> * [shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508) * feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig` * feat: apply `GradientCheckpointConfig` to policy and llama_forward * feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager * fix: add optional args for `distribute_layer` and `get_stage_index` * fix: fix changed API calls * test: update llama tests * style: polish `GradientCheckpointConfig` * fix: fix pipeline utils tests * fix incorrect sharding without zero (#5545) Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [shardformer] Sequence Parallelism Optimization (#5533) * sequence parallel optimization * validate sequence parallel in llama (code to be polished) * shardformer api writing * integrate sequence parallel in ShardFormer * fix pp bugs and sp bugs for LlaMa model * integrating ring-based sequence parallelism into ShardFormer * [sequence parallelism]: Add fused megatron function * integrating ring-based sequence parallelism into ShardFormer --------- Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn> * fix bugs when useing sp and flashattention together * fix operation function name * support flash attention for ulysses-style sp * clarify sp process group * fix compatibility bugs in moe plugin * fix fused linear bugs * fix linear layer test * support gpt model all-to-all sp * modify shard data dimension (meant to be dim=-1) * support megtron-style sp and distributed attn for llama model * [shardformer] add megatron sp to llama * support llama7B 128k with distributed attention * [shardformer] robustness enhancement * add block attn * sp mode 1: keep input as a complete sequence * fix sp compatability * finish sp mode 3 support for gpt * using all_to_all_single when batch size is 1 * support mode 2 sp in gpt2 (#5) * [shardformer] add megatron sp to llama * support llama7B 128k with distributed attention * [shardformer] robustness enhancement * add block attn * sp mode 1: keep input as a complete sequence * fix sp compatability * refactor ring implementation * support mode 2 sp in gpt2 * polish code * enable distributed attn mask when using sp mode 2 and 3 in llama * automatically enable flash attn when using sp mode 2 and 3 in llama * inplace attn mask * add zero2 support for sequence parallel * polish code * fix bugs * fix gemini checkpoint io * loose tensor checking atol and rtol * add comment * fix llama layernorm grad * fix zero grad * fix zero grad * fix conflict * update split and gather auto grad func * sequence parallel: inside text split (#6) * polish code (part 1) * polish code (part 2) * polish code (part 2.5) * polish code (part 3) * sequence parallel: inside text split * miscellaneous minor fixes * polish code * fix ulysses style ZeRO * sequence parallel: inside text split * miscellaneous minor fixes * disaggregate sp group and dp group for sp * fix llama and gpt sp * polish code * move ulysses grad sync to ddp (#9) * remove zero_stage and unbind the grad sync for alltoall sp * add 2d group creation test * move ulysses grad sync to ddp * add 2d group creation test * remove useless code * change shard config not to enable sp when enable_all_optimizations * add sp warnings for several model * remove useless code --------- Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn> * [hotfix] quick fixes to make legacy tutorials runnable (#5559) Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [fix] fix typo s/muiti-node /multi-node etc. (#5448) * [hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548) * [devops] remove post commit ci (#5566) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [doc] fix ColossalMoE readme (#5599) * fix readme * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [zero] support multiple (partial) backward passes (#5596) * [zero] support multiple (partial) backward passes * [misc] update requirements * [shardformer] refactor embedding resize (#5603) * [branch rebase] rebase main to Feature/resize_embedding (#5554) * fix * [release] update version (#5411) * [hotfix] fix typo s/keywrods/keywords etc. (#5429) * [devops] fix compatibility (#5444) * [devops] fix compatibility * [hotfix] update compatibility test on pr * [devops] fix compatibility * [devops] record duration during comp test * [test] decrease test duration * fix falcon * [shardformer] fix gathering output when using tensor parallelism (#5431) * fix * padding vocab_size when using pipeline parallellism padding vocab_size when using pipeline parallellism fix fix * fix * fix fix fix * fix gather output * fix * fix * fix fix resize embedding fix resize embedding * fix resize embedding fix * revert * revert * revert * [doc] release Open-Sora 1.0 with model weights (#5468) * [doc] release Open-Sora 1.0 with model weights * [doc] release Open-Sora 1.0 with model weights * [doc] release Open-Sora 1.0 with model weights * [doc] update open-sora demo (#5479) * [doc] update open-sora demo * [doc] update open-sora demo * [doc] update open-sora demo * [example] add grok-1 inference (#5485) * [misc] add submodule * remove submodule * [example] support grok-1 tp inference * [example] add grok-1 inference script * [example] refactor code * [example] add grok-1 readme * [exmaple] add test ci * [exmaple] update readme --------- Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * [CI] run pre-commit (#5577) * fix * [release] update version (#5411) * [hotfix] fix typo s/keywrods/keywords etc. (#5429) * [devops] fix compatibility (#5444) * [devops] fix compatibility * [hotfix] update compatibility test on pr * [devops] fix compatibility * [devops] record duration during comp test * [test] decrease test duration * fix falcon * [shardformer] fix gathering output when using tensor parallelism (#5431) * fix * padding vocab_size when using pipeline parallellism padding vocab_size when using pipeline parallellism fix fix * fix * fix fix fix * fix gather output * fix * fix * fix fix resize embedding fix resize embedding * fix resize embedding fix * revert * revert * revert * [doc] release Open-Sora 1.0 with model weights (#5468) * [doc] release Open-Sora 1.0 with model weights * [doc] release Open-Sora 1.0 with model weights * [doc] release Open-Sora 1.0 with model weights * [doc] update open-sora demo (#5479) * [doc] update open-sora demo * [doc] update open-sora demo * [doc] update open-sora demo * [example] add grok-1 inference (#5485) * [misc] add submodule * remove submodule * [example] support grok-1 tp inference * [example] add grok-1 inference script * [example] refactor code * [example] add grok-1 readme * [exmaple] add test ci * [exmaple] update readme * run pre-commit --------- Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * [rebase] rebase main to resize-embedding (#5581) * [release] grok-1 314b inference (#5490) * [release] grok-1 inference * [release] grok-1 inference * [release] grok-1 inference * [example] update Grok-1 inference (#5495) * revise grok-1 example * remove unused arg in scripts * prevent re-installing torch * update readme * revert modifying colossalai requirements * add perf * trivial * add tokenizer url * [hotfix] set return_outputs=False in examples and polish code (#5404) * fix: simplify merge_batch * fix: use return_outputs=False to eliminate extra memory consumption * feat: add return_outputs warning * style: remove `return_outputs=False` as it is the default value * [release] grok-1 inference benchmark (#5500) * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [release] grok-1 inference benchmark * [shardformer]Fix lm parallel. (#5480) * fix * padding vocab_size when using pipeline parallellism padding vocab_size when using pipeline parallellism fix fix * fix * fix fix fix * fix gather output * fix * fix * fix fix resize embedding fix resize embedding * fix resize embedding fix * revert * revert * revert * fix lm forward distribution * fix * test ci * fix * [fix] fix grok-1 example typo (#5506) * [devops] fix example test ci (#5504) * Fix ColoTensorSpec for py11 (#5440) * fixed layout converter caching and updated tester * Empty-Commit * [shardformer] update colo attention to support custom mask (#5510) * [feature] refactor colo attention (#5462) * [extension] update api * [feature] add colo attention * [feature] update sdpa * [feature] update npu attention * [feature] update flash-attn * [test] add flash attn test * [test] update flash attn test * [shardformer] update modeling to fit colo attention (#5465) * [misc] refactor folder structure * [shardformer] update llama flash-attn * [shardformer] fix llama policy * [devops] update tensornvme install * [test] update llama test * [shardformer] update colo attn kernel dispatch * [shardformer] update blip2 * [shardformer] update chatglm * [shardformer] update gpt2 * [shardformer] update gptj * [shardformer] update opt * [shardformer] update vit * [shardformer] update colo attention mask prep * [shardformer] update whisper * [test] fix shardformer tests (#5514) * [test] fix shardformer tests * [test] fix shardformer tests * [format] applied code formatting on changed files in pull request 5510 (#5517) Co-authored-by: github-actions <github-actions@github.com> * [shardformer] fix pipeline forward error if custom layer distribution is used (#5189) * Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution * Change static methods for t5 layer distribution to member functions * Change static methods for whisper layer distribution to member functions * Replace whisper policy usage with self one * Fix test case to use non-static layer distribution methods * fix: fix typo --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [Fix] Grok-1 use tokenizer from the same pretrained path (#5532) * [fix] use tokenizer from the same pretrained path * trust remote code * [ColossalChat] Update RLHF V2 (#5286) * Add dpo. Fix sft, ppo, lora. Refactor all * fix and tested ppo * 2 nd round refactor * add ci tests * fix ci * fix ci * fix readme, style * fix readme style * fix style, fix benchmark * reproduce benchmark result, remove useless files * rename to ColossalChat * use new image * fix ci workflow * fix ci * use local model/tokenizer for ci tests * fix ci * fix ci * fix ci * fix ci timeout * fix rm progress bar. fix ci timeout * fix ci * fix ci typo * remove 3d plugin from ci temporary * test environment * cannot save optimizer * support chat template * fix readme * fix path * test ci locally * restore build_or_pr * fix ci data path * fix benchmark * fix ci, move ci tests to 3080, disable fast tokenizer * move ci to 85 * support flash attention 2 * add all-in-one data preparation script. Fix colossal-llama2-chat chat template * add hardware requirements * move ci test data * fix save_model, add unwrap * fix missing bos * fix missing bos; support grad accumulation with gemini * fix ci * fix ci * fix ci * fix llama2 chat template config * debug sft * debug sft * fix colossalai version requirement * fix ci * add sanity check to prevent NaN loss * fix requirements * add dummy data generation script * add dummy data generation script * add dummy data generation script * add dummy data generation script * update readme * update readme * update readme and ignore * fix logger bug * support parallel_output * modify data preparation logic * fix tokenization * update lr * fix inference * run pre-commit --------- Co-authored-by: Tong Li <tong.li352711588@gmail.com> * [shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508) * feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig` * feat: apply `GradientCheckpointConfig` to policy and llama_forward * feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager * fix: add optional args for `distribute_layer` and `get_stage_index` * fix: fix changed API calls * test: update llama tests * style: polish `GradientCheckpointConfig` * fix: fix pipeline utils tests * fix incorrect sharding without zero (#5545) Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [shardformer] Sequence Parallelism Optimization (#5533) * sequence parallel optimization * validate sequence parallel in llama (code to be polished) * shardformer api writing * integrate sequence parallel in ShardFormer * fix pp bugs and sp bugs for LlaMa model * integrating ring-based sequence parallelism into ShardFormer * [sequence parallelism]: Add fused megatron function * integrating ring-based sequence parallelism into ShardFormer --------- Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn> * fix bugs when useing sp and flashattention together * fix operation function name * support flash attention for ulysses-style sp * clarify sp process group * fix compatibility bugs in moe plugin * fix fused linear bugs * fix linear layer test * support gpt model all-to-all sp * modify shard data dimension (meant to be dim=-1) * support megtron-style sp and distributed attn for llama model * [shardformer] add megatron sp to llama * support llama7B 128k with distributed attention * [shardformer] robustness enhancement * add block attn * sp mode 1: keep input as a complete sequence * fix sp compatability * finish sp mode 3 support for gpt * using all_to_all_single when batch size is 1 * support mode 2 sp in gpt2 (#5) * [shardformer] add megatron sp to llama * support llama7B 128k with distributed attention * [shardformer] robustness enhancement * add block attn * sp mode 1: keep input as a complete sequence * fix sp compatability * refactor ring implementation * support mode 2 sp in gpt2 * polish code * enable distributed attn mask when using sp mode 2 and 3 in llama * automatically enable flash attn when using sp mode 2 and 3 in llama * inplace attn mask * add zero2 support for sequence parallel * polish code * fix bugs * fix gemini checkpoint io * loose tensor checking atol and rtol * add comment * fix llama layernorm grad * fix zero grad * fix zero grad * fix conflict * update split and gather auto grad func * sequence parallel: inside text split (#6) * polish code (part 1) * polish code (part 2) * polish code (part 2.5) * polish code (part 3) * sequence parallel: inside text split * miscellaneous minor fixes * polish code * fix ulysses style ZeRO * sequence parallel: inside text split * miscellaneous minor fixes * disaggregate sp group and dp group for sp * fix llama and gpt sp * polish code * move ulysses grad sync to ddp (#9) * remove zero_stage and unbind the grad sync for alltoall sp * add 2d group creation test * move ulysses grad sync to ddp * add 2d group creation test * remove useless code * change shard config not to enable sp when enable_all_optimizations * add sp warnings for several model * remove useless code --------- Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn> * [hotfix] quick fixes to make legacy tutorials runnable (#5559) Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [fix] fix typo s/muiti-node /multi-node etc. (#5448) * [hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548) * [devops] remove post commit ci (#5566) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --------- Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: Wenhao Chen <cwher@outlook.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: Rocky Duan <dementrock@users.noreply.github.com> Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Insu Jang <insujang@umich.edu> Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com> Co-authored-by: Tong Li <tong.li352711588@gmail.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn> Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [shardformer]enable padding vocabulary size. (#5489) * padding vocab_size when using pipeline parallellism padding vocab_size when using pipeline parallellism fix fix * fix * fix fix fix * fix gather output * fix * fix * fix fix resize embedding fix resize embedding * fix resize embedding fix * revert * revert * revert * padding vocab * padding vocabe * fix * fix * fxi * test ci * fix fix fix fix * fix fix * fix * fix * Update hybrid_parallel_plugin.py fix fix fix * fix fix * fix fix * fix * resolve super init resolve super init resolve super init resolve super init * resolve comments * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * vocab checkpointio * padding vocab_size when using pipeline parallellism padding vocab_size when using pipeline parallellism fix fix * fix fix fix * fix * fix fix resize embedding fix resize embedding * fix resize embedding fix * revert * revert * padding vocab * fix * fix fix * fix fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * cherry-pick * revert moe modify * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix fix fix fix fix fix fix fix * resolve comments resolve comments resolve comments resolve comments resolve comments * ptensor ptensor resolve comments fix fix fix fix fix resolve comments resolve comments resolve comments resolve comments resolve comments --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix rebase * fix rebase --------- Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: Wenhao Chen <cwher@outlook.com> Co-authored-by: Rocky Duan <dementrock@users.noreply.github.com> Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Insu Jang <insujang@umich.edu> Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com> Co-authored-by: Tong Li <tong.li352711588@gmail.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606) * fix no pad token bug * fixed some auto parallel codegen bug, but might not run on torch 2.1 --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [shardformer] fix pipeline grad ckpt (#5620) * [shardformer] fix pipeline grad ckpt * [lora] add lora APIs for booster, support lora for TorchDDP (#4981) * add apis and peft requirement * add liscense and implement apis * add checkpointio apis * add torchddp fwd_bwd test * add support_lora methods * add checkpointio test and debug * delete unneeded codes * remove peft from LICENSE * add concrete methods for enable_lora * simplify enable_lora api * fix requirements * [LowLevelZero] low level zero support lora (#5153) * low level zero support lora low level zero support lora * add checkpoint test * add checkpoint test * fix * fix * fix * fix fix fix fix * fix * fix fix fix fix fix fix fix * fix * fix fix fix fix fix fix fix * fix * test ci * git # This is a combination of 3 commits. Update low_level_zero_plugin.py Update low_level_zero_plugin.py fix fix fix * fix naming fix naming fix naming fix * [feature] qlora support * qlora follow commit * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * migrate qutization folder to colossalai/ * minor fixes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * gptj sp fix * remove redundancies from pre-commit * minor fixes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com> Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Cuiqing Li <lixx3527@gmail.com> Co-authored-by: cuiqing.li <lixx336@gmail.com> Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com> Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: littsk <1214689160@qq.com> Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: Xuanlei Zhao <43881818+oahzxl@users.noreply.github.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Wenhao Chen <cwher@outlook.com> Co-authored-by: Jun Gao <imgaojun@gmail.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: Xu Kai <xukai16@foxmail.com> Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com> Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com> Co-authored-by: Xu Kai <xukai16@foxamil.com> Co-authored-by: Orion-Zheng <zheng_zian@u.nus.edu> Co-authored-by: Elsa Granger <zeyugao@outlook.com> Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com> Co-authored-by: YeAnbang <anbangy2@outlook.com> Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu> Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: eric8607242 <e0928021388@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com> Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com> Co-authored-by: BlueRum <70618399+ht-zhou@users.noreply.github.com> Co-authored-by: Tong Li <tong.li352711588@gmail.com> Co-authored-by: JIMMY ZHAO <knightyzhao@gmail.com> Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: Desperado-Jia <502205863@qq.com> Co-authored-by: 李文军 <40464906+liwenjuna@users.noreply.github.com> Co-authored-by: yixiaoer <miyaku@yixiaoer.sg> Co-authored-by: CZYCW <czyczf@163.com> Co-authored-by: Stephan Kölker <stephankoe@users.noreply.github.com> Co-authored-by: QinLuo <eric.x.sun@gmail.com> Co-authored-by: MickeyCHAN <76671016+danyow-cheung@users.noreply.github.com> Co-authored-by: Luo Yihang <luo_yihang@outlook.com> Co-authored-by: Dongruixuan Li <dongruixuan@hotmail.com> Co-authored-by: hugo-syn <61210734+hugo-syn@users.noreply.github.com> Co-authored-by: Youngon <Youngon_wyl@163.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: Rocky Duan <dementrock@users.noreply.github.com> Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu> Co-authored-by: Insu Jang <insujang@umich.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-09-24 03:03:37 +00:00 · 2024-04-23 17:57:44 +08:00
parent 52a2dded36
commit fcf776ff1b
973 changed files with 57285 additions and 24951 deletions
--- a/examples/inference/bench_bloom.py
+++ b/examples/inference/bench_bloom.py
@@ -1,100 +0,0 @@
-import argparse
-import os
-import time
-
-import torch
-from transformers import BloomForCausalLM, BloomTokenizerFast
-
-import colossalai
-from colossalai.inference.tensor_parallel.engine import TPInferEngine
-from colossalai.logging import disable_existing_loggers
-from colossalai.shardformer import ShardConfig
-from colossalai.testing import clear_cache_before_run, rerun_if_address_is_in_use, spawn
-
-os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true"
-
-
-def print_perf_stats(latency_set, config, bs, warmup=3):
-    # trim warmup queries
-    latency_set = list(latency_set)
-    latency_set = latency_set[warmup:]
-    count = len(latency_set)
-
-    if count > 0:
-        latency_set.sort()
-        avg = sum(latency_set) / count
-        num_layers = getattr(config, "num_layers", config.num_hidden_layers)
-        num_parameters = num_layers * config.hidden_size * config.hidden_size * 12
-        num_bytes = 2  # float16
-
-        print("Avg Per Token Latency: {0:8.2f} ms".format(avg * 1000))
-        print("Avg BW: {0:8.2f} GB/s".format(1 / avg * num_parameters * num_bytes / 1e9))
-        print("Avg flops: {0:8.2f} TFlops/s".format(1 / avg * num_parameters * num_bytes * bs / 1e12))
-        print("Avg Throughput: tokens/s: {}".format((1000 / (avg * 1000)) * bs))
-
-
-def bench_bloom(args):
-    model_path = args.path
-    max_batch_size = args.batch_size
-    max_input_len = args.input_len
-    max_output_len = args.output_len
-
-    tokenizer = BloomTokenizerFast.from_pretrained(model_path)
-    tokenizer.pad_token = tokenizer.eos_token
-    model = BloomForCausalLM.from_pretrained(model_path, pad_token_id=tokenizer.eos_token_id)
-    model = model.half()
-
-    # init TPInferEngine and shard the original model
-    # To benchmark torch original, comment out the line of optimizing model
-    shard_config = ShardConfig(enable_tensor_parallelism=True if args.tp_size > 1 else False, inference_only=True)
-    infer_engine = TPInferEngine(model, shard_config, max_batch_size, max_input_len, max_output_len)
-
-    # prepare data for generation
-    generate_kwargs = dict(max_new_tokens=max_output_len, do_sample=False)
-    input_tokens = {
-        "input_ids": torch.randint(10, 1000, (max_batch_size, max_input_len)),
-        "attention_mask": torch.ones((max_batch_size, max_input_len)),
-    }
-    for t in input_tokens:
-        if torch.is_tensor(input_tokens[t]):
-            input_tokens[t] = input_tokens[t].to(torch.cuda.current_device())
-            print(f" input_tokens[{t}].shape: {input_tokens[t].shape}")
-
-    iters = 10
-    times = []
-    for i in range(iters):
-        torch.cuda.synchronize()
-        start = time.time()
-        outputs = infer_engine.generate(input_tokens, **generate_kwargs)
-        torch.cuda.synchronize()
-        end = time.time()
-        out_len = outputs.shape[1]
-        print(f" iter {i}: out len {str(out_len)}, generation time {str(end - start)} s")
-        times.append((end - start) / (out_len - max_input_len))
-
-    print_perf_stats(times, model.config, max_batch_size)
-
-
-def check_bloom(rank, world_size, port, args):
-    disable_existing_loggers()
-    colossalai.launch(config={}, rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")
-    bench_bloom(args)
-
-
-@rerun_if_address_is_in_use()
-@clear_cache_before_run()
-def test_bloom(args):
-    spawn(check_bloom, args.tp_size, args=args)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("-p", "--path", type=str, help="Model path", required=True)
-    parser.add_argument("-tp", "--tp_size", type=int, default=1, help="Tensor parallel size")
-    parser.add_argument("-b", "--batch_size", type=int, default=16, help="Maximum batch size")
-    parser.add_argument("--input_len", type=int, default=1024, help="Maximum input length")
-    parser.add_argument("--output_len", type=int, default=128, help="Maximum output length")
-
-    args = parser.parse_args()
-
-    test_bloom(args)
--- a/examples/inference/bench_llama.py
+++ b/examples/inference/bench_llama.py
@@ -1,133 +0,0 @@
-import argparse
-import os
-import time
-
-import torch
-from transformers import LlamaForCausalLM, LlamaTokenizer
-
-import colossalai
-from colossalai.inference.tensor_parallel.engine import TPInferEngine
-from colossalai.logging import disable_existing_loggers
-from colossalai.shardformer import ShardConfig
-from colossalai.testing import clear_cache_before_run, rerun_if_address_is_in_use, spawn
-
-os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true"
-
-
-def print_perf_stats(latency_set, config, bs, warmup=3):
-    torch.cuda.empty_cache()
-    # trim warmup queries
-    latency_set = list(latency_set)
-    latency_set = latency_set[warmup:]
-    count = len(latency_set)
-
-    if count > 0:
-        latency_set.sort()
-        avg = sum(latency_set) / count
-        num_layers = getattr(config, "num_layers", config.num_hidden_layers)
-        num_parameters = num_layers * config.hidden_size * config.hidden_size * 12
-        num_bytes = 2
-
-        print("Avg Per Token Latency: {0:8.2f} ms".format(avg * 1000))
-        print("Avg BW: {0:8.2f} GB/s".format(1 / avg * num_parameters * num_bytes / 1e9))
-        print("Avg flops: {0:8.2f} TFlops/s".format(1 / avg * num_parameters * num_bytes * bs / 1e12))
-
-
-def run_llama_test(args):
-    llama_model_path = args.path
-    max_batch_size = args.batch_size
-    max_input_len = args.input_len
-    max_output_len = args.output_len
-    args.test_mode
-
-    print("max_batch_size : " + str(max_batch_size))
-
-    tokenizer = LlamaTokenizer.from_pretrained(llama_model_path)
-    tokenizer.pad_token_id = tokenizer.unk_token_id
-    model = LlamaForCausalLM.from_pretrained(llama_model_path, pad_token_id=tokenizer.eos_token_id)
-    model = model.half()
-    model.config
-
-    shard_config = ShardConfig(enable_tensor_parallelism=True if args.tp_size > 1 else False, inference_only=True)
-    infer_engine = TPInferEngine(model, shard_config, max_batch_size, max_input_len, max_output_len)
-
-    generate_kwargs = dict(max_new_tokens=1, do_sample=False)
-    input_tokens = {
-        "input_ids": torch.randint(1, 1000, (max_batch_size, max_input_len), device="cuda"),
-        "attention_mask": torch.ones((max_batch_size, max_input_len), device="cuda"),
-    }
-
-    iters = 10
-    prefill_times = []
-
-    warmup = 3
-
-    for i in range(iters):
-        torch.cuda.synchronize()
-        start = time.time()
-        outputs = infer_engine.generate(input_tokens, **generate_kwargs)
-        torch.cuda.synchronize()
-        end = time.time()
-        out_len = outputs.shape[1]
-        print("generation time {} s".format(str(end - start)))
-        print(out_len - max_input_len)
-        prefill_times.append((end - start) / (out_len - max_input_len))
-
-    prefill_times = prefill_times[warmup:]
-    prefill_time_avg = sum(prefill_times) / len(prefill_times)
-    generate_kwargs = dict(max_new_tokens=max_output_len, do_sample=False)
-
-    times = []
-    decoder_times = []
-    for i in range(iters):
-        torch.cuda.synchronize()
-        start = time.time()
-        outputs = infer_engine.generate(input_tokens, **generate_kwargs)
-        torch.cuda.synchronize()
-        end = time.time()
-        out_len = outputs.shape[1]
-        print("generation time {} s".format(str(end - start)))
-        print(out_len - max_input_len)
-        times.append((end - start) / (out_len - max_input_len))
-        if args.test_mode == "decoder_test":
-            decoder_times.append((end - start - prefill_time_avg) / (out_len - max_input_len - 1))
-
-    times = times[warmup:]
-    latency = sum(times) / len(times)
-    print("total process latency is : " + str(latency) + " s")
-    print("total throughput is : " + str(1 / latency * max_batch_size))
-
-    if args.test_mode == "decoder_test":
-        decoder_times = decoder_times[warmup:]
-        latency = sum(decoder_times) / len(decoder_times)
-
-        print("decoder process latency is : " + str(latency) + " s")
-        print("decoder throughput is : " + str(1 / latency * max_batch_size))
-
-
-def check_llama(rank, world_size, port, args):
-    disable_existing_loggers()
-    colossalai.launch(config={}, rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")
-    run_llama_test(args)
-
-
-@rerun_if_address_is_in_use()
-@clear_cache_before_run()
-def test_llama(args):
-    spawn(check_llama, args.tp_size, args=args)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("-p", "--path", type=str, help="Model path", required=True)
-    parser.add_argument("-tp", "--tp_size", type=int, default=1, help="Tensor parallel size")
-    parser.add_argument("-b", "--batch_size", type=int, default=16, help="Maximum batch size")
-    parser.add_argument("--input_len", type=int, default=256, help="Maximum input length")
-    parser.add_argument("--output_len", type=int, default=128, help="Maximum output length")
-    parser.add_argument(
-        "--test_mode", type=str, help="Test mode", default="e2e_test", choices=["e2e_test", "decoder_test"]
-    )
-
-    args = parser.parse_args()
-
-    test_llama(args)
--- a/examples/inference/benchmark_llama.py
+++ b/examples/inference/benchmark_llama.py
@@ -0,0 +1,167 @@
+import argparse
+import time
+
+import torch
+import torch.distributed as dist
+import transformers
+
+import colossalai
+from colossalai.accelerator import get_accelerator
+from colossalai.inference import InferenceEngine
+from colossalai.testing import clear_cache_before_run, rerun_if_address_is_in_use, spawn
+
+GIGABYTE = 1024**3
+MEGABYTE = 1024 * 1024
+
+CONFIG_MAP = {
+    "toy": transformers.LlamaConfig(num_hidden_layers=4),
+    "llama-7b": transformers.LlamaConfig(
+        hidden_size=4096,
+        intermediate_size=11008,
+        num_attention_heads=32,
+        num_hidden_layers=32,
+        num_key_value_heads=32,
+        max_position_embeddings=2048,
+    ),
+    "llama-13b": transformers.LlamaConfig(
+        hidden_size=5120,
+        intermediate_size=13824,
+        num_attention_heads=40,
+        num_hidden_layers=40,
+        num_key_value_heads=40,
+        max_position_embeddings=2048,
+    ),
+    "llama2-7b": transformers.LlamaConfig(
+        hidden_size=4096,
+        intermediate_size=11008,
+        num_attention_heads=32,
+        num_hidden_layers=32,
+        num_key_value_heads=32,
+        max_position_embeddings=4096,
+    ),
+    "llama2-13b": transformers.LlamaConfig(
+        hidden_size=5120,
+        intermediate_size=13824,
+        num_attention_heads=40,
+        num_hidden_layers=40,
+        num_key_value_heads=40,
+        max_position_embeddings=4096,
+    ),
+}
+
+
+def data_gen(batch_size: int = 4, seq_len: int = 512):
+    input_ids = torch.randint(10, 30000, (batch_size, seq_len), device=get_accelerator().get_current_device())
+    attention_mask = torch.ones_like(input_ids)
+    data = dict(input_ids=input_ids, attention_mask=attention_mask)
+    return data
+
+
+def print_details_info(outputs, model_config, args, whole_end2end):
+    msg: str = ""
+
+    if dist.get_rank() == 0:
+        msg += "-------Perf Summary-------\n"
+        if args.verbose:
+            timestamps = outputs[1]
+            prefill = []
+            encoder = []
+            end2end = []
+            for timestamp in timestamps:
+                prefill.append(timestamp[1] - timestamp[0])
+                encoder.append(
+                    sum(timestamp[i + 1] - timestamp[i] for i in range(1, len(timestamp) - 1)) / (len(timestamp) - 2)
+                )
+                end2end.append(timestamp[-1] - timestamp[0])
+
+            mb_avg_end2end = sum(end2end) / len(end2end)
+            mb_avg_latency = mb_avg_end2end / (args.output_len * args.mb_size)
+
+            msg += f"Average prefill time: {sum(prefill) / len(prefill) * 1000:.2f} ms\n"
+            msg += f"Average encode time: {sum(encoder) / len(encoder) * 1000:.2f} ms\n"
+            msg += f"Average micro batch end2end time: {mb_avg_end2end * 1000:.2f} ms\n"
+            msg += f"Average micro batch per token latency: {mb_avg_latency * 1000:.2f} ms\n"
+
+        whole_avg_latency = whole_end2end / (args.output_len * args.batch_size)
+        num_layers = getattr(model_config, "num_layers", model_config.num_hidden_layers)
+        num_parameters = num_layers * model_config.hidden_size * model_config.hidden_size * 12 / args.pp_size
+        if args.dtype in ["fp16", "bf16"]:
+            num_bytes = 2
+        else:
+            num_bytes = 4
+
+        msg += f"Whole batch end2end time: {whole_end2end * 1000:.2f} ms\n"
+        msg += f"Whole batch per token latency: {whole_avg_latency * 1000:.2f} ms\n"
+        msg += f"Throughput: {args.output_len * args.batch_size / whole_end2end:.2f} tokens/s\n"
+        msg += f"Flops: {num_parameters * num_bytes / whole_avg_latency / 1e12:.2f} TFLOPS\n"
+
+    if torch.cuda.is_available():
+        msg += f"-------Memory Summary Device:{get_accelerator().current_device()}-------\n"
+        msg += f"Max memory allocated: {get_accelerator().max_memory_allocated() / GIGABYTE:.2f} GB\n"
+        msg += f"Max memory reserved: {get_accelerator().max_memory_reserved() / GIGABYTE:.2f} GB\n"
+
+    print(msg)
+
+
+def benchmark_inference(args):
+    config = CONFIG_MAP[args.model]
+    model = transformers.LlamaForCausalLM(config)
+    if dist.get_rank() == 0:
+        print("Model loaded")
+    engine = InferenceEngine(
+        pp_size=args.pp_size,
+        tp_size=args.tp_size,
+        dtype=args.dtype,
+        micro_batch_size=args.mb_size,
+        model=model,
+        verbose=args.verbose,
+        max_batch_size=args.batch_size,
+        max_input_len=args.seq_len,
+        max_output_len=args.output_len,
+    )
+    data = data_gen(args.batch_size, args.seq_len)
+
+    N_WARMUP_STEPS = 2
+
+    for _ in range(N_WARMUP_STEPS):
+        engine.generate(data)
+
+    torch.cuda.synchronize()
+    whole_end2end = time.time()
+    outputs = engine.generate(data)
+    torch.cuda.synchronize()
+    whole_end2end = time.time() - whole_end2end
+
+    print_details_info(outputs, model.config, args, whole_end2end)
+
+
+def hybrid_inference(rank, world_size, port, args):
+    colossalai.launch(config={}, rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")
+    benchmark_inference(args)
+
+
+@rerun_if_address_is_in_use()
+@clear_cache_before_run()
+def benchmark(args):
+    spawn(hybrid_inference, nprocs=args.tp_size * args.pp_size, args=args)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-m",
+        "--model",
+        default="toy",
+        help="the size of model",
+        choices=["toy", "llama-7b", "llama-13b", "llama2-7b", "llama2-13b"],
+    )
+    parser.add_argument("-b", "--batch_size", type=int, default=8, help="batch size")
+    parser.add_argument("-s", "--seq_len", type=int, default=8, help="sequence length")
+    parser.add_argument("--mb_size", type=int, default=1, help="micro_batch_size")
+    parser.add_argument("--pp_size", type=int, default=1, help="pipeline size")
+    parser.add_argument("--tp_size", type=int, default=1, help="pipeline size")
+    parser.add_argument("--output_len", type=int, default=128, help="Output length")
+    parser.add_argument("--dtype", type=str, default="fp16", help="data type")
+    parser.add_argument("-v", "--verbose", default=False, action="store_true")
+    args = parser.parse_args()
+    benchmark(args)
--- a/examples/inference/build_smoothquant_weight.py
+++ b/examples/inference/build_smoothquant_weight.py
@@ -29,7 +29,7 @@ def parse_args():
        type=str,
        help="location of the calibration dataset",
    )
-    parser.add_argument("--num-samples", type=int, default=512)
+    parser.add_argument("--num-samples", type=int, default=10)
    parser.add_argument("--seq-len", type=int, default=512)
    args = parser.parse_args()
    return args
@@ -41,13 +41,12 @@ def main():
    model_path = args.model_name
    dataset_path = args.dataset_path
    output_path = args.output_path
-    num_samples = 10
-    seq_len = 512
+    num_samples = args.num_samples
+    seq_len = args.seq_len

    model, tokenizer = build_model_and_tokenizer(model_path)
    if not os.path.exists(dataset_path):
-        print(f"Cannot find the dataset at {args.dataset_path}")
-        raise FileNotFoundError
+        raise FileNotFoundError(f"Cannot find the dataset at {args.dataset_path}")
    dataset = load_dataset("json", data_files=dataset_path, split="train")

    model.quantized(tokenizer, dataset, num_samples=num_samples, seq_len=seq_len)
@@ -55,15 +54,6 @@ def main():

    model.save_quantized(output_path, model_basename="llama-7b")

-    model = SmoothLlamaForCausalLM.from_quantized(output_path, model_basename="llama-7b")
-    model = model.cuda()
-
-    generate_kwargs = dict(max_new_tokens=16, do_sample=False, use_cache=True)
-    input_tokens = tokenizer(["today is "], return_tensors="pt").to("cuda")
-    out = model.generate(**input_tokens, **generate_kwargs)
-    text = tokenizer.batch_decode(out)
-    print("out is:", text)
-

 if __name__ == "__main__":
    main()
--- a/examples/inference/gptq_bloom.py
+++ b/examples/inference/gptq_bloom.py
@@ -1,120 +0,0 @@
-import argparse
-import os
-import time
-
-import torch
-from auto_gptq import AutoGPTQForCausalLM
-from transformers import BloomTokenizerFast
-
-import colossalai
-from colossalai.inference.tensor_parallel.engine import TPInferEngine
-from colossalai.logging import disable_existing_loggers
-from colossalai.shardformer import ShardConfig
-from colossalai.testing import clear_cache_before_run, rerun_if_address_is_in_use, spawn
-
-os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true"
-
-
-def print_perf_stats(latency_set, config, bs, warmup=3):
-    # trim warmup queries
-    latency_set = list(latency_set)
-    latency_set = latency_set[warmup:]
-    count = len(latency_set)
-
-    if count > 0:
-        latency_set.sort()
-        avg = sum(latency_set) / count
-        num_layers = getattr(config, "num_layers", config.num_hidden_layers)
-        num_parameters = num_layers * config.hidden_size * config.hidden_size * 12
-        num_bytes = 2  # float16
-
-        print("Avg Per Token Latency: {0:8.2f} ms".format(avg * 1000))
-        print("Avg BW: {0:8.2f} GB/s".format(1 / avg * num_parameters * num_bytes / 1e9))
-        print("Avg flops: {0:8.2f} TFlops/s".format(1 / avg * num_parameters * num_bytes * bs / 1e12))
-        print("Avg Throughput: tokens/s: {}".format((1000 / (avg * 1000)) * bs))
-
-
-def bench_bloom(args):
-    pretrained_model_dir = args.path
-    quantized_model_dir = args.quantized_path
-    max_batch_size = args.batch_size
-    max_input_len = args.input_len
-    max_output_len = args.output_len
-
-    tokenizer = BloomTokenizerFast.from_pretrained(pretrained_model_dir)
-    tokenizer.pad_token = tokenizer.eos_token
-
-    # load quantized model to the first GPU
-    model = AutoGPTQForCausalLM.from_quantized(
-        quantized_model_dir, device=torch.cuda.current_device(), inject_fused_attention=False
-    )
-
-    model = model.half()
-
-    model_config = model.config
-    shard_config = ShardConfig(enable_tensor_parallelism=True if args.tp_size > 1 else False, inference_only=True)
-    infer_engine = TPInferEngine(model, shard_config, max_batch_size, max_input_len, max_output_len)
-    generate_kwargs = dict(max_new_tokens=max_output_len, do_sample=False)
-
-    input_tokens = {
-        "input_ids": torch.randint(1, 1000, (max_batch_size, max_input_len), device="cuda"),
-        "attention_mask": torch.ones((max_batch_size, max_input_len), device="cuda"),
-    }
-
-    # init TPInferEngine and shard the original model
-    # To benchmark torch original, comment out the line of optimizing model
-    shard_config = ShardConfig(
-        enable_tensor_parallelism=True if args.tp_size > 1 else False, inference_only=True, inference_gptq=True
-    )
-    infer_engine = TPInferEngine(model, shard_config, max_batch_size, max_input_len, max_output_len)
-
-    # prepare data for generation
-    generate_kwargs = dict(max_new_tokens=max_output_len, do_sample=False)
-    input_tokens = {
-        "input_ids": torch.randint(10, 1000, (max_batch_size, max_input_len)),
-        "attention_mask": torch.ones((max_batch_size, max_input_len)),
-    }
-    for t in input_tokens:
-        if torch.is_tensor(input_tokens[t]):
-            input_tokens[t] = input_tokens[t].to(torch.cuda.current_device())
-            # print(f" input_tokens[{t}].shape: {input_tokens[t].shape}")
-
-    iters = 10
-    times = []
-    for i in range(iters):
-        torch.cuda.synchronize()
-        start = time.time()
-        outputs = infer_engine.generate(input_tokens, **generate_kwargs)
-        torch.cuda.synchronize()
-        end = time.time()
-        out_len = outputs.shape[1]
-        print(f" iter {i}: out len {str(out_len)}, generation time {str(end - start)} s")
-        times.append((end - start) / (out_len - max_input_len))
-
-    print_perf_stats(times, model_config, max_batch_size)
-
-
-def check_bloom(rank, world_size, port, args):
-    disable_existing_loggers()
-    colossalai.launch(config={}, rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")
-    bench_bloom(args)
-
-
-@rerun_if_address_is_in_use()
-@clear_cache_before_run()
-def test_bloom(args):
-    spawn(check_bloom, args.tp_size, args=args)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("-p", "--path", type=str, help="Model path", required=True)
-    parser.add_argument("-q", "--quantized_path", type=str, help="Model path", required=True)
-    parser.add_argument("-tp", "--tp_size", type=int, default=1, help="Tensor parallel size")
-    parser.add_argument("-b", "--batch_size", type=int, default=16, help="Maximum batch size")
-    parser.add_argument("--input_len", type=int, default=1024, help="Maximum input length")
-    parser.add_argument("--output_len", type=int, default=128, help="Maximum output length")
-
-    args = parser.parse_args()
-
-    test_bloom(args)
--- a/examples/inference/gptq_llama.py
+++ b/examples/inference/gptq_llama.py
@@ -1,104 +0,0 @@
-import argparse
-import os
-import time
-
-import torch
-from auto_gptq import AutoGPTQForCausalLM
-from transformers import LlamaTokenizer
-
-import colossalai
-from colossalai.inference.tensor_parallel.engine import TPInferEngine
-from colossalai.logging import disable_existing_loggers
-from colossalai.shardformer import ShardConfig
-from colossalai.testing import clear_cache_before_run, rerun_if_address_is_in_use, spawn
-
-os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "true"
-
-
-def print_perf_stats(latency_set, config, bs, warmup=3):
-    # trim warmup queries
-    latency_set = list(latency_set)
-    latency_set = latency_set[warmup:]
-    count = len(latency_set)
-
-    if count > 0:
-        latency_set.sort()
-        avg = sum(latency_set) / count
-        num_layers = getattr(config, "num_layers", config.num_hidden_layers)
-        num_parameters = num_layers * config.hidden_size * config.hidden_size * 12
-        num_bytes = 2
-
-        print("Avg Per Token Latency: {0:8.2f} ms".format(avg * 1000))
-        print("Avg BW: {0:8.2f} GB/s".format(1 / avg * num_parameters * num_bytes / 1e9))
-        print("Avg flops: {0:8.2f} TFlops/s".format(1 / avg * num_parameters * num_bytes * bs / 1e12))
-        print("Avg Throughput: tokens/s: {}".format((1000 / (avg * 1000)) * bs))
-
-
-def run_llama_test(args):
-    pretrained_model_dir = args.path
-    quantized_model_dir = args.quantized_path
-    max_batch_size = args.batch_size
-    max_input_len = args.input_len
-    max_output_len = args.output_len
-
-    tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
-    tokenizer.pad_token_id = tokenizer.eos_token_id
-
-    # load quantized model to the first GPU
-    model = AutoGPTQForCausalLM.from_quantized(
-        quantized_model_dir, device=torch.cuda.current_device(), inject_fused_attention=False
-    )
-
-    model_config = model.config
-    shard_config = ShardConfig(
-        enable_tensor_parallelism=True if args.tp_size > 1 else False, inference_only=True, inference_gptq=True
-    )
-    infer_engine = TPInferEngine(model, shard_config, max_batch_size, max_input_len, max_output_len)
-
-    generate_kwargs = dict(max_new_tokens=max_output_len, do_sample=False)
-
-    input_tokens = {
-        "input_ids": torch.randint(1, 1000, (max_batch_size, max_input_len), device="cuda"),
-        "attention_mask": torch.ones((max_batch_size, max_input_len), device="cuda"),
-    }
-
-    iters = 10
-    times = []
-
-    for i in range(iters):
-        torch.cuda.synchronize()
-        start = time.time()
-        outputs = infer_engine.generate(input_tokens, **generate_kwargs)
-        torch.cuda.synchronize()
-        end = time.time()
-        out_len = outputs.shape[1]
-        print(f" iter {i}: out len {str(out_len)}, generation time {str(end - start)} s")
-        times.append((end - start) / (out_len - max_input_len))
-
-    print_perf_stats(times, model_config, max_batch_size)
-
-
-def check_llama(rank, world_size, port, args):
-    disable_existing_loggers()
-    colossalai.launch(config={}, rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")
-    run_llama_test(args)
-
-
-@rerun_if_address_is_in_use()
-@clear_cache_before_run()
-def test_llama(args):
-    spawn(check_llama, args.tp_size, args=args)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("-p", "--path", type=str, help="Model path", required=True)
-    parser.add_argument("-q", "--quantized_path", type=str, help="Model path", required=True)
-    parser.add_argument("-tp", "--tp_size", type=int, default=1, help="Tensor parallel size")
-    parser.add_argument("-b", "--batch_size", type=int, default=16, help="Maximum batch size")
-    parser.add_argument("--input_len", type=int, default=1024, help="Maximum input length")
-    parser.add_argument("--output_len", type=int, default=128, help="Maximum output length")
-
-    args = parser.parse_args()
-
-    test_llama(args)
--- a/examples/inference/run_benchmark.sh
+++ b/examples/inference/run_benchmark.sh
@@ -0,0 +1,15 @@
+ROOT=$(realpath $(dirname $0))
+PY_SCRIPT=${ROOT}/benchmark_llama.py
+GPU=$(nvidia-smi -L | head -1 | cut -d' ' -f4 | cut -d'-' -f1)
+
+mkdir -p logs
+
+# benchmark llama2-7b one single GPU
+for bsz in 16 32 64; do
+    python3 ${PY_SCRIPT} -m llama2-7b --tp_size 1 --pp_size 1 -b $bsz -s 256 --output_len 128 | tee logs/${GPU}_${bsz}_256.txt
+done
+
+
+for bsz in 4 8 16 32 64; do
+    python3 ${PY_SCRIPT} -m llama2-7b --tp_size 1 --pp_size 1 -b $bsz -s 1024 --output_len 128 | tee logs/${GPU}_${bsz}_1024.txt
+done
--- a/examples/inference/run_llama_inference.py
+++ b/examples/inference/run_llama_inference.py
@@ -0,0 +1,98 @@
+import argparse
+
+import torch
+import torch.distributed as dist
+from transformers import LlamaForCausalLM, LlamaTokenizer
+
+import colossalai
+from colossalai.accelerator import get_accelerator
+from colossalai.inference import InferenceEngine
+from colossalai.testing import spawn
+
+INPUT_TEXTS = [
+    "What is the longest river in the world?",
+    "Explain the difference between process and thread in compouter science.",
+]
+
+
+def run_inference(args):
+    llama_model_path = args.model_path
+    llama_tokenize_path = args.tokenizer_path or args.model_path
+
+    max_input_len = args.max_input_len
+    max_output_len = args.max_output_len
+    max_batch_size = args.batch_size
+    micro_batch_size = args.micro_batch_size
+    tp_size = args.tp_size
+    pp_size = args.pp_size
+    rank = dist.get_rank()
+
+    tokenizer = LlamaTokenizer.from_pretrained(llama_tokenize_path, padding_side="left")
+    tokenizer.pad_token_id = tokenizer.eos_token_id
+
+    if args.quant is None:
+        model = LlamaForCausalLM.from_pretrained(llama_model_path, pad_token_id=tokenizer.pad_token_id)
+    elif args.quant == "gptq":
+        from auto_gptq import AutoGPTQForCausalLM
+
+        model = AutoGPTQForCausalLM.from_quantized(
+            llama_model_path, inject_fused_attention=False, device=torch.cuda.current_device()
+        )
+    elif args.quant == "smoothquant":
+        from colossalai.inference.quant.smoothquant.models.llama import SmoothLlamaForCausalLM
+
+        model = SmoothLlamaForCausalLM.from_quantized(llama_model_path, model_basename=args.smoothquant_base_name)
+        model = model.cuda()
+
+    engine = InferenceEngine(
+        tp_size=tp_size,
+        pp_size=pp_size,
+        model=model,
+        max_input_len=max_input_len,
+        max_output_len=max_output_len,
+        max_batch_size=max_batch_size,
+        micro_batch_size=micro_batch_size,
+        quant=args.quant,
+        dtype=args.dtype,
+    )
+
+    inputs = tokenizer(INPUT_TEXTS, return_tensors="pt", padding="longest", max_length=max_input_len, truncation=True)
+    inputs = {k: v.to(get_accelerator().get_current_device()) for k, v in inputs.items()}
+    outputs = engine.generate(inputs)
+
+    if rank == 0:
+        output_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+        for input_text, output_text in zip(INPUT_TEXTS, output_texts):
+            print(f"Input: {input_text}")
+            print(f"Output: {output_text}")
+
+
+def run_tp_pipeline_inference(rank, world_size, port, args):
+    colossalai.launch(config={}, rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")
+    run_inference(args)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-p", "--model_path", type=str, help="Model path", required=True)
+    parser.add_argument("-i", "--input", default="What is the longest river in the world?")
+    parser.add_argument("-t", "--tokenizer_path", type=str, help="Tokenizer path", default=None)
+    parser.add_argument(
+        "-q",
+        "--quant",
+        type=str,
+        choices=["gptq", "smoothquant"],
+        default=None,
+        help="quantization type: 'gptq' or 'smoothquant'",
+    )
+    parser.add_argument("--smoothquant_base_name", type=str, default=None, help="soothquant base name")
+    parser.add_argument("--tp_size", type=int, default=1, help="Tensor parallel size")
+    parser.add_argument("--pp_size", type=int, default=1, help="Pipeline parallel size")
+    parser.add_argument("-b", "--batch_size", type=int, default=4, help="Maximum batch size")
+    parser.add_argument("--max_input_len", type=int, default=2048, help="Maximum input length")
+    parser.add_argument("--max_output_len", type=int, default=64, help="Maximum output length")
+    parser.add_argument("--micro_batch_size", type=int, default=1, help="Micro batch size")
+    parser.add_argument("--dtype", default="fp16", type=str)
+
+    args = parser.parse_args()
+    spawn(run_tp_pipeline_inference, nprocs=args.tp_size * args.pp_size, args=args)
--- a/examples/inference/serving/ray_serve/Colossal_Inference_rayserve.py
+++ b/examples/inference/serving/ray_serve/Colossal_Inference_rayserve.py
@@ -1,151 +0,0 @@
-import logging
-import os
-from typing import Any, List, Union
-
-import ray
-import ray.util.collective as collective
-import starlette
-import torch
-from pydantic import BaseModel
-from ray import serve
-from ray.serve import Application
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-import colossalai
-from colossalai.inference.tensor_parallel.engine import TPInferEngine
-from colossalai.shardformer import ShardConfig
-from colossalai.testing import free_port
-
-ray_serve_logger = logging.getLogger("ray.serve")
-
-
-class GenConfigArgs(BaseModel):
-    """Config for generation"""
-
-    path: str
-    tp_size: int = 2
-    max_batch_size: int = 4
-    max_input_len: int = 128
-    max_output_len: int = 32
-
-
-def log_cuda_info(scope_name: str):
-    ray_serve_logger.info(f" {scope_name}: ray.get_gpu_ids(): {ray.get_gpu_ids()}")
-    ray_serve_logger.info(
-        f" {scope_name}: CUDA_VISIBLE_DEVICES: {os.getenv('CUDA_VISIBLE_DEVICES', 'NO DEVICES FOUND!')}"
-    )
-    if torch.cuda.is_available():
-        ray_serve_logger.info(
-            f" {scope_name}: cuda current_device: {torch.cuda.current_device()}, cuda device count: {torch.cuda.device_count()}"
-        )
-    else:
-        ray_serve_logger.info(f" {scope_name}: cuda is not available!")
-
-
-@ray.remote(num_gpus=1)
-class Worker:
-    def __init__(self, model_path: str, tp_size: int, max_batch_size: int, max_input_len: int, max_output_len: int):
-        log_cuda_info("Worker.init")
-        self.tp_size = tp_size
-        self.model_path = model_path
-        self.max_batch_size = max_batch_size
-        self.max_input_len = max_input_len
-        self.max_output_len = max_output_len
-
-    def setup(self, world_size, rank, port):
-        # initialize a ray collective group, otherwise colossalai distributed env won't be built successfully
-        collective.init_collective_group(world_size, rank, "nccl", "default")
-        # initialize and set distributed environment
-        colossalai.launch(config={}, rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")
-        ray_serve_logger.info(f"Worker with rank {rank} (world size {world_size}) setting up..")
-        log_cuda_info("Worker.setup")
-
-        # Load model
-        self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
-        if self.tokenizer.pad_token is None:
-            self.tokenizer.pad_token = self.tokenizer.eos_token
-        self.model = AutoModelForCausalLM.from_pretrained(
-            self.model_path, pad_token_id=self.tokenizer.pad_token_id, torch_dtype=torch.float16
-        )
-
-        shard_config = ShardConfig(enable_tensor_parallelism=True if world_size > 1 else False, inference_only=True)
-        self.infer_engine = TPInferEngine(
-            self.model, shard_config, self.max_batch_size, self.max_input_len, self.max_output_len
-        )
-        self.generate_kwargs = dict(max_new_tokens=self.max_output_len, do_sample=False)
-
-        return True
-
-    def generate(self, text: Union[str, List[str]]) -> str:
-        input_tokens = self.tokenizer.batch_encode_plus(text, return_tensors="pt", padding=True)
-        ray_serve_logger.info(f"text: {text},\ninput_tokens: {input_tokens}")
-
-        model_output = self.infer_engine.generate(input_tokens, **self.generate_kwargs)
-        ray_serve_logger.info(f"model_output.shape: {model_output.shape}")
-
-        text_output = []
-        for i in range(len(model_output)):
-            text_output.append(self.tokenizer.decode(model_output[i]))
-        ray_serve_logger.info(f"output: {text_output}")
-
-        return text_output
-
-
-@serve.deployment(
-    ray_actor_options={"num_cpus": 1, "num_gpus": 0},
-    max_concurrent_queries=5,
-    autoscaling_config={
-        "target_num_ongoing_requests_per_replica": 1,
-        "min_replicas": 1,
-        "initial_replicas": 1,
-        "max_replicas": 1,
-    },
-)
-class Driver:
-    def __init__(self, config: GenConfigArgs):
-        log_cuda_info("Driver:init")
-        model_path = config.path
-        tp_size = config.tp_size
-
-        self.num_workers = tp_size
-        self.workers = []
-        init_rets = []
-
-        # Just grab a free port on localhost
-        # NOTE workers in this communication group listen to the same port
-        available_port = free_port()
-
-        for i in range(self.num_workers):
-            worker_name = "worker_idx_{}".format(i)
-            w = Worker.options(name=worker_name).remote(
-                model_path, self.num_workers, config.max_batch_size, config.max_input_len, config.max_output_len
-            )
-            self.workers.append(w)
-            init_rets.append(w.setup.remote(self.num_workers, i, available_port))
-        _options = {
-            "group_name": "default_driver",
-            "world_size": self.num_workers,
-            "ranks": [i for i in range(self.num_workers)],
-            "backend": "nccl",
-        }
-        collective.create_collective_group(self.workers, **_options)
-        _ = ray.get(init_rets)
-
-    # set batch wait delay in seconds and maximum number of sequences in a batch
-    @serve.batch(batch_wait_timeout_s=0.8, max_batch_size=4)
-    async def batch_generate(self, requests: List[str]):
-        ray_serve_logger.info(f"Driver.batch_generate: requests length: {len(requests)}\n requests: {requests}")
-        results = ray.get([w.generate.remote(requests) for w in self.workers])
-        text_res = results[0]  # get any one of the copies
-        return text_res
-
-    async def __call__(self, request: starlette.requests.Request) -> Any:
-        return await self.batch_generate(request.query_params["text"])
-
-
-def app(args: GenConfigArgs) -> Application:
-    print(args)
-    if args.path is None or not os.path.exists(args.path):
-        raise ValueError("Model path not provided or invalid path!")
-
-    return Driver.options(name="Colossal-Inference-Driver").bind(config=args)
--- a/examples/inference/serving/ray_serve/README.md
+++ b/examples/inference/serving/ray_serve/README.md
@@ -1,86 +0,0 @@
-# Colossal-Inference with Ray Serve
-
-This example is used for demonstrating and testing the deployment of Colossal Inference from `colossalai.inference` with [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). It imports inference modules from colossalai and is based on https://github.com/hpcaitech/ColossalAI/tree/a22706337a57dd1c98b95739dd09d98bd55947a0.
-
-Single-gpu inference as well as multiple-gpu inference (i.e. tensor parallel) serving are supported.
-
-## Installation
-
-### Conda Environment
-```bash
-# create a new conda env with python 3.8
-conda create -n ray_test python=3.8.18
-
-# use torch1.13+cuda11.6
-pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
-
-# install ray from wheels
-pip install -U "ray[default,serve]"
-
-# install cuda toolkit (e.g. nvcc, etc)
-conda install -c "nvidia/label/cuda-11.6.2" cuda-toolkit
-
-# install cuDNN, cuTENSOR, and NCCL
-conda install -c conda-forge cupy cudnn cutensor nccl cuda-version=11.6
-
-# install colossalai with PyTorch extensions
-cd <path_to_ColossalAI_repo>
-CUDA_EXT=1 pip install -e .
-
-# install other dependencies
-pip install triton==2.0.0.dev20221202
-pip install transformers
-```
-
-## Launch Ray Serve and run the app
-### Method #1. CLI command
-
-Under the current directory, we could launch the app by the following command:
-```bash
-RAY_DEDUP_LOGS=0 serve run Colossal_Inference_rayserve:app path="PATH_TO_YOUR_MODEL_DIR"
-```
-
-By default, Ray deduplicates logs across cluster. Here we set `RAY_DEDUP_LOGS=0` to disable log deduplication, enabling each actor to log information in CLI. `serve run` runs an application from the specified import path. The formats should be `<filename>:<app_name>`.
-
-Then we could send requests by running python script in another window:
-```bash
-python send_request.py
-```
-
-### Method #2. Run inside script
-
-We could also launch ray serve and run the app inside a single script by making some modifications:
-To avoid ray handler from raising error in serializing pydantic objects, we'll replace the config class from `class GenConfigArgs(BaseModel)` to
-```python
-from dataclasses import dataclass
-@dataclass
-class GenConfigArgs:
-    # attributes remain unchanged
-```
-Comment out the app builder
-```python
-# def app(args: GenConfigArgs) -> Application:
-#     ...
-#     return Driver.options(name="Colossal-Inference-Driver").bind(config=args)
-```
-And attach the following lines to the end of the file,
-```python
-from ray.serve.handle import DeploymentHandle, DeploymentResponse
-
-app = Driver.bind(config=GenConfigArgs(path="<Path_to_model_dir>"))
-handle: DeploymentHandle = serve.run(app).options(use_new_handle_api=True)
-response: DeploymentResponse = handle.batch_generate.remote(requests="Introduce some landmarks in Beijing")
-print(response.result())
-```
-Then we could run the script
-```python
-python Colossal_Inference_rayserve.py
-```
-
-### Terminate Ray Serve
-Ray serve and the application would terminate automatically as you choose the second method to run any job in the script. If you choose the first method (serve run), you might want to apply `ctrl+c` to shut down the application, or use `serve shutdown` to shut down serve and deletes all applications on the ray cluster.
-
-To make sure all the active Ray processes are killed, run
-```bash
-ray stop
-```
--- a/examples/inference/serving/ray_serve/send_request.py
+++ b/examples/inference/serving/ray_serve/send_request.py
@@ -1,15 +0,0 @@
-import ray
-import requests
-
-
-@ray.remote
-def send_query(text):
-    resp = requests.get("http://localhost:8000/?text={}".format(text))
-    return resp.text
-
-
-test_sentence = "Introduce some landmarks in Beijing"
-
-result = ray.get(send_query.remote(test_sentence))
-print("Result returned:")
-print(result)
--- a/examples/inference/serving/ray_serve/send_requests.py
+++ b/examples/inference/serving/ray_serve/send_requests.py
@@ -1,27 +0,0 @@
-import ray
-import requests
-
-
-@ray.remote
-def send_query(text):
-    resp = requests.get("http://localhost:8000/?text={}".format(text))
-    return resp.text
-
-
-test_sentences = [
-    "Introduce some landmarks in Beijing",
-    "What is the weather today",
-    "Coding requires practice and patience",
-    "Rainy days inspire cozy reading",
-    "Laughter is contagious and heartwarming",
-    "Hiking mountains builds strength and resilience",
-    "Family bonds grow stronger with time",
-    "Science unlocks mysteries of the universe",
-    "Music soothes the soul and ignites passion",
-    "Artistic expression knows no boundaries",
-]
-
-results = ray.get([send_query.remote(text) for text in test_sentences])
-print("Result returned:")
-for res in results:
-    print(res)
--- a/examples/inference/serving/test_ci.sh
+++ b/examples/inference/serving/test_ci.sh
--- a/examples/inference/serving/torch_serve/Colossal_Inference_Handler.py
+++ b/examples/inference/serving/torch_serve/Colossal_Inference_Handler.py
@@ -1,193 +0,0 @@
-import logging
-import os
-import zipfile
-from abc import ABC
-
-import torch
-import transformers
-from transformers import AutoTokenizer, BloomForCausalLM, BloomTokenizerFast, LlamaForCausalLM
-from ts.torch_handler.base_handler import BaseHandler
-
-import colossalai
-from colossalai.inference.tensor_parallel.engine import TPInferEngine
-from colossalai.shardformer import ShardConfig
-from colossalai.testing import free_port
-
-logger = logging.getLogger(__name__)
-logger.info("Transformers version %s", transformers.__version__)
-logger.info("ColossalAI version %s", colossalai.__version__)
-
-
-class ColossalInferenceHandler(BaseHandler, ABC):
-    """
-    Transformers handler class for testing
-    """
-
-    def __init__(self):
-        super(ColossalInferenceHandler, self).__init__()
-        self.infer_engine = None
-        self.max_batch_size = None
-        self.max_input_len = None
-        self.max_output_len = None
-        self.tokenizer = None
-        self.initialized = False
-
-    def initialize(self, ctx):
-        """Expected behaviour: the sharded Bloom/Llama model is loaded.
-
-        Args:
-            ctx (context): It is a JSON Object containing information
-            pertaining to the model artefacts parameters.
-        """
-        if ctx is not None or not hasattr(ctx, "model_yaml_config"):
-            logger.error("Context ctx and model-config are not appropriately passed in.")
-
-        self.manifest = ctx.manifest
-        gpu_id = ctx.system_properties.get("gpu_id", -1)
-        model_dir = ctx.system_properties.get("model_dir")
-
-        # Inference configs are collected together in model yaml config for handler use
-        inference_config = ctx.model_yaml_config["handler"]
-        self.inference_config = inference_config
-        logger.info(self.inference_config)
-
-        self.tp_size = self.inference_config.get("tp_size", 1)
-        self.max_batch_size = self.inference_config.get("max_batch_size", 4)
-        self.max_input_len = self.inference_config.get("max_input_len", 1024)
-        self.max_output_len = self.inference_config.get("max_output_len", 128)
-
-        self.device = torch.device("cuda:" + str(gpu_id) if torch.cuda.is_available() and gpu_id >= 0 else "cpu")
-        logger.info(f"Device set to {self.device}")
-        logger.info(f"torch.cuda.device_count() {torch.cuda.device_count()}")
-
-        # Unpacking from model_dir
-        model_dir_path = os.path.join(model_dir, "model")
-        with zipfile.ZipFile(model_dir + "/model.zip", "r") as zip_ref:
-            zip_ref.extractall(model_dir_path)
-        logger.info(f"Loading {self.inference_config['model_type']} pretrain model and tokenizer")
-        if self.inference_config["model_type"] == "bloom":
-            self.model = BloomForCausalLM.from_pretrained(
-                model_dir_path,
-            )
-            self.tokenizer = BloomTokenizerFast.from_pretrained(model_dir_path, return_tensors="pt")
-        elif self.inference_config["model_type"] == "llama":
-            self.model = LlamaForCausalLM.from_pretrained(
-                model_dir_path,
-            )
-            self.tokenizer = AutoTokenizer.from_pretrained(model_dir_path, return_tensors="pt")
-        else:
-            logger.warning(f"Model type {self.inference_config['model_type']} not supported yet.")
-
-        logger.info("Transformer model from path %s loaded successfully", model_dir)
-
-        # NOTE world_size, rank, host, port here are used to launch colossalai dist environment
-        # This world_size is different from the world size of TorchServe
-        world_size = int(os.getenv("WORLD_SIZE", self.tp_size))
-        assert world_size == 1, "Colossal-Inference with tensor parallel is not supported on TorchServe for now"
-        rank = int(os.getenv("RANK", gpu_id))
-        local_rank = int(os.getenv("LOCAL_RANK", gpu_id))
-        host = os.getenv("MASTER_ADDR", "localhost")
-        port = os.getenv("MASTER_PORT", free_port())  # use a random free port
-
-        logger.info(
-            f"  world_size {world_size}" f"  local_rank {local_rank}" f"  rank {rank}" f"  host {host}" f"  port {port}"
-        )
-
-        torch.cuda.set_device(self.device)
-        self.model.half()
-        self.model.cuda()
-        self.model.eval()
-
-        colossalai.launch(config={}, rank=rank, world_size=world_size, host=host, port=port, backend="nccl")
-        logger.info("Initializing TPInferEngine ...")
-        shard_config = ShardConfig(enable_tensor_parallelism=True if self.tp_size > 1 else False, inference_only=True)
-        self.infer_engine = TPInferEngine(
-            self.model, shard_config, self.max_batch_size, self.max_input_len, self.max_output_len
-        )
-        logger.info("TPInferEngine initialized successfully")
-
-        self.model = self.infer_engine.model
-        self.initialized = True
-
-    def preprocess(self, requests):
-        """Basic text preprocessing, based on the user's chocie of application mode.
-        Args:
-            requests: The Input data in the form of text is passed on to the preprocess
-            function.
-        Returns:
-            list : The preprocess function returns a list of Tensor for the size of the word tokens.
-        """
-        logger.info("Pre-processing requests")
-        input_ids_batch = None
-        attention_mask_batch = None
-        for idx, data in enumerate(requests):
-            input_text = data.get("data")
-            if input_text is None:
-                input_text = data.get("body")
-            if isinstance(input_text, (bytes, bytearray)):
-                input_text = input_text.decode("utf-8")
-
-            logger.info("Received text: '%s'", input_text)
-
-            inputs = self.tokenizer.encode_plus(
-                input_text,
-                max_length=self.max_input_len,
-                padding=True,
-                add_special_tokens=True,
-                return_tensors="pt",
-                truncation=True,
-            )
-
-            input_ids = inputs["input_ids"].to(self.device)
-            attention_mask = inputs["attention_mask"].to(self.device)
-            # making a batch out of the recieved requests
-            # attention masks are passed for cases where input tokens are padded.
-            if input_ids.shape is not None:
-                if input_ids_batch is None:
-                    input_ids_batch = input_ids
-                    attention_mask_batch = attention_mask
-                else:
-                    input_ids_batch = torch.cat((input_ids_batch, input_ids), 0)
-                    attention_mask_batch = torch.cat((attention_mask_batch, attention_mask), 0)
-        return (input_ids_batch, attention_mask_batch)
-
-    def inference(self, input_batch):
-        """Predict the class (or classes) of the received text using the
-        serialized transformers checkpoint.
-        Args:
-            input_batch (list): List of Text Tensors from the pre-process function is passed here
-        Returns:
-            list : It returns a list of the predicted value for the input text
-        """
-        input_ids_batch, attention_mask_batch = input_batch
-        inferences = []
-
-        do_sample = self.inference_config.get("do_sample", True)
-        top_p = self.inference_config.get("top_p", 0.95 if do_sample else 1.0)
-        top_k = self.inference_config.get("top_k", 60 if do_sample else 50)
-        input_ids_batch = input_ids_batch.to(self.device)
-        outputs = self.infer_engine.generate(
-            dict(input_ids=input_ids_batch, attention_mask=attention_mask_batch),
-            do_sample=do_sample,
-            top_p=top_p,
-            top_k=top_k,
-        )
-
-        for i, _ in enumerate(outputs):
-            inferences.append(self.tokenizer.decode(outputs[i], skip_special_tokens=True))
-
-        # For testing only
-        logger.info(
-            f"Generated text: {inferences}",
-        )
-
-        return inferences
-
-    def postprocess(self, inference_output):
-        """Post Process Function converts the predicted response into Torchserve readable format.
-        Args:
-            inference_output (list): It contains the predicted response of the input text.
-        Returns:
-            (list): Returns a list of the Predictions and Explanations.
-        """
-        return inference_output
--- a/examples/inference/serving/torch_serve/README.md
+++ b/examples/inference/serving/torch_serve/README.md
@@ -1,109 +0,0 @@
-# Colossal-Inference with TorchServe
-
-## Overview
-
-This demo is used for testing and demonstrating the usage of Colossal Inference from `colossalai.inference` with deployment with TorchServe. It imports inference modules from colossalai and is based on
-https://github.com/hpcaitech/ColossalAI/tree/3e05c07bb8921f2a8f9736b6f6673d4e9f1697d0. For now, single-gpu inference serving is supported.
-
-## Environment for testing
-### Option #1: Use Conda Env
-Records to create a conda env to test locally as follows. We might want to use docker or configure env on cloud platform later.
-
-*NOTE*: It requires the installation of jdk and the set of `JAVA_HOME`. We recommend to install open-jdk-17 (Please refer to https://openjdk.org/projects/jdk/17/)
-
-```bash
-# use python 3.8 or 3.9
-conda create -n infer python=3.9
-
-# use torch 1.13+cuda11.6 for inference
-pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
-
-# conda cuda toolkit (e.g. nvcc, etc)
-conda install -c "nvidia/label/cuda-11.6.2" cuda-toolkit
-
-# install colossalai with PyTorch extensions
-cd <path_to_ColossalAI_repo>
-pip install -r requirements/requirements.txt
-pip install -r requirements/requirements-test.txt
-CUDA_EXT=1 pip install -e .
-
-# install torchserve
-cd <path_to_torch_serve_repo>
-python ./ts_scripts/install_dependencies.py --cuda=cu116
-pip install torchserve torch-model-archiver torch-workflow-archiver
-```
-
-### Option #2: Use Docker
-To use the stable diffusion Docker image, you can build using the provided the [Dockerfile](./docker/Dockerfile).
-
-```bash
-# build from dockerfile
-cd ColossalAI/examples/inference/serving/torch_serve/docker
-docker build -t hpcaitech/colossal-infer-ts:0.2.0 .
-```
-
-Once you have the image ready, you can launch the image with the following command
-
-```bash
-cd ColossalAI/examples/inference/serving/torch_serve
-
-# run the docker container
-docker run --rm \
-    -it --gpus all \
-    --name <name_you_assign> \
-    -v <your-data-dir>:/data/scratch \
-    -w <ColossalAI_dir> \
-    hpcaitech/colossal-infer-ts:0.2.0 \
-    /bin/bash
-```
-
-## Steps to deploy a model
-
-###  1.download/prepare a model
-We will download a bloom model, and then zip the downloaded model. You could download the model from [HuggingFace](https://huggingface.co/models) manually, or you might want to refer to this script [download_model.py](https://github.com/pytorch/serve/blob/c3ca2599b4d36d2b61302064b02eab1b65e1908d/examples/large_models/utils/Download_model.py) provided by pytorch-serve team to help you download a snapshot of the model.
-
-```bash
-# download snapshots
-cd <path_to_torch_serve>/examples/large_models/utils/
-huggingface-cli login
-python download_model.py --model_name bigscience/bloom-560m -o <path_to_store_downloaded_model>
-
-# zip the model repo
-cd <path_to_store_downloaded_model>/models--bigscience--bloom-560m/snapshots/<specific_revision>
-zip -r <path_to_place_zipped_model>//model.zip *
-```
-
-> **_NOTE:_**  The torch archiver and server will use `/tmp/` folder. Depending on the limit of disk quota, using torch-model-archiver might cause OSError "Disk quota exceeded". To prevent the OSError, set tmp dir environment variable as follows:
-`export TMPDIR=<dir_with_enough_space>/tmp` and `export TEMP=<dir_with_enough_space>/tmp`,
-or use relatively small models (as we did) for local testing.
-
-### 2. Archive the model
-With torch archiver, we will pack the model file (.zip) as well as handler file (.py) together into a .mar file. And then in serving process these files will be unpacked by TorchServe. Revelant model configs and inference configs can be set in `model-config.yaml`.
-```bash
-cd ./ColossalAI/examples/inference/serving/torch_serve
-# create a folder under the current directory to store the packed model created by torch archiver
-mkdir model_store
-torch-model-archiver --model-name bloom --version 0.1 --handler Colossal_Inference_Handler.py --config-file model-config.yaml --extra-files <dir_zipped_model>/model.zip --export-path ./model_store/
-```
-
-### 3. Launch serving
-
-Modify `load_models` in config.properties to select the model(s) stored in <model_store> directory to be deployed. By default we use `load_models=all` to load and deploy all the models (.mar) we have.
-
-```bash
-torchserve --start --ncs --ts-config config.properties
-```
-We could set inference, management, and metrics addresses and other TorchServe settings in `config.properties`.
-
-TorchServe will create a folder `logs/` under the current directory to store ts, model, and metrics logs.
-
-### 4. Run inference
-
-```bash
-# check inference status
-curl http://0.0.0.0:8084/ping
-
-curl -X POST http://localhost:8084/predictions/bloom -T sample_text.txt
-```
-
-To stop TorchServe, run `torchserve --stop`
--- a/examples/inference/serving/torch_serve/config.properties
+++ b/examples/inference/serving/torch_serve/config.properties
@@ -1,10 +0,0 @@
-inference_address=http://0.0.0.0:8084
-management_address=http://0.0.0.0:8085
-metrics_address=http://0.0.0.0:8086
-enable_envvars_config=true
-install_py_dep_per_model=true
-number_of_gpu=1
-load_models=all
-max_response_size=655350000
-default_response_timeout=6000
-model_store=./model_store
--- a/examples/inference/serving/torch_serve/docker/Dockerfile
+++ b/examples/inference/serving/torch_serve/docker/Dockerfile
@@ -1,57 +0,0 @@
-FROM hpcaitech/pytorch-cuda:1.13.0-11.6.0
-
-# enable passwordless ssh
-RUN mkdir ~/.ssh && \
-    printf "Host * \n    ForwardAgent yes\nHost *\n    StrictHostKeyChecking no" > ~/.ssh/config && \
-    ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa && \
-    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
-
-# install curl
-RUN apt-get update && \
-    apt-get -y install curl && \
-    apt-get clean && \
-    rm -rf /var/lib/apt/lists/*
-
-# Download and extract OpenJDK 17
-ENV JAVA_HOME /opt/openjdk-17
-RUN apt-get update && \
-    apt-get install -y wget && \
-    wget -q https://download.java.net/openjdk/jdk17/ri/openjdk-17+35_linux-x64_bin.tar.gz -O /tmp/openjdk.tar.gz && \
-    mkdir -p $JAVA_HOME && \
-    tar xzf /tmp/openjdk.tar.gz -C $JAVA_HOME --strip-components=1 && \
-    rm /tmp/openjdk.tar.gz && \
-    apt-get purge -y --auto-remove wget && \
-    rm -rf /var/lib/apt/lists/*
-
-ENV PATH $JAVA_HOME/bin:$PATH
-RUN export JAVA_HOME
-RUN java -version
-
-# install ninja
-RUN apt-get update && \
-    apt-get install -y --no-install-recommends ninja-build && \
-    apt-get clean && \
-    rm -rf /var/lib/apt/lists/*
-
-# install colossalai
-ARG VERSION=main
-RUN git clone -b ${VERSION} https://github.com/hpcaitech/ColossalAI.git && \
-    cd ./ColossalAI && \
-    git checkout 3e05c07bb8921f2a8f9736b6f6673d4e9f1697d0 && \
-    CUDA_EXT=1 pip install -v --no-cache-dir .
-
-# install titans
-RUN pip install --no-cache-dir titans
-
-# install transformers
-RUN pip install --no-cache-dir transformers
-
-# install triton
-RUN pip install --no-cache-dir triton==2.0.0.dev20221202
-
-# install torchserve
-ARG VERSION=master
-RUN git clone -b ${VERSION} https://github.com/pytorch/serve.git && \
-    cd ./serve && \
-    python ./ts_scripts/install_dependencies.py --cuda=cu116 && \
-    pip install torchserve torch-model-archiver torch-workflow-archiver
--- a/examples/inference/serving/torch_serve/model-config.yaml
+++ b/examples/inference/serving/torch_serve/model-config.yaml
@@ -1,16 +0,0 @@
-# TS frontend parameters settings
-minWorkers: 1        # minimum number of workers of a model
-maxWorkers: 1        # maximum number of workers of a model
-batchSize: 8         # batch size of a model
-maxBatchDelay: 100   # maximum delay of a batch (ms)
-responseTimeout: 120 # timeout of a specific model's response (*in sec)
-deviceType: "gpu"
-# deviceIds: [0, 1]    # seting CUDA_VISIBLE_DEVICES
-
-handler:
-    mode: "text_generation"
-    model_type: "bloom"
-    tp_size: 1
-    max_batch_size: 8
-    max_input_len: 1024
-    max_output_len: 128
--- a/examples/inference/serving/torch_serve/sample_text.txt
+++ b/examples/inference/serving/torch_serve/sample_text.txt
@@ -1 +0,0 @@
-Introduce some landmarks in Beijing