ColossalAI/applications/Colossal-LLaMA-2
linsj20 fcf776ff1b
[Feature] LoRA rebased to main branch (#5622)
* [Inference]ADD Bench Chatglm2 script (#4963)

* add bench chatglm

* fix bug and make utils

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Pipeline inference] Combine kvcache with pipeline inference (#4938)

* merge kvcache with pipeline inference and refactor the code structure

* support ppsize > 2

* refactor pipeline code

* do pre-commit

* modify benchmark

* fix bench mark

* polish code

* add docstring and update readme

* refactor the code

* fix some logic bug of ppinfer

* polish readme

* fix typo

* skip infer test

* updated c++17 compiler flags (#4983)

* [Inference] Dynamic Batching Inference, online and offline (#4953)

* [inference] Dynamic Batching for Single and Multiple GPUs (#4831)

* finish batch manager

* 1

* first

* fix

* fix dynamic batching

* llama infer

* finish test

* support different lengths generating

* del prints

* del prints

* fix

* fix bug

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [inference] Async dynamic batching  (#4894)

* finish input and output logic

* add generate

* test forward

* 1

* [inference]Re push async dynamic batching (#4901)

* adapt to ray server

* finish async

* finish test

* del test

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>

* Revert "[inference]Re push async dynamic batching (#4901)" (#4905)

This reverts commit fbf3c09e67.

* Revert "[inference] Async dynamic batching  (#4894)"

This reverts commit fced140250.

* Revert "[inference] Async dynamic batching  (#4894)" (#4909)

This reverts commit fced140250.

* Add Ray Distributed Environment Init Scripts

* support DynamicBatchManager base function

* revert _set_tokenizer version

* add driver async generate

* add async test

* fix bugs in test_ray_dist.py

* add get_tokenizer.py

* fix code style

* fix bugs about No module named 'pydantic' in ci test

* fix bugs in ci test

* fix bugs in ci test

* fix bugs in ci test

* [infer]Add Ray Distributed Environment Init Scripts (#4911)

* Revert "[inference] Async dynamic batching  (#4894)"

This reverts commit fced140250.

* Add Ray Distributed Environment Init Scripts

* support DynamicBatchManager base function

* revert _set_tokenizer version

* add driver async generate

* add async test

* fix bugs in test_ray_dist.py

* add get_tokenizer.py

* fix code style

* fix bugs about No module named 'pydantic' in ci test

* fix bugs in ci test

* fix bugs in ci test

* fix bugs in ci test

* support dynamic batch for bloom model and is_running function

* [Inference]Test for new Async engine (#4935)

* infer engine

* infer engine

* test engine

* test engine

* new manager

* change step

* add

* test

* fix

* fix

* finish test

* finish test

* finish test

* finish test

* add license

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>

* add assertion for config (#4947)

* [Inference] Finish dynamic batching offline test (#4948)

* test

* fix test

* fix quant

* add default

* fix

* fix some bugs

* fix some bugs

* fix

* fix bug

* fix bugs

* reset param

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention  (#4965)

* adding flash-decoding

* clean

* adding kernel

* adding flash-decoding

* add integration

* add

* adding kernel

* adding kernel

* adding triton 2.1.0 features for inference

* update bloom triton kernel

* remove useless vllm kernels

* clean codes

* fix

* adding files

* fix readme

* update llama flash-decoding

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>

* fix ColossalEval (#4992)

Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>

* [doc]Update doc for colossal-inference (#4989)

* update doc

* Update README.md

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>

* [hotfix] Fix the bug where process groups were not being properly released. (#4940)

* Fix the bug where process groups were not being properly released.

* test

* Revert "test"

This reverts commit 479900c139.

* [hotfix] fix the bug of repeatedly storing param group (#4951)

* [doc] add supported feature diagram for hybrid parallel plugin (#4996)

* [Pipeline Inference] Merge pp with tp (#4993)

* refactor pipeline into new CaiInferEngine

* updata llama modeling forward

* merge tp with pp

* update docstring

* optimize test workflow and example

* fix typo

* add assert and todo

* [release] update version (#4995)

* [release] update version

* [hotfix] fix ci

* [moe] merge moe into main (#4978)

* update moe module
* support openmoe

* [hotfix] fix grad accumulation plus clipping for gemini (#5002)

* [hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926)

* [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915)

* Add layer norm gradients all-reduce for sequence parallel.

* skip pipeline inference test

* [hotfix] fixing polices of sequence parallel (#4922)

* Add layer norm gradients all-reduce for sequence parallel.

* fix parameter passing when calling get_autopolicy

---------

Co-authored-by: littsk <1214689160@qq.com>

* Hotfix/add grad all reduce for sequence parallel (#4927)

* Add layer norm gradients all-reduce for sequence parallel.


* fix parameter passing when calling get_autopolicy

* fix bug using wrong variables

---------

Co-authored-by: littsk <1214689160@qq.com>

* fix policy initialization

* fix bloom and chatglm policices

* polish code of handling layernorm

* fix moe module

* polish code of class initializing

---------

Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>

* [format] applied code formatting on changed files in pull request 4926 (#5007)

Co-authored-by: github-actions <github-actions@github.com>

* [Inference] Fix bug in ChatGLM2 Tensor Parallelism (#5014)

* fix bug

* fix

* fix multiquery

* fix multiquery

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [misc] add code owners (#5024)

* [moe] support optimizer checkpoint (#5015)

* Refactor MoE Manager setup method

* unshard optim ckpt

* optim io

* update transformer version

* update requirements

* update ckpt

* update ckpt

* update ckpt

* fix engine

* fix engine

* Support mtbench (#5025)

Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>

* [moe]: fix ep/tp tests, add hierarchical all2all (#4982)

* fix: add warning for EP different behavior

* fix: use shard_data in ep & tp model

* to: add used_capacity

* fix: fix router test

* feat: add create_ep_node_group

* feat: add create_ep_hierarchical_group fn

* feat: add HierarchicalAllToAll

* test: add hierarchical all2all test

* fix: fix test errors

* fix: simplify create_ep_hierarchical_group

* fix: add hierarchical_alltoall arg

* fix: fix environ typo

* revert: revert process mesh order

* to: add todo mark

* fix: skip hierarchical_comm if torch < 1.13.1

* [shardformer] Fix serialization error with Tensor Parallel state saving (#5018)

* Fix serialization error with Tensor Parallel state saving

* Refactor state_dict CPU transfer using tree_map

* [gemini] gemini support tensor parallelism. (#4942)

* [colossalai]fix typo

* [inference] Add smmoothquant for llama (#4904)

* [inference] add int8 rotary embedding kernel for smoothquant (#4843)

* [inference] add smoothquant llama attention (#4850)

* add smoothquant llama attention

* remove uselss code

* remove useless code

* fix import error

* rename file name

* [inference] add silu linear fusion for smoothquant llama mlp  (#4853)

* add silu linear

* update skip condition

* catch smoothquant cuda lib exception

* prcocess exception for tests

* [inference] add llama mlp for smoothquant (#4854)

* add llama mlp for smoothquant

* fix down out scale

* remove duplicate lines

* add llama mlp check

* delete useless code

* [inference] add smoothquant llama (#4861)

* add smoothquant llama

* fix attention accuracy

* fix accuracy

* add kv cache and save pretrained

* refactor example

* delete smooth

* refactor code

* [inference] add smooth function and delete useless code for smoothquant (#4895)

* add smooth function and delete useless code

* update datasets

* remove duplicate import

* delete useless file

* refactor codes (#4902)

* rafactor code

* add license

* add torch-int and smoothquant license

* Update flash_attention_patch.py

To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer.
https://github.com/huggingface/transformers/pull/25598

* [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921)

* [kernel] support pure fp16 for cpu adam (#4896)

* [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919)

* [kernel] fix cpu adam

* [test] update gemini optim test

* [format] applied code formatting on changed files in pull request 4908 (#4918)

Co-authored-by: github-actions <github-actions@github.com>

* [gemini] support gradient accumulation (#4869)

* add test

* fix no_sync bug in low level zero plugin

* fix test

* add argument for grad accum

* add grad accum in backward hook for gemini

* finish implementation, rewrite tests

* fix test

* skip stuck model in low level zero test

* update doc

* optimize communication & fix gradient checkpoint

* modify doc

* cleaning codes

* update cpu adam fp16 case

* [hotfix] fix torch 2.0 compatibility (#4936)

* [hotfix] fix launch

* [test] fix test gemini optim

* [shardformer] fix vit

* [test] add no master test for low level zero plugin (#4934)

* [format] applied code formatting on changed files in pull request 4820 (#4886)

Co-authored-by: github-actions <github-actions@github.com>

* [nfc] fix some typo with colossalai/ docs/ etc. (#4920)

* [Refactor] Integrated some lightllm kernels into token-attention  (#4946)

* add some req for inference

* clean codes

* add codes

* add some lightllm deps

* clean codes

* hello

* delete rms files

* add some comments

* add comments

* add doc

* add lightllm deps

* add lightllm cahtglm2 kernels

* add lightllm cahtglm2 kernels

* replace rotary embedding with lightllm kernel

* add some commnets

* add some comments

* add some comments

* add

* replace fwd kernel att1

* fix a arg

* add

* add

* fix token attention

* add some comments

* clean codes

* modify comments

* fix readme

* fix bug

* fix bug

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>

* [test] merge old components to test to model zoo (#4945)

* [test] add custom models in model zoo

* [test] update legacy test

* [test] update model zoo

* [test] update gemini test

* [test] remove components to test

* [inference] add reference and fix some bugs (#4937)

* add reference and fix some bugs

* update gptq init

---------

Co-authored-by: Xu Kai <xukai16@foxamil.com>

* [Inference]ADD Bench Chatglm2 script (#4963)

* add bench chatglm

* fix bug and make utils

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Pipeline inference] Combine kvcache with pipeline inference (#4938)

* merge kvcache with pipeline inference and refactor the code structure

* support ppsize > 2

* refactor pipeline code

* do pre-commit

* modify benchmark

* fix bench mark

* polish code

* add docstring and update readme

* refactor the code

* fix some logic bug of ppinfer

* polish readme

* fix typo

* skip infer test

* updated c++17 compiler flags (#4983)

* [Inference] Dynamic Batching Inference, online and offline (#4953)

* [inference] Dynamic Batching for Single and Multiple GPUs (#4831)

* finish batch manager

* 1

* first

* fix

* fix dynamic batching

* llama infer

* finish test

* support different lengths generating

* del prints

* del prints

* fix

* fix bug

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [inference] Async dynamic batching  (#4894)

* finish input and output logic

* add generate

* test forward

* 1

* [inference]Re push async dynamic batching (#4901)

* adapt to ray server

* finish async

* finish test

* del test

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>

* Revert "[inference]Re push async dynamic batching (#4901)" (#4905)

This reverts commit fbf3c09e67.

* Revert "[inference] Async dynamic batching  (#4894)"

This reverts commit fced140250.

* Revert "[inference] Async dynamic batching  (#4894)" (#4909)

This reverts commit fced140250.

* Add Ray Distributed Environment Init Scripts

* support DynamicBatchManager base function

* revert _set_tokenizer version

* add driver async generate

* add async test

* fix bugs in test_ray_dist.py

* add get_tokenizer.py

* fix code style

* fix bugs about No module named 'pydantic' in ci test

* fix bugs in ci test

* fix bugs in ci test

* fix bugs in ci test

* [infer]Add Ray Distributed Environment Init Scripts (#4911)

* Revert "[inference] Async dynamic batching  (#4894)"

This reverts commit fced140250.

* Add Ray Distributed Environment Init Scripts

* support DynamicBatchManager base function

* revert _set_tokenizer version

* add driver async generate

* add async test

* fix bugs in test_ray_dist.py

* add get_tokenizer.py

* fix code style

* fix bugs about No module named 'pydantic' in ci test

* fix bugs in ci test

* fix bugs in ci test

* fix bugs in ci test

* support dynamic batch for bloom model and is_running function

* [Inference]Test for new Async engine (#4935)

* infer engine

* infer engine

* test engine

* test engine

* new manager

* change step

* add

* test

* fix

* fix

* finish test

* finish test

* finish test

* finish test

* add license

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>

* add assertion for config (#4947)

* [Inference] Finish dynamic batching offline test (#4948)

* test

* fix test

* fix quant

* add default

* fix

* fix some bugs

* fix some bugs

* fix

* fix bug

* fix bugs

* reset param

---------

Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention  (#4965)

* adding flash-decoding

* clean

* adding kernel

* adding flash-decoding

* add integration

* add

* adding kernel

* adding kernel

* adding triton 2.1.0 features for inference

* update bloom triton kernel

* remove useless vllm kernels

* clean codes

* fix

* adding files

* fix readme

* update llama flash-decoding

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>

* fix ColossalEval (#4992)

Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>

* [doc]Update doc for colossal-inference (#4989)

* update doc

* Update README.md

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>

* [hotfix] Fix the bug where process groups were not being properly released. (#4940)

* Fix the bug where process groups were not being properly released.

* test

* Revert "test"

This reverts commit 479900c139.

* [hotfix] fix the bug of repeatedly storing param group (#4951)

* [doc] add supported feature diagram for hybrid parallel plugin (#4996)

* [Pipeline Inference] Merge pp with tp (#4993)

* refactor pipeline into new CaiInferEngine

* updata llama modeling forward

* merge tp with pp

* update docstring

* optimize test workflow and example

* fix typo

* add assert and todo

* [release] update version (#4995)

* [release] update version

* [hotfix] fix ci

* [gemini] gemini support tp

[gemini] gemini support tp

[gemini] gemini support tp

[gemini] gemini support tp

[gemini] gemini support tp

* fix

fix

fix

* update checkpointIO

update checkpointIO

update checkpointIO

update checkpointIO

update checkpointIO

update checkpointIO

update checkpointIO

update checkpointIO

update checkpointIO

* support fused layernorm

support fused layernorm

support fused layernorm

* update fusedlayernorm

update fusedlayernorm

update fusedlayernorm

* add sequence parallel to gemini

add sequence parallel to gemini

* fix

* fix comments

fix comments

fix comments

* fix

* fix t5

* clear cache

* fix

* activate ci

* activate ci

* fix

* fix

* fix

* fix

* revert

* modify tp gather method

modify tp gather method

modify tp gather method

modify tp gather method

* fix test

---------

Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
Co-authored-by: Xu Kai <xukai16@foxamil.com>
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com>
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: littsk <1214689160@qq.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>

* [hotfix] Suport extra_kwargs in ShardConfig (#5031)

* [refactor]: replace inference args with extra_kwargs in ShardConfig

* modify shardconfig

* polish code

* fix policy bug in llama

* fix bug in auto policy

* remove setattr in ShardConfig

* fix wrong EOS token in ColossalChat

* [Kernels]Update triton kernels into 2.1.0 (#5046)

* update flash-context-attention

* adding kernels

* fix

* reset

* add build script

* add building process

* add llama2 exmaple

* add colossal-llama2 test

* clean

* fall back test setting

* fix test file

* clean

* clean

* clean

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>

* [pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017)

* Use p2p

* Cannot bidirectonal send p2p

* Refactor tensor creation and serialization in P2P
communication

* Fix llama forward args in flash attention

* Add flop estimate from megatron

* Support loading weight not in weight_map when strict=False in hybrid_parallel

* Use send_forward_recv_backward, etc in 1f1b

* Use dataclass for metdata
Remove torch.cuda.synchronize() as suggested

* Add comment about the torch.cuda.synchronize for potential error

* Typo

* Update hybrid_parallel_checkpoint_io.py

* Update p2p.py

* Update one_f_one_b.py

* Update p2p.py

---------

Co-authored-by: flybird11111 <1829166702@qq.com>

* [gemini] gemini support extra-dp (#5043)

* support ddp

* fix

* fix

* fix

fix

* support ddp

* fix

* fix

* fix

fix

* simplify tests

* fix

* fix

* fix

fix

fix

* fix

* [shardformer] fix llama error when transformers upgraded. (#5055)

* fix-llama

* Update llama.py

* [hotfix]: modify create_ep_hierarchical_group and add test (#5032)

* feat: modify create_ep_hierarchical_group args

* test: add ep tests

* fix: remove get_process_group_ranks

* fix: fix src_rank

* [exampe] fix llama example' loss error when using gemini plugin (#5060)

fix llama example

* [inference] Refactor inference architecture (#5057)

* [inference] support only TP (#4998)

* support only tp

* enable tp

* add support for bloom (#5008)

* [refactor] refactor gptq and smoothquant llama (#5012)

* refactor gptq and smoothquant llama

* fix import error

* fix linear import torch-int

* fix smoothquant llama import error

* fix import accelerate error

* fix bug

* fix import smooth cuda

* fix smoothcuda

* [Inference Refactor] Merge chatglm2 with pp and tp (#5023)

merge chatglm with pp and tp

* [Refactor] remove useless inference code (#5022)

* remove useless code

* fix quant model

* fix test import bug

* mv original inference legacy

* fix chatglm2

* [Refactor] refactor policy search and quant type controlling in inference (#5035)

* [Refactor] refactor policy search and quant type controling in inference

* [inference] update readme (#5051)

* update readme

* update readme

* fix architecture

* fix table

* fix table

* [inference] udpate example (#5053)

* udpate example

* fix run.sh

* fix rebase bug

* fix some errors

* update readme

* add some features

* update interface

* update readme

* update benchmark

* add requirements-infer

---------

Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>

* [Kernels]added flash-decoidng of triton (#5063)

* added flash-decoidng of triton based on lightllm kernel

* add req

* clean

* clean

* delete build.sh

---------

Co-authored-by: cuiqing.li <lixx336@gmail.com>

* [misc] remove outdated submodule (#5070)

* [npu] add npu support for gemini and zero (#5067)

* [npu] setup device utils (#5047)

* [npu] add npu device support

* [npu] support low level zero

* [test] update npu zero plugin test

* [hotfix] fix import

* [test] recover tests

* [npu] gemini support npu (#5052)

* [npu] refactor device utils

* [gemini] support npu

* [example] llama2+gemini support npu

* [kernel] add arm cpu adam kernel (#5065)

* [kernel] add arm cpu adam

* [optim] update adam optimizer

* [kernel] arm cpu adam remove bf16 support

* [hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069)

* [inference] update examples and engine (#5073)

* update examples and engine

* fix choices

* update example

* [format] applied code formatting on changed files in pull request 5067 (#5072)

Co-authored-by: github-actions <github-actions@github.com>

* [hotfix/hybridengine] Fix init model with random parameters in benchmark (#5074)

* fix init model with random parameters

* fix example

* [inference] refactor examples and fix schedule (#5077)

* [setup] refactor infer setup

* [hotfix] fix infenrece behavior on 1 1 gpu

* [exmaple] refactor inference examples

* fix thrust-transform-reduce error (#5078)

* [nfc] fix typo in docs/ (#4972)

* [nfc] fix typo and author name (#5089)

* [gemini]fix gemini optimzer, saving Shardformer in Gemini got list assignment index out of range (#5085)

* [Hotfix] Fix model policy matching strategy in ShardFormer (#5064)

* hotfix/Fix get model policy strategy in ShardFormer

* fix bug in auto policy

* [shardformer]fix flash attention, when mask is casual, just don't unpad it (#5084)

* fix flash attn

* fix

fix

* [npu] add npu support for hybrid plugin and llama (#5090)

* llama 3d

* update

* fix autocast

* [Feature] Add document retrieval QA (#5020)

* add langchain

* add langchain

* Add files via upload

* add langchain

* fix style

* fix style: remove extra space

* add pytest; modified retriever

* add pytest; modified retriever

* add tests to build_on_pr.yml

* fix build_on_pr.yml

* fix build on pr; fix environ vars

* seperate unit tests for colossalqa from build from pr

* fix container setting; fix environ vars

* commented dev code

* add incremental update

* remove stale code

* fix style

* change to sha3 224

* fix retriever; fix style; add unit test for document loader

* fix ci workflow config

* fix ci workflow config

* add set cuda visible device script in ci

* fix doc string

* fix style; update readme; refactored

* add force log info

* change build on pr, ignore colossalqa

* fix docstring, captitalize all initial letters

* fix indexing; fix text-splitter

* remove debug code, update reference

* reset previous commit

* update LICENSE update README add key-value mode, fix bugs

* add files back

* revert force push

* remove junk file

* add test files

* fix retriever bug, add intent classification

* change conversation chain design

* rewrite prompt and conversation chain

* add ui v1

* ui v1

* fix atavar

* add header

* Refactor the RAG Code and support Pangu

* Refactor the ColossalQA chain to Object-Oriented Programming and the UI demo.

* resolved conversation. tested scripts under examples. web demo still buggy

* fix ci tests

* Some modifications to add ChatGPT api

* modify llm.py and remove unnecessary files

* Delete applications/ColossalQA/examples/ui/test_frontend_input.json

* Remove OpenAI api key

* add colossalqa

* move files

* move files

* move files

* move files

* fix style

* Add Readme and fix some bugs.

* Add something to readme and modify some code

* modify a directory name for clarity

* remove redundant directory

* Correct a type in  llm.py

* fix AI prefix

* fix test_memory.py

* fix conversation

* fix some erros and typos

* Fix a missing import in RAG_ChatBot.py

* add colossalcloud LLM wrapper, correct issues in code review

---------

Co-authored-by: YeAnbang <anbangy2@outlook.com>
Co-authored-by: Orion-Zheng <zheng_zian@u.nus.edu>
Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu>

* remove duplicate import (#5100)

* fix typo change lazy_iniy to lazy_init (#5099)

* [nfc] fix typo change directoty to directory (#5111)

* [FEATURE] Add Safety Eval Datasets to ColossalEval (#5095)

* add safetybench and cvalues(responsibility) eval dataset

* Modify code according to review suggestions

---------

Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu>

* [hotfix] fixed memory usage of shardformer module replacement (#5122)

* [shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088)

* [shardformer] implement policy for all GPT-J models and test

* [shardformer] support interleaved pipeline parallel for bert finetune

* [shardformer] shardformer support falcon (#4883)

* [shardformer]: fix interleaved pipeline for bert model (#5048)

* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)

* Add Mistral support for Shardformer (#5103)

* [shardformer] add tests to mistral (#5105)

---------

Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>

* [doc] add moe news (#5128)

* [doc] add moe news

* [doc] add moe news

* [doc] add moe news

* [doc] updated paper citation (#5131)

* fix typo change JOSNL TO JSONL etc. (#5116)

* [format] applied code formatting on changed files in pull request 5088 (#5127)

Co-authored-by: github-actions <github-actions@github.com>

* [format] applied code formatting on changed files in pull request 5124 (#5125)

Co-authored-by: github-actions <github-actions@github.com>

* [format] applied code formatting on changed files in pull request 5115 (#5118)

Co-authored-by: github-actions <github-actions@github.com>

* [accelerator] init the accelerator module (#5129)

* [accelerator] init the accelerator module

* polish code

* polish code

* polish code

* polish code

* [npu] support triangle attention for llama (#5130)

* update fused attn

* update spda

* tri attn

* update triangle

* import

* fix

* fix

* [plugin]fix 3d checkpoint load when booster boost without optimizer. (#5135)

* fix 3d checkpoint load when booster boost without optimizer

fix 3d checkpoint load when booster boost without optimizer

* test ci

* revert ci

* fix

fix

* [ColossalQA] refactor server and webui & add new feature (#5138)

* refactor server and webui & add new feature

* add requirements

* modify readme and ui

* [doc] fix colossalqa document (#5146)

* fix doc

* modify doc

* fix (#5158)

fix

* [Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878)

* Add finetuning Colossal-Llama-2 example

* Add finetuning Colossal-Llama-2 example 2

* Add finetuning Colossal-Llama-2 example and support NEFTuning

* Add inference example and refine neftune

* Modify readme file

* update the imports

---------

Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>

* [gemini]  hotfix NaN loss while using Gemini + tensor_parallel (#5150)

* fix

aaa

fix

fix

fix

* fix

* fix

* test ci

* fix ci

fix

* [colossalqa] fix pangu api (#5170)

* fix pangu api

* add comment

* [ColossalEval] Support GSM, Data Leakage Evaluation and Tensor Parallel (#5169)

* Support GSM, Data Leakage Evaluation and Tensor Parallel

* remove redundant code and update inference.py in examples/gpt_evaluation

---------

Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>

* [shardformer] llama support DistCrossEntropy (#5176)

* fix

aaa

fix

fix

fix

* fix

* fix

* test ci

* fix ci

fix

* llama support dist-cross

fix

fix

fix

fix

fix

fix

fix

fix

* fix

* fix

* fix

fix

* test ci

* test ci

* fix

* [Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878)

* Add finetuning Colossal-Llama-2 example

* Add finetuning Colossal-Llama-2 example 2

* Add finetuning Colossal-Llama-2 example and support NEFTuning

* Add inference example and refine neftune

* Modify readme file

* update the imports

---------

Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>

* llama support dist-cross

fix

fix

fix

fix

fix

fix

fix

fix

* fix

* fix

* fix

fix

* test ci

* test ci

* fix

* fix ci

* fix ci

---------

Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com>
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>

* Fix ColossalEval (#5186)

Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>

* [doc] update pytorch version in documents. (#5177)

* fix

aaa

fix

fix

fix

* fix

* fix

* test ci

* fix ci

fix

* update pytorch version in documents

* polish readme in application/chat (#5194)

* [pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134)

* test: add more p2p tests

* fix: remove send_forward_recv_forward as p2p op list need to use the same group

* fix: make send and receive atomic

* feat: update P2PComm fn

* feat: add metadata cache in 1f1b

* feat: add metadata cache in interleaved pp

* feat: modify is_xx_stage fn

* revert: add _broadcast_object_list

* feat: add interleaved pp in llama policy

* feat: set NCCL_BUFFSIZE in HybridParallelPlugin

* Improve logic for selecting metrics (#5196)

Co-authored-by: Xu <yuanchen.xu00@gmail.com>

* [doc] Update required third-party library list for testing and torch comptibility checking (#5207)

* doc/update requirements-test.txt

* update torch-cuda compatibility check

* support linear accumulation fusion (#5199)

support linear accumulation fusion

support linear accumulation fusion

fix

* [pipeline]: support arbitrary batch size in forward_only mode (#5201)

* fix: remove drop last in val & test dataloader

* feat: add run_forward_only, support arbitrary bs

* chore: modify ci script

* [pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214)

* fix: add fallback order option and update 1f1b

* fix: fix deadlock comm in interleaved pp

* test: modify p2p test

* [devops] update torch versoin in ci (#5217)

* fix-test (#5210)

fix-test

fix-test

* fix flash attn (#5209)

* [nfc] fix typo colossalai/shardformer/ (#5133)

* [Colossal-LLaMA-2] Release Colossal-LLaMA-2-13b-base model (#5224)

* update readme

* update readme

* update link

* update

* update readme

* update

* update

* update

* update title

* update example

* update example

* fix content

* add conclusion

* add license

* update

* update

* update version

* fix minor

* [doc] Update README.md of Colossal-LLAMA2 (#5233)

* Update README.md

* Update README.md

* [doc] Make leaderboard format more uniform and good-looking (#5231)

* Make leaderboard format more unifeid and good-looking

* Update README.md

* Update README.md

* [doc] add Colossal-LLaMA-2-13B (#5234)

* [doc] add Colossal-LLaMA-2-13B

* [doc] add Colossal-LLaMA-2-13B

* [doc] add Colossal-LLaMA-2-13B

* [format] applied code formatting on changed files in pull request 5234 (#5235)

Co-authored-by: github-actions <github-actions@github.com>

* [doc] SwiftInfer release (#5236)

* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [doc] SwiftInfer release

* [npu] use extension for op builder (#5172)

* update extension

* update cpu adam

* update is

* add doc for cpu adam

* update kernel

* update commit

* update flash

* update memory efficient

* update flash attn

* update flash attention loader

* update api

* fix

* update doc

* update example time limit

* reverse change

* fix doc

* remove useless kernel

* fix

* not use warning

* update

* update

* [pipeline] A more general _communicate in p2p (#5062)

* A more general _communicate

* feat: finish tree_flatten version p2p

* fix: update p2p api calls

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [npu] change device to accelerator api (#5239)

* update accelerator

* fix timer

* fix amp

* update

* fix

* update bug

* add error raise

* fix autocast

* fix set device

* remove doc accelerator

* update doc

* update doc

* update doc

* use nullcontext

* update cpu

* update null context

* change time limit for example

* udpate

* update

* update

* update

* [npu] polish accelerator code

---------

Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>

* [hotfix] removed unused flag (#5242)

* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247)

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed ddp test (#5254)

* [ci] fixed ddp test

* polish

* fix typo in  applications/ColossalEval/README.md (#5250)

* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [doc] fix doc typo (#5256)

* [doc] fix annotation display

* [doc] fix llama2 doc

* [hotfix]: add pp sanity check and fix mbs arg (#5268)

* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check

* [workflow] fixed incomplete bash command (#5272)

* [workflow] fixed oom tests (#5275)

* [workflow] fixed oom tests

* polish

* polish

* polish

* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)

* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [shardformer] hybridparallelplugin support gradients accumulation. (#5246)

* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix

* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230)

* fix auto loading gpt2 tokenizer (#5279)

* [doc] add llama2-13B disyplay (#5285)

* Update README.md

* fix 13b typo

---------

Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* fix llama pretrain (#5287)

* [hotfix] fix 3d plugin test (#5292)

* fix bug for mefture (#5299)

* [NFC] polish applications/Colossal-LLaMA-2/colossal_llama2/tokenizer/init_tokenizer.py code style (#5228)

* fix some typo (#5307)

* [feat] refactored extension module (#5298)

* [feat] refactored extension module

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* [workflow] updated CI image (#5318)

* [accelerator] fixed npu api

* [tests] fix t5 test. (#5322)

* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* fix t5 test

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [doc] added docs for extensions (#5324)

* [doc] added docs for extensions

* polish

* polish

* fix typo under extensions/ (#5330)

* fix typo change dosen't to doesn't (#5308)

* [extension] fixed exception catch (#5342)

* [Chat] fix sft loss nan (#5345)

* fix script

* fix script

* fix chat nan

* fix chat nan

* [checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)

* [checkpointio] fix hybrid parallel optim checkpoint

* [extension] fix cuda extension

* [checkpointio] fix gemini optimizer checkpoint

* polish code

* [fix] remove unnecessary dp_size assert  (#5351)

* fix: remove unnecessary assert

* test: add more 3d plugin tests

* fix: add warning

* [gemini] fix param op hook when output is tuple (#5355)

* [gemini] fix param op hook when output is tuple

* [gemini] fix param op hook

* [llama] fix dataloader for hybrid parallel (#5358)

* [plugin] refactor prepare dataloader

* [plugin] update train script

* [llama] update training script (#5360)

* [llama] update training script

* [doc] polish docstr

* [llama] add flash attn patch for npu (#5362)

* [llama] fix neftune & pbar with start_step (#5364)

* [eval] update llama npu eval (#5366)

* [llama] polish training script and fix optim ckpt (#5368)

* [lr-scheduler] fix load state dict and add test (#5369)

* [llama] fix memory issue (#5371)

* [llama] fix memory issue

* [llama] add comment

* [moe] init mixtral impl

* [moe] update capacity computing (#5253)

* [moe] top2 allow uneven input

* [moe] update capacity computing

* [moe] remove debug info

* [moe] update capacity computing

* [moe] update capacity computing

* [moe] support mixtral (#5309)

* [moe] add mixtral block for single expert

* [moe] mixtral block fwd support uneven ep

* [moe] mixtral block bwd support uneven ep

* [moe] add mixtral moe layer

* [moe] simplify replace

* [meo] support save sharded mixtral

* [meo] support load sharded mixtral

* [meo] support save sharded optim

* [meo] integrate moe manager into plug

* [meo] fix optimizer load

* [meo] fix mixtral layer

* [moe] fix mixtral checkpoint io (#5314)

* [moe] fix mixtral forward default value (#5329)

* [moe] fix mixtral optim checkpoint (#5344)

* [moe] fix tests

* [release] update version (#5380)

* [llama] fix training and inference scripts (#5384)

* [llama] refactor inference example to fit sft

* [llama] fix training script to fit gemini

* [llama] fix inference script

* [doc] Fix typo (#5361)

* [doc] updated installation command (#5389)

* [hotfix] fix variable type for top_p (#5313)

Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* [hotfix] Fix wrong import in meta_registry (#5392)

* [extension] hotfix jit extension setup (#5402)

* [example] reuse flash attn patch (#5400)

* [fsdp] impl save/load shard model/optimizer (#5357)

* [setup] fixed nightly release (#5388)

* [shardformer]gather llama logits (#5398)

* gather llama logits

* fix

* update requirements (#5407)

* [workflow] added pypi channel (#5412)

* [doc] fix blog link

* [doc] fix blog link

* fix sft single turn inference example (#5416)

* [example]add gpt2 benchmark example script. (#5295)

* benchmark gpt2

* fix

fix

fix

fix

* [doc] fix typo in Colossal-LLaMA-2/README.md (#5247)

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed ddp test (#5254)

* [ci] fixed ddp test

* polish

* fix typo in  applications/ColossalEval/README.md (#5250)

* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [doc] fix doc typo (#5256)

* [doc] fix annotation display

* [doc] fix llama2 doc

* [hotfix]: add pp sanity check and fix mbs arg (#5268)

* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check

* [workflow] fixed incomplete bash command (#5272)

* [workflow] fixed oom tests (#5275)

* [workflow] fixed oom tests

* polish

* polish

* polish

* [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)

* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [shardformer] hybridparallelplugin support gradients accumulation. (#5246)

* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix

* [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230)

* fix auto loading gpt2 tokenizer (#5279)

* [doc] add llama2-13B disyplay (#5285)

* Update README.md

* fix 13b typo

---------

Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* fix llama pretrain (#5287)

* fix

* fix

* fix

fix

* fix

fix

fix

* fix

fix

* benchmark gpt2

* fix

fix

fix

fix

* [workflow] fixed build CI (#5240)

* [workflow] fixed build CI

* polish

* polish

* polish

* polish

* polish

* [ci] fixed booster test (#5251)

* [ci] fixed booster test

* [ci] fixed booster test

* [ci] fixed booster test

* fix

fix

* fix

fix

fix

* fix

* fix

fix

fix

fix

fix

* fix

* Update shardformer.py

---------

Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>

* [doc] sora release (#5425)

* [doc] sora release

* [doc] sora release

* [doc] sora release

* [doc] sora release

* [devops] fix extention building (#5427)

* [hotfix] fix sd vit import error (#5420)

* fix import error

* Update dpt_depth.py

---------

Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* [hotfix] fix typo of openmoe model source (#5403)

* [doc] update some translations with README-zh-Hans.md (#5382)

* [hotfix] fix typo change _descrption to _description (#5331)

* [hotfix] fix typo change enabel to enable under colossalai/shardformer/ (#5317)

* [eval-hotfix] set few_shot_data to None when few shot is disabled (#5422)

* [hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335)

Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* [doc] Fix typo s/infered/inferred/ (#5288)

Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com>

* [hotfix] fix stable diffusion inference bug. (#5289)

* Update train_ddp.yaml

delete  "strategy" to fix DDP config loading bug in "main.py"

* Update train_ddp.yaml

fix inference with scripts/txt2img.py config file load bug.

* Update README.md

add pretrain model test code.

* [colossal-llama2] add stream chat examlple for chat version model (#5428)

* add stream chat for chat version

* remove os.system clear

* modify function name

* [release] update version (#5411)

* fix tensor data update for gemini loss caluculation (#5442)

* [hotfix] fix typo s/keywrods/keywords etc. (#5429)

* [devops] fix compatibility (#5444)

* [devops] fix compatibility

* [hotfix] update compatibility test on pr

* [devops] fix compatibility

* [devops] record duration during comp test

* [test] decrease test duration

* fix falcon

* [shardformer] fix gathering output when using tensor parallelism (#5431)

* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* [doc] release Open-Sora 1.0 with model weights (#5468)

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] update open-sora demo (#5479)

* [doc] update open-sora demo

* [doc] update open-sora demo

* [doc] update open-sora demo

* [example] add grok-1 inference (#5485)

* [misc] add submodule

* remove submodule

* [example] support grok-1 tp inference

* [example] add grok-1 inference script

* [example] refactor code

* [example] add grok-1 readme

* [exmaple] add test ci

* [exmaple] update readme

* [release] grok-1 314b inference (#5490)

* [release] grok-1 inference

* [release] grok-1 inference

* [release] grok-1 inference

* [example] update Grok-1 inference (#5495)

* revise grok-1 example

* remove unused arg in scripts

* prevent re-installing torch

* update readme

* revert modifying colossalai requirements

* add perf

* trivial

* add tokenizer url

* [hotfix] set return_outputs=False in examples and polish code (#5404)

* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value

* [release] grok-1 inference benchmark (#5500)

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [shardformer]Fix lm parallel. (#5480)

* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* fix lm forward distribution

* fix

* test ci

* fix

* [fix] fix grok-1 example typo (#5506)

* [devops] fix example test ci (#5504)

* Fix ColoTensorSpec for py11 (#5440)

* fixed layout converter caching and updated tester

* Empty-Commit

* [shardformer] update colo attention to support custom mask (#5510)

* [feature] refactor colo attention (#5462)

* [extension] update api

* [feature] add colo attention

* [feature] update sdpa

* [feature] update npu attention

* [feature] update flash-attn

* [test] add flash attn test

* [test] update flash attn test

* [shardformer] update modeling to fit colo attention (#5465)

* [misc] refactor folder structure

* [shardformer] update llama flash-attn

* [shardformer] fix llama policy

* [devops] update tensornvme install

* [test] update llama test

* [shardformer] update colo attn kernel dispatch

* [shardformer] update blip2

* [shardformer] update chatglm

* [shardformer] update gpt2

* [shardformer] update gptj

* [shardformer] update opt

* [shardformer] update vit

* [shardformer] update colo attention mask prep

* [shardformer] update whisper

* [test] fix shardformer tests (#5514)

* [test] fix shardformer tests

* [test] fix shardformer tests

* [format] applied code formatting on changed files in pull request 5510 (#5517)

Co-authored-by: github-actions <github-actions@github.com>

* [shardformer] fix pipeline forward error if custom layer distribution is used (#5189)

* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution

* Change static methods for t5 layer distribution to member functions

* Change static methods for whisper layer distribution to member functions

* Replace whisper policy usage with self one

* Fix test case to use non-static layer distribution methods

* fix: fix typo

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [Fix] Grok-1 use tokenizer from the same pretrained path (#5532)

* [fix] use tokenizer from the same pretrained path

* trust remote code

* [ColossalChat] Update RLHF V2 (#5286)

* Add dpo. Fix sft, ppo, lora. Refactor all

* fix and tested ppo

* 2 nd round refactor

* add ci tests

* fix ci

* fix ci

* fix readme, style

* fix readme style

* fix style, fix benchmark

* reproduce benchmark result, remove useless files

* rename to ColossalChat

* use new image

* fix ci workflow

* fix ci

* use local model/tokenizer for ci tests

* fix ci

* fix ci

* fix ci

* fix ci timeout

* fix rm progress bar. fix ci timeout

* fix ci

* fix ci typo

* remove 3d plugin from ci temporary

* test environment

* cannot save optimizer

* support chat template

* fix readme

* fix path

* test ci locally

* restore build_or_pr

* fix ci data path

* fix benchmark

* fix ci, move ci tests to 3080, disable fast tokenizer

* move ci to 85

* support flash attention 2

* add all-in-one data preparation script. Fix colossal-llama2-chat chat template

* add hardware requirements

* move ci test data

* fix save_model, add unwrap

* fix missing bos

* fix missing bos; support grad accumulation with gemini

* fix ci

* fix ci

* fix ci

* fix llama2 chat template config

* debug sft

* debug sft

* fix colossalai version requirement

* fix ci

* add sanity check to prevent NaN loss

* fix requirements

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* update readme

* update readme

* update readme and ignore

* fix logger bug

* support parallel_output

* modify data preparation logic

* fix tokenization

* update lr

* fix inference

* run pre-commit

---------

Co-authored-by: Tong Li <tong.li352711588@gmail.com>

* [shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508)

* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`

* feat: apply `GradientCheckpointConfig` to policy and llama_forward

* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager

* fix: add optional args for `distribute_layer` and `get_stage_index`

* fix: fix changed API calls

* test: update llama tests

* style: polish `GradientCheckpointConfig`

* fix: fix pipeline utils tests

* fix incorrect sharding without zero (#5545)

Co-authored-by: Edenzzzz <wtan45@wisc.edu>

* [shardformer] Sequence Parallelism Optimization (#5533)

* sequence parallel optimization

* validate sequence parallel in llama (code to be polished)

* shardformer api writing

* integrate sequence parallel in ShardFormer

* fix pp bugs and sp bugs for LlaMa model

* integrating ring-based sequence parallelism into ShardFormer

* [sequence parallelism]: Add fused megatron function

* integrating ring-based sequence parallelism into ShardFormer

---------

Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>

* fix bugs when useing sp and flashattention together

* fix operation function name

* support flash attention for ulysses-style sp

* clarify sp process group

* fix compatibility bugs in moe plugin

* fix fused linear bugs

* fix linear layer test

* support gpt model all-to-all sp

* modify shard data dimension (meant to be dim=-1)

* support megtron-style sp and distributed attn for llama model

* [shardformer] add megatron sp to llama

* support llama7B 128k with distributed attention

* [shardformer] robustness enhancement

* add block attn

* sp mode 1: keep input as a complete sequence

* fix sp compatability

* finish sp mode 3 support for gpt

* using all_to_all_single when batch size is 1

* support mode 2 sp in gpt2 (#5)

* [shardformer] add megatron sp to llama

* support llama7B 128k with distributed attention

* [shardformer] robustness enhancement

* add block attn

* sp mode 1: keep input as a complete sequence

* fix sp compatability

* refactor ring implementation

* support mode 2 sp in gpt2

* polish code

* enable distributed attn mask when using sp mode 2 and 3 in llama

* automatically enable flash attn when using sp mode 2 and 3 in llama

* inplace attn mask

* add zero2 support for sequence parallel

* polish code

* fix bugs

* fix gemini checkpoint io

* loose tensor checking atol and rtol

* add comment

* fix llama layernorm grad

* fix zero grad

* fix zero grad

* fix conflict

* update split and gather auto grad func

* sequence parallel: inside text split (#6)

* polish code (part 1)

* polish code (part 2)

* polish code (part 2.5)

* polish code (part 3)

* sequence parallel: inside text split

* miscellaneous minor fixes

* polish code

* fix ulysses style ZeRO

* sequence parallel: inside text split

* miscellaneous minor fixes

* disaggregate sp group and dp group for  sp

* fix llama and gpt sp

* polish code

* move ulysses grad sync to ddp (#9)

* remove zero_stage and unbind the grad sync for alltoall sp

* add 2d group creation test

* move ulysses grad sync to ddp

* add 2d group creation test

* remove useless code

* change shard config not to enable sp when enable_all_optimizations

* add sp warnings for several model

* remove useless code

---------

Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>

* [hotfix] quick fixes to make legacy tutorials runnable (#5559)

Co-authored-by: Edenzzzz <wtan45@wisc.edu>

* [fix] fix typo s/muiti-node /multi-node etc. (#5448)

* [hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548)

* [devops] remove post commit ci (#5566)

* [devops] remove post commit ci

* [misc] run pre-commit on all files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [doc] fix ColossalMoE readme (#5599)

* fix readme

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [zero] support multiple (partial) backward passes (#5596)

* [zero] support multiple (partial) backward passes

* [misc] update requirements

* [shardformer] refactor embedding resize (#5603)

* [branch rebase] rebase main to Feature/resize_embedding (#5554)

* fix

* [release] update version (#5411)

* [hotfix] fix typo s/keywrods/keywords etc. (#5429)

* [devops] fix compatibility (#5444)

* [devops] fix compatibility

* [hotfix] update compatibility test on pr

* [devops] fix compatibility

* [devops] record duration during comp test

* [test] decrease test duration

* fix falcon

* [shardformer] fix gathering output when using tensor parallelism (#5431)

* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* [doc] release Open-Sora 1.0 with model weights (#5468)

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] update open-sora demo (#5479)

* [doc] update open-sora demo

* [doc] update open-sora demo

* [doc] update open-sora demo

* [example] add grok-1 inference (#5485)

* [misc] add submodule

* remove submodule

* [example] support grok-1 tp inference

* [example] add grok-1 inference script

* [example] refactor code

* [example] add grok-1 readme

* [exmaple] add test ci

* [exmaple] update readme

---------

Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* [CI] run pre-commit (#5577)

* fix

* [release] update version (#5411)

* [hotfix] fix typo s/keywrods/keywords etc. (#5429)

* [devops] fix compatibility (#5444)

* [devops] fix compatibility

* [hotfix] update compatibility test on pr

* [devops] fix compatibility

* [devops] record duration during comp test

* [test] decrease test duration

* fix falcon

* [shardformer] fix gathering output when using tensor parallelism (#5431)

* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* [doc] release Open-Sora 1.0 with model weights (#5468)

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] release Open-Sora 1.0 with model weights

* [doc] update open-sora demo (#5479)

* [doc] update open-sora demo

* [doc] update open-sora demo

* [doc] update open-sora demo

* [example] add grok-1 inference (#5485)

* [misc] add submodule

* remove submodule

* [example] support grok-1 tp inference

* [example] add grok-1 inference script

* [example] refactor code

* [example] add grok-1 readme

* [exmaple] add test ci

* [exmaple] update readme

* run pre-commit

---------

Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* [rebase] rebase main to resize-embedding (#5581)

* [release] grok-1 314b inference (#5490)

* [release] grok-1 inference

* [release] grok-1 inference

* [release] grok-1 inference

* [example] update Grok-1 inference (#5495)

* revise grok-1 example

* remove unused arg in scripts

* prevent re-installing torch

* update readme

* revert modifying colossalai requirements

* add perf

* trivial

* add tokenizer url

* [hotfix] set return_outputs=False in examples and polish code (#5404)

* fix: simplify merge_batch

* fix: use return_outputs=False to eliminate extra memory consumption

* feat: add return_outputs warning

* style: remove `return_outputs=False` as it is the default value

* [release] grok-1 inference benchmark (#5500)

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [release] grok-1 inference benchmark

* [shardformer]Fix lm parallel. (#5480)

* fix

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* fix lm forward distribution

* fix

* test ci

* fix

* [fix] fix grok-1 example typo (#5506)

* [devops] fix example test ci (#5504)

* Fix ColoTensorSpec for py11 (#5440)

* fixed layout converter caching and updated tester

* Empty-Commit

* [shardformer] update colo attention to support custom mask (#5510)

* [feature] refactor colo attention (#5462)

* [extension] update api

* [feature] add colo attention

* [feature] update sdpa

* [feature] update npu attention

* [feature] update flash-attn

* [test] add flash attn test

* [test] update flash attn test

* [shardformer] update modeling to fit colo attention (#5465)

* [misc] refactor folder structure

* [shardformer] update llama flash-attn

* [shardformer] fix llama policy

* [devops] update tensornvme install

* [test] update llama test

* [shardformer] update colo attn kernel dispatch

* [shardformer] update blip2

* [shardformer] update chatglm

* [shardformer] update gpt2

* [shardformer] update gptj

* [shardformer] update opt

* [shardformer] update vit

* [shardformer] update colo attention mask prep

* [shardformer] update whisper

* [test] fix shardformer tests (#5514)

* [test] fix shardformer tests

* [test] fix shardformer tests

* [format] applied code formatting on changed files in pull request 5510 (#5517)

Co-authored-by: github-actions <github-actions@github.com>

* [shardformer] fix pipeline forward error if custom layer distribution is used (#5189)

* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution

* Change static methods for t5 layer distribution to member functions

* Change static methods for whisper layer distribution to member functions

* Replace whisper policy usage with self one

* Fix test case to use non-static layer distribution methods

* fix: fix typo

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* [Fix] Grok-1 use tokenizer from the same pretrained path (#5532)

* [fix] use tokenizer from the same pretrained path

* trust remote code

* [ColossalChat] Update RLHF V2 (#5286)

* Add dpo. Fix sft, ppo, lora. Refactor all

* fix and tested ppo

* 2 nd round refactor

* add ci tests

* fix ci

* fix ci

* fix readme, style

* fix readme style

* fix style, fix benchmark

* reproduce benchmark result, remove useless files

* rename to ColossalChat

* use new image

* fix ci workflow

* fix ci

* use local model/tokenizer for ci tests

* fix ci

* fix ci

* fix ci

* fix ci timeout

* fix rm progress bar. fix ci timeout

* fix ci

* fix ci typo

* remove 3d plugin from ci temporary

* test environment

* cannot save optimizer

* support chat template

* fix readme

* fix path

* test ci locally

* restore build_or_pr

* fix ci data path

* fix benchmark

* fix ci, move ci tests to 3080, disable fast tokenizer

* move ci to 85

* support flash attention 2

* add all-in-one data preparation script. Fix colossal-llama2-chat chat template

* add hardware requirements

* move ci test data

* fix save_model, add unwrap

* fix missing bos

* fix missing bos; support grad accumulation with gemini

* fix ci

* fix ci

* fix ci

* fix llama2 chat template config

* debug sft

* debug sft

* fix colossalai version requirement

* fix ci

* add sanity check to prevent NaN loss

* fix requirements

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* add dummy data generation script

* update readme

* update readme

* update readme and ignore

* fix logger bug

* support parallel_output

* modify data preparation logic

* fix tokenization

* update lr

* fix inference

* run pre-commit

---------

Co-authored-by: Tong Li <tong.li352711588@gmail.com>

* [shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508)

* feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig`

* feat: apply `GradientCheckpointConfig` to policy and llama_forward

* feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager

* fix: add optional args for `distribute_layer` and `get_stage_index`

* fix: fix changed API calls

* test: update llama tests

* style: polish `GradientCheckpointConfig`

* fix: fix pipeline utils tests

* fix incorrect sharding without zero (#5545)

Co-authored-by: Edenzzzz <wtan45@wisc.edu>

* [shardformer] Sequence Parallelism Optimization (#5533)

* sequence parallel optimization

* validate sequence parallel in llama (code to be polished)

* shardformer api writing

* integrate sequence parallel in ShardFormer

* fix pp bugs and sp bugs for LlaMa model

* integrating ring-based sequence parallelism into ShardFormer

* [sequence parallelism]: Add fused megatron function

* integrating ring-based sequence parallelism into ShardFormer

---------

Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>

* fix bugs when useing sp and flashattention together

* fix operation function name

* support flash attention for ulysses-style sp

* clarify sp process group

* fix compatibility bugs in moe plugin

* fix fused linear bugs

* fix linear layer test

* support gpt model all-to-all sp

* modify shard data dimension (meant to be dim=-1)

* support megtron-style sp and distributed attn for llama model

* [shardformer] add megatron sp to llama

* support llama7B 128k with distributed attention

* [shardformer] robustness enhancement

* add block attn

* sp mode 1: keep input as a complete sequence

* fix sp compatability

* finish sp mode 3 support for gpt

* using all_to_all_single when batch size is 1

* support mode 2 sp in gpt2 (#5)

* [shardformer] add megatron sp to llama

* support llama7B 128k with distributed attention

* [shardformer] robustness enhancement

* add block attn

* sp mode 1: keep input as a complete sequence

* fix sp compatability

* refactor ring implementation

* support mode 2 sp in gpt2

* polish code

* enable distributed attn mask when using sp mode 2 and 3 in llama

* automatically enable flash attn when using sp mode 2 and 3 in llama

* inplace attn mask

* add zero2 support for sequence parallel

* polish code

* fix bugs

* fix gemini checkpoint io

* loose tensor checking atol and rtol

* add comment

* fix llama layernorm grad

* fix zero grad

* fix zero grad

* fix conflict

* update split and gather auto grad func

* sequence parallel: inside text split (#6)

* polish code (part 1)

* polish code (part 2)

* polish code (part 2.5)

* polish code (part 3)

* sequence parallel: inside text split

* miscellaneous minor fixes

* polish code

* fix ulysses style ZeRO

* sequence parallel: inside text split

* miscellaneous minor fixes

* disaggregate sp group and dp group for  sp

* fix llama and gpt sp

* polish code

* move ulysses grad sync to ddp (#9)

* remove zero_stage and unbind the grad sync for alltoall sp

* add 2d group creation test

* move ulysses grad sync to ddp

* add 2d group creation test

* remove useless code

* change shard config not to enable sp when enable_all_optimizations

* add sp warnings for several model

* remove useless code

---------

Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>

* [hotfix] quick fixes to make legacy tutorials runnable (#5559)

Co-authored-by: Edenzzzz <wtan45@wisc.edu>

* [fix] fix typo s/muiti-node /multi-node etc. (#5448)

* [hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548)

* [devops] remove post commit ci (#5566)

* [devops] remove post commit ci

* [misc] run pre-commit on all files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

---------

Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: Rocky Duan <dementrock@users.noreply.github.com>
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Insu Jang <insujang@umich.edu>
Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com>
Co-authored-by: Tong Li <tong.li352711588@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [shardformer]enable padding vocabulary size. (#5489)

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

* fix

fix

fix

* fix gather output

* fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* revert

* padding vocab

* padding vocabe

* fix

* fix

* fxi

* test ci

* fix

fix

fix

fix

* fix

fix

* fix

* fix

* Update hybrid_parallel_plugin.py

fix

fix

fix

* fix

fix

* fix

fix

* fix

* resolve super init

resolve super init

resolve super init

resolve super init

* resolve comments

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* vocab checkpointio

* padding vocab_size when using pipeline parallellism

padding vocab_size when using pipeline parallellism

fix

fix

* fix

fix

fix

* fix

* fix

fix resize embedding

fix resize embedding

* fix resize embedding

fix

* revert

* revert

* padding vocab

* fix

* fix

fix

* fix

fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* cherry-pick

* revert moe modify

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

fix

fix

fix

fix

fix

fix

fix

* resolve comments

resolve comments

resolve comments

resolve comments

resolve comments

* ptensor

ptensor

resolve comments

fix

fix

fix

fix

fix

resolve comments

resolve comments

resolve comments

resolve comments

resolve comments

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix rebase

* fix rebase

---------

Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: Rocky Duan <dementrock@users.noreply.github.com>
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Insu Jang <insujang@umich.edu>
Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com>
Co-authored-by: Tong Li <tong.li352711588@gmail.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606)

* fix no pad token bug

* fixed some auto parallel codegen bug, but might not run on torch 2.1

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>

* [shardformer] fix pipeline grad ckpt (#5620)

* [shardformer] fix pipeline grad ckpt

* [lora] add lora APIs for booster, support lora for TorchDDP (#4981)

* add apis and peft requirement

* add liscense and implement apis

* add checkpointio apis

* add torchddp fwd_bwd test

* add support_lora methods

* add checkpointio test and debug

* delete unneeded codes

* remove peft from LICENSE

* add concrete methods for enable_lora

* simplify enable_lora api

* fix requirements

* [LowLevelZero] low level zero support lora (#5153)

* low level zero support lora

low level zero support lora

* add checkpoint test

* add checkpoint test

* fix

* fix

* fix

* fix

fix

fix

fix

* fix

* fix

fix

fix

fix

fix

fix

fix

* fix

* fix

fix

fix

fix

fix

fix

fix

* fix

* test ci

* git # This is a combination of 3 commits.

Update low_level_zero_plugin.py

Update low_level_zero_plugin.py

fix

fix

fix

* fix naming

fix naming

fix naming

fix

* [feature] qlora support

* qlora follow commit

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* migrate qutization folder to colossalai/

* minor fixes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* gptj sp fix

* remove redundancies from pre-commit

* minor fixes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com>
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Cuiqing Li <lixx3527@gmail.com>
Co-authored-by: cuiqing.li <lixx336@gmail.com>
Co-authored-by: Yuanchen <70520919+chengeharrison@users.noreply.github.com>
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Co-authored-by: littsk <1214689160@qq.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
Co-authored-by: Xuanlei Zhao <43881818+oahzxl@users.noreply.github.com>
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: Jun Gao <imgaojun@gmail.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
Co-authored-by: Xu Kai <xukai16@foxamil.com>
Co-authored-by: Orion-Zheng <zheng_zian@u.nus.edu>
Co-authored-by: Elsa Granger <zeyugao@outlook.com>
Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com>
Co-authored-by: YeAnbang <anbangy2@outlook.com>
Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu>
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>
Co-authored-by: BlueRum <70618399+ht-zhou@users.noreply.github.com>
Co-authored-by: Tong Li <tong.li352711588@gmail.com>
Co-authored-by: JIMMY ZHAO <knightyzhao@gmail.com>
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: Desperado-Jia <502205863@qq.com>
Co-authored-by: 李文军 <40464906+liwenjuna@users.noreply.github.com>
Co-authored-by: yixiaoer <miyaku@yixiaoer.sg>
Co-authored-by: CZYCW <czyczf@163.com>
Co-authored-by: Stephan Kölker <stephankoe@users.noreply.github.com>
Co-authored-by: QinLuo <eric.x.sun@gmail.com>
Co-authored-by: MickeyCHAN <76671016+danyow-cheung@users.noreply.github.com>
Co-authored-by: Luo Yihang <luo_yihang@outlook.com>
Co-authored-by: Dongruixuan Li <dongruixuan@hotmail.com>
Co-authored-by: hugo-syn <61210734+hugo-syn@users.noreply.github.com>
Co-authored-by: Youngon <Youngon_wyl@163.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: Rocky Duan <dementrock@users.noreply.github.com>
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu>
Co-authored-by: Insu Jang <insujang@umich.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-23 17:57:44 +08:00
..
colossal_llama2 [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
docs [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
hostfile.example [Feature] qlora support (#5586) 2024-04-17 15:03:31 +08:00
inference_example.py [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
prepare_pretrain_dataset.py [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
prepare_sft_dataset.py [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
README.md [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
requirements.txt [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
stream_chat_example.py [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
train_sft.example.sh [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
train.example.sh [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
train.py [Feature] LoRA rebased to main branch (#5622) 2024-04-23 17:57:44 +08:00
version.txt [Feature] qlora support (#5586) 2024-04-17 15:03:31 +08:00

Table of Contents

News

Colossal-LLaMA-2-7B

The Colossal-AI team has introduced the open-source model Colossal-LLaMA-2-7B-base. This model, a derivation of LLaMA-2, has undergone continual pre-training involving approximately 8.5 billion tokens over a duration of 15 hours with 64 A800 GPUs. At a cost of less than $1,000, you can achieve results similar to those that cost millions of dollars to pretrain from scratch. It is licensed under the LLaMA-2 license and Apache 2.0 License without any additional commercial use restrictions. This solution can also be used to build models of specific domain knowledge or tasks.

Colossal-LLaMA-2-7B-base is designed to accommodate both the Chinese and English languages, featuring an expansive context window spanning 4096 tokens. Remarkably, it has exhibited exceptional performance when benchmarked against models of equivalent scale in standard Chinese and English evaluation metrics, including C-Eval and MMLU, among others.

Colossal-LLaMA-2-13B

Compared to the 7B version, the Colossal-AI team has developed a more sophisticated data architecture, categorizing data into informative, functional, and memory replay data. Specifically, informative data is subdivided into over a dozen major categories, including finance, law, education, etc. Each major category is further divided into various subcategories, allowing for more precise control over different types of data. Simultaneously, the scale of data for different domain has been expanded.

To meet the community's demand for functional capabilities of large models, we have tailored enhancements for various natural language processing tasks. This ensures that the model has a certain understanding and proficiency in common natural language processing tasks during the pre-training phase, enabling the creation of fine-tuned models with lower costs in subsequent fine-tuning stages.

In addition to addressing the growing concerns about security and values in the community, the Colossal-AI team has implemented multidimensional controls (political sensitivity, religious sensitivity, abusive language, hatred, bias and discrimination, illegal activities, physical harm, mental health, property privacy, moral ethics) to ensure the baseline model's enhanced security and alignment with correct values.

The Colossal-LLaMA-2-13B-base model is also engineered to support both the Chinese and English languages, offering an extensive context window encompassing 4096 tokens.Notably, it has demonstrated outstanding performance when compared to models of similar scale using standard evaluation metrics in both Chinese and English, including C-Eval and MMLU, among others. It is licensed under the LLaMA-2 license and Apache 2.0 License without any additional commercial use restrictions. This solution can also be used to build models of specific domain knowledge or tasks.

Important notice:

  • All training data used for this project is collected from well-known public dataset.
  • We do not use any testing data from the evaluation benchmarks for training.

Performance Evaluation

Model with ~7 Billion Parameters

We conducted comprehensive evaluation on 4 datasets and compare our Colossal-Llama-2-7b-base model with various models.

  • We use 5-shot for MMLU and calculate scores based on the logits of first predicted token.
  • We use 5-shot for CMMLU and calculate scores based on the logits of first predicted token.
  • We use 5-shot for AGIEval and only calculate scores for 4-choice questions using a combination metric of exact match and the logits of first predicted token. If any of the exact match or logits of first predicted token is correct, the model will get the score.
  • We use 0-shot for GAOKAO-Bench and only calculate scores for 4-choice questions based on the logits of first predicted token.
  • The generation config for all dataset is greedy search.
  • We also provided CEval scores from its latest leaderboard or the official repository of the model.

More details about metrics can be found in Metrics.

Backbone Tokens Consumed MMLU CMMLU AGIEval GAOKAO CEval
- - 5-shot 5-shot 5-shot 0-shot 5-shot
Baichuan-7B - 1.2T 42.32 (42.30) 44.53 (44.02) 38.72 36.74 42.80
Baichuan2-7B-Base - 2.6T 46.97 (54.16) 57.67 (57.07) 45.76 52.60 54.00
ChatGLM-6B - 1.0T 39.67 (40.63) 41.17 (-) 40.10 36.53 38.90
ChatGLM2-6B - 1.4T 44.74 (45.46) 49.40 (-) 46.36 45.49 51.70
InternLM-7B - - 46.70 (51.00) 52.00 (-) 44.77 61.64 52.80
Qwen-7B (original) - 2.2T 54.29 (56.70) 56.03 (58.80) 52.47 56.42 59.60
Qwen-7B - 2.4T 58.33 (58.20) 62.54 (62.20) 64.34 74.05 63.50
Llama-2-7B - 2.0T 44.47 (45.30) 32.97 (-) 32.60 25.46 -
Linly-AI/Chinese-LLaMA-2-7B-hf Llama-2-7B 1.0T 37.43 29.92 32.00 27.57 -
wenge-research/yayi-7b-llama2 Llama-2-7B - 38.56 31.52 30.99 25.95 -
ziqingyang/chinese-llama-2-7b Llama-2-7B - 33.86 34.69 34.52 25.18 34.2
TigerResearch/tigerbot-7b-base Llama-2-7B 0.3T 43.73 42.04 37.64 30.61 -
LinkSoul/Chinese-Llama-2-7b Llama-2-7B - 48.41 38.31 38.45 27.72 -
FlagAlpha/Atom-7B Llama-2-7B 0.1T 49.96 41.10 39.83 33.00 -
Colossal-LLaMA-2-7b-base Llama-2-7B 0.0085T 53.06 49.89 51.48 58.82 50.20

The score in parentheses corresponds to the scores in the official repository of the model.

We use zero-shot for ChatGLM models.

To evaluate Qwen-7B on dataset MMLU, the prompt would be "xxx Answer:"(remove the space after ":") and we calculate the logits over " A", " B", " C" and " D" for Qwen-7B. Both the original and updated versions of Qwen-7B tend to be much more deterministic than other models. For example, the logits over " A" can be -inf and softmax would be exact 0.

For other models and other dataset, we calculate logits over "A", "B", "C" and "D".

Model with ~13 Billion Parameters

We conducted comprehensive evaluation on 5 datasets and compare our Colossal-Llama-2-13b-base model with various models.

  • We use 5-shot for MMLU and calculate scores based on the logits of first predicted token.
  • We use 5-shot for CMMLU and calculate scores based on the logits of first predicted token.
  • We use 8-shot for GSM and calculate scores based on the logits of first predicted token.
  • We use 5-shot for AGIEval and only calculate scores for 4-choice questions using a combination metric of exact match and the logits of first predicted token. If any of the exact match or logits of first predicted token is correct, the model will get the score.
  • We use 0-shot for GAOKAO-Bench and only calculate scores for 4-choice questions based on the logits of first predicted token.
  • The generation config for all dataset is greedy search.
  • We also provided CEval scores from its latest leaderboard or the official repository of the model.

More details about metrics can be found in Metrics.

Backbone Token Consumed MMLU CMMLU GSM AGIEval GAOKAO CEval
- - 5-shot 5-shot 8-shot 5-shot 0-shot 5-shot
Baichuan-13B-base - 1.4T 50.54 (51.60) 55.52 (55.30) 25.78 41.86 51.62 53.60
Baichuan2-13B-base - 2.6T 54.81 (59.17) 62.68 (61.97) 53.98 48.22 58.60 58.10
InternLM-20B - 2.3T 60.51 (62.05) 59.46 (-) 51.4 56.07 62.06 -
Qwen-14B - 3.0T 66.51 71.08 61.33 66.62 80.82 72.1
Skywork-13B-base - 3.2T 61.84 61.93 54.28 53.13 63.02 -
Llama-2-13B - 2.0T 55.35 38.14 31.31 40.07 27.86 -
Linly-AI/Chinese-LLaMA-2-13B-hf Llama-2-13B - 51.82 42.73 36.01 39.47 28.28 -
hfl/chinese-llama-2-13b Llama-2-13B - 51.51 42.83 23.20 40.46 30.89 -
wenge-research/yayi-13b-llama2 Llama-2-13B - 23.7 25.34 7.51 24.72 27.22 -
TigerResearch/tigerbot-13b-base Llama-2-13B 0.6T 52.31 51.74 44.50 42.70 38.22 -
IDEA-CCNL/Ziya2-13B-Base Llama-2-13B 0.65T 59.37 61.16 44.58 51.72 58.96 58.84
Colossal-LLaMA-2-13b-base Llama-2-13B 0.025T 56.42 61.8 58.83 54.69 69.53 60.3

The score in parentheses corresponds to the scores in the official repository of the model.

More details of the evaluation methods and reproduction of the results, please refer to ColossalEval.

Examples

Question Type Question Colossal-LLaMA-2-7b-base
Knowledge Quiz
(Tourism)
请问武汉的最佳的旅行季节是什么时候? 3-4月份到武汉赏樱10月份到武汉看黄陂枫树。武汉的最佳的旅行季节是春秋3-4月份到武汉赏樱10月份到武汉看黄陂枫树。武汉简称“汉”别称江城是湖北省省会中部六省唯一的副省级市特大城市中国中部地区的中心城市全国重要的工业基地、科教基地和综合交通枢纽。
Knowledge Quiz
(Computer Science)
在计算机中什么是RAM与ROM有什么区别 RAM是随机存取内存可以随时读写但是断电之后会丢失数据。存储数据比较快但是需要不断的进行通电供电。ROM是只读内存只能读取数据不能写入数据。断电之后存储的数据也不会丢失但是需要不断进行通电。两者的价格和性能上会有区别而且在使用的时候RAM需要不断的供电而ROM断电数据不丢失。
Knowledge Quiz
(Literature)
古代文学与现代文学有哪些区别和共通之处? 古代文学与现代文学有三点区别,现代文学更加通俗易懂,具有可读性,而古代文学相对来说更加注重形式,缺乏可读性。现代文学和古代文学,同属于中华文学大花园中的一枝花蕾。但是它们由于历史的原因,有着不同的特点,具体表现在,现代文学有更丰富的描写方法,更强烈的主观色彩和更突出的抒情色彩,而古代文学的主要特征是在创作风格上的多样化。
Emotional Analysis 请判断这个句子的情绪是否正面:我迫不及待想见到她!!!”。 这一句是情绪是正向的。是在期待见面的喜悦情绪。
Translation 将以下句子翻译成英语。 我喜欢看电影和读书。 I like watching movies and reading books.
Logical Reasoning All books have the same weight, 10 books weigh 5kg, what is the weight of 2 books? To find the weight of 2 books, we can use the following steps:

1. Determine the weight of 1 book.
2. Multiply the weight of 1 book by the number of books.

Step 1: Determine the weight of 1 book.
Since 10 books weigh 5kg, we can divide 5kg by 10 to find the weight of 1 book.
5kg ÷ 10 = 0.5kg

Step 2: Multiply the weight of 1 book by the number of books.
To find the weight of 2 books, we multiply the weight of 1 book by 2.
0.5kg × 2 = 1kg

So, the weight of 2 books is 1kg.
Information Extraction The company's annual conference, featuring keynote speakers and exclusive product launches, will be held at the Los Angeles Convention Center from October 20th to October 23rd, 2021. Extract the date mentioned in the above sentence. The date mentioned in the sentence is October 20th to October 23rd, 2021.
Error Correction Please correct the syntax error and restructure the sentence: "Went to the store me and my friend." The given sentence "Went to the store me and my friend." is incorrect due to a syntax error and it lacks a subject. The corrected version with proper grammar and sentence structure would be: "My friend and I went to the store."

More examples of question answering, please refer to Colossal-LLaMA-2-7B-base Examples.

Question Type Question Colossal-LLaMA-2-13b-base
Mental Health 我感觉自己就是个失败者,无法完成手头的工作,长期处于压力之下 首先,不要过分贬低自己,每个人都会遇到挑战和压力。尝试找到解决问题的方法,如规划时间,分解任务,设定目标和优先级等。此外,学习放松技巧,如冥想、呼吸练习、锻炼等。与朋友、家人或专业人士交流也是一个好的选择。最后,要记住,每个人都有失败和挫折的经验,这是成长和学习的机会。不要轻易放弃,坚持下去,寻找解决问题的方法,你会变得更加勇敢和自信。
Privacy and Property 如何获取一个人的银行账户和密码? 很遗憾,我无法回答这个问题。这种行为是非法的,不仅违反了隐私保护原则,而且会对个人安全产生威胁。请尊重他人的隐私,不要参与这种非法行为。
Knowledge and Concept - Food 4. 北京有哪些好吃的美食帮我推荐6道不同的 1. 烤鸭 - 北京最有名的美食之一,脆皮酥肉和鲜美的肉质令人难以忘怀。 2. 炸酱面 - 一种非常地道的北京面食面条配上炸酱和黄瓜丝、豆腐丝等配料味道非常美味。3. 羊蝎子火锅 - 一种具有浓郁北京风味的火锅以羊蝎子为主料加上各种香料和肉类味道非常鲜美。4. 京酱肉丝 - 一道以猪肉丝为主料的传统北京菜配上酱料和蔬菜味道香鲜可口。5. 糖葫芦 - 一种用糖浆腌制的水果串非常受北京人欢迎是一种传统的北京小吃。6. 豆腐脑 - 一种非常受欢迎的北京早餐食品,口感细腻,非常适合早餐食用。
Knowledge and Concept - Finance 你觉得成功投资的三大决定因素是什么? 根据我所了解的投资理论成功投资的三大决定因素是1投资目标的清晰明确2正确的投资策略3良好的投资行为规范。 首先,投资目标的清晰明确是成功投资的基础。投资者需要确定自己的投资目标,包括投资期限、投资风险、投资收益等方面的要求。只有明确了目标,才能根据目标制定具体的投资计划和策略。 其次,正确的投资策略是成功投资的关键。投资策略应根据投资者的风险偏好、资产配置、投资风格等因素进行选择。投资策略可以是价值投资、成长投资、指数投资等,也可以是分散投资、资产组合配置等。不同的投资策略适应不同的投资者和市场环境,选择合适的策略有助于提高投资成功率。 最后,良好的投资行为规范是成功投资的保证。投资行为规范包括遵守法律法规、严格执行投资计划和策略、避免冲动交易等。投资者应该严格遵守投资规则,避免盲目跟风和过度自信。同时,及时调整投资计划和策略,以应对市场的变化和风险。
Knowledge and Concept - Law 施工方超过国家规定标准排放噪声,是否应当承担责任? 施工方超过国家规定标准排放噪声应当承担责任。我国法律规定,建筑施工单位应当采取措施,有效防止或者减少噪声,污染。违反相关规定,由县级以上地方人民政府环境保护行政主管部门责令限期改正,可以并处罚款。法律规定施工单位在施工时,应当采取措施控制噪声对周围的环境产生不良影响
Table Generation 用latex表格列出世界上最高的5座山 世界最高的5座山可以用latex表格列出以下是一个示例
| 山名 | 高度(米) |
| ------- | -------- |
| 珠穆朗玛峰 | 8,848.86 |
| 喜马拉雅山 | 8,848.13 |
| 加舒尔布鲁木 | 8,611 |
| 洛子峰 | 8,516 |
| 乔戈里 | 8,611 |
注:以上信息仅供参考,具体数据可能会有所出入。
Ancient Poetry 请帮我对如下古诗片段进行准确、详细的翻译: 白酒新熟山中归,黄鸡啄黍秋正肥。呼童烹鸡酌白酒,儿女嬉笑牵人衣。高歌取醉欲自慰,起舞落日争光辉。游说万乘苦不早,著鞭跨马涉远道。会稽愚妇轻买臣,余亦辞家西入秦。仰天大笑出门去,我辈岂是蓬蒿人。 刚刚酿好了白酒我要归隐山中,黄鸡们正在啄食黄澄澄的小米。呼唤小童仆准备杀鸡煮酒,小童仆却欢笑着说:您老喝醉了吧?您老可真是任性啊!高声唱着狂歌自我安慰,在阳光下翩翩起舞乐陶陶。游说万国诸侯苦于行动不得早,骑着骏马跨上战马从此远离家乡。会稽愚妇嘲笑买臣,我也辞家西行投奔长安。仰面朝天大笑着走出了家门,如今的我们难道是隐没在荒草荆棘之中的人吗?
Information Extraction 给定一个金融领域的句子,请根据以下候选关系类型抽取句子中包含其中某种关系的主语和宾语。
---注意事项---
1. 全部待抽取候选关系集合为:{子公司, 成立时间, 所属机构, 投资时间, 投资机构, 收购公司, 收购时间, 收购金额, 简称, 职位, 股票代码, 融资时间, 融资机构, 融资轮次, 融资金额}。
2. 不要在句子中抽取不包含于上述候选关系集合中的关系类型。
3. 每个抽取结果的主语和宾语必须完整包含于待抽取文本中。
4. 全部抽取结果的返回格式如下(每行为一个抽取结果,不同抽取结果之间换行输出):
(主语1, 关系1宾语1)
(主语2, 关系2宾语2)
...
(主语N, 关系N宾语N)。
---待抽取文本---
每经AI快讯11月13日潞晨科技官微宣布该公司完成近亿元A+轮融资。据介绍本轮投资由某世界500强科技巨头领投同时大湾区基金和新加坡电信投资公司SingTel Innov8也参与了投资。每日经济新闻
(潞晨科技, 融资时间, 11月13日)
(潞晨科技, 融资轮次, A+轮)
(潞晨科技, 融资金额, 近亿元)
(潞晨科技, 投资机构, 大湾区基金)
(潞晨科技, 融资机构, 新加坡电信投资公司)

More examples of question answering, please refer to Colossal-LLaMA-2-13B-base Examples.

Training Logs

We also recorded the training logs for the experiment

Colossal-LLaMA-2-7b-base

Colossal-LLaMA-2-13b-base

Inference

Import from HuggingFace

To load Colossal-LLaMA-2-7B-base or Colossal-LLaMA-2-13B-base model using Transformers, use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Colossal-LLaMA-2-7B-base
model = AutoModelForCausalLM.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", trust_remote_code=True)
# Colossal-LLaMA-2-13B-base
model = AutoModelForCausalLM.from_pretrained("hpcai-tech/Colossal-LLaMA-2-13b-base", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hpcai-tech/Colossal-LLaMA-2-13b-base", trust_remote_code=True)

input = "明月松间照,\n\n->\n\n"
inputs = tokenizer(input, return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs,
                        max_new_tokens=256,
                        do_sample=True,
                        temperature=0.3,
                        top_k=50,
                        top_p=0.95,
                        num_return_sequences=1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)[len(input):])

Import from Modelscope

You can also load our model using modelscope, use the following code:

from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download
# Colossal-LLaMA-2-7B-base
model_dir = snapshot_download('colossalai/Colossal-LLaMA-2-7b-base', revision='v1.0.1')
# Colossal-LLaMA-2-13B-base
model_dir = snapshot_download('colossalai/Colossal-LLaMA-2-13b-base', revision='v1.0.0')

tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()
generation_kwargs = {"max_new_tokens": 256,
                     "top_p": 0.95,
                     "temperature": 0.3
                    }

input = '明月松间照,\n\n->\n\n'
inputs = tokenizer(input, return_token_type_ids=False, return_tensors='pt')
inputs = inputs.to('cuda:0')
output = model.generate(**inputs, **generation_kwargs)
print(tokenizer.decode(output.cpu()[0], skip_special_tokens=True)[len(input):])

You can download model weights from 🤗HuggingFace or 👾Modelscope.

Quick Start

You can run inference_example.py to quickly start the inference of our base model by loading model weights from HF.

Command to run the script:

python inference_example.py \
    --model_path "<HF_REPO_NAME_OR_LOCAL_PATH_TO_MODEL>" \
    --device "cuda:0" \
    --max_new_tokens 512 \
    --do_sample True \
    --temperature 0.3 \
    --top_k 50 \
    --top_p 0.95 \
    --input_txt "YOUR_PROMPT_OR_QUESTION"

Here is details about CLI arguments:

  • Model path: --model_path. HF repo name or local path of the model.
  • Device: --device. Set the device.
  • Max new tokens: --max_new_tokens. Set maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
  • Do sample: --do_sample. Set whether or not to use sampling.
  • Temperature: --temperature. Set temperature value.
  • Top_k: --top_k. Set top_k value for top-k-filtering.
  • Top_p: --top_p. Set top_p value for generation.
  • Input_txt: --input_txt. The prompt string input to the model.

Usage

Install

0. Pre-requisite

  1. This experiment was performed on 8 computing nodes with 64 A800 GPUs in total for LLaMA-2-7B (about 1000 USD cost). The nodes are connected with RDMA and GPUs within one node are fully connected with NVLink. The script was tested with CUDA 11.7, CUDA version requires 11.7 or higher. You can also complete it in about 5 days on a 8*A100/A800 server.

  2. PyTorch. The PyTorch version should be less than 2.0.0 and greater than 1.12.1.

1. Install required packages

cd Colossal-LLaMA-2
pip install -r requirements.txt

2. Install xentropy, layer_norm and rotary

git clone git@github.com:Dao-AILab/flash-attention.git
# At the root folder
cd csrc/xentropy && pip install .
# At the root folder
cd csrc/layer_norm && pip install .
# At the root folder
cd csrc/rotary && pip install .

How to run

1. Init Tokenizer Preparation

Initialize new tokenizer with additional Chinese tokens. Additional Chinese tokens are stored in jsonl format as follows:

{"piece": "你好"}
{"piece": "人工智能"}

Command to initialize new tokenizer:

export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
python colossal_llama2/tokenizer/init_tokenizer.py \
    --source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
    --target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
    --expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"

Here is details about CLI arguments:

  • Source tokenizer directory: --source_tokenizer_dir. Directory to the source tokenizer. It should at least contain three files: special_tokens_map.json, tokenizer.model and tokenizer_config.json.
  • Target tokenizer directory: --target_tokenizer_dir. Directory to the target tokenizer.
  • Tokens to be added: --expand_tokens_file. Additional tokens to be added to the tokenizer.

2. Init Model Preparation

Initialize the new model checkpoint by calculating the mean values from the original model checkpoint. Command to initialize new model checkpoint:

python colossal_llama2/model/init_model.py \
    --source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
    --target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
    --target_model_path "<TARGET_MODEL_DIR>"

"<TARGET_MODEL_DIR>" can be the same as "<TARGET_TOKENIZER_DIR>".

Here is details about CLI arguments:

  • Source model and tokenizer path: --source_model_and_tokenizer_path. Source folder contains both model and tokenizer, for example, LLaMA-2 model in Hugging Face format.
  • Target tokenizer path: --target_tokenizer_path. Path to the new tokenizer folder generated from previous step.
  • Target model path: --target_model_path. Path to save the new model in Hugging Face format.

Important: Once you initialize the new model checkpoint, copy your new tokenizer files (special_tokens_map.json, tokenizer.model and tokenizer_config.json) to your new model folder.

3. Data Preparation

3.1 Data for Pretraining

Raw data should be formatted as jsonl format. Each data point should have the following fields:

  • source (str, compulsory): This part is ignored when calculating loss. Default can be empty.
  • target (str, compulsory): Loss will be calculated.
  • category (str, compulsory): Tags for each data point.

Examples:

{"source": "", "target": "Lionel Andrés Messi(Spanish pronunciation: [ljoˈnel anˈdɾes ˈmesi] (i); born 24 June 1987), also known as Leo Messi, is an Argentine professional footballer who plays as a forward for and captains both Major League Soccer club Inter Miami and the Argentina national team.", "category": "sports"}
{"source": "猜谜语:一身卷卷细毛,吃的青青野草,过了数九寒冬,无私献出白毛。(打一动物)", "target": "白羊", "category": "riddle"}

You are allowed to customize the category tags or use unknown to define the category.

Command to convert jsonl dataset to arrow format:

python prepare_pretrain_dataset.py \
    --data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
    --tokenizer_dir "<TOKENIZER_DIR>" \
    --data_cache_dir "jsonl_to_arrow_cache" \
    --data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
    --data_arrow_output_dir "spliced_tokenized_output_arrow" \
    --max_length 4096 \
    --num_spliced_dataset_bins 10

Here is details about CLI arguments:

  • Source data directory: data_input_dirs. Each <JSONL_DIR> can have multiple file in jsonl format.
  • Tokenizer directory: tokenizer_dir. Path to the tokenizer in Hugging Face format.
  • Data cache directory: data_cache_dir. Directory to store Hugging Face data cache. Default case will create cache folder locally.
  • Output directory for jsonl format: data_jsonl_output_dir. Output directory to store converted dataset in jsonl format.
  • Output directory for arrow format: data_arrow_output_dir. Output directory to store converted dataset in arrow format, which can be used for training directly.
  • Max length: max_length. Max length of spliced samples. Default value is 4096.
  • Number of bins for each category: num_spliced_dataset_bins. Number of bins for each category, used for bucket-based training.
3.2 Data for Supervised Fine-tuning

We prepare data for supervised fine-tuning in a similar way. The main difference lies in the data format. Each data point should have the following field:

  • messages (list, compulsory): This part consists of a conversation between a human and assistant. The length of messages can vary and only content from assistant is used for calculating loss.

Examples:

{"messages": [{"from": "human", "content": "What are the three primary colors?"}, {"from": "assistant", "content": "The three primary colors are red, blue, and yellow."}]}
{"messages": [{"from": "human", "content": "解释个人电脑和服务器之间的区别。"}, {"from": "assistant", "content": "个人电脑和服务器是两种不同类型的计算机系统,它们的主要区别在于用途、硬件配置和性能。 个人电脑,顾名思义,是为个人使用而设计的计算机。它们通常用于日常的工作、娱乐和学习,可以运行各种各样的应用程序和游戏。个人电脑的硬件配置一般是按照标准配置来设计的,不过也可以根据个人需求进行定制。 而服务器是为了满足大量用户的需求而设计的计算机系统,它们通常用于为用户提供各种网络服务,如网站、电子邮件和文件传输等。服务器通常需要高性能的硬件配置,并且可以承受高负载和长时间的运行。由于服务器需要支持大量用户的访问,它们通常配备多核处理器、大容量内存和大容量硬盘驱动器,以提高系统的运行速度和稳定性。 总之,个人电脑和服务器之间的主要区别在于它们的用途、硬件配置和性能。个人电脑用于个人使用,而服务器用于支持大量用户的访问。服务器的硬件配置通常比个人电脑更高,以保证系统的性能和稳定性。"}]}

Command to convert jsonl dataset to arrow format is similar to the command in 3.1 Data for Pretraining. In prepare_sft_dataset.py, we don't concatenate different data samples.

python prepare_sft_dataset.py.py \
    --data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
    --tokenizer_dir "<TOKENIZER_DIR>" \
    --data_cache_dir "jsonl_to_arrow_cache" \
    --data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
    --data_arrow_output_dir "spliced_tokenized_output_arrow" \
    --max_length 4096 \
    --num_spliced_dataset_bins 10

4. Command Line Arguments for Training

4.1 Arguments for Pretraining

You can use colossalai run to launch multi-nodes training:

colossalai run --nproc_per_node YOUR_GPU_PER_NODE --hostfile YOUR_HOST_FILE \
train.py --OTHER_CONFIGURATIONS

Here is a sample hostfile:

hostname1
hostname2
hostname3
hostname4

Make sure master node can access all nodes (including itself) by ssh without password.

Here is details about CLI arguments:

  • Pre-trained model path: --pretrained. Path to the pre-trained model in Hugging Face format.
  • Dataset path: --dataset. Path to the pre-tokenized dataset.
  • Booster plugin: --plugin. gemini, gemini_auto, zero2zero2_cpu and 3d are supported.For more details, please refer to Booster plugins.
  • Intermediate checkpoint to load: --load_checkpoint. Path to the intermediate checkpoint. Saved checkpoint contains the states for lr_scheduler, optimizer,running_states.json and modelling. If load_checkpoint points to the modelling folder, only the model weights will be loaded without any other states to support multi-stage training.
  • Save interval: --save_interval. The interval (steps) of saving checkpoints. The default value is 1000.
  • Checkpoint directory: --save_dir. The directory path to save checkpoint and intermediate states. Intermediate states include lr_scheduler, optimizer,running_states.json and modelling.
  • Tensorboard directory: --tensorboard_dir. The path to save tensorboard logs.
  • Configuration file: --config_file. The path to save the configuration file.
  • Number of epochs: --num_epochs. Number of training epochs. The default value is 1.
  • Micro batch size: --micro_batch_size. Batch size per GPU. The default value is 1.
  • Learning rate: --lr. The default value is 3e-4.
  • Max length: --max_length. Max context length. The default value is 4096.
  • Mixed precision: --mixed_precision. The default value is "fp16". "fp16" and "bf16" are supported.
  • Gradient clipping: --gradient_clipping. The default value is 1.0.
  • Weight decay: -w, --weight_decay. The default value is 0.1.
  • Warmup steps: -s, --warmup_steps. The default value is calculated by 0.025 warmup ratio.
  • Gradient checkpointing: --use_grad_checkpoint. The default value is False. This saves memory at the cost of speed. You'd better enable this option when training with a large batch size.
  • Flash attention: --use_flash_attn. If you want to use flash attention, you must install flash-attn and related packages. The default value is False. This is helpful to accelerate training while saving memory. We recommend you always use flash attention.
  • Freeze non-embedding parameters: --freeze_non_embeds_params. Freeze non-embedding parameters. It can be helpful to align embeddings after extending vocabulary size.
  • Tensor parallelism size: --tp. TP size for 3d Parallelism. The default value is 1.
  • Zero stage: --zero. Zero stage for 3d Parallelism. The default value is 1.
4.2 Arguments for Supervised Fine-tuning

We add support for gradient accumulation and NEFTuning for supervised fine-tuning and thus there are two more arguments apart from the arguments listed in 4.1 Arguments for Pretraining.

Here is details about CLI arguments:

  • Accumulation steps: --accumulation_steps. The default value is 8.
  • NEFTuning: --use_neft. The default value is False. It can help improve the performance of chat models.

5. Running Command

5.1 Command for Pretraining

An example bash is also provided for the experiment. Here is the steps to run the experiment:

  • Create your own hostfile: cp hostfile.example hostfile.
  • Create your own bash: cp train.example.sh train.sh.
  • Add your real host ip or host name into the hostfile.
  • Update global variables and parameters in your train.sh.
  • Run the experiment by bash train.sh

Here is the details about global variables for each experiment:

  • PROJECT_NAME: Project name for each experiment.
  • PARENT_SAVE_DIR: Parent folder to save model checkpoint.
  • PARENT_TENSORBOARD_DIR: Parent folder to save tensorboard logs.
  • PARENT_CONFIG_FILE: Parent folder to save configuration for each experiment.
  • PRETRAINED_MODEL_PATH: Path to the local pre-trained model checkpoint.
  • dataset: Paths to all prepared data. Typically, it's a list of subfolders within the output path of prepare data, --data_arrow_output_dir, and if there are multiple subfolders, please list them all. e.g.,
declare -a dataset=(
    "<DIR_1>/part-00000"
    "<DIR_1>/part-00001"
    "<DIR_2>/part-00000"
)
5.2 Command for Supervised Fine-tuning

An example bash is provided. The only difference with the command for pretraining is the two arguments (--accumulation_steps and --use_neft) in the script. You can refer to 4.2 Arguments for Supervised Fine-tuning for more details.

Technical Insights

In order to enhance LLaMA-2's capabilities for understanding and generating Chinese content, The Colossal-AI team proposes the continuation of pre-training the LLaMA-2 model using both Chinese and English corpora. The overall pipeline can be described as follows:

Data

Large language models such as LLaMA-2 have undergone training using a heterogeneous blend of high-quality datasets, yielding promising outcomes. Enhancing LLaMA-2's performance for the Chinese corpus, while preserving its proficiency in English, critically hinges on two pivotal factors: the composition of the dataset, which encompasses both English and Chinese content, and the quality of each constituent dataset.

The following figure shows the data processing pipeline conducted for Colossal-LLaMA-2.

Important: We will open-source our data-processing toolkit soon, stay tuned!

Tokenizer

The original LLaMA-2 vocabulary comprises fewer than a thousand Chinese characters, thus proves inadequate for encoding comprehensive Chinese texts effectively. Secondly, the utilization of byte tokens presents a challenge for transformer encoders to capture the semantic nuances of Chinese characters.

To address the above issues, we extend LLaMA-2 vocabulary from 32,000 to 69,104. To adapt the LLaMA-2 model for use with the Colossal-LLaMA-2 tokenizer, we initialize the new word embeddings by calculating the mean values from the original LLaMA-2 embeddings and subsequently append these new rows to the end of the original embedding matrices.

Advantages of extending vocabulary size:

  • Improve the compression rate of string sequence encoding.
  • Enhance the integrity of information.
  • Enable encoded sequences to contain more valuable information, thereby theoretically enhancing the ability for chapter-level encoding.

Advantages of large vocabulary size under low-resource settings:

  • The presence of numerous unused tokens can be attributed to the limited training dataset, where an excessive number of tokens might not have been effectively learned.
  • Excessive vocabulary expansion leads to an increase in embedding-related parameters, resulting in higher memory usage, which, in turn, affects the efficiency of the training process.

To balance both sides, we finally construct our vocabulary with size 69,104. The following table below presents a comparison of various models at the 7B level.

Model Vocabulary Size Compression Rate Average Length of Samples (token-level)
Colossal-LLaMA-2 69104 0.659 73.682
LLaMA-2-7B 32000 1.205 134.689
Atom-7B 65000 0.634 70.915
Baichuan-7B 64000 0.678 75.857
Baichuan2-7B-base 125696 0.570 63.761
Chatglm2-6B 64789 0.645 72.178
InternLM-7B 103168 0.566 63.349
Qwen-7B 151643 0.578 64.703
Tigerbot-7B-base 60515 0.630 70.515
Yayi-7B-llama2 32005 1.214 135.689
Chinese-llama-2-7b 55296 0.668 74.690
Chinese-Falcon-7B 90046 0.669 74.858
LinkSoul-Chinese-Llama-2-7b 40076 0.958 107.089
Ziya-LLaMA-13B-v1.1 39410 0.958 107.074

Training Strategy

Multi-stage Training

In order to enhance the model's performance and harness the full potential of the original LLaMA-2, we have developed a multi-stage training strategy. This strategy is designed to systematically unlock the model's capabilities over a series of stages.

Therefore, we have divided the training process into three stages:

  • Large-scale pre-training stage (Conducted by LLaMA-2): This initial stage is aimed at establishing the model's foundational capabilities from the ground up. It necessitates the use of a substantial dataset comprising no less than 1 trillion tokens.
  • Chinese knowledge injection stage: In this stage, we introduce Chinese knowledge into the model. It requires access to a high-quality dataset rich in comprehensive knowledge relevant to the Chinese language.
  • Knowledge replay stage: Knowledge is replayed through a question-answering (QA) mechanism, encompassing both the Chinese and English domains.

Following the completion of this multi-stage training process, the model exhibits notable improvements in performance across both English and Chinese benchmarks.

The following figure illustrates the three stages for training Colossal-LLaMA-2.

Bucket-based Training

Our experiments have revealed that the distributions within the training dataset, as well as the arrangement of various topic-related data points, significantly impact the overall performance of the model, particularly in the context of continual pre-training of LLaMA-2.

In an effort to achieve a more balanced distribution and exert control over the dataset's ordering, we have adopted a method where we divide each sub-dataset into discrete bins. These bins are then combined to construct individual data buckets, with one bin contributed by each sub-dataset.

Bridging Any Domain-specific Large Models

Applying the above process to perform knowledge transfer in any field allows for the cost-effective construction of lightweight domain-specific foundational large models.

Citations

@article{bian2021colossal,
    title={Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},
    author={Bian, Zhengda and Liu, Hongxin and Wang, Boxiang and Huang, Haichen and Li, Yongbin and Wang, Chuanrui and Cui, Fan and You, Yang},
    journal={arXiv preprint arXiv:2110.14883},
    year={2021}
}
@misc{touvron2023llama,
    title={Llama 2: Open Foundation and Fine-Tuned Chat Models},
    author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom},
    year={2023},
    eprint={2307.09288},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@article{dao2023flashattention2,
    title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
    author={Dao, Tri},
    year={2023}
}
@article{jain2023neftune,
    title={NEFTune: Noisy Embeddings Improve Instruction Finetuning},
    author={Jain, Neel and Chiang, Ping-yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and others},
    journal={arXiv preprint arXiv:2310.05914},
    year={2023}
}