1
0
mirror of https://github.com/hpcaitech/ColossalAI.git synced 2025-05-04 22:48:15 +00:00
Commit Graph

3811 Commits

Author SHA1 Message Date
ver217
184a653704 [checkpointio] fix pinned state dict 2024-11-19 14:51:39 +08:00
ver217
5fa657f0a1 [checkpointio] fix size compute 2024-11-19 14:51:39 +08:00
flybird11111
eb69e640e5 [async io]supoort async io ()
* support async optimizer save/load

* fix

* fix

* support pin mem

* Update low_level_zero_plugin.py

* fix

* fix

* fix

* fix

* fix
2024-11-19 14:51:39 +08:00
Hongxin Liu
b90835bd32 [checkpointio] fix performance issue () 2024-11-19 14:51:39 +08:00
Wang Binluo
8e08c27e19 [ckpt] Add async ckpt api ()
* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix
2024-11-19 14:51:39 +08:00
Hongxin Liu
d4a436051d [checkpointio] support async model save ()
* [checkpointio] support async model save

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-11-19 14:51:39 +08:00
Hongxin Liu
5a03d2696d
[cli] support run as module option () 2024-11-14 18:10:37 +08:00
Hanks
cc40fe0e6f
[fix] multi-node backward slowdown ()
* remove redundant memcpy during backward

* get back record_stream
2024-11-14 17:45:49 +08:00
duanjunwen
c2fe3137e2
[hotfix] fix flash attn window_size err ()
* [fix] fix flash attn

* [hotfix] fix flash-atten version

* [fix] fix flash_atten version

* [fix] fix flash-atten versions

* [fix] fix flash-attn not enough values to unpack error

* [fix] fix test_ring_attn

* [fix] fix test ring attn
2024-11-14 17:11:35 +08:00
Hongxin Liu
a2596519fd
[zero] support extra dp ()
* [zero] support extra dp

* [zero] update checkpoint

* fix bugs

* fix bugs
2024-11-12 11:20:46 +08:00
Tong Li
30a9443132
[Coati] Refine prompt for better inference ()
* refine prompt

* update prompt

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-11-08 11:00:37 +08:00
Tong Li
7a60161035
update readme () 2024-11-06 17:24:08 +08:00
Hongxin Liu
a15ab139ad
[plugin] support get_grad_norm () 2024-11-05 18:12:47 +08:00
Hongxin Liu
13ffa08cfa
[release] update version () 2024-11-04 17:26:28 +08:00
pre-commit-ci[bot]
2f583c1549
[pre-commit.ci] pre-commit autoupdate ()
updates:
- [github.com/psf/black-pre-commit-mirror: 24.8.0 → 24.10.0](https://github.com/psf/black-pre-commit-mirror/compare/24.8.0...24.10.0)
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.2)
- [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.6.0...v5.0.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-10-31 18:18:01 +08:00
Hongxin Liu
c2e8f61592
[checkpointio] fix hybrid plugin model save () 2024-10-31 17:04:53 +08:00
Tong Li
89a9a600bc
[MCTS] Add self-refined MCTS ()
* add reasoner

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update code

* delete llama

* update prompts

* update readme

* update readme

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-10-24 17:51:19 +08:00
binmakeswell
4294ae83bb
[doc] sora solution news ()
* [doc] sora solution news

* [doc] sora solution news
2024-10-24 13:24:37 +08:00
Hongxin Liu
80a8ca916a
[extension] hotfix compile check () 2024-10-24 11:11:44 +08:00
Hanks
dee63cc5ef
Merge pull request from BurkeHulk/hotfix/lora_ckpt
[hotfix] fix lora ckpt saving format
2024-10-21 14:13:04 +08:00
BurkeHulk
6d6cafabe2 pre-commit fix 2024-10-21 14:04:32 +08:00
BurkeHulk
b10339df7c fix lora ckpt save format (ColoTensor to Tensor) 2024-10-21 13:55:43 +08:00
Hongxin Liu
19baab5fd5
[release] update version () 2024-10-21 10:19:08 +08:00
Hongxin Liu
58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import ()
* [amp] fit torch's new api

* [amp] fix api call

* [amp] fix api call

* [misc] fit torch pytree api upgrade

* [misc] remove legacy import

* [misc] fit torch amp api

* [misc] fit torch amp api
2024-10-18 16:48:52 +08:00
Hongxin Liu
5ddad486ca
[fp8] add fallback and make compile option configurable () 2024-10-18 13:55:31 +08:00
botbw
3b1d7d1ae8 [chore] refactor 2024-10-17 11:04:47 +08:00
botbw
2bcd0b6844 [ckpt] add safetensors util 2024-10-17 11:04:47 +08:00
Hongxin Liu
cd61353bae
[pipeline] hotfix backward for multiple outputs ()
* [pipeline] hotfix backward for multiple outputs

* [pipeline] hotfix backward for multiple outputs
2024-10-16 17:27:33 +08:00
Wenxuan Tan
62c13e7969
[Ring Attention] Improve comments ()
* improve comments

* improve comments

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-10-16 11:23:35 +08:00
Wang Binluo
dcd41d0973
Merge pull request from wangbluo/ring_attention
[Ring Attention] fix the 2d ring attn when using multiple machine
2024-10-15 15:17:21 +08:00
wangbluo
83cf2f84fb fix 2024-10-15 14:50:27 +08:00
wangbluo
bc7eeade33 fix 2024-10-15 13:28:33 +08:00
wangbluo
fd92789af2 fix 2024-10-15 13:26:44 +08:00
wangbluo
6be9862aaf fix 2024-10-15 11:56:49 +08:00
wangbluo
3dc08c8a5a fix 2024-10-15 11:01:34 +08:00
wangbluo
8ff7d0c780 fix 2024-10-14 18:16:03 +08:00
wangbluo
fe9208feac fix 2024-10-14 18:07:56 +08:00
wangbluo
3201377e94 fix 2024-10-14 18:06:24 +08:00
wangbluo
23199e34cc fix 2024-10-14 18:01:53 +08:00
wangbluo
d891e50617 fix 2024-10-14 14:56:05 +08:00
wangbluo
e1e86f9f1f fix 2024-10-14 11:45:35 +08:00
Tong Li
4c8e85ee0d
[Coati] Train DPO using PP ()
* update dpo

* remove unsupport plugin

* update msg

* update dpo

* remove unsupport plugin

* update msg

* update template

* update dataset

* add pp for dpo

* update dpo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add dpo fn

* update dpo

* update dpo

* update dpo

* update dpo

* minor update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update loss

* update help

* polish code

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-10-11 19:32:00 +08:00
wangbluo
703bb5c18d fix the test 2024-10-11 17:34:20 +08:00
wangbluo
4e0e99bb6a fix the test 2024-10-11 17:31:40 +08:00
wangbluo
1507a7528f fix 2024-10-11 06:20:34 +00:00
wangbluo
0002ae5956 fix 2024-10-11 14:16:21 +08:00
Hongxin Liu
dc2cdaf3e8
[shardformer] optimize seq parallelism ()
* [shardformer] optimize seq parallelism

* [shardformer] fix gpt2 fused linear col

* [plugin] update gemini plugin

* [plugin] update moe hybrid plugin

* [test] update gpt2 fused linear test

* [shardformer] fix gpt2 fused linear reduce
2024-10-11 13:44:40 +08:00
wangbluo
efe3042bb2 fix 2024-10-10 18:38:47 +08:00
梁爽
6b2c506fc5
Update README.md ()
add HPC-AI.COM activity
2024-10-10 17:02:49 +08:00
wangbluo
5ecc27e150 fix 2024-10-10 15:35:52 +08:00