Commit Graph

70 Commits

Author SHA1 Message Date
HELSON
e5ea3fdeef [gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Ziyue Jiang
4b01da24cd [TP] change the check assert in split batch 2d (#772) 2022-04-16 21:29:57 +08:00
アマデウス
b8899e0905 [TP] allow layernorm without bias (#750) 2022-04-14 11:43:56 +08:00
Frank Lee
eda30a058e [compatibility] fixed tensor parallel compatibility with torch 1.9 (#700) 2022-04-11 13:44:50 +08:00
HELSON
a9b8300d54 [zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
2022-04-11 13:38:51 +08:00
アマデウス
3fc8a204dc []Corrected 3d vocab parallel embedding (#707) 2022-04-11 10:17:55 +08:00
HELSON
b31daed4cf fix bugs in CPU adam (#633)
* add cpu adam counter for all cpu adam

* fixed updating error in adam kernel
2022-04-02 17:04:05 +08:00
Liang Bowen
828e465622 [hotfix] Raise messages for indivisible batch sizes with tensor parallelism (#622) 2022-04-02 16:12:04 +08:00
アマデウス
77ad24bf94 [model checkpoint] updated saving/loading for 3d layers (#597) 2022-04-01 16:52:47 +08:00
アマデウス
93089ed708 [model checkpoint] updated saving/loading for 2.5d layers (#596) 2022-04-01 16:52:33 +08:00
アマデウス
c50bfb807b [model checkpoint] updated saving/loading for 1d layers (#594) 2022-04-01 16:51:52 +08:00
アマデウス
7636d518e1 [model checkpoint] updated saving/loading for 2d layers (#595) 2022-04-01 16:50:34 +08:00
アマデウス
cd13b63832 [model checkpoint] reworked unified layers for ease of save/load states (#593) 2022-04-01 16:49:56 +08:00
Ziyue Jiang
1c40ee8749 [TP] add assert for tp1d (#621) 2022-04-01 16:44:23 +08:00
ver217
e619a651fb polish optimizer docstring (#619) 2022-04-01 16:27:03 +08:00
ver217
8432dc7080 polish moe docsrting (#618) 2022-04-01 16:15:36 +08:00
ver217
104cbbb313 [hotfix] add hybrid adam to __init__ (#584) 2022-03-31 19:08:34 +08:00
HELSON
e6d50ec107 [zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero

* add unitest for moe-zero model init

* polish moe gradient handler
2022-03-31 18:34:11 +08:00
Wesley
46c9ba33da update code format 2022-03-31 17:15:08 +08:00
Wesley
666cfd094a fix parallel_input flag for Linear1D_Col gather_output 2022-03-31 17:15:08 +08:00
Liang Bowen
2c45efc398 html refactor (#555) 2022-03-31 11:36:56 +08:00
LuGY
c44d797072 [docs] updatad docs of hybrid adam and cpu adam (#552) 2022-03-30 18:14:59 +08:00
Ziyue Jiang
763dc325f1 [TP] Add gather_out arg to Linear (#541) 2022-03-30 09:35:46 +08:00
HELSON
8c90d4df54 [zero] add zero context manager to change config during initialization (#546) 2022-03-29 17:57:59 +08:00
Liang Bowen
ec5086c49c Refactored docstring to google style 2022-03-29 17:17:47 +08:00
LuGY
105c5301c3 [zero]added hybrid adam, removed loss scale in adam (#527)
* [zero]added hybrid adam, removed loss scale of adam

* remove useless code
2022-03-25 18:03:54 +08:00
LuGY
6a3f9fda83 [cuda] modify the fused adam, support hybrid of fp16 and fp32 (#497) 2022-03-25 14:15:53 +08:00
Jiarui Fang
a445e118cf [polish] polish singleton and global context (#500) 2022-03-23 18:03:39 +08:00
ver217
9ec1ce6ab1 [zero] sharded model support the reuse of fp16 shard (#495)
* sharded model supports reuse fp16 shard

* rename variable

* polish code

* polish code

* polish code
2022-03-23 14:59:59 +08:00
HELSON
c9023d4078 [MOE] support PR-MOE (#488) 2022-03-22 16:48:22 +08:00
ver217
62b0a8d644 [zero] sharded optim support hybrid cpu adam (#486)
* sharded optim support hybrid cpu adam

* update unit test

* polish docstring
2022-03-22 14:56:59 +08:00
HELSON
d7ea63992b [MOE] add FP32LinearGate for MOE in NaiveAMP context (#480) 2022-03-22 10:50:20 +08:00
Jiarui Fang
65c0f380c2 [format] polish name format for MOE (#481) 2022-03-21 23:19:47 +08:00
HELSON
7544347145 [MOE] add unitest for MOE experts layout, gradient handler and kernel (#469) 2022-03-21 13:35:04 +08:00
HELSON
aff9d354f7 [MOE] polish moe_env (#467) 2022-03-19 15:36:25 +08:00
HELSON
bccbc15861 [MOE] changed parallelmode to dist process group (#460) 2022-03-19 13:46:29 +08:00
Jiarui Fang
0fcfb1e00d [test] make zero engine test really work (#447) 2022-03-17 17:24:25 +08:00
Jiarui Fang
237d08e7ee [zero] hybrid cpu adam (#445) 2022-03-17 15:05:41 +08:00
HELSON
dbdc9a7783 added Multiply Jitter and capacity factor eval for MOE (#434) 2022-03-16 16:47:44 +08:00
HELSON
3f70a2b12f removed noisy function during evaluation of MoE router (#419) 2022-03-15 12:06:09 +08:00
Jiang Zhuo
5a4a3b77d9 fix format (#376) 2022-03-11 15:50:28 +08:00
LuGY
de46450461 Added activation offload (#331)
* Added activation offload

* Fixed the import bug, used the pytest
2022-03-11 15:50:28 +08:00
Kai Wang (Victor Kai)
53bb3bcc0a fix format (#362) 2022-03-11 15:50:28 +08:00
Yuer867
4a0f8c2c50 fix format parallel_2p5d (#357) 2022-03-11 15:50:28 +08:00
Liang Bowen
7eb87f516d flake8 style (#352) 2022-03-11 15:50:28 +08:00
xuqifan897
148207048e Qifan formated file ColossalAI\colossalai\nn\layer\parallel_1d\layers.py (#342) 2022-03-11 15:50:28 +08:00
DouJS
cbb6436ff0 fix format for dir-[parallel_3d] (#333) 2022-03-11 15:50:28 +08:00
LuGY
a3269de5c9 [zero] cpu adam kernel (#288)
* Added CPU Adam

* finished the cpu adam

* updated the license

* delete useless parameters, removed resnet

* modified the method off cpu adam unittest

* deleted some useless codes

* removed useless codes

Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-03-11 15:50:28 +08:00
1SAA
82023779bb Added TPExpert for special situation 2022-03-11 15:50:28 +08:00
HELSON
36b8477228 Fixed parameter initialization in FFNExpert (#251) 2022-03-11 15:50:28 +08:00