Commit Graph

250 Commits

Author SHA1 Message Date
HELSON
df4f020ee3 [zero1&2] only append parameters with gradients (#2681) 2023-02-13 18:00:16 +08:00
HELSON
b528eea0f0 [zero] add zero wrappers (#2523)
* [zero] add zero wrappers

* change names

* add wrapper functions to init
2023-01-29 17:52:58 +08:00
HELSON
077a5cdde4 [zero] fix gradient clipping in hybrid parallelism (#2521)
* [zero] fix gradient clipping in hybrid parallelism

* [testing] change model name to avoid pytest warning

* [hotfix] fix unit testing
2023-01-29 15:09:57 +08:00
HELSON
d565a24849 [zero] add unit testings for hybrid parallelism (#2486) 2023-01-18 10:36:10 +08:00
HELSON
a5dc4253c6 [zero] polish low level optimizer (#2473) 2023-01-13 14:56:17 +08:00
Jiarui Fang
867c8c2d3a [zero] low level optim supports ProcessGroup (#2464) 2023-01-13 10:05:58 +08:00
HELSON
7829aa094e [ddp] add is_ddp_ignored (#2434)
[ddp] rename to is_ddp_ignored
2023-01-11 12:22:45 +08:00
HELSON
62c38e3330 [zero] polish low level zero optimizer (#2275) 2023-01-03 17:22:34 +08:00
HELSON
a7d95b7024 [example] add zero1, zero2 example in GPT examples (#2146)
* [example] add zero1 and zero2 for GPT

* update readme in gpt example

* polish code

* change init value

* update readme
2022-12-20 14:30:27 +08:00
Jiarui Fang
c89c66a858 [Gemini] update API of the chunkmemstatscollector. (#2129) 2022-12-14 00:47:06 +08:00
Jiarui Fang
2938edf446 [Gemini] update the non model data record method in runtime memory tracer (#2128) 2022-12-13 17:11:31 +08:00
Jiarui Fang
e99edfcb51 [NFC] polish comments for Chunk class (#2116) 2022-12-12 15:39:31 +08:00
Jiarui Fang
33f4412102 [Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084) 2022-12-06 16:43:06 +08:00
Jiarui Fang
b3b89865e2 [Gemini] ParamOpHook -> ColoParamOpHook (#2080) 2022-12-05 17:11:06 +08:00
HELSON
a1ce02d740 [zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2

* [zero] test gradient accumulation

* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
Jiarui Fang
cc0ed7cf33 [Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) 2022-11-17 14:43:49 +08:00
Jiarui Fang
c4739a725a [Gemini] polish memstats collector (#1962) 2022-11-16 15:45:57 +08:00
Jiarui Fang
f7e276fa71 [Gemini] add GeminiAdamOptimizer (#1960) 2022-11-16 14:44:28 +08:00
HELSON
7066dfbf82 [zero] fix memory leak for zero2 (#1955) 2022-11-16 11:43:24 +08:00
HELSON
6e51d296f0 [zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer

* rename test ditectory

* rename test files

* change tolerance in test
2022-11-11 09:26:40 +08:00
Zihao
20e255d4e8 MemStatsCollectorStatic (#1765) 2022-11-07 16:49:03 +08:00
HELSON
c6a1a62636 [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12

* [zero] add cpu shard init

* [zero] add tiny example test

* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
CsRic
ea961d8fd1 [NFC] polish colossalai/zero/sharded_param/__init__.py code style (#1717)
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-10-19 12:20:51 +08:00
HELSON
1468e4bcfc [zero] add constant placement policy (#1705)
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
HELSON
b28991dd0a [feature] A new ZeRO implementation (#1644) 2022-10-09 09:18:51 +08:00
Jiarui Fang
c5d39215f6 Revert "[feature] new zero implementation (#1623)" (#1643)
This reverts commit 5be118f405.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405 [feature] new zero implementation (#1623) 2022-09-24 19:58:18 +08:00
HELSON
f7f2248771 [moe] fix MoE bugs (#1628)
* remove forced FP32 modules

* correct no_shard-contexts' positions
2022-09-22 13:56:30 +08:00
ver217
c9e8ce67b8 fix move fp32 shards (#1604) 2022-09-16 17:33:16 +08:00
Fazzie-Maqianli
06dccdde44 [NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style (#1554) 2022-09-08 22:11:04 +08:00
ver217
821c6172e2 [utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) 2022-08-11 22:58:58 +08:00
ver217
6df3e19be9 [hotfix] zero optim prevents calling inner optim.zero_grad (#1422) 2022-08-09 16:08:12 +08:00
ver217
8dced41ad0 [zero] zero optim state_dict takes only_rank_0 (#1384)
* zero optim state_dict takes only_rank_0

* fix unit test
2022-07-29 13:22:50 +08:00
ver217
828b9e5e0d [hotfix] fix zero optim save/load state dict (#1381) 2022-07-28 17:19:39 +08:00
ver217
6b43c789fd fix zero optim backward_by_grad and save/load (#1353) 2022-07-21 16:43:58 +08:00
ver217
d068af81a3 [doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
ver217
ce470ba37e [checkpoint] sharded optim save/load grad scaler (#1350) 2022-07-21 15:21:21 +08:00
ver217
7a05367101 [hotfix] shared model returns cpu state_dict (#1328) 2022-07-15 22:11:37 +08:00
Jiarui Fang
4165eabb1e [hotfix] remove potiential circle import (#1307)
* make it faster

* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
ver217
a45ddf2d5f [hotfix] fix sharded optim step and clip_grad_norm (#1226) 2022-07-08 13:34:48 +08:00
Jiarui Fang
a444633d13 warmup ratio configration (#1192) 2022-06-30 15:23:50 +08:00
Jiarui Fang
372f791444 [refactor] move chunk and chunkmgr to directory gemini (#1182) 2022-06-29 13:31:02 +08:00
ver217
9e1daa63d2 [zero] sharded optim supports loading local state dict (#1170)
* sharded optim supports loading local state dict

* polish code

* add unit test
2022-06-24 18:05:16 +08:00
ver217
561e90493f [zero] zero optim supports loading local state dict (#1171)
* zero optim supports loading local state dict

* polish code

* add unit test
2022-06-24 17:25:57 +08:00
ver217
8106d7b8c7 [ddp] refactor ColoDDP and ZeroDDP (#1146)
* ColoDDP supports overwriting default process group

* rename ColoDDPV2 to ZeroDDP

* add docstr for ZeroDDP

* polish docstr
2022-06-21 16:35:23 +08:00
ver217
6690a61b4d [hotfix] prevent nested ZeRO (#1140) 2022-06-21 11:33:53 +08:00
Frank Lee
15aab1476e [zero] avoid zero hook spam by changing log to debug level (#1137) 2022-06-21 10:44:01 +08:00
ver217
a1a7899cae [hotfix] fix zero init ctx numel (#1128) 2022-06-16 17:17:27 +08:00
ver217
f0a954f16d [ddp] add set_params_to_ignore for ColoDDP (#1122)
* add set_params_to_ignore for ColoDDP

* polish code

* fix zero hook v2

* add unit test

* polish docstr
2022-06-16 12:54:46 +08:00
Frank Lee
14e5b11d7f [zero] fixed api consistency (#1098) 2022-06-10 16:59:59 +08:00