Commit Graph

1213 Commits

Author SHA1 Message Date
Jiarui Fang
ea0a2ed25f [hotfix] the bug of numel() in ColoTensor (#845) 2022-04-24 12:32:10 +08:00
Jiarui Fang
8789850eea Init Conext supports lazy allocate model memory (#842) 2022-04-22 18:03:35 +08:00
Frank Lee
943982d29a [unittest] refactored unit tests for change in dependency (#838) 2022-04-22 15:39:07 +08:00
Frank Lee
01e9f834f5 [dependency] removed torchvision (#833)
* [dependency] removed torchvision

* fixed transforms
2022-04-22 15:24:35 +08:00
Jiarui Fang
cb5a4778e1 Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)" (#835)
This reverts commit ac88de6dfc.
2022-04-22 14:45:57 +08:00
Jiarui Fang
ac88de6dfc [WIP] Applying ColoTensor on TP-1D-row Linear. (#831)
* revert zero tensors back

* [tensor] init row 1d linear
2022-04-22 14:03:26 +08:00
Jiarui Fang
294a6060d0 [tensor] ZeRO use ColoTensor as the base class. (#828)
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.

* [tensor] ZeRO use ColoTensor as the base class.

* polish
2022-04-22 12:00:48 +08:00
Ziyue Jiang
8e6fdb4f29 [tensor]fix test_linear (#826) 2022-04-21 17:18:56 +08:00
Ziyue Jiang
1a9e2c2dff [tensor] fix kwargs in colo_tensor torch_funtion (#825) 2022-04-21 16:47:35 +08:00
Jiarui Fang
2ecc3d7a55 [tensor] lazy init (#823) 2022-04-21 15:40:23 +08:00
Jiarui Fang
660d2d1f1b [Tensor] apply ColoTensor on Torch functions (#821)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* [tensor] renaming and reorganize directory structure.

* rm useless dir

* polish

* polish

* [tensor] hander the function not wrapped
2022-04-21 14:21:10 +08:00
Jiarui Fang
0ce8924ceb [tensor] reorganize files (#820) 2022-04-21 14:15:48 +08:00
Jiarui Fang
ab962b9735 [gemini] a new tensor structure (#818)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2022-04-21 11:42:37 +08:00
Jiarui Fang
e761ad2cd7 Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 2022-04-19 14:40:02 +08:00
HELSON
88759e289e [zero] add ZeroTensorShardStrategy (#793) 2022-04-19 14:32:45 +08:00
Jiarui Fang
681addb512 [refactor] moving grad acc logic to engine (#804) 2022-04-19 14:03:21 +08:00
Jiarui Fang
4d9332b4c5 [refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
HELSON
4c4388c46e [hotfix] fix memory leak in zero (#781) 2022-04-18 13:57:03 +08:00
Frank Lee
5a1a095b92 [test] refactored with the new rerun decorator (#763)
* [test] refactored with the new rerun decorator

* polish test case
2022-04-15 00:33:04 +08:00
Jiarui Fang
10ef8afdd2 [gemini] init genimi individual directory (#754) 2022-04-14 16:40:26 +08:00
ver217
dcca614eee [hotfix] fix test_stateful_tensor_mgr (#762) 2022-04-14 15:50:09 +08:00
ver217
a93a7d7364 [hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard

* disable test stm

* polish code
2022-04-14 14:56:46 +08:00
HELSON
84c6700b2a [zero] refactor memstats_collector (#746) 2022-04-14 12:01:12 +08:00
ver217
e396bb71f2 [zero] add tensor placement policies (#743)
* add tensor placement policies

* polish comments

* polish comments

* update moe unit tests
2022-04-13 15:00:48 +08:00
HELSON
22c4b88d56 [zero] refactor ShardedParamV2 for convenience (#742) 2022-04-13 14:54:26 +08:00
Frank Lee
f4f42d4c3c [bug] fixed DDP compatibility with torch 1.8 (#739) 2022-04-13 00:08:46 +08:00
Jiarui Fang
53cb584808 [utils] correct cpu memory used and capacity in the context of multi-process (#726) 2022-04-12 14:57:54 +08:00
HELSON
b9b469ea50 [moe] add checkpoint for moe zero test (#729) 2022-04-12 12:11:54 +08:00
FrankLeeeee
e88a498c9c [test] removed trivial outdated test 2022-04-12 11:08:15 +08:00
FrankLeeeee
62b4ce7326 [test] added missing decorators to model checkpointing tests 2022-04-12 11:08:15 +08:00
Jiarui Fang
4d90a7b513 [refactor] zero directory (#724) 2022-04-11 23:13:02 +08:00
Frank Lee
20ab1f5520 [bug] fixed broken test_found_inf (#725) 2022-04-11 22:00:27 +08:00
Jiarui Fang
193dc8dacb [refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
HELSON
dbd96fe90a [zero] check whether gradients have inf and nan in gpu (#712) 2022-04-11 15:40:13 +08:00
HELSON
a9b8300d54 [zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
2022-04-11 13:38:51 +08:00
ver217
ab8c6b4a0e [zero] refactor memstats collector (#706)
* refactor memstats collector

* fix disposable

* polish code
2022-04-11 10:46:08 +08:00
HELSON
ee112fe1da [zero] adapt zero hooks for unsharded module (#699) 2022-04-08 20:23:26 +08:00
ver217
3c9cd5bb5e [zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager

* add eviction strategy

* polish code

* polish code

* polish comment

* add unit test

* fix sampler bug

* polish code

* fix max sampling cnt resetting bug

* fix sampler bug

* polish code

* fix bug

* fix unit test

Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
HELSON
d7ecaf362b [zero] fix init bugs in zero context (#686)
* adapt model weight initialization for methods in Pytorch nn.init
2022-04-07 17:38:45 +08:00
Jiarui Fang
0aab52301e [hotfix] fix a bug in model data stats tracing (#655) 2022-04-03 21:48:06 +08:00
YuliangLiu0306
ade05a5d83 [refactor] pipeline, put runtime schedule into engine. (#627) 2022-04-03 20:46:45 +08:00
HELSON
e5d615aeee [hotfix] fix bugs in testing (#659)
* remove hybrid adam in test_moe_zero_optim

* fix activation checkpointing and its unitest
2022-04-02 21:58:47 +08:00
HELSON
b31daed4cf fix bugs in CPU adam (#633)
* add cpu adam counter for all cpu adam

* fixed updating error in adam kernel
2022-04-02 17:04:05 +08:00
HELSON
055fbf5be6 [zero] adapt zero for unsharded paramters (Optimizer part) (#601) 2022-04-01 20:10:47 +08:00
アマデウス
354b7954d1 [model checkpoint] added unit tests for checkpoint save/load (#599) 2022-04-01 16:53:32 +08:00
FredHuang99
93f14d2a33 [zero] test zero tensor utils (#609) 2022-04-01 15:16:59 +08:00
Jiarui Fang
e956d93ac2 [refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
HELSON
e6d50ec107 [zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero

* add unitest for moe-zero model init

* polish moe gradient handler
2022-03-31 18:34:11 +08:00
ver217
7c6c427db1 [zero] trace states of fp16/32 grad and fp32 param (#571) 2022-03-31 16:26:54 +08:00
Jiarui Fang
7675366fce [polish] rename col_attr -> colo_attr (#558) 2022-03-31 12:25:45 +08:00