Commit Graph

2276 Commits

Author SHA1 Message Date
Ziyue Jiang
1a9e2c2dff [tensor] fix kwargs in colo_tensor torch_funtion (#825) 2022-04-21 16:47:35 +08:00
Jiarui Fang
eb1b89908c [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824) 2022-04-21 16:03:18 +08:00
Jiarui Fang
2ecc3d7a55 [tensor] lazy init (#823) 2022-04-21 15:40:23 +08:00
Jiarui Fang
68dcd51d41 [Tensor] update ColoTensor torch_function (#822)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* [tensor] renaming and reorganize directory structure.

* rm useless dir

* polish

* polish

* [tensor] hander the function not wrapped

* polish
2022-04-21 14:25:27 +08:00
Jiarui Fang
0ce8924ceb [tensor] reorganize files (#820) 2022-04-21 14:15:48 +08:00
Jiarui Fang
ab962b9735 [gemini] a new tensor structure (#818)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2022-04-21 11:42:37 +08:00
FrankLeeeee
70ed11d07e [cli] added check installation cli 2022-04-20 12:13:27 +08:00
YuliangLiu0306
c7eca40f51 Merge pull request #812 from FrankLeeeee/feature/cli
[cli] fixed single-node process launching
2022-04-20 11:40:07 +08:00
Jiarui Fang
3ddbd1bce1 [gemini] collect cpu-gpu moving volume in each iteration (#813) 2022-04-20 11:29:48 +08:00
FrankLeeeee
d522cb704e [cli] fixed single-node process launching 2022-04-20 10:46:51 +08:00
Jiarui Fang
61c20b44bc [log] local throughput metrics (#811)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish
2022-04-20 10:05:39 +08:00
ver217
dd92b90a68 [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
* init fp16 param directly

* polish code
2022-04-19 16:16:48 +08:00
Jiarui Fang
227d1cd4b3 [gemini] APIs to set cpu memory capacity (#809) 2022-04-19 16:05:22 +08:00
FrankLeeeee
f63e91d280 [cli] fixed a bug in user args and refactored the module structure 2022-04-19 15:15:16 +08:00
Jiarui Fang
e761ad2cd7 Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 2022-04-19 14:40:02 +08:00
HELSON
88759e289e [zero] add ZeroTensorShardStrategy (#793) 2022-04-19 14:32:45 +08:00
Jiarui Fang
681addb512 [refactor] moving grad acc logic to engine (#804) 2022-04-19 14:03:21 +08:00
Frank Lee
05d9ae5999 [cli] add missing requirement (#805) 2022-04-19 13:56:59 +08:00
YuliangLiu0306
de2f581d43 [cli] added micro benchmarking for tp (#789)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [CLI]add cli benchmark feature

* fix CodeFactor issues.

* refactor the module structure.
2022-04-19 12:08:28 +08:00
YuliangLiu0306
cfadc9df8e [cli] added distributed launcher command (#791)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [CLI]add cli launcher feature

* remove testing message used during developing

* refactor the module structure.
2022-04-19 10:59:44 +08:00
Jiarui Fang
4d9332b4c5 [refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
Jiarui Fang
8711c706f4 [hotfix] fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:58:21 +08:00
ver217
f1fa1a675f fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:07:39 +08:00
HELSON
4c4388c46e [hotfix] fix memory leak in zero (#781) 2022-04-18 13:57:03 +08:00
Ziyue Jiang
4b01da24cd [TP] change the check assert in split batch 2d (#772) 2022-04-16 21:29:57 +08:00
ver217
846406a07a [gemini] fix auto tensor placement policy (#775) 2022-04-16 21:29:31 +08:00
HELSON
a65cbb7e4e [zero] refactor shard and gather operation (#773) 2022-04-15 14:41:31 +08:00
ver217
6e553748a7 polish sharded optim docstr and warning (#770) 2022-04-14 21:03:59 +08:00
LuGY
80e37eec42 fix the ckpt bugs when using DDP (#769) 2022-04-14 21:03:24 +08:00
Frank Lee
920fe31526 [compatibility] used backward-compatible API for global process group (#758) 2022-04-14 17:20:35 +08:00
Frank Lee
4ea49cb536 [test] added a decorator for address already in use error with backward compatibility (#760)
* [test] added a decorator for address already in use error with backward compatibility

* [test] added a decorator for address already in use error with backward compatibility
2022-04-14 16:48:44 +08:00
Jiarui Fang
10ef8afdd2 [gemini] init genimi individual directory (#754) 2022-04-14 16:40:26 +08:00
ver217
dcca614eee [hotfix] fix test_stateful_tensor_mgr (#762) 2022-04-14 15:50:09 +08:00
ver217
a93a7d7364 [hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard

* disable test stm

* polish code
2022-04-14 14:56:46 +08:00
ver217
8f7ce94b8e [hotfix] fix auto tensor placement policy (#753) 2022-04-14 12:04:45 +08:00
HELSON
84c6700b2a [zero] refactor memstats_collector (#746) 2022-04-14 12:01:12 +08:00
アマデウス
b8899e0905 [TP] allow layernorm without bias (#750) 2022-04-14 11:43:56 +08:00
Jiarui Fang
3d7dc46d33 [zero] use factory pattern for tensor_placement_policy (#752) 2022-04-14 11:07:29 +08:00
ver217
4b048a8728 fix prepare grads in sharded optim (#749) 2022-04-13 22:36:11 +08:00
ver217
097772546e fix initialize about zero 2022-04-13 19:10:21 +08:00
ver217
e396bb71f2 [zero] add tensor placement policies (#743)
* add tensor placement policies

* polish comments

* polish comments

* update moe unit tests
2022-04-13 15:00:48 +08:00
HELSON
22c4b88d56 [zero] refactor ShardedParamV2 for convenience (#742) 2022-04-13 14:54:26 +08:00
HELSON
340e59f968 [utils] add synchronized cuda memory monitor (#740) 2022-04-13 10:50:54 +08:00
ver217
e6212f56cd [hotfix] fix memory leak in backward of sharded model (#741) 2022-04-13 09:59:05 +08:00
Frank Lee
a4e91bc87f [bug] fixed grad scaler compatibility with torch 1.8 (#735) 2022-04-12 16:04:21 +08:00
Jiarui Fang
53cb584808 [utils] correct cpu memory used and capacity in the context of multi-process (#726) 2022-04-12 14:57:54 +08:00
Jiarui Fang
7db3ccc79b [hotfix] remove duplicated param register to stateful tensor manager (#728) 2022-04-12 13:55:25 +08:00
Frank Lee
1cb7bdad3b [util] fixed communication API depth with PyTorch 1.9 (#721) 2022-04-12 09:44:40 +08:00
Frank Lee
2412429d54 [util] fixed activation checkpointing on torch 1.9 (#719) 2022-04-12 09:35:45 +08:00
Frank Lee
04ff5ea546 [utils] support detection of number of processes on current node (#723) 2022-04-12 09:28:19 +08:00