Commit Graph

1546 Commits

Author SHA1 Message Date
Jiarui Fang
cb5a4778e1 Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)" (#835)
This reverts commit ac88de6dfc.
2022-04-22 14:45:57 +08:00
Frank Lee
5e00e6cf23 [setup] allow installation with python 3.6 (#834) 2022-04-22 14:17:51 +08:00
Jiarui Fang
ac88de6dfc [WIP] Applying ColoTensor on TP-1D-row Linear. (#831)
* revert zero tensors back

* [tensor] init row 1d linear
2022-04-22 14:03:26 +08:00
Jiarui Fang
595bedf767 revert zero tensors back (#829) 2022-04-22 12:12:35 +08:00
Jiarui Fang
294a6060d0 [tensor] ZeRO use ColoTensor as the base class. (#828)
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.

* [tensor] ZeRO use ColoTensor as the base class.

* polish
2022-04-22 12:00:48 +08:00
Ziyue Jiang
8e6fdb4f29 [tensor]fix test_linear (#826) 2022-04-21 17:18:56 +08:00
Ziyue Jiang
1a9e2c2dff [tensor] fix kwargs in colo_tensor torch_funtion (#825) 2022-04-21 16:47:35 +08:00
Jiarui Fang
eb1b89908c [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824) 2022-04-21 16:03:18 +08:00
Jiarui Fang
2ecc3d7a55 [tensor] lazy init (#823) 2022-04-21 15:40:23 +08:00
Jiarui Fang
68dcd51d41 [Tensor] update ColoTensor torch_function (#822)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* [tensor] renaming and reorganize directory structure.

* rm useless dir

* polish

* polish

* [tensor] hander the function not wrapped

* polish
2022-04-21 14:25:27 +08:00
Jiarui Fang
660d2d1f1b [Tensor] apply ColoTensor on Torch functions (#821)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* [tensor] renaming and reorganize directory structure.

* rm useless dir

* polish

* polish

* [tensor] hander the function not wrapped
2022-04-21 14:21:10 +08:00
Jiarui Fang
0ce8924ceb [tensor] reorganize files (#820) 2022-04-21 14:15:48 +08:00
Jiarui Fang
ab962b9735 [gemini] a new tensor structure (#818)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2022-04-21 11:42:37 +08:00
github-actions[bot]
413ce30c45 Automated submodule synchronization (#819)
Co-authored-by: github-actions <github-actions@github.com>
2022-04-21 11:26:58 +08:00
github-actions[bot]
9aae4197bb Automated submodule synchronization (#810)
Co-authored-by: github-actions <github-actions@github.com>
2022-04-20 13:57:12 +08:00
YuliangLiu0306
e1b3899824 Merge pull request #815 from FrankLeeeee/feature/check-cli
[cli] added check installation cli
2022-04-20 12:19:50 +08:00
FrankLeeeee
70ed11d07e [cli] added check installation cli 2022-04-20 12:13:27 +08:00
YuliangLiu0306
c7eca40f51 Merge pull request #812 from FrankLeeeee/feature/cli
[cli] fixed single-node process launching
2022-04-20 11:40:07 +08:00
Jiarui Fang
3ddbd1bce1 [gemini] collect cpu-gpu moving volume in each iteration (#813) 2022-04-20 11:29:48 +08:00
FrankLeeeee
d522cb704e [cli] fixed single-node process launching 2022-04-20 10:46:51 +08:00
Jiarui Fang
61c20b44bc [log] local throughput metrics (#811)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish
2022-04-20 10:05:39 +08:00
ver217
dd92b90a68 [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
* init fp16 param directly

* polish code
2022-04-19 16:16:48 +08:00
Jiarui Fang
227d1cd4b3 [gemini] APIs to set cpu memory capacity (#809) 2022-04-19 16:05:22 +08:00
YuliangLiu0306
f6dcd23fb9 Merge pull request #807 from FrankLeeeee/feature/cli
[cli] fixed a bug in user args and refactored the module structure
2022-04-19 15:52:26 +08:00
FrankLeeeee
f63e91d280 [cli] fixed a bug in user args and refactored the module structure 2022-04-19 15:15:16 +08:00
Jiarui Fang
e761ad2cd7 Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 2022-04-19 14:40:02 +08:00
HELSON
88759e289e [zero] add ZeroTensorShardStrategy (#793) 2022-04-19 14:32:45 +08:00
Jiarui Fang
681addb512 [refactor] moving grad acc logic to engine (#804) 2022-04-19 14:03:21 +08:00
Frank Lee
05d9ae5999 [cli] add missing requirement (#805) 2022-04-19 13:56:59 +08:00
YuliangLiu0306
de2f581d43 [cli] added micro benchmarking for tp (#789)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [CLI]add cli benchmark feature

* fix CodeFactor issues.

* refactor the module structure.
2022-04-19 12:08:28 +08:00
YuliangLiu0306
cfadc9df8e [cli] added distributed launcher command (#791)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [CLI]add cli launcher feature

* remove testing message used during developing

* refactor the module structure.
2022-04-19 10:59:44 +08:00
Jiarui Fang
97cd9b03b3 [log] display tflops if available (#802) 2022-04-19 10:13:28 +08:00
Jiarui Fang
4d9332b4c5 [refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
Jiarui Fang
8711c706f4 [hotfix] fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:58:21 +08:00
ver217
f1fa1a675f fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:07:39 +08:00
HELSON
4c4388c46e [hotfix] fix memory leak in zero (#781) 2022-04-18 13:57:03 +08:00
Ziyue Jiang
4b01da24cd [TP] change the check assert in split batch 2d (#772) 2022-04-16 21:29:57 +08:00
ver217
846406a07a [gemini] fix auto tensor placement policy (#775) 2022-04-16 21:29:31 +08:00
ver217
38102cf61a update version (#779) v0.1.3 2022-04-16 17:09:24 +08:00
HELSON
a65cbb7e4e [zero] refactor shard and gather operation (#773) 2022-04-15 14:41:31 +08:00
Frank Lee
5a1a095b92 [test] refactored with the new rerun decorator (#763)
* [test] refactored with the new rerun decorator

* polish test case
2022-04-15 00:33:04 +08:00
binmakeswell
deaf99f4c9 [readme] sync CN readme (#766) 2022-04-14 21:04:51 +08:00
ver217
6e553748a7 polish sharded optim docstr and warning (#770) 2022-04-14 21:03:59 +08:00
LuGY
80e37eec42 fix the ckpt bugs when using DDP (#769) 2022-04-14 21:03:24 +08:00
Jiarui Fang
1f698f4406 [readme] polish readme (#764)
* [readme] polish readme

* centering image
2022-04-14 17:34:08 +08:00
Frank Lee
920fe31526 [compatibility] used backward-compatible API for global process group (#758) 2022-04-14 17:20:35 +08:00
Frank Lee
4ea49cb536 [test] added a decorator for address already in use error with backward compatibility (#760)
* [test] added a decorator for address already in use error with backward compatibility

* [test] added a decorator for address already in use error with backward compatibility
2022-04-14 16:48:44 +08:00
Jiarui Fang
10ef8afdd2 [gemini] init genimi individual directory (#754) 2022-04-14 16:40:26 +08:00
ver217
dcca614eee [hotfix] fix test_stateful_tensor_mgr (#762) 2022-04-14 15:50:09 +08:00
github-actions[bot]
6978980f6d Automated submodule synchronization (#751)
Co-authored-by: github-actions <github-actions@github.com>
2022-04-14 15:34:01 +08:00