Commit Graph

147 Commits

Author SHA1 Message Date
YuliangLiu0306
e414e4092b [DTensor] implementation of dtensor (#2946)
* [DTensor] implementation of dtensor

* test layout convert

* polish
2023-03-01 16:34:58 +08:00
HELSON
707b11d4a0 [gemini] update ddp strict mode (#2518)
* [zero] add strict ddp mode for chunk init

* [gemini] update gpt example
2023-01-28 14:35:25 +08:00
HELSON
2d1a7dfe5f [zero] add strict ddp mode (#2508)
* [zero] add strict ddp mode

* [polish] add comments for strict ddp mode

* [zero] fix test error
2023-01-20 14:04:38 +08:00
HELSON
d565a24849 [zero] add unit testings for hybrid parallelism (#2486) 2023-01-18 10:36:10 +08:00
HELSON
ea13a201bb [polish] polish code for get_static_torch_model (#2405)
* [gemini] polish code

* [testing] remove code

* [gemini] make more robust
2023-01-09 17:41:38 +08:00
HELSON
a3100bd50d [testing] add beit model for unit testings (#2196)
* [testing] add beit model

* [beit] fix bugs

* [beit] fix bugs

* [testing] fix bugs
2022-12-26 17:35:36 +08:00
Jiarui Fang
1f99205827 [Gemini] remove static tracer (#2083) 2022-12-06 12:53:58 +08:00
Jiarui Fang
2e9cbfca12 [Gemini] add unitests to check gemini correctness (#2015) 2022-11-24 16:51:45 +08:00
Genghan Zhang
d655eea515 [autoparallel] mix gather (#1977)
* Add mix-gather

* Add comments

* Add comments

* Polish comments

* Change the global rank assumption

* Add tests

* Add two-step tests

* Fix 10 and 01

* Skip test becasue the number of GPUs
2022-11-23 21:49:17 +08:00
Jiarui Fang
f7e276fa71 [Gemini] add GeminiAdamOptimizer (#1960) 2022-11-16 14:44:28 +08:00
Jiarui Fang
52c6ad26e0 [ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) 2022-11-15 16:24:16 +08:00
Jiarui Fang
9f4fb3f28a [ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) 2022-11-14 16:05:09 +08:00
Jiarui Fang
3ce4463fe6 [utils] remove lazy_memory_allocate from ColoInitContext (#1844) 2022-11-09 11:50:33 +08:00
YuliangLiu0306
980ed21723 [autoparallel] shard param and buffer as expected (#1753)
* [autoparallel] shard param and buffer as expected

* fix unit test issue
2022-10-21 15:45:13 +08:00
Frank Lee
eee84908d4 [autoparallel] handled illegal sharding strategy (#1728)
* [autoparallel] handled illegal sharding strategy

* polish code
2022-10-19 12:53:06 +08:00
HELSON
f69f9bf223 [zero] add chunk init function for users (#1729)
* add chunk manager init function

* fix unit tests

* add comment

* add flush=True
2022-10-18 16:31:22 +08:00
HELSON
b28991dd0a [feature] A new ZeRO implementation (#1644) 2022-10-09 09:18:51 +08:00
YuliangLiu0306
3f068d1409 [autoparallel] update CommSpec (#1667) 2022-09-29 11:20:59 +08:00
Frank Lee
154d3ef432 [fix] fixed the collective pattern name for consistency (#1649)
* [fix] fixed the collective pattern name for consistency

* polish code
2022-09-26 16:39:37 +08:00
Jiarui Fang
c5d39215f6 Revert "[feature] new zero implementation (#1623)" (#1643)
This reverts commit 5be118f405.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405 [feature] new zero implementation (#1623) 2022-09-24 19:58:18 +08:00
YuliangLiu0306
702dbc5288 [tensor] use communication autograd func (#1617)
* [tensor] use communication autograd func

* change all to all comm spec info

* rename pattern and distinguish fwd/bwd

* polish code
2022-09-23 13:31:15 +08:00
YuliangLiu0306
4b03c25f85 [tensor]add 1D device mesh (#1492) 2022-08-25 16:48:12 +08:00
YuliangLiu0306
b73fb7a077 [tensor] support runtime ShardingSpec apply (#1453)
* [tensor] support runtime ShardingSpec apply

* polish code

* polish code
2022-08-19 13:39:51 +08:00
YuliangLiu0306
0f3042363c [tensor] shape consistency generate transform path and communication cost (#1435)
* [tensor] shape consistency output transform path and communication cost

* polish code
2022-08-12 14:02:32 +08:00
Frank Lee
ae1b58cd16 [tensor] added linear implementation for the new sharding spec (#1416)
* [tensor] added linear implementation for the new sharding spec

* polish code
2022-08-12 11:33:09 +08:00
Jiarui Fang
89c434a0a6 [polish] add test_ops directory (#1431) 2022-08-10 15:35:26 +08:00
Jiarui Fang
10b3df65c8 [FAW] move coloparam setting in test code. (#1429) 2022-08-10 14:31:53 +08:00
Jiarui Fang
cb98cf5558 [FAW] parallel FreqAwareEmbedding (#1424) 2022-08-10 13:44:30 +08:00
YuliangLiu0306
33f0744d51 [tensor] add shape consistency feature to support auto spec transform (#1418)
* [tensor] add shape consistency feature to supportauto sharding spec transform.

* [tensor] remove unused argument in simulator, add doc string for target pair.
2022-08-10 11:29:17 +08:00
Jiarui Fang
d209aff684 Add FreqAwareEmbeddingBag (#1421) 2022-08-09 16:26:12 +08:00
Jiarui Fang
504419d261 [FAW] add cache manager for the cached embedding (#1419) 2022-08-09 15:17:17 +08:00
YuliangLiu0306
7c96055c68 [tensor]build sharding spec to replace distspec in future. (#1405) 2022-08-08 11:15:57 +08:00
HELSON
87775a0682 [colotensor] use cpu memory to store state_dict (#1367) 2022-07-26 14:13:38 +08:00
HELSON
4417804129 [unit test] add megatron init test in zero_optim (#1358) 2022-07-25 11:18:08 +08:00
HELSON
7a065dc9f6 [hotfix] fix megatron_init in test_gpt2.py (#1357) 2022-07-25 10:28:19 +08:00
HELSON
7a8702c06d [colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
HELSON
bf5066fba7 [refactor] refactor ColoTensor's unit tests (#1340) 2022-07-19 15:46:24 +08:00
ver217
0c51ff2c13 [hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00
HELSON
d49708ae43 [hotfix] fix ddp for unit test test_gpt2 (#1326) 2022-07-15 18:19:52 +08:00
HELSON
1b41686461 [hotfix] fix unit test test_module_spec (#1321) 2022-07-15 14:02:32 +08:00
Jiarui Fang
85f933b58b [Optimizer] Remove useless ColoOptimizer (#1312) 2022-07-14 16:57:48 +08:00
Jiarui Fang
9f10524313 [Optimizer] polish the init method of ColoOptimizer (#1310) 2022-07-14 16:37:33 +08:00
HELSON
36086927e1 [hotfix] fix ColoTensor GPT2 unitest (#1309) 2022-07-14 16:37:20 +08:00
HELSON
260a55804a [hotfix] fix shape error in backward when using ColoTensor (#1298) 2022-07-13 23:06:12 +08:00
Jiarui Fang
79fe7b027a [hotfix] test model unittest hotfix (#1281) 2022-07-12 23:45:29 +08:00
Jiarui Fang
e56731e916 [hotfix] test_gpt.py duplicated (#1279)
* make it faster

* [hotfix] torchvison fx tests

* [hotfix] rename duplicated named test_gpt.py
2022-07-12 23:29:17 +08:00
HELSON
abba4d84e1 [hotfix] fix bert model test in unitests (#1272) 2022-07-12 23:26:45 +08:00
Jiarui Fang
c92f84fcdb [tensor] distributed checkpointing for parameters (#1240) 2022-07-12 15:51:06 +08:00
Jiarui Fang
1aad903c15 [tensor] redistribute among different process groups (#1247)
* make it faster

* [tensor] rename convert_to_dist -> redistribute

* [tensor] ShardSpec and ReplicaSpec

* [tensor] redistribute among diff pgs

* polish code
2022-07-12 10:24:05 +08:00