Commit Graph

1179 Commits

Author SHA1 Message Date
Frank Lee
05fae1fd56 [fx] added activation checkpointing annotation (#1349)
* [fx] added activation checkpointing annotation

* polish code

* polish code
2022-07-21 11:14:28 +08:00
HELSON
7a8702c06d [colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
YuliangLiu0306
942c8cd1fb [fx] refactor tracer to trace complete graph (#1342)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] refactor tracer to trace complete graph

* add comments and solve conflicts.
2022-07-20 11:20:38 +08:00
Frank Lee
2cc1175c76 [fx] tested the complete workflow for auto-parallel (#1336)
* [fx] tested the complete workflow for auto-parallel

* polish code

* polish code

* polish code
2022-07-20 10:45:17 +08:00
YuliangLiu0306
4631fef8a0 [fx]refactor tracer (#1335) 2022-07-19 15:50:42 +08:00
HELSON
bf5066fba7 [refactor] refactor ColoTensor's unit tests (#1340) 2022-07-19 15:46:24 +08:00
HELSON
f92c100ddd [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) 2022-07-19 14:15:28 +08:00
Frank Lee
f3ce7b8336 [fx] recovered skipped pipeline tests (#1338) 2022-07-19 09:49:50 +08:00
ver217
0c51ff2c13 [hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00
Frank Lee
75abc75c15 [fx] fixed compatiblity issue with torch 1.10 (#1331) 2022-07-18 11:41:27 +08:00
Frank Lee
169954f87e [test] removed outdated unit test for meta context (#1329) 2022-07-15 23:16:23 +08:00
ver217
7a05367101 [hotfix] shared model returns cpu state_dict (#1328) 2022-07-15 22:11:37 +08:00
Frank Lee
b2475d8c5c [fx] fixed unit tests for torch 1.12 (#1327) 2022-07-15 18:22:15 +08:00
HELSON
d49708ae43 [hotfix] fix ddp for unit test test_gpt2 (#1326) 2022-07-15 18:19:52 +08:00
Frank Lee
250be4d31e [utils] integrated colotensor with lazy init context (#1324)
* [utils] integrated colotensor with lazy init context

* polish code

* polish code

* polish code
2022-07-15 17:47:12 +08:00
YuliangLiu0306
e8acf55e8b [fx] add balanced policy v2 (#1251)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] add balanced policy v2

* add unittest
2022-07-15 14:54:26 +08:00
XYE
ca2d3f284f [fx] Add unit test and fix bugs for transform_mlp_pass (#1299)
* add test and fix bugs

* add functions back

* add comments
2022-07-15 14:37:58 +08:00
HELSON
1b41686461 [hotfix] fix unit test test_module_spec (#1321) 2022-07-15 14:02:32 +08:00
Jiarui Fang
9e4c6449b0 [checkpoint] add ColoOptimizer checkpointing (#1316) 2022-07-15 09:52:55 +08:00
Jiarui Fang
85f933b58b [Optimizer] Remove useless ColoOptimizer (#1312) 2022-07-14 16:57:48 +08:00
Jiarui Fang
9f10524313 [Optimizer] polish the init method of ColoOptimizer (#1310) 2022-07-14 16:37:33 +08:00
HELSON
36086927e1 [hotfix] fix ColoTensor GPT2 unitest (#1309) 2022-07-14 16:37:20 +08:00
Jiarui Fang
3ef3791a3b [checkpoint] add test for bert and hotfix save bugs (#1297) 2022-07-14 15:38:18 +08:00
Jiarui Fang
bd71e2a88b [hotfix] add missing file (#1308) 2022-07-14 14:43:15 +08:00
Frank Lee
4f4d8c3656 [fx] added apex normalization to patched modules (#1300)
* [fx] added apex normalization to patched modules

* remove unused imports
2022-07-14 14:24:13 +08:00
Jiarui Fang
4165eabb1e [hotfix] remove potiential circle import (#1307)
* make it faster

* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
YuliangLiu0306
93a75433df [hotfix] skip some unittest due to CI environment. (#1301) 2022-07-14 10:55:18 +08:00
HELSON
260a55804a [hotfix] fix shape error in backward when using ColoTensor (#1298) 2022-07-13 23:06:12 +08:00
Frank Lee
7e8114a8dd [hotfix] skipped unsafe test cases (#1282) 2022-07-13 00:08:59 +08:00
Jiarui Fang
79fe7b027a [hotfix] test model unittest hotfix (#1281) 2022-07-12 23:45:29 +08:00
Jiarui Fang
e56731e916 [hotfix] test_gpt.py duplicated (#1279)
* make it faster

* [hotfix] torchvison fx tests

* [hotfix] rename duplicated named test_gpt.py
2022-07-12 23:29:17 +08:00
HELSON
abba4d84e1 [hotfix] fix bert model test in unitests (#1272) 2022-07-12 23:26:45 +08:00
YuliangLiu0306
01ea68b2e6 [tests] remove T5 test skip decorator (#1271) 2022-07-12 23:25:30 +08:00
Jiarui Fang
ca9d5ee91c [hotfix] torchvison fx unittests miss import pytest (#1277) 2022-07-12 23:04:06 +08:00
Jiarui Fang
c92f84fcdb [tensor] distributed checkpointing for parameters (#1240) 2022-07-12 15:51:06 +08:00
Frank Lee
4a09fc0947 [fx] fixed tracing with apex-based T5 model (#1252)
* [fx] fixed tracing with apex-based T5 model

* polish code

* polish code
2022-07-12 15:19:25 +08:00
YuliangLiu0306
97d713855a [fx] methods to get fx graph property. (#1246)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* manipulation

* [fx]add graph manipulation methods.

* [fx]methods to get fx graph property.

* add unit test

* add docstring to explain top node and leaf node in this context
2022-07-12 14:10:37 +08:00
YuliangLiu0306
30b4fc0eb0 [fx]add split module pass and unit test from pipeline passes (#1242)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]add split module pass and unit test from pipeline passes

* fix MNASNet bug

* polish
2022-07-12 13:45:01 +08:00
Jiarui Fang
1aad903c15 [tensor] redistribute among different process groups (#1247)
* make it faster

* [tensor] rename convert_to_dist -> redistribute

* [tensor] ShardSpec and ReplicaSpec

* [tensor] redistribute among diff pgs

* polish code
2022-07-12 10:24:05 +08:00
Jiarui Fang
9bcd2fd4af [tensor] a shorter shard and replicate spec (#1245) 2022-07-11 15:51:48 +08:00
Jiarui Fang
2699dfbbfd [rename] convert_to_dist -> redistribute (#1243) 2022-07-11 13:05:44 +08:00
HELSON
f6add9b720 [tensor] redirect .data.__get__ to a tensor instance (#1239) 2022-07-11 11:41:29 +08:00
Jiarui Fang
20da6e48c8 [checkpoint] save sharded optimizer states (#1237) 2022-07-08 16:33:13 +08:00
Jiarui Fang
4a76084dc9 [tensor] add zero_like colo op, important for Optimizer (#1236) 2022-07-08 14:55:27 +08:00
Jiarui Fang
3b500984b1 [tensor] fix some unittests (#1234) 2022-07-08 14:18:30 +08:00
HELSON
0453776def [tensor] fix a assertion in colo_tensor cross_entropy (#1232) 2022-07-08 11:18:00 +08:00
Jiarui Fang
0e199d71e8 [hotfix] fx get comm size bugs (#1233)
* init a checkpoint dir

* [checkpoint]support resume for cosinewarmuplr

* [checkpoint]add unit test

* fix some bugs but still not OK

* fix bugs

* make it faster

* [checkpoint]support generalized scheduler

* polish

* [tensor] torch function return colotensor

* polish

* fix bugs

* remove debug info

* polish

* polish

* [tensor] test_model pass unittests

* polish

* [hotfix] fx get comm size bug

Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>
2022-07-08 10:54:41 +08:00
HELSON
42ab36b762 [tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) 2022-07-07 19:17:23 +08:00
Yi Zhao
04537bf83e [checkpoint]support generalized scheduler (#1222) 2022-07-07 18:16:38 +08:00
Jiarui Fang
a98319f023 [tensor] torch function return colotensor (#1229) 2022-07-07 18:09:18 +08:00