Commit Graph

1213 Commits

Author SHA1 Message Date
XYE
ca2d3f284f [fx] Add unit test and fix bugs for transform_mlp_pass (#1299)
* add test and fix bugs

* add functions back

* add comments
2022-07-15 14:37:58 +08:00
HELSON
1b41686461 [hotfix] fix unit test test_module_spec (#1321) 2022-07-15 14:02:32 +08:00
Jiarui Fang
9e4c6449b0 [checkpoint] add ColoOptimizer checkpointing (#1316) 2022-07-15 09:52:55 +08:00
Jiarui Fang
85f933b58b [Optimizer] Remove useless ColoOptimizer (#1312) 2022-07-14 16:57:48 +08:00
Jiarui Fang
9f10524313 [Optimizer] polish the init method of ColoOptimizer (#1310) 2022-07-14 16:37:33 +08:00
HELSON
36086927e1 [hotfix] fix ColoTensor GPT2 unitest (#1309) 2022-07-14 16:37:20 +08:00
Jiarui Fang
3ef3791a3b [checkpoint] add test for bert and hotfix save bugs (#1297) 2022-07-14 15:38:18 +08:00
Jiarui Fang
bd71e2a88b [hotfix] add missing file (#1308) 2022-07-14 14:43:15 +08:00
Frank Lee
4f4d8c3656 [fx] added apex normalization to patched modules (#1300)
* [fx] added apex normalization to patched modules

* remove unused imports
2022-07-14 14:24:13 +08:00
Jiarui Fang
4165eabb1e [hotfix] remove potiential circle import (#1307)
* make it faster

* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
YuliangLiu0306
93a75433df [hotfix] skip some unittest due to CI environment. (#1301) 2022-07-14 10:55:18 +08:00
HELSON
260a55804a [hotfix] fix shape error in backward when using ColoTensor (#1298) 2022-07-13 23:06:12 +08:00
Frank Lee
7e8114a8dd [hotfix] skipped unsafe test cases (#1282) 2022-07-13 00:08:59 +08:00
Jiarui Fang
79fe7b027a [hotfix] test model unittest hotfix (#1281) 2022-07-12 23:45:29 +08:00
Jiarui Fang
e56731e916 [hotfix] test_gpt.py duplicated (#1279)
* make it faster

* [hotfix] torchvison fx tests

* [hotfix] rename duplicated named test_gpt.py
2022-07-12 23:29:17 +08:00
HELSON
abba4d84e1 [hotfix] fix bert model test in unitests (#1272) 2022-07-12 23:26:45 +08:00
YuliangLiu0306
01ea68b2e6 [tests] remove T5 test skip decorator (#1271) 2022-07-12 23:25:30 +08:00
Jiarui Fang
ca9d5ee91c [hotfix] torchvison fx unittests miss import pytest (#1277) 2022-07-12 23:04:06 +08:00
Jiarui Fang
c92f84fcdb [tensor] distributed checkpointing for parameters (#1240) 2022-07-12 15:51:06 +08:00
Frank Lee
4a09fc0947 [fx] fixed tracing with apex-based T5 model (#1252)
* [fx] fixed tracing with apex-based T5 model

* polish code

* polish code
2022-07-12 15:19:25 +08:00
YuliangLiu0306
97d713855a [fx] methods to get fx graph property. (#1246)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* manipulation

* [fx]add graph manipulation methods.

* [fx]methods to get fx graph property.

* add unit test

* add docstring to explain top node and leaf node in this context
2022-07-12 14:10:37 +08:00
YuliangLiu0306
30b4fc0eb0 [fx]add split module pass and unit test from pipeline passes (#1242)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]add split module pass and unit test from pipeline passes

* fix MNASNet bug

* polish
2022-07-12 13:45:01 +08:00
Jiarui Fang
1aad903c15 [tensor] redistribute among different process groups (#1247)
* make it faster

* [tensor] rename convert_to_dist -> redistribute

* [tensor] ShardSpec and ReplicaSpec

* [tensor] redistribute among diff pgs

* polish code
2022-07-12 10:24:05 +08:00
Jiarui Fang
9bcd2fd4af [tensor] a shorter shard and replicate spec (#1245) 2022-07-11 15:51:48 +08:00
Jiarui Fang
2699dfbbfd [rename] convert_to_dist -> redistribute (#1243) 2022-07-11 13:05:44 +08:00
HELSON
f6add9b720 [tensor] redirect .data.__get__ to a tensor instance (#1239) 2022-07-11 11:41:29 +08:00
Jiarui Fang
20da6e48c8 [checkpoint] save sharded optimizer states (#1237) 2022-07-08 16:33:13 +08:00
Jiarui Fang
4a76084dc9 [tensor] add zero_like colo op, important for Optimizer (#1236) 2022-07-08 14:55:27 +08:00
Jiarui Fang
3b500984b1 [tensor] fix some unittests (#1234) 2022-07-08 14:18:30 +08:00
HELSON
0453776def [tensor] fix a assertion in colo_tensor cross_entropy (#1232) 2022-07-08 11:18:00 +08:00
Jiarui Fang
0e199d71e8 [hotfix] fx get comm size bugs (#1233)
* init a checkpoint dir

* [checkpoint]support resume for cosinewarmuplr

* [checkpoint]add unit test

* fix some bugs but still not OK

* fix bugs

* make it faster

* [checkpoint]support generalized scheduler

* polish

* [tensor] torch function return colotensor

* polish

* fix bugs

* remove debug info

* polish

* polish

* [tensor] test_model pass unittests

* polish

* [hotfix] fx get comm size bug

Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>
2022-07-08 10:54:41 +08:00
HELSON
42ab36b762 [tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) 2022-07-07 19:17:23 +08:00
Yi Zhao
04537bf83e [checkpoint]support generalized scheduler (#1222) 2022-07-07 18:16:38 +08:00
Jiarui Fang
a98319f023 [tensor] torch function return colotensor (#1229) 2022-07-07 18:09:18 +08:00
Frank Lee
5581170890 [fx] fixed huggingface OPT and T5 results misalignment (#1227) 2022-07-07 16:29:58 +08:00
YuliangLiu0306
2b7dca44b5 [fx]get communication size between partitions (#1224)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]get communication size between partitions.

* polish
2022-07-07 16:22:00 +08:00
Frank Lee
84f2298a96 [fx] added patches for tracing swin transformer (#1228) 2022-07-07 15:20:13 +08:00
Frank Lee
37fcf96b7f [fx] fixed timm tracing result misalignment (#1225) 2022-07-07 14:45:15 +08:00
Frank Lee
b6cb5a47ad [fx] added timm model tracing testing (#1221) 2022-07-07 14:02:17 +08:00
Jiarui Fang
15d988f954 [tensor] sharded global process group (#1219) 2022-07-07 13:38:48 +08:00
Frank Lee
11973d892d [fx] added torchvision model tracing testing (#1216)
* [fx] added torchvision model tracing testing

* remove unused imports
2022-07-06 21:37:56 +08:00
Jiarui Fang
52736205d9 [checkpoint] make unitest faster (#1217) 2022-07-06 17:39:46 +08:00
Jiarui Fang
f38006ea83 [checkpoint] checkpoint for ColoTensor Model (#1196) 2022-07-06 17:22:03 +08:00
Jiarui Fang
ae7d3f4927 [refactor] move process group from _DistSpec to ColoTensor. (#1203) 2022-07-06 16:15:16 +08:00
Frank Lee
5da87ce35d [fx] added testing for all albert variants (#1211) 2022-07-06 15:11:08 +08:00
Frank Lee
2d13a45a3b [fx] added testing for all gpt variants (#1210)
* [fx] added testing for all gpt variants

* polish code

* polish code
2022-07-06 14:03:13 +08:00
YuliangLiu0306
189946c5c4 [fx]add uniform policy (#1208)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]add uniform policy
2022-07-06 13:48:11 +08:00
Frank Lee
426a279ce7 [fx] added testing for all bert variants (#1207)
* [fx] added testing for all bert variants

* polish code
2022-07-06 10:50:49 +08:00
Frank Lee
f7878f465c [fx] supported model tracing for huggingface bert (#1201)
* [fx] supported model tracing for huggingface bert

* polish test
2022-07-05 13:19:57 +08:00
Jiarui Fang
060b917daf [refactor] remove gpc dependency in colotensor's _ops (#1189) 2022-07-04 18:54:37 +08:00