Commit Graph

660 Commits

Author SHA1 Message Date
Jiarui Fang
d209aff684 Add FreqAwareEmbeddingBag (#1421) 2022-08-09 16:26:12 +08:00
ver217
6df3e19be9 [hotfix] zero optim prevents calling inner optim.zero_grad (#1422) 2022-08-09 16:08:12 +08:00
Jiarui Fang
504419d261 [FAW] add cache manager for the cached embedding (#1419) 2022-08-09 15:17:17 +08:00
Kirigaya Kazuto
44fd3c83ab [communication] add p2p_v2.py to support communication with List[Any] (#1407)
* support p2p communication with any type of object | pass test

* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test

* [communication] add p2p_v2.py to support communication with List[Any]

* Delete _pipeline_schedule_v2.py

* Delete test_cifar_with_data_pipeline_tensor_v2.py

* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule

* [communication] remove print code

* [communication] remove print code
2022-08-09 11:40:04 +08:00
YuliangLiu0306
7c96055c68 [tensor]build sharding spec to replace distspec in future. (#1405) 2022-08-08 11:15:57 +08:00
ver217
12b4887097 [hotfix] fix CPUAdam kernel nullptr (#1410) 2022-08-05 19:45:45 +08:00
YuliangLiu0306
0442f940f0 [device] add DeviceMesh class to support logical device layout (#1394)
* [device] add DeviceMesh class to support logical device layout

* polish code

* add doc string
2022-08-02 19:23:48 +08:00
ver217
04c9a86af8 [zero] ZeroDDP supports controlling outputs' dtype (#1399) 2022-08-02 17:49:11 +08:00
HELSON
4e98e938ce [zero] alleviate memory usage in ZeRODDP state_dict (#1398) 2022-08-02 15:49:13 +08:00
ver217
56b8863b87 [zero] chunk manager allows filtering ex-large params (#1393) 2022-08-02 10:40:27 +08:00
Frank Lee
7d6293927f [fx] patched torch.max and data movement operator (#1391)
* [fx] patched torch.max and data movement operator

* polish code
2022-08-01 15:31:50 +08:00
Frank Lee
89e60d1505 [fx] fixed indentation error in checkpointing codegen (#1385) 2022-07-30 00:27:12 +08:00
HELSON
c7221cb2d4 [hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) 2022-07-29 19:33:24 +08:00
Frank Lee
ad678921db [fx] patched torch.full for huggingface opt (#1386) 2022-07-29 17:56:28 +08:00
HELSON
527758b2ae [hotfix] fix a running error in test_colo_checkpoint.py (#1387) 2022-07-29 15:58:06 +08:00
Jiarui Fang
f792507ff3 [chunk] add PG check for tensor appending (#1383) 2022-07-29 13:27:05 +08:00
ver217
8dced41ad0 [zero] zero optim state_dict takes only_rank_0 (#1384)
* zero optim state_dict takes only_rank_0

* fix unit test
2022-07-29 13:22:50 +08:00
YuliangLiu0306
df54481473 [hotfix] fix some bugs during gpt2 testing (#1379) 2022-07-28 17:21:07 +08:00
ver217
828b9e5e0d [hotfix] fix zero optim save/load state dict (#1381) 2022-07-28 17:19:39 +08:00
HELSON
b6fd165f66 [checkpoint] add kwargs for load_state_dict (#1374) 2022-07-28 15:56:52 +08:00
ver217
83328329dd [hotfix] fix zero ddp buffer cast (#1376)
* fix zero ddp buffer cast

* fix zero ddp ignore params
2022-07-28 10:54:44 +08:00
ver217
5d5031e946 fix zero ddp state dict (#1378) 2022-07-28 09:31:42 +08:00
Frank Lee
0c1a16ea5b [util] standard checkpoint function naming (#1377) 2022-07-28 09:29:30 +08:00
YuliangLiu0306
52bc2dc271 [fx] update split module pass and add customized policy (#1373)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]update split module pass and add customized policy
2022-07-27 13:40:54 +08:00
Super Daniel
be229217ce [fx] add torchaudio test (#1369)
* [fx]add torchaudio test

* [fx]add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test and test patches

* Delete ~

* [fx] add patches and patches test

* [fx] add patches and patches test

* [fx] fix patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] merge upstream

* [fx] fix import errors
2022-07-27 11:03:14 +08:00
ver217
c415240db6 [nvme] CPUAdam and HybridAdam support NVMe offload (#1360)
* impl nvme optimizer

* update cpu adam

* add unit test

* update hybrid adam

* update docstr

* add TODOs

* update CI

* fix CI

* fix CI

* fix CI path

* fix CI path

* fix CI path

* fix install tensornvme

* fix CI

* fix CI path

* fix CI env variables

* test CI

* test CI

* fix CI

* fix nvme optim __del__

* fix adam __del__

* fix nvme optim

* fix CI env variables

* fix nvme optim import

* test CI

* test CI

* fix CI
2022-07-26 17:25:24 +08:00
HELSON
8463290642 [checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368) 2022-07-26 14:41:53 +08:00
YuliangLiu0306
5542816690 [fx]add gpt2 passes for pipeline performance test (#1366)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]add gpt2 passes for pipeline performance test
2022-07-26 14:31:00 +08:00
HELSON
87775a0682 [colotensor] use cpu memory to store state_dict (#1367) 2022-07-26 14:13:38 +08:00
HELSON
943a96323e [hotfix] fix no optimizer in save/load (#1363) 2022-07-26 10:53:53 +08:00
Frank Lee
cd063ac37f [fx] added activation checkpoint codegen support for torch < 1.12 (#1359) 2022-07-25 23:35:31 +08:00
Frank Lee
644582eee9 [fx] added activation checkpoint codegen (#1355) 2022-07-25 09:39:10 +08:00
ver217
6b43c789fd fix zero optim backward_by_grad and save/load (#1353) 2022-07-21 16:43:58 +08:00
ver217
d068af81a3 [doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
Frank Lee
274c1a3b5f [fx] fixed apex normalization patch exception (#1352) 2022-07-21 15:29:11 +08:00
ver217
ce470ba37e [checkpoint] sharded optim save/load grad scaler (#1350) 2022-07-21 15:21:21 +08:00
Frank Lee
05fae1fd56 [fx] added activation checkpointing annotation (#1349)
* [fx] added activation checkpointing annotation

* polish code

* polish code
2022-07-21 11:14:28 +08:00
YuliangLiu0306
051592c64e [fx] update MetaInforProp pass to process more complex node.meta (#1344)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] update MetaInforProp pass to process more complex node.meta
2022-07-21 10:57:52 +08:00
HELSON
7a8702c06d [colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
YuliangLiu0306
942c8cd1fb [fx] refactor tracer to trace complete graph (#1342)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] refactor tracer to trace complete graph

* add comments and solve conflicts.
2022-07-20 11:20:38 +08:00
Frank Lee
2cc1175c76 [fx] tested the complete workflow for auto-parallel (#1336)
* [fx] tested the complete workflow for auto-parallel

* polish code

* polish code

* polish code
2022-07-20 10:45:17 +08:00
YuliangLiu0306
4631fef8a0 [fx]refactor tracer (#1335) 2022-07-19 15:50:42 +08:00
HELSON
f92c100ddd [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) 2022-07-19 14:15:28 +08:00
ver217
0c51ff2c13 [hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00
Frank Lee
75abc75c15 [fx] fixed compatiblity issue with torch 1.10 (#1331) 2022-07-18 11:41:27 +08:00
ver217
7a05367101 [hotfix] shared model returns cpu state_dict (#1328) 2022-07-15 22:11:37 +08:00
Frank Lee
b2475d8c5c [fx] fixed unit tests for torch 1.12 (#1327) 2022-07-15 18:22:15 +08:00
HELSON
d49708ae43 [hotfix] fix ddp for unit test test_gpt2 (#1326) 2022-07-15 18:19:52 +08:00
Frank Lee
250be4d31e [utils] integrated colotensor with lazy init context (#1324)
* [utils] integrated colotensor with lazy init context

* polish code

* polish code

* polish code
2022-07-15 17:47:12 +08:00
YuliangLiu0306
e8acf55e8b [fx] add balanced policy v2 (#1251)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] add balanced policy v2

* add unittest
2022-07-15 14:54:26 +08:00