Commit Graph

1213 Commits

Author SHA1 Message Date
ver217
821c6172e2 [utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) 2022-08-11 22:58:58 +08:00
HELSON
b80340168e [zero] add chunk_managerV2 for all-gather chunk (#1441) 2022-08-11 19:17:24 +08:00
Super Daniel
3b26516c69 [fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433)
* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.
2022-08-11 15:46:39 +08:00
Jiarui Fang
30b4dd17c0 [FAW] export FAW in _ops (#1438) 2022-08-11 13:43:24 +08:00
HELSON
9056677b13 [zero] add chunk size searching algorithm for parameters in different groups (#1436) 2022-08-11 13:32:19 +08:00
HELSON
039b7ed3bc [polish] add update directory in gemini; rename AgChunk to ChunkV2 (#1432) 2022-08-10 16:40:29 +08:00
Super Daniel
f20cb4e893 [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425)
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
2022-08-10 16:36:35 +08:00
Jiarui Fang
89c434a0a6 [polish] add test_ops directory (#1431) 2022-08-10 15:35:26 +08:00
Jiarui Fang
10b3df65c8 [FAW] move coloparam setting in test code. (#1429) 2022-08-10 14:31:53 +08:00
Jiarui Fang
cb98cf5558 [FAW] parallel FreqAwareEmbedding (#1424) 2022-08-10 13:44:30 +08:00
HELSON
0d212183c4 [zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) 2022-08-10 11:37:28 +08:00
YuliangLiu0306
33f0744d51 [tensor] add shape consistency feature to support auto spec transform (#1418)
* [tensor] add shape consistency feature to supportauto sharding spec transform.

* [tensor] remove unused argument in simulator, add doc string for target pair.
2022-08-10 11:29:17 +08:00
HELSON
4fb3c52cf0 [zero] add unit test for AgChunk's append, close, access (#1423) 2022-08-09 18:03:10 +08:00
Jiarui Fang
d209aff684 Add FreqAwareEmbeddingBag (#1421) 2022-08-09 16:26:12 +08:00
Jiarui Fang
504419d261 [FAW] add cache manager for the cached embedding (#1419) 2022-08-09 15:17:17 +08:00
Kirigaya Kazuto
44fd3c83ab [communication] add p2p_v2.py to support communication with List[Any] (#1407)
* support p2p communication with any type of object | pass test

* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test

* [communication] add p2p_v2.py to support communication with List[Any]

* Delete _pipeline_schedule_v2.py

* Delete test_cifar_with_data_pipeline_tensor_v2.py

* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule

* [communication] remove print code

* [communication] remove print code
2022-08-09 11:40:04 +08:00
YuliangLiu0306
7c96055c68 [tensor]build sharding spec to replace distspec in future. (#1405) 2022-08-08 11:15:57 +08:00
ver217
12b4887097 [hotfix] fix CPUAdam kernel nullptr (#1410) 2022-08-05 19:45:45 +08:00
YuliangLiu0306
0442f940f0 [device] add DeviceMesh class to support logical device layout (#1394)
* [device] add DeviceMesh class to support logical device layout

* polish code

* add doc string
2022-08-02 19:23:48 +08:00
HELSON
4e98e938ce [zero] alleviate memory usage in ZeRODDP state_dict (#1398) 2022-08-02 15:49:13 +08:00
Frank Lee
adf5054ff8 [fx] fixed torchaudio conformer tracing (#1392) 2022-08-01 16:08:28 +08:00
Frank Lee
7d6293927f [fx] patched torch.max and data movement operator (#1391)
* [fx] patched torch.max and data movement operator

* polish code
2022-08-01 15:31:50 +08:00
HELSON
527758b2ae [hotfix] fix a running error in test_colo_checkpoint.py (#1387) 2022-07-29 15:58:06 +08:00
ver217
8dced41ad0 [zero] zero optim state_dict takes only_rank_0 (#1384)
* zero optim state_dict takes only_rank_0

* fix unit test
2022-07-29 13:22:50 +08:00
ver217
7d5d628e07 [DDP] test ddp state dict uses more strict threshold (#1382) 2022-07-28 17:29:04 +08:00
ver217
828b9e5e0d [hotfix] fix zero optim save/load state dict (#1381) 2022-07-28 17:19:39 +08:00
Super Daniel
be229217ce [fx] add torchaudio test (#1369)
* [fx]add torchaudio test

* [fx]add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test and test patches

* Delete ~

* [fx] add patches and patches test

* [fx] add patches and patches test

* [fx] fix patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] merge upstream

* [fx] fix import errors
2022-07-27 11:03:14 +08:00
Boyuan Yao
bb640ec728 [fx] Add colotracer compatibility test on torchrec (#1370) 2022-07-26 17:54:39 +08:00
ver217
c415240db6 [nvme] CPUAdam and HybridAdam support NVMe offload (#1360)
* impl nvme optimizer

* update cpu adam

* add unit test

* update hybrid adam

* update docstr

* add TODOs

* update CI

* fix CI

* fix CI

* fix CI path

* fix CI path

* fix CI path

* fix install tensornvme

* fix CI

* fix CI path

* fix CI env variables

* test CI

* test CI

* fix CI

* fix nvme optim __del__

* fix adam __del__

* fix nvme optim

* fix CI env variables

* fix nvme optim import

* test CI

* test CI

* fix CI
2022-07-26 17:25:24 +08:00
HELSON
87775a0682 [colotensor] use cpu memory to store state_dict (#1367) 2022-07-26 14:13:38 +08:00
Frank Lee
cd063ac37f [fx] added activation checkpoint codegen support for torch < 1.12 (#1359) 2022-07-25 23:35:31 +08:00
HELSON
4417804129 [unit test] add megatron init test in zero_optim (#1358) 2022-07-25 11:18:08 +08:00
HELSON
7a065dc9f6 [hotfix] fix megatron_init in test_gpt2.py (#1357) 2022-07-25 10:28:19 +08:00
Frank Lee
644582eee9 [fx] added activation checkpoint codegen (#1355) 2022-07-25 09:39:10 +08:00
Frank Lee
05fae1fd56 [fx] added activation checkpointing annotation (#1349)
* [fx] added activation checkpointing annotation

* polish code

* polish code
2022-07-21 11:14:28 +08:00
HELSON
7a8702c06d [colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
YuliangLiu0306
942c8cd1fb [fx] refactor tracer to trace complete graph (#1342)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] refactor tracer to trace complete graph

* add comments and solve conflicts.
2022-07-20 11:20:38 +08:00
Frank Lee
2cc1175c76 [fx] tested the complete workflow for auto-parallel (#1336)
* [fx] tested the complete workflow for auto-parallel

* polish code

* polish code

* polish code
2022-07-20 10:45:17 +08:00
YuliangLiu0306
4631fef8a0 [fx]refactor tracer (#1335) 2022-07-19 15:50:42 +08:00
HELSON
bf5066fba7 [refactor] refactor ColoTensor's unit tests (#1340) 2022-07-19 15:46:24 +08:00
HELSON
f92c100ddd [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) 2022-07-19 14:15:28 +08:00
Frank Lee
f3ce7b8336 [fx] recovered skipped pipeline tests (#1338) 2022-07-19 09:49:50 +08:00
ver217
0c51ff2c13 [hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00
Frank Lee
75abc75c15 [fx] fixed compatiblity issue with torch 1.10 (#1331) 2022-07-18 11:41:27 +08:00
Frank Lee
169954f87e [test] removed outdated unit test for meta context (#1329) 2022-07-15 23:16:23 +08:00
ver217
7a05367101 [hotfix] shared model returns cpu state_dict (#1328) 2022-07-15 22:11:37 +08:00
Frank Lee
b2475d8c5c [fx] fixed unit tests for torch 1.12 (#1327) 2022-07-15 18:22:15 +08:00
HELSON
d49708ae43 [hotfix] fix ddp for unit test test_gpt2 (#1326) 2022-07-15 18:19:52 +08:00
Frank Lee
250be4d31e [utils] integrated colotensor with lazy init context (#1324)
* [utils] integrated colotensor with lazy init context

* polish code

* polish code

* polish code
2022-07-15 17:47:12 +08:00
YuliangLiu0306
e8acf55e8b [fx] add balanced policy v2 (#1251)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] add balanced policy v2

* add unittest
2022-07-15 14:54:26 +08:00