Commit Graph

75 Commits

Author SHA1 Message Date
Hongxin Liu
079bf3cb26 [misc] update pre-commit and run all files (#4752)
* [misc] update pre-commit

* [misc] run pre-commit

* [misc] remove useless configuration files

* [misc] ignore cuda for clang-format
2023-09-19 14:20:26 +08:00
Hongxin Liu
27061426f7 [gemini] improve compatibility and add static placement policy (#4479)
* [gemini] remove distributed-related part from colotensor (#4379)

* [gemini] remove process group dependency

* [gemini] remove tp part from colo tensor

* [gemini] patch inplace op

* [gemini] fix param op hook and update tests

* [test] remove useless tests

* [test] remove useless tests

* [misc] fix requirements

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [misc] update requirements

* [gemini] refactor gemini optimizer and gemini ddp (#4398)

* [gemini] update optimizer interface

* [gemini] renaming gemini optimizer

* [gemini] refactor gemini ddp class

* [example] update gemini related example

* [example] update gemini related example

* [plugin] fix gemini plugin args

* [test] update gemini ckpt tests

* [gemini] fix checkpoint io

* [example] fix opt example requirements

* [example] fix opt example

* [example] fix opt example

* [example] fix opt example

* [gemini] add static placement policy (#4443)

* [gemini] add static placement policy

* [gemini] fix param offload

* [test] update gemini tests

* [plugin] update gemini plugin

* [plugin] update gemini plugin docstr

* [misc] fix flash attn requirement

* [test] fix gemini checkpoint io test

* [example] update resnet example result (#4457)

* [example] update bert example result (#4458)

* [doc] update gemini doc (#4468)

* [example] update gemini related examples (#4473)

* [example] update gpt example

* [example] update dreambooth example

* [example] update vit

* [example] update opt

* [example] update palm

* [example] update vit and opt benchmark

* [hotfix] fix bert in model zoo (#4480)

* [hotfix] fix bert in model zoo

* [test] remove chatglm gemini test

* [test] remove sam gemini test

* [test] remove vit gemini test

* [hotfix] fix opt tutorial example (#4497)

* [hotfix] fix opt tutorial example

* [hotfix] fix opt tutorial example
2023-08-24 09:29:25 +08:00
digger-yu
b9a8dff7e5 [doc] Fix typo under colossalai and doc(#3618)
* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402
2023-04-26 11:38:43 +08:00
YH
1a229045af Add interface for colo tesnor dp size (#3227) 2023-03-27 09:42:21 +08:00
Jiatong (Julius) Han
8c8a39be95 [hotfix]: Remove math.prod dependency (#2837)
* Remove math.prod dependency

* Fix style

* Fix style

---------

Co-authored-by: Jiatong Han <jiatong.han@u.nus.edu>
2023-02-23 23:56:15 +08:00
HELSON
552183bb74 [polish] polish ColoTensor and its submodules (#2537) 2023-02-03 11:44:10 +08:00
HELSON
707b11d4a0 [gemini] update ddp strict mode (#2518)
* [zero] add strict ddp mode for chunk init

* [gemini] update gpt example
2023-01-28 14:35:25 +08:00
Jiarui Fang
1aaeb596c6 [example] gpt, shard init on all processes (#2366) 2023-01-06 15:44:50 +08:00
xcnick
85178a397a [hotfix] fix error for torch 2.0 (#2243) 2022-12-30 23:11:55 +08:00
HELSON
2458659919 [zero] fix error for BEiT models (#2169)
* [zero] fix error for BEiT models

* [ColoParameter] add unpack operation for tuple arguments

* fix bugs

* fix chunkv2 unit testing

* add assertion for gradient state
2022-12-26 15:03:54 +08:00
Jiarui Fang
2827f41898 [Gemini] GeminiDPP convert to PyTorch Module. (#2151) 2022-12-20 10:19:36 +08:00
YuliangLiu0306
49216d7ab1 [autoparallel] fix bugs caused by negative dim key (#1808)
* [autoparallel] fix bugs caused by negative dim key

* fix import error

* fix matmul test issue

* fix unit test issue
2022-11-08 17:03:50 +08:00
HELSON
c6a1a62636 [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12

* [zero] add cpu shard init

* [zero] add tiny example test

* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
Jiarui Fang
a1476ea882 [NFC] polish doc style for ColoTensor (#1457) 2022-08-16 09:21:05 +08:00
HELSON
7a8702c06d [colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
HELSON
f92c100ddd [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) 2022-07-19 14:15:28 +08:00
HELSON
1b41686461 [hotfix] fix unit test test_module_spec (#1321) 2022-07-15 14:02:32 +08:00
HELSON
260a55804a [hotfix] fix shape error in backward when using ColoTensor (#1298) 2022-07-13 23:06:12 +08:00
ver217
7aadcbd070 hotfix colotensor _scan_for_pg_from_args (#1276) 2022-07-12 20:46:31 +08:00
Jiarui Fang
c92f84fcdb [tensor] distributed checkpointing for parameters (#1240) 2022-07-12 15:51:06 +08:00
Jiarui Fang
1aad903c15 [tensor] redistribute among different process groups (#1247)
* make it faster

* [tensor] rename convert_to_dist -> redistribute

* [tensor] ShardSpec and ReplicaSpec

* [tensor] redistribute among diff pgs

* polish code
2022-07-12 10:24:05 +08:00
Jiarui Fang
9bcd2fd4af [tensor] a shorter shard and replicate spec (#1245) 2022-07-11 15:51:48 +08:00
Jiarui Fang
2699dfbbfd [rename] convert_to_dist -> redistribute (#1243) 2022-07-11 13:05:44 +08:00
HELSON
f6add9b720 [tensor] redirect .data.__get__ to a tensor instance (#1239) 2022-07-11 11:41:29 +08:00
Jiarui Fang
4a76084dc9 [tensor] add zero_like colo op, important for Optimizer (#1236) 2022-07-08 14:55:27 +08:00
Jiarui Fang
3b500984b1 [tensor] fix some unittests (#1234) 2022-07-08 14:18:30 +08:00
HELSON
f071b500b6 [polish] polish __repr__ for ColoTensor, DistSpec, ProcessGroup (#1235) 2022-07-08 13:25:57 +08:00
Yi Zhao
04537bf83e [checkpoint]support generalized scheduler (#1222) 2022-07-07 18:16:38 +08:00
Jiarui Fang
a98319f023 [tensor] torch function return colotensor (#1229) 2022-07-07 18:09:18 +08:00
Jiarui Fang
ae7d3f4927 [refactor] move process group from _DistSpec to ColoTensor. (#1203) 2022-07-06 16:15:16 +08:00
Jiarui Fang
060b917daf [refactor] remove gpc dependency in colotensor's _ops (#1189) 2022-07-04 18:54:37 +08:00
Jiarui Fang
c463f8adf9 [tensor] remove gpc in tensor tests (#1186) 2022-06-29 14:08:40 +08:00
Jiarui Fang
1b657f9ce1 [tensor] revert local view back (#1178) 2022-06-27 18:38:34 +08:00
Jiarui Fang
aa7bef73d4 [Tensor] distributed view supports inter-process hybrid parallel (#1169) 2022-06-27 09:45:26 +08:00
Jiarui Fang
4b9bba8116 [ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) 2022-06-24 13:08:54 +08:00
Jiarui Fang
f4ef224358 [Tensor] remove ParallelAction, use ComputeSpec instread (#1166) 2022-06-23 17:34:59 +08:00
Jiarui Fang
177c374401 remove gather out in parallel action (#1163) 2022-06-23 16:35:05 +08:00
Jiarui Fang
8cdce0399c [ColoTensor] improves init functions. (#1150) 2022-06-21 18:28:38 +08:00
ver217
a3b66f6def [tensor] refactor parallel action (#1007)
* refactor parallel action

* polish unit tests
2022-05-20 20:19:58 +08:00
ver217
ad536e308e [tensor] refactor colo-tensor (#992)
* refactor colo-tensor and update linear op

* polish code

* polish code

* update ops and unit tests

* update unit tests

* polish code

* rename dist_spec module

* polish code

* polish code

* remove unneeded import

* fix pipelinable
2022-05-19 12:44:59 +08:00
Jiarui Fang
802ac297cc [Tensor] remove useless import in tensor dir (#997) 2022-05-18 14:54:51 +08:00
ver217
67c33f57eb [tensor] design DistSpec and DistSpecManager for ColoTensor (#934)
* add dist spec

* update linear op

* polish code

* polish code

* update embedding op

* polish unit tests

* polish unit tests

* polish comments

* polish code

* add test_dist_spec_mgr

* polish code

* refactor folder structure

* polish unit tests

* add get_process_group() for TensorSpec

* polish code
2022-05-13 15:13:52 +08:00
ver217
4ca732349e [tensor] colo tensor overrides mul (#927)
* colo tensor overrides mul

* polish code
2022-05-10 16:04:08 +08:00
ver217
45b9124df4 [tensor] hijack addmm for colo tensor (#923)
* hijack addmm for colo tensor

* fix bugs

* polish unit test

* polish comments
2022-05-09 18:55:49 +08:00
Ziyue Jiang
c195d2814c [Tensor] add from_pretrained support and bert pretrained test (#921)
* add from_pretrained support and test

* polish

* polish

* polish

* polish
2022-05-09 16:11:47 +08:00
Jiarui Fang
845856ea29 [Graph] building computing graph with ColoTensor, Linear only (#917) 2022-05-07 17:10:37 +08:00
Jiarui Fang
ab95ec9aea [Tensor] init ColoParameter (#914) 2022-05-06 12:57:14 +08:00
Ziyue Jiang
f593a5637e [Tensor] add embedding tp1d row (#904) 2022-04-29 14:10:05 +08:00
Ziyue Jiang
2c0d19d755 [Tensor] add ColoTensor TP1Dcol Embedding (#899) 2022-04-28 17:45:06 +08:00
Jiarui Fang
676f191532 [Tensor] activation is an attr of ColoTensor (#897) 2022-04-28 14:43:22 +08:00