1
0
mirror of https://github.com/hpcaitech/ColossalAI.git synced 2025-05-10 17:38:32 +00:00
Commit Graph

162 Commits

Author SHA1 Message Date
Super Daniel
e8a9bebc87
[autoparallel] refactor and add rotorc. ()
* [autoparallel] refactor and add rotorc.

* [autoparallel] refactor and add rotorc.
2022-11-03 12:32:51 +08:00
YuliangLiu0306
2c4c7b3618
[autoparallel] add getattr handler ()
* [autoparallel] add getattr haandler

* polish code

* add extra processes for Parameters

* add unit test for param resharding cost

* add docstring and polish test
2022-11-03 12:31:33 +08:00
YuliangLiu0306
e859380bf7
[fx] support module with bias addition ()
* [autoparallel] refactor tracer to fix bias addition issue

* [fx] support module with bias addition

* create bias_addition_module

* refactor file structure

* polish code

* fix unit test
2022-11-01 22:53:51 +08:00
Super Daniel
1e88811c7a
[autoparallel] move ckpt solvers to autoparallel folder / refactor code ()
* [autoparallel] first move.

* [autoparallel] add solver rotor.

* [autoparallel] add ckpt solvers.

* [autoparallel] modify codegen.

* [fx] fix annotation in test.

* [fx] remove check.

* [autoparallel] polish docstring.

* [fx] refactor MetaTensor.
2022-11-01 10:43:15 +08:00
Super Daniel
0584654c79
[fx] refactor memory utils and extend shard utils. ()
* [fx] change memory.py to memory_utils.py.

* [fx] add shard utils.

* [fx] fix import.

* [fx] check code style.

* [fx] add comment.

* [autoparallel] first move.

* [fx] add time computations.
2022-10-26 14:24:41 +08:00
YuliangLiu0306
314d8c497f
[autoparallel] refactor the runtime apply pass and add docstring to passes ()
* [autoparallel] refactor the runtime apply pass and add doc string to passes

* fix unit test

* polish
2022-10-25 14:32:22 +08:00
YuliangLiu0306
d2fc067231
[autoparallel] fix param hook issue in transform pass () 2022-10-24 13:13:38 +08:00
Frank Lee
262652c8bc
[autoparallel] added addbmm handler () 2022-10-21 18:55:48 +08:00
YuliangLiu0306
980ed21723
[autoparallel] shard param and buffer as expected ()
* [autoparallel] shard param and buffer as expected

* fix unit test issue
2022-10-21 15:45:13 +08:00
YuliangLiu0306
a4ce180e85
[autoparallel] add sequential order to communication actions () 2022-10-20 18:48:18 +08:00
Frank Lee
993b8875b6
[autoparallel] handled illegal sharding strategy in shape consistency ()
* [autoparallel] handled illegal sharding strategy in shape consistency

* polish code
2022-10-20 12:06:25 +08:00
Super Daniel
30874f1692
[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler ()
* [fx/profiler] add test.

* [fx] fix file names.

* [fx] add docstring and comment.

* [fx] polish profiler.py.

* [fx] fix import errors.

* [fx] fix profiler.

* [fx] fix names.
2022-10-19 14:24:51 +08:00
YuliangLiu0306
51b89d2202
[autoparallel] runtime_backward_apply () 2022-10-18 10:44:58 +08:00
Super Daniel
393f594051
[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug ()
* [fx] move meta registration

* [fx] fix tests.

* [fx] fix test.

* [fx] fix.

* [meta] refactor meta registration.py.

* [fx] add compatibility descriptions.

* [fx] polish import.

* [fx] add a decorator.

* [fx] fix tests.

* [fx] remove print.

* [fx] edit raise error.

* [fx] edit raise error.

* [fx] add type hint.

* [fx] fix import in experimental.

* [rpc] remove color debug.

* [meta] fix naming.
2022-10-18 10:44:23 +08:00
YuliangLiu0306
845ff4a47a
[autoparallel] resnet block runtime apply ()
* [autoparallel] resnet block runtime apply

* seperate buffer and parameter in MemoryCost

* polish code

* add comments and todos

* fix test issue
2022-10-17 13:37:38 +08:00
YuliangLiu0306
451cd72dea
[autoparallel] adapt runtime passes ()
* [autoparallel] adapt runtime passes v2

* polish code
2022-10-14 10:14:07 +08:00
Boyuan Yao
31d2f03d27
[autoparallel] fix C version rotor inconsistency () 2022-10-12 15:21:58 +08:00
Super Daniel
3dd6994427
[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 ()
* [fx/profiler] modify data_ptr into uuid for all tensors.

* [fx] modify uuid.

* [fx/profiler] tune performance on GPT-2.

* [fx] updates.

* [fx] debug.

* [fx] debug.

* [fx] cuda.
2022-10-11 11:03:35 +08:00
Boyuan Yao
b1be5b88bd
[autoparallel] fix insecure subprocess ()
* [autoparallel] fix insecure subprocess

* [fx] fix insecure subprocess
2022-10-06 15:07:03 +08:00
Boyuan Yao
d8420f81a4
[hotfix] fix wrong type name in profiler () 2022-10-05 21:59:05 +08:00
Boyuan Yao
132b4306b7
[fx] Add concrete info prop ()
* [fx] concreteinfoprop

* [fx] add concreteinfoprop

* [fx] modify docstring of ConcreteInfoProp

* [fx] fix device error

* [fx] modify parameter calculation

* [fx] modify parameters calculation
2022-10-04 16:48:24 +08:00
Boyuan Yao
1df98d5b66
[autoparallel] add rotor C version ()
* [autoparallel] add rotor c version

* [fx] remove metainfoprop in rotor solver

* [autoparallel] modify C
 code format

* [autoparallel] remove build.py

* [autoparallel] fix C extension build

* [autoparallel] add C solver consistency test

* [autoparallel] remove some unused imports

* [autoparallel] refactor rotor solver code

* [autoparallel] replace print with colossalai logger

* [autoparallel] ranks fixed
2022-10-03 17:13:30 +08:00
Super Daniel
6135e178b3
[fx] refactor code for profiler / enable fake tensor movement. ()
* [fx/profiling] provide summary for MetaInfoProp.

* [fx/profiler] provide a table of summary.

* [fx/profiler] provide a table of summary.

* [fx/profiler] provide a table of summary.

* [fx/profiler] provide a table of summary.

* [fx] optimize table repr.

* [fx] optimize table repr.

* [fx] refactor code for profiler.

* [fx] add docstring.

* [fx] add docstring.

* [fx] skip test.

* [fx] redo.

* [fx] redo.

* [fx] fix import error for torch11.

* [fx] fix import error for torch11.
2022-09-27 10:26:52 +08:00
Boyuan Yao
5d0fdb9cb4
[fx] fix offload codegen test ()
* [fx] fix offload codegen test

* [fx] modify typing
2022-09-27 10:25:27 +08:00
Boyuan Yao
f921733621
[autoparallel] Add pofo sequence annotation ()
* [autoparallel] annotate pofo sequence

* [autoparallel] remove unused print

* [autoparallel] fix some code
2022-09-24 01:52:57 +08:00
Super Daniel
04bbabeea8
[fx/profiler] provide a table of summary. ()
* [fx/profiling] provide summary for MetaInfoProp.

* [fx/profiler] provide a table of summary.

* [fx] optimize table repr.
2022-09-23 18:12:43 +08:00
Boyuan Yao
d6b01feb66
[fx] Modify offload codegen ()
* [fx] modify offload codegen

* [fx] remove repeated hook definitions

* [fx] modify offload test
2022-09-23 11:04:52 +08:00
Super Daniel
d967779a32
[fx/profiler] tuned the calculation of memory estimation ()
* [fx] tuned the meta info and rotor solver.

* [fx] remove import.

* [fx] remove import.

* [fx] remove import.

* [fx] tune the meta calculations.

* [fx] polish comments.

* [fx] remove assertions.

* [fx] modify test cases.

* [fx] modify test cases.

* [fx] optimize import.

* [fx
2022-09-23 10:59:47 +08:00
YuliangLiu0306
7d1bb71d5d
[fx] PoC of runtime shape consistency application ()
* [fx] PoC of runtime shape consistency application

* polish code
2022-09-20 14:00:04 +08:00
Boyuan Yao
933b6c6367
[fx] Add pofo solver ()
* [fx] add pofo algorithm

* [fx] Add pofo solver

* [fx] code refactor

* [fx] fix test_linearize import
2022-09-20 11:20:48 +08:00
Super Daniel
cd5cf2bcc9
[fx/tuning] tune performance on rotor with meta info. () 2022-09-15 14:46:36 +08:00
Boyuan Yao
a7cda6f57d
[fx] Add offload codegen ()
* [fx] add input activation offload to codegen

* [fx] modify unit test

* [fx] remove two skips in torch11

* [fx] use all_input_nodes instead of _input_nodes
2022-09-14 15:49:06 +08:00
Super Daniel
c8e9b2ad78
[hotfix/rotor] fix variable names ()
* [fx] add some comment and docstrings.

* [fx] add dataflow analysis for an autograd graph.

* add intepretation for graph analysis.

* [fx] before doing save_tensor_hooks.

* [fx] provide an accurate estimation of memory except for GPT-2.

* [fx] provide an accurate estimation of memory except for GPT-2.

* [fx] provide an accurate estimation of memory except for GPT-2.

* [fx] a very accurate version on GPT-2.

* [fx] refactor code.

* [fx] remove redundant inplace=True.

* [fx] refactor code.

* [fx] refactor code.

* [fx] refactor code.

* [fx] dive into backward memory.

* [fx] fix variable names in ckpt_solvers and unskip tests.

* [fx] commit my changes.

* [fx] restore skips.

* [fx] restore skips.

* [fx] chaange stage into phase.

* [fx] chaange stage into phase.

* [fx] chaange stage into phase.
2022-09-14 14:27:04 +08:00
Super Daniel
5c494d4540
[fx] provide an accurate estimation of memory. ()
* [fx] add some comment and docstrings.

* [fx] add dataflow analysis for an autograd graph.

* add intepretation for graph analysis.

* [fx] before doing save_tensor_hooks.

* [fx] provide an accurate estimation of memory except for GPT-2.

* [fx] provide an accurate estimation of memory except for GPT-2.

* [fx] provide an accurate estimation of memory except for GPT-2.

* [fx] a very accurate version on GPT-2.

* [fx] refactor code.

* [fx] remove redundant inplace=True.

* [fx] refactor code.

* [fx] refactor code.

* [fx] refactor code.

* [fx] dive into backward memory.
2022-09-14 09:36:43 +08:00
Boyuan Yao
49ccf8b5f8
[fx] Improve linearize and rotor solver ()
* [fx] add nested activation_checkpoint codegen

* undo algorithms commits

* solver

* undo some commits

* [fx] torch11 add nested activation checkpoint codegen

* remove some imports

* [fx] add some comments in activation codegen

* [fx] codegen instance error fix

* [fx] imporve linearize and rotor solver

* [fx] some comments and format modification
2022-09-13 14:50:04 +08:00
Boyuan Yao
f3687e4ee2
[fx] Add nested checkpoint in activation checkpoint codegen ()
* [fx] add nested activation_checkpoint codegen

* undo algorithms commits

* solver

* undo some commits

* [fx] torch11 add nested activation checkpoint codegen

* remove some imports

* [fx] add some comments in activation codegen

* [fx] codegen instance error fix
2022-09-12 20:00:48 +08:00
Xue Fuzhao
e070ca45c6 [NFC] polish colossalai/fx/tracer/meta_patch/patched_module/convolution.py code style () 2022-09-08 22:11:04 +08:00
Super Daniel
4f59693207
[fx] provide a stable but not accurate enough version of profiler. ()
* [fx] compute memory stat and flop count for MetaInfoProp.

* [fx] modify node attribute.

* [fx] modify ckpt_chen.

* [fx] fix compatibility.

* [fx] fix import error.

* [fx] skip test for MetaInfoProp.

* [fx] skip test for MetaInfoProp.

* [fx] skip test for MetaInfoProp.

* [fx] skip test for MetaInfoProp.

* [fx] skip if torch 1.11.0.

* [fx] recover MetaInfoProp support for PyTorch 1.11.

* [fx] provide a stable but not accurate enough version of profiler.

* [fx] provide a stable but not accurate enough version of profiler.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix compatibility in tests.

* [fx] fix import error.
2022-09-07 11:21:04 +08:00
Super Daniel
d8a5aded19
[hotfix] change namespace for meta_trace. () 2022-09-06 11:46:12 +08:00
Boyuan Yao
46c6cc79a9
[fx] Add common node in model linearize ()
* [fx] Add common node into linearize

* [fx] Add common node to solver
2022-09-05 18:35:05 +08:00
Super Daniel
70129603aa
[fx] support meta tracing for aten level computation graphs like functorch. ()
* [fx] support meta tracing for aten level computation graphs like functorch.

* [fx] support meta tracing for aten level computation graphs like functorch.

* [fx] remove redundant import.

* [fx] add docstring.
2022-09-05 12:10:09 +08:00
Boyuan Yao
56159049e8
[fx] Modify solver linearize and add corresponding test ()
* [fx] modify solver linearize and add test

* [fx] add torch11 test of linearize but skip it

* [fx] remove some unused imports
2022-09-02 10:24:41 +08:00
YuliangLiu0306
4b3d6caeb3
[fx]patch nn.functional convolution () 2022-09-01 19:05:07 +08:00
Super Daniel
112a1f0a8f
[hotfix] avoid conflict of meta registry with torch 1.13.0. ()
* [hotfix] avoid conflict of meta registry with torch 1.13.0.

* [hotfix] avoid conflict of meta registry with torch 1.13.0.
2022-09-01 15:31:21 +08:00
Boyuan Yao
b231430bcb
[fx] Fix wrong index in annotation and minimal flops in ckpt solver ()
* [fx] fix wrong variable name in solver rotor

* [fx] fix wrong variable name in solver rotor

* [fx] fix the discretize bug

* [fx] fix the first op in activation checkpoint codegen

* [fx] fix some bugs of ckpt solver

* [fx] modify test_ckpt_torchvision

* [fx] set sequence to __sequence__ attr of GraphModule

* [fx] docstring modification

* [fx] remove performance test
2022-08-31 18:10:48 +08:00
Super Daniel
5cc849f6ce
[fx] hack __torch_dispatch__ for meta tensor and autograd. ()
* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] add bad case detections.

* [fx] add bad case detections.

* [fx] rename MetaTensor attributes.

* [fx] fix unexpected error.

* [fx] fix unexpected error.

* [fx] fix unexpected error.

* [fx] fix unexpected error.

* [fx] fix unexpected error.

* [fx] add register backward for native_batch_norm_backward.

* [fx] add more meta backend support for nn.Modules.

* [fx] add meta backend to support timm and torchvision models.

* [fx] add meta hardswish for timm models.
2022-08-31 16:30:16 +08:00
Frank Lee
a0436a62ee
[autoparallel] added liveness analysis ()
* [autoparallel] added liveness analysis

* remove memory cost
2022-08-30 15:54:37 +08:00
Super Daniel
ea1a95b8b9
[hotfix] fix coloproxy typos. () 2022-08-30 11:39:03 +08:00
Boyuan Yao
4acc58ee20
[fx] Fix activation codegen dealing with checkpointing first op () 2022-08-27 19:39:21 +08:00
Boyuan Yao
ac3a453a50
[fx] fix the discretize bug ()
* [fx] fix wrong variable name in solver rotor

* [fx] fix wrong variable name in solver rotor

* code modification

* [fx] fix the discretize bug
2022-08-26 17:15:52 +08:00
Boyuan Yao
31fffd3fc5
[fx] fix wrong variable name in solver rotor ()
* [fx] fix wrong variable name in solver rotor

* [fx] fix wrong variable name in solver rotor

* code modification
2022-08-26 15:47:08 +08:00
Boyuan Yao
de1e716dc4
[fx] Add activation checkpoint solver rotor ()
* [fx] fix defining ckpt functions inside forward

* [fx] Modify activation checkpoint codegen and add ColoGraphModule

* [fx] some modification

* some modifications

* some modifications

* some modifications

* some modifications

* some code modifications

* [automatic_parallel] ckpt solver rotor

* [fx] add ckpt_solver_rotor

* [fx] modification

* code refactor

* code refactor
2022-08-26 10:34:21 +08:00
Super Daniel
09c023bee2
[fx] add more op patches for profiler and error message for unsupported ops. ()
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main ()

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] add rules to linearize computation graphs for searching. ()

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main ()

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] fix test and algorithm bugs in activation checkpointing.

* [fx] polish ckpt_test.

* [fx] add rules to linearize computation graphs for searching.

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] fix inconsistencies.

* [fx] fix MetaInfoProp.

* [fx] fix MetaInfoProp.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] fix error in tests.

* [fx] unfix bug.

* [fx] unfix bug.

* [fx] patch more modules and functions.

* [fx] change name of utils.py to profiler.py

* [fx] add profiler for rnn.

* [fx] add profiler for rnn.

* [fx] polish and add more patch for profiler.

* [fx] polish and add more patch for profiler.
2022-08-25 23:11:13 +08:00
YuliangLiu0306
413c053453
[autoparallel] add cost graph class ()
* [autoparallel] add cost graph class

* polish code
2022-08-25 17:19:59 +08:00
Frank Lee
3da68d6b1b
[fx] fixed adapative pooling size concatenation error () 2022-08-25 09:05:07 +08:00
Super Daniel
32efe8e740
[fx] add profiler for fx nodes. ()
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main ()

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] add rules to linearize computation graphs for searching. ()

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main ()

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] fix test and algorithm bugs in activation checkpointing.

* [fx] polish ckpt_test.

* [fx] add rules to linearize computation graphs for searching.

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] fix inconsistencies.

* [fx] fix MetaInfoProp.

* [fx] fix MetaInfoProp.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] fix error in tests.

* [fx] unfix bug.

* [fx] unfix bug.
2022-08-24 16:22:44 +08:00
Boyuan Yao
1f2e547f7a
[fx] Fix ckpt functions' definitions in forward ()
* [fx] fix defining ckpt functions inside forward

* [fx] Modify activation checkpoint codegen and add ColoGraphModule

* [fx] some modification

* some modifications

* some modifications

* some modifications

* some modifications

* some code modifications
2022-08-22 16:59:54 +08:00
Super Daniel
bbc58d881b
[fx] fix MetaInfoProp for incorrect calculations and add detections for inplace op. ()
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main ()

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] add rules to linearize computation graphs for searching. ()

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main ()

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] fix test and algorithm bugs in activation checkpointing.

* [fx] polish ckpt_test.

* [fx] add rules to linearize computation graphs for searching.

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] fix inconsistencies.

* [fx] fix MetaInfoProp.

* [fx] fix MetaInfoProp.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.
2022-08-18 11:27:06 +08:00
Super Daniel
e7383f578b
[fx] add rules to linearize computation graphs for searching. ()
* [fx] polish ckpt_test.

* [fx] add rules to linearize computation graphs for searching.

* [fx] remove chen_sqrt for sake of simplicity

* [fx] fix inconsistencies.
2022-08-17 14:47:12 +08:00
Boyuan Yao
092b9c8f49
[fx] Add use_reentrant=False to checkpoint in codegen ()
* [utils] Add use_reetrant=False into colossalai checkpoint

* [utils] add some annotation in utils.activaion_checkpoint

* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py

* [test] modify test_activation_checkpoint.py

* [test] modify test for reentrant=False

* [fx] Add use_reentrant=False of checkpoint into codegen
2022-08-17 10:34:50 +08:00
Jiarui Fang
36824a304c
[Doc] add more doc for ColoTensor. () 2022-08-16 10:38:41 +08:00
Super Daniel
0dbd61c29b
[fx] fix test and algorithm bugs in activation checkpointing. ()
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main ()

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] fix test and algorithm bugs in activation checkpointing.

* mend

[fx] fix test and algorithm bugs in activation checkpointing.

* mend

[fx] fix test and algorithm bugs in activation checkpointing.

* mend

[fx] fix test and algorithm bugs in activation checkpointing.

* mend

[fx] fix test and algorithm bugs in activation checkpointing.

* [fx] polish ckpt_test.

* [fx] polish ckpt_test.

* [fx] polish ckpt_test.
2022-08-15 19:09:19 +08:00
Jiarui Fang
b1553fdf96
[NFC] global vars should be upper case () 2022-08-15 09:50:29 +08:00
Boyuan Yao
5774fe0270
[fx] Use colossalai checkpoint and add offload recognition in codegen ()
* [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen

* [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen

* Modification of test and add TODO in codegen

* [fx] Modification of colossal ckpt usage

* [fx] add gpc.destroy() to test_codegen
2022-08-12 12:23:30 +08:00
Super Daniel
d40a9392ba
[fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. ()
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* mend

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.
2022-08-12 11:28:50 +08:00
Super Daniel
3b26516c69
[fx] add vanilla activation checkpoint search with test on resnet and densenet ()
* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.
2022-08-11 15:46:39 +08:00
Super Daniel
f20cb4e893
[fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages ()
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
2022-08-10 16:36:35 +08:00
Frank Lee
7d6293927f
[fx] patched torch.max and data movement operator ()
* [fx] patched torch.max and data movement operator

* polish code
2022-08-01 15:31:50 +08:00
Frank Lee
89e60d1505
[fx] fixed indentation error in checkpointing codegen () 2022-07-30 00:27:12 +08:00
Frank Lee
ad678921db
[fx] patched torch.full for huggingface opt () 2022-07-29 17:56:28 +08:00
YuliangLiu0306
df54481473
[hotfix] fix some bugs during gpt2 testing () 2022-07-28 17:21:07 +08:00
YuliangLiu0306
52bc2dc271
[fx] update split module pass and add customized policy ()
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]update split module pass and add customized policy
2022-07-27 13:40:54 +08:00
Super Daniel
be229217ce
[fx] add torchaudio test ()
* [fx]add torchaudio test

* [fx]add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test and test patches

* Delete ~

* [fx] add patches and patches test

* [fx] add patches and patches test

* [fx] fix patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] merge upstream

* [fx] fix import errors
2022-07-27 11:03:14 +08:00
YuliangLiu0306
5542816690
[fx]add gpt2 passes for pipeline performance test ()
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]add gpt2 passes for pipeline performance test
2022-07-26 14:31:00 +08:00
Frank Lee
cd063ac37f
[fx] added activation checkpoint codegen support for torch < 1.12 () 2022-07-25 23:35:31 +08:00
Frank Lee
644582eee9
[fx] added activation checkpoint codegen () 2022-07-25 09:39:10 +08:00
ver217
d068af81a3
[doc] update rst and docstring ()
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
Frank Lee
274c1a3b5f
[fx] fixed apex normalization patch exception () 2022-07-21 15:29:11 +08:00
Frank Lee
05fae1fd56
[fx] added activation checkpointing annotation ()
* [fx] added activation checkpointing annotation

* polish code

* polish code
2022-07-21 11:14:28 +08:00
YuliangLiu0306
051592c64e
[fx] update MetaInforProp pass to process more complex node.meta ()
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] update MetaInforProp pass to process more complex node.meta
2022-07-21 10:57:52 +08:00
YuliangLiu0306
942c8cd1fb
[fx] refactor tracer to trace complete graph ()
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] refactor tracer to trace complete graph

* add comments and solve conflicts.
2022-07-20 11:20:38 +08:00
Frank Lee
2cc1175c76
[fx] tested the complete workflow for auto-parallel ()
* [fx] tested the complete workflow for auto-parallel

* polish code

* polish code

* polish code
2022-07-20 10:45:17 +08:00
YuliangLiu0306
4631fef8a0
[fx]refactor tracer () 2022-07-19 15:50:42 +08:00
Frank Lee
75abc75c15
[fx] fixed compatiblity issue with torch 1.10 () 2022-07-18 11:41:27 +08:00
Frank Lee
b2475d8c5c
[fx] fixed unit tests for torch 1.12 () 2022-07-15 18:22:15 +08:00
YuliangLiu0306
e8acf55e8b
[fx] add balanced policy v2 ()
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx] add balanced policy v2

* add unittest
2022-07-15 14:54:26 +08:00
XYE
ca2d3f284f
[fx] Add unit test and fix bugs for transform_mlp_pass ()
* add test and fix bugs

* add functions back

* add comments
2022-07-15 14:37:58 +08:00
Frank Lee
4f4d8c3656
[fx] added apex normalization to patched modules ()
* [fx] added apex normalization to patched modules

* remove unused imports
2022-07-14 14:24:13 +08:00
Frank Lee
fb35460595
[fx] added ndim property to proxy () 2022-07-12 15:27:13 +08:00
Frank Lee
4a09fc0947
[fx] fixed tracing with apex-based T5 model ()
* [fx] fixed tracing with apex-based T5 model

* polish code

* polish code
2022-07-12 15:19:25 +08:00
Frank Lee
7531c6271f
[fx] refactored the file structure of patched function and module ()
* [fx] refactored the file structure of patched function and module

* polish code
2022-07-12 15:01:58 +08:00
YuliangLiu0306
97d713855a
[fx] methods to get fx graph property. ()
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* manipulation

* [fx]add graph manipulation methods.

* [fx]methods to get fx graph property.

* add unit test

* add docstring to explain top node and leaf node in this context
2022-07-12 14:10:37 +08:00
YuliangLiu0306
30b4fc0eb0
[fx]add split module pass and unit test from pipeline passes ()
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]add split module pass and unit test from pipeline passes

* fix MNASNet bug

* polish
2022-07-12 13:45:01 +08:00
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec () 2022-07-11 15:51:48 +08:00
Jiarui Fang
0e199d71e8
[hotfix] fx get comm size bugs ()
* init a checkpoint dir

* [checkpoint]support resume for cosinewarmuplr

* [checkpoint]add unit test

* fix some bugs but still not OK

* fix bugs

* make it faster

* [checkpoint]support generalized scheduler

* polish

* [tensor] torch function return colotensor

* polish

* fix bugs

* remove debug info

* polish

* polish

* [tensor] test_model pass unittests

* polish

* [hotfix] fx get comm size bug

Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>
2022-07-08 10:54:41 +08:00
YuliangLiu0306
2b7dca44b5
[fx]get communication size between partitions ()
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [fx]get communication size between partitions.

* polish
2022-07-07 16:22:00 +08:00
Frank Lee
84f2298a96
[fx] added patches for tracing swin transformer () 2022-07-07 15:20:13 +08:00
Frank Lee
b6cb5a47ad
[fx] added timm model tracing testing () 2022-07-07 14:02:17 +08:00
Jiarui Fang
db1bef9032
[hotfix] fx shard 1d pass bug fixing () 2022-07-07 13:37:31 +08:00
Frank Lee
11973d892d
[fx] added torchvision model tracing testing ()
* [fx] added torchvision model tracing testing

* remove unused imports
2022-07-06 21:37:56 +08:00