From 7a58dc5ad244f68d162e34a269a4a7c96f9896ab Mon Sep 17 00:00:00 2001 From: Boyuan Yao <70263930+Cypher30@users.noreply.github.com> Date: Fri, 27 Jan 2023 09:52:21 +0800 Subject: [PATCH] Update metainfo patch branch (#2517) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * init * rename and remove useless func * basic chunk * add evoformer * align evoformer * add meta * basic chunk * basic memory * finish basic inference memory estimation * finish memory estimation * fix bug * finish memory estimation * add part of index tracer * finish basic index tracer * add doc string * add doc str * polish code * polish code * update active log * polish code * add possible region search * finish region search loop * finish chunk define * support new op * rename index tracer * finishi codegen on msa * redesign index tracer, add source and change compute * pass outproduct mean * code format * code format * work with outerproductmean and msa * code style * code style * code style * code style * change threshold * support check_index_duplicate * support index dupilictae and update loop * support output * update memory estimate * optimise search * fix layernorm * move flow tracer * refactor flow tracer * format code * refactor flow search * code style * adapt codegen to prepose node * code style * remove abandoned function * remove flow tracer * code style * code style * reorder nodes * finish node reorder * update run * code style * add chunk select class * add chunk select * code style * add chunksize in emit, fix bug in reassgin shape * code style * turn off print mem * add evoformer openfold init * init openfold * add benchmark * add print * code style * code style * init openfold * update openfold * align openfold * use max_mem to control stratge * update source add * add reorder in mem estimator * improve reorder efficeincy * support ones_like, add prompt if fit mode search fail * fix a bug in ones like, dont gen chunk if dim size is 1 * fix bug again * update min memory stratege, reduce mem usage by 30% * last version of benchmark * refactor structure * restruct dir * update test * rename * take apart chunk code gen * close mem and code print * code format * rename ambiguous variable * seperate flow tracer * seperate input node dim search * seperate prepose_nodes * seperate non chunk input * seperate reorder * rename * ad reorder graph * seperate trace flow * code style * code style * fix typo * set benchmark * rename test * update codegen test * Fix state_dict key missing issue of the ZeroDDP (#2363) * Fix state_dict output for ZeroDDP duplicated parameters * Rewrite state_dict based on get_static_torch_model * Modify get_static_torch_model to be compatible with the lower version (ZeroDDP) * update codegen test * update codegen test * add chunk search test * code style * add available * [hotfix] fix gpt gemini example (#2404) * [hotfix] fix gpt gemini example * [example] add new assertions * remove autochunk_available * [workflow] added nightly release to pypi (#2403) * add comments * code style * add doc for search chunk * [doc] updated readme regarding pypi installation (#2406) * add doc for search * [doc] updated kernel-related optimisers' docstring (#2385) * [doc] updated kernel-related optimisers' docstring * polish doc * rename trace_index to trace_indice * rename function from index to indice * rename * rename in doc * [polish] polish code for get_static_torch_model (#2405) * [gemini] polish code * [testing] remove code * [gemini] make more robust * rename * rename * remove useless function * [worfklow] added coverage test (#2399) * [worfklow] added coverage test * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code * add doc for trace indice * [docker] updated Dockerfile and release workflow (#2410) * add doc * update doc * add available * change imports * add test in import * [workflow] refactored the example check workflow (#2411) * [workflow] refactored the example check workflow * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code * Update parallel_context.py (#2408) * [hotfix] add DISTPAN argument for benchmark (#2412) * change the benchmark config file * change config * revert config file * rename distpan to distplan * [workflow] added precommit check for code consistency (#2401) * [workflow] added precommit check for code consistency * polish code * polish code * polish code * polish code * polish code * polish code * polish code * adapt new fx * [workflow] added translation for non-english comments (#2414) * [setup] refactored setup.py for dependency graph (#2413) * change import * update doc * [workflow] auto comment if precommit check fails (#2417) * [hotfix] add norm clearing for the overflow step (#2416) * [examples] adding tflops to PaLM (#2365) * [workflow]auto comment with test coverage report (#2419) * [workflow]auto comment with test coverage report * polish code * polish yaml * [doc] added documentation for CI/CD (#2420) * [doc] added documentation for CI/CD * polish markdown * polish markdown * polish markdown * [example] removed duplicated stable diffusion example (#2424) * [zero] add inference mode and its unit test (#2418) * [workflow] report test coverage even if below threshold (#2431) * [example] improved the clarity yof the example readme (#2427) * [example] improved the clarity yof the example readme * polish workflow * polish workflow * polish workflow * polish workflow * polish workflow * polish workflow * [ddp] add is_ddp_ignored (#2434) [ddp] rename to is_ddp_ignored * [workflow] make test coverage report collapsable (#2436) * [autoparallel] add shard option (#2423) * [fx] allow native ckpt trace and codegen. (#2438) * [cli] provided more details if colossalai run fail (#2442) * [autoparallel] integrate device mesh initialization into autoparallelize (#2393) * [autoparallel] integrate device mesh initialization into autoparallelize * add megatron solution * update gpt autoparallel examples with latest api * adapt beta value to fit the current computation cost * [zero] fix state_dict and load_state_dict for ddp ignored parameters (#2443) * [ddp] add is_ddp_ignored [ddp] rename to is_ddp_ignored * [zero] fix state_dict and load_state_dict * fix bugs * [zero] update unit test for ZeroDDP * [example] updated the hybrid parallel tutorial (#2444) * [example] updated the hybrid parallel tutorial * polish code * [zero] add warning for ignored parameters (#2446) * [example] updated large-batch optimizer tutorial (#2448) * [example] updated large-batch optimizer tutorial * polish code * polish code * [example] fixed seed error in train_dreambooth_colossalai.py (#2445) * [workflow] fixed the on-merge condition check (#2452) * [workflow] automated the compatiblity test (#2453) * [workflow] automated the compatiblity test * polish code * [autoparallel] update binary elementwise handler (#2451) * [autoparallel] update binary elementwise handler * polish * [workflow] automated bdist wheel build (#2459) * [workflow] automated bdist wheel build * polish workflow * polish readme * polish readme * Fix False warning in initialize.py (#2456) * Update initialize.py * pre-commit run check * [examples] update autoparallel tutorial demo (#2449) * [examples] update autoparallel tutorial demo * add test_ci.sh * polish * add conda yaml * [cli] fixed hostname mismatch error (#2465) * [example] integrate autoparallel demo with CI (#2466) * [example] integrate autoparallel demo with CI * polish code * polish code * polish code * polish code * [zero] low level optim supports ProcessGroup (#2464) * [example] update vit ci script (#2469) * [example] update vit ci script * [example] update requirements * [example] update requirements * [example] integrate seq-parallel tutorial with CI (#2463) * [zero] polish low level optimizer (#2473) * polish pp middleware (#2476) Co-authored-by: Ziyue Jiang * [example] update gpt gemini example ci test (#2477) * [zero] add unit test for low-level zero init (#2474) * [workflow] fixed the skip condition of example weekly check workflow (#2481) * [example] stable diffusion add roadmap * add dummy test_ci.sh * [example] stable diffusion add roadmap (#2482) * [CI] add test_ci.sh for palm, opt and gpt (#2475) * polish code * [example] titans for gpt * polish readme * remove license * polish code * update readme * [example] titans for gpt (#2484) * [autoparallel] support origin activation ckpt on autoprallel system (#2468) * [autochunk] support evoformer tracer (#2485) support full evoformer tracer, which is a main module of alphafold. previously we just support a simplifed version of it. 1. support some evoformer's op in fx 2. support evoformer test 3. add repos for test code * [example] fix requirements (#2488) * [zero] add unit testings for hybrid parallelism (#2486) * [hotfix] gpt example titans bug #2493 * polish code and fix dataloader bugs * [hotfix] gpt example titans bug #2493 (#2494) * [fx] allow control of ckpt_codegen init (#2498) * [fx] allow control of ckpt_codegen init Currently in ColoGraphModule, ActivationCheckpointCodeGen will be set automatically in __init__. But other codegen can't be set if so. So I add an arg to control whether to set ActivationCheckpointCodeGen in __init__. * code style * [example] dreambooth example * add test_ci.sh to dreambooth * [autochunk] support autochunk on evoformer (#2497) * Revert "Update parallel_context.py (#2408)" This reverts commit 7d5640b9db01b501e95b66e91be9fe27b58d2e58. * add avg partition (#2483) Co-authored-by: Ziyue Jiang * [auto-chunk] support extramsa (#3) (#2504) * [utils] lazy init. (#2148) * [utils] lazy init. * [utils] remove description. * [utils] complete. * [utils] finalize. * [utils] fix names. * [autochunk] support parsing blocks (#2506) * [zero] add strict ddp mode (#2508) * [zero] add strict ddp mode * [polish] add comments for strict ddp mode * [zero] fix test error * [doc] update opt and tutorial links (#2509) * [workflow] fixed changed file detection (#2515) Co-authored-by: oahzxl Co-authored-by: eric8607242 Co-authored-by: HELSON Co-authored-by: Frank Lee Co-authored-by: Haofan Wang Co-authored-by: Jiarui Fang Co-authored-by: ZijianYY <119492445+ZijianYY@users.noreply.github.com> Co-authored-by: YuliangLiu0306 <72588413+YuliangLiu0306@users.noreply.github.com> Co-authored-by: Super Daniel <78588128+super-dainiu@users.noreply.github.com> Co-authored-by: ver217 Co-authored-by: Ziyue Jiang Co-authored-by: Ziyue Jiang Co-authored-by: oahzxl <43881818+oahzxl@users.noreply.github.com> Co-authored-by: binmakeswell Co-authored-by: Fazzie-Maqianli <55798671+Fazziekey@users.noreply.github.com> Co-authored-by: アマデウス --- .bdist.json | 24 + .compatibility | 3 + .github/workflows/README.md | 149 ++ .github/workflows/auto_compatibility_test.yml | 74 + .github/workflows/auto_example_check.yml | 143 ++ .github/workflows/auto_release_bdist.yml | 70 + .github/workflows/build.yml | 29 +- ...rigger_examples_check_and_weekly_check.yml | 119 -- ...st.yml => dispatch_compatibility_test.yml} | 2 +- ...example.yml => dispatch_example_check.yml} | 44 +- .../workflows/draft_github_release_post.yml | 3 +- .github/workflows/pre_commit.yml | 71 + .github/workflows/release_docker.yml | 29 +- .github/workflows/release_nightly.yml | 86 +- .../workflows/report_precommit_failure.yml | 67 + .github/workflows/report_test_coverage.yml | 74 + .../example_checks/check_dispatch_inputs.py | 27 + .../check_example_weekly.py} | 9 +- .../detect_changed_example.py} | 11 +- .../workflows/scripts/input_check_example.py | 23 - .github/workflows/translate_comment.yml | 18 + .gitignore | 4 + README-zh-Hans.md | 55 +- README.md | 39 +- .../passes/runtime_apply_pass.py | 33 + .../passes/runtime_preparation_pass.py | 2 + .../auto_parallel/tensor_shard/initialize.py | 72 +- .../tensor_shard/node_handler/__init__.py | 3 +- .../binary_elementwise_handler.py | 27 +- .../tensor_shard/node_handler/node_handler.py | 18 + .../tensor_shard/node_handler/option.py | 17 + colossalai/autochunk/autochunk_codegen.py | 523 ++++++ colossalai/autochunk/estimate_memory.py | 323 ++++ colossalai/autochunk/reorder_graph.py | 117 ++ colossalai/autochunk/search_chunk.py | 319 ++++ colossalai/autochunk/select_chunk.py | 224 +++ colossalai/autochunk/trace_flow.py | 445 +++++ colossalai/autochunk/trace_indice.py | 703 ++++++++ colossalai/autochunk/utils.py | 132 ++ colossalai/cli/launcher/hostinfo.py | 9 +- colossalai/cli/launcher/multinode_runner.py | 12 +- colossalai/cli/launcher/run.py | 54 +- colossalai/device/alpha_beta_profiler.py | 4 +- colossalai/device/device_mesh.py | 30 +- colossalai/fx/graph_module.py | 22 +- .../fx/passes/adding_split_node_pass.py | 36 + colossalai/fx/passes/meta_info_prop.py | 3 +- colossalai/fx/profiler/opcount.py | 5 +- colossalai/fx/profiler/tensor.py | 45 +- colossalai/fx/tracer/_symbolic_trace.py | 3 +- colossalai/fx/tracer/experimental.py | 42 +- colossalai/gemini/chunk/search_utils.py | 9 +- colossalai/gemini/chunk/utils.py | 15 +- colossalai/gemini/gemini_mgr.py | 18 +- colossalai/initialize.py | 29 +- .../kernel/cuda_native/scaled_softmax.py | 17 +- colossalai/nn/optimizer/cpu_adam.py | 2 +- colossalai/nn/optimizer/fused_adam.py | 3 +- colossalai/nn/optimizer/fused_lamb.py | 3 +- colossalai/nn/optimizer/fused_sgd.py | 3 +- colossalai/nn/optimizer/hybrid_adam.py | 2 +- colossalai/nn/optimizer/zero_optimizer.py | 22 +- colossalai/nn/parallel/data_parallel.py | 104 +- colossalai/nn/parallel/gemini_parallel.py | 3 +- colossalai/nn/parallel/utils.py | 25 +- colossalai/pipeline/rpc/_pipeline_base.py | 4 +- colossalai/pipeline/rpc/_pipeline_schedule.py | 3 - colossalai/utils/__init__.py | 42 +- colossalai/utils/common.py | 8 +- colossalai/utils/model/experimental.py | 440 +++++ colossalai/zero/sharded_optim/_utils.py | 23 +- .../sharded_optim/bookkeeping/base_store.py | 10 +- .../sharded_optim/bookkeeping/bucket_store.py | 19 +- .../bookkeeping/parameter_store.py | 5 +- .../zero/sharded_optim/low_level_optim.py | 285 +-- colossalai/zero/utils/gemini_hook.py | 5 +- docker/Dockerfile | 5 +- examples/README.md | 48 +- examples/images/diffusion/README.md | 14 +- .../diffusion/test_ci.sh} | 0 .../dreambooth/test_ci.sh} | 0 .../dreambooth/train_dreambooth_colossalai.py | 28 +- examples/images/vit/configs/vit_1d_tp2_ci.py | 32 + examples/images/vit/requirements.txt | 6 + examples/images/vit/test_ci.sh | 9 + examples/images/vit/train.py | 25 +- examples/images/vit/vit.py | 23 +- examples/language/gpt/README.md | 17 +- .../auto_parallel/auto_parallel_with_gpt.py | 20 +- .../saved_solution/solution_12_layers.pt | Bin 0 -> 1903 bytes .../saved_solution/solution_1_layers.pt | Bin 0 -> 559 bytes .../saved_solution/solution_4_layers.pt | Bin 0 -> 943 bytes .../pipeline_parallel/requirements.txt | 2 + .../pipeline_parallel/train_gpt_pp.py | 2 +- .../language/gpt/gemini/benchmark_gemini.sh | 30 +- .../language/gpt/gemini/commons/model_zoo.py | 12 + examples/language/gpt/gemini/requirements.txt | 2 + examples/language/gpt/gemini/run_gemini.sh | 11 +- examples/language/gpt/gemini/test_ci.sh | 35 + .../language/gpt/gemini/train_gpt_demo.py | 37 +- examples/language/gpt/requirements.txt | 1 + examples/language/gpt/test_ci.sh | 18 +- examples/language/gpt/titans/LICENSE | 201 +++ examples/language/gpt/titans/README.md | 48 + .../titans/configs/gpt2_small_zero3_pp1d.py | 31 + .../gpt/titans/configs/gpt3_zero3_pp1d.py | 31 + .../language/gpt/titans/dataset/webtext.py | 43 + .../language/gpt/titans/model/__init__.py | 3 + examples/language/gpt/titans/model/embed.py | 599 +++++++ examples/language/gpt/titans/model/gpt1d.py | 349 ++++ .../gpt/titans/model/pipeline_gpt1d.py | 322 ++++ examples/language/gpt/titans/requirements.txt | 4 + examples/language/gpt/titans/run.sh | 3 + examples/language/gpt/titans/test_ci.sh | 1 + examples/language/gpt/titans/train_gpt.py | 113 ++ examples/language/opt/requirements.txt | 2 + examples/language/opt/test_ci.sh | 4 + examples/language/palm/run.sh | 2 +- examples/language/palm/test_ci.sh | 9 + examples/language/palm/train.py | 100 +- examples/tutorial/README.md | 26 - examples/tutorial/auto_parallel/README.md | 113 +- .../auto_parallel_with_resnet.py | 144 +- examples/tutorial/auto_parallel/config.py | 4 +- .../tutorial/auto_parallel/requirements.txt | 9 +- .../setup.py | 6 +- examples/tutorial/auto_parallel/test_ci.sh | 6 + examples/tutorial/hybrid_parallel/README.md | 55 +- examples/tutorial/hybrid_parallel/config.py | 10 +- .../tutorial/hybrid_parallel/requirements.txt | 5 +- examples/tutorial/hybrid_parallel/test_ci.sh | 5 + examples/tutorial/hybrid_parallel/train.py | 24 +- .../tutorial/large_batch_optimizer/README.md | 42 +- .../tutorial/large_batch_optimizer/config.py | 26 +- .../large_batch_optimizer/requirements.txt | 5 +- .../tutorial/large_batch_optimizer/test_ci.sh | 8 + .../tutorial/large_batch_optimizer/train.py | 74 +- examples/tutorial/sequence_parallel/README.md | 147 +- examples/tutorial/sequence_parallel/config.py | 15 +- .../sequence_parallel/requirements.txt | 4 +- .../tutorial/sequence_parallel/test_ci.sh | 7 + examples/tutorial/sequence_parallel/train.py | 44 +- examples/tutorial/stable_diffusion/LICENSE | 82 - examples/tutorial/stable_diffusion/README.md | 149 -- .../configs/train_colossalai.yaml | 116 -- .../configs/train_colossalai_cifar10.yaml | 123 -- .../stable_diffusion/configs/train_ddp.yaml | 113 -- .../configs/train_pokemon.yaml | 121 -- .../stable_diffusion/environment.yaml | 34 - .../stable_diffusion/ldm/data/base.py | 75 - .../stable_diffusion/ldm/data/cifar10.py | 184 -- .../stable_diffusion/ldm/data/imagenet.py | 394 ----- .../stable_diffusion/ldm/data/lsun.py | 92 - .../stable_diffusion/ldm/lr_scheduler.py | 98 -- .../ldm/models/autoencoder.py | 544 ------ .../ldm/models/diffusion/classifier.py | 267 --- .../ldm/models/diffusion/ddim.py | 240 --- .../ldm/models/diffusion/ddpm.py | 1554 ----------------- .../ldm/models/diffusion/plms.py | 236 --- .../stable_diffusion/ldm/modules/attention.py | 314 ---- .../ldm/modules/diffusionmodules/__init__.py | 0 .../ldm/modules/diffusionmodules/model.py | 862 --------- .../modules/diffusionmodules/openaimodel.py | 1152 ------------ .../ldm/modules/diffusionmodules/util.py | 276 --- .../ldm/modules/distributions/__init__.py | 0 .../modules/distributions/distributions.py | 92 - .../stable_diffusion/ldm/modules/ema.py | 76 - .../ldm/modules/encoders/__init__.py | 0 .../ldm/modules/encoders/modules.py | 264 --- .../ldm/modules/flash_attention.py | 50 - .../ldm/modules/image_degradation/__init__.py | 2 - .../ldm/modules/image_degradation/bsrgan.py | 730 -------- .../modules/image_degradation/bsrgan_light.py | 650 ------- .../modules/image_degradation/utils/test.png | Bin 441072 -> 0 bytes .../modules/image_degradation/utils_image.py | 916 ---------- .../ldm/modules/losses/__init__.py | 1 - .../ldm/modules/losses/contperceptual.py | 111 -- .../ldm/modules/losses/vqperceptual.py | 167 -- .../ldm/modules/x_transformer.py | 641 ------- .../tutorial/stable_diffusion/ldm/util.py | 203 --- examples/tutorial/stable_diffusion/main.py | 830 --------- .../stable_diffusion/requirements.txt | 22 - .../scripts/download_first_stages.sh | 41 - .../scripts/download_models.sh | 49 - .../stable_diffusion/scripts/img2img.py | 293 ---- .../stable_diffusion/scripts/inpaint.py | 98 -- .../stable_diffusion/scripts/knn2img.py | 398 ----- .../scripts/sample_diffusion.py | 313 ---- .../scripts/tests/test_checkpoint.py | 37 - .../scripts/tests/test_watermark.py | 18 - .../scripts/train_searcher.py | 147 -- .../stable_diffusion/scripts/txt2img.py | 344 ---- examples/tutorial/stable_diffusion/train.sh | 4 - requirements/requirements-test.txt | 1 + setup.py | 33 +- .../test_tensor_shard/test_checkpoint.py | 70 + .../test_binary_elementwise_handler.py | 65 +- .../test_node_handler/test_shard_option.py | 112 ++ .../test_node_handler/utils.py | 5 +- .../benchmark_simple_evoformer.py | 94 + .../test_autochunk/test_evoformer_codegen.py | 163 ++ .../test_evoformer_stack_codegen.py | 163 ++ tests/test_autochunk/test_extramsa_codegen.py | 164 ++ .../test_simple_evoformer_codegen.py | 104 ++ .../test_simple_evoformer_search.py | 97 + tests/test_gemini/update/test_grad_clip.py | 2 - tests/test_gemini/update/test_inference.py | 122 ++ tests/test_gemini/update/test_optim.py | 2 - .../update/test_zeroddp_state_dict.py | 16 +- tests/test_tensor/common_utils/_utils.py | 17 +- tests/test_tensor/test_tp_with_zero.py | 4 +- .../test_zero/low_level_zero/test_grad_acc.py | 5 +- .../test_zero/low_level_zero/test_zero1_2.py | 2 +- .../low_level_zero/test_zero_init.py | 61 + .../test_zero/low_level_zero/test_zero_tp.py | 98 ++ 215 files changed, 8523 insertions(+), 14916 deletions(-) create mode 100644 .bdist.json create mode 100644 .compatibility create mode 100644 .github/workflows/README.md create mode 100644 .github/workflows/auto_compatibility_test.yml create mode 100644 .github/workflows/auto_example_check.yml create mode 100644 .github/workflows/auto_release_bdist.yml delete mode 100644 .github/workflows/changed_file_trigger_examples_check_and_weekly_check.yml rename .github/workflows/{compatibility_test.yml => dispatch_compatibility_test.yml} (98%) rename .github/workflows/{workflow_dispatch_example.yml => dispatch_example_check.yml} (57%) create mode 100644 .github/workflows/pre_commit.yml create mode 100644 .github/workflows/report_precommit_failure.yml create mode 100644 .github/workflows/report_test_coverage.yml create mode 100644 .github/workflows/scripts/example_checks/check_dispatch_inputs.py rename .github/workflows/scripts/{weekly_check_example.py => example_checks/check_example_weekly.py} (76%) rename .github/workflows/scripts/{changed_example.py => example_checks/detect_changed_example.py} (52%) delete mode 100644 .github/workflows/scripts/input_check_example.py create mode 100644 .github/workflows/translate_comment.yml create mode 100644 colossalai/auto_parallel/tensor_shard/node_handler/option.py create mode 100644 colossalai/autochunk/autochunk_codegen.py create mode 100644 colossalai/autochunk/estimate_memory.py create mode 100644 colossalai/autochunk/reorder_graph.py create mode 100644 colossalai/autochunk/search_chunk.py create mode 100644 colossalai/autochunk/select_chunk.py create mode 100644 colossalai/autochunk/trace_flow.py create mode 100644 colossalai/autochunk/trace_indice.py create mode 100644 colossalai/autochunk/utils.py create mode 100644 colossalai/utils/model/experimental.py rename examples/{tutorial/stable_diffusion/ldm/data/__init__.py => images/diffusion/test_ci.sh} (100%) rename examples/{tutorial/stable_diffusion/ldm/models/diffusion/__init__.py => images/dreambooth/test_ci.sh} (100%) create mode 100644 examples/images/vit/configs/vit_1d_tp2_ci.py create mode 100644 examples/images/vit/test_ci.sh create mode 100644 examples/language/gpt/experiments/auto_parallel/saved_solution/solution_12_layers.pt create mode 100644 examples/language/gpt/experiments/auto_parallel/saved_solution/solution_1_layers.pt create mode 100644 examples/language/gpt/experiments/auto_parallel/saved_solution/solution_4_layers.pt create mode 100644 examples/language/gpt/experiments/pipeline_parallel/requirements.txt create mode 100644 examples/language/gpt/gemini/requirements.txt create mode 100644 examples/language/gpt/gemini/test_ci.sh create mode 100644 examples/language/gpt/titans/LICENSE create mode 100644 examples/language/gpt/titans/README.md create mode 100644 examples/language/gpt/titans/configs/gpt2_small_zero3_pp1d.py create mode 100644 examples/language/gpt/titans/configs/gpt3_zero3_pp1d.py create mode 100644 examples/language/gpt/titans/dataset/webtext.py create mode 100644 examples/language/gpt/titans/model/__init__.py create mode 100644 examples/language/gpt/titans/model/embed.py create mode 100644 examples/language/gpt/titans/model/gpt1d.py create mode 100644 examples/language/gpt/titans/model/pipeline_gpt1d.py create mode 100644 examples/language/gpt/titans/requirements.txt create mode 100644 examples/language/gpt/titans/run.sh create mode 100644 examples/language/gpt/titans/test_ci.sh create mode 100644 examples/language/gpt/titans/train_gpt.py create mode 100644 examples/language/opt/requirements.txt create mode 100644 examples/language/opt/test_ci.sh create mode 100644 examples/language/palm/test_ci.sh rename examples/tutorial/{stable_diffusion => auto_parallel}/setup.py (68%) create mode 100644 examples/tutorial/auto_parallel/test_ci.sh create mode 100644 examples/tutorial/hybrid_parallel/test_ci.sh create mode 100644 examples/tutorial/large_batch_optimizer/test_ci.sh create mode 100644 examples/tutorial/sequence_parallel/test_ci.sh delete mode 100644 examples/tutorial/stable_diffusion/LICENSE delete mode 100644 examples/tutorial/stable_diffusion/README.md delete mode 100644 examples/tutorial/stable_diffusion/configs/train_colossalai.yaml delete mode 100644 examples/tutorial/stable_diffusion/configs/train_colossalai_cifar10.yaml delete mode 100644 examples/tutorial/stable_diffusion/configs/train_ddp.yaml delete mode 100644 examples/tutorial/stable_diffusion/configs/train_pokemon.yaml delete mode 100644 examples/tutorial/stable_diffusion/environment.yaml delete mode 100644 examples/tutorial/stable_diffusion/ldm/data/base.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/data/cifar10.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/data/imagenet.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/data/lsun.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/lr_scheduler.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/models/autoencoder.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/models/diffusion/classifier.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/models/diffusion/ddim.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/models/diffusion/ddpm.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/models/diffusion/plms.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/attention.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/diffusionmodules/__init__.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/diffusionmodules/model.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/diffusionmodules/openaimodel.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/diffusionmodules/util.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/distributions/__init__.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/distributions/distributions.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/ema.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/encoders/__init__.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/encoders/modules.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/flash_attention.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/image_degradation/__init__.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/image_degradation/bsrgan.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/image_degradation/bsrgan_light.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/image_degradation/utils/test.png delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/image_degradation/utils_image.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/losses/__init__.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/losses/contperceptual.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/losses/vqperceptual.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/modules/x_transformer.py delete mode 100644 examples/tutorial/stable_diffusion/ldm/util.py delete mode 100644 examples/tutorial/stable_diffusion/main.py delete mode 100644 examples/tutorial/stable_diffusion/requirements.txt delete mode 100644 examples/tutorial/stable_diffusion/scripts/download_first_stages.sh delete mode 100644 examples/tutorial/stable_diffusion/scripts/download_models.sh delete mode 100644 examples/tutorial/stable_diffusion/scripts/img2img.py delete mode 100644 examples/tutorial/stable_diffusion/scripts/inpaint.py delete mode 100644 examples/tutorial/stable_diffusion/scripts/knn2img.py delete mode 100644 examples/tutorial/stable_diffusion/scripts/sample_diffusion.py delete mode 100644 examples/tutorial/stable_diffusion/scripts/tests/test_checkpoint.py delete mode 100644 examples/tutorial/stable_diffusion/scripts/tests/test_watermark.py delete mode 100644 examples/tutorial/stable_diffusion/scripts/train_searcher.py delete mode 100644 examples/tutorial/stable_diffusion/scripts/txt2img.py delete mode 100644 examples/tutorial/stable_diffusion/train.sh create mode 100644 tests/test_auto_parallel/test_tensor_shard/test_checkpoint.py create mode 100644 tests/test_auto_parallel/test_tensor_shard/test_node_handler/test_shard_option.py create mode 100644 tests/test_autochunk/benchmark_simple_evoformer.py create mode 100644 tests/test_autochunk/test_evoformer_codegen.py create mode 100644 tests/test_autochunk/test_evoformer_stack_codegen.py create mode 100644 tests/test_autochunk/test_extramsa_codegen.py create mode 100644 tests/test_autochunk/test_simple_evoformer_codegen.py create mode 100644 tests/test_autochunk/test_simple_evoformer_search.py create mode 100644 tests/test_gemini/update/test_inference.py create mode 100644 tests/test_zero/low_level_zero/test_zero_init.py create mode 100644 tests/test_zero/low_level_zero/test_zero_tp.py diff --git a/.bdist.json b/.bdist.json new file mode 100644 index 000000000..8693bca48 --- /dev/null +++ b/.bdist.json @@ -0,0 +1,24 @@ +{ + "build": [ + { + "torch_version": "1.11.0", + "cuda_image": "hpcaitech/cuda-conda:10.2" + }, + { + "torch_version": "1.11.0", + "cuda_image": "hpcaitech/cuda-conda:11.3" + }, + { + "torch_version": "1.12.1", + "cuda_image": "hpcaitech/cuda-conda:10.2" + }, + { + "torch_version": "1.12.1", + "cuda_image": "hpcaitech/cuda-conda:11.3" + }, + { + "torch_version": "1.12.1", + "cuda_image": "hpcaitech/cuda-conda:11.6" + } + ] +} diff --git a/.compatibility b/.compatibility new file mode 100644 index 000000000..c8ac4083d --- /dev/null +++ b/.compatibility @@ -0,0 +1,3 @@ +1.12.0-11.3.0 +1.11.0-11.3.0 +1.10.1-11.3.0 diff --git a/.github/workflows/README.md b/.github/workflows/README.md new file mode 100644 index 000000000..cda6a3139 --- /dev/null +++ b/.github/workflows/README.md @@ -0,0 +1,149 @@ +# CI/CD + +## Table of Contents + +- [CI/CD](#cicd) + - [Table of Contents](#table-of-contents) + - [Overview](#overview) + - [Workflows](#workflows) + - [Checks on Pull Requests](#checks-on-pull-requests) + - [Regular Checks](#regular-checks) + - [Release](#release) + - [Manual Dispatch](#manual-dispatch) + - [Release bdist wheel](#release-bdist-wheel) + - [Dispatch Example Test](#dispatch-example-test) + - [Compatibility Test](#compatibility-test) + - [User Friendliness](#user-friendliness) + - [Configuration](#configuration) + - [Progress Log](#progress-log) + +## Overview + +Automation makes our development more efficient as the machine automatically run the pre-defined tasks for the contributors. +This saves a lot of manual work and allow the developer to fully focus on the features and bug fixes. +In Colossal-AI, we use [GitHub Actions](https://github.com/features/actions) to automate a wide range of workflows to ensure the robustness of the software. +In the section below, we will dive into the details of different workflows available. + +## Workflows + +### Checks on Pull Requests + +| Workflow Name | File name | Description | +| --------------------------- | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `Build` | `build.yml` | This workflow is triggered when the label `Run build and Test` is assigned to a PR. It will run all the unit tests in the repository with 4 GPUs. | +| `Pre-commit` | `pre_commit.yml` | This workflow runs pre-commit checks for code style consistency. | +| `Report pre-commit failure` | `report_precommit_failure.yml` | This PR will put up a comment in the PR to explain the precommit failure and remedy. This is executed when `Pre-commit` is done | +| `Report test coverage` | `report_test_coverage.yml` | This PR will put up a comment to report the test coverage results. This is executed when `Build` is completed. | +| `Test example` | `auto_example_check.yml` | The example will be automatically tested if its files are changed in the PR | + +### Regular Checks + +| Workflow Name | File name | Description | +| ----------------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `Test example` | `auto_example_check.yml` | This workflow will test all examples every Sunday | +| `Compatibility Test` | `auto_compatibility_test.yml` | This workflow will check the compatiblity of Colossal-AI against PyTorch and CUDA every Sunday. The PyTorch and CUDA versions are specified in `.compatibility`. | +| `Build on 8 GPUs` | `build_gpu_8.yml` | This workflow will run the unit tests everyday with 8 GPUs. | +| `Synchronize submodule` | `submodule.yml` | This workflow will check if any git submodule is updated. If so, it will create a PR to update the submodule pointers. | +| `Close inactive issues` | `close_inactive.yml` | This workflow will close issues which are stale for 14 days. | + +### Release + +| Workflow Name | File name | Description | +| --------------------------- | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `Draft GitHub Release Post` | `draft_github_release_post.yml` | Compose a GitHub release post draft based on the commit history. Triggered when the change of `version.txt` is merged. | +| `Release to PyPI` | `release_pypi.yml` | Build and release the wheel to PyPI. Triggered when the change of `version.txt` is merged. | +| `Release Nightly to PyPI` | `release_nightly.yml` | Build and release the nightly wheel to PyPI as `colossalai-nightly`. Automatically executed every Sunday. | +| `Release Docker` | `release_docker.yml` | Build and release the Docker image to DockerHub. Triggered when the change of `version.txt` is merged. | +| `Release bdist wheel` | `release_bdist.yml` | Build binary wheels with pre-built PyTorch extensions. Manually dispatched. See more details in the next section. | +| `Auto Release bdist wheel` | `auto_release_bdist.yml` | Build binary wheels with pre-built PyTorch extensions.Triggered when the change of `version.txt` is merged. Build specificatons are stored in `.bdist.json` | +| `Auto Compatibility Test` | `auto_compatibility_test.yml` | Check Colossal-AI's compatiblity against the PyTorch and CUDA version specified in `.compatibility`. Triggered when `version.txt` is changed in a PR. | + +### Manual Dispatch + +| Workflow Name | File name | Description | +| ---------------------------- | -------------------------------- | ------------------------------------------------------ | +| `Release bdist wheel` | `release_bdist.yml` | Build binary wheels with pre-built PyTorch extensions. | +| `Dispatch Example Test` | `dispatch_example_check.yml` | Manually test a specified example. | +| `Dispatch Compatiblity Test` | `dispatch_compatiblity_test.yml` | Test PyTorch and Python Compatibility. | + +Refer to this [documentation](https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow) on how to manually trigger a workflow. +I will provide the details of each workflow below. + +#### Release bdist wheel + +Parameters: +- `torch version`:torch version to test against, multiple versions are supported but must be separated by comma. The default is value is all, which will test all available torch versions listed in this [repository](https://github.com/hpcaitech/public_assets/tree/main/colossalai/torch_build/torch_wheels) which is regularly updated. +- `cuda version`: cuda versions to test against, multiple versions are supported but must be separated by comma. The CUDA versions must be present in our [DockerHub repository](https://hub.docker.com/r/hpcaitech/cuda-conda). +- `ref`: input the branch or tag name to build the wheel for this ref. + +#### Dispatch Example Test + +parameters: +- `example_directory`: the example directory to test. Multiple directories are supported and must be separated by comma. For example, language/gpt, images/vit. Simply input language or simply gpt does not work. + + +#### Compatibility Test + +Parameters: +- `torch version`:torch version to test against, multiple versions are supported but must be separated by comma. The default is value is all, which will test all available torch versions listed in this [repository](https://github.com/hpcaitech/public_assets/tree/main/colossalai/torch_build/torch_wheels). +- `cuda version`: cuda versions to test against, multiple versions are supported but must be separated by comma. The CUDA versions must be present in our [DockerHub repository](https://hub.docker.com/r/hpcaitech/cuda-conda). + +> It only test the compatiblity of the main branch + + +### User Friendliness + +| Workflow Name | File name | Description | +| ----------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | +| `issue-translate` | `translate_comment.yml` | This workflow is triggered when a new issue comment is created. The comment will be translated into English if not written in English. | + + +## Configuration + +This section lists the files used to configure the workflow. + +1. `.compatibility` + +This `.compatibility` file is to tell GitHub Actions which PyTorch and CUDA versions to test against. Each line in the file is in the format `${torch-version}-${cuda-version}`, which is a tag for Docker image. Thus, this tag must be present in the [docker registry](https://hub.docker.com/r/pytorch/conda-cuda) so as to perform the test. + +2. `.bdist.json` + +This file controls what pytorch/cuda compatible pre-built releases will be built and published. You can add a new entry according to the json schema below if there is a new wheel that needs to be built with AOT compilation of PyTorch extensions. + +```json +{ + "build": [ + { + "torch_version": "", + "cuda_image": "" + }, + ] +} +``` + +## Progress Log + +- [x] unit testing + - [x] test on PR + - [x] report test coverage + - [x] regular test +- [x] release + - [x] official release + - [x] nightly build + - [x] binary build + - [x] docker build + - [x] draft release post +- [x] pre-commit + - [x] check on PR + - [x] report failure +- [x] example check + - [x] check on PR + - [x] regular check + - [x] manual dispatch +- [x] compatiblity check + - [x] manual dispatch + - [x] auto test when release +- [x] helpers + - [x] comment translation + - [x] submodule update + - [x] close inactive issue diff --git a/.github/workflows/auto_compatibility_test.yml b/.github/workflows/auto_compatibility_test.yml new file mode 100644 index 000000000..4b026c63e --- /dev/null +++ b/.github/workflows/auto_compatibility_test.yml @@ -0,0 +1,74 @@ +name: Compatibility Test + +on: + pull_request: + paths: + - 'version.txt' + - '.compatibility' + # run at 03:00 of every Sunday(singapore time) so here is UTC time Saturday 16:00 + schedule: + - cron: '0 19 * * 6' + +jobs: + matrix_preparation: + name: Prepare Container List + runs-on: ubuntu-latest + outputs: + matrix: ${{ steps.set-matrix.outputs.matrix }} + steps: + - uses: actions/checkout@v3 + - id: set-matrix + run: | + IFS=',' + DOCKER_IMAGE=() + + while read tag; do + DOCKER_IMAGE+=("\"hpcaitech/pytorch-cuda:${tag}\"") + done <.compatibility + + container=$( IFS=',' ; echo "${DOCKER_IMAGE[*]}" ) + container="[${container}]" + echo "$container" + echo "::set-output name=matrix::{\"container\":$(echo "$container")}" + + build: + name: Test for PyTorch Compatibility + needs: matrix_preparation + if: github.repository == 'hpcaitech/ColossalAI' + runs-on: [self-hosted, gpu] + strategy: + fail-fast: false + matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}} + container: + image: ${{ matrix.container }} + options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10 + timeout-minutes: 120 + steps: + - name: Install dependencies + run: | + pip install -U pip setuptools wheel --user + - uses: actions/checkout@v2 + with: + repository: hpcaitech/TensorNVMe + ssh-key: ${{ secrets.SSH_KEY_FOR_CI }} + path: TensorNVMe + - name: Install tensornvme + run: | + cd TensorNVMe + conda install cmake + pip install -r requirements.txt + pip install -v . + - uses: actions/checkout@v2 + with: + ssh-key: ${{ secrets.SSH_KEY_FOR_CI }} + - name: Install Colossal-AI + run: | + pip install -v --no-cache-dir . + pip install -r requirements/requirements-test.txt + - name: Unit Testing + run: | + PYTHONPATH=$PWD pytest tests + env: + DATA: /data/scratch/cifar-10 + NCCL_SHM_DISABLE: 1 + LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 diff --git a/.github/workflows/auto_example_check.yml b/.github/workflows/auto_example_check.yml new file mode 100644 index 000000000..df413f646 --- /dev/null +++ b/.github/workflows/auto_example_check.yml @@ -0,0 +1,143 @@ +name: Test Example +on: + pull_request: + # any change in the examples folder will trigger check for the corresponding example. + paths: + - 'examples/**' + # run at 00:00 of every Sunday(singapore time) so here is UTC time Saturday 16:00 + schedule: + - cron: '0 16 * * 6' + +jobs: + # This is for changed example files detect and output a matrix containing all the corresponding directory name. + detect-changed-example: + if: | + github.event.pull_request.draft == false && + github.base_ref == 'main' && + github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' && github.event_name == 'pull_request' + runs-on: ubuntu-latest + outputs: + matrix: ${{ steps.setup-matrix.outputs.matrix }} + anyChanged: ${{ steps.setup-matrix.outputs.anyChanged }} + name: Detect changed example files + steps: + - uses: actions/checkout@v3 + with: + fetch-depth: 0 + ref: ${{ github.event.pull_request.head.sha }} + + - name: Locate base commit + id: locate-base-sha + run: | + curBranch=$(git rev-parse --abbrev-ref HEAD) + commonCommit=$(git merge-base origin/main $curBranch) + echo $commonCommit + echo "baseSHA=$commonCommit" >> $GITHUB_OUTPUT + + - name: Get all changed example files + id: changed-files + uses: tj-actions/changed-files@v35 + with: + base_sha: ${{ steps.locate-base-sha.outputs.baseSHA }} + + - name: setup matrix + id: setup-matrix + run: | + changedFileName="" + for file in ${{ steps.changed-files.outputs.all_changed_files }}; do + changedFileName="${file}:${changedFileName}" + done + echo "$changedFileName was changed" + res=`python .github/workflows/scripts/example_checks/detect_changed_example.py --fileNameList $changedFileName` + echo "All changed examples are $res" + + if [ "$res" = "[]" ]; then + echo "anyChanged=false" >> $GITHUB_OUTPUT + echo "matrix=null" >> $GITHUB_OUTPUT + else + dirs=$( IFS=',' ; echo "${res[*]}" ) + echo "anyChanged=true" >> $GITHUB_OUTPUT + echo "matrix={\"directory\":$(echo "$dirs")}" >> $GITHUB_OUTPUT + fi + + # If no file is changed, it will prompt an error and shows the matrix do not have value. + check-changed-example: + # Add this condition to avoid executing this job if the trigger event is workflow_dispatch. + if: | + github.event.pull_request.draft == false && + github.base_ref == 'main' && + github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' && github.event_name == 'pull_request' && + needs.detect-changed-example.outputs.anyChanged == 'true' + name: Test the changed example + needs: detect-changed-example + runs-on: [self-hosted, gpu] + strategy: + matrix: ${{fromJson(needs.detect-changed-example.outputs.matrix)}} + container: + image: hpcaitech/pytorch-cuda:1.12.0-11.3.0 + options: --gpus all --rm -v /data/scratch/examples-data:/data/ + timeout-minutes: 10 + steps: + - uses: actions/checkout@v3 + + - name: Install Colossal-AI + run: | + pip install -v . + + - name: Test the example + run: | + example_dir=${{ matrix.directory }} + cd "${PWD}/examples/${example_dir}" + bash test_ci.sh + env: + NCCL_SHM_DISABLE: 1 + + # This is for all files' weekly check. Specifically, this job is to find all the directories. + matrix_preparation: + if: | + github.repository == 'hpcaitech/ColossalAI' && + github.event_name == 'schedule' + name: Prepare matrix for weekly check + runs-on: ubuntu-latest + outputs: + matrix: ${{ steps.setup-matrix.outputs.matrix }} + steps: + - name: 📚 Checkout + uses: actions/checkout@v3 + + - name: setup matrix + id: setup-matrix + run: | + res=`python .github/workflows/scripts/example_checks/check_example_weekly.py` + all_loc=$( IFS=',' ; echo "${res[*]}" ) + echo "Found the examples: $all_loc" + echo "matrix={\"directory\":$(echo "$all_loc")}" >> $GITHUB_OUTPUT + + weekly_check: + if: | + github.repository == 'hpcaitech/ColossalAI' && + github.event_name == 'schedule' + name: Weekly check all examples + needs: matrix_preparation + runs-on: [self-hosted, gpu] + strategy: + matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}} + container: + image: hpcaitech/pytorch-cuda:1.12.0-11.3.0 + timeout-minutes: 10 + steps: + - name: 📚 Checkout + uses: actions/checkout@v3 + + - name: Install Colossal-AI + run: | + pip install -v . + + - name: Traverse all files + run: | + example_dir=${{ matrix.diretory }} + echo "Testing ${example_dir} now" + cd "${PWD}/examples/${example_dir}" + bash test_ci.sh + env: + NCCL_SHM_DISABLE: 1 diff --git a/.github/workflows/auto_release_bdist.yml b/.github/workflows/auto_release_bdist.yml new file mode 100644 index 000000000..56a3036f8 --- /dev/null +++ b/.github/workflows/auto_release_bdist.yml @@ -0,0 +1,70 @@ +name: Auto Release bdist wheel + +on: + workflow_dispatch: + pull_request: + paths: + - 'version.txt' + types: + - closed + +jobs: + matrix_preparation: + name: Prepare Container List + if: ( github.event_name == 'workflow_dispatch' || github.event.pull_request.merged == true ) && github.repository == 'hpcaitech/ColossalAI' + runs-on: ubuntu-latest + outputs: + matrix: ${{ steps.set-matrix.outputs.matrix }} + steps: + - uses: actions/checkout@v3 + - id: set-matrix + run: | + bdist=$(cat .bdist.json | tr '\n' ' ') + echo "matrix=${bdist}" >> $GITHUB_OUTPUT + + build: + name: Release bdist wheels + needs: matrix_preparation + runs-on: [self-hosted, gpu] + strategy: + fail-fast: false + matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}} + container: + image: ${{ matrix.build.cuda_image }} + options: --gpus all --rm + steps: + - uses: actions/checkout@v2 + with: + fetch-depth: 0 + # cub is for cuda 10.2 + - name: Copy scripts + run: | + cp -r ./.github/workflows/scripts/* ./ + + # link the cache diretories to current path + ln -s /github/home/conda_pkgs ./conda_pkgs + ln -s /github/home/pip_wheels ./pip_wheels + + # set the conda package path + echo "pkgs_dirs:\n - $PWD/conda_pkgs" > ~/.condarc + + # set safe directory + git config --global --add safe.directory /__w/ColossalAI/ColossalAI + + # get cub package for cuda 10.2 + wget https://github.com/NVIDIA/cub/archive/refs/tags/1.8.0.zip + unzip 1.8.0.zip + - name: Build bdist wheel + run: | + pip install beautifulsoup4 requests packaging + python ./build_colossalai_wheel.py --torch_version $TORCH_VERSIONS + env: + TORCH_VERSIONS: ${{ matrix.build.torch_version }} + - name: 🚀 Deploy + uses: garygrossgarten/github-action-scp@release + with: + local: all_dist + remote: ${{ secrets.PRIVATE_PYPI_DIR }} + host: ${{ secrets.PRIVATE_PYPI_HOST }} + username: ${{ secrets.PRIVATE_PYPI_USER }} + password: ${{ secrets.PRIVATE_PYPI_PASSWD }} diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 5366f69cc..8f334d599 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -20,15 +20,26 @@ jobs: - uses: actions/checkout@v2 with: fetch-depth: 0 + ref: ${{ github.event.pull_request.head.sha }} + + - name: Locate base commit + id: locate-base-sha + run: | + curBranch=$(git rev-parse --abbrev-ref HEAD) + commonCommit=$(git merge-base origin/main $curBranch) + echo $commonCommit + echo "baseSHA=$commonCommit" >> $GITHUB_OUTPUT + - name: Find the changed files id: find-changed-files uses: tj-actions/changed-files@v35 with: - since_last_remote_commit: true + base_sha: ${{ steps.locate-base-sha.outputs.baseSHA }} files: | op_builder/** colossalai/kernel/** setup.py + - name: List changed files run: | for file in ${{ steps.find-changed-files.outputs.all_changed_files }}; do @@ -75,12 +86,26 @@ jobs: - name: Unit Testing run: | - PYTHONPATH=$PWD pytest tests + PYTHONPATH=$PWD pytest --cov=. --cov-report xml tests env: DATA: /data/scratch/cifar-10 NCCL_SHM_DISABLE: 1 LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 + - name: Collate artifact + env: + PR_NUMBER: ${{ github.event.number }} + run: | + mkdir report + echo $PR_NUMBER > ./report/pr_number + mv coverage.xml ./report + + - name: Upload test coverage artifact + uses: actions/upload-artifact@v3 + with: + name: report + path: report/ + - name: Store Cache run: | # -p flag is required to preserve the file timestamp to avoid ninja rebuild diff --git a/.github/workflows/changed_file_trigger_examples_check_and_weekly_check.yml b/.github/workflows/changed_file_trigger_examples_check_and_weekly_check.yml deleted file mode 100644 index 2b7ec3125..000000000 --- a/.github/workflows/changed_file_trigger_examples_check_and_weekly_check.yml +++ /dev/null @@ -1,119 +0,0 @@ -name: Test Example -on: - pull_request: - # So only the changes in examples folder will trigger jobs below. - paths: - - 'examples/**' - # run at 00:00 of every Sunday(singapore time) so here is UTC time Saturday 16:00 - schedule: - - cron: '0 16 * * 6' - -jobs: - # This is for changed example files detect and output a matrix containing all the corresponding directory name. - detect-changed-example: - if: | - github.event.pull_request.draft == false && - github.base_ref == 'main' && - github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' && github.event_name == 'pull_request' - runs-on: ubuntu-latest - outputs: - matrix: ${{ steps.set-matrix.outputs.matrix }} - name: Check out all files - steps: - - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: Get all changed example files - id: changed-files - uses: tj-actions/changed-files@v35 - # Using this can trigger action each time a PR is submitted. - with: - since_last_remote_commit: true - - name: setup matrix - id: set-matrix - run: | - changedFileName="" - for file in ${{ steps.changed-files.outputs.all_changed_files }}; do - changedFileName="${file}:${changedFileName}" - done - echo "$changedFileName was changed" - res=`python .github/workflows/scripts/changed_example.py --fileNameList $changedFileName` - echo "All changed files are $res" - loc=$( IFS=',' ; echo "${res[*]}" ) - echo "$loc" - echo "::set-output name=matrix::{\"loc\":$(echo "$loc")}" - - # If no file is changed, it will prompt an error and shows the matrix do not have value. - check-all-changed-files: - # Add this condition to avoid executing this job if the trigger event is workflow_dispatch. - if: | - github.event.pull_request.draft == false && - github.base_ref == 'main' && - github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' && github.event_name == 'pull_request' - name: Test each changed example files - needs: detect-changed-example - runs-on: [self-hosted, gpu] - strategy: - matrix: ${{fromJson(needs.detect-changed-example.outputs.matrix)}} - container: - image: hpcaitech/pytorch-cuda:1.12.0-11.3.0 - steps: - - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: Install dependancies - run: | - pip install -r ./requirements/requirements.txt - pip install colossalai - - name: List all changed example files - run: | - res=${{ matrix.loc }} - cd "${PWD}/examples/${res}" - bash test_ci.sh - - # This is for all files' weekly check. Specifically, this job is to find all the directories. - matrix_preparation: - if: | - github.event.pull_request.draft == false && - github.base_ref == 'main' && - github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' && github.event_name == 'schedule' - name: Prepare Directory List for All files - runs-on: ubuntu-latest - outputs: - matrix: ${{ steps.set-matrix.outputs.matrix }} - steps: - - name: 📚 Checkout - uses: actions/checkout@v3 - - name: setup matrix - id: set-matrix - run: | - res=`python .github/workflows/scripts/weekly_check_example.py` - all_loc=$( IFS=',' ; echo "${res[*]}" ) - echo "$all_loc" - echo "::set-output name=matrix::{\"all_loc\":$(echo "$all_loc")}" - - weekly_check: - if: | - github.event.pull_request.draft == false && - github.base_ref == 'main' && - github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' && github.event_name == 'schedule' - name: Weekly check all examples - needs: matrix_preparation - runs-on: [self-hosted, gpu] - strategy: - matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}} - container: - image: hpcaitech/pytorch-cuda:1.12.0-11.3.0 - steps: - - name: 📚 Checkout - uses: actions/checkout@v3 - - name: Install the requirements - run: | - pip install -r ./requirements/requirements.txt - pip install colossalai - - name: Traverse all files - run: | - dir=${{ matrix.all_loc }} - echo "${dir} is current directory" - cd "${PWD}/examples/${dir}" - bash test_ci.sh diff --git a/.github/workflows/compatibility_test.yml b/.github/workflows/dispatch_compatibility_test.yml similarity index 98% rename from .github/workflows/compatibility_test.yml rename to .github/workflows/dispatch_compatibility_test.yml index eadd07886..ac5669c6f 100644 --- a/.github/workflows/compatibility_test.yml +++ b/.github/workflows/dispatch_compatibility_test.yml @@ -1,4 +1,4 @@ -name: Compatibility Test +name: Dispatch Compatibility Test on: workflow_dispatch: diff --git a/.github/workflows/workflow_dispatch_example.yml b/.github/workflows/dispatch_example_check.yml similarity index 57% rename from .github/workflows/workflow_dispatch_example.yml rename to .github/workflows/dispatch_example_check.yml index d9d576910..e0333422f 100644 --- a/.github/workflows/workflow_dispatch_example.yml +++ b/.github/workflows/dispatch_example_check.yml @@ -8,7 +8,7 @@ on: required: true jobs: - manual_check_matrix_preparation: + matrix_preparation: if: | github.event.pull_request.draft == false && github.base_ref == 'main' && @@ -16,31 +16,24 @@ jobs: name: Check the examples user want runs-on: ubuntu-latest outputs: - matrix: ${{ steps.set-matrix-1.outputs.matrix }} + matrix: ${{ steps.set-matrix.outputs.matrix }} steps: - name: 📚 Checkout uses: actions/checkout@v3 - - name: Get manual directories - id: set-matrix-1 + - name: Set up matrix + id: set-matrix env: check_dir: ${{ inputs.example_directory }} run: | - all_mannual_check_dir=() - for cdi in $check_dir - do - all_mannual_check_dir+=("\"${cdi}\"") - done - man_loc=$( IFS=',' ; echo "${all_mannual_check_dir[*]}" ) - res=`python .github/workflows/scripts/input_check_example.py --fileNameList $man_loc` - echo "${res} is file existance. 1 for all exist, -1 for at least one file not exist." - if [ res == -1 ];then - exit(1) + res=`python .github/workflows/scripts/example_checks/check_dispatch_inputs.py --fileNameList $check_dir` + if [ res == "failure" ];then + exit -1 fi - man_loc="[${man_loc}]" - echo "$man_loc" - echo "::set-output name=matrix::{\"man_loc\":$(echo "$man_loc")}" + dirs="[${check_dir}]" + echo "Testing examples in $dirs" + echo "matrix={\"directory\":$(echo "$dirs")}" >> $GITHUB_OUTPUT - manual_check: + test_example: if: | github.event.pull_request.draft == false && github.base_ref == 'main' && @@ -52,16 +45,19 @@ jobs: matrix: ${{fromJson(needs.manual_check_matrix_preparation.outputs.matrix)}} container: image: hpcaitech/pytorch-cuda:1.12.0-11.3.0 + options: --gpus all --rm -v /data/scratch/examples-data:/data/ + timeout-minutes: 10 steps: - name: 📚 Checkout uses: actions/checkout@v3 - - name: Install the requirements + - name: Install Colossal-AI run: | - pip install -r ./requirements/requirements.txt - pip install colossalai - - name: Traverse all files + pip install -v . + - name: Test the example run: | - dir=${{ matrix.man_loc }} - echo "${dir} is current directory" + dir=${{ matrix.directory }} + echo "Testing ${dir} now" cd "${PWD}/examples/${dir}" bash test_ci.sh + env: + NCCL_SHM_DISABLE: 1 diff --git a/.github/workflows/draft_github_release_post.yml b/.github/workflows/draft_github_release_post.yml index 413714daf..53bfa9e8d 100644 --- a/.github/workflows/draft_github_release_post.yml +++ b/.github/workflows/draft_github_release_post.yml @@ -8,11 +8,10 @@ on: types: - closed - jobs: release: name: Draft Release Post - if: github.repository == 'hpcaitech/ColossalAI' + if: ( github.event_name == 'workflow_dispatch' || github.event.pull_request.merged == true ) && github.repository == 'hpcaitech/ColossalAI' runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 diff --git a/.github/workflows/pre_commit.yml b/.github/workflows/pre_commit.yml new file mode 100644 index 000000000..3e71be2fc --- /dev/null +++ b/.github/workflows/pre_commit.yml @@ -0,0 +1,71 @@ +name: pre-commit + +on: + pull_request: + +jobs: + pre-commit: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + with: + fetch-depth: 0 + ref: ${{ github.event.pull_request.head.sha }} + + # the PR branch and the hpcaitech/colossal-ai main branch + # must share a common commit, we need to locate that commit, + # which is the commit checked-out or forked when the PR branch is created + # such that we can look for files changed since that commit + - name: Locate base commit + id: locate-base-sha + run: | + curBranch=$(git rev-parse --abbrev-ref HEAD) + commonCommit=$(git merge-base origin/main $curBranch) + echo $commonCommit + echo "baseSHA=$commonCommit" >> $GITHUB_OUTPUT + + - name: Find the changed files + id: find-changed-files + uses: tj-actions/changed-files@v35 + with: + base_sha: ${{ steps.locate-base-sha.outputs.baseSHA }} + + - name: List all changed files + run: | + for file in ${{ steps.find-changed-files.outputs.all_changed_files }}; do + echo "$file was changed" + done + + - uses: actions/setup-python@v3 + + - name: Cache pre-commit hooks + uses: actions/cache@v3 + with: + path: ~/.cache/pre-commit + key: ${{ runner.os }}-pre-commit-hooks + + - name: Set up pre-commit + run: | + pip install pre-commit + pre-commit install + + - name: Run pre-commit on Changed Files + id: precommit + run: | + for file in ${{ steps.find-changed-files.outputs.all_changed_files }}; do + echo "======= running pre-commit on ${file} =======" + pre-commit run --files $file + done + + - name: Save PR number + if: always() + env: + PR_NUMBER: ${{ github.event.number }} + run: | + mkdir -p ./pr + echo $PR_NUMBER > ./pr/pr_number + - uses: actions/upload-artifact@v3 + if: always() + with: + name: pr_number + path: pr/ diff --git a/.github/workflows/release_docker.yml b/.github/workflows/release_docker.yml index 328d232a8..8da6e5f87 100644 --- a/.github/workflows/release_docker.yml +++ b/.github/workflows/release_docker.yml @@ -2,13 +2,16 @@ name: Publish Docker Image to DockerHub on: workflow_dispatch: - release: - types: [published] + pull_request: + paths: + - 'version.txt' + types: + - closed jobs: release: name: Publish Docker Image to DockerHub - if: github.repository == 'hpcaitech/ColossalAI' + if: ( github.event_name == 'workflow_dispatch' || github.event.pull_request.merged == true ) && github.repository == 'hpcaitech/ColossalAI' runs-on: [self-hosted, gpu] container: image: "hpcaitech/docker-in-docker:latest" @@ -18,23 +21,17 @@ jobs: with: fetch-depth: 0 - name: Build Docker + id: build run: | version=$(cat version.txt) - docker build --build-arg http_proxy=http://172.17.0.1:7890 --build-arg https_proxy=http://172.17.0.1:7890 -t hpcaitech/colossalai:$version ./docker + tag=hpcaitech/colossalai:$version + docker build --build-arg http_proxy=http://172.17.0.1:7890 --build-arg https_proxy=http://172.17.0.1:7890 -t $tag ./docker + echo "tag=${tag}" >> $GITHUB_OUTPUT - name: Log in to Docker Hub uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9 with: username: ${{ secrets.DOCKER_USERNAME }} password: ${{ secrets.DOCKER_PASSWORD }} - - name: Extract metadata (tags, labels) for Docker - id: meta - uses: docker/metadata-action@98669ae865ea3cffbcbaa878cf57c20bbf1c6c38 - with: - images: hpcaitech/colossalai - - name: Build and push Docker image - uses: docker/build-push-action@ad44023a93711e3deb337508980b4b5e9bcdc5dc - with: - context: . - push: true - tags: ${{ steps.meta.outputs.tags }} - labels: ${{ steps.meta.outputs.labels }} + - name: Push Docker image + run: | + docker push ${{ steps.build.outputs.tag }} diff --git a/.github/workflows/release_nightly.yml b/.github/workflows/release_nightly.yml index 6bc000d1f..8aa48b8ed 100644 --- a/.github/workflows/release_nightly.yml +++ b/.github/workflows/release_nightly.yml @@ -1,73 +1,29 @@ -name: Release bdist wheel for Nightly versions +name: Publish Nightly Version to PyPI on: - schedule: - # run at 00:00 of every Sunday - - cron: '0 0 * * 6' workflow_dispatch: + schedule: + - cron: '0 0 * * 6' # release on every Sunday 00:00 UTC time jobs: - matrix_preparation: - name: Prepare Container List + build-n-publish: + if: github.event_name == 'workflow_dispatch' || github.repository == 'hpcaitech/ColossalAI' + name: Build and publish Python 🐍 distributions 📦 to PyPI runs-on: ubuntu-latest - outputs: - matrix: ${{ steps.set-matrix.outputs.matrix }} + timeout-minutes: 20 steps: - - id: set-matrix - run: | - matrix="[\"hpcaitech/cuda-conda:11.3\", \"hpcaitech/cuda-conda:10.2\"]" - echo $matrix - echo "::set-output name=matrix::{\"container\":$(echo $matrix)}" + - uses: actions/checkout@v2 - build: - name: Release bdist wheels - needs: matrix_preparation - if: github.repository == 'hpcaitech/ColossalAI' && contains(fromJson('["FrankLeeeee", "ver217", "feifeibear", "kurisusnowdeng"]'), github.actor) - runs-on: [self-hosted, gpu] - strategy: - fail-fast: false - matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}} - container: - image: ${{ matrix.container }} - options: --gpus all --rm - steps: - - uses: actions/checkout@v2 - with: - fetch-depth: 0 - # cub is for cuda 10.2 - - name: Copy scripts and checkout - run: | - cp -r ./.github/workflows/scripts/* ./ - ln -s /github/home/pip_wheels ./pip_wheels - wget https://github.com/NVIDIA/cub/archive/refs/tags/1.8.0.zip - unzip 1.8.0.zip - - name: Build bdist wheel - run: | - pip install beautifulsoup4 requests packaging - python ./build_colossalai_wheel.py --nightly - - name: 🚀 Deploy - uses: garygrossgarten/github-action-scp@release - with: - local: all_dist - remote: ${{ secrets.PRIVATE_PYPI_NIGHTLY_DIR }} - host: ${{ secrets.PRIVATE_PYPI_HOST }} - username: ${{ secrets.PRIVATE_PYPI_USER }} - password: ${{ secrets.PRIVATE_PYPI_PASSWD }} - remove_old_build: - name: Remove old nightly build - runs-on: ubuntu-latest - needs: build - steps: - - name: executing remote ssh commands using password - uses: appleboy/ssh-action@master - env: - BUILD_DIR: ${{ secrets.PRIVATE_PYPI_NIGHTLY_DIR }} - with: - host: ${{ secrets.PRIVATE_PYPI_HOST }} - username: ${{ secrets.PRIVATE_PYPI_USER }} - password: ${{ secrets.PRIVATE_PYPI_PASSWD }} - envs: BUILD_DIR - script: | - cd $BUILD_DIR - find . -type f -mtime +0 -exec rm -f {} + - script_stop: true + - uses: actions/setup-python@v2 + with: + python-version: '3.8.14' + + - run: NIGHTLY=1 python setup.py sdist build + + # publish to PyPI if executed on the main branch + - name: Publish package to PyPI + uses: pypa/gh-action-pypi-publish@release/v1 + with: + user: __token__ + password: ${{ secrets.PYPI_API_TOKEN }} + verbose: true diff --git a/.github/workflows/report_precommit_failure.yml b/.github/workflows/report_precommit_failure.yml new file mode 100644 index 000000000..e6ca7b01b --- /dev/null +++ b/.github/workflows/report_precommit_failure.yml @@ -0,0 +1,67 @@ +name: Report Precommit Failure + +on: + workflow_run: + workflows: [pre-commit] + types: + - completed + +jobs: + # comment with a message on how to do pre-commit + # if the pre-commit check was not passed + report-precommit-failure: + runs-on: ubuntu-latest + if: ${{ github.event.workflow_run.conclusion == 'failure' }} + steps: + - name: 'Download artifact' + uses: actions/github-script@v6 + with: + script: | + let allArtifacts = await github.rest.actions.listWorkflowRunArtifacts({ + owner: context.repo.owner, + repo: context.repo.repo, + run_id: context.payload.workflow_run.id, + }); + let matchArtifact = allArtifacts.data.artifacts.filter((artifact) => { + return artifact.name == "pr_number" + })[0]; + let download = await github.rest.actions.downloadArtifact({ + owner: context.repo.owner, + repo: context.repo.repo, + artifact_id: matchArtifact.id, + archive_format: 'zip', + }); + let fs = require('fs'); + fs.writeFileSync(`${process.env.GITHUB_WORKSPACE}/pr_number.zip`, Buffer.from(download.data)); + + - name: 'Unzip artifact' + run: unzip pr_number.zip + + - name: 'Comment on PR' + uses: actions/github-script@v6 + with: + github-token: ${{ secrets.GITHUB_TOKEN }} + script: | + let fs = require('fs'); + let issue_number = Number(fs.readFileSync('./pr_number')); + let owner = context.repo.owner; + let repo = context.repo.repo; + let run_id = context.payload.workflow_run.id; + let run_url = `https://github.com/${owner}/${repo}/actions/runs/${run_id}` + let body = ` + Your pre-commit check failed, follow the steps to run pre-commit on your file for code style consistency. + + 1. install pre-commit via "pip install pre-commit" + 2. install pre-commit hooks via "pre-commit install" + 3. run pre-commit on file with format error via "pre-commit run --files path" by replacing "path" with the actual file path + 4. commit and push to your branch + + View your job at ${run_url}. + Read our "CONTRIBUTING.md" for more reference to the code style. + `; + await github.rest.issues.createComment({ + owner: owner, + repo: repo, + issue_number: issue_number, + body: body + }); diff --git a/.github/workflows/report_test_coverage.yml b/.github/workflows/report_test_coverage.yml new file mode 100644 index 000000000..dc3fe395f --- /dev/null +++ b/.github/workflows/report_test_coverage.yml @@ -0,0 +1,74 @@ +name: Report Test Coverage + +on: + workflow_run: + workflows: [Build] + types: + - completed + +jobs: + report-test-coverage: + runs-on: ubuntu-latest + steps: + - name: 'Download artifact' + uses: actions/github-script@v6 + with: + script: | + let allArtifacts = await github.rest.actions.listWorkflowRunArtifacts({ + owner: context.repo.owner, + repo: context.repo.repo, + run_id: context.payload.workflow_run.id, + }); + let matchArtifact = allArtifacts.data.artifacts.filter((artifact) => { + return artifact.name == "report" + })[0]; + let download = await github.rest.actions.downloadArtifact({ + owner: context.repo.owner, + repo: context.repo.repo, + artifact_id: matchArtifact.id, + archive_format: 'zip', + }); + let fs = require('fs'); + fs.writeFileSync(`${process.env.GITHUB_WORKSPACE}/report.zip`, Buffer.from(download.data)); + + - name: 'Unzip artifact' + run: | + unzip report.zip + + - name: Code Coverage Report + uses: irongut/CodeCoverageSummary@v1.3.0 + with: + filename: coverage.xml + badge: true + format: markdown + hide_branch_rate: false + hide_complexity: false + indicators: true + output: both + thresholds: '80 90' + + - name: Make Coverage Report Collapsable + run: | + sed -i '2 i
' code-coverage-results.md + sed -i '3 i Click me to view the complete report' code-coverage-results.md + echo "
" >> code-coverage-results.md + + - name: 'Comment on PR' + uses: actions/github-script@v6 + with: + github-token: ${{ secrets.GITHUB_TOKEN }} + script: | + let fs = require('fs'); + let issue_number = Number(fs.readFileSync('./pr_number')); + let owner = context.repo.owner; + let repo = context.repo.repo; + let run_id = context.payload.workflow_run.id; + let run_url = `https://github.com/${owner}/${repo}/actions/runs/${run_id}` + let body = fs.readFileSync('./code-coverage-results.md', {encoding:'utf8', flag:'r'}) + + await github.rest.issues.createComment({ + owner: owner, + repo: repo, + issue_number: issue_number, + body: body + }); diff --git a/.github/workflows/scripts/example_checks/check_dispatch_inputs.py b/.github/workflows/scripts/example_checks/check_dispatch_inputs.py new file mode 100644 index 000000000..04d2063ec --- /dev/null +++ b/.github/workflows/scripts/example_checks/check_dispatch_inputs.py @@ -0,0 +1,27 @@ +import argparse +import os + + +def check_inputs(input_list): + for path in input_list: + real_path = os.path.join('examples', path) + if not os.path.exists(real_path): + return False + return True + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('-f', '--fileNameList', type=str, help="List of file names") + args = parser.parse_args() + name_list = args.fileNameList.split(",") + is_correct = check_inputs(name_list) + + if is_correct: + print('success') + else: + print('failure') + + +if __name__ == '__main__': + main() diff --git a/.github/workflows/scripts/weekly_check_example.py b/.github/workflows/scripts/example_checks/check_example_weekly.py similarity index 76% rename from .github/workflows/scripts/weekly_check_example.py rename to .github/workflows/scripts/example_checks/check_example_weekly.py index dfedc4628..941e90901 100644 --- a/.github/workflows/scripts/weekly_check_example.py +++ b/.github/workflows/scripts/example_checks/check_example_weekly.py @@ -5,9 +5,9 @@ def show_files(path, all_files): # Traverse all the folder/file in current directory file_list = os.listdir(path) # Determine the element is folder or file. If file, pass it into list, if folder, recurse. - for file in file_list: + for file_name in file_list: # Get the abs directory using os.path.join() and store into cur_path. - cur_path = os.path.join(path, file) + cur_path = os.path.join(path, file_name) # Determine whether folder if os.path.isdir(cur_path): show_files(cur_path, all_files) @@ -26,9 +26,8 @@ def main(): for file_loc in contents: split_loc = file_loc.split('/') # must have two sub-folder levels after examples folder, such as examples/images/vit is acceptable, examples/images/README.md is not, examples/requirements.txt is not. - if len(split_loc) - split_loc.index('examples') >= 3: - tmp_loc = split_loc[(split_loc.index('examples') + 1):(split_loc.index('examples') + 3)] - re_loc = join(tmp_loc, '/') + if len(split_loc) >= 4: + re_loc = '/'.join(split_loc[1:3]) if re_loc not in all_loc: all_loc.append(re_loc) print(all_loc) diff --git a/.github/workflows/scripts/changed_example.py b/.github/workflows/scripts/example_checks/detect_changed_example.py similarity index 52% rename from .github/workflows/scripts/changed_example.py rename to .github/workflows/scripts/example_checks/detect_changed_example.py index ac2f0864e..df4fd6736 100644 --- a/.github/workflows/scripts/changed_example.py +++ b/.github/workflows/scripts/example_checks/detect_changed_example.py @@ -3,14 +3,19 @@ import argparse def main(): parser = argparse.ArgumentParser() - parser.add_argument('--fileNameList', type=str) + parser.add_argument('-f', '--fileNameList', type=str, help="The list of changed files") args = parser.parse_args() name_list = args.fileNameList.split(":") folder_need_check = set() for loc in name_list: - # Find only the sub-folder of 'example' folder + # Find only the sub-sub-folder of 'example' folder + # the examples folder structure is like + # - examples + # - area + # - application + # - file if loc.split("/")[0] == "examples" and len(loc.split("/")) >= 4: - folder_need_check.add(loc.split("/")[1] + "/" + loc.split("/")[2]) + folder_need_check.add('/'.join(loc.split("/")[1:3])) # Output the result using print. Then the shell can get the values. print(list(folder_need_check)) diff --git a/.github/workflows/scripts/input_check_example.py b/.github/workflows/scripts/input_check_example.py deleted file mode 100644 index 5602d8f09..000000000 --- a/.github/workflows/scripts/input_check_example.py +++ /dev/null @@ -1,23 +0,0 @@ -import argparse -import os - - -def detect_correct(loc_li): - for loc in loc_li: - real_loc = 'examples/' + eval(loc) - if not os.path.exists(real_loc): - return -1 - return 1 - - -def main(): - parser = argparse.ArgumentParser() - parser.add_argument('--fileNameList', type=str) - args = parser.parse_args() - name_list = args.fileNameList.split(",") - result = detect_correct(name_list) - print(result) - - -if __name__ == '__main__': - main() diff --git a/.github/workflows/translate_comment.yml b/.github/workflows/translate_comment.yml new file mode 100644 index 000000000..83c127b3c --- /dev/null +++ b/.github/workflows/translate_comment.yml @@ -0,0 +1,18 @@ +name: 'issue-translator' +on: + issue_comment: + types: [created] + issues: + types: [opened] + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: usthe/issues-translate-action@v2.7 + with: + IS_MODIFY_TITLE: false + # not require, default false, . Decide whether to modify the issue title + # if true, the robot account @Issues-translate-bot must have modification permissions, invite @Issues-translate-bot to your project or use your custom bot. + CUSTOM_BOT_NOTE: Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿 + # not require. Customize the translation robot prefix message. diff --git a/.gitignore b/.gitignore index 6b6f980e3..bf74a7538 100644 --- a/.gitignore +++ b/.gitignore @@ -151,3 +151,7 @@ colossalai/version.py # ignore python interface defition file .pyi + +# ignore coverage test file +coverage.lcov +coverage.xml diff --git a/README-zh-Hans.md b/README-zh-Hans.md index 8edcff28b..5ad22785c 100644 --- a/README-zh-Hans.md +++ b/README-zh-Hans.md @@ -5,10 +5,10 @@ Colossal-AI: 一个面向大模型时代的通用深度学习系统 -

论文 | - 文档 | - 例程 | - 论坛 | +

论文 | + 文档 | + 例程 | + 论坛 | 博客

[![Build](https://github.com/hpcaitech/ColossalAI/actions/workflows/build.yml/badge.svg)](https://github.com/hpcaitech/ColossalAI/actions/workflows/build.yml) @@ -35,7 +35,7 @@
  • 为何选择 Colossal-AI
  • 特点
  • - 并行训练样例展示 + 并行训练样例展示
  • - 单GPU训练样例展示 + 单GPU训练样例展示
  • - 推理 (Energon-AI) 样例展示 + 推理 (Energon-AI) 样例展示
  • - Colossal-AI 成功案例 + Colossal-AI 成功案例