duanjunwen
a9bedc7a43
[Sharderformer] Support zbv in Sharderformer Policy ( #6150 )
...
* [feat] Sharderformer support zbv
* [feat] support chatglm2, command, deepseek for zbv
* [feat] support zbv in shardformer policy:
falcon,gptj,mistral,opt,qwen2,t5, vit, whisper
* [feat] support GPT2FusedLinearConv1D
* [feat] support GPT2FusedLinear (without tp)
* [fix] debug FusedConvLinear
* [shardfromer] support gpt2 policy for zbv, support GPT2FusedLinearConv
Col and Row.
* [Shardformer] support FusedLinear1D base for zbv
* [shardformer] support zbv in FusedLinear1D base, Col, Row
* [shardformer] support zbv in blip2 and sam policy
* [shardformer] fix bug incorrect number of gradients; add fusedLinear
base testcase;
* [fix] fix incorrect number of gradients ;
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [Shardformer] add en doc for zbv;
* [fix] fix typo in Model compatibility table
* [fix] fix API Reference typo
* [Shardformer] add zh-Han doc for zbv
* [fix] fix Linear name; update en & zh doc
* [fix] fix shardformer doc import err
* [fix] fix shardconfig import in doc
* [fix] fix shardformer doc
* [fix] fix shardconfig doc
* [fix] fix config
* [fix] remove shardconfig
* [fix] fix doc
* [feat] add zbv doc string
* [fix] rm doc
* [fix] fix doc
* [fix] empty zbv doc
* [fix] ifx torch version
* [fix] fix torch version
* [fix] fix torch versions
* [fix] fix torch versions
* [fix] fix pyramid versions
* [fix] fix pyramid, zope version
* [fix] try fix workflow
* [fix] try import ShardConfig in yml
* [fix] fix workflow
* [fix] fix workflow
* [fix] fix workflow
* [fix] fix workflow
* [fix] fix ci
* [fix] fix zbv doc
* [fix] fix param for qkv linear, gpt2fused linear; fix requirments;
* [fix] fix policy use fused_linear
* [fix] fix weight grad none, err caused by weight ptr change
* [fix] fix comm in WeightGradStore
* [fix] fix WeightGradStore pop param
* [fix] remove useless param in doc; fix gpt2 qkv test;
* [shardformer] simplify execute_w_pass_grad_accum;
* [fix] rm useless comments
* [shardformer] simplify execute_w_pass_grad_accum & execute_w_pass
* [shardformer] Run meaningful doc test
* [shadformer] fix doc test cmd;
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-01-02 10:22:26 +08:00
Wenxuan Tan
62c13e7969
[Ring Attention] Improve comments ( #6085 )
...
* improve comments
* improve comments
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-10-16 11:23:35 +08:00
Wenxuan Tan
8fd25d6e09
[Feature] Split cross-entropy computation in SP ( #5959 )
...
* halfway
* fix cross-PP-stage position id length diff bug
* fix typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* unified cross entropy func for all shardformer models
* remove redundant lines
* add basic ring attn; debug cross entropy
* fwd bwd logic complete
* fwd bwd logic complete; add experimental triton rescale
* precision tests passed
* precision tests passed
* fix typos and remove misc files
* update softmax_lse shape by new interface
* change tester name
* remove buffer clone; support packed seq layout
* add varlen tests
* fix typo
* all tests passed
* add dkv_group; fix mask
* remove debug statements
* adapt chatglm, command-R, qwen
* debug
* halfway
* fix cross-PP-stage position id length diff bug
* fix typo
* fix typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* unified cross entropy func for all shardformer models
* remove redundant lines
* add basic ring attn; debug cross entropy
* fwd bwd logic complete
* fwd bwd logic complete; add experimental triton rescale
* precision tests passed
* precision tests passed
* fix typos and remove misc files
* add sp_mode to benchmark; fix varlen interface
* update softmax_lse shape by new interface
* add varlen tests
* fix typo
* all tests passed
* add dkv_group; fix mask
* remove debug statements
* add comments
* q1 index only once
* remove events to simplify stream sync
* simplify forward/backward logic
* 2d ring forward passed
* 2d ring backward passed
* fixes
* fix ring attn loss
* 2D ring backward + llama passed
* merge
* update logger
* fix typo
* rebase
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix typo
* remove typos
* fixes
* support GPT
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-09-10 12:06:50 +08:00
Edenzzzz
f5c84af0b0
[Feature] Zigzag Ring attention ( #5905 )
...
* halfway
* fix cross-PP-stage position id length diff bug
* fix typo
* fix typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* unified cross entropy func for all shardformer models
* remove redundant lines
* add basic ring attn; debug cross entropy
* fwd bwd logic complete
* fwd bwd logic complete; add experimental triton rescale
* precision tests passed
* precision tests passed
* fix typos and remove misc files
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add sp_mode to benchmark; fix varlen interface
* update softmax_lse shape by new interface
* change tester name
* remove buffer clone; support packed seq layout
* add varlen tests
* fix typo
* all tests passed
* add dkv_group; fix mask
* remove debug statements
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-08-16 13:56:38 +08:00
Zhongkai Zhao
8e412a548e
[shardformer] Sequence Parallelism Optimization ( #5533 )
...
* sequence parallel optimization
* validate sequence parallel in llama (code to be polished)
* shardformer api writing
* integrate sequence parallel in ShardFormer
* fix pp bugs and sp bugs for LlaMa model
* integrating ring-based sequence parallelism into ShardFormer
* [sequence parallelism]: Add fused megatron function
* integrating ring-based sequence parallelism into ShardFormer
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* fix bugs when useing sp and flashattention together
* fix operation function name
* support flash attention for ulysses-style sp
* clarify sp process group
* fix compatibility bugs in moe plugin
* fix fused linear bugs
* fix linear layer test
* support gpt model all-to-all sp
* modify shard data dimension (meant to be dim=-1)
* support megtron-style sp and distributed attn for llama model
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* finish sp mode 3 support for gpt
* using all_to_all_single when batch size is 1
* support mode 2 sp in gpt2 (#5 )
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* refactor ring implementation
* support mode 2 sp in gpt2
* polish code
* enable distributed attn mask when using sp mode 2 and 3 in llama
* automatically enable flash attn when using sp mode 2 and 3 in llama
* inplace attn mask
* add zero2 support for sequence parallel
* polish code
* fix bugs
* fix gemini checkpoint io
* loose tensor checking atol and rtol
* add comment
* fix llama layernorm grad
* fix zero grad
* fix zero grad
* fix conflict
* update split and gather auto grad func
* sequence parallel: inside text split (#6 )
* polish code (part 1)
* polish code (part 2)
* polish code (part 2.5)
* polish code (part 3)
* sequence parallel: inside text split
* miscellaneous minor fixes
* polish code
* fix ulysses style ZeRO
* sequence parallel: inside text split
* miscellaneous minor fixes
* disaggregate sp group and dp group for sp
* fix llama and gpt sp
* polish code
* move ulysses grad sync to ddp (#9 )
* remove zero_stage and unbind the grad sync for alltoall sp
* add 2d group creation test
* move ulysses grad sync to ddp
* add 2d group creation test
* remove useless code
* change shard config not to enable sp when enable_all_optimizations
* add sp warnings for several model
* remove useless code
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
2024-04-03 17:15:47 +08:00
Hongxin Liu
d202cc28c0
[npu] change device to accelerator api ( #5239 )
...
* update accelerator
* fix timer
* fix amp
* update
* fix
* update bug
* add error raise
* fix autocast
* fix set device
* remove doc accelerator
* update doc
* update doc
* update doc
* use nullcontext
* update cpu
* update null context
* change time limit for example
* udpate
* update
* update
* update
* [npu] polish accelerator code
---------
Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com>
Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>
2024-01-09 10:20:05 +08:00
Xuanlei Zhao
dd2c28a323
[npu] use extension for op builder ( #5172 )
...
* update extension
* update cpu adam
* update is
* add doc for cpu adam
* update kernel
* update commit
* update flash
* update memory efficient
* update flash attn
* update flash attention loader
* update api
* fix
* update doc
* update example time limit
* reverse change
* fix doc
* remove useless kernel
* fix
* not use warning
* update
* update
2024-01-08 11:39:16 +08:00
Xuanlei Zhao
d6df19bae7
[npu] support triangle attention for llama ( #5130 )
...
* update fused attn
* update spda
* tri attn
* update triangle
* import
* fix
* fix
2023-11-30 14:21:30 +08:00
Xuanlei Zhao
3acbf6d496
[npu] add npu support for hybrid plugin and llama ( #5090 )
...
* llama 3d
* update
* fix autocast
2023-11-22 19:23:21 +08:00
littsk
1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel ( #4926 )
...
* [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915 )
* Add layer norm gradients all-reduce for sequence parallel.
* skip pipeline inference test
* [hotfix] fixing polices of sequence parallel (#4922 )
* Add layer norm gradients all-reduce for sequence parallel.
* fix parameter passing when calling get_autopolicy
---------
Co-authored-by: littsk <1214689160@qq.com>
* Hotfix/add grad all reduce for sequence parallel (#4927 )
* Add layer norm gradients all-reduce for sequence parallel.
* fix parameter passing when calling get_autopolicy
* fix bug using wrong variables
---------
Co-authored-by: littsk <1214689160@qq.com>
* fix policy initialization
* fix bloom and chatglm policices
* polish code of handling layernorm
* fix moe module
* polish code of class initializing
---------
Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
2023-11-03 13:32:43 +08:00
Hongxin Liu
079bf3cb26
[misc] update pre-commit and run all files ( #4752 )
...
* [misc] update pre-commit
* [misc] run pre-commit
* [misc] remove useless configuration files
* [misc] ignore cuda for clang-format
2023-09-19 14:20:26 +08:00
Hongxin Liu
172f7fa3cf
[misc] resolve code factor issues ( #4433 )
2023-08-15 23:25:14 +08:00
Baizhou Zhang
ed4c448488
[pipeline] rewrite t5 tests & support multi-tensor transmitting in pipeline ( #4388 )
...
* fix remaining t5 bugs/rewrite t5 tests
* fix multi-tensor communication in pipeline
* rearrange test_config
* fix keyerror in sync_shared_params
* fix get_held_layers & Randomnizer, complete t5 tests
* erase printing
* fix get_held_layers through modifying _release_unheld_layers
* fix _get_recursive_held_layers bug
2023-08-15 23:25:14 +08:00
Frank Lee
b1c2901530
[shardformer] supported bloom model ( #4098 )
2023-07-04 16:05:01 +08:00
Frank Lee
015af592f8
[shardformer] integrated linear 1D with dtensor ( #3996 )
...
* [shardformer] integrated linear 1D with dtensor
* polish code
2023-07-04 16:05:01 +08:00