[Feature] Zigzag Ring attention (#5905)

* halfway

* fix cross-PP-stage position id length diff bug

* fix typo

* fix typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* unified cross entropy func for all shardformer models

* remove redundant lines

* add basic ring attn; debug cross entropy

* fwd bwd logic complete

* fwd bwd logic complete; add experimental triton rescale

* precision tests passed

* precision tests passed

* fix typos and remove misc files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add sp_mode to benchmark; fix varlen interface

* update softmax_lse shape by new interface

* change tester name

* remove buffer clone; support packed seq layout

* add varlen tests

* fix typo

* all tests passed

* add dkv_group; fix mask

* remove debug statements

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
Edenzzzz
2024-08-16 13:56:38 +08:00
committed by GitHub
parent 887d2d579b
commit f5c84af0b0
50 changed files with 1870 additions and 326 deletions

View File

@@ -57,14 +57,14 @@ class FlashAttentionDaoCudaExtension(_Extension):
q_indices: Optional[torch.Tensor] = None,
kv_indices: Optional[torch.Tensor] = None,
):
# [B, N, S, D] -> [B, S, N, D]
# [B, H, S, D] -> [B, S, H, D]
q = q.transpose(1, 2)
k = k.transpose(1, 2)
v = v.transpose(1, 2)
b, s_q = q.shape[:2]
if cu_seqlens_q is not None:
# padded / padded causal
# unpad input: [B, S, N, D] -> [T, N, D]
# unpad input: [B, S, H, D] -> [T, H, D]
q = _unpad_input(q, q_indices)
kv = _unpad_input(torch.stack(tensors=(k, v), dim=2), kv_indices)
attn_output = flash_attn_varlen_kvpacked_func(
@@ -78,7 +78,7 @@ class FlashAttentionDaoCudaExtension(_Extension):
softmax_scale=scale,
causal=is_causal,
)
# pad output: [T, N, D] -> [B, S, N, D]
# pad output: [T, H, D] -> [B, S, H, D]
attn_output = pad_input(attn_output, q_indices, b, s_q)
else:
# causal / no attn mask
@@ -90,7 +90,7 @@ class FlashAttentionDaoCudaExtension(_Extension):
softmax_scale=scale,
causal=is_causal,
)
# [B, S, N, D] -> [B, N, S, D]
# [B, S, H, D] -> [B, H, S, D]
return attn_output.transpose(1, 2)
return flash_attention