This commit is contained in:
Edenzzzz 2025-05-06 14:14:22 -05:00
parent 35c2c44d52
commit 35f45ffd36

View File

@ -410,7 +410,7 @@ class RingAttention(torch.autograd.Function):
We also adopt the double ring topology from LoongTrain to fully utilize available We also adopt the double ring topology from LoongTrain to fully utilize available
NICs on each node, by computing attention within a inner ring first and then sending all KVs to the next NICs on each node, by computing attention within a inner ring first and then sending all KVs to the next
ring at once. ring at once.
Our implementation references Our implementation references code from
- ring-flash-attention: https://github.com/zhuzilin/ring-flash-attention/tree/main - ring-flash-attention: https://github.com/zhuzilin/ring-flash-attention/tree/main
- Megatron Context Parallel: https://github.com/NVIDIA/TransformerEngine/pull/726 - Megatron Context Parallel: https://github.com/NVIDIA/TransformerEngine/pull/726
References: References: