[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461)

* Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32
2026-04-26 17:53:08 +00:00 · 2024-03-25 13:40:34 +08:00
parent 7ff42cc06d
commit 87079cffe8
15 changed files with 550 additions and 138 deletions
--- a/extensions/inference/inference_ops_cuda.py
+++ b/extensions/inference/inference_ops_cuda.py
@@ -12,6 +12,7 @@ class InferenceOpsCudaExtension(_CudaExtension):
            for fname in [
                "cuda/pybind/inference.cpp",
                "cuda/decode_kv_cache_memcpy_kernel.cu",
+                "cuda/context_kv_cache_memcpy_kernel.cu",
                "cuda/fused_rotary_emb_and_cache_kernel.cu",
                "cuda/activation_kernel.cu",
                "cuda/rms_layernorm_kernel.cu",