[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418)

* add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline
2025-09-07 12:01:39 +00:00 · 2024-03-13 17:20:03 +08:00
parent ed431de4e4
commit f366a5ea1f
13 changed files with 928 additions and 78 deletions
--- a/colossalai/inference/utils.py
+++ b/colossalai/inference/utils.py
@@ -47,5 +47,5 @@ def init_to_get_rotary(self, base=10000, use_elem=False):
    t = torch.arange(max_seq_len + 1024 * 64, device="cpu", dtype=torch.float32) / rope_scaling_factor
    freqs = torch.outer(t, inv_freq)

-    self._cos_cached = torch.cos(freqs).to(torch.float16).cuda()
-    self._sin_cached = torch.sin(freqs).to(torch.float16).cuda()
+    self._cos_cached = torch.cos(freqs).to(self.dtype).cuda()
+    self._sin_cached = torch.sin(freqs).to(self.dtype).cuda()