[Inference]Adapted to the triton attn kernels (#5264)

* adapted to the triton attn kernels * fix pad input * adapted to copy_kv_to_blocked_cache * fix ci test * update kv memcpy * remove print
2025-09-21 09:29:47 +00:00 · 2024-01-17 16:03:10 +08:00
parent 0f2b46a41c
commit 86b63f720c
7 changed files with 221 additions and 101 deletions
--- a/colossalai/inference/core/engine.py
+++ b/colossalai/inference/core/engine.py
@@ -236,6 +236,7 @@ class InferenceEngine:
        output_list = []
        batch = self.request_handler.schedule()

+        # TODO: padding_id is used for generating attn_mask and will be removed if nopad version is supported.
        logits = self.model(
            batch,
            self.k_cahce,