[Inference]Adapted to the triton attn kernels (#5264)

* adapted to the triton attn kernels

* fix pad input

* adapted to copy_kv_to_blocked_cache

* fix ci test

* update kv memcpy

* remove print
This commit is contained in:
yuehuayingxueluo
2024-01-17 16:03:10 +08:00
committed by GitHub
parent 0f2b46a41c
commit 86b63f720c
7 changed files with 221 additions and 101 deletions

View File

@@ -236,6 +236,7 @@ class InferenceEngine:
output_list = []
batch = self.request_handler.schedule()
# TODO: padding_id is used for generating attn_mask and will be removed if nopad version is supported.
logits = self.model(
batch,
self.k_cahce,