[Fix/Inference] Fix GQA Triton and Support Llama3 (#5624)

* [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-09-07 03:52:01 +00:00 · 2024-04-23 13:09:55 +08:00
parent ccf72797e3
commit 5d4c1fe8f5
9 changed files with 183 additions and 194 deletions
--- a/colossalai/inference/struct.py
+++ b/colossalai/inference/struct.py
@@ -117,6 +117,14 @@ class Sequence:

        return False

+    def revoke_finished_status(self) -> None:
+        """
+        Revoke the finished status of the sequence.
+        This is only used by speculative decoding for now.
+        """
+        if RequestStatus.is_finished(self.status):
+            self.status = RequestStatus.RUNNING
+
    def __hash__(self):
        return hash(self.request_id)