[Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)

* add kvcache manager funcs for batching

* add batch bucket for batching

* revise RunningList struct in handler

* add kvcache/batch funcs for compatibility

* use new batching methods

* fix indexing bugs

* revise abort logic

* use cpu seq lengths/block tables

* rm unused attr in Sequence

* fix type conversion/default arg

* add and revise pytests

* revise pytests, rm unused tests

* rm unused statements

* fix pop finished indexing issue

* fix: use index in batch when retrieving inputs/update seqs

* use dict instead of odict in batch struct

* arg type hinting

* fix make compress

* refine comments

* fix: pop_n_seqs to pop the first n seqs

* add check in request handler

* remove redundant conversion

* fix test for request handler

* fix pop method in batch bucket

* fix prefill adding
This commit is contained in:
Yuanheng Zhao
2024-02-19 17:18:20 +08:00
committed by GitHub
parent 8c69debdc7
commit b21aac5bae
11 changed files with 902 additions and 112 deletions

View File

@@ -71,7 +71,6 @@ class Sequence:
input_token_id: List[int]
block_size: int
sample_params: Any # SampleParams needs to be imported later.
block_table: torch.Tensor
eos_token_id: int
pad_token_id: int
max_output_len: int = 256
@@ -158,7 +157,6 @@ class Sequence:
f"prompt={self.prompt}, "
f"status={self.status.name}, "
f"sample_params={self.sample_params}, "
f"logical_block_number={self.block_table.shape[0]},"
f"input_len={self.input_len}),"
f"output_len={self.output_len})"
)