mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-03 11:47:49 +00:00
nvidia-trt[patch]: propagate InferenceClientException to the caller. (#16936)
- **Description:** before the change I've got 1. propagate InferenceClientException to the caller. 2. stop grpc receiver thread on exception ``` for token in result_queue: > result_str += token E TypeError: can only concatenate str (not "InferenceServerException") to str ../../langchain_nvidia_trt/llms.py:207: TypeError ``` And stream thread keeps running. after the change request thread stops correctly and caller got a root cause exception: ``` E tritonclient.utils.InferenceServerException: [request id: 4529729] expected number of inputs between 2 and 3 but got 10 inputs for model 'vllm_model' ../../langchain_nvidia_trt/llms.py:205: InferenceServerException ``` - **Issue:** the issue # it fixes if applicable, - **Dependencies:** any dependencies required for this change, - **Twitter handle:** [t.me/mkhl_spb](https://t.me/mkhl_spb) I'm not sure about test coverage. Should I setup deep mocks or there's a kind of triton stub via testcontainers or so.
This commit is contained in:
parent
6af912d7e0
commit
14ff1438e6
@ -199,10 +199,13 @@ class TritonTensorRTLLM(BaseLLM):
|
||||
result_queue = self._invoke_triton(self.model_name, inputs, outputs, stop)
|
||||
|
||||
result_str = ""
|
||||
for token in result_queue:
|
||||
result_str += token
|
||||
|
||||
self.client.stop_stream()
|
||||
try:
|
||||
for token in result_queue:
|
||||
if isinstance(token, Exception):
|
||||
raise token
|
||||
result_str += token
|
||||
finally:
|
||||
self.client.stop_stream()
|
||||
|
||||
return result_str
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user