FEAT: Integrate Xinference LLMs and Embeddings (#8171)

- [Xorbits Inference(Xinference)](https://github.com/xorbitsai/inference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. Xinference supports a variety of GGML-compatible models including chatglm, whisper, and vicuna, and utilizes heterogeneous hardware and a distributed architecture for seamless cross-device and cross-server model deployment. - This PR integrates Xinference models and Xinference embeddings into LangChain. - Dependencies: To install the depenedencies for this integration, run `pip install "xinference[all]"` - Example Usage: To start a local instance of Xinference, run `xinference`. To deploy Xinference in a distributed cluster, first start an Xinference supervisor using `xinference-supervisor`: `xinference-supervisor -H "${supervisor_host}"` Then, start the Xinference workers using `xinference-worker` on each server you want to run them on. `xinference-worker -e "http://${supervisor_host}:9997"` To use Xinference with LangChain, you also need to launch a model. You can use command line interface (CLI) to do so. Fo example: `xinference launch -n vicuna-v1.3 -f ggmlv3 -q q4_0`. This launches a model named vicuna-v1.3 with `model_format="ggmlv3"` and `quantization="q4_0"`. A model UID is returned for you to use. Now you can use Xinference with LangChain: ```python from langchain.llms import Xinference llm = Xinference( server_url="http://0.0.0.0:9997", # suppose the supervisor_host is "0.0.0.0" model_uid = {model_uid} # model UID returned from launching a model ) llm( prompt="Q: where can we visit in the capital of France? A:", generate_config={"max_tokens": 1024}, ) ``` You can also use RESTful client to launch a model: ```python from xinference.client import RESTfulClient client = RESTfulClient("http://0.0.0.0:9997") model_uid = client.launch_model(model_name="vicuna-v1.3", model_size_in_billions=7, quantization="q4_0") ``` The following code block demonstrates how to use Xinference embeddings with LangChain: ```python from langchain.embeddings import XinferenceEmbeddings xinference = XinferenceEmbeddings( server_url="http://0.0.0.0:9997", model_uid = model_uid ) ``` ```python query_result = xinference.embed_query("This is a test query") ``` ```python doc_result = xinference.embed_documents(["text A", "text B"]) ``` Xinference is still under rapid development. Feel free to [join our Slack community](https://xorbitsio.slack.com/join/shared_invite/zt-1z3zsm9ep-87yI9YZ_B79HLB2ccTq4WA) to get the latest updates! - Request for review: @hwchase17, @baskaryan - Twitter handle: https://twitter.com/Xorbitsio --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-06-23 23:29:21 +00:00 · 2023-07-28 12:23:19 +08:00 · 2023-07-28 12:23:19 +08:00 · 1efb9bae5f
commit 1efb9bae5f
parent 877d384bc9
13 changed files with 1556 additions and 541 deletions
--- a/docs/extras/integrations/llms/xinference.ipynb
+++ b/docs/extras/integrations/llms/xinference.ipynb
@ -0,0 +1,176 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Xorbits Inference (Xinference)\n",
    "\n",
    "[Xinference](https://github.com/xorbitsai/inference) is a powerful and versatile library designed to serve LLMs, \n",
    "speech recognition models, and multimodal models, even on your laptop. It supports a variety of models compatible with GGML, such as chatglm, baichuan, whisper, vicuna, orca, and many others. This notebook demonstrates how to use Xinference with LangChain."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "Install `Xinference` through PyPI:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install \"xinference[all]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy Xinference Locally or in a Distributed Cluster.\n",
    "\n",
    "For local deployment, run `xinference`. \n",
    "\n",
    "To deploy Xinference in a cluster, first start an Xinference supervisor using the `xinference-supervisor`. You can also use the option -p to specify the port and -H to specify the host. The default port is 9997.\n",
    "\n",
    "Then, start the Xinference workers using `xinference-worker` on each server you want to run them on. \n",
    "\n",
    "You can consult the README file from [Xinference](https://github.com/xorbitsai/inference) for more information.\n",
    "## Wrapper\n",
    "\n",
    "To use Xinference with LangChain, you need to first launch a model. You can use command line interface (CLI) to do so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model uid: 7167b2b0-2a04-11ee-83f0-d29396a3f064\n"
     ]
    }
   ],
   "source": [
    "!xinference launch -n vicuna-v1.3 -f ggmlv3 -q q4_0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A model UID is returned for you to use. Now you can use Xinference with LangChain:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' You can visit the Eiffel Tower, Notre-Dame Cathedral, the Louvre Museum, and many other historical sites in Paris, the capital of France.'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.llms import Xinference\n",
    "\n",
    "llm = Xinference(\n",
    "    server_url=\"http://0.0.0.0:9997\",\n",
    "    model_uid = \"7167b2b0-2a04-11ee-83f0-d29396a3f064\"\n",
    ")\n",
    "\n",
    "llm(\n",
    "    prompt=\"Q: where can we visit in the capital of France? A:\",\n",
    "    generate_config={\"max_tokens\": 1024, \"stream\": True},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Integrate with a LLMChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "A: You can visit many places in Paris, such as the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, the Champs-Elysées, Montmartre, Sacré-Cœur, and the Palace of Versailles.\n"
     ]
    }
   ],
   "source": [
    "from langchain import PromptTemplate, LLMChain\n",
    "\n",
    "template = \"Where can we visit in the capital of {country}?\"\n",
    "\n",
    "prompt = PromptTemplate(template=template, input_variables=[\"country\"])\n",
    "\n",
    "llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
    "\n",
    "generated = llm_chain.run(country=\"France\")\n",
    "print(generated)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, terminate the model when you do not need to use it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "!xinference terminate --model-uid \"7167b2b0-2a04-11ee-83f0-d29396a3f064\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "myenv3.9",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/docs/extras/integrations/providers/xinference.mdx
+++ b/docs/extras/integrations/providers/xinference.mdx
@ -0,0 +1,102 @@
 # Xorbits Inference (Xinference)
 This page demonstrates how to use [Xinference](https://github.com/xorbitsai/inference)
 with LangChain.
 `Xinference` is a powerful and versatile library designed to serve LLMs, 
 speech recognition models, and multimodal models, even on your laptop. 
 With Xorbits Inference, you can effortlessly deploy and serve your or 
 state-of-the-art built-in models using just a single command.
 ## Installation and Setup
 Xinference can be installed via pip from PyPI: 
 ```bash
 pip install "xinference[all]"
 ```
 ## LLM
 Xinference supports various models compatible with GGML, including chatglm, baichuan, whisper, 
 vicuna, and orca. To view the builtin models, run the command:
 ```bash
 xinference list --all
 ```
 ### Wrapper for Xinference
 You can start a local instance of Xinference by running:
 ```bash
 xinference
 ```
 You can also deploy Xinference in a distributed cluster. To do so, first start an Xinference supervisor
 on the server you want to run it:
 ```bash
 xinference-supervisor -H "${supervisor_host}"
 ```
 Then, start the Xinference workers on each of the other servers where you want to run them on:
 ```bash
 xinference-worker -e "http://${supervisor_host}:9997"
 ```
 You can also start a local instance of Xinference by running:
 ```bash
 xinference
 ```
 Once Xinference is running, an endpoint will be accessible for model management via CLI or 
 Xinference client. 
 For local deployment, the endpoint will be http://localhost:9997. 
 For cluster deployment, the endpoint will be http://${supervisor_host}:9997.
 Then, you need to launch a model. You can specify the model names and other attributes 
 including model_size_in_billions and quantization. You can use command line interface (CLI) to 
 do it. For example, 
 ```bash
 xinference launch -n orca -s 3 -q q4_0
 ```
 A model uid will be returned.
 Example usage:
 ```python
 from langchain.llms import Xinference
 llm = Xinference(
    server_url="http://0.0.0.0:9997",
    model_uid = {model_uid} # replace model_uid with the model UID return from launching the model
 )
 llm(
    prompt="Q: where can we visit in the capital of France? A:",
    generate_config={"max_tokens": 1024, "stream": True},
 )
 ```
 ### Usage
 For more information and detailed examples, refer to the
 [example notebook for xinference](../modules/models/llms/integrations/xinference.ipynb)
 ### Embeddings
 Xinference also supports embedding queries and documents. See
 [example notebook for xinference embeddings](../modules/data_connection/text_embedding/integrations/xinference.ipynb) 
 for a more detailed demo.
--- a/docs/extras/integrations/text_embedding/xinference.ipynb
+++ b/docs/extras/integrations/text_embedding/xinference.ipynb
@ -0,0 +1,144 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Xorbits inference (Xinference)\n",
    "\n",
    "This notebook goes over how to use Xinference embeddings within LangChain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "Install `Xinference` through PyPI:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install \"xinference[all]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy Xinference Locally or in a Distributed Cluster.\n",
    "\n",
    "For local deployment, run `xinference`. \n",
    "\n",
    "To deploy Xinference in a cluster, first start an Xinference supervisor using the `xinference-supervisor`. You can also use the option -p to specify the port and -H to specify the host. The default port is 9997.\n",
    "\n",
    "Then, start the Xinference workers using `xinference-worker` on each server you want to run them on. \n",
    "\n",
    "You can consult the README file from [Xinference](https://github.com/xorbitsai/inference) for more information.\n",
    "\n",
    "## Wrapper\n",
    "\n",
    "To use Xinference with LangChain, you need to first launch a model. You can use command line interface (CLI) to do so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model uid: 915845ee-2a04-11ee-8ed4-d29396a3f064\n"
     ]
    }
   ],
   "source": [
    "!xinference launch -n vicuna-v1.3 -f ggmlv3 -q q4_0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A model UID is returned for you to use. Now you can use Xinference embeddings with LangChain:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import XinferenceEmbeddings\n",
    "\n",
    "xinference = XinferenceEmbeddings(\n",
    "    server_url=\"http://0.0.0.0:9997\",\n",
    "    model_uid = \"915845ee-2a04-11ee-8ed4-d29396a3f064\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result = xinference.embed_query(\"This is a test query\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_result = xinference.embed_documents([\"text A\", \"text B\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, terminate the model when you do not need to use it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "!xinference terminate --model-uid \"915845ee-2a04-11ee-8ed4-d29396a3f064\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/libs/langchain/langchain/embeddings/init.py
+++ b/libs/langchain/langchain/embeddings/init.py
@ -42,6 +42,7 @@ from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddi
 from langchain.embeddings.spacy_embeddings import SpacyEmbeddings
 from langchain.embeddings.tensorflow_hub import TensorflowHubEmbeddings
 from langchain.embeddings.vertexai import VertexAIEmbeddings
 from langchain.embeddings.xinference import XinferenceEmbeddings
 logger = logging.getLogger(__name__)
@ -78,6 +79,7 @@ __all__ = [
    "SpacyEmbeddings",
    "NLPCloudEmbeddings",
    "GPT4AllEmbeddings",
    "XinferenceEmbeddings",
    "LocalAIEmbeddings",
    "AwaEmbeddings",
 ]
--- a/libs/langchain/langchain/embeddings/xinference.py
+++ b/libs/langchain/langchain/embeddings/xinference.py
@ -0,0 +1,113 @@
 """Wrapper around Xinference embedding models."""
 from typing import Any, List, Optional
 from langchain.embeddings.base import Embeddings
 class XinferenceEmbeddings(Embeddings):
    """Wrapper around xinference embedding models.
    To use, you should have the xinference library installed:
    .. code-block:: bash
        pip install xinference
    Check out: https://github.com/xorbitsai/inference
    To run, you need to start a Xinference supervisor on one server and Xinference workers on the other servers
    Example:
        To start a local instance of Xinference, run
         .. code-block:: bash
            $ xinference
        You can also deploy Xinference in a distributed cluster. Here are the steps:
        Starting the supervisor:
        .. code-block:: bash
            $ xinference-supervisor
        Starting the worker:
        .. code-block:: bash
            $ xinference-worker
    Then, launch a model using command line interface (CLI).
    Example:
    .. code-block:: bash
            $ xinference launch -n orca -s 3 -q q4_0
    It will return a model UID. Then you can use Xinference Embedding with LangChain.
    Example:
    .. code-block:: python
        from langchain.embeddings import XinferenceEmbeddings
        xinference = XinferenceEmbeddings(
            server_url="http://0.0.0.0:9997",
            model_uid = {model_uid} # replace model_uid with the model UID return from launching the model
        )
    """  # noqa: E501
    client: Any
    server_url: Optional[str]
    """URL of the xinference server"""
    model_uid: Optional[str]
    """UID of the launched model"""
    def __init__(
        self, server_url: Optional[str] = None, model_uid: Optional[str] = None
    ):
        try:
            from xinference.client import RESTfulClient
        except ImportError as e:
            raise ImportError(
                "Could not import RESTfulClient from xinference. Please install it"
                " with `pip install xinference`."
            ) from e
        super().__init__()
        if server_url is None:
            raise ValueError("Please provide server URL")
        if model_uid is None:
            raise ValueError("Please provide the model UID")
        self.server_url = server_url
        self.model_uid = model_uid
        self.client = RESTfulClient(server_url)
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents using Xinference.
        Args:
            texts: The list of texts to embed.
        Returns:
            List of embeddings, one for each text.
        """
        model = self.client.get_model(self.model_uid)
        embeddings = [
            model.create_embedding(text)["data"][0]["embedding"] for text in texts
        ]
        return [list(map(float, e)) for e in embeddings]
    def embed_query(self, text: str) -> List[float]:
        """Embed a query of documents using Xinference.
        Args:
            text: The text to embed.
        Returns:
            Embeddings for the text.
        """
        model = self.client.get_model(self.model_uid)
        embedding_res = model.create_embedding(text)
        embedding = embedding_res["data"][0]["embedding"]
        return list(map(float, embedding))
--- a/libs/langchain/langchain/llms/init.py
+++ b/libs/langchain/langchain/llms/init.py
@ -56,6 +56,7 @@ from langchain.llms.textgen import TextGen
 from langchain.llms.tongyi import Tongyi
 from langchain.llms.vertexai import VertexAI
 from langchain.llms.writer import Writer
 from langchain.llms.xinference import Xinference
 __all__ = [
    "AI21",
@ -115,6 +116,7 @@ __all__ = [
    "VertexAI",
    "Writer",
    "OctoAIEndpoint",
    "Xinference",
 ]
 type_to_cls_dict: Dict[str, Type[BaseLLM]] = {
@ -170,4 +172,5 @@ type_to_cls_dict: Dict[str, Type[BaseLLM]] = {
    "openllm": OpenLLM,
    "openllm_client": OpenLLM,
    "writer": Writer,
    "xinference": Xinference,
 }
--- a/libs/langchain/langchain/llms/xinference.py
+++ b/libs/langchain/langchain/llms/xinference.py
@ -0,0 +1,185 @@
 from typing import TYPE_CHECKING, Any, Generator, List, Mapping, Optional, Union
 from langchain.callbacks.manager import CallbackManagerForLLMRun
 from langchain.llms.base import LLM
 if TYPE_CHECKING:
    from xinference.client import RESTfulChatModelHandle, RESTfulGenerateModelHandle
    from xinference.model.llm.core import LlamaCppGenerateConfig
 class Xinference(LLM):
    """Wrapper for accessing Xinference's large-scale model inference service.
    To use, you should have the xinference library installed:
    .. code-block:: bash
        pip install "xinference[all]"
    Check out: https://github.com/xorbitsai/inference
    To run, you need to start a Xinference supervisor on one server and Xinference workers on the other servers
    Example:
        To start a local instance of Xinference, run
         .. code-block:: bash
            $ xinference
        You can also deploy Xinference in a distributed cluster. Here are the steps:
        Starting the supervisor:
        .. code-block:: bash
            $ xinference-supervisor
        Starting the worker:
        .. code-block:: bash
            $ xinference-worker
    Then, launch a model using command line interface (CLI).
    Example:
    .. code-block:: bash
            $ xinference launch -n orca -s 3 -q q4_0
    It will return a model UID. Then, you can use Xinference with LangChain.
    Example:
        .. code-block:: python
            from langchain.llms import Xinference
            llm = Xinference(
                server_url="http://0.0.0.0:9997",
                model_uid = {model_uid} # replace model_uid with the model UID return from launching the model
            )
            llm(
                prompt="Q: where can we visit in the capital of France? A:",
                generate_config={"max_tokens": 1024, "stream": True},
            )
    To view all the supported builtin models, run:
    .. code-block:: bash
        $ xinference list --all
    """  # noqa: E501
    client: Any
    server_url: Optional[str]
    """URL of the xinference server"""
    model_uid: Optional[str]
    """UID of the launched model"""
    def __init__(
        self, server_url: Optional[str] = None, model_uid: Optional[str] = None
    ):
        try:
            from xinference.client import RESTfulClient
        except ImportError as e:
            raise ImportError(
                "Could not import RESTfulClient from xinference. Please install it"
                " with `pip install xinference`."
            ) from e
        super().__init__(
            **{
                "server_url": server_url,
                "model_uid": model_uid,
            }
        )
        if self.server_url is None:
            raise ValueError("Please provide server URL")
        if self.model_uid is None:
            raise ValueError("Please provide the model UID")
        self.client = RESTfulClient(server_url)
    @property
    def _llm_type(self) -> str:
        """Return type of llm."""
        return "xinference"
    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """Get the identifying parameters."""
        return {
            **{"server_url": self.server_url},
            **{"model_uid": self.model_uid},
        }
    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        """Call the xinference model and return the output.
        Args:
            prompt: The prompt to use for generation.
            stop: Optional list of stop words to use when generating.
            generate_config: Optional dictionary for the configuration used for
                generation.
        Returns:
            The generated string by the model.
        """
        model = self.client.get_model(self.model_uid)
        generate_config: "LlamaCppGenerateConfig" = kwargs.get("generate_config", {})
        if stop:
            generate_config["stop"] = stop
        if generate_config and generate_config.get("stream"):
            combined_text_output = ""
            for token in self._stream_generate(
                model=model,
                prompt=prompt,
                run_manager=run_manager,
                generate_config=generate_config,
            ):
                combined_text_output += token
            return combined_text_output
        else:
            completion = model.generate(prompt=prompt, generate_config=generate_config)
            return completion["choices"][0]["text"]
    def _stream_generate(
        self,
        model: Union["RESTfulGenerateModelHandle", "RESTfulChatModelHandle"],
        prompt: str,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        generate_config: Optional["LlamaCppGenerateConfig"] = None,
    ) -> Generator[str, None, None]:
        """
        Args:
            prompt: The prompt to use for generation.
            model: The model used for generation.
            stop: Optional list of stop words to use when generating.
            generate_config: Optional dictionary for the configuration used for
                generation.
        Yields:
            A string token.
        """
        streaming_response = model.generate(
            prompt=prompt, generate_config=generate_config
        )
        for chunk in streaming_response:
            if isinstance(chunk, dict):
                choices = chunk.get("choices", [])
                if choices:
                    choice = choices[0]
                    if isinstance(choice, dict):
                        token = choice.get("text", "")
                        log_probs = choice.get("logprobs")
                        if run_manager:
                            run_manager.on_llm_new_token(
                                token=token, verbose=self.verbose, log_probs=log_probs
                            )
                        yield token
--- a/libs/langchain/poetry.lock
+++ b/libs/langchain/poetry.lock
--- a/libs/langchain/pyproject.toml
+++ b/libs/langchain/pyproject.toml
@ -124,6 +124,7 @@ langsmith = "~0.0.11"
 rank-bm25 = {version = "^0.2.2", optional = true}
 amadeus = {version = ">=8.1.0", optional = true}
 geopandas = {version = "^0.13.1", optional = true}
 xinference = {version = "^0.0.6", optional = true}
 python-arango = {version = "^7.5.9", optional = true}
 [tool.poetry.group.test.dependencies]
@ -219,7 +220,7 @@ playwright = "^1.28.0"
 setuptools = "^67.6.1"
 [tool.poetry.extras]
-llms = ["anthropic", "clarifai", "cohere", "openai", "openllm", "openlm", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers"]
+llms = ["anthropic", "clarifai", "cohere", "openai", "openllm", "openlm", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers", "xinference"]
 qdrant = ["qdrant-client"]
 openai = ["openai", "tiktoken"]
 text_helpers = ["chardet"]
@ -315,6 +316,7 @@ all = [
    "octoai-sdk",
    "rdflib",
    "amadeus",
    "xinference",
    "python-arango",
 ]
@ -356,6 +358,7 @@ extended_testing = [
 "rank_bm25",
 "geopandas",
 "jinja2",
 "xinference",
 ]
 [tool.ruff]
--- a/libs/langchain/tests/integration_tests/embeddings/test_xinference.py
+++ b/libs/langchain/tests/integration_tests/embeddings/test_xinference.py
@ -0,0 +1,74 @@
 """Test Xinference embeddings."""
 import time
 from typing import AsyncGenerator, Tuple
 import pytest_asyncio
 from langchain.embeddings import XinferenceEmbeddings
@pytest_asyncio.fixture
 async def setup() -> AsyncGenerator[Tuple[str, str], None]:
    import xoscar as xo
    from xinference.deploy.supervisor import start_supervisor_components
    from xinference.deploy.utils import create_worker_actor_pool
    from xinference.deploy.worker import start_worker_components
    pool = await create_worker_actor_pool(
        f"test://127.0.0.1:{xo.utils.get_next_port()}"
    )
    print(f"Pool running on localhost:{pool.external_address}")
    endpoint = await start_supervisor_components(
        pool.external_address, "127.0.0.1", xo.utils.get_next_port()
    )
    await start_worker_components(
        address=pool.external_address, supervisor_address=pool.external_address
    )
    # wait for the api.
    time.sleep(3)
    async with pool:
        yield endpoint, pool.external_address
 def test_xinference_embedding_documents(setup: Tuple[str, str]) -> None:
    """Test xinference embeddings for documents."""
    from xinference.client import RESTfulClient
    endpoint, _ = setup
    client = RESTfulClient(endpoint)
    model_uid = client.launch_model(
        model_name="vicuna-v1.3",
        model_size_in_billions=7,
        model_format="ggmlv3",
        quantization="q4_0",
    )
    xinference = XinferenceEmbeddings(server_url=endpoint, model_uid=model_uid)
    documents = ["foo bar", "bar foo"]
    output = xinference.embed_documents(documents)
    assert len(output) == 2
    assert len(output[0]) == 4096
 def test_xinference_embedding_query(setup: Tuple[str, str]) -> None:
    """Test xinference embeddings for query."""
    from xinference.client import RESTfulClient
    endpoint, _ = setup
    client = RESTfulClient(endpoint)
    model_uid = client.launch_model(
        model_name="vicuna-v1.3", model_size_in_billions=7, quantization="q4_0"
    )
    xinference = XinferenceEmbeddings(server_url=endpoint, model_uid=model_uid)
    document = "foo bar"
    output = xinference.embed_query(document)
    assert len(output) == 4096
--- a/libs/langchain/tests/integration_tests/llms/test_xinference.py
+++ b/libs/langchain/tests/integration_tests/llms/test_xinference.py
@ -0,0 +1,57 @@
 """Test Xinference wrapper."""
 import time
 from typing import AsyncGenerator, Tuple
 import pytest_asyncio
 from langchain.llms import Xinference
@pytest_asyncio.fixture
 async def setup() -> AsyncGenerator[Tuple[str, str], None]:
    import xoscar as xo
    from xinference.deploy.supervisor import start_supervisor_components
    from xinference.deploy.utils import create_worker_actor_pool
    from xinference.deploy.worker import start_worker_components
    pool = await create_worker_actor_pool(
        f"test://127.0.0.1:{xo.utils.get_next_port()}"
    )
    print(f"Pool running on localhost:{pool.external_address}")
    endpoint = await start_supervisor_components(
        pool.external_address, "127.0.0.1", xo.utils.get_next_port()
    )
    await start_worker_components(
        address=pool.external_address, supervisor_address=pool.external_address
    )
    # wait for the api.
    time.sleep(3)
    async with pool:
        yield endpoint, pool.external_address
 def test_xinference_llm_(setup: Tuple[str, str]) -> None:
    from xinference.client import RESTfulClient
    endpoint, _ = setup
    client = RESTfulClient(endpoint)
    model_uid = client.launch_model(
        model_name="vicuna-v1.3", model_size_in_billions=7, quantization="q4_0"
    )
    llm = Xinference(server_url=endpoint, model_uid=model_uid)
    answer = llm(prompt="Q: What food can we try in the capital of France? A:")
    assert isinstance(answer, str)
    answer = llm(
        prompt="Q: where can we visit in the capital of France? A:",
        generate_config={"max_tokens": 1024, "stream": True},
    )
    assert isinstance(answer, str)
--- a/poetry.lock
+++ b/poetry.lock
@ -1629,7 +1629,7 @@ files = [
 [[package]]
 name = "langchain"
-version = "0.0.240"
+version = "0.0.244"
 description = "Building applications with LLMs through composability"
 optional = false
 python-versions = ">=3.8.1,<4.0"
@ -1651,15 +1651,15 @@ SQLAlchemy = ">=1.4,<3"
 tenacity = "^8.1.0"
 [package.extras]
-all = ["O365 (>=2.0.26,<3.0.0)", "aleph-alpha-client (>=2.15.0,<3.0.0)", "amadeus (>=8.1.0)", "anthropic (>=0.3,<0.4)", "arxiv (>=1.4,<2.0)", "atlassian-python-api (>=3.36.0,<4.0.0)", "awadb (>=0.3.3,<0.4.0)", "azure-ai-formrecognizer (>=3.2.1,<4.0.0)", "azure-ai-vision (>=0.11.1b1,<0.12.0)", "azure-cognitiveservices-speech (>=1.28.0,<2.0.0)", "azure-cosmos (>=4.4.0b1,<5.0.0)", "azure-identity (>=1.12.0,<2.0.0)", "beautifulsoup4 (>=4,<5)", "clarifai (>=9.1.0)", "clickhouse-connect (>=0.5.14,<0.6.0)", "cohere (>=3,<4)", "deeplake (>=3.6.8,<4.0.0)", "docarray[hnswlib] (>=0.32.0,<0.33.0)", "duckduckgo-search (>=3.8.3,<4.0.0)", "elasticsearch (>=8,<9)", "esprima (>=4.0.1,<5.0.0)", "faiss-cpu (>=1,<2)", "google-api-python-client (==2.70.0)", "google-auth (>=2.18.1,<3.0.0)", "google-search-results (>=2,<3)", "gptcache (>=0.1.7)", "html2text (>=2020.1.16,<2021.0.0)", "huggingface_hub (>=0,<1)", "jina (>=3.14,<4.0)", "jinja2 (>=3,<4)", "jq (>=1.4.1,<2.0.0)", "lancedb (>=0.1,<0.2)", "langkit (>=0.0.6,<0.1.0)", "lark (>=1.1.5,<2.0.0)", "libdeeplake (>=0.0.60,<0.0.61)", "lxml (>=4.9.2,<5.0.0)", "manifest-ml (>=0.0.1,<0.0.2)", "marqo (>=0.11.0,<0.12.0)", "momento (>=1.5.0,<2.0.0)", "nebula3-python (>=3.4.0,<4.0.0)", "neo4j (>=5.8.1,<6.0.0)", "networkx (>=2.6.3,<3.0.0)", "nlpcloud (>=1,<2)", "nltk (>=3,<4)", "nomic (>=1.0.43,<2.0.0)", "octoai-sdk (>=0.1.1,<0.2.0)", "openai (>=0,<1)", "openlm (>=0.0.5,<0.0.6)", "opensearch-py (>=2.0.0,<3.0.0)", "pdfminer-six (>=20221105,<20221106)", "pexpect (>=4.8.0,<5.0.0)", "pgvector (>=0.1.6,<0.2.0)", "pinecone-client (>=2,<3)", "pinecone-text (>=0.4.2,<0.5.0)", "psycopg2-binary (>=2.9.5,<3.0.0)", "pymongo (>=4.3.3,<5.0.0)", "pyowm (>=3.3.0,<4.0.0)", "pypdf (>=3.4.0,<4.0.0)", "pytesseract (>=0.3.10,<0.4.0)", "pyvespa (>=0.33.0,<0.34.0)", "qdrant-client (>=1.3.1,<2.0.0)", "rdflib (>=6.3.2,<7.0.0)", "redis (>=4,<5)", "requests-toolbelt (>=1.0.0,<2.0.0)", "sentence-transformers (>=2,<3)", "singlestoredb (>=0.7.1,<0.8.0)", "spacy (>=3,<4)", "steamship (>=2.16.9,<3.0.0)", "tensorflow-text (>=2.11.0,<3.0.0)", "tigrisdb (>=1.0.0b6,<2.0.0)", "tiktoken (>=0.3.2,<0.4.0)", "torch (>=1,<3)", "transformers (>=4,<5)", "weaviate-client (>=3,<4)", "wikipedia (>=1,<2)", "wolframalpha (==5.0.0)"]
+all = ["O365 (>=2.0.26,<3.0.0)", "aleph-alpha-client (>=2.15.0,<3.0.0)", "amadeus (>=8.1.0)", "anthropic (>=0.3,<0.4)", "arxiv (>=1.4,<2.0)", "atlassian-python-api (>=3.36.0,<4.0.0)", "awadb (>=0.3.3,<0.4.0)", "azure-ai-formrecognizer (>=3.2.1,<4.0.0)", "azure-ai-vision (>=0.11.1b1,<0.12.0)", "azure-cognitiveservices-speech (>=1.28.0,<2.0.0)", "azure-cosmos (>=4.4.0b1,<5.0.0)", "azure-identity (>=1.12.0,<2.0.0)", "beautifulsoup4 (>=4,<5)", "clarifai (>=9.1.0)", "clickhouse-connect (>=0.5.14,<0.6.0)", "cohere (>=4,<5)", "deeplake (>=3.6.8,<4.0.0)", "docarray[hnswlib] (>=0.32.0,<0.33.0)", "duckduckgo-search (>=3.8.3,<4.0.0)", "elasticsearch (>=8,<9)", "esprima (>=4.0.1,<5.0.0)", "faiss-cpu (>=1,<2)", "google-api-python-client (==2.70.0)", "google-auth (>=2.18.1,<3.0.0)", "google-search-results (>=2,<3)", "gptcache (>=0.1.7)", "html2text (>=2020.1.16,<2021.0.0)", "huggingface_hub (>=0,<1)", "jina (>=3.14,<4.0)", "jinja2 (>=3,<4)", "jq (>=1.4.1,<2.0.0)", "lancedb (>=0.1,<0.2)", "langkit (>=0.0.6,<0.1.0)", "lark (>=1.1.5,<2.0.0)", "libdeeplake (>=0.0.60,<0.0.61)", "lxml (>=4.9.2,<5.0.0)", "manifest-ml (>=0.0.1,<0.0.2)", "marqo (>=0.11.0,<0.12.0)", "momento (>=1.5.0,<2.0.0)", "nebula3-python (>=3.4.0,<4.0.0)", "neo4j (>=5.8.1,<6.0.0)", "networkx (>=2.6.3,<3.0.0)", "nlpcloud (>=1,<2)", "nltk (>=3,<4)", "nomic (>=1.0.43,<2.0.0)", "octoai-sdk (>=0.1.1,<0.2.0)", "openai (>=0,<1)", "openlm (>=0.0.5,<0.0.6)", "opensearch-py (>=2.0.0,<3.0.0)", "pdfminer-six (>=20221105,<20221106)", "pexpect (>=4.8.0,<5.0.0)", "pgvector (>=0.1.6,<0.2.0)", "pinecone-client (>=2,<3)", "pinecone-text (>=0.4.2,<0.5.0)", "psycopg2-binary (>=2.9.5,<3.0.0)", "pymongo (>=4.3.3,<5.0.0)", "pyowm (>=3.3.0,<4.0.0)", "pypdf (>=3.4.0,<4.0.0)", "pytesseract (>=0.3.10,<0.4.0)", "python-arango (>=7.5.9,<8.0.0)", "pyvespa (>=0.33.0,<0.34.0)", "qdrant-client (>=1.3.1,<2.0.0)", "rdflib (>=6.3.2,<7.0.0)", "redis (>=4,<5)", "requests-toolbelt (>=1.0.0,<2.0.0)", "sentence-transformers (>=2,<3)", "singlestoredb (>=0.7.1,<0.8.0)", "spacy (>=3,<4)", "steamship (>=2.16.9,<3.0.0)", "tensorflow-text (>=2.11.0,<3.0.0)", "tigrisdb (>=1.0.0b6,<2.0.0)", "tiktoken (>=0.3.2,<0.4.0)", "torch (>=1,<3)", "transformers (>=4,<5)", "weaviate-client (>=3,<4)", "wikipedia (>=1,<2)", "wolframalpha (==5.0.0)", "xinference (>=0.0.6,<0.0.7)"]
-azure = ["azure-ai-formrecognizer (>=3.2.1,<4.0.0)", "azure-ai-vision (>=0.11.1b1,<0.12.0)", "azure-cognitiveservices-speech (>=1.28.0,<2.0.0)", "azure-core (>=1.26.4,<2.0.0)", "azure-cosmos (>=4.4.0b1,<5.0.0)", "azure-identity (>=1.12.0,<2.0.0)", "azure-search-documents (==11.4.0a20230509004)", "openai (>=0,<1)"]
+azure = ["azure-ai-formrecognizer (>=3.2.1,<4.0.0)", "azure-ai-vision (>=0.11.1b1,<0.12.0)", "azure-cognitiveservices-speech (>=1.28.0,<2.0.0)", "azure-core (>=1.26.4,<2.0.0)", "azure-cosmos (>=4.4.0b1,<5.0.0)", "azure-identity (>=1.12.0,<2.0.0)", "azure-search-documents (==11.4.0b6)", "openai (>=0,<1)"]
 clarifai = ["clarifai (>=9.1.0)"]
-cohere = ["cohere (>=3,<4)"]
+cohere = ["cohere (>=4,<5)"]
 docarray = ["docarray[hnswlib] (>=0.32.0,<0.33.0)"]
 embeddings = ["sentence-transformers (>=2,<3)"]
-extended-testing = ["atlassian-python-api (>=3.36.0,<4.0.0)", "beautifulsoup4 (>=4,<5)", "bibtexparser (>=1.4.0,<2.0.0)", "cassio (>=0.0.7,<0.0.8)", "chardet (>=5.1.0,<6.0.0)", "esprima (>=4.0.1,<5.0.0)", "geopandas (>=0.13.1,<0.14.0)", "gql (>=3.4.1,<4.0.0)", "html2text (>=2020.1.16,<2021.0.0)", "jinja2 (>=3,<4)", "jq (>=1.4.1,<2.0.0)", "lxml (>=4.9.2,<5.0.0)", "mwparserfromhell (>=0.6.4,<0.7.0)", "mwxml (>=0.3.3,<0.4.0)", "openai (>=0,<1)", "openai (>=0,<1)", "pandas (>=2.0.1,<3.0.0)", "pdfminer-six (>=20221105,<20221106)", "pgvector (>=0.1.6,<0.2.0)", "psychicapi (>=0.8.0,<0.9.0)", "py-trello (>=0.19.0,<0.20.0)", "pymupdf (>=1.22.3,<2.0.0)", "pypdf (>=3.4.0,<4.0.0)", "pypdfium2 (>=4.10.0,<5.0.0)", "pyspark (>=3.4.0,<4.0.0)", "rank-bm25 (>=0.2.2,<0.3.0)", "rapidfuzz (>=3.1.1,<4.0.0)", "requests-toolbelt (>=1.0.0,<2.0.0)", "scikit-learn (>=1.2.2,<2.0.0)", "streamlit (>=1.18.0,<2.0.0)", "sympy (>=1.12,<2.0)", "telethon (>=1.28.5,<2.0.0)", "tqdm (>=4.48.0)", "zep-python (>=0.32)"]
+extended-testing = ["atlassian-python-api (>=3.36.0,<4.0.0)", "beautifulsoup4 (>=4,<5)", "bibtexparser (>=1.4.0,<2.0.0)", "cassio (>=0.0.7,<0.0.8)", "chardet (>=5.1.0,<6.0.0)", "esprima (>=4.0.1,<5.0.0)", "geopandas (>=0.13.1,<0.14.0)", "gql (>=3.4.1,<4.0.0)", "html2text (>=2020.1.16,<2021.0.0)", "jinja2 (>=3,<4)", "jq (>=1.4.1,<2.0.0)", "lxml (>=4.9.2,<5.0.0)", "mwparserfromhell (>=0.6.4,<0.7.0)", "mwxml (>=0.3.3,<0.4.0)", "openai (>=0,<1)", "openai (>=0,<1)", "pandas (>=2.0.1,<3.0.0)", "pdfminer-six (>=20221105,<20221106)", "pgvector (>=0.1.6,<0.2.0)", "psychicapi (>=0.8.0,<0.9.0)", "py-trello (>=0.19.0,<0.20.0)", "pymupdf (>=1.22.3,<2.0.0)", "pypdf (>=3.4.0,<4.0.0)", "pypdfium2 (>=4.10.0,<5.0.0)", "pyspark (>=3.4.0,<4.0.0)", "rank-bm25 (>=0.2.2,<0.3.0)", "rapidfuzz (>=3.1.1,<4.0.0)", "requests-toolbelt (>=1.0.0,<2.0.0)", "scikit-learn (>=1.2.2,<2.0.0)", "streamlit (>=1.18.0,<2.0.0)", "sympy (>=1.12,<2.0)", "telethon (>=1.28.5,<2.0.0)", "tqdm (>=4.48.0)", "xinference (>=0.0.6,<0.0.7)", "zep-python (>=0.32)"]
 javascript = ["esprima (>=4.0.1,<5.0.0)"]
-llms = ["anthropic (>=0.3,<0.4)", "clarifai (>=9.1.0)", "cohere (>=3,<4)", "huggingface_hub (>=0,<1)", "manifest-ml (>=0.0.1,<0.0.2)", "nlpcloud (>=1,<2)", "openai (>=0,<1)", "openllm (>=0.1.19)", "openlm (>=0.0.5,<0.0.6)", "torch (>=1,<3)", "transformers (>=4,<5)"]
+llms = ["anthropic (>=0.3,<0.4)", "clarifai (>=9.1.0)", "cohere (>=4,<5)", "huggingface_hub (>=0,<1)", "manifest-ml (>=0.0.1,<0.0.2)", "nlpcloud (>=1,<2)", "openai (>=0,<1)", "openllm (>=0.1.19)", "openlm (>=0.0.5,<0.0.6)", "torch (>=1,<3)", "transformers (>=4,<5)", "xinference (>=0.0.6,<0.0.7)"]
 openai = ["openai (>=0,<1)", "tiktoken (>=0.3.2,<0.4.0)"]
 qdrant = ["qdrant-client (>=1.3.1,<2.0.0)"]
 text-helpers = ["chardet (>=5.1.0,<6.0.0)"]