FEAT: Integrate Xinference LLMs and Embeddings (#8171)

- [Xorbits
Inference(Xinference)](https://github.com/xorbitsai/inference) is a
powerful and versatile library designed to serve language, speech
recognition, and multimodal models. Xinference supports a variety of
GGML-compatible models including chatglm, whisper, and vicuna, and
utilizes heterogeneous hardware and a distributed architecture for
seamless cross-device and cross-server model deployment.
- This PR integrates Xinference models and Xinference embeddings into
LangChain.
- Dependencies: To install the depenedencies for this integration, run
    
    `pip install "xinference[all]"`
    
- Example Usage:

To start a local instance of Xinference, run `xinference`.

To deploy Xinference in a distributed cluster, first start an Xinference
supervisor using `xinference-supervisor`:

`xinference-supervisor -H "${supervisor_host}"`

Then, start the Xinference workers using `xinference-worker` on each
server you want to run them on.

`xinference-worker -e "http://${supervisor_host}:9997"`

To use Xinference with LangChain, you also need to launch a model. You
can use command line interface (CLI) to do so. Fo example: `xinference
launch -n vicuna-v1.3 -f ggmlv3 -q q4_0`. This launches a model named
vicuna-v1.3 with `model_format="ggmlv3"` and `quantization="q4_0"`. A
model UID is returned for you to use.

Now you can use Xinference with LangChain:

```python
from langchain.llms import Xinference

llm = Xinference(
    server_url="http://0.0.0.0:9997", # suppose the supervisor_host is "0.0.0.0"
    model_uid = {model_uid} # model UID returned from launching a model
)

llm(
    prompt="Q: where can we visit in the capital of France? A:",
    generate_config={"max_tokens": 1024},
)
```

You can also use RESTful client to launch a model:
```python
from xinference.client import RESTfulClient

client = RESTfulClient("http://0.0.0.0:9997")

model_uid = client.launch_model(model_name="vicuna-v1.3", model_size_in_billions=7, quantization="q4_0")
```

The following code block demonstrates how to use Xinference embeddings
with LangChain:
```python
from langchain.embeddings import XinferenceEmbeddings

xinference = XinferenceEmbeddings(
    server_url="http://0.0.0.0:9997",
    model_uid = model_uid
)
```

```python
query_result = xinference.embed_query("This is a test query")
```

```python
doc_result = xinference.embed_documents(["text A", "text B"])
```

Xinference is still under rapid development. Feel free to [join our
Slack
community](https://xorbitsio.slack.com/join/shared_invite/zt-1z3zsm9ep-87yI9YZ_B79HLB2ccTq4WA)
to get the latest updates!

- Request for review: @hwchase17, @baskaryan
- Twitter handle: https://twitter.com/Xorbitsio

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
Jiayi Ni 2023-07-28 12:23:19 +08:00 committed by GitHub
parent 877d384bc9
commit 1efb9bae5f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
13 changed files with 1556 additions and 541 deletions

View File

@ -0,0 +1,176 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Xorbits Inference (Xinference)\n",
"\n",
"[Xinference](https://github.com/xorbitsai/inference) is a powerful and versatile library designed to serve LLMs, \n",
"speech recognition models, and multimodal models, even on your laptop. It supports a variety of models compatible with GGML, such as chatglm, baichuan, whisper, vicuna, orca, and many others. This notebook demonstrates how to use Xinference with LangChain."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"Install `Xinference` through PyPI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install \"xinference[all]\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy Xinference Locally or in a Distributed Cluster.\n",
"\n",
"For local deployment, run `xinference`. \n",
"\n",
"To deploy Xinference in a cluster, first start an Xinference supervisor using the `xinference-supervisor`. You can also use the option -p to specify the port and -H to specify the host. The default port is 9997.\n",
"\n",
"Then, start the Xinference workers using `xinference-worker` on each server you want to run them on. \n",
"\n",
"You can consult the README file from [Xinference](https://github.com/xorbitsai/inference) for more information.\n",
"## Wrapper\n",
"\n",
"To use Xinference with LangChain, you need to first launch a model. You can use command line interface (CLI) to do so:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model uid: 7167b2b0-2a04-11ee-83f0-d29396a3f064\n"
]
}
],
"source": [
"!xinference launch -n vicuna-v1.3 -f ggmlv3 -q q4_0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A model UID is returned for you to use. Now you can use Xinference with LangChain:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' You can visit the Eiffel Tower, Notre-Dame Cathedral, the Louvre Museum, and many other historical sites in Paris, the capital of France.'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.llms import Xinference\n",
"\n",
"llm = Xinference(\n",
" server_url=\"http://0.0.0.0:9997\",\n",
" model_uid = \"7167b2b0-2a04-11ee-83f0-d29396a3f064\"\n",
")\n",
"\n",
"llm(\n",
" prompt=\"Q: where can we visit in the capital of France? A:\",\n",
" generate_config={\"max_tokens\": 1024, \"stream\": True},\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Integrate with a LLMChain"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"A: You can visit many places in Paris, such as the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, the Champs-Elysées, Montmartre, Sacré-Cœur, and the Palace of Versailles.\n"
]
}
],
"source": [
"from langchain import PromptTemplate, LLMChain\n",
"\n",
"template = \"Where can we visit in the capital of {country}?\"\n",
"\n",
"prompt = PromptTemplate(template=template, input_variables=[\"country\"])\n",
"\n",
"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
"\n",
"generated = llm_chain.run(country=\"France\")\n",
"print(generated)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, terminate the model when you do not need to use it:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"!xinference terminate --model-uid \"7167b2b0-2a04-11ee-83f0-d29396a3f064\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "myenv3.9",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,102 @@
# Xorbits Inference (Xinference)
This page demonstrates how to use [Xinference](https://github.com/xorbitsai/inference)
with LangChain.
`Xinference` is a powerful and versatile library designed to serve LLMs,
speech recognition models, and multimodal models, even on your laptop.
With Xorbits Inference, you can effortlessly deploy and serve your or
state-of-the-art built-in models using just a single command.
## Installation and Setup
Xinference can be installed via pip from PyPI:
```bash
pip install "xinference[all]"
```
## LLM
Xinference supports various models compatible with GGML, including chatglm, baichuan, whisper,
vicuna, and orca. To view the builtin models, run the command:
```bash
xinference list --all
```
### Wrapper for Xinference
You can start a local instance of Xinference by running:
```bash
xinference
```
You can also deploy Xinference in a distributed cluster. To do so, first start an Xinference supervisor
on the server you want to run it:
```bash
xinference-supervisor -H "${supervisor_host}"
```
Then, start the Xinference workers on each of the other servers where you want to run them on:
```bash
xinference-worker -e "http://${supervisor_host}:9997"
```
You can also start a local instance of Xinference by running:
```bash
xinference
```
Once Xinference is running, an endpoint will be accessible for model management via CLI or
Xinference client.
For local deployment, the endpoint will be http://localhost:9997.
For cluster deployment, the endpoint will be http://${supervisor_host}:9997.
Then, you need to launch a model. You can specify the model names and other attributes
including model_size_in_billions and quantization. You can use command line interface (CLI) to
do it. For example,
```bash
xinference launch -n orca -s 3 -q q4_0
```
A model uid will be returned.
Example usage:
```python
from langchain.llms import Xinference
llm = Xinference(
server_url="http://0.0.0.0:9997",
model_uid = {model_uid} # replace model_uid with the model UID return from launching the model
)
llm(
prompt="Q: where can we visit in the capital of France? A:",
generate_config={"max_tokens": 1024, "stream": True},
)
```
### Usage
For more information and detailed examples, refer to the
[example notebook for xinference](../modules/models/llms/integrations/xinference.ipynb)
### Embeddings
Xinference also supports embedding queries and documents. See
[example notebook for xinference embeddings](../modules/data_connection/text_embedding/integrations/xinference.ipynb)
for a more detailed demo.

View File

@ -0,0 +1,144 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Xorbits inference (Xinference)\n",
"\n",
"This notebook goes over how to use Xinference embeddings within LangChain"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"Install `Xinference` through PyPI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install \"xinference[all]\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy Xinference Locally or in a Distributed Cluster.\n",
"\n",
"For local deployment, run `xinference`. \n",
"\n",
"To deploy Xinference in a cluster, first start an Xinference supervisor using the `xinference-supervisor`. You can also use the option -p to specify the port and -H to specify the host. The default port is 9997.\n",
"\n",
"Then, start the Xinference workers using `xinference-worker` on each server you want to run them on. \n",
"\n",
"You can consult the README file from [Xinference](https://github.com/xorbitsai/inference) for more information.\n",
"\n",
"## Wrapper\n",
"\n",
"To use Xinference with LangChain, you need to first launch a model. You can use command line interface (CLI) to do so:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model uid: 915845ee-2a04-11ee-8ed4-d29396a3f064\n"
]
}
],
"source": [
"!xinference launch -n vicuna-v1.3 -f ggmlv3 -q q4_0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A model UID is returned for you to use. Now you can use Xinference embeddings with LangChain:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import XinferenceEmbeddings\n",
"\n",
"xinference = XinferenceEmbeddings(\n",
" server_url=\"http://0.0.0.0:9997\",\n",
" model_uid = \"915845ee-2a04-11ee-8ed4-d29396a3f064\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"query_result = xinference.embed_query(\"This is a test query\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"doc_result = xinference.embed_documents([\"text A\", \"text B\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, terminate the model when you do not need to use it:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"!xinference terminate --model-uid \"915845ee-2a04-11ee-8ed4-d29396a3f064\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -42,6 +42,7 @@ from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddi
from langchain.embeddings.spacy_embeddings import SpacyEmbeddings from langchain.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain.embeddings.tensorflow_hub import TensorflowHubEmbeddings from langchain.embeddings.tensorflow_hub import TensorflowHubEmbeddings
from langchain.embeddings.vertexai import VertexAIEmbeddings from langchain.embeddings.vertexai import VertexAIEmbeddings
from langchain.embeddings.xinference import XinferenceEmbeddings
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -78,6 +79,7 @@ __all__ = [
"SpacyEmbeddings", "SpacyEmbeddings",
"NLPCloudEmbeddings", "NLPCloudEmbeddings",
"GPT4AllEmbeddings", "GPT4AllEmbeddings",
"XinferenceEmbeddings",
"LocalAIEmbeddings", "LocalAIEmbeddings",
"AwaEmbeddings", "AwaEmbeddings",
] ]

View File

@ -0,0 +1,113 @@
"""Wrapper around Xinference embedding models."""
from typing import Any, List, Optional
from langchain.embeddings.base import Embeddings
class XinferenceEmbeddings(Embeddings):
"""Wrapper around xinference embedding models.
To use, you should have the xinference library installed:
.. code-block:: bash
pip install xinference
Check out: https://github.com/xorbitsai/inference
To run, you need to start a Xinference supervisor on one server and Xinference workers on the other servers
Example:
To start a local instance of Xinference, run
.. code-block:: bash
$ xinference
You can also deploy Xinference in a distributed cluster. Here are the steps:
Starting the supervisor:
.. code-block:: bash
$ xinference-supervisor
Starting the worker:
.. code-block:: bash
$ xinference-worker
Then, launch a model using command line interface (CLI).
Example:
.. code-block:: bash
$ xinference launch -n orca -s 3 -q q4_0
It will return a model UID. Then you can use Xinference Embedding with LangChain.
Example:
.. code-block:: python
from langchain.embeddings import XinferenceEmbeddings
xinference = XinferenceEmbeddings(
server_url="http://0.0.0.0:9997",
model_uid = {model_uid} # replace model_uid with the model UID return from launching the model
)
""" # noqa: E501
client: Any
server_url: Optional[str]
"""URL of the xinference server"""
model_uid: Optional[str]
"""UID of the launched model"""
def __init__(
self, server_url: Optional[str] = None, model_uid: Optional[str] = None
):
try:
from xinference.client import RESTfulClient
except ImportError as e:
raise ImportError(
"Could not import RESTfulClient from xinference. Please install it"
" with `pip install xinference`."
) from e
super().__init__()
if server_url is None:
raise ValueError("Please provide server URL")
if model_uid is None:
raise ValueError("Please provide the model UID")
self.server_url = server_url
self.model_uid = model_uid
self.client = RESTfulClient(server_url)
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Embed a list of documents using Xinference.
Args:
texts: The list of texts to embed.
Returns:
List of embeddings, one for each text.
"""
model = self.client.get_model(self.model_uid)
embeddings = [
model.create_embedding(text)["data"][0]["embedding"] for text in texts
]
return [list(map(float, e)) for e in embeddings]
def embed_query(self, text: str) -> List[float]:
"""Embed a query of documents using Xinference.
Args:
text: The text to embed.
Returns:
Embeddings for the text.
"""
model = self.client.get_model(self.model_uid)
embedding_res = model.create_embedding(text)
embedding = embedding_res["data"][0]["embedding"]
return list(map(float, embedding))

View File

@ -56,6 +56,7 @@ from langchain.llms.textgen import TextGen
from langchain.llms.tongyi import Tongyi from langchain.llms.tongyi import Tongyi
from langchain.llms.vertexai import VertexAI from langchain.llms.vertexai import VertexAI
from langchain.llms.writer import Writer from langchain.llms.writer import Writer
from langchain.llms.xinference import Xinference
__all__ = [ __all__ = [
"AI21", "AI21",
@ -115,6 +116,7 @@ __all__ = [
"VertexAI", "VertexAI",
"Writer", "Writer",
"OctoAIEndpoint", "OctoAIEndpoint",
"Xinference",
] ]
type_to_cls_dict: Dict[str, Type[BaseLLM]] = { type_to_cls_dict: Dict[str, Type[BaseLLM]] = {
@ -170,4 +172,5 @@ type_to_cls_dict: Dict[str, Type[BaseLLM]] = {
"openllm": OpenLLM, "openllm": OpenLLM,
"openllm_client": OpenLLM, "openllm_client": OpenLLM,
"writer": Writer, "writer": Writer,
"xinference": Xinference,
} }

View File

@ -0,0 +1,185 @@
from typing import TYPE_CHECKING, Any, Generator, List, Mapping, Optional, Union
from langchain.callbacks.manager import CallbackManagerForLLMRun
from langchain.llms.base import LLM
if TYPE_CHECKING:
from xinference.client import RESTfulChatModelHandle, RESTfulGenerateModelHandle
from xinference.model.llm.core import LlamaCppGenerateConfig
class Xinference(LLM):
"""Wrapper for accessing Xinference's large-scale model inference service.
To use, you should have the xinference library installed:
.. code-block:: bash
pip install "xinference[all]"
Check out: https://github.com/xorbitsai/inference
To run, you need to start a Xinference supervisor on one server and Xinference workers on the other servers
Example:
To start a local instance of Xinference, run
.. code-block:: bash
$ xinference
You can also deploy Xinference in a distributed cluster. Here are the steps:
Starting the supervisor:
.. code-block:: bash
$ xinference-supervisor
Starting the worker:
.. code-block:: bash
$ xinference-worker
Then, launch a model using command line interface (CLI).
Example:
.. code-block:: bash
$ xinference launch -n orca -s 3 -q q4_0
It will return a model UID. Then, you can use Xinference with LangChain.
Example:
.. code-block:: python
from langchain.llms import Xinference
llm = Xinference(
server_url="http://0.0.0.0:9997",
model_uid = {model_uid} # replace model_uid with the model UID return from launching the model
)
llm(
prompt="Q: where can we visit in the capital of France? A:",
generate_config={"max_tokens": 1024, "stream": True},
)
To view all the supported builtin models, run:
.. code-block:: bash
$ xinference list --all
""" # noqa: E501
client: Any
server_url: Optional[str]
"""URL of the xinference server"""
model_uid: Optional[str]
"""UID of the launched model"""
def __init__(
self, server_url: Optional[str] = None, model_uid: Optional[str] = None
):
try:
from xinference.client import RESTfulClient
except ImportError as e:
raise ImportError(
"Could not import RESTfulClient from xinference. Please install it"
" with `pip install xinference`."
) from e
super().__init__(
**{
"server_url": server_url,
"model_uid": model_uid,
}
)
if self.server_url is None:
raise ValueError("Please provide server URL")
if self.model_uid is None:
raise ValueError("Please provide the model UID")
self.client = RESTfulClient(server_url)
@property
def _llm_type(self) -> str:
"""Return type of llm."""
return "xinference"
@property
def _identifying_params(self) -> Mapping[str, Any]:
"""Get the identifying parameters."""
return {
**{"server_url": self.server_url},
**{"model_uid": self.model_uid},
}
def _call(
self,
prompt: str,
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> str:
"""Call the xinference model and return the output.
Args:
prompt: The prompt to use for generation.
stop: Optional list of stop words to use when generating.
generate_config: Optional dictionary for the configuration used for
generation.
Returns:
The generated string by the model.
"""
model = self.client.get_model(self.model_uid)
generate_config: "LlamaCppGenerateConfig" = kwargs.get("generate_config", {})
if stop:
generate_config["stop"] = stop
if generate_config and generate_config.get("stream"):
combined_text_output = ""
for token in self._stream_generate(
model=model,
prompt=prompt,
run_manager=run_manager,
generate_config=generate_config,
):
combined_text_output += token
return combined_text_output
else:
completion = model.generate(prompt=prompt, generate_config=generate_config)
return completion["choices"][0]["text"]
def _stream_generate(
self,
model: Union["RESTfulGenerateModelHandle", "RESTfulChatModelHandle"],
prompt: str,
run_manager: Optional[CallbackManagerForLLMRun] = None,
generate_config: Optional["LlamaCppGenerateConfig"] = None,
) -> Generator[str, None, None]:
"""
Args:
prompt: The prompt to use for generation.
model: The model used for generation.
stop: Optional list of stop words to use when generating.
generate_config: Optional dictionary for the configuration used for
generation.
Yields:
A string token.
"""
streaming_response = model.generate(
prompt=prompt, generate_config=generate_config
)
for chunk in streaming_response:
if isinstance(chunk, dict):
choices = chunk.get("choices", [])
if choices:
choice = choices[0]
if isinstance(choice, dict):
token = choice.get("text", "")
log_probs = choice.get("logprobs")
if run_manager:
run_manager.on_llm_new_token(
token=token, verbose=self.verbose, log_probs=log_probs
)
yield token

File diff suppressed because it is too large Load Diff

View File

@ -124,6 +124,7 @@ langsmith = "~0.0.11"
rank-bm25 = {version = "^0.2.2", optional = true} rank-bm25 = {version = "^0.2.2", optional = true}
amadeus = {version = ">=8.1.0", optional = true} amadeus = {version = ">=8.1.0", optional = true}
geopandas = {version = "^0.13.1", optional = true} geopandas = {version = "^0.13.1", optional = true}
xinference = {version = "^0.0.6", optional = true}
python-arango = {version = "^7.5.9", optional = true} python-arango = {version = "^7.5.9", optional = true}
[tool.poetry.group.test.dependencies] [tool.poetry.group.test.dependencies]
@ -219,7 +220,7 @@ playwright = "^1.28.0"
setuptools = "^67.6.1" setuptools = "^67.6.1"
[tool.poetry.extras] [tool.poetry.extras]
llms = ["anthropic", "clarifai", "cohere", "openai", "openllm", "openlm", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers"] llms = ["anthropic", "clarifai", "cohere", "openai", "openllm", "openlm", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers", "xinference"]
qdrant = ["qdrant-client"] qdrant = ["qdrant-client"]
openai = ["openai", "tiktoken"] openai = ["openai", "tiktoken"]
text_helpers = ["chardet"] text_helpers = ["chardet"]
@ -315,6 +316,7 @@ all = [
"octoai-sdk", "octoai-sdk",
"rdflib", "rdflib",
"amadeus", "amadeus",
"xinference",
"python-arango", "python-arango",
] ]
@ -356,6 +358,7 @@ extended_testing = [
"rank_bm25", "rank_bm25",
"geopandas", "geopandas",
"jinja2", "jinja2",
"xinference",
] ]
[tool.ruff] [tool.ruff]

View File

@ -0,0 +1,74 @@
"""Test Xinference embeddings."""
import time
from typing import AsyncGenerator, Tuple
import pytest_asyncio
from langchain.embeddings import XinferenceEmbeddings
@pytest_asyncio.fixture
async def setup() -> AsyncGenerator[Tuple[str, str], None]:
import xoscar as xo
from xinference.deploy.supervisor import start_supervisor_components
from xinference.deploy.utils import create_worker_actor_pool
from xinference.deploy.worker import start_worker_components
pool = await create_worker_actor_pool(
f"test://127.0.0.1:{xo.utils.get_next_port()}"
)
print(f"Pool running on localhost:{pool.external_address}")
endpoint = await start_supervisor_components(
pool.external_address, "127.0.0.1", xo.utils.get_next_port()
)
await start_worker_components(
address=pool.external_address, supervisor_address=pool.external_address
)
# wait for the api.
time.sleep(3)
async with pool:
yield endpoint, pool.external_address
def test_xinference_embedding_documents(setup: Tuple[str, str]) -> None:
"""Test xinference embeddings for documents."""
from xinference.client import RESTfulClient
endpoint, _ = setup
client = RESTfulClient(endpoint)
model_uid = client.launch_model(
model_name="vicuna-v1.3",
model_size_in_billions=7,
model_format="ggmlv3",
quantization="q4_0",
)
xinference = XinferenceEmbeddings(server_url=endpoint, model_uid=model_uid)
documents = ["foo bar", "bar foo"]
output = xinference.embed_documents(documents)
assert len(output) == 2
assert len(output[0]) == 4096
def test_xinference_embedding_query(setup: Tuple[str, str]) -> None:
"""Test xinference embeddings for query."""
from xinference.client import RESTfulClient
endpoint, _ = setup
client = RESTfulClient(endpoint)
model_uid = client.launch_model(
model_name="vicuna-v1.3", model_size_in_billions=7, quantization="q4_0"
)
xinference = XinferenceEmbeddings(server_url=endpoint, model_uid=model_uid)
document = "foo bar"
output = xinference.embed_query(document)
assert len(output) == 4096

View File

@ -0,0 +1,57 @@
"""Test Xinference wrapper."""
import time
from typing import AsyncGenerator, Tuple
import pytest_asyncio
from langchain.llms import Xinference
@pytest_asyncio.fixture
async def setup() -> AsyncGenerator[Tuple[str, str], None]:
import xoscar as xo
from xinference.deploy.supervisor import start_supervisor_components
from xinference.deploy.utils import create_worker_actor_pool
from xinference.deploy.worker import start_worker_components
pool = await create_worker_actor_pool(
f"test://127.0.0.1:{xo.utils.get_next_port()}"
)
print(f"Pool running on localhost:{pool.external_address}")
endpoint = await start_supervisor_components(
pool.external_address, "127.0.0.1", xo.utils.get_next_port()
)
await start_worker_components(
address=pool.external_address, supervisor_address=pool.external_address
)
# wait for the api.
time.sleep(3)
async with pool:
yield endpoint, pool.external_address
def test_xinference_llm_(setup: Tuple[str, str]) -> None:
from xinference.client import RESTfulClient
endpoint, _ = setup
client = RESTfulClient(endpoint)
model_uid = client.launch_model(
model_name="vicuna-v1.3", model_size_in_billions=7, quantization="q4_0"
)
llm = Xinference(server_url=endpoint, model_uid=model_uid)
answer = llm(prompt="Q: What food can we try in the capital of France? A:")
assert isinstance(answer, str)
answer = llm(
prompt="Q: where can we visit in the capital of France? A:",
generate_config={"max_tokens": 1024, "stream": True},
)
assert isinstance(answer, str)

12
poetry.lock generated
View File

@ -1629,7 +1629,7 @@ files = [
[[package]] [[package]]
name = "langchain" name = "langchain"
version = "0.0.240" version = "0.0.244"
description = "Building applications with LLMs through composability" description = "Building applications with LLMs through composability"
optional = false optional = false
python-versions = ">=3.8.1,<4.0" python-versions = ">=3.8.1,<4.0"
@ -1651,15 +1651,15 @@ SQLAlchemy = ">=1.4,<3"
tenacity = "^8.1.0" tenacity = "^8.1.0"
[package.extras] [package.extras]
all = ["O365 (>=2.0.26,<3.0.0)", "aleph-alpha-client (>=2.15.0,<3.0.0)", "amadeus (>=8.1.0)", "anthropic (>=0.3,<0.4)", "arxiv (>=1.4,<2.0)", "atlassian-python-api (>=3.36.0,<4.0.0)", "awadb (>=0.3.3,<0.4.0)", "azure-ai-formrecognizer (>=3.2.1,<4.0.0)", "azure-ai-vision (>=0.11.1b1,<0.12.0)", "azure-cognitiveservices-speech (>=1.28.0,<2.0.0)", "azure-cosmos (>=4.4.0b1,<5.0.0)", "azure-identity (>=1.12.0,<2.0.0)", "beautifulsoup4 (>=4,<5)", "clarifai (>=9.1.0)", "clickhouse-connect (>=0.5.14,<0.6.0)", "cohere (>=3,<4)", "deeplake (>=3.6.8,<4.0.0)", "docarray[hnswlib] (>=0.32.0,<0.33.0)", "duckduckgo-search (>=3.8.3,<4.0.0)", "elasticsearch (>=8,<9)", "esprima (>=4.0.1,<5.0.0)", "faiss-cpu (>=1,<2)", "google-api-python-client (==2.70.0)", "google-auth (>=2.18.1,<3.0.0)", "google-search-results (>=2,<3)", "gptcache (>=0.1.7)", "html2text (>=2020.1.16,<2021.0.0)", "huggingface_hub (>=0,<1)", "jina (>=3.14,<4.0)", "jinja2 (>=3,<4)", "jq (>=1.4.1,<2.0.0)", "lancedb (>=0.1,<0.2)", "langkit (>=0.0.6,<0.1.0)", "lark (>=1.1.5,<2.0.0)", "libdeeplake (>=0.0.60,<0.0.61)", "lxml (>=4.9.2,<5.0.0)", "manifest-ml (>=0.0.1,<0.0.2)", "marqo (>=0.11.0,<0.12.0)", "momento (>=1.5.0,<2.0.0)", "nebula3-python (>=3.4.0,<4.0.0)", "neo4j (>=5.8.1,<6.0.0)", "networkx (>=2.6.3,<3.0.0)", "nlpcloud (>=1,<2)", "nltk (>=3,<4)", "nomic (>=1.0.43,<2.0.0)", "octoai-sdk (>=0.1.1,<0.2.0)", "openai (>=0,<1)", "openlm (>=0.0.5,<0.0.6)", "opensearch-py (>=2.0.0,<3.0.0)", "pdfminer-six (>=20221105,<20221106)", "pexpect (>=4.8.0,<5.0.0)", "pgvector (>=0.1.6,<0.2.0)", "pinecone-client (>=2,<3)", "pinecone-text (>=0.4.2,<0.5.0)", "psycopg2-binary (>=2.9.5,<3.0.0)", "pymongo (>=4.3.3,<5.0.0)", "pyowm (>=3.3.0,<4.0.0)", "pypdf (>=3.4.0,<4.0.0)", "pytesseract (>=0.3.10,<0.4.0)", "pyvespa (>=0.33.0,<0.34.0)", "qdrant-client (>=1.3.1,<2.0.0)", "rdflib (>=6.3.2,<7.0.0)", "redis (>=4,<5)", "requests-toolbelt (>=1.0.0,<2.0.0)", "sentence-transformers (>=2,<3)", "singlestoredb (>=0.7.1,<0.8.0)", "spacy (>=3,<4)", "steamship (>=2.16.9,<3.0.0)", "tensorflow-text (>=2.11.0,<3.0.0)", "tigrisdb (>=1.0.0b6,<2.0.0)", "tiktoken (>=0.3.2,<0.4.0)", "torch (>=1,<3)", "transformers (>=4,<5)", "weaviate-client (>=3,<4)", "wikipedia (>=1,<2)", "wolframalpha (==5.0.0)"] all = ["O365 (>=2.0.26,<3.0.0)", "aleph-alpha-client (>=2.15.0,<3.0.0)", "amadeus (>=8.1.0)", "anthropic (>=0.3,<0.4)", "arxiv (>=1.4,<2.0)", "atlassian-python-api (>=3.36.0,<4.0.0)", "awadb (>=0.3.3,<0.4.0)", "azure-ai-formrecognizer (>=3.2.1,<4.0.0)", "azure-ai-vision (>=0.11.1b1,<0.12.0)", "azure-cognitiveservices-speech (>=1.28.0,<2.0.0)", "azure-cosmos (>=4.4.0b1,<5.0.0)", "azure-identity (>=1.12.0,<2.0.0)", "beautifulsoup4 (>=4,<5)", "clarifai (>=9.1.0)", "clickhouse-connect (>=0.5.14,<0.6.0)", "cohere (>=4,<5)", "deeplake (>=3.6.8,<4.0.0)", "docarray[hnswlib] (>=0.32.0,<0.33.0)", "duckduckgo-search (>=3.8.3,<4.0.0)", "elasticsearch (>=8,<9)", "esprima (>=4.0.1,<5.0.0)", "faiss-cpu (>=1,<2)", "google-api-python-client (==2.70.0)", "google-auth (>=2.18.1,<3.0.0)", "google-search-results (>=2,<3)", "gptcache (>=0.1.7)", "html2text (>=2020.1.16,<2021.0.0)", "huggingface_hub (>=0,<1)", "jina (>=3.14,<4.0)", "jinja2 (>=3,<4)", "jq (>=1.4.1,<2.0.0)", "lancedb (>=0.1,<0.2)", "langkit (>=0.0.6,<0.1.0)", "lark (>=1.1.5,<2.0.0)", "libdeeplake (>=0.0.60,<0.0.61)", "lxml (>=4.9.2,<5.0.0)", "manifest-ml (>=0.0.1,<0.0.2)", "marqo (>=0.11.0,<0.12.0)", "momento (>=1.5.0,<2.0.0)", "nebula3-python (>=3.4.0,<4.0.0)", "neo4j (>=5.8.1,<6.0.0)", "networkx (>=2.6.3,<3.0.0)", "nlpcloud (>=1,<2)", "nltk (>=3,<4)", "nomic (>=1.0.43,<2.0.0)", "octoai-sdk (>=0.1.1,<0.2.0)", "openai (>=0,<1)", "openlm (>=0.0.5,<0.0.6)", "opensearch-py (>=2.0.0,<3.0.0)", "pdfminer-six (>=20221105,<20221106)", "pexpect (>=4.8.0,<5.0.0)", "pgvector (>=0.1.6,<0.2.0)", "pinecone-client (>=2,<3)", "pinecone-text (>=0.4.2,<0.5.0)", "psycopg2-binary (>=2.9.5,<3.0.0)", "pymongo (>=4.3.3,<5.0.0)", "pyowm (>=3.3.0,<4.0.0)", "pypdf (>=3.4.0,<4.0.0)", "pytesseract (>=0.3.10,<0.4.0)", "python-arango (>=7.5.9,<8.0.0)", "pyvespa (>=0.33.0,<0.34.0)", "qdrant-client (>=1.3.1,<2.0.0)", "rdflib (>=6.3.2,<7.0.0)", "redis (>=4,<5)", "requests-toolbelt (>=1.0.0,<2.0.0)", "sentence-transformers (>=2,<3)", "singlestoredb (>=0.7.1,<0.8.0)", "spacy (>=3,<4)", "steamship (>=2.16.9,<3.0.0)", "tensorflow-text (>=2.11.0,<3.0.0)", "tigrisdb (>=1.0.0b6,<2.0.0)", "tiktoken (>=0.3.2,<0.4.0)", "torch (>=1,<3)", "transformers (>=4,<5)", "weaviate-client (>=3,<4)", "wikipedia (>=1,<2)", "wolframalpha (==5.0.0)", "xinference (>=0.0.6,<0.0.7)"]
azure = ["azure-ai-formrecognizer (>=3.2.1,<4.0.0)", "azure-ai-vision (>=0.11.1b1,<0.12.0)", "azure-cognitiveservices-speech (>=1.28.0,<2.0.0)", "azure-core (>=1.26.4,<2.0.0)", "azure-cosmos (>=4.4.0b1,<5.0.0)", "azure-identity (>=1.12.0,<2.0.0)", "azure-search-documents (==11.4.0a20230509004)", "openai (>=0,<1)"] azure = ["azure-ai-formrecognizer (>=3.2.1,<4.0.0)", "azure-ai-vision (>=0.11.1b1,<0.12.0)", "azure-cognitiveservices-speech (>=1.28.0,<2.0.0)", "azure-core (>=1.26.4,<2.0.0)", "azure-cosmos (>=4.4.0b1,<5.0.0)", "azure-identity (>=1.12.0,<2.0.0)", "azure-search-documents (==11.4.0b6)", "openai (>=0,<1)"]
clarifai = ["clarifai (>=9.1.0)"] clarifai = ["clarifai (>=9.1.0)"]
cohere = ["cohere (>=3,<4)"] cohere = ["cohere (>=4,<5)"]
docarray = ["docarray[hnswlib] (>=0.32.0,<0.33.0)"] docarray = ["docarray[hnswlib] (>=0.32.0,<0.33.0)"]
embeddings = ["sentence-transformers (>=2,<3)"] embeddings = ["sentence-transformers (>=2,<3)"]
extended-testing = ["atlassian-python-api (>=3.36.0,<4.0.0)", "beautifulsoup4 (>=4,<5)", "bibtexparser (>=1.4.0,<2.0.0)", "cassio (>=0.0.7,<0.0.8)", "chardet (>=5.1.0,<6.0.0)", "esprima (>=4.0.1,<5.0.0)", "geopandas (>=0.13.1,<0.14.0)", "gql (>=3.4.1,<4.0.0)", "html2text (>=2020.1.16,<2021.0.0)", "jinja2 (>=3,<4)", "jq (>=1.4.1,<2.0.0)", "lxml (>=4.9.2,<5.0.0)", "mwparserfromhell (>=0.6.4,<0.7.0)", "mwxml (>=0.3.3,<0.4.0)", "openai (>=0,<1)", "openai (>=0,<1)", "pandas (>=2.0.1,<3.0.0)", "pdfminer-six (>=20221105,<20221106)", "pgvector (>=0.1.6,<0.2.0)", "psychicapi (>=0.8.0,<0.9.0)", "py-trello (>=0.19.0,<0.20.0)", "pymupdf (>=1.22.3,<2.0.0)", "pypdf (>=3.4.0,<4.0.0)", "pypdfium2 (>=4.10.0,<5.0.0)", "pyspark (>=3.4.0,<4.0.0)", "rank-bm25 (>=0.2.2,<0.3.0)", "rapidfuzz (>=3.1.1,<4.0.0)", "requests-toolbelt (>=1.0.0,<2.0.0)", "scikit-learn (>=1.2.2,<2.0.0)", "streamlit (>=1.18.0,<2.0.0)", "sympy (>=1.12,<2.0)", "telethon (>=1.28.5,<2.0.0)", "tqdm (>=4.48.0)", "zep-python (>=0.32)"] extended-testing = ["atlassian-python-api (>=3.36.0,<4.0.0)", "beautifulsoup4 (>=4,<5)", "bibtexparser (>=1.4.0,<2.0.0)", "cassio (>=0.0.7,<0.0.8)", "chardet (>=5.1.0,<6.0.0)", "esprima (>=4.0.1,<5.0.0)", "geopandas (>=0.13.1,<0.14.0)", "gql (>=3.4.1,<4.0.0)", "html2text (>=2020.1.16,<2021.0.0)", "jinja2 (>=3,<4)", "jq (>=1.4.1,<2.0.0)", "lxml (>=4.9.2,<5.0.0)", "mwparserfromhell (>=0.6.4,<0.7.0)", "mwxml (>=0.3.3,<0.4.0)", "openai (>=0,<1)", "openai (>=0,<1)", "pandas (>=2.0.1,<3.0.0)", "pdfminer-six (>=20221105,<20221106)", "pgvector (>=0.1.6,<0.2.0)", "psychicapi (>=0.8.0,<0.9.0)", "py-trello (>=0.19.0,<0.20.0)", "pymupdf (>=1.22.3,<2.0.0)", "pypdf (>=3.4.0,<4.0.0)", "pypdfium2 (>=4.10.0,<5.0.0)", "pyspark (>=3.4.0,<4.0.0)", "rank-bm25 (>=0.2.2,<0.3.0)", "rapidfuzz (>=3.1.1,<4.0.0)", "requests-toolbelt (>=1.0.0,<2.0.0)", "scikit-learn (>=1.2.2,<2.0.0)", "streamlit (>=1.18.0,<2.0.0)", "sympy (>=1.12,<2.0)", "telethon (>=1.28.5,<2.0.0)", "tqdm (>=4.48.0)", "xinference (>=0.0.6,<0.0.7)", "zep-python (>=0.32)"]
javascript = ["esprima (>=4.0.1,<5.0.0)"] javascript = ["esprima (>=4.0.1,<5.0.0)"]
llms = ["anthropic (>=0.3,<0.4)", "clarifai (>=9.1.0)", "cohere (>=3,<4)", "huggingface_hub (>=0,<1)", "manifest-ml (>=0.0.1,<0.0.2)", "nlpcloud (>=1,<2)", "openai (>=0,<1)", "openllm (>=0.1.19)", "openlm (>=0.0.5,<0.0.6)", "torch (>=1,<3)", "transformers (>=4,<5)"] llms = ["anthropic (>=0.3,<0.4)", "clarifai (>=9.1.0)", "cohere (>=4,<5)", "huggingface_hub (>=0,<1)", "manifest-ml (>=0.0.1,<0.0.2)", "nlpcloud (>=1,<2)", "openai (>=0,<1)", "openllm (>=0.1.19)", "openlm (>=0.0.5,<0.0.6)", "torch (>=1,<3)", "transformers (>=4,<5)", "xinference (>=0.0.6,<0.0.7)"]
openai = ["openai (>=0,<1)", "tiktoken (>=0.3.2,<0.4.0)"] openai = ["openai (>=0,<1)", "tiktoken (>=0.3.2,<0.4.0)"]
qdrant = ["qdrant-client (>=1.3.1,<2.0.0)"] qdrant = ["qdrant-client (>=1.3.1,<2.0.0)"]
text-helpers = ["chardet (>=5.1.0,<6.0.0)"] text-helpers = ["chardet (>=5.1.0,<6.0.0)"]