mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-10 15:33:11 +00:00
community[minor]: Adding support for Infinispan as VectorStore (#17861)
**Description:** This integrates Infinispan as a vectorstore. Infinispan is an open-source key-value data grid, it can work as single node as well as distributed. Vector search is supported since release 15.x For more: [Infinispan Home](https://infinispan.org) Integration tests are provided as well as a demo notebook
This commit is contained in:
committed by
GitHub
parent
cca0167917
commit
51f3902bc4
17
docs/docs/integrations/providers/infinispanvs.mdx
Normal file
17
docs/docs/integrations/providers/infinispanvs.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
# Infinispan VS
|
||||
|
||||
> [Infinispan](https://infinispan.org) Infinispan is an open-source in-memory data grid that provides
|
||||
> a key/value data store able to hold all types of data, from Java objects to plain text.
|
||||
> Since version 15 Infinispan supports vector search over caches.
|
||||
|
||||
## Installation and Setup
|
||||
See [Get Started](https://infinispan.org/get-started/) to run an Infinispan server, you may want to disable authentication
|
||||
(not supported atm)
|
||||
|
||||
## Vector Store
|
||||
|
||||
See a [usage example](/docs/integrations/vectorstores/infinispanvs).
|
||||
|
||||
```python
|
||||
from langchain_community.vectorstores import InfinispanVS
|
||||
```
|
408
docs/docs/integrations/vectorstores/infinispanvs.ipynb
Normal file
408
docs/docs/integrations/vectorstores/infinispanvs.ipynb
Normal file
@@ -0,0 +1,408 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cffb482c-bbd8-4829-b185-0d930a5fe0bc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Infinispan\n",
|
||||
"\n",
|
||||
"Infinispan is an open-source key-value data grid, it can work as single node as well as distributed.\n",
|
||||
"\n",
|
||||
"Vector search is supported since release 15.x\n",
|
||||
"For more: [Infinispan Home](https://infinispan.org)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "03ec8f9a-7641-47ea-9fa0-f43ee9fc79a3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Ensure that all we need is installed\n",
|
||||
"# You may want to skip this\n",
|
||||
"%pip install sentence-transformers\n",
|
||||
"%pip install langchain\n",
|
||||
"%pip install langchain_core\n",
|
||||
"%pip install langchain_community"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "180d172e-cca1-481c-87d5-c4f14684604d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Setup\n",
|
||||
"\n",
|
||||
"To run this demo we need a running Infinispan instance without authentication and a data file.\n",
|
||||
"In the next three cells we're going to:\n",
|
||||
"- create the configuration\n",
|
||||
"- run Infinispan in docker\n",
|
||||
"- download the data file"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b251e66e-f056-4e81-a6b4-5f4d95b6537d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"#create infinispan configuration file\n",
|
||||
"echo 'infinispan:\n",
|
||||
" cache-container: \n",
|
||||
" name: default\n",
|
||||
" transport: \n",
|
||||
" cluster: cluster \n",
|
||||
" stack: tcp \n",
|
||||
" server:\n",
|
||||
" interfaces:\n",
|
||||
" interface:\n",
|
||||
" name: public\n",
|
||||
" inet-address:\n",
|
||||
" value: 0.0.0.0 \n",
|
||||
" socket-bindings:\n",
|
||||
" default-interface: public\n",
|
||||
" port-offset: 0 \n",
|
||||
" socket-binding:\n",
|
||||
" name: default\n",
|
||||
" port: 11222\n",
|
||||
" endpoints:\n",
|
||||
" endpoint:\n",
|
||||
" socket-binding: default\n",
|
||||
" rest-connector:\n",
|
||||
"' > infinispan-noauth.yaml"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "9678d5ce-894c-4e28-bf68-20d45507122f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"#get an archive of news\n",
|
||||
"wget https://raw.githubusercontent.com/rigazilla/infinispan-vector/main/bbc_news.csv.gz"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "009da6d1-9d1a-4392-90f1-5c654dd12654",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!docker run -d --name infinispanvs-demo -v $(pwd):/user-config -p 11222:11222 infinispan/server:15.0.0.Dev09 -c /user-config/infinispan-noauth.yaml "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b575cde9-4c62-47b3-af89-109ed39f56b6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# The Code\n",
|
||||
"\n",
|
||||
"## Pick up an embedding model\n",
|
||||
"\n",
|
||||
"In this demo we're using\n",
|
||||
"a HuggingFace embedding mode."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d2c9f46f-3c78-4865-810b-52408dff5fb7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings import HuggingFaceEmbeddings\n",
|
||||
"from langchain_core.embeddings import Embeddings\n",
|
||||
"\n",
|
||||
"model_name = \"sentence-transformers/all-MiniLM-L12-v2\"\n",
|
||||
"hf = HuggingFaceEmbeddings(model_name=model_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "61ce7e1f-51ee-4d3d-ad3c-97088b1120f6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup Infinispan cache\n",
|
||||
"\n",
|
||||
"Infinispan is a very flexible key-value store, it can store raw bits as well as complex data type.\n",
|
||||
"We need to configure it to store data containing embedded vectors.\n",
|
||||
"\n",
|
||||
"In the next cells we're going to:\n",
|
||||
"- create an empty Infinispan VectoreStore\n",
|
||||
"- deploy a protobuf definition of our data\n",
|
||||
"- create a cache"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "49668bf1-778b-466d-86fb-41747ed52b74",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Creating a langchain_core.VectorStore\n",
|
||||
"from langchain_community.vectorstores import InfinispanVS\n",
|
||||
"\n",
|
||||
"ispnvs = InfinispanVS.from_texts(\n",
|
||||
" texts={}, embedding=hf, cache_name=\"demo_cache\", entity_name=\"demo_entity\"\n",
|
||||
")\n",
|
||||
"ispn = ispnvs.ispn"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0cedf066-aaab-4185-b049-93eea9b48329",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Protobuf definition\n",
|
||||
"\n",
|
||||
"Below there's the protobuf definition of our data type that contains:\n",
|
||||
"- embedded vector (field 1)\n",
|
||||
"- text of the news (2)\n",
|
||||
"- title of the news (3)\n",
|
||||
"\n",
|
||||
"As you can see, there are additional annotations in the comments that tell Infinispan that:\n",
|
||||
"- data type must be indexed (`@Indexed`)\n",
|
||||
"- field 1 is an embeddeded vector (`@Vector`)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1fa0add0-8317-4667-9b8c-5d91c47f752a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"\n",
|
||||
"# Infinispan supports protobuf schemas\n",
|
||||
"schema_vector = \"\"\"\n",
|
||||
"/**\n",
|
||||
" * @Indexed\n",
|
||||
" */\n",
|
||||
"message demo_entity {\n",
|
||||
"/**\n",
|
||||
" * @Vector(dimension=384)\n",
|
||||
" */\n",
|
||||
"repeated float vector = 1;\n",
|
||||
"optional string text = 2;\n",
|
||||
"optional string title = 3;\n",
|
||||
"}\n",
|
||||
"\"\"\"\n",
|
||||
"# Cleanup before deploy a new schema\n",
|
||||
"ispnvs.schema_delete()\n",
|
||||
"output = ispnvs.schema_create(schema_vector)\n",
|
||||
"assert output.status_code == 200\n",
|
||||
"assert json.loads(output.text)[\"error\"] is None\n",
|
||||
"# Create the cache\n",
|
||||
"ispnvs.cache_create()\n",
|
||||
"# Cleanup old data and index\n",
|
||||
"ispnvs.cache_clear()\n",
|
||||
"ispnvs.cache_index_reindex()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "456da9e7-baf4-472a-a9ee-8473aed8cabd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Prepare the data\n",
|
||||
"\n",
|
||||
"In this demo we choose to store text,vector and metadata in the same cache, but other options\n",
|
||||
"are possible: i.e. content can be store somewhere else and vector store could contain only a reference to the actual content."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0f6a42d3-c5ec-44ec-9b57-ebe5ca8c301a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import csv\n",
|
||||
"import gzip\n",
|
||||
"import time\n",
|
||||
"\n",
|
||||
"# Open the news file and process it as a csv\n",
|
||||
"with gzip.open(\"bbc_news.csv.gz\", \"rt\", newline=\"\") as csvfile:\n",
|
||||
" spamreader = csv.reader(csvfile, delimiter=\",\", quotechar='\"')\n",
|
||||
" i = 0\n",
|
||||
" texts = []\n",
|
||||
" metas = []\n",
|
||||
" embeds = []\n",
|
||||
" for row in spamreader:\n",
|
||||
" # first and fifth value are joined to form the content\n",
|
||||
" # to be processed\n",
|
||||
" text = row[0] + \".\" + row[4]\n",
|
||||
" texts.append(text)\n",
|
||||
" # Storing meta\n",
|
||||
" # Store text and title as metadata\n",
|
||||
" meta = {}\n",
|
||||
" meta[\"text\"] = row[4]\n",
|
||||
" meta[\"title\"] = row[0]\n",
|
||||
" metas.append(meta)\n",
|
||||
" i = i + 1\n",
|
||||
" # Change this to change the number of news you want to load\n",
|
||||
" if i >= 5000:\n",
|
||||
" break"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a6b00299-94db-43ca-9da3-45d12cdf2db1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Populate the vector store"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "75e135a6-1b38-48eb-96ca-379b6f4a653f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# add texts and fill vector db\n",
|
||||
"keys = ispnvs.add_texts(texts, metas)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2bb6f053-208d-407e-b8b7-c6c6443522d8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# An helper func that prints the result documents\n",
|
||||
"\n",
|
||||
"By default InfinispanVS returns the protobuf `ŧext` field in the `Document.page_content`\n",
|
||||
"and all the remaining protobuf fields (except the vector) in the `metadata`. This behaviour is\n",
|
||||
"configurable via lambda functions at setup."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "594fad38-37f0-4dd4-9785-a99a2f009ae5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def print_docs(docs):\n",
|
||||
" for res, i in zip(docs, range(len(docs))):\n",
|
||||
" print(\"----\" + str(i + 1) + \"----\")\n",
|
||||
" print(\"TITLE: \" + res.metadata[\"title\"])\n",
|
||||
" print(res.page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cfa517c7-e741-4f64-9736-6db7a6bd259a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Try it!!!\n",
|
||||
"\n",
|
||||
"Below some sample queries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "86e782b3-5a74-4ca1-a5d1-c0ee935a659e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = ispnvs.similarity_search(\"European nations\", 5)\n",
|
||||
"print_docs(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b60847f9-ef34-4c79-b276-ac62170e2d6a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print_docs(ispnvs.similarity_search(\"Milan fashion week begins\", 2))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6cbb5607-da55-4879-92cf-79ac690cc0c5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print_docs(ispnvs.similarity_search(\"Stock market is rising today\", 4))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "3bb94ca1-7b1e-41ed-9d8f-b845775d11c1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print_docs(ispnvs.similarity_search(\"Why cats are so viral?\", 2))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a4fca208-b580-483d-9be0-786b6b63a31d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print_docs(ispnvs.similarity_search(\"How to stay young\", 5))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "862e4af2-9f8a-4985-90cb-997477901b1e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Clean up\n",
|
||||
"ispnvs.schema_delete()\n",
|
||||
"ispnvs.cache_delete()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d4a460b8-f0c8-4ae9-a7ff-cf550c3195f1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!docker rm --force infinispanvs-demo"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.18"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
Reference in New Issue
Block a user