community[minor]: Adding support for Infinispan as VectorStore (#17861)

**Description:**
This integrates Infinispan as a vectorstore.
Infinispan is an open-source key-value data grid, it can work as single
node as well as distributed.

Vector search is supported since release 15.x 

For more: [Infinispan Home](https://infinispan.org)

Integration tests are provided as well as a demo notebook
This commit is contained in:
Vittorio Rigamonti
2024-03-07 00:11:02 +01:00
committed by GitHub
parent cca0167917
commit 51f3902bc4
6 changed files with 1076 additions and 0 deletions

View File

@@ -0,0 +1,17 @@
# Infinispan VS
> [Infinispan](https://infinispan.org) Infinispan is an open-source in-memory data grid that provides
> a key/value data store able to hold all types of data, from Java objects to plain text.
> Since version 15 Infinispan supports vector search over caches.
## Installation and Setup
See [Get Started](https://infinispan.org/get-started/) to run an Infinispan server, you may want to disable authentication
(not supported atm)
## Vector Store
See a [usage example](/docs/integrations/vectorstores/infinispanvs).
```python
from langchain_community.vectorstores import InfinispanVS
```

View File

@@ -0,0 +1,408 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "cffb482c-bbd8-4829-b185-0d930a5fe0bc",
"metadata": {},
"source": [
"# Infinispan\n",
"\n",
"Infinispan is an open-source key-value data grid, it can work as single node as well as distributed.\n",
"\n",
"Vector search is supported since release 15.x\n",
"For more: [Infinispan Home](https://infinispan.org)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "03ec8f9a-7641-47ea-9fa0-f43ee9fc79a3",
"metadata": {},
"outputs": [],
"source": [
"# Ensure that all we need is installed\n",
"# You may want to skip this\n",
"%pip install sentence-transformers\n",
"%pip install langchain\n",
"%pip install langchain_core\n",
"%pip install langchain_community"
]
},
{
"cell_type": "markdown",
"id": "180d172e-cca1-481c-87d5-c4f14684604d",
"metadata": {},
"source": [
"# Setup\n",
"\n",
"To run this demo we need a running Infinispan instance without authentication and a data file.\n",
"In the next three cells we're going to:\n",
"- create the configuration\n",
"- run Infinispan in docker\n",
"- download the data file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b251e66e-f056-4e81-a6b4-5f4d95b6537d",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"#create infinispan configuration file\n",
"echo 'infinispan:\n",
" cache-container: \n",
" name: default\n",
" transport: \n",
" cluster: cluster \n",
" stack: tcp \n",
" server:\n",
" interfaces:\n",
" interface:\n",
" name: public\n",
" inet-address:\n",
" value: 0.0.0.0 \n",
" socket-bindings:\n",
" default-interface: public\n",
" port-offset: 0 \n",
" socket-binding:\n",
" name: default\n",
" port: 11222\n",
" endpoints:\n",
" endpoint:\n",
" socket-binding: default\n",
" rest-connector:\n",
"' > infinispan-noauth.yaml"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9678d5ce-894c-4e28-bf68-20d45507122f",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"#get an archive of news\n",
"wget https://raw.githubusercontent.com/rigazilla/infinispan-vector/main/bbc_news.csv.gz"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "009da6d1-9d1a-4392-90f1-5c654dd12654",
"metadata": {},
"outputs": [],
"source": [
"!docker run -d --name infinispanvs-demo -v $(pwd):/user-config -p 11222:11222 infinispan/server:15.0.0.Dev09 -c /user-config/infinispan-noauth.yaml "
]
},
{
"cell_type": "markdown",
"id": "b575cde9-4c62-47b3-af89-109ed39f56b6",
"metadata": {},
"source": [
"# The Code\n",
"\n",
"## Pick up an embedding model\n",
"\n",
"In this demo we're using\n",
"a HuggingFace embedding mode."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2c9f46f-3c78-4865-810b-52408dff5fb7",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import HuggingFaceEmbeddings\n",
"from langchain_core.embeddings import Embeddings\n",
"\n",
"model_name = \"sentence-transformers/all-MiniLM-L12-v2\"\n",
"hf = HuggingFaceEmbeddings(model_name=model_name)"
]
},
{
"cell_type": "markdown",
"id": "61ce7e1f-51ee-4d3d-ad3c-97088b1120f6",
"metadata": {},
"source": [
"## Setup Infinispan cache\n",
"\n",
"Infinispan is a very flexible key-value store, it can store raw bits as well as complex data type.\n",
"We need to configure it to store data containing embedded vectors.\n",
"\n",
"In the next cells we're going to:\n",
"- create an empty Infinispan VectoreStore\n",
"- deploy a protobuf definition of our data\n",
"- create a cache"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49668bf1-778b-466d-86fb-41747ed52b74",
"metadata": {},
"outputs": [],
"source": [
"# Creating a langchain_core.VectorStore\n",
"from langchain_community.vectorstores import InfinispanVS\n",
"\n",
"ispnvs = InfinispanVS.from_texts(\n",
" texts={}, embedding=hf, cache_name=\"demo_cache\", entity_name=\"demo_entity\"\n",
")\n",
"ispn = ispnvs.ispn"
]
},
{
"cell_type": "markdown",
"id": "0cedf066-aaab-4185-b049-93eea9b48329",
"metadata": {},
"source": [
"### Protobuf definition\n",
"\n",
"Below there's the protobuf definition of our data type that contains:\n",
"- embedded vector (field 1)\n",
"- text of the news (2)\n",
"- title of the news (3)\n",
"\n",
"As you can see, there are additional annotations in the comments that tell Infinispan that:\n",
"- data type must be indexed (`@Indexed`)\n",
"- field 1 is an embeddeded vector (`@Vector`)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1fa0add0-8317-4667-9b8c-5d91c47f752a",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"# Infinispan supports protobuf schemas\n",
"schema_vector = \"\"\"\n",
"/**\n",
" * @Indexed\n",
" */\n",
"message demo_entity {\n",
"/**\n",
" * @Vector(dimension=384)\n",
" */\n",
"repeated float vector = 1;\n",
"optional string text = 2;\n",
"optional string title = 3;\n",
"}\n",
"\"\"\"\n",
"# Cleanup before deploy a new schema\n",
"ispnvs.schema_delete()\n",
"output = ispnvs.schema_create(schema_vector)\n",
"assert output.status_code == 200\n",
"assert json.loads(output.text)[\"error\"] is None\n",
"# Create the cache\n",
"ispnvs.cache_create()\n",
"# Cleanup old data and index\n",
"ispnvs.cache_clear()\n",
"ispnvs.cache_index_reindex()"
]
},
{
"cell_type": "markdown",
"id": "456da9e7-baf4-472a-a9ee-8473aed8cabd",
"metadata": {},
"source": [
"## Prepare the data\n",
"\n",
"In this demo we choose to store text,vector and metadata in the same cache, but other options\n",
"are possible: i.e. content can be store somewhere else and vector store could contain only a reference to the actual content."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f6a42d3-c5ec-44ec-9b57-ebe5ca8c301a",
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import gzip\n",
"import time\n",
"\n",
"# Open the news file and process it as a csv\n",
"with gzip.open(\"bbc_news.csv.gz\", \"rt\", newline=\"\") as csvfile:\n",
" spamreader = csv.reader(csvfile, delimiter=\",\", quotechar='\"')\n",
" i = 0\n",
" texts = []\n",
" metas = []\n",
" embeds = []\n",
" for row in spamreader:\n",
" # first and fifth value are joined to form the content\n",
" # to be processed\n",
" text = row[0] + \".\" + row[4]\n",
" texts.append(text)\n",
" # Storing meta\n",
" # Store text and title as metadata\n",
" meta = {}\n",
" meta[\"text\"] = row[4]\n",
" meta[\"title\"] = row[0]\n",
" metas.append(meta)\n",
" i = i + 1\n",
" # Change this to change the number of news you want to load\n",
" if i >= 5000:\n",
" break"
]
},
{
"cell_type": "markdown",
"id": "a6b00299-94db-43ca-9da3-45d12cdf2db1",
"metadata": {},
"source": [
"# Populate the vector store"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75e135a6-1b38-48eb-96ca-379b6f4a653f",
"metadata": {},
"outputs": [],
"source": [
"# add texts and fill vector db\n",
"keys = ispnvs.add_texts(texts, metas)"
]
},
{
"cell_type": "markdown",
"id": "2bb6f053-208d-407e-b8b7-c6c6443522d8",
"metadata": {},
"source": [
"# An helper func that prints the result documents\n",
"\n",
"By default InfinispanVS returns the protobuf `ŧext` field in the `Document.page_content`\n",
"and all the remaining protobuf fields (except the vector) in the `metadata`. This behaviour is\n",
"configurable via lambda functions at setup."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "594fad38-37f0-4dd4-9785-a99a2f009ae5",
"metadata": {},
"outputs": [],
"source": [
"def print_docs(docs):\n",
" for res, i in zip(docs, range(len(docs))):\n",
" print(\"----\" + str(i + 1) + \"----\")\n",
" print(\"TITLE: \" + res.metadata[\"title\"])\n",
" print(res.page_content)"
]
},
{
"cell_type": "markdown",
"id": "cfa517c7-e741-4f64-9736-6db7a6bd259a",
"metadata": {},
"source": [
"# Try it!!!\n",
"\n",
"Below some sample queries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "86e782b3-5a74-4ca1-a5d1-c0ee935a659e",
"metadata": {},
"outputs": [],
"source": [
"docs = ispnvs.similarity_search(\"European nations\", 5)\n",
"print_docs(docs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b60847f9-ef34-4c79-b276-ac62170e2d6a",
"metadata": {},
"outputs": [],
"source": [
"print_docs(ispnvs.similarity_search(\"Milan fashion week begins\", 2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6cbb5607-da55-4879-92cf-79ac690cc0c5",
"metadata": {},
"outputs": [],
"source": [
"print_docs(ispnvs.similarity_search(\"Stock market is rising today\", 4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3bb94ca1-7b1e-41ed-9d8f-b845775d11c1",
"metadata": {},
"outputs": [],
"source": [
"print_docs(ispnvs.similarity_search(\"Why cats are so viral?\", 2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4fca208-b580-483d-9be0-786b6b63a31d",
"metadata": {},
"outputs": [],
"source": [
"print_docs(ispnvs.similarity_search(\"How to stay young\", 5))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "862e4af2-9f8a-4985-90cb-997477901b1e",
"metadata": {},
"outputs": [],
"source": [
"# Clean up\n",
"ispnvs.schema_delete()\n",
"ispnvs.cache_delete()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4a460b8-f0c8-4ae9-a7ff-cf550c3195f1",
"metadata": {},
"outputs": [],
"source": [
"!docker rm --force infinispanvs-demo"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}