Files
langchain/docs/versioned_docs/version-0.2.x/integrations/vectorstores/infinispanvs.ipynb
Jacob Lee aff771923a Jacob/new docs (#20570)
Use docusaurus versioning with a callout, merged master as well

@hwchase17 @baskaryan

---------

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru>
Co-authored-by: Averi Kitsch <akitsch@google.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Nuno Campos <nuno@langchain.dev>
Co-authored-by: Nuno Campos <nuno@boringbits.io>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com>
Co-authored-by: Fayfox <admin@fayfox.com>
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com>
Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com>
Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com>
Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com>
Co-authored-by: Kartik Sarangmath <kartik@thirdai.com>
Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai>
Co-authored-by: MacanPN <martin.triska@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Hyeongchan Kim <kozistr@gmail.com>
Co-authored-by: sdan <git@sdan.io>
Co-authored-by: Guangdong Liu <liugddx@gmail.com>
Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com>
Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com>
Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com>
Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com>
Co-authored-by: Tomer Cagan <tomer@tomercagan.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
2024-04-18 11:10:55 -07:00

325 lines
8.7 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "cffb482c-bbd8-4829-b185-0d930a5fe0bc",
"metadata": {},
"source": [
"# Infinispan\n",
"\n",
"Infinispan is an open-source key-value data grid, it can work as single node as well as distributed.\n",
"\n",
"Vector search is supported since release 15.x\n",
"For more: [Infinispan Home](https://infinispan.org)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "03ec8f9a-7641-47ea-9fa0-f43ee9fc79a3",
"metadata": {},
"outputs": [],
"source": [
"# Ensure that all we need is installed\n",
"# You may want to skip this\n",
"%pip install sentence-transformers\n",
"%pip install langchain\n",
"%pip install langchain_core\n",
"%pip install langchain_community"
]
},
{
"cell_type": "markdown",
"id": "180d172e-cca1-481c-87d5-c4f14684604d",
"metadata": {},
"source": [
"# Setup\n",
"\n",
"To run this demo we need a running Infinispan instance without authentication and a data file.\n",
"In the next three cells we're going to:\n",
"- download the data file\n",
"- create the configuration\n",
"- run Infinispan in docker"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9678d5ce-894c-4e28-bf68-20d45507122f",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"#get an archive of news\n",
"wget https://raw.githubusercontent.com/rigazilla/infinispan-vector/main/bbc_news.csv.gz"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b251e66e-f056-4e81-a6b4-5f4d95b6537d",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"#create infinispan configuration file\n",
"echo 'infinispan:\n",
" cache-container: \n",
" name: default\n",
" transport: \n",
" cluster: cluster \n",
" stack: tcp \n",
" server:\n",
" interfaces:\n",
" interface:\n",
" name: public\n",
" inet-address:\n",
" value: 0.0.0.0 \n",
" socket-bindings:\n",
" default-interface: public\n",
" port-offset: 0 \n",
" socket-binding:\n",
" name: default\n",
" port: 11222\n",
" endpoints:\n",
" endpoint:\n",
" socket-binding: default\n",
" rest-connector:\n",
"' > infinispan-noauth.yaml"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "009da6d1-9d1a-4392-90f1-5c654dd12654",
"metadata": {},
"outputs": [],
"source": [
"!docker rm --force infinispanvs-demo\n",
"!docker run -d --name infinispanvs-demo -v $(pwd):/user-config -p 11222:11222 infinispan/server:15.0 -c /user-config/infinispan-noauth.yaml"
]
},
{
"cell_type": "markdown",
"id": "b575cde9-4c62-47b3-af89-109ed39f56b6",
"metadata": {},
"source": [
"# The Code\n",
"\n",
"## Pick up an embedding model\n",
"\n",
"In this demo we're using\n",
"a HuggingFace embedding mode."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2c9f46f-3c78-4865-810b-52408dff5fb7",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import HuggingFaceEmbeddings\n",
"from langchain_core.embeddings import Embeddings\n",
"\n",
"model_name = \"sentence-transformers/all-MiniLM-L12-v2\"\n",
"hf = HuggingFaceEmbeddings(model_name=model_name)"
]
},
{
"cell_type": "markdown",
"id": "61ce7e1f-51ee-4d3d-ad3c-97088b1120f6",
"metadata": {},
"source": [
"## Setup Infinispan cache\n",
"\n",
"Infinispan is a very flexible key-value store, it can store raw bits as well as complex data type.\n",
"User has complete freedom in the datagrid configuration, but for simple data type everything is automatically\n",
"configured by the python layer. We take advantage of this feature so we can focus on our application."
]
},
{
"cell_type": "markdown",
"id": "456da9e7-baf4-472a-a9ee-8473aed8cabd",
"metadata": {},
"source": [
"## Prepare the data\n",
"\n",
"In this demo we rely on the default configuration, thus texts, metadatas and vectors in the same cache, but other options are possible: i.e. content can be store somewhere else and vector store could contain only a reference to the actual content."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f6a42d3-c5ec-44ec-9b57-ebe5ca8c301a",
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import gzip\n",
"import time\n",
"\n",
"# Open the news file and process it as a csv\n",
"with gzip.open(\"bbc_news.csv.gz\", \"rt\", newline=\"\") as csvfile:\n",
" spamreader = csv.reader(csvfile, delimiter=\",\", quotechar='\"')\n",
" i = 0\n",
" texts = []\n",
" metas = []\n",
" embeds = []\n",
" for row in spamreader:\n",
" # first and fifth values are joined to form the content\n",
" # to be processed\n",
" text = row[0] + \".\" + row[4]\n",
" texts.append(text)\n",
" # Store text and title as metadata\n",
" meta = {\"text\": row[4], \"title\": row[0]}\n",
" metas.append(meta)\n",
" i = i + 1\n",
" # Change this to change the number of news you want to load\n",
" if i >= 5000:\n",
" break"
]
},
{
"cell_type": "markdown",
"id": "a6b00299-94db-43ca-9da3-45d12cdf2db1",
"metadata": {},
"source": [
"# Populate the vector store"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75e135a6-1b38-48eb-96ca-379b6f4a653f",
"metadata": {},
"outputs": [],
"source": [
"# add texts and fill vector db\n",
"\n",
"from langchain_community.vectorstores import InfinispanVS\n",
"\n",
"ispnvs = InfinispanVS.from_texts(texts, hf, metas)"
]
},
{
"cell_type": "markdown",
"id": "2bb6f053-208d-407e-b8b7-c6c6443522d8",
"metadata": {},
"source": [
"# An helper func that prints the result documents\n",
"\n",
"By default InfinispanVS returns the protobuf `ŧext` field in the `Document.page_content`\n",
"and all the remaining protobuf fields (except the vector) in the `metadata`. This behaviour is\n",
"configurable via lambda functions at setup."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "594fad38-37f0-4dd4-9785-a99a2f009ae5",
"metadata": {},
"outputs": [],
"source": [
"def print_docs(docs):\n",
" for res, i in zip(docs, range(len(docs))):\n",
" print(\"----\" + str(i + 1) + \"----\")\n",
" print(\"TITLE: \" + res.metadata[\"title\"])\n",
" print(res.page_content)"
]
},
{
"cell_type": "markdown",
"id": "cfa517c7-e741-4f64-9736-6db7a6bd259a",
"metadata": {},
"source": [
"# Try it!!!\n",
"\n",
"Below some sample queries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "86e782b3-5a74-4ca1-a5d1-c0ee935a659e",
"metadata": {},
"outputs": [],
"source": [
"docs = ispnvs.similarity_search(\"European nations\", 5)\n",
"print_docs(docs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b60847f9-ef34-4c79-b276-ac62170e2d6a",
"metadata": {},
"outputs": [],
"source": [
"print_docs(ispnvs.similarity_search(\"Milan fashion week begins\", 2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6cbb5607-da55-4879-92cf-79ac690cc0c5",
"metadata": {},
"outputs": [],
"source": [
"print_docs(ispnvs.similarity_search(\"Stock market is rising today\", 4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3bb94ca1-7b1e-41ed-9d8f-b845775d11c1",
"metadata": {},
"outputs": [],
"source": [
"print_docs(ispnvs.similarity_search(\"Why cats are so viral?\", 2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4fca208-b580-483d-9be0-786b6b63a31d",
"metadata": {},
"outputs": [],
"source": [
"print_docs(ispnvs.similarity_search(\"How to stay young\", 5))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4a460b8-f0c8-4ae9-a7ff-cf550c3195f1",
"metadata": {},
"outputs": [],
"source": [
"!docker rm --force infinispanvs-demo"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}