mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-09 06:53:59 +00:00
Adding Marqo to vectorstore ecosystem (#7068)
This PR brings in a vectorstore interface for [Marqo](https://www.marqo.ai/). The Marqo vectorstore exposes some of Marqo's functionality in addition the the VectorStore base class. The Marqo vectorstore also makes the embedding parameter optional because inference for embeddings is an inherent part of Marqo. Docs, notebook examples and integration tests included. Related PR: https://github.com/hwchase17/langchain/pull/2807 --------- Co-authored-by: Tom Hamer <tom@marqo.ai> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This commit is contained in:
@@ -0,0 +1,442 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "683953b3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Marqo\n",
|
||||
"\n",
|
||||
"This notebook shows how to use functionality related to the Marqo database."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "aac9563e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.vectorstores import Marqo\n",
|
||||
"from langchain.document_loaders import TextLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "a3c3999a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"loader = TextLoader('../../../state_of_the_union.txt')\n",
|
||||
"documents = loader.load()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
|
||||
"docs = text_splitter.split_documents(documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "6e104aee",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Index langchain-demo exists.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import marqo \n",
|
||||
"\n",
|
||||
"# initialize marqo\n",
|
||||
"marqo_url = \"http://localhost:8882\" # if using marqo cloud replace with your endpoint (console.marqo.ai)\n",
|
||||
"marqo_api_key = \"\" # if using marqo cloud replace with your api key (console.marqo.ai)\n",
|
||||
"\n",
|
||||
"client = marqo.Client(url=marqo_url, api_key=marqo_api_key)\n",
|
||||
"\n",
|
||||
"index_name = \"langchain-demo\"\n",
|
||||
"\n",
|
||||
"docsearch = Marqo.from_documents(docs, index_name=index_name)\n",
|
||||
"\n",
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"result_docs = docsearch.similarity_search(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "9c608226",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||||
"\n",
|
||||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(result_docs[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "98704b27",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||||
"\n",
|
||||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
|
||||
"0.68647254\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result_docs = docsearch.similarity_search_with_score(query)\n",
|
||||
"print(result_docs[0][0].page_content, result_docs[0][1], sep=\"\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "eb3395b6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Additional features\n",
|
||||
"\n",
|
||||
"One of the powerful features of Marqo as a vectorstore is that you can use indexes created externally. For example:\n",
|
||||
"\n",
|
||||
"+ If you had a database of image and text pairs from another application, you can simply just use it in langchain with the Marqo vectorstore. Note that bringing your own multimodal indexes will disable the `add_texts` method.\n",
|
||||
"\n",
|
||||
"+ If you had a database of text documents, you can bring it into the langchain framework and add more texts through `add_texts`.\n",
|
||||
"\n",
|
||||
"The documents that are returned are customised by passing your own function to the `page_content_builder` callback in the search methods."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "35b99fef",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Multimodal Example"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "a359ed74",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'errors': False,\n",
|
||||
" 'processingTimeMs': 4675.6921890009835,\n",
|
||||
" 'index_name': 'langchain-multimodal-demo',\n",
|
||||
" 'items': [{'_id': '7af25f35-5d41-4ff5-95fa-ab6bd6755176',\n",
|
||||
" 'result': 'created',\n",
|
||||
" 'status': 201},\n",
|
||||
" {'_id': '70434d17-2680-4e33-b060-a37b9b8b6959',\n",
|
||||
" 'result': 'created',\n",
|
||||
" 'status': 201}]}"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"\n",
|
||||
"# use a new index\n",
|
||||
"index_name = \"langchain-multimodal-demo\"\n",
|
||||
"\n",
|
||||
"# incase the demo is re-run\n",
|
||||
"try:\n",
|
||||
" client.delete_index(index_name)\n",
|
||||
"except Exception:\n",
|
||||
" print(f\"Creating {index_name}\")\n",
|
||||
" \n",
|
||||
"# This index could have been created by another system\n",
|
||||
"settings = {\"treat_urls_and_pointers_as_images\": True, \"model\": \"ViT-L/14\"}\n",
|
||||
"client.create_index(index_name, **settings)\n",
|
||||
"client.index(index_name).add_documents(\n",
|
||||
" [ \n",
|
||||
" # image of a bus\n",
|
||||
" {\n",
|
||||
" \"caption\": \"Bus\",\n",
|
||||
" \"image\": \"https://raw.githubusercontent.com/marqo-ai/marqo/mainline/examples/ImageSearchGuide/data/image4.jpg\"\n",
|
||||
" },\n",
|
||||
" # image of a plane\n",
|
||||
" { \n",
|
||||
" \"caption\": \"Plane\", \n",
|
||||
" \"image\": \"https://raw.githubusercontent.com/marqo-ai/marqo/mainline/examples/ImageSearchGuide/data/image2.jpg\"\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "368d1fab",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def get_content(res):\n",
|
||||
" \"\"\"Helper to format Marqo's documents into text to be used as page_content\"\"\"\n",
|
||||
" return f\"{res['caption']}: {res['image']}\"\n",
|
||||
"\n",
|
||||
"docsearch = Marqo(client, index_name, page_content_builder=get_content)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"query = \"vehicles that fly\"\n",
|
||||
"doc_results = docsearch.similarity_search(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "eef4edf9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Plane: https://raw.githubusercontent.com/marqo-ai/marqo/mainline/examples/ImageSearchGuide/data/image2.jpg\n",
|
||||
"Bus: https://raw.githubusercontent.com/marqo-ai/marqo/mainline/examples/ImageSearchGuide/data/image4.jpg\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for doc in doc_results:\n",
|
||||
" print(doc.page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "c255f603",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Text only example"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "9e9a2b20",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'errors': False,\n",
|
||||
" 'processingTimeMs': 500.1302719992964,\n",
|
||||
" 'index_name': 'langchain-byo-index-demo',\n",
|
||||
" 'items': [{'_id': 'cbad6f9e-a4ea-45c6-9a85-1b9c0a59827c',\n",
|
||||
" 'result': 'created',\n",
|
||||
" 'status': 201},\n",
|
||||
" {'_id': 'c0be68cb-8847-4e95-a4c9-4791b54f772c',\n",
|
||||
" 'result': 'created',\n",
|
||||
" 'status': 201}]}"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"\n",
|
||||
"# use a new index\n",
|
||||
"index_name = \"langchain-byo-index-demo\"\n",
|
||||
"\n",
|
||||
"# incase the demo is re-run\n",
|
||||
"try:\n",
|
||||
" client.delete_index(index_name)\n",
|
||||
"except Exception:\n",
|
||||
" print(f\"Creating {index_name}\")\n",
|
||||
"\n",
|
||||
"# This index could have been created by another system\n",
|
||||
"client.index(index_name).add_documents(\n",
|
||||
" [ \n",
|
||||
" {\n",
|
||||
" \"Title\": \"Smartphone\",\n",
|
||||
" \"Description\": \"A smartphone is a portable computer device that combines mobile telephone \"\n",
|
||||
" \"functions and computing functions into one unit.\",\n",
|
||||
" },\n",
|
||||
" { \n",
|
||||
" \"Title\": \"Telephone\",\n",
|
||||
" \"Description\": \"A telephone is a telecommunications device that permits two or more users to\"\n",
|
||||
" \"conduct a conversation when they are too far apart to be easily heard directly.\",\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "b2943ea9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['484d8436-cb09-49f2-8f9d-39671c7ebfaa']"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Note text indexes retain the ability to use add_texts despite different field names in documents\n",
|
||||
"# this is because the page_content_builder callback lets you handle these document fields as required\n",
|
||||
"\n",
|
||||
"def get_content(res):\n",
|
||||
" \"\"\"Helper to format Marqo's documents into text to be used as page_content\"\"\"\n",
|
||||
" if 'text' in res:\n",
|
||||
" return res['text']\n",
|
||||
" return res['Description']\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"docsearch = Marqo(client, index_name, page_content_builder=get_content)\n",
|
||||
"\n",
|
||||
"docsearch.add_texts([\"This is a document that is about elephants\"])\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "851450e9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"A smartphone is a portable computer device that combines mobile telephone functions and computing functions into one unit.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"modern communications devices\"\n",
|
||||
"doc_results = docsearch.similarity_search(query)\n",
|
||||
"\n",
|
||||
"print(doc_results[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "9a438fec",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"This is a document that is about elephants\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"elephants\"\n",
|
||||
"doc_results = docsearch.similarity_search(query, page_content_builder=get_content)\n",
|
||||
"\n",
|
||||
"print(doc_results[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "0d04c9d4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Weighted Queries\n",
|
||||
"\n",
|
||||
"We also expose marqos weighted queries which are a powerful way to compose complex semantic searches."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "d42ba0d6",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"A smartphone is a portable computer device that combines mobile telephone functions and computing functions into one unit.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = {\"communications devices\" : 1.0}\n",
|
||||
"doc_results = docsearch.similarity_search(query)\n",
|
||||
"print(doc_results[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "b5918a16",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"A telephone is a telecommunications device that permits two or more users toconduct a conversation when they are too far apart to be easily heard directly.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = {\"communications devices\" : 1.0, \"technology post 2000\": -1.0}\n",
|
||||
"doc_results = docsearch.similarity_search(query)\n",
|
||||
"print(doc_results[0].page_content)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.9"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
Reference in New Issue
Block a user