mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-06 05:25:04 +00:00
Cassandra Vector Store, add metadata filtering + improvements (#9280)
This PR addresses a few minor issues with the Cassandra vector store implementation and extends the store to support Metadata search. Thanks to the latest cassIO library (>=0.1.0), metadata filtering is available in the store. Further, - the "relevance" score is prevented from being flipped in the [0,1] interval, thus ensuring that 1 corresponds to the closest vector (this is related to how the underlying cassIO class returns the cosine difference); - bumped the cassIO package version both in the notebooks and the pyproject.toml; - adjusted the textfile location for the vector-store example after the reshuffling of the Langchain repo dir structure; - added demonstration of metadata filtering in the Cassandra vector store notebook; - better docstring for the Cassandra vector store class; - fixed test flakiness and removed offending out-of-place escape chars from a test module docstring; To my knowledge all relevant tests pass and mypy+black+ruff don't complain. (mypy gives unrelated errors in other modules, which clearly don't depend on the content of this PR). Thank you! Stefano --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
@@ -23,7 +23,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install \"cassio>=0.0.7\""
|
||||
"!pip install \"cassio>=0.1.0\""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -155,7 +155,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
"version": "3.10.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
@@ -23,7 +23,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install \"cassio>=0.0.7\""
|
||||
"!pip install \"cassio>=0.1.0\""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -152,7 +152,9 @@
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"\n",
|
||||
"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
|
||||
"SOURCE_FILE_NAME = \"../../modules/state_of_the_union.txt\"\n",
|
||||
"\n",
|
||||
"loader = TextLoader(SOURCE_FILE_NAME)\n",
|
||||
"documents = loader.load()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||
"docs = text_splitter.split_documents(documents)\n",
|
||||
@@ -197,7 +199,7 @@
|
||||
"# table_name=table_name,\n",
|
||||
"# )\n",
|
||||
"\n",
|
||||
"# docsearch_preexisting.similarity_search(query, k=2)"
|
||||
"# docs = docsearch_preexisting.similarity_search(query, k=2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -253,6 +255,51 @@
|
||||
"for i, doc in enumerate(found_docs):\n",
|
||||
" print(f\"{i + 1}.\", doc.page_content, \"\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "da791c5f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Metadata filtering\n",
|
||||
"\n",
|
||||
"You can specify filtering on metadata when running searches in the vector store. By default, when inserting documents, the only metadata is the `\"source\"` (but you can customize the metadata at insertion time).\n",
|
||||
"\n",
|
||||
"Since only one files was inserted, this is just a demonstration of how filters are passed:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "93f132fa",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"filter = {\"source\": SOURCE_FILE_NAME}\n",
|
||||
"filtered_docs = docsearch.similarity_search(query, filter=filter, k=5)\n",
|
||||
"print(f\"{len(filtered_docs)} documents retrieved.\")\n",
|
||||
"print(f\"{filtered_docs[0].page_content[:64]} ...\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1b413ec4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"filter = {\"source\": \"nonexisting_file.txt\"}\n",
|
||||
"filtered_docs2 = docsearch.similarity_search(query, filter=filter)\n",
|
||||
"print(f\"{len(filtered_docs2)} documents retrieved.\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a0fea764",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Please visit the [cassIO documentation](https://cassio.org/frameworks/langchain/about/) for more on using vector stores with Langchain."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -271,7 +318,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
"version": "3.10.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
Reference in New Issue
Block a user