Add "Astra DB" vector store integration (#12966)

# Astra DB Vector store integration

- **Description:** This PR adds a `VectorStore` implementation for
DataStax Astra DB using its HTTP API
  - **Issue:** (no related issue)
- **Dependencies:** A new required dependency is `astrapy` (`>=0.5.3`)
which was added to pyptoject.toml, optional, as per guidelines
- **Tag maintainer:** I recently mentioned to @baskaryan this
integration was coming
  - **Twitter handle:** `@rsprrs` if you want to mention me

This PR introduces the `AstraDB` vector store class, extensive
integration test coverage, a reworking of the documentation which
conflates Cassandra and Astra DB on a single "provider" page and a new,
completely reworked vector-store example notebook (common to the
Cassandra store, since parts of the flow is shared by the two APIs). I
also took care in ensuring docs (and redirects therein) are behaving
correctly.

All style, linting, typechecks and tests pass as far as the `AstraDB`
integration is concerned.

I could build the documentation and check it all right (but ran into
trouble with the `api_docs_build` makefile target which I could not
verify: `Error: Unable to import module
'plan_and_execute.agent_executor' with error: No module named
'langchain_experimental'` was the first of many similar errors)

Thank you for a review!
Stefano

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Stefano Lottini
2023-11-07 23:45:33 +01:00
committed by GitHub
parent 13bd83bd61
commit 4f4b020582
21 changed files with 4376 additions and 376 deletions

View File

@@ -0,0 +1,85 @@
# Astra DB
This page lists the integrations available with [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) and [Apache Cassandra®](https://cassandra.apache.org/).
### Setup
Install the following Python package:
```bash
pip install "astrapy>=0.5.3"
```
## Astra DB
> DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Cassandra and made conveniently available
> through an easy-to-use JSON API.
### Vector Store
```python
from langchain.vectorstores import AstraDB
vector_store = AstraDB(
embedding=my_embedding,
collection_name="my_store",
api_endpoint="...",
token="...",
)
```
Learn more in the [example notebook](/docs/integrations/vectorstores/astradb).
## Apache Cassandra and Astra DB through CQL
> [Cassandra](https://cassandra.apache.org/) is a NoSQL, row-oriented, highly scalable and highly available database.
> Starting with version 5.0, the database ships with [vector search capabilities](https://cassandra.apache.org/doc/trunk/cassandra/vector-search/overview.html).
> DataStax [Astra DB through CQL](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html) is a managed serverless database built on Cassandra, offering the same interface and strengths.
These databases use the CQL protocol (Cassandra Query Language).
Hence, a different set of connectors, outlined below, shall be used.
### Vector Store
```python
from langchain.vectorstores import Cassandra
vector_store = Cassandra(
embedding=my_embedding,
table_name="my_store",
)
```
Learn more in the [example notebook](/docs/integrations/vectorstores/astradb) (scroll down to the CQL-specific section).
### Memory
```python
from langchain.memory import CassandraChatMessageHistory
message_history = CassandraChatMessageHistory(session_id="my-session")
```
Learn more in the [example notebook](/docs/integrations/memory/cassandra_chat_message_history).
### LLM Cache
```python
from langchain.cache import CassandraCache
langchain.llm_cache = CassandraCache()
```
Learn more in the [example notebook](/docs/integrations/llms/llm_caching) (scroll to the Cassandra section).
### Semantic LLM Cache
```python
from langchain.cache import CassandraSemanticCache
cassSemanticCache = CassandraSemanticCache(
embedding=my_embedding,
table_name="my_store",
)
```
Learn more in the [example notebook](/docs/integrations/llms/llm_caching) (scroll to the appropriate section).

View File

@@ -1,35 +0,0 @@
# Cassandra
>[Apache Cassandra®](https://cassandra.apache.org/) is a free and open-source, distributed, wide-column
> store, NoSQL database management system designed to handle large amounts of data across many commodity servers,
> providing high availability with no single point of failure. Cassandra offers support for clusters spanning
> multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.
> Cassandra was designed to implement a combination of _Amazon's Dynamo_ distributed storage and replication
> techniques combined with _Google's Bigtable_ data and storage engine model.
## Installation and Setup
```bash
pip install cassandra-driver
pip install cassio
```
## Vector Store
See a [usage example](/docs/integrations/vectorstores/cassandra).
```python
from langchain.vectorstores import Cassandra
```
## Memory
See a [usage example](/docs/integrations/memory/cassandra_chat_message_history).
```python
from langchain.memory import CassandraChatMessageHistory
```

View File

@@ -0,0 +1,749 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d2d6ca14-fb7e-4172-9aa0-a3119a064b96",
"metadata": {},
"source": [
"# Astra DB\n",
"\n",
"This page provides a quickstart for using [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) and [Apache Cassandra®](https://cassandra.apache.org/) as a Vector Store.\n",
"\n",
"_Note: in addition to access to the database, an OpenAI API Key is required to run the full example._"
]
},
{
"cell_type": "markdown",
"id": "bb9be7ce-8c70-4d46-9f11-71c42a36e928",
"metadata": {},
"source": [
"### Setup and general dependencies"
]
},
{
"cell_type": "markdown",
"id": "dbe7c156-0413-47e3-9237-4769c4248869",
"metadata": {},
"source": [
"Use of the integration requires the following Python package.\n",
"\n",
"_Note: depending on your LangChain setup, you may need to install other dependencies needed for this demo._"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d00fcf4-9798-4289-9214-d9734690adfc",
"metadata": {},
"outputs": [],
"source": [
"!pip install --quiet \"astrapy>=0.5.3\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b06619af-fea2-4863-8149-7f239a8c9c82",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"from datasets import load_dataset # if not present yet, run: pip install \"datasets==2.14.6\"\n",
"\n",
"from langchain.schema import Document\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.document_loaders import PyPDFLoader\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.schema.runnable import RunnablePassthrough\n",
"from langchain.schema.output_parser import StrOutputParser"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1983f1da-0ae7-4a9b-bf4c-4ade328f7a3a",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"OPENAI_API_KEY\"] = getpass(\"OPENAI_API_KEY = \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c656df06-e938-4bc5-b570-440b8b7a0189",
"metadata": {},
"outputs": [],
"source": [
"embe = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "dd8caa76-bc41-429e-a93b-989ba13aff01",
"metadata": {},
"source": [
"_Keep reading to connect with Astra DB. For usage with Apache Cassandra and Astra DB through CQL, scroll to the section below._"
]
},
{
"cell_type": "markdown",
"id": "22866f09-e10d-4f05-a24b-b9420129462e",
"metadata": {},
"source": [
"## Astra DB"
]
},
{
"cell_type": "markdown",
"id": "5fba47cc-3533-42fc-84b7-9dc14cd68b2b",
"metadata": {},
"source": [
"DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Cassandra and made conveniently available through an easy-to-use JSON API."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b32730d-176e-414c-9d91-fd3644c54211",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import AstraDB"
]
},
{
"cell_type": "markdown",
"id": "68f61b01-3e09-47c1-9d67-5d6915c86626",
"metadata": {},
"source": [
"### Astra DB connection parameters\n",
"\n",
"- the API Endpoint looks like `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`\n",
"- the Token looks like `AstraCS:6gBhNmsk135....`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d78af8ed-cff9-4f14-aa5d-016f99ab547c",
"metadata": {},
"outputs": [],
"source": [
"ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n",
"ASTRA_DB_TOKEN = getpass(\"ASTRA_DB_TOKEN = \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b77553b-8bb5-4949-b87b-8c6abac56a26",
"metadata": {},
"outputs": [],
"source": [
"vstore = AstraDB(\n",
" embedding=embe,\n",
" collection_name=\"astra_vector_demo\",\n",
" api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
" token=ASTRA_DB_TOKEN,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9a348678-b2f6-46ca-9a0d-2eb4cc6b66b1",
"metadata": {},
"source": [
"### Load a dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a1f532f-ad63-4256-9730-a183841bd8e9",
"metadata": {},
"outputs": [],
"source": [
"philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n",
"\n",
"docs = []\n",
"for entry in philo_dataset:\n",
" metadata = {\"author\": entry[\"author\"]}\n",
" doc = Document(page_content=entry[\"quote\"], metadata=metadata)\n",
" docs.append(doc)\n",
"\n",
"inserted_ids = vstore.add_documents(docs)\n",
"print(f\"\\nInserted {len(inserted_ids)} documents.\")"
]
},
{
"cell_type": "markdown",
"id": "084d8802-ab39-4262-9a87-42eafb746f92",
"metadata": {},
"source": [
"Add some more entries, this time with `add_texts`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6b157f5-eb31-4907-a78e-2e2b06893936",
"metadata": {},
"outputs": [],
"source": [
"texts = [\"I think, therefore I am.\", \"To the things themselves!\"]\n",
"metadatas = [{\"author\": \"descartes\"}, {\"author\": \"husserl\"}]\n",
"ids = [\"desc_01\", \"huss_xy\"]\n",
"\n",
"inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)\n",
"print(f\"\\nInserted {len(inserted_ids_2)} documents.\")"
]
},
{
"cell_type": "markdown",
"id": "c031760a-1fc5-4855-adf2-02ed52fe2181",
"metadata": {},
"source": [
"### Run simple searches"
]
},
{
"cell_type": "markdown",
"id": "02a77d8e-1aae-4054-8805-01c77947c49f",
"metadata": {},
"source": [
"This section demonstrates metadata filtering and getting the similarity scores back:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1761806a-1afd-4491-867c-25a80d92b9fe",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.similarity_search(\"Our life is what we make of it\", k=3)\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eebc4f7c-f61a-438e-b3c8-17e6888d8a0b",
"metadata": {},
"outputs": [],
"source": [
"results_filtered = vstore.similarity_search(\n",
" \"Our life is what we make of it\",\n",
" k=3,\n",
" filter={\"author\": \"plato\"},\n",
")\n",
"for res in results_filtered:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11bbfe64-c0cd-40c6-866a-a5786538450e",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.similarity_search_with_score(\"Our life is what we make of it\", k=3)\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "b14ea558-bfbe-41ce-807e-d70670060ada",
"metadata": {},
"source": [
"### MMR (Maximal-marginal-relevance) search"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76381ce8-780a-4e3b-97b1-056d6782d7d5",
"metadata": {},
"outputs": [],
"source": [
"results = vstore.max_marginal_relevance_search(\n",
" \"Our life is what we make of it\",\n",
" k=3,\n",
" filter={\"author\": \"aristotle\"},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "1cc86edd-692b-4495-906c-ccfd13b03c23",
"metadata": {},
"source": [
"### Deleting stored documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38a70ec4-b522-4d32-9ead-c642864fca37",
"metadata": {},
"outputs": [],
"source": [
"delete_1 = vstore.delete(inserted_ids[:3])\n",
"print(f\"all_succeed={delete_1}\") # True, all documents deleted"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4cf49ed-9d29-4ed9-bdab-51a308c41b8e",
"metadata": {},
"outputs": [],
"source": [
"delete_2 = vstore.delete(inserted_ids[2:5])\n",
"print(f\"some_succeeds={delete_2}\") # True, though some IDs were gone already"
]
},
{
"cell_type": "markdown",
"id": "847181ba-77d1-4a17-b7f9-9e2c3d8efd13",
"metadata": {},
"source": [
"### A minimal RAG chain"
]
},
{
"cell_type": "markdown",
"id": "cd64b844-846f-43c5-a7dd-c26b9ed417d0",
"metadata": {},
"source": [
"The next cells will implement a simple RAG pipeline:\n",
"- download a sample PDF file and load it onto the store;\n",
"- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;\n",
"- run the question-answering chain."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5cbc4dba-0d5e-4038-8fc5-de6cadd1c2a9",
"metadata": {},
"outputs": [],
"source": [
"!curl -L \\\n",
" \"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true\" \\\n",
" -o \"what-is-philosophy.pdf\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "459385be-5e9c-47ff-ba53-2b7ae6166b09",
"metadata": {},
"outputs": [],
"source": [
"pdf_loader = PyPDFLoader(\"what-is-philosophy.pdf\")\n",
"splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)\n",
"docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)\n",
"\n",
"print(f\"Documents from PDF: {len(docs_from_pdf)}.\")\n",
"inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)\n",
"print(f\"Inserted {len(inserted_ids_from_pdf)} documents.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5010a66c-4298-4e32-82b5-2da0d36a5c70",
"metadata": {},
"outputs": [],
"source": [
"retriever = vstore.as_retriever(search_kwargs={'k': 3})\n",
"\n",
"philo_template = \"\"\"\n",
"You are a philosopher that draws inspiration from great thinkers of the past\n",
"to craft well-thought answers to user questions. Use the provided context as the basis\n",
"for your answers and do not make up new reasoning paths - just mix-and-match what you are given.\n",
"Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.\n",
"\n",
"CONTEXT:\n",
"{context}\n",
"\n",
"QUESTION: {question}\n",
"\n",
"YOUR ANSWER:\"\"\"\n",
"\n",
"philo_prompt = ChatPromptTemplate.from_template(philo_template)\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"chain = (\n",
" {\"context\": retriever, \"question\": RunnablePassthrough()} \n",
" | philo_prompt \n",
" | llm \n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fcbc1296-6c7c-478b-b55b-533ba4e54ddb",
"metadata": {},
"outputs": [],
"source": [
"chain.invoke(\"How does Russel elaborate on Peirce's idea of the security blanket?\")"
]
},
{
"cell_type": "markdown",
"id": "869ab448-a029-4692-aefc-26b85513314d",
"metadata": {},
"source": [
"For more, check out a complete RAG template using Astra DB [here](https://github.com/langchain-ai/langchain/tree/master/templates/rag-astradb)."
]
},
{
"cell_type": "markdown",
"id": "177610c7-50d0-4b7b-8634-b03338054c8e",
"metadata": {},
"source": [
"### Cleanup"
]
},
{
"cell_type": "markdown",
"id": "0da4d19f-9878-4d3d-82c9-09cafca20322",
"metadata": {},
"source": [
"If you want to completely delete the collection from your Astra DB instance, run this.\n",
"\n",
"_(You will lose the data you stored in it.)_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd405a13-6f71-46fa-87e6-167238e9c25e",
"metadata": {},
"outputs": [],
"source": [
"vstore.delete_collection()"
]
},
{
"cell_type": "markdown",
"id": "94ebaab1-7cbf-4144-a147-7b0e32c43069",
"metadata": {},
"source": [
"## Apache Cassandra and Astra DB through CQL"
]
},
{
"cell_type": "markdown",
"id": "bc3931b4-211d-4f84-bcc0-51c127e3027c",
"metadata": {},
"source": [
"[Cassandra](https://cassandra.apache.org/) is a NoSQL, row-oriented, highly scalable and highly available database.Starting with version 5.0, the database ships with [vector search capabilities](https://cassandra.apache.org/doc/trunk/cassandra/vector-search/overview.html).\n",
"\n",
"DataStax [Astra DB through CQL](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html) is a managed serverless database built on Cassandra, offering the same interface and strengths."
]
},
{
"cell_type": "markdown",
"id": "a0055fbf-448d-4e46-9c40-28d43df25ca3",
"metadata": {},
"source": [
"#### What sets this case apart from \"Astra DB\" above?\n",
"\n",
"Thanks to LangChain having a standardized `VectorStore` interface, most of the \"Astra DB\" section above applies to this case as well. However, this time the database uses the CQL protocol, which means you'll use a _different_ class this time and instantiate it in another way.\n",
"\n",
"The cells below show how you should get your `vstore` object in this case and how you can clean up the database resources at the end: for the rest, i.e. the actual usage of the vector store, you will be able to run the very code that was shown above.\n",
"\n",
"In other words, running this demo in full with Cassandra or Astra DB through CQL means:\n",
"\n",
"- **initialization as shown below**\n",
"- \"Load a dataset\", _see above section_\n",
"- \"Run simple searches\", _see above section_\n",
"- \"MMR search\", _see above section_\n",
"- \"Deleting stored documents\", _see above section_\n",
"- \"A minimal RAG chain\", _see above section_\n",
"- **cleanup as shown below**"
]
},
{
"cell_type": "markdown",
"id": "23d12be2-745f-4e72-a82c-334a887bc7cd",
"metadata": {},
"source": [
"### Initialization"
]
},
{
"cell_type": "markdown",
"id": "e3212542-79be-423e-8e1f-b8d725e3cda8",
"metadata": {},
"source": [
"The class to use is the following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "941af73e-a090-4fba-b23c-595757d470eb",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import Cassandra"
]
},
{
"cell_type": "markdown",
"id": "414d1e72-f7c9-4b6d-bf6f-16075712c7e3",
"metadata": {},
"source": [
"Now, depending on whether you connect to a Cassandra cluster or to Astra DB through CQL, you will provide different parameters when creating the vector store object."
]
},
{
"cell_type": "markdown",
"id": "48ecca56-71a4-4a91-b198-29384c44ce27",
"metadata": {},
"source": [
"#### Initialization (Cassandra cluster)"
]
},
{
"cell_type": "markdown",
"id": "55ebe958-5654-43e0-9aed-d607ffd3fa48",
"metadata": {},
"source": [
"In this case, you first need to create a `cassandra.cluster.Session` object, as described in the [Cassandra driver documentation](https://docs.datastax.com/en/developer/python-driver/latest/api/cassandra/cluster/#module-cassandra.cluster). The details vary (e.g. with network settings and authentication), but this might be something like:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4642dafb-a065-4063-b58c-3d276f5ad07e",
"metadata": {},
"outputs": [],
"source": [
"from cassandra.cluster import Cluster\n",
"\n",
"cluster = Cluster([\"127.0.0.1\"])\n",
"session = cluster.connect()"
]
},
{
"cell_type": "markdown",
"id": "624c93bf-fb46-4350-bcfa-09ca09dc068f",
"metadata": {},
"source": [
"You can now set the session, along with your desired keyspace name, as a global CassIO parameter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92a4ab28-1c4f-4dad-9671-d47e0b1dde7b",
"metadata": {},
"outputs": [],
"source": [
"import cassio\n",
"\n",
"CASSANDRA_KEYSPACE = input(\"CASSANDRA_KEYSPACE = \")\n",
"\n",
"cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)"
]
},
{
"cell_type": "markdown",
"id": "3b87a824-36f1-45b4-b54c-efec2a2de216",
"metadata": {},
"source": [
"Now you can create the vector store:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "853a2a88-a565-4e24-8789-d78c213954a6",
"metadata": {},
"outputs": [],
"source": [
"vstore = Cassandra(\n",
" embedding=embe,\n",
" table_name=\"cassandra_vector_demo\",\n",
" # session=None, keyspace=None # Uncomment on older versions of LangChain\n",
")"
]
},
{
"cell_type": "markdown",
"id": "768ddf7a-0c3e-4134-ad38-25ac53c3da7a",
"metadata": {},
"source": [
"#### Initialization (Astra DB through CQL)"
]
},
{
"cell_type": "markdown",
"id": "4ed4269a-b7e7-4503-9e66-5a11335c7681",
"metadata": {},
"source": [
"In this case you initialize CassIO with the following connection parameters:\n",
"\n",
"- the Database ID, e.g. `01234567-89ab-cdef-0123-456789abcdef`\n",
"- the Token, e.g. `AstraCS:6gBhNmsk135....` (it must be a \"Database Administrator\" token)\n",
"- Optionally a Keyspace name (if omitted, the default one for the database will be used)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fa6bd74-d4b2-45c5-9757-96dddc6242fb",
"metadata": {},
"outputs": [],
"source": [
"ASTRA_DB_ID = input(\"ASTRA_DB_ID = \")\n",
"ASTRA_DB_TOKEN = getpass(\"ASTRA_DB_TOKEN = \")\n",
"\n",
"desired_keyspace = input(\"ASTRA_DB_KEYSPACE (optional, can be left empty) = \")\n",
"if desired_keyspace:\n",
" ASTRA_DB_KEYSPACE = desired_keyspace\n",
"else:\n",
" ASTRA_DB_KEYSPACE = None"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "add6e585-17ff-452e-8ef6-7e485ead0b06",
"metadata": {},
"outputs": [],
"source": [
"import cassio\n",
"\n",
"cassio.init(\n",
" database_id=ASTRA_DB_ID,\n",
" token=ASTRA_DB_TOKEN,\n",
" keyspace=ASTRA_DB_KEYSPACE,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b305823c-bc98-4f3d-aabb-d7eb663ea421",
"metadata": {},
"source": [
"Now you can create the vector store:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f45f3038-9d59-41cc-8b43-774c6aa80295",
"metadata": {},
"outputs": [],
"source": [
"vstore = Cassandra(\n",
" embedding=embe,\n",
" table_name=\"cassandra_vector_demo\",\n",
" # session=None, keyspace=None # Uncomment on older versions of LangChain\n",
")"
]
},
{
"cell_type": "markdown",
"id": "39284918-cf8a-49bb-a2d3-aef285bb2ffa",
"metadata": {},
"source": [
"### Usage of the vector store"
]
},
{
"cell_type": "markdown",
"id": "3cc1aead-d6ec-48a3-affe-1d0cffa955a9",
"metadata": {},
"source": [
"_See the sections \"Load a dataset\" through \"A minimal RAG chain\" above._\n",
"\n",
"Speaking of the latter, you can check out a full RAG template for Astra DB through CQL [here](https://github.com/langchain-ai/langchain/tree/master/templates/cassandra-entomology-rag)."
]
},
{
"cell_type": "markdown",
"id": "096397d8-6622-4685-9f9d-7e238beca467",
"metadata": {},
"source": [
"### Cleanup"
]
},
{
"cell_type": "markdown",
"id": "cc1e74f9-5500-41aa-836f-235b1ed5f20c",
"metadata": {},
"source": [
"the following essentially retrieves the `Session` object from CassIO and runs a CQL `DROP TABLE` statement with it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5b82c33-0e77-4a37-852c-8d50edbdd991",
"metadata": {},
"outputs": [],
"source": [
"cassio.config.resolve_session().execute(\n",
" f\"DROP TABLE {cassio.config.resolve_keyspace()}.cassandra_vector_demo;\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c10ece4d-ae06-42ab-baf4-4d0ac2051743",
"metadata": {},
"source": [
"### Learn more"
]
},
{
"cell_type": "markdown",
"id": "51ea8b69-7e15-458f-85aa-9fa199f95f9c",
"metadata": {},
"source": [
"For more information, extended quickstarts and additional usage examples, please visit the [CassIO documentation](https://cassio.org/frameworks/langchain/about/) for more on using the LangChain `Cassandra` vector store."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,326 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {},
"source": [
"# Cassandra\n",
"\n",
">[Apache Cassandra®](https://cassandra.apache.org) is a NoSQL, row-oriented, highly scalable and highly available database.\n",
"\n",
"Newest Cassandra releases natively [support](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor(ANN)+Vector+Search+via+Storage-Attached+Indexes) Vector Similarity Search.\n",
"\n",
"To run this notebook you need either a running Cassandra cluster equipped with Vector Search capabilities (in pre-release at the time of writing) or a DataStax Astra DB instance running in the cloud (you can get one for free at [datastax.com](https://astra.datastax.com)). Check [cassio.org](https://cassio.org/start_here/) for more information."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b4c41cad-08ef-4f72-a545-2151e4598efe",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install \"cassio>=0.1.0\""
]
},
{
"cell_type": "markdown",
"id": "b7e46bb0",
"metadata": {},
"source": [
"### Please provide database connection parameters and secrets:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36128a32",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import getpass\n",
"\n",
"database_mode = (input(\"\\n(C)assandra or (A)stra DB? \")).upper()\n",
"\n",
"keyspace_name = input(\"\\nKeyspace name? \")\n",
"\n",
"if database_mode == \"A\":\n",
" ASTRA_DB_APPLICATION_TOKEN = getpass.getpass('\\nAstra DB Token (\"AstraCS:...\") ')\n",
" #\n",
" ASTRA_DB_SECURE_BUNDLE_PATH = input(\"Full path to your Secure Connect Bundle? \")\n",
"elif database_mode == \"C\":\n",
" CASSANDRA_CONTACT_POINTS = input(\n",
" \"Contact points? (comma-separated, empty for localhost) \"\n",
" ).strip()"
]
},
{
"cell_type": "markdown",
"id": "4f22aac2",
"metadata": {},
"source": [
"#### depending on whether local or cloud-based Astra DB, create the corresponding database connection \"Session\" object"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "677f8576",
"metadata": {},
"outputs": [],
"source": [
"from cassandra.cluster import Cluster\n",
"from cassandra.auth import PlainTextAuthProvider\n",
"\n",
"if database_mode == \"C\":\n",
" if CASSANDRA_CONTACT_POINTS:\n",
" cluster = Cluster(\n",
" [cp.strip() for cp in CASSANDRA_CONTACT_POINTS.split(\",\") if cp.strip()]\n",
" )\n",
" else:\n",
" cluster = Cluster()\n",
" session = cluster.connect()\n",
"elif database_mode == \"A\":\n",
" ASTRA_DB_CLIENT_ID = \"token\"\n",
" cluster = Cluster(\n",
" cloud={\n",
" \"secure_connect_bundle\": ASTRA_DB_SECURE_BUNDLE_PATH,\n",
" },\n",
" auth_provider=PlainTextAuthProvider(\n",
" ASTRA_DB_CLIENT_ID,\n",
" ASTRA_DB_APPLICATION_TOKEN,\n",
" ),\n",
" )\n",
" session = cluster.connect()\n",
"else:\n",
" raise NotImplementedError"
]
},
{
"cell_type": "markdown",
"id": "320af802-9271-46ee-948f-d2453933d44b",
"metadata": {},
"source": [
"### Please provide OpenAI access key\n",
"\n",
"We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffea66e4-bc23-46a9-9580-b348dfe7b7a7",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "markdown",
"id": "e98a139b",
"metadata": {},
"source": [
"### Creation and usage of the Vector Store"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aac9563e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores import Cassandra\n",
"from langchain.document_loaders import TextLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3c3999a",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"\n",
"SOURCE_FILE_NAME = \"../../modules/state_of_the_union.txt\"\n",
"\n",
"loader = TextLoader(SOURCE_FILE_NAME)\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embedding_function = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e104aee",
"metadata": {},
"outputs": [],
"source": [
"table_name = \"my_vector_db_table\"\n",
"\n",
"docsearch = Cassandra.from_documents(\n",
" documents=docs,\n",
" embedding=embedding_function,\n",
" session=session,\n",
" keyspace=keyspace_name,\n",
" table_name=table_name,\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f509ee02",
"metadata": {},
"outputs": [],
"source": [
"## if you already have an index, you can load it and use it like this:\n",
"\n",
"# docsearch_preexisting = Cassandra(\n",
"# embedding=embedding_function,\n",
"# session=session,\n",
"# keyspace=keyspace_name,\n",
"# table_name=table_name,\n",
"# )\n",
"\n",
"# docs = docsearch_preexisting.similarity_search(query, k=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c608226",
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "d46d1452",
"metadata": {},
"source": [
"### Maximal Marginal Relevance Searches\n",
"\n",
"In addition to using similarity search in the retriever object, you can also use `mmr` as retriever.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a359ed74",
"metadata": {},
"outputs": [],
"source": [
"retriever = docsearch.as_retriever(search_type=\"mmr\")\n",
"matched_docs = retriever.get_relevant_documents(query)\n",
"for i, d in enumerate(matched_docs):\n",
" print(f\"\\n## Document {i}\\n\")\n",
" print(d.page_content)"
]
},
{
"cell_type": "markdown",
"id": "7c477287",
"metadata": {},
"source": [
"Or use `max_marginal_relevance_search` directly:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ca82740",
"metadata": {},
"outputs": [],
"source": [
"found_docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10)\n",
"for i, doc in enumerate(found_docs):\n",
" print(f\"{i + 1}.\", doc.page_content, \"\\n\")"
]
},
{
"cell_type": "markdown",
"id": "da791c5f",
"metadata": {},
"source": [
"### Metadata filtering\n",
"\n",
"You can specify filtering on metadata when running searches in the vector store. By default, when inserting documents, the only metadata is the `\"source\"` (but you can customize the metadata at insertion time).\n",
"\n",
"Since only one files was inserted, this is just a demonstration of how filters are passed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93f132fa",
"metadata": {},
"outputs": [],
"source": [
"filter = {\"source\": SOURCE_FILE_NAME}\n",
"filtered_docs = docsearch.similarity_search(query, filter=filter, k=5)\n",
"print(f\"{len(filtered_docs)} documents retrieved.\")\n",
"print(f\"{filtered_docs[0].page_content[:64]} ...\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b413ec4",
"metadata": {},
"outputs": [],
"source": [
"filter = {\"source\": \"nonexisting_file.txt\"}\n",
"filtered_docs2 = docsearch.similarity_search(query, filter=filter)\n",
"print(f\"{len(filtered_docs2)} documents retrieved.\")"
]
},
{
"cell_type": "markdown",
"id": "a0fea764",
"metadata": {},
"source": [
"Please visit the [cassIO documentation](https://cassio.org/frameworks/langchain/about/) for more on using vector stores with Langchain."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -58,9 +58,9 @@
"1. Do not use with a store that has been pre-populated with content independently of the indexing API, as the record manager will not know that records have been inserted previously.\n",
"2. Only works with LangChain `vectorstore`'s that support:\n",
" * document addition by id (`add_documents` method with `ids` argument)\n",
" * delete by id (`delete` method with)\n",
" * delete by id (`delete` method with `ids` argument)\n",
"\n",
"Compatible Vectorstores: `AnalyticDB`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `DashVector`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `MyScale`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `ScaNN`, `SupabaseVectorStore`, `TimescaleVector`, `Vald`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`.\n",
"Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `DashVector`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `MyScale`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `ScaNN`, `SupabaseVectorStore`, `TimescaleVector`, `Vald`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`.\n",
" \n",
"## Caution\n",
"\n",

View File

@@ -414,7 +414,15 @@
},
{
"source": "/docs/integrations/cassandra",
"destination": "/docs/integrations/providers/cassandra"
"destination": "/docs/integrations/providers/astradb"
},
{
"source": "/docs/integrations/providers/cassandra",
"destination": "/docs/integrations/providers/astradb"
},
{
"source": "/docs/integrations/vectorstores/cassandra",
"destination": "/docs/integrations/vectorstores/astradb"
},
{
"source": "/docs/integrations/cerebriumai",