{ "cells": [ { "cell_type": "markdown", "id": "66d0270a-b74f-4110-901e-7960b00297af", "metadata": {}, "source": [ "# Astra DB\n", "\n", "This page provides a quickstart for using [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) as a Vector Store." ] }, { "cell_type": "markdown", "id": "ab8cd64f-3bb2-4f16-a0a9-12d7b1789bf6", "metadata": {}, "source": [ "> DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Apache Cassandra® and made conveniently available through an easy-to-use JSON API." ] }, { "cell_type": "markdown", "id": "d2d6ca14-fb7e-4172-9aa0-a3119a064b96", "metadata": {}, "source": [ "_Note: in addition to access to the database, an OpenAI API Key is required to run the full example._" ] }, { "cell_type": "markdown", "id": "bb9be7ce-8c70-4d46-9f11-71c42a36e928", "metadata": {}, "source": [ "## Setup and general dependencies" ] }, { "cell_type": "markdown", "id": "dbe7c156-0413-47e3-9237-4769c4248869", "metadata": {}, "source": [ "Use of the integration requires the corresponding Python package:" ] }, { "cell_type": "code", "execution_count": null, "id": "8d00fcf4-9798-4289-9214-d9734690adfc", "metadata": {}, "outputs": [], "source": [ "pip install --upgrade langchain-astradb" ] }, { "cell_type": "markdown", "id": "2453d83a-bc8f-41e1-a692-befe4dd90156", "metadata": {}, "source": [ "_**Note.** the following are all packages required to run the full demo on this page. Depending on your LangChain setup, some of them may need to be installed:_" ] }, { "cell_type": "code", "execution_count": null, "id": "56c1f86e-5921-4976-ac8f-1d62e5a512b0", "metadata": {}, "outputs": [], "source": [ "pip install langchain langchain-openai datasets pypdf" ] }, { "cell_type": "markdown", "id": "c2910035-e61f-48d9-a110-d68c401b62aa", "metadata": {}, "source": [ "### Import dependencies" ] }, { "cell_type": "code", "execution_count": null, "id": "b06619af-fea2-4863-8149-7f239a8c9c82", "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "from datasets import (\n", " load_dataset,\n", ")\n", "from langchain_community.document_loaders import PyPDFLoader\n", "from langchain_core.documents import Document\n", "from langchain_core.output_parsers import StrOutputParser\n", "from langchain_core.prompts import ChatPromptTemplate\n", "from langchain_core.runnables import RunnablePassthrough\n", "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n", "from langchain_text_splitters import RecursiveCharacterTextSplitter" ] }, { "cell_type": "code", "execution_count": null, "id": "1983f1da-0ae7-4a9b-bf4c-4ade328f7a3a", "metadata": {}, "outputs": [], "source": [ "os.environ[\"OPENAI_API_KEY\"] = getpass(\"OPENAI_API_KEY = \")" ] }, { "cell_type": "code", "execution_count": null, "id": "c656df06-e938-4bc5-b570-440b8b7a0189", "metadata": {}, "outputs": [], "source": [ "embe = OpenAIEmbeddings()" ] }, { "cell_type": "markdown", "id": "22866f09-e10d-4f05-a24b-b9420129462e", "metadata": {}, "source": [ "## Import the Vector Store" ] }, { "cell_type": "code", "execution_count": null, "id": "0b32730d-176e-414c-9d91-fd3644c54211", "metadata": {}, "outputs": [], "source": [ "from langchain_astradb import AstraDBVectorStore" ] }, { "cell_type": "markdown", "id": "68f61b01-3e09-47c1-9d67-5d6915c86626", "metadata": {}, "source": [ "## Connection parameters\n", "\n", "These are found on your Astra DB dashboard:\n", "\n", "- the API Endpoint looks like `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`\n", "- the Token looks like `AstraCS:6gBhNmsk135....`\n", "- you may optionally provide a _Namespace_ such as `my_namespace`" ] }, { "cell_type": "code", "execution_count": null, "id": "d78af8ed-cff9-4f14-aa5d-016f99ab547c", "metadata": {}, "outputs": [], "source": [ "ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n", "ASTRA_DB_APPLICATION_TOKEN = getpass(\"ASTRA_DB_APPLICATION_TOKEN = \")\n", "\n", "desired_namespace = input(\"(optional) Namespace = \")\n", "if desired_namespace:\n", " ASTRA_DB_KEYSPACE = desired_namespace\n", "else:\n", " ASTRA_DB_KEYSPACE = None" ] }, { "cell_type": "markdown", "id": "196268bd-a950-41c3-bede-f5b55f6a0804", "metadata": {}, "source": [ "Now you can create the vector store:" ] }, { "cell_type": "code", "execution_count": null, "id": "8b77553b-8bb5-4949-b87b-8c6abac56a26", "metadata": {}, "outputs": [], "source": [ "vstore = AstraDBVectorStore(\n", " embedding=embe,\n", " collection_name=\"astra_vector_demo\",\n", " api_endpoint=ASTRA_DB_API_ENDPOINT,\n", " token=ASTRA_DB_APPLICATION_TOKEN,\n", " namespace=ASTRA_DB_KEYSPACE,\n", ")" ] }, { "cell_type": "markdown", "id": "9a348678-b2f6-46ca-9a0d-2eb4cc6b66b1", "metadata": {}, "source": [ "## Load a dataset" ] }, { "cell_type": "markdown", "id": "552e56b0-301a-4b06-99c7-57ba6faa966f", "metadata": {}, "source": [ "Convert each entry in the source dataset into a `Document`, then write them into the vector store:" ] }, { "cell_type": "code", "execution_count": null, "id": "3a1f532f-ad63-4256-9730-a183841bd8e9", "metadata": {}, "outputs": [], "source": [ "philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n", "\n", "docs = []\n", "for entry in philo_dataset:\n", " metadata = {\"author\": entry[\"author\"]}\n", " doc = Document(page_content=entry[\"quote\"], metadata=metadata)\n", " docs.append(doc)\n", "\n", "inserted_ids = vstore.add_documents(docs)\n", "print(f\"\\nInserted {len(inserted_ids)} documents.\")" ] }, { "cell_type": "markdown", "id": "79d4f436-ef04-4288-8f79-97c9abb983ed", "metadata": {}, "source": [ "In the above, `metadata` dictionaries are created from the source data and are part of the `Document`.\n", "\n", "_Note: check the [Astra DB API Docs](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#_json_api_limits) for the valid metadata field names: some characters are reserved and cannot be used._" ] }, { "cell_type": "markdown", "id": "084d8802-ab39-4262-9a87-42eafb746f92", "metadata": {}, "source": [ "Add some more entries, this time with `add_texts`:" ] }, { "cell_type": "code", "execution_count": null, "id": "b6b157f5-eb31-4907-a78e-2e2b06893936", "metadata": {}, "outputs": [], "source": [ "texts = [\"I think, therefore I am.\", \"To the things themselves!\"]\n", "metadatas = [{\"author\": \"descartes\"}, {\"author\": \"husserl\"}]\n", "ids = [\"desc_01\", \"huss_xy\"]\n", "\n", "inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)\n", "print(f\"\\nInserted {len(inserted_ids_2)} documents.\")" ] }, { "cell_type": "markdown", "id": "63840eb3-8b29-4017-bc2f-301bf5001f28", "metadata": {}, "source": [ "_Note: you may want to speed up the execution of `add_texts` and `add_documents` by increasing the concurrency level for_\n", "_these bulk operations - check out the `*_concurrency` parameters in the class constructor and the `add_texts` docstrings_\n", "_for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary._" ] }, { "cell_type": "markdown", "id": "c031760a-1fc5-4855-adf2-02ed52fe2181", "metadata": {}, "source": [ "## Run searches" ] }, { "cell_type": "markdown", "id": "02a77d8e-1aae-4054-8805-01c77947c49f", "metadata": {}, "source": [ "This section demonstrates metadata filtering and getting the similarity scores back:" ] }, { "cell_type": "code", "execution_count": null, "id": "1761806a-1afd-4491-867c-25a80d92b9fe", "metadata": {}, "outputs": [], "source": [ "results = vstore.similarity_search(\"Our life is what we make of it\", k=3)\n", "for res in results:\n", " print(f\"* {res.page_content} [{res.metadata}]\")" ] }, { "cell_type": "code", "execution_count": null, "id": "eebc4f7c-f61a-438e-b3c8-17e6888d8a0b", "metadata": {}, "outputs": [], "source": [ "results_filtered = vstore.similarity_search(\n", " \"Our life is what we make of it\",\n", " k=3,\n", " filter={\"author\": \"plato\"},\n", ")\n", "for res in results_filtered:\n", " print(f\"* {res.page_content} [{res.metadata}]\")" ] }, { "cell_type": "code", "execution_count": null, "id": "11bbfe64-c0cd-40c6-866a-a5786538450e", "metadata": {}, "outputs": [], "source": [ "results = vstore.similarity_search_with_score(\"Our life is what we make of it\", k=3)\n", "for res, score in results:\n", " print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")" ] }, { "cell_type": "markdown", "id": "b14ea558-bfbe-41ce-807e-d70670060ada", "metadata": {}, "source": [ "### MMR (Maximal-marginal-relevance) search" ] }, { "cell_type": "code", "execution_count": null, "id": "76381ce8-780a-4e3b-97b1-056d6782d7d5", "metadata": {}, "outputs": [], "source": [ "results = vstore.max_marginal_relevance_search(\n", " \"Our life is what we make of it\",\n", " k=3,\n", " filter={\"author\": \"aristotle\"},\n", ")\n", "for res in results:\n", " print(f\"* {res.page_content} [{res.metadata}]\")" ] }, { "cell_type": "markdown", "id": "60fda5df-14e4-4fb0-bd17-65a393fab8a9", "metadata": {}, "source": [ "### Async\n", "\n", "Note that the Astra DB vector store supports all fully async methods (`asimilarity_search`, `afrom_texts`, `adelete` and so on) natively, i.e. without thread wrapping involved." ] }, { "cell_type": "markdown", "id": "1cc86edd-692b-4495-906c-ccfd13b03c23", "metadata": {}, "source": [ "## Deleting stored documents" ] }, { "cell_type": "code", "execution_count": null, "id": "38a70ec4-b522-4d32-9ead-c642864fca37", "metadata": {}, "outputs": [], "source": [ "delete_1 = vstore.delete(inserted_ids[:3])\n", "print(f\"all_succeed={delete_1}\") # True, all documents deleted" ] }, { "cell_type": "code", "execution_count": null, "id": "d4cf49ed-9d29-4ed9-bdab-51a308c41b8e", "metadata": {}, "outputs": [], "source": [ "delete_2 = vstore.delete(inserted_ids[2:5])\n", "print(f\"some_succeeds={delete_2}\") # True, though some IDs were gone already" ] }, { "cell_type": "markdown", "id": "847181ba-77d1-4a17-b7f9-9e2c3d8efd13", "metadata": {}, "source": [ "## A minimal RAG chain" ] }, { "cell_type": "markdown", "id": "cd64b844-846f-43c5-a7dd-c26b9ed417d0", "metadata": {}, "source": [ "The next cells will implement a simple RAG pipeline:\n", "- download a sample PDF file and load it onto the store;\n", "- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;\n", "- run the question-answering chain." ] }, { "cell_type": "code", "execution_count": null, "id": "5cbc4dba-0d5e-4038-8fc5-de6cadd1c2a9", "metadata": {}, "outputs": [], "source": [ "!curl -L \\\n", " \"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true\" \\\n", " -o \"what-is-philosophy.pdf\"" ] }, { "cell_type": "code", "execution_count": null, "id": "459385be-5e9c-47ff-ba53-2b7ae6166b09", "metadata": {}, "outputs": [], "source": [ "pdf_loader = PyPDFLoader(\"what-is-philosophy.pdf\")\n", "splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)\n", "docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)\n", "\n", "print(f\"Documents from PDF: {len(docs_from_pdf)}.\")\n", "inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)\n", "print(f\"Inserted {len(inserted_ids_from_pdf)} documents.\")" ] }, { "cell_type": "code", "execution_count": null, "id": "5010a66c-4298-4e32-82b5-2da0d36a5c70", "metadata": {}, "outputs": [], "source": [ "retriever = vstore.as_retriever(search_kwargs={\"k\": 3})\n", "\n", "philo_template = \"\"\"\n", "You are a philosopher that draws inspiration from great thinkers of the past\n", "to craft well-thought answers to user questions. Use the provided context as the basis\n", "for your answers and do not make up new reasoning paths - just mix-and-match what you are given.\n", "Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.\n", "\n", "CONTEXT:\n", "{context}\n", "\n", "QUESTION: {question}\n", "\n", "YOUR ANSWER:\"\"\"\n", "\n", "philo_prompt = ChatPromptTemplate.from_template(philo_template)\n", "\n", "llm = ChatOpenAI()\n", "\n", "chain = (\n", " {\"context\": retriever, \"question\": RunnablePassthrough()}\n", " | philo_prompt\n", " | llm\n", " | StrOutputParser()\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "fcbc1296-6c7c-478b-b55b-533ba4e54ddb", "metadata": {}, "outputs": [], "source": [ "chain.invoke(\"How does Russel elaborate on Peirce's idea of the security blanket?\")" ] }, { "cell_type": "markdown", "id": "869ab448-a029-4692-aefc-26b85513314d", "metadata": {}, "source": [ "For more, check out a complete RAG template using Astra DB [here](https://github.com/langchain-ai/langchain/tree/master/templates/rag-astradb)." ] }, { "cell_type": "markdown", "id": "177610c7-50d0-4b7b-8634-b03338054c8e", "metadata": {}, "source": [ "## Cleanup" ] }, { "cell_type": "markdown", "id": "0da4d19f-9878-4d3d-82c9-09cafca20322", "metadata": {}, "source": [ "If you want to completely delete the collection from your Astra DB instance, run this.\n", "\n", "_(You will lose the data you stored in it.)_" ] }, { "cell_type": "code", "execution_count": null, "id": "fd405a13-6f71-46fa-87e6-167238e9c25e", "metadata": {}, "outputs": [], "source": [ "vstore.delete_collection()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }