{ "cells": [ { "cell_type": "markdown", "id": "fe1cf4b8-4fee-49d9-aad5-18adabaca692", "metadata": {}, "source": [ "# SemaDB\n", "\n", "> [SemaDB](https://www.semafind.com/products/semadb) from [SemaFind](https://www.semafind.com) is a no fuss vector similarity database for building AI applications. The hosted `SemaDB Cloud` offers a no fuss developer experience to get started.\n", "\n", "The full documentation of the API along with examples and an interactive playground is available on [RapidAPI](https://rapidapi.com/semafind-semadb/api/semadb).\n", "\n", "This notebook demonstrates usage of the `SemaDB Cloud` vector store." ] }, { "cell_type": "markdown", "id": "aa8c1970-52f0-4834-8f06-3ca8f7fac857", "metadata": {}, "source": [ "## Load document embeddings\n", "\n", "To run things locally, we are using [Sentence Transformers](https://www.sbert.net/) which are commonly used for embedding sentences. You can use any embedding model LangChain offers." ] }, { "cell_type": "code", "execution_count": null, "id": "386a6b49-edee-45f2-9c0e-ebc125507ece", "metadata": {}, "outputs": [], "source": [ "%pip install --upgrade --quiet sentence_transformers" ] }, { "cell_type": "code", "execution_count": 2, "id": "5bd07a44-34fd-4318-8033-4c8dbd327559", "metadata": {}, "outputs": [], "source": [ "from langchain_community.embeddings import HuggingFaceEmbeddings\n", "\n", "embeddings = HuggingFaceEmbeddings()" ] }, { "cell_type": "code", "execution_count": 3, "id": "b0079bdf-b3cd-4856-85d5-f7787f5d93d5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "114\n" ] } ], "source": [ "from langchain_community.document_loaders import TextLoader\n", "from langchain_text_splitters import CharacterTextSplitter\n", "\n", "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", "documents = loader.load()\n", "text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=0)\n", "docs = text_splitter.split_documents(documents)\n", "print(len(docs))" ] }, { "cell_type": "markdown", "id": "92ed5523-330d-4697-9008-c910044ac45a", "metadata": {}, "source": [ "## Connect to SemaDB\n", "\n", "SemaDB Cloud uses [RapidAPI keys](https://rapidapi.com/semafind-semadb/api/semadb) to authenticate. You can obtain yours by creating a free RapidAPI account." ] }, { "cell_type": "code", "execution_count": 4, "id": "c4ffeeef-e6f5-4bcc-8c97-0e4222ca8282", "metadata": {}, "outputs": [ { "name": "stdin", "output_type": "stream", "text": [ "SemaDB API Key: ········\n" ] } ], "source": [ "import getpass\n", "import os\n", "\n", "os.environ[\"SEMADB_API_KEY\"] = getpass.getpass(\"SemaDB API Key:\")" ] }, { "cell_type": "code", "execution_count": 5, "id": "ba5f7a81-0f59-448a-93a8-5d8bf3bfc0f9", "metadata": {}, "outputs": [], "source": [ "from langchain_community.vectorstores import SemaDB\n", "from langchain_community.vectorstores.utils import DistanceStrategy" ] }, { "cell_type": "markdown", "id": "320f743c-39ae-456c-8c20-0683196358a4", "metadata": {}, "source": [ "The parameters to the SemaDB vector store reflect the API directly:\n", "\n", "- \"mycollection\": is the collection name in which we will store these vectors.\n", "- 768: is dimensions of the vectors. In our case, the sentence transformer embeddings yield 768 dimensional vectors.\n", "- API_KEY: is your RapidAPI key.\n", "- embeddings: correspond to how the embeddings of documents, texts and queries will be generated.\n", "- DistanceStrategy: is the distance metric used. The wrapper automatically normalises vectors if COSINE is used." ] }, { "cell_type": "code", "execution_count": 6, "id": "c1cb1f78-c25e-41a7-8001-6c84d51514ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db = SemaDB(\"mycollection\", 768, embeddings, DistanceStrategy.COSINE)\n", "\n", "# Create collection if running for the first time. If the collection\n", "# already exists this will fail.\n", "db.create_collection()" ] }, { "cell_type": "markdown", "id": "44348469-1d1f-4f3e-9af3-a955aec3dd71", "metadata": {}, "source": [ "The SemaDB vector store wrapper adds the document text as point metadata to collect later. Storing large chunks of text is *not recommended*. If you are indexing a large collection, we instead recommend storing references to the documents such as external Ids." ] }, { "cell_type": "code", "execution_count": 7, "id": "9adca5d3-e534-4fd2-aace-f436de4630ed", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['813c7ef3-9797-466b-8afa-587115592c6c',\n", " 'fc392f7f-082b-4932-bfcc-06800db5e017']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.add_documents(docs)[:2]" ] }, { "cell_type": "markdown", "id": "fb177b0d-148b-4cbc-86cc-b62dff135a9d", "metadata": {}, "source": [ "## Similarity Search\n", "\n", "We use the default LangChain similarity search interface to search for the most similar sentences." ] }, { "cell_type": "code", "execution_count": 8, "id": "7536aba2-a757-4a3f-beda-79cfee5c34cf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n" ] } ], "source": [ "query = \"What did the president say about Ketanji Brown Jackson\"\n", "docs = db.similarity_search(query)\n", "print(docs[0].page_content)" ] }, { "cell_type": "code", "execution_count": 9, "id": "a51e940e-487e-484d-9dc4-1aa1a6371660", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Document(page_content='And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../modules/state_of_the_union.txt', 'text': 'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'}),\n", " 0.42369342)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs = db.similarity_search_with_score(query)\n", "docs[0]" ] }, { "cell_type": "markdown", "id": "79aec3f4-d4d8-4c51-b4b2-074b6c22c3c0", "metadata": {}, "source": [ "## Clean up\n", "\n", "You can delete the collection to remove all data." ] }, { "cell_type": "code", "execution_count": 10, "id": "b00afad5-8ec1-4c19-be6b-1c2ae2d5fead", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.delete_collection()" ] }, { "cell_type": "code", "execution_count": null, "id": "239a0bca-5c88-401f-9828-1cb0b652e7d0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }