Files
langchain/docs/versioned_docs/version-0.2.x/integrations/vectorstores/semadb.ipynb
Jacob Lee aff771923a Jacob/new docs (#20570)
Use docusaurus versioning with a callout, merged master as well

@hwchase17 @baskaryan

---------

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru>
Co-authored-by: Averi Kitsch <akitsch@google.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Nuno Campos <nuno@langchain.dev>
Co-authored-by: Nuno Campos <nuno@boringbits.io>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com>
Co-authored-by: Fayfox <admin@fayfox.com>
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com>
Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com>
Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com>
Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com>
Co-authored-by: Kartik Sarangmath <kartik@thirdai.com>
Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai>
Co-authored-by: MacanPN <martin.triska@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Hyeongchan Kim <kozistr@gmail.com>
Co-authored-by: sdan <git@sdan.io>
Co-authored-by: Guangdong Liu <liugddx@gmail.com>
Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com>
Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com>
Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com>
Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com>
Co-authored-by: Tomer Cagan <tomer@tomercagan.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
2024-04-18 11:10:55 -07:00

300 lines
8.2 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "fe1cf4b8-4fee-49d9-aad5-18adabaca692",
"metadata": {},
"source": [
"# SemaDB\n",
"\n",
"> [SemaDB](https://www.semafind.com/products/semadb) from [SemaFind](https://www.semafind.com) is a no fuss vector similarity database for building AI applications. The hosted `SemaDB Cloud` offers a no fuss developer experience to get started.\n",
"\n",
"The full documentation of the API along with examples and an interactive playground is available on [RapidAPI](https://rapidapi.com/semafind-semadb/api/semadb).\n",
"\n",
"This notebook demonstrates usage of the `SemaDB Cloud` vector store."
]
},
{
"cell_type": "markdown",
"id": "aa8c1970-52f0-4834-8f06-3ca8f7fac857",
"metadata": {},
"source": [
"## Load document embeddings\n",
"\n",
"To run things locally, we are using [Sentence Transformers](https://www.sbert.net/) which are commonly used for embedding sentences. You can use any embedding model LangChain offers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "386a6b49-edee-45f2-9c0e-ebc125507ece",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet sentence_transformers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5bd07a44-34fd-4318-8033-4c8dbd327559",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.embeddings import HuggingFaceEmbeddings\n",
"\n",
"embeddings = HuggingFaceEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b0079bdf-b3cd-4856-85d5-f7787f5d93d5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"114\n"
]
}
],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"print(len(docs))"
]
},
{
"cell_type": "markdown",
"id": "92ed5523-330d-4697-9008-c910044ac45a",
"metadata": {},
"source": [
"## Connect to SemaDB\n",
"\n",
"SemaDB Cloud uses [RapidAPI keys](https://rapidapi.com/semafind-semadb/api/semadb) to authenticate. You can obtain yours by creating a free RapidAPI account."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c4ffeeef-e6f5-4bcc-8c97-0e4222ca8282",
"metadata": {},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"SemaDB API Key: ········\n"
]
}
],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"SEMADB_API_KEY\"] = getpass.getpass(\"SemaDB API Key:\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ba5f7a81-0f59-448a-93a8-5d8bf3bfc0f9",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.vectorstores import SemaDB\n",
"from langchain_community.vectorstores.utils import DistanceStrategy"
]
},
{
"cell_type": "markdown",
"id": "320f743c-39ae-456c-8c20-0683196358a4",
"metadata": {},
"source": [
"The parameters to the SemaDB vector store reflect the API directly:\n",
"\n",
"- \"mycollection\": is the collection name in which we will store these vectors.\n",
"- 768: is dimensions of the vectors. In our case, the sentence transformer embeddings yield 768 dimensional vectors.\n",
"- API_KEY: is your RapidAPI key.\n",
"- embeddings: correspond to how the embeddings of documents, texts and queries will be generated.\n",
"- DistanceStrategy: is the distance metric used. The wrapper automatically normalises vectors if COSINE is used."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c1cb1f78-c25e-41a7-8001-6c84d51514ea",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db = SemaDB(\"mycollection\", 768, embeddings, DistanceStrategy.COSINE)\n",
"\n",
"# Create collection if running for the first time. If the collection\n",
"# already exists this will fail.\n",
"db.create_collection()"
]
},
{
"cell_type": "markdown",
"id": "44348469-1d1f-4f3e-9af3-a955aec3dd71",
"metadata": {},
"source": [
"The SemaDB vector store wrapper adds the document text as point metadata to collect later. Storing large chunks of text is *not recommended*. If you are indexing a large collection, we instead recommend storing references to the documents such as external Ids."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9adca5d3-e534-4fd2-aace-f436de4630ed",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['813c7ef3-9797-466b-8afa-587115592c6c',\n",
" 'fc392f7f-082b-4932-bfcc-06800db5e017']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db.add_documents(docs)[:2]"
]
},
{
"cell_type": "markdown",
"id": "fb177b0d-148b-4cbc-86cc-b62dff135a9d",
"metadata": {},
"source": [
"## Similarity Search\n",
"\n",
"We use the default LangChain similarity search interface to search for the most similar sentences."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7536aba2-a757-4a3f-beda-79cfee5c34cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a51e940e-487e-484d-9dc4-1aa1a6371660",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../modules/state_of_the_union.txt', 'text': 'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.'}),\n",
" 0.42369342)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = db.similarity_search_with_score(query)\n",
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "79aec3f4-d4d8-4c51-b4b2-074b6c22c3c0",
"metadata": {},
"source": [
"## Clean up\n",
"\n",
"You can delete the collection to remove all data."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "b00afad5-8ec1-4c19-be6b-1c2ae2d5fead",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db.delete_collection()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "239a0bca-5c88-401f-9828-1cb0b652e7d0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}