mirror of
https://github.com/hwchase17/langchain.git
synced 2026-04-03 19:04:23 +00:00
Use docusaurus versioning with a callout, merged master as well @hwchase17 @baskaryan --------- Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru> Co-authored-by: Averi Kitsch <akitsch@google.com> Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Nuno Campos <nuno@langchain.dev> Co-authored-by: Nuno Campos <nuno@boringbits.io> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com> Co-authored-by: Fayfox <admin@fayfox.com> Co-authored-by: Eugene Yurtsev <eugene@langchain.dev> Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com> Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com> Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com> Co-authored-by: ccurme <chester.curme@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com> Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com> Co-authored-by: Kartik Sarangmath <kartik@thirdai.com> Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai> Co-authored-by: MacanPN <martin.triska@gmail.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> Co-authored-by: Hyeongchan Kim <kozistr@gmail.com> Co-authored-by: sdan <git@sdan.io> Co-authored-by: Guangdong Liu <liugddx@gmail.com> Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com> Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com> Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com> Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com> Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com> Co-authored-by: Tomer Cagan <tomer@tomercagan.com> Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
276 lines
8.7 KiB
Plaintext
276 lines
8.7 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9787b308",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Rockset\n",
|
||
"\n",
|
||
">[Rockset](https://rockset.com/) is a real-time search and analytics database built for the cloud. Rockset uses a [Converged Index™](https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/) with an efficient store for vector embeddings to serve low latency, high concurrency search queries at scale. Rockset has full support for metadata filtering and handles real-time ingestion for constantly updating, streaming data.\n",
|
||
"\n",
|
||
"This notebook demonstrates how to use `Rockset` as a vector store in LangChain. Before getting started, make sure you have access to a `Rockset` account and an API key available. [Start your free trial today.](https://rockset.com/create/)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b823d64a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Setting Up Your Environment\n",
|
||
"\n",
|
||
"1. Leverage the `Rockset` console to create a [collection](https://rockset.com/docs/collections/) with the Write API as your source. In this walkthrough, we create a collection named `langchain_demo`. \n",
|
||
" \n",
|
||
" Configure the following [ingest transformation](https://rockset.com/docs/ingest-transformation/) to mark your embeddings field and take advantage of performance and storage optimizations:\n",
|
||
"\n",
|
||
"\n",
|
||
" (We used OpenAI `text-embedding-ada-002` for this examples, where #length_of_vector_embedding = 1536)"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "382952d8-6bbd-447f-903c-4437ddbeec0c",
|
||
"metadata": {
|
||
"vscode": {
|
||
"languageId": "sql"
|
||
}
|
||
},
|
||
"source": [
|
||
"```\n",
|
||
"SELECT _input.* EXCEPT(_meta), \n",
|
||
"VECTOR_ENFORCE(_input.description_embedding, #length_of_vector_embedding, 'float') as description_embedding \n",
|
||
"FROM _input\n",
|
||
"```\n",
|
||
"\n",
|
||
"2. After creating your collection, use the console to retrieve an [API key](https://rockset.com/docs/iam/#users-api-keys-and-roles). For the purpose of this notebook, we assume you are using the `Oregon(us-west-2)` region.\n",
|
||
"\n",
|
||
"3. Install the [rockset-python-client](https://github.com/rockset/rockset-python-client) to enable LangChain to communicate directly with `Rockset`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "00d16b83",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"%pip install --upgrade --quiet rockset"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e79550eb",
|
||
"metadata": {},
|
||
"source": [
|
||
"## LangChain Tutorial\n",
|
||
"\n",
|
||
"Follow along in your own Python notebook to generate and store vector embeddings in Rockset.\n",
|
||
"Start using Rockset to search for documents similar to your search queries.\n",
|
||
"\n",
|
||
"### 1. Define Key Variables"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "29505c1e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"\n",
|
||
"import rockset\n",
|
||
"\n",
|
||
"ROCKSET_API_KEY = os.environ.get(\n",
|
||
" \"ROCKSET_API_KEY\"\n",
|
||
") # Verify ROCKSET_API_KEY environment variable\n",
|
||
"ROCKSET_API_SERVER = rockset.Regions.usw2a1 # Verify Rockset region\n",
|
||
"rockset_client = rockset.RocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n",
|
||
"\n",
|
||
"COLLECTION_NAME = \"langchain_demo\"\n",
|
||
"TEXT_KEY = \"description\"\n",
|
||
"EMBEDDING_KEY = \"description_embedding\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "07625be2",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2. Prepare Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9740d8c4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain_community.document_loaders import TextLoader\n",
|
||
"from langchain_community.vectorstores import Rockset\n",
|
||
"from langchain_openai import OpenAIEmbeddings\n",
|
||
"from langchain_text_splitters import CharacterTextSplitter\n",
|
||
"\n",
|
||
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
|
||
"documents = loader.load()\n",
|
||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||
"docs = text_splitter.split_documents(documents)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a068be18",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3. Insert Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "85b6a6c5",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"embeddings = OpenAIEmbeddings() # Verify OPENAI_API_KEY environment variable\n",
|
||
"\n",
|
||
"docsearch = Rockset(\n",
|
||
" client=rockset_client,\n",
|
||
" embeddings=embeddings,\n",
|
||
" collection_name=COLLECTION_NAME,\n",
|
||
" text_key=TEXT_KEY,\n",
|
||
" embedding_key=EMBEDDING_KEY,\n",
|
||
")\n",
|
||
"\n",
|
||
"ids = docsearch.add_texts(\n",
|
||
" texts=[d.page_content for d in docs],\n",
|
||
" metadatas=[d.metadata for d in docs],\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "56eef48d",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 4. Search for Similar Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0bbf3df0",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"output = docsearch.similarity_search_with_relevance_scores(\n",
|
||
" query, 4, Rockset.DistanceFunction.COSINE_SIM\n",
|
||
")\n",
|
||
"print(\"output length:\", len(output))\n",
|
||
"for d, dist in output:\n",
|
||
" print(dist, d.metadata, d.page_content[:20] + \"...\")\n",
|
||
"\n",
|
||
"##\n",
|
||
"# output length: 4\n",
|
||
"# 0.764990692109871 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n",
|
||
"# 0.7485416901622112 {'source': '../../../state_of_the_union.txt'} And I’m taking robus...\n",
|
||
"# 0.7468678973398306 {'source': '../../../state_of_the_union.txt'} And so many families...\n",
|
||
"# 0.7436231261419488 {'source': '../../../state_of_the_union.txt'} Groups of citizens b..."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7037a22f",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 5. Search for Similar Documents with Filtering"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "b64a290f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"output = docsearch.similarity_search_with_relevance_scores(\n",
|
||
" query,\n",
|
||
" 4,\n",
|
||
" Rockset.DistanceFunction.COSINE_SIM,\n",
|
||
" where_str=\"{} NOT LIKE '%citizens%'\".format(TEXT_KEY),\n",
|
||
")\n",
|
||
"print(\"output length:\", len(output))\n",
|
||
"for d, dist in output:\n",
|
||
" print(dist, d.metadata, d.page_content[:20] + \"...\")\n",
|
||
"\n",
|
||
"##\n",
|
||
"# output length: 4\n",
|
||
"# 0.7651359650263554 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n",
|
||
"# 0.7486265516824893 {'source': '../../../state_of_the_union.txt'} And I’m taking robus...\n",
|
||
"# 0.7469625542348115 {'source': '../../../state_of_the_union.txt'} And so many families...\n",
|
||
"# 0.7344177777547739 {'source': '../../../state_of_the_union.txt'} We see the unity amo..."
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "13a52b38",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 6. [Optional] Delete Inserted Documents\n",
|
||
"\n",
|
||
"You must have the unique ID associated with each document to delete them from your collection.\n",
|
||
"Define IDs when inserting documents with `Rockset.add_texts()`. Rockset will otherwise generate a unique ID for each document. Regardless, `Rockset.add_texts()` returns the IDs of inserted documents.\n",
|
||
"\n",
|
||
"To delete these docs, simply use the `Rockset.delete_texts()` function."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1f755924",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"docsearch.delete_texts(ids)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d468f431",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Summary\n",
|
||
"\n",
|
||
"In this tutorial, we successfully created a `Rockset` collection, `inserted` documents with OpenAI embeddings, and searched for similar documents with and without metadata filters.\n",
|
||
"\n",
|
||
"Keep an eye on https://rockset.com/ for future updates in this space."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.9.1"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|