mirror of
https://github.com/hwchase17/langchain.git
synced 2025-05-11 01:56:12 +00:00
**Issue:** Added support for creating indexes in the SAP HANA Vector engine. **Changes**: 1. Introduced a new function `create_hnsw_index` in `hanavector.py` that enables the creation of indexes for SAP HANA Vector. 2. Added integration tests for the index creation function to ensure functionality. 3. Updated the documentation to reflect the new index creation feature, including examples and output from the notebook. 4. Fix the operator issue in ` _process_filter_object` function and change the array argument to a placeholder in the similarity search SQL statement. --------- Co-authored-by: Erick Friis <erick@langchain.dev>
1353 lines
42 KiB
Plaintext
1353 lines
42 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# SAP HANA Cloud Vector Engine\n",
|
||
"\n",
|
||
">[SAP HANA Cloud Vector Engine](https://www.sap.com/events/teched/news-guide/ai.html#article8) is a vector store fully integrated into the `SAP HANA Cloud` database.\n",
|
||
"\n",
|
||
"You'll need to install `langchain-community` with `pip install -qU langchain-community` to use this integration"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Setting up\n",
|
||
"\n",
|
||
"Installation of the HANA database driver."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"tags": []
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Pip install necessary package\n",
|
||
"%pip install --upgrade --quiet hdbcli"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"For `OpenAIEmbeddings` we use the OpenAI API key from the environment."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-09-09T08:02:16.802456Z",
|
||
"start_time": "2023-09-09T08:02:07.065604Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"# Use OPENAI_API_KEY env variable\n",
|
||
"# os.environ[\"OPENAI_API_KEY\"] = \"Your OpenAI API key\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Create a database connection to a HANA Cloud instance."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-09-09T08:02:28.174088Z",
|
||
"start_time": "2023-09-09T08:02:28.162698Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from dotenv import load_dotenv\n",
|
||
"from hdbcli import dbapi\n",
|
||
"\n",
|
||
"load_dotenv()\n",
|
||
"# Use connection settings from the environment\n",
|
||
"connection = dbapi.connect(\n",
|
||
" address=os.environ.get(\"HANA_DB_ADDRESS\"),\n",
|
||
" port=os.environ.get(\"HANA_DB_PORT\"),\n",
|
||
" user=os.environ.get(\"HANA_DB_USER\"),\n",
|
||
" password=os.environ.get(\"HANA_DB_PASSWORD\"),\n",
|
||
" autocommit=True,\n",
|
||
" sslValidateCertificate=False,\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Example"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Load the sample document \"state_of_the_union.txt\" and create chunks from it."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-09-09T08:02:25.452472Z",
|
||
"start_time": "2023-09-09T08:02:25.441563Z"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Number of document chunks: 88\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain_community.document_loaders import TextLoader\n",
|
||
"from langchain_community.vectorstores.hanavector import HanaDB\n",
|
||
"from langchain_core.documents import Document\n",
|
||
"from langchain_openai import OpenAIEmbeddings\n",
|
||
"from langchain_text_splitters import CharacterTextSplitter\n",
|
||
"\n",
|
||
"text_documents = TextLoader(\"../../how_to/state_of_the_union.txt\").load()\n",
|
||
"text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
|
||
"text_chunks = text_splitter.split_documents(text_documents)\n",
|
||
"print(f\"Number of document chunks: {len(text_chunks)}\")\n",
|
||
"\n",
|
||
"embeddings = OpenAIEmbeddings()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Create a LangChain VectorStore interface for the HANA database and specify the table (collection) to use for accessing the vector embeddings"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-09-09T08:04:16.696625Z",
|
||
"start_time": "2023-09-09T08:02:31.817790Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"db = HanaDB(\n",
|
||
" embedding=embeddings, connection=connection, table_name=\"STATE_OF_THE_UNION\"\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Add the loaded document chunks to the table. For this example, we delete any previous content from the table which might exist from previous runs."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[]"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Delete already existing documents from the table\n",
|
||
"db.delete(filter={})\n",
|
||
"\n",
|
||
"# add the loaded document chunks\n",
|
||
"db.add_documents(text_chunks)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Perform a query to get the two best-matching document chunks from the ones that were added in the previous step.\n",
|
||
"By default \"Cosine Similarity\" is used for the search."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--------------------------------------------------------------------------------\n",
|
||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||
"\n",
|
||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
|
||
"--------------------------------------------------------------------------------\n",
|
||
"As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n",
|
||
"\n",
|
||
"While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"docs = db.similarity_search(query, k=2)\n",
|
||
"\n",
|
||
"for doc in docs:\n",
|
||
" print(\"-\" * 80)\n",
|
||
" print(doc.page_content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Query the same content with \"Euclidian Distance\". The results shoud be the same as with \"Cosine Similarity\"."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--------------------------------------------------------------------------------\n",
|
||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||
"\n",
|
||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
|
||
"--------------------------------------------------------------------------------\n",
|
||
"As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n",
|
||
"\n",
|
||
"While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain_community.vectorstores.utils import DistanceStrategy\n",
|
||
"\n",
|
||
"db = HanaDB(\n",
|
||
" embedding=embeddings,\n",
|
||
" connection=connection,\n",
|
||
" distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,\n",
|
||
" table_name=\"STATE_OF_THE_UNION\",\n",
|
||
")\n",
|
||
"\n",
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"docs = db.similarity_search(query, k=2)\n",
|
||
"for doc in docs:\n",
|
||
" print(\"-\" * 80)\n",
|
||
" print(doc.page_content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"collapsed": false,
|
||
"jupyter": {
|
||
"outputs_hidden": false
|
||
}
|
||
},
|
||
"source": [
|
||
"## Maximal Marginal Relevance Search (MMR)\n",
|
||
"\n",
|
||
"`Maximal marginal relevance` optimizes for similarity to query AND diversity among selected documents. The first 20 (fetch_k) items will be retrieved from the DB. The MMR algorithm will then find the best 2 (k) matches."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-09-09T08:05:23.276819Z",
|
||
"start_time": "2023-09-09T08:05:21.972256Z"
|
||
},
|
||
"collapsed": false,
|
||
"jupyter": {
|
||
"outputs_hidden": false
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--------------------------------------------------------------------------------\n",
|
||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||
"\n",
|
||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
|
||
"--------------------------------------------------------------------------------\n",
|
||
"Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \n",
|
||
"\n",
|
||
"In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \n",
|
||
"\n",
|
||
"Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = db.max_marginal_relevance_search(query, k=2, fetch_k=20)\n",
|
||
"for doc in docs:\n",
|
||
" print(\"-\" * 80)\n",
|
||
" print(doc.page_content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Creating an HNSW Vector Index\n",
|
||
"\n",
|
||
"A vector index can significantly speed up top-k nearest neighbor queries for vectors. Users can create a Hierarchical Navigable Small World (HNSW) vector index using the `create_hnsw_index` function.\n",
|
||
"\n",
|
||
"For more information about creating an index at the database level, please refer to the [official documentation](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/create-vector-index-statement-data-definition).\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--------------------------------------------------------------------------------\n",
|
||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||
"\n",
|
||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
|
||
"--------------------------------------------------------------------------------\n",
|
||
"Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \n",
|
||
"\n",
|
||
"In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \n",
|
||
"\n",
|
||
"Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# HanaDB instance uses cosine similarity as default:\n",
|
||
"db_cosine = HanaDB(\n",
|
||
" embedding=embeddings, connection=connection, table_name=\"STATE_OF_THE_UNION\"\n",
|
||
")\n",
|
||
"\n",
|
||
"# Attempting to create the HNSW index with default parameters\n",
|
||
"db_cosine.create_hnsw_index() # If no other parameters are specified, the default values will be used\n",
|
||
"# Default values: m=64, ef_construction=128, ef_search=200\n",
|
||
"# The default index name will be: STATE_OF_THE_UNION_COSINE_SIMILARITY_IDX (verify this naming pattern in HanaDB class)\n",
|
||
"\n",
|
||
"\n",
|
||
"# Creating a HanaDB instance with L2 distance as the similarity function and defined values\n",
|
||
"db_l2 = HanaDB(\n",
|
||
" embedding=embeddings,\n",
|
||
" connection=connection,\n",
|
||
" table_name=\"STATE_OF_THE_UNION\",\n",
|
||
" distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE, # Specify L2 distance\n",
|
||
")\n",
|
||
"\n",
|
||
"# This will create an index based on L2 distance strategy.\n",
|
||
"db_l2.create_hnsw_index(\n",
|
||
" index_name=\"STATE_OF_THE_UNION_L2_index\",\n",
|
||
" m=100, # Max number of neighbors per graph node (valid range: 4 to 1000)\n",
|
||
" ef_construction=200, # Max number of candidates during graph construction (valid range: 1 to 100000)\n",
|
||
" ef_search=500, # Min number of candidates during the search (valid range: 1 to 100000)\n",
|
||
")\n",
|
||
"\n",
|
||
"# Use L2 index to perform MMR\n",
|
||
"docs = db_l2.max_marginal_relevance_search(query, k=2, fetch_k=20)\n",
|
||
"for doc in docs:\n",
|
||
" print(\"-\" * 80)\n",
|
||
" print(doc.page_content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"**Key Points**:\n",
|
||
"- **Similarity Function**: The similarity function for the index is **cosine similarity** by default. If you want to use a different similarity function (e.g., `L2` distance), you need to specify it when initializing the `HanaDB` instance.\n",
|
||
"- **Default Parameters**: In the `create_hnsw_index` function, if the user does not provide custom values for parameters like `m`, `ef_construction`, or `ef_search`, the default values (e.g., `m=64`, `ef_construction=128`, `ef_search=200`) will be used automatically. These values ensure the index is created with reasonable performance without requiring user intervention.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Basic Vectorstore Operations"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"db = HanaDB(\n",
|
||
" connection=connection, embedding=embeddings, table_name=\"LANGCHAIN_DEMO_BASIC\"\n",
|
||
")\n",
|
||
"\n",
|
||
"# Delete already existing documents from the table\n",
|
||
"db.delete(filter={})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"We can add simple text documents to the existing table."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[]"
|
||
]
|
||
},
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = [Document(page_content=\"Some text\"), Document(page_content=\"Other docs\")]\n",
|
||
"db.add_documents(docs)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Add documents with metadata."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[]"
|
||
]
|
||
},
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = [\n",
|
||
" Document(\n",
|
||
" page_content=\"foo\",\n",
|
||
" metadata={\"start\": 100, \"end\": 150, \"doc_name\": \"foo.txt\", \"quality\": \"bad\"},\n",
|
||
" ),\n",
|
||
" Document(\n",
|
||
" page_content=\"bar\",\n",
|
||
" metadata={\"start\": 200, \"end\": 250, \"doc_name\": \"bar.txt\", \"quality\": \"good\"},\n",
|
||
" ),\n",
|
||
"]\n",
|
||
"db.add_documents(docs)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Query documents with specific metadata."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--------------------------------------------------------------------------------\n",
|
||
"foo\n",
|
||
"{'start': 100, 'end': 150, 'doc_name': 'foo.txt', 'quality': 'bad'}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = db.similarity_search(\"foobar\", k=2, filter={\"quality\": \"bad\"})\n",
|
||
"# With filtering on \"quality\"==\"bad\", only one document should be returned\n",
|
||
"for doc in docs:\n",
|
||
" print(\"-\" * 80)\n",
|
||
" print(doc.page_content)\n",
|
||
" print(doc.metadata)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Delete documents with specific metadata."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"0\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"db.delete(filter={\"quality\": \"bad\"})\n",
|
||
"\n",
|
||
"# Now the similarity search with the same filter will return no results\n",
|
||
"docs = db.similarity_search(\"foobar\", k=2, filter={\"quality\": \"bad\"})\n",
|
||
"print(len(docs))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Advanced filtering\n",
|
||
"In addition to the basic value-based filtering capabilities, it is possible to use more advanced filtering.\n",
|
||
"The table below shows the available filter operators.\n",
|
||
"\n",
|
||
"| Operator | Semantic |\n",
|
||
"|----------|-------------------------|\n",
|
||
"| `$eq` | Equality (==) |\n",
|
||
"| `$ne` | Inequality (!=) |\n",
|
||
"| `$lt` | Less than (<) |\n",
|
||
"| `$lte` | Less than or equal (<=) |\n",
|
||
"| `$gt` | Greater than (>) |\n",
|
||
"| `$gte` | Greater than or equal (>=) |\n",
|
||
"| `$in` | Contained in a set of given values (in) |\n",
|
||
"| `$nin` | Not contained in a set of given values (not in) |\n",
|
||
"| `$between` | Between the range of two boundary values |\n",
|
||
"| `$like` | Text equality based on the \"LIKE\" semantics in SQL (using \"%\" as wildcard) |\n",
|
||
"| `$and` | Logical \"and\", supporting 2 or more operands |\n",
|
||
"| `$or` | Logical \"or\", supporting 2 or more operands |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Prepare some test documents\n",
|
||
"docs = [\n",
|
||
" Document(\n",
|
||
" page_content=\"First\",\n",
|
||
" metadata={\"name\": \"adam\", \"is_active\": True, \"id\": 1, \"height\": 10.0},\n",
|
||
" ),\n",
|
||
" Document(\n",
|
||
" page_content=\"Second\",\n",
|
||
" metadata={\"name\": \"bob\", \"is_active\": False, \"id\": 2, \"height\": 5.7},\n",
|
||
" ),\n",
|
||
" Document(\n",
|
||
" page_content=\"Third\",\n",
|
||
" metadata={\"name\": \"jane\", \"is_active\": True, \"id\": 3, \"height\": 2.4},\n",
|
||
" ),\n",
|
||
"]\n",
|
||
"\n",
|
||
"db = HanaDB(\n",
|
||
" connection=connection,\n",
|
||
" embedding=embeddings,\n",
|
||
" table_name=\"LANGCHAIN_DEMO_ADVANCED_FILTER\",\n",
|
||
")\n",
|
||
"\n",
|
||
"# Delete already existing documents from the table\n",
|
||
"db.delete(filter={})\n",
|
||
"db.add_documents(docs)\n",
|
||
"\n",
|
||
"\n",
|
||
"# Helper function for printing filter results\n",
|
||
"def print_filter_result(result):\n",
|
||
" if len(result) == 0:\n",
|
||
" print(\"<empty result>\")\n",
|
||
" for doc in result:\n",
|
||
" print(doc.metadata)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Filtering with `$ne`, `$gt`, `$gte`, `$lt`, `$lte`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Filter: {'id': {'$ne': 1}}\n",
|
||
"{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}\n",
|
||
"{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}\n",
|
||
"Filter: {'id': {'$gt': 1}}\n",
|
||
"{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}\n",
|
||
"{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}\n",
|
||
"Filter: {'id': {'$gte': 1}}\n",
|
||
"{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}\n",
|
||
"{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}\n",
|
||
"{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}\n",
|
||
"Filter: {'id': {'$lt': 1}}\n",
|
||
"<empty result>\n",
|
||
"Filter: {'id': {'$lte': 1}}\n",
|
||
"{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"advanced_filter = {\"id\": {\"$ne\": 1}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"id\": {\"$gt\": 1}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"id\": {\"$gte\": 1}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"id\": {\"$lt\": 1}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"id\": {\"$lte\": 1}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Filtering with `$between`, `$in`, `$nin`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Filter: {'id': {'$between': (1, 2)}}\n",
|
||
"{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}\n",
|
||
"{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}\n",
|
||
"Filter: {'name': {'$in': ['adam', 'bob']}}\n",
|
||
"{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}\n",
|
||
"{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}\n",
|
||
"Filter: {'name': {'$nin': ['adam', 'bob']}}\n",
|
||
"{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"advanced_filter = {\"id\": {\"$between\": (1, 2)}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"name\": {\"$in\": [\"adam\", \"bob\"]}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"name\": {\"$nin\": [\"adam\", \"bob\"]}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Text filtering with `$like`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Filter: {'name': {'$like': 'a%'}}\n",
|
||
"{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}\n",
|
||
"Filter: {'name': {'$like': '%a%'}}\n",
|
||
"{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}\n",
|
||
"{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"advanced_filter = {\"name\": {\"$like\": \"a%\"}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"name\": {\"$like\": \"%a%\"}}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Combined filtering with `$and`, `$or`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Filter: {'$or': [{'id': 1}, {'name': 'bob'}]}\n",
|
||
"{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}\n",
|
||
"{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}\n",
|
||
"Filter: {'$and': [{'id': 1}, {'id': 2}]}\n",
|
||
"<empty result>\n",
|
||
"Filter: {'$or': [{'id': 1}, {'id': 2}, {'id': 3}]}\n",
|
||
"{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}\n",
|
||
"{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}\n",
|
||
"{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"advanced_filter = {\"$or\": [{\"id\": 1}, {\"name\": \"bob\"}]}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"$and\": [{\"id\": 1}, {\"id\": 2}]}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))\n",
|
||
"\n",
|
||
"advanced_filter = {\"$or\": [{\"id\": 1}, {\"id\": 2}, {\"id\": 3}]}\n",
|
||
"print(f\"Filter: {advanced_filter}\")\n",
|
||
"print_filter_result(db.similarity_search(\"just testing\", k=5, filter=advanced_filter))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Using a VectorStore as a retriever in chains for retrieval augmented generation (RAG)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain.memory import ConversationBufferMemory\n",
|
||
"from langchain_openai import ChatOpenAI\n",
|
||
"\n",
|
||
"# Access the vector DB with a new table\n",
|
||
"db = HanaDB(\n",
|
||
" connection=connection,\n",
|
||
" embedding=embeddings,\n",
|
||
" table_name=\"LANGCHAIN_DEMO_RETRIEVAL_CHAIN\",\n",
|
||
")\n",
|
||
"\n",
|
||
"# Delete already existing entries from the table\n",
|
||
"db.delete(filter={})\n",
|
||
"\n",
|
||
"# add the loaded document chunks from the \"State Of The Union\" file\n",
|
||
"db.add_documents(text_chunks)\n",
|
||
"\n",
|
||
"# Create a retriever instance of the vector store\n",
|
||
"retriever = db.as_retriever()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Define the prompt."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain_core.prompts import PromptTemplate\n",
|
||
"\n",
|
||
"prompt_template = \"\"\"\n",
|
||
"You are an expert in state of the union topics. You are provided multiple context items that are related to the prompt you have to answer.\n",
|
||
"Use the following pieces of context to answer the question at the end.\n",
|
||
"\n",
|
||
"'''\n",
|
||
"{context}\n",
|
||
"'''\n",
|
||
"\n",
|
||
"Question: {question}\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"PROMPT = PromptTemplate(\n",
|
||
" template=prompt_template, input_variables=[\"context\", \"question\"]\n",
|
||
")\n",
|
||
"chain_type_kwargs = {\"prompt\": PROMPT}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Create the ConversationalRetrievalChain, which handles the chat history and the retrieval of similar document chunks to be added to the prompt."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain.chains import ConversationalRetrievalChain\n",
|
||
"\n",
|
||
"llm = ChatOpenAI(model=\"gpt-3.5-turbo\")\n",
|
||
"memory = ConversationBufferMemory(\n",
|
||
" memory_key=\"chat_history\", output_key=\"answer\", return_messages=True\n",
|
||
")\n",
|
||
"qa_chain = ConversationalRetrievalChain.from_llm(\n",
|
||
" llm,\n",
|
||
" db.as_retriever(search_kwargs={\"k\": 5}),\n",
|
||
" return_source_documents=True,\n",
|
||
" memory=memory,\n",
|
||
" verbose=False,\n",
|
||
" combine_docs_chain_kwargs={\"prompt\": PROMPT},\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Ask the first question (and verify how many text chunks have been used)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Answer from LLM:\n",
|
||
"================\n",
|
||
"The United States has set up joint patrols with Mexico and Guatemala to catch more human traffickers. This collaboration is part of the efforts to address immigration issues and secure the borders in the region.\n",
|
||
"================\n",
|
||
"Number of used source document chunks: 5\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"question = \"What about Mexico and Guatemala?\"\n",
|
||
"\n",
|
||
"result = qa_chain.invoke({\"question\": question})\n",
|
||
"print(\"Answer from LLM:\")\n",
|
||
"print(\"================\")\n",
|
||
"print(result[\"answer\"])\n",
|
||
"\n",
|
||
"source_docs = result[\"source_documents\"]\n",
|
||
"print(\"================\")\n",
|
||
"print(f\"Number of used source document chunks: {len(source_docs)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Examine the used chunks of the chain in detail. Check if the best ranked chunk contains info about \"Mexico and Guatemala\" as mentioned in the question."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"for doc in source_docs:\n",
|
||
" print(\"-\" * 80)\n",
|
||
" print(doc.page_content)\n",
|
||
" print(doc.metadata)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Ask another question on the same conversational chain. The answer should relate to the previous answer given."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Answer from LLM:\n",
|
||
"================\n",
|
||
"Mexico and Guatemala are involved in joint patrols to catch human traffickers.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"question = \"What about other countries?\"\n",
|
||
"\n",
|
||
"result = qa_chain.invoke({\"question\": question})\n",
|
||
"print(\"Answer from LLM:\")\n",
|
||
"print(\"================\")\n",
|
||
"print(result[\"answer\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Standard tables vs. \"custom\" tables with vector data"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"As default behaviour, the table for the embeddings is created with 3 columns:\n",
|
||
"\n",
|
||
"- A column `VEC_TEXT`, which contains the text of the Document\n",
|
||
"- A column `VEC_META`, which contains the metadata of the Document\n",
|
||
"- A column `VEC_VECTOR`, which contains the embeddings-vector of the Document's text"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 35,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[]"
|
||
]
|
||
},
|
||
"execution_count": 35,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Access the vector DB with a new table\n",
|
||
"db = HanaDB(\n",
|
||
" connection=connection, embedding=embeddings, table_name=\"LANGCHAIN_DEMO_NEW_TABLE\"\n",
|
||
")\n",
|
||
"\n",
|
||
"# Delete already existing entries from the table\n",
|
||
"db.delete(filter={})\n",
|
||
"\n",
|
||
"# Add a simple document with some metadata\n",
|
||
"docs = [\n",
|
||
" Document(\n",
|
||
" page_content=\"A simple document\",\n",
|
||
" metadata={\"start\": 100, \"end\": 150, \"doc_name\": \"simple.txt\"},\n",
|
||
" )\n",
|
||
"]\n",
|
||
"db.add_documents(docs)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Show the columns in table \"LANGCHAIN_DEMO_NEW_TABLE\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"('VEC_META', 'NCLOB')\n",
|
||
"('VEC_TEXT', 'NCLOB')\n",
|
||
"('VEC_VECTOR', 'REAL_VECTOR')\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"cur = connection.cursor()\n",
|
||
"cur.execute(\n",
|
||
" \"SELECT COLUMN_NAME, DATA_TYPE_NAME FROM SYS.TABLE_COLUMNS WHERE SCHEMA_NAME = CURRENT_SCHEMA AND TABLE_NAME = 'LANGCHAIN_DEMO_NEW_TABLE'\"\n",
|
||
")\n",
|
||
"rows = cur.fetchall()\n",
|
||
"for row in rows:\n",
|
||
" print(row)\n",
|
||
"cur.close()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Show the value of the inserted document in the three columns "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"cur = connection.cursor()\n",
|
||
"cur.execute(\n",
|
||
" \"SELECT VEC_TEXT, VEC_META, TO_NVARCHAR(VEC_VECTOR) FROM LANGCHAIN_DEMO_NEW_TABLE LIMIT 1\"\n",
|
||
")\n",
|
||
"rows = cur.fetchall()\n",
|
||
"print(rows[0][0]) # The text\n",
|
||
"print(rows[0][1]) # The metadata\n",
|
||
"print(rows[0][2]) # The vector\n",
|
||
"cur.close()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Custom tables must have at least three columns that match the semantics of a standard table\n",
|
||
"\n",
|
||
"- A column with type `NCLOB` or `NVARCHAR` for the text/context of the embeddings\n",
|
||
"- A column with type `NCLOB` or `NVARCHAR` for the metadata \n",
|
||
"- A column with type `REAL_VECTOR` for the embedding vector\n",
|
||
"\n",
|
||
"The table can contain additional columns. When new Documents are inserted into the table, these additional columns must allow NULL values."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 39,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"None\n",
|
||
"Some other text\n",
|
||
"{\"start\": 400, \"end\": 450, \"doc_name\": \"other.txt\"}\n",
|
||
"<memory at 0x7f5edcb18d00>\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Create a new table \"MY_OWN_TABLE_ADD\" with three \"standard\" columns and one additional column\n",
|
||
"my_own_table_name = \"MY_OWN_TABLE_ADD\"\n",
|
||
"cur = connection.cursor()\n",
|
||
"cur.execute(\n",
|
||
" (\n",
|
||
" f\"CREATE TABLE {my_own_table_name} (\"\n",
|
||
" \"SOME_OTHER_COLUMN NVARCHAR(42), \"\n",
|
||
" \"MY_TEXT NVARCHAR(2048), \"\n",
|
||
" \"MY_METADATA NVARCHAR(1024), \"\n",
|
||
" \"MY_VECTOR REAL_VECTOR )\"\n",
|
||
" )\n",
|
||
")\n",
|
||
"\n",
|
||
"# Create a HanaDB instance with the own table\n",
|
||
"db = HanaDB(\n",
|
||
" connection=connection,\n",
|
||
" embedding=embeddings,\n",
|
||
" table_name=my_own_table_name,\n",
|
||
" content_column=\"MY_TEXT\",\n",
|
||
" metadata_column=\"MY_METADATA\",\n",
|
||
" vector_column=\"MY_VECTOR\",\n",
|
||
")\n",
|
||
"\n",
|
||
"# Add a simple document with some metadata\n",
|
||
"docs = [\n",
|
||
" Document(\n",
|
||
" page_content=\"Some other text\",\n",
|
||
" metadata={\"start\": 400, \"end\": 450, \"doc_name\": \"other.txt\"},\n",
|
||
" )\n",
|
||
"]\n",
|
||
"db.add_documents(docs)\n",
|
||
"\n",
|
||
"# Check if data has been inserted into our own table\n",
|
||
"cur.execute(f\"SELECT * FROM {my_own_table_name} LIMIT 1\")\n",
|
||
"rows = cur.fetchall()\n",
|
||
"print(rows[0][0]) # Value of column \"SOME_OTHER_DATA\". Should be NULL/None\n",
|
||
"print(rows[0][1]) # The text\n",
|
||
"print(rows[0][2]) # The metadata\n",
|
||
"print(rows[0][3]) # The vector\n",
|
||
"\n",
|
||
"cur.close()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Add another document and perform a similarity search on the custom table."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 40,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--------------------------------------------------------------------------------\n",
|
||
"Some other text\n",
|
||
"--------------------------------------------------------------------------------\n",
|
||
"Some more text\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = [\n",
|
||
" Document(\n",
|
||
" page_content=\"Some more text\",\n",
|
||
" metadata={\"start\": 800, \"end\": 950, \"doc_name\": \"more.txt\"},\n",
|
||
" )\n",
|
||
"]\n",
|
||
"db.add_documents(docs)\n",
|
||
"\n",
|
||
"query = \"What's up?\"\n",
|
||
"docs = db.similarity_search(query, k=2)\n",
|
||
"for doc in docs:\n",
|
||
" print(\"-\" * 80)\n",
|
||
" print(doc.page_content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Filter Performance Optimization with Custom Columns"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"To allow flexible metadata values, all metadata is stored as JSON in the metadata column by default. If some of the used metadata keys and value types are known, they can be stored in additional columns instead by creating the target table with the key names as column names and passing them to the HanaDB constructor via the specific_metadata_columns list. Metadata keys that match those values are copied into the special column during insert. Filters use the special columns instead of the metadata JSON column for keys in the specific_metadata_columns list."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 41,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Filters on this value are very performant\n",
|
||
"Some other text\n",
|
||
"{\"start\": 400, \"end\": 450, \"doc_name\": \"other.txt\", \"CUSTOMTEXT\": \"Filters on this value are very performant\"}\n",
|
||
"<memory at 0x7f5edcb193c0>\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Create a new table \"PERFORMANT_CUSTOMTEXT_FILTER\" with three \"standard\" columns and one additional column\n",
|
||
"my_own_table_name = \"PERFORMANT_CUSTOMTEXT_FILTER\"\n",
|
||
"cur = connection.cursor()\n",
|
||
"cur.execute(\n",
|
||
" (\n",
|
||
" f\"CREATE TABLE {my_own_table_name} (\"\n",
|
||
" \"CUSTOMTEXT NVARCHAR(500), \"\n",
|
||
" \"MY_TEXT NVARCHAR(2048), \"\n",
|
||
" \"MY_METADATA NVARCHAR(1024), \"\n",
|
||
" \"MY_VECTOR REAL_VECTOR )\"\n",
|
||
" )\n",
|
||
")\n",
|
||
"\n",
|
||
"# Create a HanaDB instance with the own table\n",
|
||
"db = HanaDB(\n",
|
||
" connection=connection,\n",
|
||
" embedding=embeddings,\n",
|
||
" table_name=my_own_table_name,\n",
|
||
" content_column=\"MY_TEXT\",\n",
|
||
" metadata_column=\"MY_METADATA\",\n",
|
||
" vector_column=\"MY_VECTOR\",\n",
|
||
" specific_metadata_columns=[\"CUSTOMTEXT\"],\n",
|
||
")\n",
|
||
"\n",
|
||
"# Add a simple document with some metadata\n",
|
||
"docs = [\n",
|
||
" Document(\n",
|
||
" page_content=\"Some other text\",\n",
|
||
" metadata={\n",
|
||
" \"start\": 400,\n",
|
||
" \"end\": 450,\n",
|
||
" \"doc_name\": \"other.txt\",\n",
|
||
" \"CUSTOMTEXT\": \"Filters on this value are very performant\",\n",
|
||
" },\n",
|
||
" )\n",
|
||
"]\n",
|
||
"db.add_documents(docs)\n",
|
||
"\n",
|
||
"# Check if data has been inserted into our own table\n",
|
||
"cur.execute(f\"SELECT * FROM {my_own_table_name} LIMIT 1\")\n",
|
||
"rows = cur.fetchall()\n",
|
||
"print(\n",
|
||
" rows[0][0]\n",
|
||
") # Value of column \"CUSTOMTEXT\". Should be \"Filters on this value are very performant\"\n",
|
||
"print(rows[0][1]) # The text\n",
|
||
"print(\n",
|
||
" rows[0][2]\n",
|
||
") # The metadata without the \"CUSTOMTEXT\" data, as this is extracted into a sperate column\n",
|
||
"print(rows[0][3]) # The vector\n",
|
||
"\n",
|
||
"cur.close()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The special columns are completely transparent to the rest of the langchain interface. Everything works as it did before, just more performant."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--------------------------------------------------------------------------------\n",
|
||
"Some other text\n",
|
||
"--------------------------------------------------------------------------------\n",
|
||
"Some more text\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = [\n",
|
||
" Document(\n",
|
||
" page_content=\"Some more text\",\n",
|
||
" metadata={\n",
|
||
" \"start\": 800,\n",
|
||
" \"end\": 950,\n",
|
||
" \"doc_name\": \"more.txt\",\n",
|
||
" \"CUSTOMTEXT\": \"Another customtext value\",\n",
|
||
" },\n",
|
||
" )\n",
|
||
"]\n",
|
||
"db.add_documents(docs)\n",
|
||
"\n",
|
||
"advanced_filter = {\"CUSTOMTEXT\": {\"$like\": \"%value%\"}}\n",
|
||
"query = \"What's up?\"\n",
|
||
"docs = db.similarity_search(query, k=2, filter=advanced_filter)\n",
|
||
"for doc in docs:\n",
|
||
" print(\"-\" * 80)\n",
|
||
" print(doc.page_content)"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.14"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|