community[patch]: Performant filter columns option for Hanavector (#21971)

**Description:** Backwards compatible extension of the initialisation
interface of HanaDB to allow the user to specify
specific_metadata_columns that are used for metadata storage of selected
keys which yields increased filter performance. Any not-mentioned
metadata remains in the general metadata column as part of a JSON
string. Furthermore switched to executemany for batch inserts into
HanaDB.

**Issue:** N/A

**Dependencies:** no new dependencies added

**Twitter handle:** @sapopensource

---------

Co-authored-by: Martin Kolb <martin.kolb@sap.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
SaschaStoll
2024-05-22 22:21:21 +02:00
committed by GitHub
parent 16b55b0704
commit 709664a079
3 changed files with 551 additions and 49 deletions

View File

@@ -41,7 +41,7 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:02:16.802456Z",
@@ -64,7 +64,7 @@
},
{
"cell_type": "code",
"execution_count": 30,
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:02:28.174088Z",
@@ -102,7 +102,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:02:25.452472Z",
@@ -134,7 +134,7 @@
},
{
"cell_type": "code",
"execution_count": 31,
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-09T08:04:16.696625Z",
@@ -541,7 +541,7 @@
},
{
"cell_type": "code",
"execution_count": 36,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
@@ -574,7 +574,7 @@
},
{
"cell_type": "code",
"execution_count": 37,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
@@ -606,7 +606,7 @@
},
{
"cell_type": "code",
"execution_count": 38,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
@@ -869,6 +869,113 @@
" print(\"-\" * 80)\n",
" print(doc.page_content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filter Performance Optimization with Custom Columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To allow flexible metadata values, all metadata is stored as JSON in the metadata column by default. If some of the used metadata keys and value types are known, they can be stored in additional columns instead by creating the target table with the key names as column names and passing them to the HanaDB constructor via the specific_metadata_columns list. Metadata keys that match those values are copied into the special column during insert. Filters use the special columns instead of the metadata JSON column for keys in the specific_metadata_columns list."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a new table \"PERFORMANT_CUSTOMTEXT_FILTER\" with three \"standard\" columns and one additional column\n",
"my_own_table_name = \"PERFORMANT_CUSTOMTEXT_FILTER\"\n",
"cur = connection.cursor()\n",
"cur.execute(\n",
" (\n",
" f\"CREATE TABLE {my_own_table_name} (\"\n",
" \"CUSTOMTEXT NVARCHAR(500), \"\n",
" \"MY_TEXT NVARCHAR(2048), \"\n",
" \"MY_METADATA NVARCHAR(1024), \"\n",
" \"MY_VECTOR REAL_VECTOR )\"\n",
" )\n",
")\n",
"\n",
"# Create a HanaDB instance with the own table\n",
"db = HanaDB(\n",
" connection=connection,\n",
" embedding=embeddings,\n",
" table_name=my_own_table_name,\n",
" content_column=\"MY_TEXT\",\n",
" metadata_column=\"MY_METADATA\",\n",
" vector_column=\"MY_VECTOR\",\n",
" specific_metadata_columns=[\"CUSTOMTEXT\"],\n",
")\n",
"\n",
"# Add a simple document with some metadata\n",
"docs = [\n",
" Document(\n",
" page_content=\"Some other text\",\n",
" metadata={\n",
" \"start\": 400,\n",
" \"end\": 450,\n",
" \"doc_name\": \"other.txt\",\n",
" \"CUSTOMTEXT\": \"Filters on this value are very performant\",\n",
" },\n",
" )\n",
"]\n",
"db.add_documents(docs)\n",
"\n",
"# Check if data has been inserted into our own table\n",
"cur.execute(f\"SELECT * FROM {my_own_table_name} LIMIT 1\")\n",
"rows = cur.fetchall()\n",
"print(\n",
" rows[0][0]\n",
") # Value of column \"CUSTOMTEXT\". Should be \"Filters on this value are very performant\"\n",
"print(rows[0][1]) # The text\n",
"print(\n",
" rows[0][2]\n",
") # The metadata without the \"CUSTOMTEXT\" data, as this is extracted into a sperate column\n",
"print(rows[0][3]) # The vector\n",
"\n",
"cur.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The special columns are completely transparent to the rest of the langchain interface. Everything works as it did before, just more performant."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs = [\n",
" Document(\n",
" page_content=\"Some more text\",\n",
" metadata={\n",
" \"start\": 800,\n",
" \"end\": 950,\n",
" \"doc_name\": \"more.txt\",\n",
" \"CUSTOMTEXT\": \"Another customtext value\",\n",
" },\n",
" )\n",
"]\n",
"db.add_documents(docs)\n",
"\n",
"advanced_filter = {\"CUSTOMTEXT\": {\"$like\": \"%value%\"}}\n",
"query = \"What's up?\"\n",
"docs = db.similarity_search(query, k=2, filter=advanced_filter)\n",
"for doc in docs:\n",
" print(\"-\" * 80)\n",
" print(doc.page_content)"
]
}
],
"metadata": {
@@ -887,7 +994,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.11.9"
}
},
"nbformat": 4,