[ElasticsearchStore] Enable custom Bulk Args (#11065)

This enables bulk args like `chunk_size` to be passed down from the ingest methods (from_text, from_documents) to be passed down to the bulk API. This helps alleviate issues where bulk importing a large amount of documents into Elasticsearch was resulting in a timeout. Contribution Shoutout - @elastic - [x] Updated Integration tests --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-09-08 14:31:55 +00:00 · 2023-09-26 20:53:50 +01:00
parent d19fd0cfae
commit 175ef0a55d
5 changed files with 161 additions and 16 deletions
--- a/docs/extras/integrations/providers/elasticsearch.mdx
+++ b/docs/extras/integrations/providers/elasticsearch.mdx
@@ -18,7 +18,7 @@ Example: Run a single-node Elasticsearch instance with security disabled. This i

 #### Deploy Elasticsearch on Elastic Cloud

-Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?storm=langchain-notebook).
+Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?utm_source=langchain&utm_content=documentation).

 ### Install Client

--- a/docs/extras/integrations/vectorstores/elasticsearch.ipynb
+++ b/docs/extras/integrations/vectorstores/elasticsearch.ipynb
@@ -44,7 +44,7 @@
   "source": [
    "There are two main ways to setup an Elasticsearch instance for use with:\n",
    "\n",
-    "1. Elastic Cloud: Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?storm=langchain-notebook).\n",
+    "1. Elastic Cloud: Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?utm_source=langchain&utm_content=documentation).\n",
    "\n",
    "To connect to an Elasticsearch instance that does not require\n",
    "login credentials (starting the docker instance with security enabled), pass the Elasticsearch URL and index name along with the\n",
@@ -662,7 +662,7 @@
   "id": "0960fa0a",
   "metadata": {},
   "source": [
-    "# Customise the Query\n",
+    "## Customise the Query\n",
    "With `custom_query` parameter at search, you are able to adjust the query that is used to retrieve documents from Elasticsearch. This is useful if you want to want to use a more complex query, to support linear boosting of fields."
   ]
  },
@@ -720,6 +720,35 @@
    "print(results[0])"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "3242fd42",
+   "metadata": {},
+   "source": [
+    "# FAQ\n",
+    "\n",
+    "## Question: Im getting timeout errors when indexing documents into Elasticsearch. How do I fix this?\n",
+    "One possible issue is your documents might take longer to index into Elasticsearch. ElasticsearchStore uses the Elasticsearch bulk API which has a few defaults that you can adjust to reduce the chance of timeout errors.\n",
+    "\n",
+    "This is also a good idea when you're using SparseVectorRetrievalStrategy.\n",
+    "\n",
+    "The defaults are:\n",
+    "- `chunk_size`: 500\n",
+    "- `max_chunk_bytes`: 100MB\n",
+    "\n",
+    "To adjust these, you can pass in the `chunk_size` and `max_chunk_bytes` parameters to the ElasticsearchStore `add_texts` method.\n",
+    "\n",
+    "```python\n",
+    "    vector_store.add_texts(\n",
+    "        texts,\n",
+    "        bulk_kwargs={\n",
+    "            \"chunk_size\": 50,\n",
+    "            \"max_chunk_bytes\": 200000000\n",
+    "        }\n",
+    "    )\n",
+    "```"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "604c66ea",