mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-08 14:31:55 +00:00
[ElasticsearchStore] Enable custom Bulk Args (#11065)
This enables bulk args like `chunk_size` to be passed down from the ingest methods (from_text, from_documents) to be passed down to the bulk API. This helps alleviate issues where bulk importing a large amount of documents into Elasticsearch was resulting in a timeout. Contribution Shoutout - @elastic - [x] Updated Integration tests --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
@@ -18,7 +18,7 @@ Example: Run a single-node Elasticsearch instance with security disabled. This i
|
||||
|
||||
#### Deploy Elasticsearch on Elastic Cloud
|
||||
|
||||
Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?storm=langchain-notebook).
|
||||
Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?utm_source=langchain&utm_content=documentation).
|
||||
|
||||
### Install Client
|
||||
|
||||
|
@@ -44,7 +44,7 @@
|
||||
"source": [
|
||||
"There are two main ways to setup an Elasticsearch instance for use with:\n",
|
||||
"\n",
|
||||
"1. Elastic Cloud: Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?storm=langchain-notebook).\n",
|
||||
"1. Elastic Cloud: Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?utm_source=langchain&utm_content=documentation).\n",
|
||||
"\n",
|
||||
"To connect to an Elasticsearch instance that does not require\n",
|
||||
"login credentials (starting the docker instance with security enabled), pass the Elasticsearch URL and index name along with the\n",
|
||||
@@ -662,7 +662,7 @@
|
||||
"id": "0960fa0a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Customise the Query\n",
|
||||
"## Customise the Query\n",
|
||||
"With `custom_query` parameter at search, you are able to adjust the query that is used to retrieve documents from Elasticsearch. This is useful if you want to want to use a more complex query, to support linear boosting of fields."
|
||||
]
|
||||
},
|
||||
@@ -720,6 +720,35 @@
|
||||
"print(results[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3242fd42",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# FAQ\n",
|
||||
"\n",
|
||||
"## Question: Im getting timeout errors when indexing documents into Elasticsearch. How do I fix this?\n",
|
||||
"One possible issue is your documents might take longer to index into Elasticsearch. ElasticsearchStore uses the Elasticsearch bulk API which has a few defaults that you can adjust to reduce the chance of timeout errors.\n",
|
||||
"\n",
|
||||
"This is also a good idea when you're using SparseVectorRetrievalStrategy.\n",
|
||||
"\n",
|
||||
"The defaults are:\n",
|
||||
"- `chunk_size`: 500\n",
|
||||
"- `max_chunk_bytes`: 100MB\n",
|
||||
"\n",
|
||||
"To adjust these, you can pass in the `chunk_size` and `max_chunk_bytes` parameters to the ElasticsearchStore `add_texts` method.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
" vector_store.add_texts(\n",
|
||||
" texts,\n",
|
||||
" bulk_kwargs={\n",
|
||||
" \"chunk_size\": 50,\n",
|
||||
" \"max_chunk_bytes\": 200000000\n",
|
||||
" }\n",
|
||||
" )\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "604c66ea",
|
||||
|
Reference in New Issue
Block a user