docs: restore some content to Elasticsearch integration page (#30522)

https://github.com/langchain-ai/langchain/pull/24858 standardized vector
store integration pages, but deleted some content.

Here we merge some of the old content back in. We use this version as a
reference:
2c798622cd/docs/docs/integrations/vectorstores/elasticsearch.ipynb
This commit is contained in:
ccurme 2025-03-27 11:07:19 -04:00 committed by GitHub
parent 956b09f468
commit 80064893c1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -391,6 +391,147 @@
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "3f1d79c6",
"metadata": {},
"source": [
"#### Metadata filtering\n",
"\n",
"`ElasticsearchStore` supports metadata to stored along with the document. This metadata dict object is stored in a metadata object field in the Elasticsearch document. Based on the metadata value, Elasticsearch will automatically setup the mapping by infering the data type of the metadata value. For example, if the metadata value is a string, Elasticsearch will setup the mapping for the metadata object field as a string type.\n",
"\n",
"You can filter by exact keyword, as above:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8cc5db5",
"metadata": {},
"outputs": [],
"source": [
"results = vector_store.similarity_search(\n",
" query=\"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter=[{\"term\": {\"metadata.source.keyword\": \"tweet\"}}],\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "a2f03ab8",
"metadata": {},
"source": [
"By partial match:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b371da9f",
"metadata": {},
"outputs": [],
"source": [
"results = vector_store.similarity_search(\n",
" query=\"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter=[{\"match\": {\"metadata.source\": {\"query\": \"tweet\", \"fuzziness\": \"AUTO\"}}}],\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "d70d8cd7",
"metadata": {},
"source": [
"By date range (if a date field exists):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72ddc0eb",
"metadata": {},
"outputs": [],
"source": [
"results = vector_store.similarity_search(\n",
" query=\"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter=[{\"range\": {\"metadata.date\": {\"gte\": \"2010-01-01\"}}}],\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "82759079",
"metadata": {},
"source": [
"By numeric range (if a numeric field exists):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7cbf8255",
"metadata": {},
"outputs": [],
"source": [
"results = vector_store.similarity_search(\n",
" query=\"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter=[{\"range\": {\"metadata.a_numeric_field\": {\"gte\": 2}}}],\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "0ad5f8da",
"metadata": {},
"source": [
"By geo distance (Requires an index with a geo_point mapping to be declared for `metadata.geo_location`):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0efa827",
"metadata": {},
"outputs": [],
"source": [
"results = vector_store.similarity_search(\n",
" query=\"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter=[\n",
" {\n",
" \"geo_distance\": {\n",
" \"distance\": \"200km\",\n",
" \"metadata.geo_location\": {\"lat\": 40, \"lon\": -70},\n",
" }\n",
" }\n",
" ],\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "a883e5b0",
"metadata": {},
"source": [
"Filter supports many more types of queries than above. \n",
"\n",
"Read more about them in the [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)."
]
},
{
"cell_type": "markdown",
"id": "a0fda72e",
@ -462,6 +603,412 @@
"retriever.invoke(\"Stealing from the bank is a crime\")"
]
},
{
"cell_type": "markdown",
"id": "8ec8694f",
"metadata": {},
"source": [
"## Distance Similarity Algorithm\n",
"Elasticsearch supports the following vector distance similarity algorithms:\n",
"\n",
"- cosine\n",
"- euclidean\n",
"- dot_product\n",
"\n",
"The cosine similarity algorithm is the default.\n",
"\n",
"You can specify the similarity Algorithm needed via the similarity parameter.\n",
"\n",
"**NOTE**\n",
"Depending on the retrieval strategy, the similarity algorithm cannot be changed at query time. It is needed to be set when creating the index mapping for field. If you need to change the similarity algorithm, you need to delete the index and recreate it with the correct distance_strategy.\n",
"\n",
"```python\n",
"\n",
"db = ElasticsearchStore.from_documents(\n",
" docs, \n",
" embeddings, \n",
" es_url=\"http://localhost:9200\", \n",
" index_name=\"test\",\n",
" distance_strategy=\"COSINE\"\n",
" # distance_strategy=\"EUCLIDEAN_DISTANCE\"\n",
" # distance_strategy=\"DOT_PRODUCT\"\n",
")\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "67115d26",
"metadata": {},
"source": [
"## Retrieval Strategies\n",
"Elasticsearch has big advantages over other vector only databases from its ability to support a wide range of retrieval strategies. In this notebook we will configure `ElasticsearchStore` to support some of the most common retrieval strategies. \n",
"\n",
"By default, `ElasticsearchStore` uses the `DenseVectorStrategy` (was called `ApproxRetrievalStrategy` prior to version 0.2.0).\n",
"\n",
"### DenseVectorStrategy\n",
"This will return the top `k` most similar vectors to the query vector. The `k` parameter is set when the `ElasticsearchStore` is initialized. The default value is `10`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "946c12c4",
"metadata": {},
"outputs": [],
"source": [
"from langchain_elasticsearch import DenseVectorStrategy\n",
"\n",
"db = ElasticsearchStore.from_documents(\n",
" docs,\n",
" embeddings,\n",
" es_url=\"http://localhost:9200\",\n",
" index_name=\"test\",\n",
" strategy=DenseVectorStrategy(),\n",
")\n",
"\n",
"docs = db.similarity_search(query=\"...\", k=10)"
]
},
{
"cell_type": "markdown",
"id": "8183cb02",
"metadata": {},
"source": [
"### Example: Hybrid retrieval with dense vector and keyword search\n",
"This example will show how to configure `ElasticsearchStore` to perform a hybrid retrieval, using a combination of approximate semantic search and keyword based search. \n",
"\n",
"We use RRF to balance the two scores from different retrieval methods.\n",
"\n",
"To enable hybrid retrieval, we need to set `hybrid=True` in the `DenseVectorStrategy` constructor.\n",
"\n",
"```python\n",
"\n",
"db = ElasticsearchStore.from_documents(\n",
" docs, \n",
" embeddings, \n",
" es_url=\"http://localhost:9200\", \n",
" index_name=\"test\",\n",
" strategy=DenseVectorStrategy(hybrid=True)\n",
")\n",
"```\n",
"\n",
"When `hybrid` is enabled, the query performed will be a combination of approximate semantic search and keyword based search. \n",
"\n",
"It will use `rrf` (Reciprocal Rank Fusion) to balance the two scores from different retrieval methods.\n",
"\n",
"**Note** RRF requires Elasticsearch 8.9.0 or above.\n",
"\n",
"```json\n",
"{\n",
" \"knn\": {\n",
" \"field\": \"vector\",\n",
" \"filter\": [],\n",
" \"k\": 1,\n",
" \"num_candidates\": 50,\n",
" \"query_vector\": [1.0, ..., 0.0],\n",
" },\n",
" \"query\": {\n",
" \"bool\": {\n",
" \"filter\": [],\n",
" \"must\": [{\"match\": {\"text\": {\"query\": \"foo\"}}}],\n",
" }\n",
" },\n",
" \"rank\": {\"rrf\": {}},\n",
"}\n",
"```\n",
"\n",
"### Example: Dense vector search with Embedding Model in Elasticsearch\n",
"This example will show how to configure `ElasticsearchStore` to use the embedding model deployed in Elasticsearch for dense vector retrieval.\n",
"\n",
"To use this, specify the model_id in `DenseVectorStrategy` constructor via the `query_model_id` argument.\n",
"\n",
"**NOTE** This requires the model to be deployed and running in Elasticsearch ml node. See [notebook example](https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/integrations/hugging-face/loading-model-from-hugging-face.ipynb) on how to deploy the model with eland.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "993ab653",
"metadata": {},
"outputs": [],
"source": [
"DENSE_SELF_DEPLOYED_INDEX_NAME = \"test-dense-self-deployed\"\n",
"\n",
"# Note: This does not have an embedding function specified\n",
"# Instead, we will use the embedding model deployed in Elasticsearch\n",
"db = ElasticsearchStore(\n",
" es_cloud_id=\"<your cloud id>\",\n",
" es_user=\"elastic\",\n",
" es_password=\"<your password>\",\n",
" index_name=DENSE_SELF_DEPLOYED_INDEX_NAME,\n",
" query_field=\"text_field\",\n",
" vector_query_field=\"vector_query_field.predicted_value\",\n",
" strategy=DenseVectorStrategy(model_id=\"sentence-transformers__all-minilm-l6-v2\"),\n",
")\n",
"\n",
"# Setup a Ingest Pipeline to perform the embedding\n",
"# of the text field\n",
"db.client.ingest.put_pipeline(\n",
" id=\"test_pipeline\",\n",
" processors=[\n",
" {\n",
" \"inference\": {\n",
" \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n",
" \"field_map\": {\"query_field\": \"text_field\"},\n",
" \"target_field\": \"vector_query_field\",\n",
" }\n",
" }\n",
" ],\n",
")\n",
"\n",
"# creating a new index with the pipeline,\n",
"# not relying on langchain to create the index\n",
"db.client.indices.create(\n",
" index=DENSE_SELF_DEPLOYED_INDEX_NAME,\n",
" mappings={\n",
" \"properties\": {\n",
" \"text_field\": {\"type\": \"text\"},\n",
" \"vector_query_field\": {\n",
" \"properties\": {\n",
" \"predicted_value\": {\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 384,\n",
" \"index\": True,\n",
" \"similarity\": \"l2_norm\",\n",
" }\n",
" }\n",
" },\n",
" }\n",
" },\n",
" settings={\"index\": {\"default_pipeline\": \"test_pipeline\"}},\n",
")\n",
"\n",
"db.from_texts(\n",
" [\"hello world\"],\n",
" es_cloud_id=\"<cloud id>\",\n",
" es_user=\"elastic\",\n",
" es_password=\"<cloud password>\",\n",
" index_name=DENSE_SELF_DEPLOYED_INDEX_NAME,\n",
" query_field=\"text_field\",\n",
" vector_query_field=\"vector_query_field.predicted_value\",\n",
" strategy=DenseVectorStrategy(model_id=\"sentence-transformers__all-minilm-l6-v2\"),\n",
")\n",
"\n",
"# Perform search\n",
"db.similarity_search(\"hello world\", k=10)"
]
},
{
"cell_type": "markdown",
"id": "24646cf3",
"metadata": {},
"source": [
"### SparseVectorStrategy (ELSER)\n",
"This strategy uses Elasticsearch's sparse vector retrieval to retrieve the top-k results. We only support our own \"ELSER\" embedding model for now.\n",
"\n",
"**NOTE** This requires the ELSER model to be deployed and running in Elasticsearch ml node. \n",
"\n",
"To use this, specify `SparseVectorStrategy` (was called `SparseVectorRetrievalStrategy` prior to version 0.2.0) in the `ElasticsearchStore` constructor. You will need to provide a model ID."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d295c424",
"metadata": {},
"outputs": [],
"source": [
"from langchain_elasticsearch import SparseVectorStrategy\n",
"\n",
"# Note that this example doesn't have an embedding function. This is because we infer the tokens at index time and at query time within Elasticsearch.\n",
"# This requires the ELSER model to be loaded and running in Elasticsearch.\n",
"db = ElasticsearchStore.from_documents(\n",
" docs,\n",
" es_cloud_id=\"<cloud id>\",\n",
" es_user=\"elastic\",\n",
" es_password=\"<cloud password>\",\n",
" index_name=\"test-elser\",\n",
" strategy=SparseVectorStrategy(model_id=\".elser_model_2\"),\n",
")\n",
"\n",
"db.client.indices.refresh(index=\"test-elser\")\n",
"\n",
"results = db.similarity_search(\"...\", k=4)\n",
"print(results[0])"
]
},
{
"cell_type": "markdown",
"id": "f6c7c17b",
"metadata": {},
"source": [
"### DenseVectorScriptScoreStrategy\n",
"This strategy uses Elasticsearch's script score query to perform exact vector retrieval (also known as brute force) to retrieve the top-k results. (This strategy was called `ExactRetrievalStrategy` prior to version 0.2.0.)\n",
"\n",
"To use this, specify `DenseVectorScriptScoreStrategy` in `ElasticsearchStore` constructor.\n",
"\n",
"```python\n",
"from langchain_elasticsearch import SparseVectorStrategy\n",
"\n",
"db = ElasticsearchStore.from_documents(\n",
" docs, \n",
" embeddings, \n",
" es_url=\"http://localhost:9200\", \n",
" index_name=\"test\",\n",
" strategy=DenseVectorScriptScoreStrategy(),\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "df4d584b",
"metadata": {},
"source": [
"### BM25Strategy\n",
"Finally, you can use full-text keyword search.\n",
"\n",
"To use this, specify `BM25Strategy` in `ElasticsearchStore` constructor.\n",
"\n",
"```python\n",
"from langchain_elasticsearch import BM25Strategy\n",
"\n",
"db = ElasticsearchStore.from_documents(\n",
" docs, \n",
" es_url=\"http://localhost:9200\", \n",
" index_name=\"test\",\n",
" strategy=BM25Strategy(),\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "79d35f68",
"metadata": {},
"source": [
"### BM25RetrievalStrategy\n",
"This strategy allows the user to perform searches using pure BM25 without vector search.\n",
"\n",
"To use this, specify `BM25RetrievalStrategy` in `ElasticsearchStore` constructor.\n",
"\n",
"Note that in the example below, the embedding option is not specified, indicating that the search is conducted without using embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "883b5f42",
"metadata": {},
"outputs": [],
"source": [
"from langchain_elasticsearch import ElasticsearchStore\n",
"\n",
"db = ElasticsearchStore(\n",
" es_url=\"http://localhost:9200\",\n",
" index_name=\"test_index\",\n",
" strategy=ElasticsearchStore.BM25RetrievalStrategy(),\n",
")\n",
"\n",
"db.add_texts(\n",
" [\"foo\", \"foo bar\", \"foo bar baz\", \"bar\", \"bar baz\", \"baz\"],\n",
")\n",
"\n",
"results = db.similarity_search(query=\"foo\", k=10)\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"id": "ee285657",
"metadata": {},
"source": [
"### Customise the Query\n",
"With `custom_query` parameter at search, you are able to adjust the query that is used to retrieve documents from Elasticsearch. This is useful if you want to use a more complex query, to support linear boosting of fields."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e275baf8",
"metadata": {},
"outputs": [],
"source": [
"# Example of a custom query thats just doing a BM25 search on the text field.\n",
"def custom_query(query_body: dict, query: str):\n",
" \"\"\"Custom query to be used in Elasticsearch.\n",
" Args:\n",
" query_body (dict): Elasticsearch query body.\n",
" query (str): Query string.\n",
" Returns:\n",
" dict: Elasticsearch query body.\n",
" \"\"\"\n",
" print(\"Query Retriever created by the retrieval strategy:\")\n",
" print(query_body)\n",
" print()\n",
"\n",
" new_query_body = {\"query\": {\"match\": {\"text\": query}}}\n",
"\n",
" print(\"Query thats actually used in Elasticsearch:\")\n",
" print(new_query_body)\n",
" print()\n",
"\n",
" return new_query_body\n",
"\n",
"\n",
"results = db.similarity_search(\n",
" \"...\",\n",
" k=4,\n",
" custom_query=custom_query,\n",
")\n",
"print(\"Results:\")\n",
"print(results[0])"
]
},
{
"cell_type": "markdown",
"id": "32ef65d4",
"metadata": {},
"source": [
"### Customize the Document Builder\n",
"\n",
"With ```doc_builder``` parameter at search, you are able to adjust how a Document is being built using data retrieved from Elasticsearch. This is especially useful if you have indices which were not created using Langchain."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09a441d4",
"metadata": {},
"outputs": [],
"source": [
"from typing import Dict\n",
"\n",
"from langchain_core.documents import Document\n",
"\n",
"\n",
"def custom_document_builder(hit: Dict) -> Document:\n",
" src = hit.get(\"_source\", {})\n",
" return Document(\n",
" page_content=src.get(\"content\", \"Missing content!\"),\n",
" metadata={\n",
" \"page_number\": src.get(\"page_number\", -1),\n",
" \"original_filename\": src.get(\"original_filename\", \"Missing filename!\"),\n",
" },\n",
" )\n",
"\n",
"\n",
"results = db.similarity_search(\n",
" \"...\",\n",
" k=4,\n",
" doc_builder=custom_document_builder,\n",
")\n",
"print(\"Results:\")\n",
"print(results[0])"
]
},
{
"cell_type": "markdown",
"id": "17b509ae",