docs: Restore accidentally deleted docs on Elasticsearch strategies (#30521)

Thank you for contributing to LangChain! - [x] **PR title**: "package: description" - Where "package" is whichever of langchain, community, core, etc. is being modified. Use "docs: ..." for purely docs changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [x] **PR message**: ***Delete this entire checklist*** and replace with - **Description:** Adding back a section of the Elasticsearch vectorstore documentation that was deleted in [this commit]([a72fddbf8d (diff-4988344c6ccc08191f89ac1ebf1caab5185e13698d7567fde5352038cd950d77))). The only change I've made is to update the example RRF request, which was out of date. - [ ] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, eyurtsev, ccurme, vbarda, hwchase17.
2025-09-19 00:58:32 +00:00 · 2025-03-27 15:27:20 +00:00
parent 0b2244ea88
commit 14b7d790c1
1 changed files with 469 additions and 0 deletions
--- a/docs/docs/integrations/vectorstores/elasticsearch.ipynb
+++ b/docs/docs/integrations/vectorstores/elasticsearch.ipynb
@@ -462,6 +462,475 @@
    "retriever.invoke(\"Stealing from the bank is a crime\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "5828dda5",
+   "metadata": {},
+   "source": [
+    "## Distance Similarity Algorithm\n",
+    "\n",
+    "Elasticsearch supports the following vector distance similarity algorithms:\n",
+    "\n",
+    "- cosine\n",
+    "- euclidean\n",
+    "- dot_product\n",
+    "\n",
+    "The cosine similarity algorithm is the default.\n",
+    "\n",
+    "You can specify the similarity Algorithm needed via the similarity parameter.\n",
+    "\n",
+    "**NOTE**: Depending on the retrieval strategy, the similarity algorithm cannot be changed at query time. It is needed to be set when creating the index mapping for field. If you need to change the similarity algorithm, you need to delete the index and recreate it with the correct distance_strategy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cec8b2ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "db = ElasticsearchStore.from_documents(\n",
+    "    docs,\n",
+    "    embeddings,\n",
+    "    es_url=\"http://localhost:9200\",\n",
+    "    index_name=\"test\",\n",
+    "    distance_strategy=\"COSINE\",\n",
+    "    # distance_strategy=\"EUCLIDEAN_DISTANCE\"\n",
+    "    # distance_strategy=\"DOT_PRODUCT\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0c9fb8a0",
+   "metadata": {},
+   "source": [
+    "## Retrieval Strategies\n",
+    "\n",
+    "Elasticsearch has big advantages over other vector only databases from its ability to support a wide range of retrieval strategies. In this notebook we will configure `ElasticsearchStore` to support some of the most common retrieval strategies.\n",
+    "\n",
+    "By default, `ElasticsearchStore` uses the `DenseVectorStrategy` (was called `ApproxRetrievalStrategy` prior to version 0.2.0).\n",
+    "\n",
+    "### DenseVectorStrategy\n",
+    "\n",
+    "This will return the top k most similar vectors to the query vector. The `k` parameter is set when the `ElasticsearchStore` is initialized. The default value is 10."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d59a493",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_elasticsearch import DenseVectorStrategy\n",
+    "\n",
+    "db = ElasticsearchStore.from_documents(\n",
+    "    docs,\n",
+    "    embeddings,\n",
+    "    es_url=\"http://localhost:9200\",\n",
+    "    index_name=\"test\",\n",
+    "    strategy=DenseVectorStrategy(),\n",
+    ")\n",
+    "\n",
+    "docs = db.similarity_search(\n",
+    "    query=\"What did the president say about Ketanji Brown Jackson?\", k=10\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0cf5d3d2",
+   "metadata": {},
+   "source": [
+    "#### Example: Hybrid retrieval with dense vector and keyword search\n",
+    "\n",
+    "This example will show how to configure ElasticsearchStore to perform a hybrid retrieval, using a combination of approximate semantic search and keyword based search.\n",
+    "\n",
+    "We use RRF to balance the two scores from different retrieval methods.\n",
+    "\n",
+    "To enable hybrid retrieval, we need to set `hybrid=True` in the `DenseVectorStrategy` constructor."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "109f992a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "db = ElasticsearchStore.from_documents(\n",
+    "    docs,\n",
+    "    embeddings,\n",
+    "    es_url=\"http://localhost:9200\",\n",
+    "    index_name=\"test\",\n",
+    "    strategy=DenseVectorStrategy(hybrid=True),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6e62ef0",
+   "metadata": {},
+   "source": [
+    "When hybrid is enabled, the query performed will be a combination of approximate semantic search and keyword based search.\n",
+    "\n",
+    "It will use rrf (Reciprocal Rank Fusion) to balance the two scores from different retrieval methods.\n",
+    "\n",
+    "**Note**: RRF requires Elasticsearch 8.9.0 or above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9c07444e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "{\n",
+    "    \"retriever\": {\n",
+    "        \"rrf\": {\n",
+    "            \"retrievers\": [\n",
+    "                {\n",
+    "                    \"standard\": {\n",
+    "                        \"query\": {\n",
+    "                            \"bool\": {\n",
+    "                                \"filter\": [],\n",
+    "                                \"must\": [{\"match\": {\"text\": {\"query\": \"foo\"}}}],\n",
+    "                            }\n",
+    "                        },\n",
+    "                    },\n",
+    "                },\n",
+    "                {\n",
+    "                    \"knn\": {\n",
+    "                        \"field\": \"vector\",\n",
+    "                        \"filter\": [],\n",
+    "                        \"k\": 1,\n",
+    "                        \"num_candidates\": 50,\n",
+    "                        \"query_vector\": [1.0, ..., 0.0],\n",
+    "                    },\n",
+    "                },\n",
+    "            ]\n",
+    "        }\n",
+    "    }\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2822fbf7",
+   "metadata": {},
+   "source": [
+    "#### Example: Dense vector search with Embedding Model in Elasticsearch\n",
+    "\n",
+    "This example will show how to configure `ElasticsearchStore` to use the embedding model deployed in Elasticsearch for dense vector retrieval.\n",
+    "\n",
+    "To use this, specify the model_id in `DenseVectorStrategy` constructor via the `query_model_id` argument.\n",
+    "\n",
+    "**NOTE**: This requires the model to be deployed and running in Elasticsearch ML node. See [notebook example](https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/integrations/hugging-face/loading-model-from-hugging-face.ipynb) on how to deploy the model with `eland`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d97d9db4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DENSE_SELF_DEPLOYED_INDEX_NAME = \"test-dense-self-deployed\"\n",
+    "\n",
+    "# Note: This does not have an embedding function specified\n",
+    "# Instead, we will use the embedding model deployed in Elasticsearch\n",
+    "db = ElasticsearchStore(\n",
+    "    es_cloud_id=\"<your cloud id>\",\n",
+    "    es_user=\"elastic\",\n",
+    "    es_password=\"<your password>\",\n",
+    "    index_name=DENSE_SELF_DEPLOYED_INDEX_NAME,\n",
+    "    query_field=\"text_field\",\n",
+    "    vector_query_field=\"vector_query_field.predicted_value\",\n",
+    "    strategy=DenseVectorStrategy(model_id=\"sentence-transformers__all-minilm-l6-v2\"),\n",
+    ")\n",
+    "\n",
+    "# Setup a Ingest Pipeline to perform the embedding\n",
+    "# of the text field\n",
+    "db.client.ingest.put_pipeline(\n",
+    "    id=\"test_pipeline\",\n",
+    "    processors=[\n",
+    "        {\n",
+    "            \"inference\": {\n",
+    "                \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n",
+    "                \"field_map\": {\"query_field\": \"text_field\"},\n",
+    "                \"target_field\": \"vector_query_field\",\n",
+    "            }\n",
+    "        }\n",
+    "    ],\n",
+    ")\n",
+    "\n",
+    "# creating a new index with the pipeline,\n",
+    "# not relying on langchain to create the index\n",
+    "db.client.indices.create(\n",
+    "    index=DENSE_SELF_DEPLOYED_INDEX_NAME,\n",
+    "    mappings={\n",
+    "        \"properties\": {\n",
+    "            \"text_field\": {\"type\": \"text\"},\n",
+    "            \"vector_query_field\": {\n",
+    "                \"properties\": {\n",
+    "                    \"predicted_value\": {\n",
+    "                        \"type\": \"dense_vector\",\n",
+    "                        \"dims\": 384,\n",
+    "                        \"index\": True,\n",
+    "                        \"similarity\": \"l2_norm\",\n",
+    "                    }\n",
+    "                }\n",
+    "            },\n",
+    "        }\n",
+    "    },\n",
+    "    settings={\"index\": {\"default_pipeline\": \"test_pipeline\"}},\n",
+    ")\n",
+    "\n",
+    "db.from_texts(\n",
+    "    [\"hello world\"],\n",
+    "    es_cloud_id=\"<cloud id>\",\n",
+    "    es_user=\"elastic\",\n",
+    "    es_password=\"<cloud password>\",\n",
+    "    index_name=DENSE_SELF_DEPLOYED_INDEX_NAME,\n",
+    "    query_field=\"text_field\",\n",
+    "    vector_query_field=\"vector_query_field.predicted_value\",\n",
+    "    strategy=DenseVectorStrategy(model_id=\"sentence-transformers__all-minilm-l6-v2\"),\n",
+    ")\n",
+    "\n",
+    "# Perform search\n",
+    "db.similarity_search(\"hello world\", k=10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9651b01",
+   "metadata": {},
+   "source": [
+    "### SparseVectorStrategy (ELSER)\n",
+    "\n",
+    "This strategy uses Elasticsearch's sparse vector retrieval to retrieve the top-k results. We only support our own \"ELSER\" embedding model for now.\n",
+    "\n",
+    "**NOTE**: This requires the ELSER model to be deployed and running in Elasticsearch ml node.\n",
+    "\n",
+    "To use this, specify `SparseVectorStrategy` (was called `SparseVectorRetrievalStrategy` prior to version 0.2.0) in the `ElasticsearchStore` constructor. You will need to provide a model ID."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c750ff57",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_elasticsearch import SparseVectorStrategy\n",
+    "\n",
+    "# Note that this example doesn't have an embedding function. This is because we infer the tokens at index time and at query time within Elasticsearch.\n",
+    "# This requires the ELSER model to be loaded and running in Elasticsearch.\n",
+    "db = ElasticsearchStore.from_documents(\n",
+    "    docs,\n",
+    "    es_cloud_id=\"<cloud id>\",\n",
+    "    es_user=\"elastic\",\n",
+    "    es_password=\"<cloud password>\",\n",
+    "    index_name=\"test-elser\",\n",
+    "    strategy=SparseVectorStrategy(model_id=\".elser_model_2\"),\n",
+    ")\n",
+    "\n",
+    "db.client.indices.refresh(index=\"test-elser\")\n",
+    "\n",
+    "results = db.similarity_search(\n",
+    "    \"What did the president say about Ketanji Brown Jackson\", k=4\n",
+    ")\n",
+    "print(results[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "416e224e",
+   "metadata": {},
+   "source": [
+    "### DenseVectorScriptScoreStrategy\n",
+    "\n",
+    "This strategy uses Elasticsearch's script score query to perform exact vector retrieval (also known as brute force) to retrieve the top-k results. (This strategy was called `ExactRetrievalStrategy` prior to version 0.2.0.)\n",
+    "\n",
+    "To use this, specify `DenseVectorScriptScoreStrategy` in `ElasticsearchStore` constructor."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ced32701",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_elasticsearch import SparseVectorStrategy\n",
+    "\n",
+    "db = ElasticsearchStore.from_documents(\n",
+    "    docs,\n",
+    "    embeddings,\n",
+    "    es_url=\"http://localhost:9200\",\n",
+    "    index_name=\"test\",\n",
+    "    strategy=DenseVectorScriptScoreStrategy(),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92c9cc33",
+   "metadata": {},
+   "source": [
+    "### BM25Strategy\n",
+    "\n",
+    "Finally, you can use full-text keyword search.\n",
+    "\n",
+    "To use this, specify `BM25Strategy` in `ElasticsearchStore` constructor."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9fd59f69",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_elasticsearch import BM25Strategy\n",
+    "\n",
+    "db = ElasticsearchStore.from_documents(\n",
+    "    docs,\n",
+    "    es_url=\"http://localhost:9200\",\n",
+    "    index_name=\"test\",\n",
+    "    strategy=BM25Strategy(),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6556d3c6",
+   "metadata": {},
+   "source": [
+    "### BM25RetrievalStrategy\n",
+    "\n",
+    "This strategy allows the user to perform searches using pure BM25 without vector search.\n",
+    "\n",
+    "To use this, specify `BM25RetrievalStrategy` in `ElasticsearchStore` constructor.\n",
+    "\n",
+    "Note that in the example below, the embedding option is not specified, indicating that the search is conducted without using embeddings.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "478af4bd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_elasticsearch import ElasticsearchStore\n",
+    "\n",
+    "db = ElasticsearchStore(\n",
+    "    es_url=\"http://localhost:9200\",\n",
+    "    index_name=\"test_index\",\n",
+    "    strategy=ElasticsearchStore.BM25RetrievalStrategy(),\n",
+    ")\n",
+    "\n",
+    "db.add_texts(\n",
+    "    [\"foo\", \"foo bar\", \"foo bar baz\", \"bar\", \"bar baz\", \"baz\"],\n",
+    ")\n",
+    "\n",
+    "results = db.similarity_search(query=\"foo\", k=10)\n",
+    "print(results)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed899034",
+   "metadata": {},
+   "source": [
+    "## Customise the Query\n",
+    "\n",
+    "With `custom_query` parameter at search, you are able to adjust the query that is used to retrieve documents from Elasticsearch. This is useful if you want to use a more complex query, to support linear boosting of fields.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e0ab7c94",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Example of a custom query thats just doing a BM25 search on the text field.\n",
+    "def custom_query(query_body: dict, query: str):\n",
+    "    \"\"\"Custom query to be used in Elasticsearch.\n",
+    "    Args:\n",
+    "        query_body (dict): Elasticsearch query body.\n",
+    "        query (str): Query string.\n",
+    "    Returns:\n",
+    "        dict: Elasticsearch query body.\n",
+    "    \"\"\"\n",
+    "    print(\"Query Retriever created by the retrieval strategy:\")\n",
+    "    print(query_body)\n",
+    "    print()\n",
+    "\n",
+    "    new_query_body = {\"query\": {\"match\": {\"text\": query}}}\n",
+    "\n",
+    "    print(\"Query thats actually used in Elasticsearch:\")\n",
+    "    print(new_query_body)\n",
+    "    print()\n",
+    "\n",
+    "    return new_query_body\n",
+    "\n",
+    "\n",
+    "results = db.similarity_search(\n",
+    "    \"What did the president say about Ketanji Brown Jackson\",\n",
+    "    k=4,\n",
+    "    custom_query=custom_query,\n",
+    ")\n",
+    "print(\"Results:\")\n",
+    "print(results[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15ebbe22",
+   "metadata": {},
+   "source": [
+    "## Customize the Document Builder\n",
+    "\n",
+    "With `doc_builder` parameter at search, you are able to adjust how a Document is being built using data retrieved from Elasticsearch. This is especially useful if you have indices which were not created using Langchain.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4cf81750",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import Dict\n",
+    "\n",
+    "from langchain_core.documents import Document\n",
+    "\n",
+    "\n",
+    "def custom_document_builder(hit: Dict) -> Document:\n",
+    "    src = hit.get(\"_source\", {})\n",
+    "    return Document(\n",
+    "        page_content=src.get(\"content\", \"Missing content!\"),\n",
+    "        metadata={\n",
+    "            \"page_number\": src.get(\"page_number\", -1),\n",
+    "            \"original_filename\": src.get(\"original_filename\", \"Missing filename!\"),\n",
+    "        },\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "results = db.similarity_search(\n",
+    "    \"What did the president say about Ketanji Brown Jackson\",\n",
+    "    k=4,\n",
+    "    doc_builder=custom_document_builder,\n",
+    ")\n",
+    "print(\"Results:\")\n",
+    "print(results[0])"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "17b509ae",