langchain/docs/docs/integrations/vectorstores/elasticsearch.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "683953b3",
   "metadata": {
    "id": "683953b3"
   },
   "source": [
    "# Elasticsearch\n",
    "\n",
    ">[Elasticsearch](https://www.elastic.co/elasticsearch/) is a distributed, RESTful search and analytics engine, capable of performing both vector and lexical search. It is built on top of the Apache Lucene library. \n",
    "\n",
    "This notebook shows how to use functionality related to the `Elasticsearch` vector store.\n",
    "\n",
    "## Setup\n",
    "\n",
    "In order to use the `Elasticsearch` vector search you must install the `langchain-elasticsearch` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5bbffe2",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -qU langchain-elasticsearch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b66c12b2-2a07-4136-ac77-ce1c9fa7a409",
   "metadata": {
    "id": "b66c12b2-2a07-4136-ac77-ce1c9fa7a409",
    "tags": []
   },
   "source": [
    "### Credentials"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81f43794-f002-477c-9b68-4975df30e718",
   "metadata": {
    "id": "81f43794-f002-477c-9b68-4975df30e718"
   },
   "source": [
    "There are two main ways to setup an Elasticsearch instance for use with:\n",
    "\n",
    "1. Elastic Cloud: Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?utm_source=langchain&utm_content=documentation).\n",
    "\n",
    "To connect to an Elasticsearch instance that does not require\n",
    "login credentials (starting the docker instance with security enabled), pass the Elasticsearch URL and index name along with the\n",
    "embedding object to the constructor.\n",
    "\n",
    "2. Local Install Elasticsearch: Get started with Elasticsearch by running it locally. The easiest way is to use the official Elasticsearch Docker image. See the [Elasticsearch Docker documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html) for more information.\n",
    "\n",
    "\n",
    "### Running Elasticsearch via Docker \n",
    "Example: Run a single-node Elasticsearch instance with security disabled. This is not recommended for production use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22942b0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%docker run -p 9200:9200 -e \"discovery.type=single-node\" -e \"xpack.security.enabled=false\" -e \"xpack.security.http.ssl.enabled=false\" docker.elastic.co/elasticsearch/elasticsearch:8.12.1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42e3846d",
   "metadata": {},
   "source": [
    "\n",
    "### Running with Authentication\n",
    "For production, we recommend you run with security enabled. To connect with login credentials, you can use the parameters `es_api_key` or `es_user` and `es_password`.\n",
    "\n",
    "```{=mdx}\n",
    "import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
    "\n",
    "<EmbeddingTabs/>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "91399482",
   "metadata": {},
   "outputs": [],
   "source": [
    "# | output: false\n",
    "# | echo: false\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "\n",
    "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3fa3cffa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_elasticsearch import ElasticsearchStore\n",
    "\n",
    "elastic_vector_search = ElasticsearchStore(\n",
    "    es_url=\"http://localhost:9200\",\n",
    "    index_name=\"langchain_index\",\n",
    "    embedding=embeddings,\n",
    "    es_user=\"elastic\",\n",
    "    es_password=\"changeme\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d1fa432",
   "metadata": {},
   "source": [
    "#### How to obtain a password for the default \"elastic\" user?\n",
    "\n",
    "To obtain your Elastic Cloud password for the default \"elastic\" user:\n",
    "1. Log in to the Elastic Cloud console at https://cloud.elastic.co\n",
    "2. Go to \"Security\" > \"Users\"\n",
    "3. Locate the \"elastic\" user and click \"Edit\"\n",
    "4. Click \"Reset password\"\n",
    "5. Follow the prompts to reset the password\n",
    "\n",
    "#### How to obtain an API key?\n",
    "\n",
    "To obtain an API key:\n",
    "1. Log in to the Elastic Cloud console at https://cloud.elastic.co\n",
    "2. Open Kibana and go to Stack Management > API Keys\n",
    "3. Click \"Create API key\"\n",
    "4. Enter a name for the API key and click \"Create\"\n",
    "5. Copy the API key and paste it into the `api_key` parameter\n",
    "\n",
    "### Elastic Cloud\n",
    "\n",
    "To connect to an Elasticsearch instance on Elastic Cloud, you can use either the `es_cloud_id` parameter or `es_url`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a0cb623",
   "metadata": {},
   "outputs": [],
   "source": [
    "elastic_vector_search = ElasticsearchStore(\n",
    "    es_cloud_id=\"<cloud_id>\",\n",
    "    index_name=\"test_index\",\n",
    "    embedding=embeddings,\n",
    "    es_user=\"elastic\",\n",
    "    es_password=\"changeme\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "406eefe8",
   "metadata": {},
   "source": [
    "If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b109bf1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
    "# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6030187-0bd7-4798-8372-a265036af5e0",
   "metadata": {
    "id": "f6030187-0bd7-4798-8372-a265036af5e0",
    "tags": []
   },
   "source": [
    "## Initialization\n",
    "\n",
    "Elasticsearch is running locally on localhost:9200 with [docker](#running-elasticsearch-via-docker). For more details on how to connect to Elasticsearch from Elastic Cloud, see [connecting with authentication](#running-with-authentication) above.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "aac9563e",
   "metadata": {
    "id": "aac9563e",
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_elasticsearch import ElasticsearchStore\n",
    "\n",
    "vector_store = ElasticsearchStore(\n",
    "    \"langchain-demo\", embedding=embeddings, es_url=\"http://localhost:9201\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eede246c",
   "metadata": {},
   "source": [
    "## Manage vector store\n",
    "\n",
    "### Add items to vector store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "12eb86d8",
   "metadata": {
    "id": "12eb86d8",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['21cca03c-9089-42d2-b41c-3d156be2b519',\n",
       " 'a6ceb967-b552-4802-bb06-c0e95fce386e',\n",
       " '3a35fac4-e5f0-493b-bee0-9143b41aedae',\n",
       " '176da099-66b1-4d6a-811b-dfdfe0808d30',\n",
       " 'ecfa1a30-3c97-408b-80c0-5c43d68bf5ff',\n",
       " 'c0f08baa-e70b-4f83-b387-c6e0a0f36f73',\n",
       " '489b2c9c-1925-43e1-bcf0-0fa94cf1cbc4',\n",
       " '408c6503-9ba4-49fd-b1cc-95584cd914c5',\n",
       " '5248c899-16d5-4377-a9e9-736ca443ad4f',\n",
       " 'ca182769-c4fc-4e25-8f0a-8dd0a525955c']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from uuid import uuid4\n",
    "\n",
    "from langchain_core.documents import Document\n",
    "\n",
    "document_1 = Document(\n",
    "    page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
    "    metadata={\"source\": \"tweet\"},\n",
    ")\n",
    "\n",
    "document_2 = Document(\n",
    "    page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
    "    metadata={\"source\": \"news\"},\n",
    ")\n",
    "\n",
    "document_3 = Document(\n",
    "    page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
    "    metadata={\"source\": \"tweet\"},\n",
    ")\n",
    "\n",
    "document_4 = Document(\n",
    "    page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
    "    metadata={\"source\": \"news\"},\n",
    ")\n",
    "\n",
    "document_5 = Document(\n",
    "    page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
    "    metadata={\"source\": \"tweet\"},\n",
    ")\n",
    "\n",
    "document_6 = Document(\n",
    "    page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
    "    metadata={\"source\": \"website\"},\n",
    ")\n",
    "\n",
    "document_7 = Document(\n",
    "    page_content=\"The top 10 soccer players in the world right now.\",\n",
    "    metadata={\"source\": \"website\"},\n",
    ")\n",
    "\n",
    "document_8 = Document(\n",
    "    page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
    "    metadata={\"source\": \"tweet\"},\n",
    ")\n",
    "\n",
    "document_9 = Document(\n",
    "    page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
    "    metadata={\"source\": \"news\"},\n",
    ")\n",
    "\n",
    "document_10 = Document(\n",
    "    page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
    "    metadata={\"source\": \"tweet\"},\n",
    ")\n",
    "\n",
    "documents = [\n",
    "    document_1,\n",
    "    document_2,\n",
    "    document_3,\n",
    "    document_4,\n",
    "    document_5,\n",
    "    document_6,\n",
    "    document_7,\n",
    "    document_8,\n",
    "    document_9,\n",
    "    document_10,\n",
    "]\n",
    "uuids = [str(uuid4()) for _ in range(len(documents))]\n",
    "\n",
    "vector_store.add_documents(documents=documents, ids=uuids)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a549e3d",
   "metadata": {},
   "source": [
    "### Delete items from vector store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "31c3b785",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vector_store.delete(ids=[uuids[-1]])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "674bcab2",
   "metadata": {},
   "source": [
    "## Query vector store\n",
    "\n",
    "Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. These examples also show how to use filtering when searching.\n",
    "\n",
    "### Query directly\n",
    "\n",
    "#### Similarity search\n",
    "\n",
    "Performing a simple similarity search with filtering on metadata can be done as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "da079ceb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]\n",
      "* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]\n"
     ]
    }
   ],
   "source": [
    "results = vector_store.similarity_search(\n",
    "    query=\"LangChain provides abstractions to make working with LLMs easy\",\n",
    "    k=2,\n",
    "    filter=[{\"term\": {\"metadata.source.keyword\": \"tweet\"}}],\n",
    ")\n",
    "for res in results:\n",
    "    print(f\"* {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0fda72e",
   "metadata": {},
   "source": [
    "#### Similarity search with score\n",
    "\n",
    "If you want to execute a similarity search and receive the corresponding scores you can run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "1013c9e8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "* [SIM=0.765887] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]\n"
     ]
    }
   ],
   "source": [
    "results = vector_store.similarity_search_with_score(\n",
    "    query=\"Will it be hot tomorrow\",\n",
    "    k=1,\n",
    "    filter=[{\"term\": {\"metadata.source.keyword\": \"news\"}}],\n",
    ")\n",
    "for doc, score in results:\n",
    "    print(f\"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f2c7b5c",
   "metadata": {},
   "source": [
    "### Query by turning into retriever\n",
    "\n",
    "You can also transform the vector store into a retriever for easier usage in your chains. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "2db8b6a5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.'),\n",
       " Document(metadata={'source': 'news'}, page_content='The stock market is down 500 points today due to fears of a recession.'),\n",
       " Document(metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this review to find out.'),\n",
       " Document(metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain - come check it out!')]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever = vector_store.as_retriever(\n",
    "    search_type=\"similarity_score_threshold\", search_kwargs={\"score_threshold\": 0.2}\n",
    ")\n",
    "retriever.invoke(\"Stealing from the bank is a crime\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17b509ae",
   "metadata": {},
   "source": [
    "## Chain usage\n",
    "\n",
    "The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
    "\n",
    "```{=mdx}\n",
    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
    "\n",
    "<ChatModelTabs customVarName=\"llm\" />\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "58e17804",
   "metadata": {},
   "outputs": [],
   "source": [
    "# | output: false\n",
    "# | echo: false\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-4o-mini\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "01dac420",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'LanGraph is used for building stateful, agentic applications. It serves as a framework to facilitate the development of exciting new projects.'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain import hub\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "\n",
    "prompt = hub.pull(\"rlm/rag-prompt\")\n",
    "\n",
    "\n",
    "def format_docs(docs):\n",
    "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
    "\n",
    "\n",
    "rag_chain = (\n",
    "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
    "    | prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")\n",
    "\n",
    "rag_chain.invoke(\"What is LanGraph used for?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3242fd42",
   "metadata": {},
   "source": [
    "# FAQ\n",
    "\n",
    "## Question: Im getting timeout errors when indexing documents into Elasticsearch. How do I fix this?\n",
    "One possible issue is your documents might take longer to index into Elasticsearch. ElasticsearchStore uses the Elasticsearch bulk API which has a few defaults that you can adjust to reduce the chance of timeout errors.\n",
    "\n",
    "This is also a good idea when you're using SparseVectorRetrievalStrategy.\n",
    "\n",
    "The defaults are:\n",
    "- `chunk_size`: 500\n",
    "- `max_chunk_bytes`: 100MB\n",
    "\n",
    "To adjust these, you can pass in the `chunk_size` and `max_chunk_bytes` parameters to the ElasticsearchStore `add_texts` method.\n",
    "\n",
    "```python\n",
    "    vector_store.add_texts(\n",
    "        texts,\n",
    "        bulk_kwargs={\n",
    "            \"chunk_size\": 50,\n",
    "            \"max_chunk_bytes\": 200000000\n",
    "        }\n",
    "    )\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "604c66ea",
   "metadata": {},
   "source": [
    "# Upgrading to ElasticsearchStore\n",
    "\n",
    "If you're already using Elasticsearch in your langchain based project, you may be using the old implementations: `ElasticVectorSearch` and `ElasticKNNSearch` which are now deprecated. We've introduced a new implementation called `ElasticsearchStore` which is more flexible and easier to use. This notebook will guide you through the process of upgrading to the new implementation.\n",
    "\n",
    "## What's new?\n",
    "\n",
    "The new implementation is now one class called `ElasticsearchStore` which can be used for approximate dense vector, exact dense vector, sparse vector (ELSER), BM25 retrieval and hybrid retrieval, via strategies.\n",
    "\n",
    "## I am using ElasticKNNSearch\n",
    "\n",
    "Old implementation:\n",
    "\n",
    "```python\n",
    "\n",
    "from langchain_community.vectorstores.elastic_vector_search import ElasticKNNSearch\n",
    "\n",
    "db = ElasticKNNSearch(\n",
    "  elasticsearch_url=\"http://localhost:9200\",\n",
    "  index_name=\"test_index\",\n",
    "  embedding=embedding\n",
    ")\n",
    "\n",
    "```\n",
    "\n",
    "New implementation:\n",
    "\n",
    "```python\n",
    "\n",
    "from langchain_elasticsearch import ElasticsearchStore, DenseVectorStrategy\n",
    "\n",
    "db = ElasticsearchStore(\n",
    "  es_url=\"http://localhost:9200\",\n",
    "  index_name=\"test_index\",\n",
    "  embedding=embedding,\n",
    "  # if you use the model_id\n",
    "  # strategy=DenseVectorStrategy(model_id=\"test_model\")\n",
    "  # if you use hybrid search\n",
    "  # strategy=DenseVectorStrategy(hybrid=True)\n",
    ")\n",
    "\n",
    "```\n",
    "\n",
    "## I am using ElasticVectorSearch\n",
    "\n",
    "Old implementation:\n",
    "\n",
    "```python\n",
    "\n",
    "from langchain_community.vectorstores.elastic_vector_search import ElasticVectorSearch\n",
    "\n",
    "db = ElasticVectorSearch(\n",
    "  elasticsearch_url=\"http://localhost:9200\",\n",
    "  index_name=\"test_index\",\n",
    "  embedding=embedding\n",
    ")\n",
    "\n",
    "```\n",
    "\n",
    "New implementation:\n",
    "\n",
    "```python\n",
    "\n",
    "from langchain_elasticsearch import ElasticsearchStore, DenseVectorScriptScoreStrategy\n",
    "\n",
    "db = ElasticsearchStore(\n",
    "  es_url=\"http://localhost:9200\",\n",
    "  index_name=\"test_index\",\n",
    "  embedding=embedding,\n",
    "  strategy=DenseVectorScriptScoreStrategy()\n",
    ")\n",
    "\n",
    "```\n",
    "\n",
    "```python\n",
    "db.client.indices.delete(\n",
    "    index=\"test-metadata, test-elser, test-basic\",\n",
    "    ignore_unavailable=True,\n",
    "    allow_no_indices=True,\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33388871",
   "metadata": {},
   "source": [
    "## API reference\n",
    "\n",
    "For detailed documentation of all `ElasticSearchStore` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_elasticsearch.vectorstores.ElasticsearchStore.html"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}