mirror of
				https://github.com/hwchase17/langchain.git
				synced 2025-10-31 07:41:40 +00:00 
			
		
		
		
	For most queries it's the `size` parameter that determines final number of documents to return. Since our abstractions refer to this as `k`, set this to be `k` everywhere instead of expecting a separate param. Would be great to have someone more familiar with OpenSearch validate that this is reasonable (e.g. that having `size` and what OpenSearch calls `k` be the same won't lead to any strange behavior). cc @naveentatikonda Closes #5212
		
			
				
	
	
		
			318 lines
		
	
	
		
			8.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			318 lines
		
	
	
		
			8.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| {
 | |
|  "cells": [
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "683953b3",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "# OpenSearch\n",
 | |
|     "\n",
 | |
|     "> [OpenSearch](https://opensearch.org/) is a scalable, flexible, and extensible open-source software suite for search, analytics, and observability applications licensed under Apache 2.0. `OpenSearch` is a distributed search and analytics engine based on `Apache Lucene`.\n",
 | |
|     "\n",
 | |
|     "\n",
 | |
|     "This notebook shows how to use functionality related to the `OpenSearch` database.\n",
 | |
|     "\n",
 | |
|     "To run, you should have an OpenSearch instance up and running: [see here for an easy Docker installation](https://hub.docker.com/r/opensearchproject/opensearch).\n",
 | |
|     "\n",
 | |
|     "`similarity_search` by default performs the Approximate k-NN Search which uses one of the several algorithms like lucene, nmslib, faiss recommended for\n",
 | |
|     "large datasets. To perform brute force search we have other search methods known as Script Scoring and Painless Scripting.\n",
 | |
|     "Check [this](https://opensearch.org/docs/latest/search-plugins/knn/index/) for more details."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "94963977-9dfc-48b7-872a-53f2947f46c6",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Installation\n",
 | |
|     "Install the Python client."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "6e606066-9386-4427-8a87-1b93f435c57e",
 | |
|    "metadata": {
 | |
|     "tags": []
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "!pip install opensearch-py"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "b1fa637e-4fbf-4d5a-9188-2cad826a193e",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "We want to use OpenAIEmbeddings so we have to get the OpenAI API Key."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "28e5455e-322d-4010-9e3b-491d522ef5db",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "import os\n",
 | |
|     "import getpass\n",
 | |
|     "\n",
 | |
|     "os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "aac9563e",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "from langchain.embeddings.openai import OpenAIEmbeddings\n",
 | |
|     "from langchain.text_splitter import CharacterTextSplitter\n",
 | |
|     "from langchain.vectorstores import OpenSearchVectorSearch\n",
 | |
|     "from langchain.document_loaders import TextLoader"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "a3c3999a",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "from langchain.document_loaders import TextLoader\n",
 | |
|     "loader = TextLoader('../../../state_of_the_union.txt')\n",
 | |
|     "documents = loader.load()\n",
 | |
|     "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
 | |
|     "docs = text_splitter.split_documents(documents)\n",
 | |
|     "\n",
 | |
|     "embeddings = OpenAIEmbeddings()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "01a9a035",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### similarity_search using Approximate k-NN\n",
 | |
|     "\n",
 | |
|     "`similarity_search` using `Approximate k-NN` Search with Custom Parameters"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "803fe12b",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "docsearch = OpenSearchVectorSearch.from_documents(\n",
 | |
|     "    docs, \n",
 | |
|     "    embeddings, \n",
 | |
|     "    opensearch_url=\"http://localhost:9200\"\n",
 | |
|     ")\n",
 | |
|     "\n",
 | |
|     "# If using the default Docker installation, use this instantiation instead:\n",
 | |
|     "# docsearch = OpenSearchVectorSearch.from_documents(\n",
 | |
|     "#     docs, \n",
 | |
|     "#     embeddings, \n",
 | |
|     "#     opensearch_url=\"https://localhost:9200\", \n",
 | |
|     "#     http_auth=(\"admin\", \"admin\"),     \n",
 | |
|     "#     use_ssl = False,\n",
 | |
|     "#     verify_certs = False,\n",
 | |
|     "#     ssl_assert_hostname = False,\n",
 | |
|     "#     ssl_show_warn = False,\n",
 | |
|     "# )"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "db3fa309",
 | |
|    "metadata": {
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "query = \"What did the president say about Ketanji Brown Jackson\"\n",
 | |
|     "docs = docsearch.similarity_search(query, k=10)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "c160d5bb",
 | |
|    "metadata": {
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "print(docs[0].page_content)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "96215c90",
 | |
|    "metadata": {
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "docsearch = OpenSearchVectorSearch.from_documents(docs, embeddings, opensearch_url=\"http://localhost:9200\", engine=\"faiss\", space_type=\"innerproduct\", ef_construction=256, m=48)\n",
 | |
|     "\n",
 | |
|     "query = \"What did the president say about Ketanji Brown Jackson\"\n",
 | |
|     "docs = docsearch.similarity_search(query)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "62a7cea0",
 | |
|    "metadata": {
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "print(docs[0].page_content)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "0d0cd877",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### similarity_search using Script Scoring\n",
 | |
|     "\n",
 | |
|     "`similarity_search` using `Script Scoring` with Custom Parameters"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "0a8e3c0e",
 | |
|    "metadata": {
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "docsearch = OpenSearchVectorSearch.from_documents(docs, embeddings, opensearch_url=\"http://localhost:9200\", is_appx_search=False)\n",
 | |
|     "\n",
 | |
|     "query = \"What did the president say about Ketanji Brown Jackson\"\n",
 | |
|     "docs = docsearch.similarity_search(\"What did the president say about Ketanji Brown Jackson\", k=1, search_type=\"script_scoring\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "92bc40db",
 | |
|    "metadata": {
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "print(docs[0].page_content)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "a4af96cc",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### similarity_search using Painless Scripting\n",
 | |
|     "\n",
 | |
|     "`similarity_search` using `Painless Scripting` with Custom Parameters"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "6d9f436e",
 | |
|    "metadata": {
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "docsearch = OpenSearchVectorSearch.from_documents(docs, embeddings, opensearch_url=\"http://localhost:9200\", is_appx_search=False)\n",
 | |
|     "filter = {\"bool\": {\"filter\": {\"term\": {\"text\": \"smuggling\"}}}}\n",
 | |
|     "query = \"What did the president say about Ketanji Brown Jackson\"\n",
 | |
|     "docs = docsearch.similarity_search(\"What did the president say about Ketanji Brown Jackson\", search_type=\"painless_scripting\", space_type=\"cosineSimilarity\", pre_filter=filter)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "8ca50bce",
 | |
|    "metadata": {
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "print(docs[0].page_content)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "id": "73264864",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Using a preexisting OpenSearch instance\n",
 | |
|     "\n",
 | |
|     "It's also possible to use a preexisting OpenSearch instance with documents that already have vectors present."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "id": "82a23440",
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# this is just an example, you would need to change these values to point to another opensearch instance\n",
 | |
|     "docsearch = OpenSearchVectorSearch(index_name=\"index-*\", embedding_function=embeddings, opensearch_url=\"http://localhost:9200\")\n",
 | |
|     "\n",
 | |
|     "# you can specify custom field names to match the fields you're using to store your embedding, document text value, and metadata\n",
 | |
|     "docs = docsearch.similarity_search(\"Who was asking about getting lunch today?\", search_type=\"script_scoring\", space_type=\"cosinesimil\", vector_field=\"message_embedding\", text_field=\"message\", metadata_field=\"message_metadata\")"
 | |
|    ]
 | |
|   }
 | |
|  ],
 | |
|  "metadata": {
 | |
|   "kernelspec": {
 | |
|    "display_name": "Python 3 (ipykernel)",
 | |
|    "language": "python",
 | |
|    "name": "python3"
 | |
|   },
 | |
|   "language_info": {
 | |
|    "codemirror_mode": {
 | |
|     "name": "ipython",
 | |
|     "version": 3
 | |
|    },
 | |
|    "file_extension": ".py",
 | |
|    "mimetype": "text/x-python",
 | |
|    "name": "python",
 | |
|    "nbconvert_exporter": "python",
 | |
|    "pygments_lexer": "ipython3",
 | |
|    "version": "3.11.3"
 | |
|   }
 | |
|  },
 | |
|  "nbformat": 4,
 | |
|  "nbformat_minor": 5
 | |
| }
 |