Add Support for OpenSearch Vector database (#1191)

### Description
This PR adds a wrapper which adds support for the OpenSearch vector
database. Using opensearch-py client we are ingesting the embeddings of
given text into opensearch cluster using Bulk API. We can perform the
`similarity_search` on the index using the 3 popular searching methods
of OpenSearch k-NN plugin:

- `Approximate k-NN Search` use approximate nearest neighbor (ANN)
algorithms from the [nmslib](https://github.com/nmslib/nmslib),
[faiss](https://github.com/facebookresearch/faiss), and
[Lucene](https://lucene.apache.org/) libraries to power k-NN search.
- `Script Scoring` extends OpenSearch’s script scoring functionality to
execute a brute force, exact k-NN search.
- `Painless Scripting` adds the distance functions as painless
extensions that can be used in more complex combinations. Also, supports
brute force, exact k-NN search like Script Scoring.

### Issues Resolved 
https://github.com/hwchase17/langchain/issues/1054

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
This commit is contained in:
Naveen Tatikonda
2023-02-20 20:39:34 -06:00
committed by GitHub
parent c5015d77e2
commit 0118706fd6
10 changed files with 786 additions and 5 deletions

View File

@@ -0,0 +1,21 @@
# OpenSearch
This page covers how to use the OpenSearch ecosystem within LangChain.
It is broken into two parts: installation and setup, and then references to specific OpenSearch wrappers.
## Installation and Setup
- Install the Python package with `pip install opensearch-py`
## Wrappers
### VectorStore
There exists a wrapper around OpenSearch vector databases, allowing you to use it as a vectorstore
for semantic search using approximate vector search powered by lucene, nmslib and faiss engines
or using painless scripting and script scoring functions for bruteforce vector search.
To import this vectorstore:
```python
from langchain.vectorstores import OpenSearchVectorSearch
```
For a more detailed walkthrough of the OpenSearch wrapper, see [this notebook](../modules/indexes/vectorstore_examples/opensearch.ipynb)

View File

@@ -732,4 +732,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}

View File

@@ -215,4 +215,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}

View File

@@ -0,0 +1,220 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {},
"source": [
"# OpenSearch\n",
"\n",
"This notebook shows how to use functionality related to the OpenSearch database.\n",
"\n",
"To run, you should have the opensearch instance up and running: [here](https://opensearch.org/docs/latest/install-and-configure/install-opensearch/index/)\n",
"`similarity_search` by default performs the Approximate k-NN Search which uses one of the several algorithms like lucene, nmslib, faiss recommended for\n",
"large datasets. To perform brute force search we have other search methods known as Script Scoring and Painless Scripting.\n",
"Check [this](https://opensearch.org/docs/latest/search-plugins/knn/index/) for more details."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "aac9563e",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores import OpenSearchVectorSearch\n",
"from langchain.document_loaders import TextLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a3c3999a",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"loader = TextLoader('../../state_of_the_union.txt')\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"docsearch = OpenSearchVectorSearch.from_texts(texts, embeddings, opensearch_url=\"http://localhost:9200\")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"print(docs[0].page_content)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"#### similarity_search using Approximate k-NN Search with Custom Parameters"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"docsearch = OpenSearchVectorSearch.from_texts(texts, embeddings, opensearch_url=\"http://localhost:9200\", engine=\"faiss\", space_type=\"innerproduct\", ef_construction=256, m=48)\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"print(docs[0].page_content)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"#### similarity_search using Script Scoring with Custom Parameters"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"docsearch = OpenSearchVectorSearch.from_texts(texts, embeddings, opensearch_url=\"http://localhost:9200\", is_appx_search=False)\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(\"What did the president say about Ketanji Brown Jackson\", k=1, search_type=\"script_scoring\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"print(docs[0].page_content)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"#### similarity_search using Painless Scripting with Custom Parameters"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"docsearch = OpenSearchVectorSearch.from_texts(texts, embeddings, opensearch_url=\"http://localhost:9200\", is_appx_search=False)\n",
"filter = {\"bool\": {\"filter\": {\"term\": {\"text\": \"smuggling\"}}}}\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(\"What did the president say about Ketanji Brown Jackson\", search_type=\"painless_scripting\", space_type=\"cosineSimilarity\", pre_filter=filter)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"print(docs[0].page_content)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -47,5 +47,9 @@ The following use cases require specific installs and api keys:
- Install requirements with `pip install faiss` for Python 3.7 and `pip install faiss-cpu` for Python 3.10+.
- _Manifest_:
- Install requirements with `pip install manifest-ml` (Note: this is only available in Python 3.8+ currently).
- _OpenSearch_:
- Install requirements with `pip install opensearch-py`
- If you want to set up OpenSearch on your local, [here](https://opensearch.org/docs/latest/)
If you are using the `NLTKTextSplitter` or the `SpacyTextSplitter`, you will also need to install the appropriate models. For example, if you want to use the `SpacyTextSplitter`, you will need to install the `en_core_web_sm` model with `python -m spacy download en_core_web_sm`. Similarly, if you want to use the `NLTKTextSplitter`, you will need to install the `punkt` model with `python -m nltk.downloader punkt`.