Files
langchain/docs/modules/indexes/vectorstores/examples/elasticsearch.ipynb
Jeff Vestal d1f65d8dc1 Es knn index search 5346 (#5569)
# Create elastic_vector_search.ElasticKnnSearch class

This extends `langchain/vectorstores/elastic_vector_search.py` by adding
a new class `ElasticKnnSearch`

Features:
- Allow creating an index with the `dense_vector` mapping compataible
with kNN search
- Store embeddings in index for use with kNN search (correct mapping
creates HNSW data structure)
- Perform approximate kNN search
- Perform hybrid BM25 (`query{}`) + kNN (`knn{}`) search
- perform knn search by either providing a `query_vector` or passing a
hosted `model_id` to use query_vector_builder to automatically generate
a query_vector at search time

Connection options
- Using `cloud_id` from Elastic Cloud
- Passing elasticsearch client object

search options
- query
- k
- query_vector
- model_id
- size
- source
- knn_boost (hybrid search)
- query_boost (hybrid search)
- fields


This also adds examples to
`docs/modules/indexes/vectorstores/examples/elasticsearch.ipynb`


Fixes # [5346](https://github.com/hwchase17/langchain/issues/5346)

cc: @dev2049

 -->

---------

Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
2023-06-02 08:40:35 -07:00

580 lines
18 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {
"id": "683953b3"
},
"source": [
"# ElasticSearch\n",
"\n",
">[Elasticsearch](https://www.elastic.co/elasticsearch/) is a distributed, RESTful search and analytics engine. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.\n",
"\n",
"This notebook shows how to use functionality related to the `Elasticsearch` database."
]
},
{
"cell_type": "markdown",
"source": [
"# ElasticVectorSearch class"
],
"metadata": {
"id": "tKSYjyTBtSLc"
},
"id": "tKSYjyTBtSLc"
},
{
"cell_type": "markdown",
"id": "b66c12b2-2a07-4136-ac77-ce1c9fa7a409",
"metadata": {
"tags": [],
"id": "b66c12b2-2a07-4136-ac77-ce1c9fa7a409"
},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "81f43794-f002-477c-9b68-4975df30e718",
"metadata": {
"id": "81f43794-f002-477c-9b68-4975df30e718"
},
"source": [
"Check out [Elasticsearch installation instructions](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html).\n",
"\n",
"To connect to an Elasticsearch instance that does not require\n",
"login credentials, pass the Elasticsearch URL and index name along with the\n",
"embedding object to the constructor.\n",
"\n",
"Example:\n",
"```python\n",
" from langchain import ElasticVectorSearch\n",
" from langchain.embeddings import OpenAIEmbeddings\n",
"\n",
" embedding = OpenAIEmbeddings()\n",
" elastic_vector_search = ElasticVectorSearch(\n",
" elasticsearch_url=\"http://localhost:9200\",\n",
" index_name=\"test_index\",\n",
" embedding=embedding\n",
" )\n",
"```\n",
"\n",
"To connect to an Elasticsearch instance that requires login credentials,\n",
"including Elastic Cloud, use the Elasticsearch URL format\n",
"https://username:password@es_host:9243. For example, to connect to Elastic\n",
"Cloud, create the Elasticsearch URL with the required authentication details and\n",
"pass it to the ElasticVectorSearch constructor as the named parameter\n",
"elasticsearch_url.\n",
"\n",
"You can obtain your Elastic Cloud URL and login credentials by logging in to the\n",
"Elastic Cloud console at https://cloud.elastic.co, selecting your deployment, and\n",
"navigating to the \"Deployments\" page.\n",
"\n",
"To obtain your Elastic Cloud password for the default \"elastic\" user:\n",
"1. Log in to the Elastic Cloud console at https://cloud.elastic.co\n",
"2. Go to \"Security\" > \"Users\"\n",
"3. Locate the \"elastic\" user and click \"Edit\"\n",
"4. Click \"Reset password\"\n",
"5. Follow the prompts to reset the password\n",
"\n",
"Format for Elastic Cloud URLs is\n",
"https://username:password@cluster_id.region_id.gcp.cloud.es.io:9243.\n",
"\n",
"Example:\n",
"```python\n",
" from langchain import ElasticVectorSearch\n",
" from langchain.embeddings import OpenAIEmbeddings\n",
"\n",
" embedding = OpenAIEmbeddings()\n",
"\n",
" elastic_host = \"cluster_id.region_id.gcp.cloud.es.io\"\n",
" elasticsearch_url = f\"https://username:password@{elastic_host}:9243\"\n",
" elastic_vector_search = ElasticVectorSearch(\n",
" elasticsearch_url=elasticsearch_url,\n",
" index_name=\"test_index\",\n",
" embedding=embedding\n",
" )\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6197931-cbe5-460c-a5e6-b5eedb83887c",
"metadata": {
"tags": [],
"id": "d6197931-cbe5-460c-a5e6-b5eedb83887c"
},
"outputs": [],
"source": [
"!pip install elasticsearch"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67ab8afa-f7c6-4fbf-b596-cb512da949da",
"metadata": {
"tags": [],
"id": "67ab8afa-f7c6-4fbf-b596-cb512da949da",
"outputId": "fd16b37f-cb76-40a9-b83f-eab58dd0d912"
},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"OpenAI API Key: ········\n"
]
}
],
"source": [
"import os\n",
"import getpass\n",
"\n",
"os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')"
]
},
{
"cell_type": "markdown",
"id": "f6030187-0bd7-4798-8372-a265036af5e0",
"metadata": {
"tags": [],
"id": "f6030187-0bd7-4798-8372-a265036af5e0"
},
"source": [
"## Example"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aac9563e",
"metadata": {
"tags": [],
"id": "aac9563e"
},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores import ElasticVectorSearch\n",
"from langchain.document_loaders import TextLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3c3999a",
"metadata": {
"tags": [],
"id": "a3c3999a"
},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"loader = TextLoader('../../../state_of_the_union.txt')\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12eb86d8",
"metadata": {
"tags": [],
"id": "12eb86d8"
},
"outputs": [],
"source": [
"db = ElasticVectorSearch.from_documents(docs, embeddings, elasticsearch_url=\"http://localhost:9200\")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b172de8",
"metadata": {
"id": "4b172de8",
"outputId": "ca05a209-4514-4b5c-f6cb-2348f58c19a2"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n",
"\n",
"We cannot let this happen. \n",
"\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"source": [
"# ElasticKnnSearch Class\n",
"The `ElasticKnnSearch` implements features allowing storing vectors and documents in Elasticsearch for use with approximate [kNN search](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html)"
],
"metadata": {
"id": "FheGPztJsrRB"
},
"id": "FheGPztJsrRB"
},
{
"cell_type": "code",
"source": [
"!pip install langchain elasticsearch"
],
"metadata": {
"id": "gRVcbh5zqCJQ"
},
"execution_count": null,
"outputs": [],
"id": "gRVcbh5zqCJQ"
},
{
"cell_type": "code",
"source": [
"from langchain.vectorstores.elastic_vector_search import ElasticKnnSearch\n",
"from langchain.embeddings import ElasticsearchEmbeddings\n",
"import elasticsearch"
],
"metadata": {
"id": "TJtqiw5AqBp8"
},
"execution_count": null,
"outputs": [],
"id": "TJtqiw5AqBp8"
},
{
"cell_type": "code",
"source": [
"# Initialize ElasticsearchEmbeddings\n",
"model_id = \"<model_id_from_es>\" \n",
"dims = dim_count\n",
"es_cloud_id = \"ESS_CLOUD_ID\"\n",
"es_user = \"es_user\"\n",
"es_password = \"es_pass\"\n",
"test_index = \"<index_name>\"\n",
"#input_field = \"your_input_field\" # if different from 'text_field'"
],
"metadata": {
"id": "XHfC0As6qN3T"
},
"execution_count": null,
"outputs": [],
"id": "XHfC0As6qN3T"
},
{
"cell_type": "code",
"source": [
"# Generate embedding object\n",
"embeddings = ElasticsearchEmbeddings.from_credentials(\n",
" model_id,\n",
" #input_field=input_field,\n",
" es_cloud_id=es_cloud_id,\n",
" es_user=es_user,\n",
" es_password=es_password,\n",
")"
],
"metadata": {
"id": "UkTipx1lqc3h"
},
"execution_count": null,
"outputs": [],
"id": "UkTipx1lqc3h"
},
{
"cell_type": "code",
"source": [
"# Initialize ElasticKnnSearch\n",
"knn_search = ElasticKnnSearch(\n",
"\tes_cloud_id=es_cloud_id, \n",
"\tes_user=es_user, \n",
"\tes_password=es_password, \n",
"\tindex_name= test_index, \n",
"\tembedding= embeddings\n",
")"
],
"metadata": {
"id": "74psgD0oqjYK"
},
"execution_count": null,
"outputs": [],
"id": "74psgD0oqjYK"
},
{
"cell_type": "markdown",
"source": [
"## Test adding vectors"
],
"metadata": {
"id": "7AfgIKLWqnQl"
},
"id": "7AfgIKLWqnQl"
},
{
"cell_type": "code",
"source": [
"# Test `add_texts` method\n",
"texts = [\"Hello, world!\", \"Machine learning is fun.\", \"I love Python.\"]\n",
"knn_search.add_texts(texts)\n",
"\n",
"# Test `from_texts` method\n",
"new_texts = [\"This is a new text.\", \"Elasticsearch is powerful.\", \"Python is great for data analysis.\"]\n",
"knn_search.from_texts(new_texts, dims=dims)"
],
"metadata": {
"id": "yNUUIaL9qmze"
},
"execution_count": null,
"outputs": [],
"id": "yNUUIaL9qmze"
},
{
"cell_type": "markdown",
"source": [
"## Test knn search using query vector builder "
],
"metadata": {
"id": "0zdR-Iubquov"
},
"id": "0zdR-Iubquov"
},
{
"cell_type": "code",
"source": [
"# Test `knn_search` method with model_id and query_text\n",
"query = \"Hello\"\n",
"knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2)\n",
"print(f\"kNN search results for query '{query}': {knn_result}\")\n",
"print(f\"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'\")\n",
"\n",
"# Test `hybrid_search` method\n",
"query = \"Hello\"\n",
"hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2)\n",
"print(f\"Hybrid search results for query '{query}': {hybrid_result}\")\n",
"print(f\"The 'text' field value from the top hit is: '{hybrid_result['hits']['hits'][0]['_source']['text']}'\")"
],
"metadata": {
"id": "bwR4jYvqqxTo"
},
"execution_count": null,
"outputs": [],
"id": "bwR4jYvqqxTo"
},
{
"cell_type": "markdown",
"source": [
"## Test knn search using pre generated vector \n"
],
"metadata": {
"id": "ltXYqp0qqz7R"
},
"id": "ltXYqp0qqz7R"
},
{
"cell_type": "code",
"source": [
"# Generate embedding for tests\n",
"query_text = 'Hello'\n",
"query_embedding = embeddings.embed_query(query_text)\n",
"print(f\"Length of embedding: {len(query_embedding)}\\nFirst two items in embedding: {query_embedding[:2]}\")\n",
"\n",
"# Test knn Search\n",
"knn_result = knn_search.knn_search(query_vector = query_embedding, k=2)\n",
"print(f\"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'\")\n",
"\n",
"# Test hybrid search - Requires both query_text and query_vector\n",
"knn_result = knn_search.knn_hybrid_search(query_vector = query_embedding, query=query_text, k=2)\n",
"print(f\"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'\")"
],
"metadata": {
"id": "O5COtpTqq23t"
},
"execution_count": null,
"outputs": [],
"id": "O5COtpTqq23t"
},
{
"cell_type": "markdown",
"source": [
"## Test source option"
],
"metadata": {
"id": "0dnmimcJq42C"
},
"id": "0dnmimcJq42C"
},
{
"cell_type": "code",
"source": [
"# Test `knn_search` method with model_id and query_text\n",
"query = \"Hello\"\n",
"knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2, source=False)\n",
"assert not '_source' in knn_result['hits']['hits'][0].keys()\n",
"\n",
"# Test `hybrid_search` method\n",
"query = \"Hello\"\n",
"hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2, source=False)\n",
"assert not '_source' in hybrid_result['hits']['hits'][0].keys()"
],
"metadata": {
"id": "v4_B72nHq7g1"
},
"execution_count": null,
"outputs": [],
"id": "v4_B72nHq7g1"
},
{
"cell_type": "markdown",
"source": [
"## Test fields option "
],
"metadata": {
"id": "teHgJgrlq-Jb"
},
"id": "teHgJgrlq-Jb"
},
{
"cell_type": "code",
"source": [
"# Test `knn_search` method with model_id and query_text\n",
"query = \"Hello\"\n",
"knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2, fields=['text'])\n",
"assert 'text' in knn_result['hits']['hits'][0]['fields'].keys()\n",
"\n",
"# Test `hybrid_search` method\n",
"query = \"Hello\"\n",
"hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2, fields=['text'])\n",
"assert 'text' in hybrid_result['hits']['hits'][0]['fields'].keys()"
],
"metadata": {
"id": "utNBbpZYrAYW"
},
"execution_count": null,
"outputs": [],
"id": "utNBbpZYrAYW"
},
{
"cell_type": "markdown",
"source": [
"### Test with es client connection rather than cloud_id "
],
"metadata": {
"id": "hddsIFferBy1"
},
"id": "hddsIFferBy1"
},
{
"cell_type": "code",
"source": [
"# Create Elasticsearch connection\n",
"es_connection = Elasticsearch(\n",
" hosts=['https://es_cluster_url:port'], \n",
" basic_auth=('user', 'password')\n",
")"
],
"metadata": {
"id": "bXqrUnoirFia"
},
"execution_count": null,
"outputs": [],
"id": "bXqrUnoirFia"
},
{
"cell_type": "code",
"source": [
"# Instantiate ElasticsearchEmbeddings using es_connection\n",
"embeddings = ElasticsearchEmbeddings.from_es_connection(\n",
" model_id,\n",
" es_connection,\n",
")"
],
"metadata": {
"id": "TIM__Hm8rSEW"
},
"execution_count": null,
"outputs": [],
"id": "TIM__Hm8rSEW"
},
{
"cell_type": "code",
"source": [
"# Initialize ElasticKnnSearch\n",
"knn_search = ElasticKnnSearch(\n",
"\tes_connection = es_connection,\n",
"\tindex_name= test_index, \n",
"\tembedding= embeddings\n",
")"
],
"metadata": {
"id": "1-CdnOrArVc_"
},
"execution_count": null,
"outputs": [],
"id": "1-CdnOrArVc_"
},
{
"cell_type": "code",
"source": [
"# Test `knn_search` method with model_id and query_text\n",
"query = \"Hello\"\n",
"knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2)\n",
"print(f\"kNN search results for query '{query}': {knn_result}\")\n",
"print(f\"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'\")\n"
],
"metadata": {
"id": "0kgyaL6QrYVF"
},
"execution_count": null,
"outputs": [],
"id": "0kgyaL6QrYVF"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"colab": {
"provenance": []
}
},
"nbformat": 4,
"nbformat_minor": 5
}