Building applications with LLMs through composability
Go to file
Karthik Bharadhwaj 498f0249e2
community[minor]: Opensearch hybridsearch implementation (#25375)
community: add hybrid search in opensearch

# Langchain OpenSearch Hybrid Search Implementation

## Implementation of Hybrid Search: 

I have taken LangChain's OpenSearch integration to the next level by
adding hybrid search capabilities. Building on the existing
OpenSearchVectorSearch class, I have implemented Hybrid Search
functionality (which combines the best of both keyword and semantic
search). This new functionality allows users to harness the power of
OpenSearch's advanced hybrid search features without leaving the
familiar LangChain ecosystem. By blending traditional text matching with
vector-based similarity, the enhanced class delivers more accurate and
contextually relevant results. It's designed to seamlessly fit into
existing LangChain workflows, making it easy for developers to upgrade
their search capabilities.

In implementing the hybrid search for OpenSearch within the LangChain
framework, I also incorporated filtering capabilities. It's important to
note that according to the OpenSearch hybrid search documentation, only
post-filtering is supported for hybrid queries. This means that the
filtering is applied after the hybrid search results are obtained,
rather than during the initial search process.

**Note:** For the implementation of hybrid search, I strictly followed
the official OpenSearch Hybrid search documentation and I took
inspiration from
https://github.com/AndreasThinks/langchain/tree/feature/opensearch_hybrid_search
Thanks Mate!  

### Experiments

I conducted few experiments to verify that the hybrid search
implementation is accurate and capable of reproducing the results of
both plain keyword search and vector search.

Experiment - 1
Hybrid Search
Keyword_weight: 1, vector_weight: 0

I conducted an experiment to verify the accuracy of my hybrid search
implementation by comparing it to a plain keyword search. For this test,
I set the keyword_weight to 1 and the vector_weight to 0 in the hybrid
search, effectively giving full weightage to the keyword component. The
results from this hybrid search configuration matched those of a plain
keyword search, confirming that my implementation can accurately
reproduce keyword-only search results when needed. It's important to
note that while the results were the same, the scores differed between
the two methods. This difference is expected because the plain keyword
search in OpenSearch uses the BM25 algorithm for scoring, whereas the
hybrid search still performs both keyword and vector searches before
normalizing the scores, even when the vector component is given zero
weight. This experiment validates that my hybrid search solution
correctly handles the keyword search component and properly applies the
weighting system, demonstrating its accuracy and flexibility in
emulating different search scenarios.


Experiment - 2
Hybrid Search
keyword_weight = 0.0, vector_weight = 1.0

For experiment-2, I took the inverse approach to further validate my
hybrid search implementation. I set the keyword_weight to 0 and the
vector_weight to 1, effectively giving full weightage to the vector
search component (KNN search). I then compared these results with a pure
vector search. The outcome was consistent with my expectations: the
results from the hybrid search with these settings exactly matched those
from a standalone vector search. This confirms that my implementation
accurately reproduces vector search results when configured to do so. As
with the first experiment, I observed that while the results were
identical, the scores differed between the two methods. This difference
in scoring is expected and can be attributed to the normalization
process in hybrid search, which still considers both components even
when one is given zero weight. This experiment further validates the
accuracy and flexibility of my hybrid search solution, demonstrating its
ability to effectively emulate pure vector search when needed while
maintaining the underlying hybrid search structure.



Experiment - 3
Hybrid Search - balanced

keyword_weight = 0.5, vector_weight = 0.5

For experiment-3, I adopted a balanced approach to further evaluate the
effectiveness of my hybrid search implementation. In this test, I set
both the keyword_weight and vector_weight to 0.5, giving equal
importance to keyword-based and vector-based search components. This
configuration aims to leverage the strengths of both search methods
simultaneously. By setting both weights to 0.5, I intended to create a
scenario where the hybrid search would consider lexical matches and
semantic similarity equally. This balanced approach is often ideal for
many real-world applications, as it can capture both exact keyword
matches and contextually relevant results that might not contain the
exact search terms.

Kindly verify the notebook for the experiments conducted!  

**Notebook:**
https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Opensearch_Hybridsearch.ipynb

### Instructions to follow for Performing Hybrid Search:

**Step-1: Instantiating OpenSearchVectorSearch Class:**
```python
opensearch_vectorstore = OpenSearchVectorSearch(
    index_name=os.getenv("INDEX_NAME"),
    embedding_function=embedding_model,
    opensearch_url=os.getenv("OPENSEARCH_URL"),
    http_auth=(os.getenv("OPENSEARCH_USERNAME"),os.getenv("OPENSEARCH_PASSWORD")),
    use_ssl=False,
    verify_certs=False,
    ssl_assert_hostname=False,
    ssl_show_warn=False
)
```

**Parameters:**
1. **index_name:** The name of the OpenSearch index to use.
2. **embedding_function:** The function or model used to generate
embeddings for the documents. It's assumed that embedding_model is
defined elsewhere in the code.
3. **opensearch_url:** The URL of the OpenSearch instance.
4. **http_auth:** A tuple containing the username and password for
authentication.
5. **use_ssl:** Set to False, indicating that the connection to
OpenSearch is not using SSL/TLS encryption.
6. **verify_certs:** Set to False, which means the SSL certificates are
not being verified. This is often used in development environments but
is not recommended for production.
7. **ssl_assert_hostname:** Set to False, disabling hostname
verification in SSL certificates.
8. **ssl_show_warn:** Set to False, suppressing SSL-related warnings.

**Step-2: Configure Search Pipeline:**

To initiate hybrid search functionality, you need to configures a search
pipeline first.

**Implementation Details:**

This method configures a search pipeline in OpenSearch that:
1. Normalizes the scores from both keyword and vector searches using the
min-max technique.
2. Applies the specified weights to the normalized scores.
3. Calculates the final score using an arithmetic mean of the weighted,
normalized scores.


**Parameters:**

* **pipeline_name (str):** A unique identifier for the search pipeline.
It's recommended to use a descriptive name that indicates the weights
used for keyword and vector searches.
* **keyword_weight (float):** The weight assigned to the keyword search
component. This should be a float value between 0 and 1. In this
example, 0.3 gives 30% importance to traditional text matching.
* **vector_weight (float):** The weight assigned to the vector search
component. This should be a float value between 0 and 1. In this
example, 0.7 gives 70% importance to semantic similarity.

```python
opensearch_vectorstore.configure_search_pipelines(
    pipeline_name="search_pipeline_keyword_0.3_vector_0.7",
    keyword_weight=0.3,
    vector_weight=0.7,
)
```

**Step-3: Performing Hybrid Search:**

After creating the search pipeline, you can perform a hybrid search
using the `similarity_search()` method (or) any methods that are
supported by `langchain`. This method combines both `keyword-based and
semantic similarity` searches on your OpenSearch index, leveraging the
strengths of both traditional information retrieval and vector embedding
techniques.

**parameters:**
* **query:** The search query string.
* **k:** The number of top results to return (in this case, 3).
* **search_type:** Set to `hybrid_search` to use both keyword and vector
search capabilities.
* **search_pipeline:** The name of the previously created search
pipeline.

```python
query = "what are the country named in our database?"

top_k = 3

pipeline_name = "search_pipeline_keyword_0.3_vector_0.7"

matched_docs = opensearch_vectorstore.similarity_search_with_score(
                query=query,
                k=top_k,
                search_type="hybrid_search",
                search_pipeline = pipeline_name
            )

matched_docs
```

twitter handle: @iamkarthik98

---------

Co-authored-by: Karthik Kolluri <karthik.kolluri@eidosmedia.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-12-13 16:34:12 -05:00
.devcontainer community[minor]: Add ApertureDB as a vectorstore (#24088) 2024-07-16 09:32:59 -07:00
.github infra: merge queue allowed (#28641) 2024-12-09 17:11:15 -08:00
cookbook Building RAG agents locally using open source LLMs on Intel CPU (#28302) 2024-11-27 15:40:09 +00:00
docs core[minor]: add new clean up strategy "scoped_full" to indexing (#28505) 2024-12-13 20:35:25 +00:00
libs community[minor]: Opensearch hybridsearch implementation (#25375) 2024-12-13 16:34:12 -05:00
scripts infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
.gitattributes
.gitignore infra: gitignore api_ref mds (#25705) 2024-08-23 09:50:30 -07:00
.readthedocs.yaml
CITATION.cff
LICENSE
Makefile docs: more api ref links, add linting step to prevent more (#28495) 2024-12-04 04:19:42 +00:00
MIGRATE.md Proofreading and Editing Report for Migration Guide (#28084) 2024-11-13 11:03:09 -05:00
poetry.lock docs: update tutorials (#28219) 2024-11-26 10:43:12 -05:00
poetry.toml multiple: use modern installer in poetry (#23998) 2024-07-08 18:50:48 -07:00
pyproject.toml docs: run how-to guides in CI (#27615) 2024-10-30 12:35:38 -04:00
README.md docs: update readme (#28631) 2024-12-09 13:50:12 -05:00
SECURITY.md docs: single security doc (#28515) 2024-12-04 18:15:34 +00:00
yarn.lock box: add langchain box package and DocumentLoader (#25506) 2024-08-21 02:23:43 +00:00

🦜🔗 LangChain

Build context-aware reasoning applications

Release Notes CI PyPI - License PyPI - Downloads GitHub star chart Open Issues Open in Dev Containers Open in GitHub Codespaces Twitter

Looking for the JS/TS library? Check out LangChain.js.

To help you ship LangChain apps to production faster, check out LangSmith. LangSmith is a unified developer platform for building, testing, and monitoring LLM applications. Fill out this form to speak with our sales team.

Quick Install

With pip:

pip install langchain

With conda:

conda install langchain -c conda-forge

🤔 What is LangChain?

LangChain is a framework for developing applications powered by large language models (LLMs).

For these applications, LangChain simplifies the entire application lifecycle:

  • Open-source libraries: Build your applications using LangChain's open-source components and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.
  • Productionization: Inspect, monitor, and evaluate your apps with LangSmith so that you can constantly optimize and deploy with confidence.
  • Deployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Platform.

Open-source libraries

  • langchain-core: Base abstractions.
  • Integration packages (e.g. langchain-openai, langchain-anthropic, etc.): Important integrations have been split into lightweight packages that are co-maintained by the LangChain team and the integration developers.
  • langchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.
  • langchain-community: Third-party integrations that are community maintained.
  • LangGraph: Build robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph. Integrates smoothly with LangChain, but can be used without it. To learn more about LangGraph, check out our first LangChain Academy course, Introduction to LangGraph, available here.

Productionization:

  • LangSmith: A developer platform that lets you debug, test, evaluate, and monitor chains built on any LLM framework and seamlessly integrates with LangChain.

Deployment:

  • LangGraph Platform: Turn your LangGraph applications into production-ready APIs and Assistants.

Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers. Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers.

🧱 What can you build with LangChain?

Question answering with RAG

🧱 Extracting structured output

🤖 Chatbots

And much more! Head to the Tutorials section of the docs for more.

🚀 How does LangChain help?

The main value props of the LangChain libraries are:

  1. Components: composable building blocks, tools and integrations for working with language models. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not.
  2. Easy orchestration with LangGraph: LangGraph, built on top of langchain-core, has built-in support for messages, tools, and other LangChain abstractions. This makes it easy to combine components into production-ready applications with persistence, streaming, and other key features. Check out the LangChain tutorials page for examples.

Components

Components fall into the following modules:

📃 Model I/O

This includes prompt management and a generic interface for chat models, including a consistent interface for tool-calling and structured output across model providers.

📚 Retrieval

Retrieval Augmented Generation involves loading data from a variety of sources, preparing it, then searching over (a.k.a. retrieving from) it for use in the generation step.

🤖 Agents

Agents allow an LLM autonomy over how a task is accomplished. Agents make decisions about which Actions to take, then take that Action, observe the result, and repeat until the task is complete. LangGraph makes it easy to use LangChain components to build both custom and built-in LLM agents.

📖 Documentation

Please see here for full documentation, which includes:

  • Introduction: Overview of the framework and the structure of the docs.
  • Tutorials: If you're looking to build something specific or are more of a hands-on learner, check out our tutorials. This is the best place to get started.
  • How-to guides: Answers to “How do I….?” type questions. These guides are goal-oriented and concrete; they're meant to help you complete a specific task.
  • Conceptual guide: Conceptual explanations of the key parts of the framework.
  • API Reference: Thorough documentation of every class and method.

🌐 Ecosystem

  • 🦜🛠️ LangSmith: Trace and evaluate your language model applications and intelligent agents to help you move from prototype to production.
  • 🦜🕸️ LangGraph: Create stateful, multi-actor applications with LLMs. Integrates smoothly with LangChain, but can be used without it.
  • 🦜🕸️ LangGraph Platform: Deploy LLM applications built with LangGraph into production.

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see here.

🌟 Contributors

langchain contributors