community: retrievers: added capability for using Product Quantization as one of the retriever. (#22424)

- [ ] **Community**: "Retrievers: Product Quantization"
- [X] This PR adds Product Quantization feature to the retrievers to the
Langchain Community. PQ is one of the fastest retrieval methods if the
embeddings are rich enough in context due to the concepts of
quantization and representation through centroids
    - **Description:** Adding PQ as one of the retrievers
    - **Dependencies:** using the package nanopq for this PR
    - **Twitter handle:** vishnunkumar_


- [X] **Add tests and docs**: If you're adding a new integration, please
include
   - [X] Added unit tests for the same in the retrievers.
   - [] Will add an example notebook subsequently

- [X] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/ -
done the same

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
This commit is contained in:
Vishnu Nandakumar
2024-07-24 19:22:15 +05:30
committed by GitHub
parent b9bea36dd4
commit e271965d1e
6 changed files with 306 additions and 0 deletions

View File

@@ -0,0 +1,135 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "661d5123-8ed2-4504-a846-7df0984e79f9",
"metadata": {},
"source": [
"# NanoPQ (Product Quantization)\n",
"\n",
">[Product Quantization algorithm (k-NN)](https://towardsdatascience.com/similarity-search-product-quantization-b2a1a6397701) in brief is a quantization algorithm that helps in compression of database vectors which helps in semantic search when large datasets are involved. In a nutshell, the embedding is split into M subspaces which further goes through clustering. Upon clustering the vectors the centroid vector gets mapped to the vectors present in the each of the clusters of the subspace. \n",
"\n",
"This notebook goes over how to use a retriever that under the hood uses a Product Quantization which has been implemented by the [nanopq](https://github.com/matsui528/nanopq) package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68794637-c13b-4145-944f-3b0c2f1258f9",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-community langchain-openai nanopq"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "39ecbf50-4623-4ee6-9c8e-fea5da21767e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings\n",
"from langchain_community.retrievers import NanoPQRetriever"
]
},
{
"cell_type": "markdown",
"id": "c1ce742a-5085-408a-a2c2-4bae0f605880",
"metadata": {},
"source": [
"## Create New Retriever with Texts"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6c80020e-bc9e-49e8-8f93-5f75fd823738",
"metadata": {},
"outputs": [],
"source": [
"retriever = NanoPQRetriever.from_texts(\n",
" [\"Great world\", \"great words\", \"world\", \"planets of the world\"],\n",
" SpacyEmbeddings(model_name=\"en_core_web_sm\"),\n",
" clusters=2,\n",
" subspace=2,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "743c26c1-0072-4e46-b41b-c28b3f1737c8",
"metadata": {},
"source": [
"## Use Retriever\n",
"\n",
"We can now use the retriever!"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f496de2d-9b8f-4f8b-a30f-279ef199259a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"M: 2, Ks: 2, metric : <class 'numpy.uint8'>, code_dtype: l2\n",
"iter: 20, seed: 123\n",
"Training the subspace: 0 / 2\n",
"Training the subspace: 1 / 2\n",
"Encoding the subspace: 0 / 2\n",
"Encoding the subspace: 1 / 2\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='world'),\n",
" Document(page_content='Great world'),\n",
" Document(page_content='great words'),\n",
" Document(page_content='planets of the world')]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.invoke(\"earth\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "617202a7-e3a6-49a8-b807-4b4d771159d5",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}