{ "cells": [ { "cell_type": "markdown", "id": "ab66dd43", "metadata": {}, "source": [ "# Pinecone Hybrid Search\n", "\n", "This notebook goes over how to use a retriever that under the hood uses Pinecone and Hybrid Search.\n", "\n", "The logic of this retriever is largely taken from [this blog post](https://www.pinecone.io/learn/hybrid-search-intro/)" ] }, { "cell_type": "code", "execution_count": 1, "id": "393ac030", "metadata": {}, "outputs": [], "source": [ "from langchain.retrievers import PineconeHybridSearchRetriever" ] }, { "cell_type": "markdown", "id": "aaf80e7f", "metadata": {}, "source": [ "## Setup Pinecone" ] }, { "cell_type": "code", "execution_count": 3, "id": "15390796", "metadata": {}, "outputs": [], "source": [ "import pinecone # !pip install pinecone-client\n", "\n", "pinecone.init(\n", " api_key=\"...\", # API key here\n", " environment=\"...\" # find next to api key in console\n", ")\n", "# choose a name for your index\n", "index_name = \"...\"" ] }, { "cell_type": "markdown", "id": "95d5d7f9", "metadata": {}, "source": [ "You should only have to do this part once." ] }, { "cell_type": "code", "execution_count": null, "id": "cfa3a8d8", "metadata": {}, "outputs": [], "source": [ "# create the index\n", "pinecone.create_index(\n", " name = index_name,\n", " dimension = 1536, # dimensionality of dense model\n", " metric = \"dotproduct\",\n", " pod_type = \"s1\"\n", ")" ] }, { "cell_type": "markdown", "id": "e01549af", "metadata": {}, "source": [ "Now that its created, we can use it" ] }, { "cell_type": "code", "execution_count": 4, "id": "bcb3c8c2", "metadata": {}, "outputs": [], "source": [ "index = pinecone.Index(index_name)" ] }, { "cell_type": "markdown", "id": "dbc025d6", "metadata": {}, "source": [ "## Get embeddings and tokenizers\n", "\n", "Embeddings are used for the dense vectors, tokenizer is used for the sparse vector" ] }, { "cell_type": "code", "execution_count": 5, "id": "2f63c911", "metadata": {}, "outputs": [], "source": [ "from langchain.embeddings import OpenAIEmbeddings\n", "embeddings = OpenAIEmbeddings()" ] }, { "cell_type": "code", "execution_count": 6, "id": "c3f030e5", "metadata": {}, "outputs": [], "source": [ "from transformers import BertTokenizerFast # !pip install transformers\n", "\n", "# load bert tokenizer from huggingface\n", "tokenizer = BertTokenizerFast.from_pretrained(\n", " 'bert-base-uncased'\n", ")" ] }, { "cell_type": "markdown", "id": "5462801e", "metadata": {}, "source": [ "## Load Retriever\n", "\n", "We can now construct the retriever!" ] }, { "cell_type": "code", "execution_count": 7, "id": "ac77d835", "metadata": {}, "outputs": [], "source": [ "retriever = PineconeHybridSearchRetriever(embeddings=embeddings, index=index, tokenizer=tokenizer)" ] }, { "cell_type": "markdown", "id": "1c518c42", "metadata": {}, "source": [ "## Add texts (if necessary)\n", "\n", "We can optionally add texts to the retriever (if they aren't already in there)" ] }, { "cell_type": "code", "execution_count": 8, "id": "98b1c017", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4d6f3ee7ca754d07a1a18d100d99e0cd", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/1 [00:00