langchain/docs/versioned_docs/version-0.2.x/integrations/retrievers/activeloop.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Activeloop Deep Memory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">[Activeloop Deep Memory](https://docs.activeloop.ai/performance-features/deep-memory) is a suite of tools that enables you to optimize your Vector Store for your use-case and achieve higher accuracy in your LLM apps.\n",
    "\n",
    "`Retrieval-Augmented Generatation` (`RAG`) has recently gained significant attention. As advanced RAG techniques and agents emerge, they expand the potential of what RAGs can accomplish. However, several challenges may limit the integration of RAGs into production. The primary factors to consider when implementing RAGs in production settings are accuracy (recall), cost, and latency. For basic use cases, OpenAI's Ada model paired with a naive similarity search can produce satisfactory results. Yet, for higher accuracy or recall during searches, one might need to employ advanced retrieval techniques. These methods might involve varying data chunk sizes, rewriting queries multiple times, and more, potentially increasing latency and costs.  Activeloop's [Deep Memory](https://www.activeloop.ai/resources/use-deep-memory-to-boost-rag-apps-accuracy-by-up-to-22/) a feature available to `Activeloop Deep Lake` users, addresses these issuea by introducing a tiny neural network layer trained to match user queries with relevant data from a corpus. While this addition incurs minimal latency during search, it can boost retrieval accuracy by up to 27\n",
    "% and remains cost-effective and simple to use, without requiring any additional advanced rag techniques.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this tutorial we will parse `DeepLake` documentation, and create a RAG system that could answer the question from the docs. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Dataset Creation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will parse activeloop's docs for this tutorial using `BeautifulSoup` library and LangChain's document parsers like `Html2TextTransformer`, `AsyncHtmlLoader`. So we will need to install the following libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet  tiktoken langchain-openai python-dotenv datasets langchain deeplake beautifulsoup4 html2text ragas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also you'll need to create a [Activeloop](https://activeloop.ai) account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ORG_ID = \"...\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-03-08T04:01:35.753257Z",
     "iopub.status.busy": "2024-03-08T04:01:35.752712Z",
     "iopub.status.idle": "2024-03-08T04:01:35.756716Z",
     "shell.execute_reply": "2024-03-08T04:01:35.756265Z",
     "shell.execute_reply.started": "2024-03-08T04:01:35.753220Z"
    }
   },
   "outputs": [],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "from langchain_community.vectorstores import DeepLake\n",
    "from langchain_openai import ChatOpenAI, OpenAIEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API token: \")\n",
    "# # activeloop token is needed if you are not signed in using CLI: `activeloop login -u <USERNAME> -p <PASSWORD>`\n",
    "os.environ[\"ACTIVELOOP_TOKEN\"] = getpass.getpass(\n",
    "    \"Enter your ActiveLoop API token: \"\n",
    ")  # Get your API token from https://app.activeloop.ai, click on your profile picture in the top right corner, and select \"API Tokens\"\n",
    "\n",
    "token = os.getenv(\"ACTIVELOOP_TOKEN\")\n",
    "openai_embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db = DeepLake(\n",
    "    dataset_path=f\"hub://{ORG_ID}/deeplake-docs-deepmemory\",  # org_id stands for your username or organization from activeloop\n",
    "    embedding=openai_embeddings,\n",
    "    runtime={\"tensor_db\": True},\n",
    "    token=token,\n",
    "    # overwrite=True, # user overwrite flag if you want to overwrite the full dataset\n",
    "    read_only=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "parsing all links in the webpage using `BeautifulSoup`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from urllib.parse import urljoin\n",
    "\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "\n",
    "def get_all_links(url):\n",
    "    response = requests.get(url)\n",
    "    if response.status_code != 200:\n",
    "        print(f\"Failed to retrieve the page: {url}\")\n",
    "        return []\n",
    "\n",
    "    soup = BeautifulSoup(response.content, \"html.parser\")\n",
    "\n",
    "    # Finding all 'a' tags which typically contain href attribute for links\n",
    "    links = [\n",
    "        urljoin(url, a[\"href\"]) for a in soup.find_all(\"a\", href=True) if a[\"href\"]\n",
    "    ]\n",
    "\n",
    "    return links\n",
    "\n",
    "\n",
    "base_url = \"https://docs.deeplake.ai/en/latest/\"\n",
    "all_links = get_all_links(base_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loading data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-03-08T04:02:37.919739Z",
     "iopub.status.busy": "2024-03-08T04:02:37.919328Z",
     "iopub.status.idle": "2024-03-08T04:02:37.933457Z",
     "shell.execute_reply": "2024-03-08T04:02:37.932716Z",
     "shell.execute_reply.started": "2024-03-08T04:02:37.919707Z"
    }
   },
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders.async_html import AsyncHtmlLoader\n",
    "\n",
    "loader = AsyncHtmlLoader(all_links)\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-03-08T04:02:37.919739Z",
     "iopub.status.busy": "2024-03-08T04:02:37.919328Z",
     "iopub.status.idle": "2024-03-08T04:02:37.933457Z",
     "shell.execute_reply": "2024-03-08T04:02:37.932716Z",
     "shell.execute_reply.started": "2024-03-08T04:02:37.919707Z"
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting data into user readable format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_transformers import Html2TextTransformer\n",
    "\n",
    "html2text = Html2TextTransformer()\n",
    "docs_transformed = html2text.transform_documents(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us chunk further the documents as some of the contain too much text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "chunk_size = 4096\n",
    "docs_new = []\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=chunk_size,\n",
    ")\n",
    "\n",
    "for doc in docs_transformed:\n",
    "    if len(doc.page_content) < chunk_size:\n",
    "        docs_new.append(doc)\n",
    "    else:\n",
    "        docs = text_splitter.create_documents([doc.page_content])\n",
    "        docs_new.extend(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Populating VectorStore:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = db.add_documents(docs_new)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Generating synthetic queries and training Deep Memory "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next step would be to train a deep_memory model that will align your users queries with the dataset that you already have. If you don't have any user queries yet, no worries, we will generate them using LLM!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TODO: Add image"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here above we showed the overall schema how deep_memory works. So as you can see, in order to train it you need relevance, queries together with corpus data (data that we want to query). Corpus data was already populated in the previous section, here we will be generating questions and relevance. \n",
    "\n",
    "1. `questions` - is a text of strings, where each string represents a query\n",
    "2. `relevance` - contains links to the ground truth for each question. There might be several docs that contain answer to the given question. Because of this relevenve is `List[List[tuple[str, float]]]`, where outer list represents queries and inner list relevant documents. Tuple contains str, float pair where string represent the id of the source doc (corresponds to the `id` tensor in the dataset), while float corresponds to how much current document is related to the question.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us generate synthetic questions and relevance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "\n",
    "from langchain.chains.openai_functions import (\n",
    "    create_structured_output_chain,\n",
    ")\n",
    "from langchain_core.messages import HumanMessage, SystemMessage\n",
    "from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate\n",
    "from langchain_openai import ChatOpenAI\n",
    "from pydantic import BaseModel, Field"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fetch dataset docs and ids if they exist (optional you can also ingest)\n",
    "docs = db.vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)[\"value\"]\n",
    "ids = db.vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)[\"value\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If we pass in a model explicitly, we need to make sure it supports the OpenAI function-calling API.\n",
    "llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)\n",
    "\n",
    "\n",
    "class Questions(BaseModel):\n",
    "    \"\"\"Identifying information about a person.\"\"\"\n",
    "\n",
    "    question: str = Field(..., description=\"Questions about text\")\n",
    "\n",
    "\n",
    "prompt_msgs = [\n",
    "    SystemMessage(\n",
    "        content=\"You are a world class expert for generating questions based on provided context. \\\n",
    "                You make sure the question can be answered by the text.\"\n",
    "    ),\n",
    "    HumanMessagePromptTemplate.from_template(\n",
    "        \"Use the given text to generate a question from the following input: {input}\"\n",
    "    ),\n",
    "    HumanMessage(content=\"Tips: Make sure to answer in the correct format\"),\n",
    "]\n",
    "prompt = ChatPromptTemplate(messages=prompt_msgs)\n",
    "chain = create_structured_output_chain(Questions, llm, prompt, verbose=True)\n",
    "\n",
    "text = \"# Understanding Hallucinations and Bias ## **Introduction** In this lesson, we'll cover the concept of **hallucinations** in LLMs, highlighting their influence on AI applications and demonstrating how to mitigate them using techniques like the retriever's architectures. We'll also explore **bias** within LLMs with examples.\"\n",
    "questions = chain.run(input=text)\n",
    "print(questions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "from tqdm import tqdm\n",
    "\n",
    "\n",
    "def generate_queries(docs: List[str], ids: List[str], n: int = 100):\n",
    "    questions = []\n",
    "    relevances = []\n",
    "    pbar = tqdm(total=n)\n",
    "    while len(questions) < n:\n",
    "        # 1. randomly draw a piece of text and relevance id\n",
    "        r = random.randint(0, len(docs) - 1)\n",
    "        text, label = docs[r], ids[r]\n",
    "\n",
    "        # 2. generate queries and assign and relevance id\n",
    "        generated_qs = [chain.run(input=text).question]\n",
    "        questions.extend(generated_qs)\n",
    "        relevances.extend([[(label, 1)] for _ in generated_qs])\n",
    "        pbar.update(len(generated_qs))\n",
    "        if len(questions) % 10 == 0:\n",
    "            print(f\"q: {len(questions)}\")\n",
    "    return questions[:n], relevances[:n]\n",
    "\n",
    "\n",
    "chain = create_structured_output_chain(Questions, llm, prompt, verbose=False)\n",
    "questions, relevances = generate_queries(docs, ids, n=200)\n",
    "\n",
    "train_questions, train_relevances = questions[:100], relevances[:100]\n",
    "test_questions, test_relevances = questions[100:], relevances[100:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we created 100 training queries as well as 100 queries for testing. Now let us train the deep_memory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "job_id = db.vectorstore.deep_memory.train(\n",
    "    queries=train_questions,\n",
    "    relevance=train_relevances,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us track the training progress:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "--------------------------------------------------------------\n",
      "|                  6538e02ecda4691033a51c5b                  |\n",
      "--------------------------------------------------------------\n",
      "| status                     | completed                     |\n",
      "--------------------------------------------------------------\n",
      "| progress                   | eta: 1.4 seconds              |\n",
      "|                            | recall@10: 79.00% (+34.00%)   |\n",
      "--------------------------------------------------------------\n",
      "| results                    | recall@10: 79.00% (+34.00%)   |\n",
      "--------------------------------------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "db.vectorstore.deep_memory.status(\"6538939ca0b69a9ca45c528c\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Evaluating Deep Memory performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great we've trained the model! It's showing some substantial improvement in recall, but how can we use it now and evaluate on unseen new data? In this section we will delve into model evaluation and inference part and see how it can be used with LangChain in order to increase retrieval accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Deep Memory evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the beginning we can use deep_memory's builtin evaluation method. \n",
    "It calculates several `recall` metrics.\n",
    "It can be done easily in a few lines of code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Embedding queries took 0.81 seconds\n",
      "---- Evaluating without model ---- \n",
      "Recall@1:\t  9.0%\n",
      "Recall@3:\t  19.0%\n",
      "Recall@5:\t  24.0%\n",
      "Recall@10:\t  42.0%\n",
      "Recall@50:\t  93.0%\n",
      "Recall@100:\t  98.0%\n",
      "---- Evaluating with model ---- \n",
      "Recall@1:\t  19.0%\n",
      "Recall@3:\t  42.0%\n",
      "Recall@5:\t  49.0%\n",
      "Recall@10:\t  69.0%\n",
      "Recall@50:\t  97.0%\n",
      "Recall@100:\t  97.0%\n",
      "\n"
     ]
    }
   ],
   "source": [
    "recall = db.vectorstore.deep_memory.evaluate(\n",
    "    queries=test_questions,\n",
    "    relevance=test_relevances,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is showing quite substatntial improvement on an unseen test dataset too!!!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Deep Memory + RAGas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.langchain import RagasEvaluatorChain\n",
    "from ragas.metrics import (\n",
    "    context_recall,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us convert recall into ground truths:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_relevance_to_ground_truth(docs, relevance):\n",
    "    ground_truths = []\n",
    "\n",
    "    for rel in relevance:\n",
    "        ground_truth = []\n",
    "        for doc_id, _ in rel:\n",
    "            ground_truth.append(docs[doc_id])\n",
    "        ground_truths.append(ground_truth)\n",
    "    return ground_truths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Evaluating with deep_memory = False\n",
      "===================================\n",
      "context_recall_score = 0.3763423145\n",
      "===================================\n",
      "\n",
      "Evaluating with deep_memory = True\n",
      "===================================\n",
      "context_recall_score = 0.5634545323\n",
      "===================================\n",
      "\n"
     ]
    }
   ],
   "source": [
    "ground_truths = convert_relevance_to_ground_truth(docs, test_relevances)\n",
    "\n",
    "for deep_memory in [False, True]:\n",
    "    print(\"\\nEvaluating with deep_memory =\", deep_memory)\n",
    "    print(\"===================================\")\n",
    "\n",
    "    retriever = db.as_retriever()\n",
    "    retriever.search_kwargs[\"deep_memory\"] = deep_memory\n",
    "\n",
    "    qa_chain = RetrievalQA.from_chain_type(\n",
    "        llm=ChatOpenAI(model=\"gpt-3.5-turbo\"),\n",
    "        chain_type=\"stuff\",\n",
    "        retriever=retriever,\n",
    "        return_source_documents=True,\n",
    "    )\n",
    "\n",
    "    metrics = {\n",
    "        \"context_recall_score\": 0,\n",
    "    }\n",
    "\n",
    "    eval_chains = {m.name: RagasEvaluatorChain(metric=m) for m in [context_recall]}\n",
    "\n",
    "    for question, ground_truth in zip(test_questions, ground_truths):\n",
    "        result = qa_chain({\"query\": question})\n",
    "        result[\"ground_truths\"] = ground_truth\n",
    "        for name, eval_chain in eval_chains.items():\n",
    "            score_name = f\"{name}_score\"\n",
    "            metrics[score_name] += eval_chain(result)[score_name]\n",
    "\n",
    "    for metric in metrics:\n",
    "        metrics[metric] /= len(test_questions)\n",
    "        print(f\"{metric}: {metrics[metric]}\")\n",
    "    print(\"===================================\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 Deep Memory Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TODO: Add image\n",
    "\n",
    "with deep_memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The base htype of the 'video_seq' tensor is 'video'.\n"
     ]
    }
   ],
   "source": [
    "retriever = db.as_retriever()\n",
    "retriever.search_kwargs[\"deep_memory\"] = True\n",
    "retriever.search_kwargs[\"k\"] = 10\n",
    "\n",
    "query = \"Deamination of cytidine to uridine on the minus strand of viral DNA results in catastrophic G-to-A mutations in the viral genome.\"\n",
    "qa = RetrievalQA.from_chain_type(\n",
    "    llm=ChatOpenAI(model=\"gpt-4\"), chain_type=\"stuff\", retriever=retriever\n",
    ")\n",
    "print(qa.run(query))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "without deep_memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The text does not provide information on the base htype of the 'video_seq' tensor.\n"
     ]
    }
   ],
   "source": [
    "retriever = db.as_retriever()\n",
    "retriever.search_kwargs[\"deep_memory\"] = False\n",
    "retriever.search_kwargs[\"k\"] = 10\n",
    "\n",
    "query = \"Deamination of cytidine to uridine on the minus strand of viral DNA results in catastrophic G-to-A mutations in the viral genome.\"\n",
    "qa = RetrievalQA.from_chain_type(\n",
    "    llm=ChatOpenAI(model=\"gpt-4\"), chain_type=\"stuff\", retriever=retriever\n",
    ")\n",
    "qa.run(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4 Deep Memory cost savings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deep Memory increases retrieval accuracy without altering your existing workflow. Additionally, by reducing the top_k input into the LLM, you can significantly cut inference costs via lower token usage."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}