From eca8a5e5b8c12d668eb046f5b9bef9eda41c0543 Mon Sep 17 00:00:00 2001
From: Lance Martin <122662504+rlancemartin@users.noreply.github.com>
Date: Mon, 16 Oct 2023 20:50:15 -0700
Subject: [PATCH] Flesh out semi-structured cookbook (#11904)

---
 cookbook/Semi_Structured_RAG.ipynb | 86 +++++++++++++++++++++++++++---
 1 file changed, 78 insertions(+), 8 deletions(-)

diff --git a/cookbook/Semi_Structured_RAG.ipynb b/cookbook/Semi_Structured_RAG.ipynb
index 985d8036edd..59d0244bfab 100644
--- a/cookbook/Semi_Structured_RAG.ipynb
+++ b/cookbook/Semi_Structured_RAG.ipynb
@@ -14,12 +14,19 @@
     "\n",
     "Many documents contain a mixture of content types, including text and tables. \n",
     "\n",
+    "Semi-structured data can be challenging for conventional RAG for at least two reasons: \n",
+    "\n",
+    "* Text splitting may break up tables, corrupting the data in retrieval\n",
+    "* Embedding tables may pose challenges for semantic similarity search \n",
+    "\n",
     "This cookbook shows how to perform RAG on documents with semi-structured data: \n",
     "\n",
     "* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).\n",
-    "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with summaries for retrieval.\n",
+    "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.\n",
     "* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
     "\n",
+    "The overall flow is here:\n",
+    "\n",
     "![MVR.png](attachment:7b5c5a30-393c-4b27-8fa1-688306ef2aef.png)\n",
     "\n",
     "## Packages"
@@ -32,7 +39,29 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "! pip install langchain unstructured[all-docs] pydantic lxml"
+    "! pip install langchain unstructured[all-docs] pydantic lxml langchainhub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44349a83-e1dc-4eed-ba75-587f309d8c88",
+   "metadata": {},
+   "source": [
+    "The PDF partitioning used by Unstructured will use: \n",
+    "\n",
+    "* `tesseract` for Optical Character Recognition (OCR)\n",
+    "*  `poppler` for PDF rendering and processing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7880871-4949-4ea2-aed8-540a09188a41",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! brew install tesseract \n",
+    "! brew install poppler"
    ]
   },
   {
@@ -44,8 +73,16 @@
     "\n",
     "### Partition PDF tables and text\n",
     "\n",
-    "* `LLaMA2` Paper: https://arxiv.org/pdf/2307.09288.pdf\n",
-    "* Use `Unstructured` to partition elements"
+    "Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper. \n",
+    "\n",
+    "We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf), which segments a PDF document by using a layout model. \n",
+    "\n",
+    "This layout model makes it possible to extract elements, such as tables, from pdfs. \n",
+    "\n",
+    "We also can use `Unstructured` chunking, which:\n",
+    "\n",
+    "* Tries to identify document sections (e.g., Introduction, etc)\n",
+    "* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes"
    ]
   },
   {
@@ -72,7 +109,7 @@
     "\n",
     "# Get elements\n",
     "raw_pdf_elements = partition_pdf(filename=path+\"LLaMA2.pdf\",\n",
-    "                                 # Using pdf format to find embedded image blocks\n",
+    "                                 # Unstructured first finds embedded image blocks\n",
     "                                 extract_images_in_pdf=False,\n",
     "                                 # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
     "                                 # Titles are any sub-section of the document \n",
@@ -82,13 +119,22 @@
     "                                 # Chunking params to aggregate text blocks\n",
     "                                 # Attempt to create a new chunk 3800 chars\n",
     "                                 # Attempt to keep chunks > 2000 chars \n",
-    "                                 # Hard max on chunks\n",
     "                                 max_characters=4000, \n",
     "                                 new_after_n_chars=3800, \n",
     "                                 combine_text_under_n_chars=2000,\n",
     "                                 image_output_dir_path=path)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b09cd727-aeab-49af-8a51-0dc377321e7c",
+   "metadata": {},
+   "source": [
+    "We can examine the elements extracted by `partition_pdf`.\n",
+    "\n",
+    "`CompositeElement` are aggregated chunks."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 13,
@@ -168,7 +214,13 @@
    "source": [
     "## Multi-vector retriever\n",
     "\n",
-    "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).\n",
+    "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n",
+    "\n",
+    "With the summary, we will also store the raw table elements.\n",
+    "\n",
+    "The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
+    "\n",
+    "The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer.  \n",
     "\n",
     "### Summaries"
    ]
@@ -185,6 +237,21 @@
     "from langchain.schema.output_parser import StrOutputParser"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "37b65677-aeb4-44fd-b06d-4539341ede97",
+   "metadata": {},
+   "source": [
+    "We create a simple summarize chain for each element.\n",
+    "\n",
+    "You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).\n",
+    "\n",
+    "```\n",
+    "from langchain import hub\n",
+    "obj = hub.pull(\"rlm/multi-vector-retriever-summarization\")\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 17,
@@ -233,7 +300,10 @@
    "source": [
     "### Add to vectorstore\n",
     "\n",
-    "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries."
+    "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries: \n",
+    "\n",
+    "* `InMemoryStore` stores the raw text, tables\n",
+    "* `vectorstore` stores the embedded summaries"
    ]
   },
   {