Flesh out semi-structured cookbook (#11904)

2025-06-26 16:43:35 +00:00 · 2023-10-16 20:50:15 -07:00 · 2023-10-16 20:50:15 -07:00 · eca8a5e5b8
commit eca8a5e5b8
parent e8c1850369
1 changed files with 78 additions and 8 deletions
--- a/cookbook/Semi_Structured_RAG.ipynb
+++ b/cookbook/Semi_Structured_RAG.ipynb
@ -14,12 +14,19 @@
    "\n",
    "Many documents contain a mixture of content types, including text and tables. \n",
    "\n",
+    "Semi-structured data can be challenging for conventional RAG for at least two reasons: \n",
+    "\n",
+    "* Text splitting may break up tables, corrupting the data in retrieval\n",
+    "* Embedding tables may pose challenges for semantic similarity search \n",
+    "\n",
    "This cookbook shows how to perform RAG on documents with semi-structured data: \n",
    "\n",
    "* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).\n",
-    "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with summaries for retrieval.\n",
+    "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.\n",
    "* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
    "\n",
+    "The overall flow is here:\n",
+    "\n",
    "![MVR.png](attachment:7b5c5a30-393c-4b27-8fa1-688306ef2aef.png)\n",
    "\n",
    "## Packages"
@ -32,7 +39,29 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "! pip install langchain unstructured[all-docs] pydantic lxml"
+    "! pip install langchain unstructured[all-docs] pydantic lxml langchainhub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44349a83-e1dc-4eed-ba75-587f309d8c88",
+   "metadata": {},
+   "source": [
+    "The PDF partitioning used by Unstructured will use: \n",
+    "\n",
+    "* `tesseract` for Optical Character Recognition (OCR)\n",
+    "*  `poppler` for PDF rendering and processing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7880871-4949-4ea2-aed8-540a09188a41",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! brew install tesseract \n",
+    "! brew install poppler"
   ]
  },
  {
@ -44,8 +73,16 @@
    "\n",
    "### Partition PDF tables and text\n",
    "\n",
-    "* `LLaMA2` Paper: https://arxiv.org/pdf/2307.09288.pdf\n",
-    "* Use `Unstructured` to partition elements"
+    "Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper. \n",
+    "\n",
+    "We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf), which segments a PDF document by using a layout model. \n",
+    "\n",
+    "This layout model makes it possible to extract elements, such as tables, from pdfs. \n",
+    "\n",
+    "We also can use `Unstructured` chunking, which:\n",
+    "\n",
+    "* Tries to identify document sections (e.g., Introduction, etc)\n",
+    "* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes"
   ]
  },
  {
@ -72,7 +109,7 @@
    "\n",
    "# Get elements\n",
    "raw_pdf_elements = partition_pdf(filename=path+\"LLaMA2.pdf\",\n",
-    "                                 # Using pdf format to find embedded image blocks\n",
+    "                                 # Unstructured first finds embedded image blocks\n",
    "                                 extract_images_in_pdf=False,\n",
    "                                 # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
    "                                 # Titles are any sub-section of the document \n",
@ -82,13 +119,22 @@
    "                                 # Chunking params to aggregate text blocks\n",
    "                                 # Attempt to create a new chunk 3800 chars\n",
    "                                 # Attempt to keep chunks > 2000 chars \n",
-    "                                 # Hard max on chunks\n",
    "                                 max_characters=4000, \n",
    "                                 new_after_n_chars=3800, \n",
    "                                 combine_text_under_n_chars=2000,\n",
    "                                 image_output_dir_path=path)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "b09cd727-aeab-49af-8a51-0dc377321e7c",
+   "metadata": {},
+   "source": [
+    "We can examine the elements extracted by `partition_pdf`.\n",
+    "\n",
+    "`CompositeElement` are aggregated chunks."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 13,
@ -168,7 +214,13 @@
   "source": [
    "## Multi-vector retriever\n",
    "\n",
-    "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).\n",
+    "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n",
+    "\n",
+    "With the summary, we will also store the raw table elements.\n",
+    "\n",
+    "The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
+    "\n",
+    "The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer.  \n",
    "\n",
    "### Summaries"
   ]
@ -185,6 +237,21 @@
    "from langchain.schema.output_parser import StrOutputParser"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "37b65677-aeb4-44fd-b06d-4539341ede97",
+   "metadata": {},
+   "source": [
+    "We create a simple summarize chain for each element.\n",
+    "\n",
+    "You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).\n",
+    "\n",
+    "```\n",
+    "from langchain import hub\n",
+    "obj = hub.pull(\"rlm/multi-vector-retriever-summarization\")\n",
+    "```"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 17,
@ -233,7 +300,10 @@
   "source": [
    "### Add to vectorstore\n",
    "\n",
-    "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries."
+    "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries: \n",
+    "\n",
+    "* `InMemoryStore` stores the raw text, tables\n",
+    "* `vectorstore` stores the embedded summaries"
   ]
  },
  {