From eca8a5e5b8c12d668eb046f5b9bef9eda41c0543 Mon Sep 17 00:00:00 2001 From: Lance Martin <122662504+rlancemartin@users.noreply.github.com> Date: Mon, 16 Oct 2023 20:50:15 -0700 Subject: [PATCH] Flesh out semi-structured cookbook (#11904) --- cookbook/Semi_Structured_RAG.ipynb | 86 +++++++++++++++++++++++++++--- 1 file changed, 78 insertions(+), 8 deletions(-) diff --git a/cookbook/Semi_Structured_RAG.ipynb b/cookbook/Semi_Structured_RAG.ipynb index 985d8036edd..59d0244bfab 100644 --- a/cookbook/Semi_Structured_RAG.ipynb +++ b/cookbook/Semi_Structured_RAG.ipynb @@ -14,12 +14,19 @@ "\n", "Many documents contain a mixture of content types, including text and tables. \n", "\n", + "Semi-structured data can be challenging for conventional RAG for at least two reasons: \n", + "\n", + "* Text splitting may break up tables, corrupting the data in retrieval\n", + "* Embedding tables may pose challenges for semantic similarity search \n", + "\n", "This cookbook shows how to perform RAG on documents with semi-structured data: \n", "\n", "* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).\n", - "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with summaries for retrieval.\n", + "* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.\n", "* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n", "\n", + "The overall flow is here:\n", + "\n", "![MVR.png](attachment:7b5c5a30-393c-4b27-8fa1-688306ef2aef.png)\n", "\n", "## Packages" @@ -32,7 +39,29 @@ "metadata": {}, "outputs": [], "source": [ - "! pip install langchain unstructured[all-docs] pydantic lxml" + "! pip install langchain unstructured[all-docs] pydantic lxml langchainhub" + ] + }, + { + "cell_type": "markdown", + "id": "44349a83-e1dc-4eed-ba75-587f309d8c88", + "metadata": {}, + "source": [ + "The PDF partitioning used by Unstructured will use: \n", + "\n", + "* `tesseract` for Optical Character Recognition (OCR)\n", + "* `poppler` for PDF rendering and processing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7880871-4949-4ea2-aed8-540a09188a41", + "metadata": {}, + "outputs": [], + "source": [ + "! brew install tesseract \n", + "! brew install poppler" ] }, { @@ -44,8 +73,16 @@ "\n", "### Partition PDF tables and text\n", "\n", - "* `LLaMA2` Paper: https://arxiv.org/pdf/2307.09288.pdf\n", - "* Use `Unstructured` to partition elements" + "Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper. \n", + "\n", + "We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf), which segments a PDF document by using a layout model. \n", + "\n", + "This layout model makes it possible to extract elements, such as tables, from pdfs. \n", + "\n", + "We also can use `Unstructured` chunking, which:\n", + "\n", + "* Tries to identify document sections (e.g., Introduction, etc)\n", + "* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes" ] }, { @@ -72,7 +109,7 @@ "\n", "# Get elements\n", "raw_pdf_elements = partition_pdf(filename=path+\"LLaMA2.pdf\",\n", - " # Using pdf format to find embedded image blocks\n", + " # Unstructured first finds embedded image blocks\n", " extract_images_in_pdf=False,\n", " # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n", " # Titles are any sub-section of the document \n", @@ -82,13 +119,22 @@ " # Chunking params to aggregate text blocks\n", " # Attempt to create a new chunk 3800 chars\n", " # Attempt to keep chunks > 2000 chars \n", - " # Hard max on chunks\n", " max_characters=4000, \n", " new_after_n_chars=3800, \n", " combine_text_under_n_chars=2000,\n", " image_output_dir_path=path)" ] }, + { + "cell_type": "markdown", + "id": "b09cd727-aeab-49af-8a51-0dc377321e7c", + "metadata": {}, + "source": [ + "We can examine the elements extracted by `partition_pdf`.\n", + "\n", + "`CompositeElement` are aggregated chunks." + ] + }, { "cell_type": "code", "execution_count": 13, @@ -168,7 +214,13 @@ "source": [ "## Multi-vector retriever\n", "\n", - "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).\n", + "Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n", + "\n", + "With the summary, we will also store the raw table elements.\n", + "\n", + "The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n", + "\n", + "The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer. \n", "\n", "### Summaries" ] @@ -185,6 +237,21 @@ "from langchain.schema.output_parser import StrOutputParser" ] }, + { + "cell_type": "markdown", + "id": "37b65677-aeb4-44fd-b06d-4539341ede97", + "metadata": {}, + "source": [ + "We create a simple summarize chain for each element.\n", + "\n", + "You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).\n", + "\n", + "```\n", + "from langchain import hub\n", + "obj = hub.pull(\"rlm/multi-vector-retriever-summarization\")\n", + "```" + ] + }, { "cell_type": "code", "execution_count": 17, @@ -233,7 +300,10 @@ "source": [ "### Add to vectorstore\n", "\n", - "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries." + "Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries: \n", + "\n", + "* `InMemoryStore` stores the raw text, tables\n", + "* `vectorstore` stores the embedded summaries" ] }, {