mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-26 16:43:35 +00:00
Flesh out semi-structured cookbook (#11904)
This commit is contained in:
parent
e8c1850369
commit
eca8a5e5b8
@ -14,12 +14,19 @@
|
||||
"\n",
|
||||
"Many documents contain a mixture of content types, including text and tables. \n",
|
||||
"\n",
|
||||
"Semi-structured data can be challenging for conventional RAG for at least two reasons: \n",
|
||||
"\n",
|
||||
"* Text splitting may break up tables, corrupting the data in retrieval\n",
|
||||
"* Embedding tables may pose challenges for semantic similarity search \n",
|
||||
"\n",
|
||||
"This cookbook shows how to perform RAG on documents with semi-structured data: \n",
|
||||
"\n",
|
||||
"* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).\n",
|
||||
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with summaries for retrieval.\n",
|
||||
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.\n",
|
||||
"* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
|
||||
"\n",
|
||||
"The overall flow is here:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## Packages"
|
||||
@ -32,7 +39,29 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install langchain unstructured[all-docs] pydantic lxml"
|
||||
"! pip install langchain unstructured[all-docs] pydantic lxml langchainhub"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "44349a83-e1dc-4eed-ba75-587f309d8c88",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The PDF partitioning used by Unstructured will use: \n",
|
||||
"\n",
|
||||
"* `tesseract` for Optical Character Recognition (OCR)\n",
|
||||
"* `poppler` for PDF rendering and processing"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f7880871-4949-4ea2-aed8-540a09188a41",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! brew install tesseract \n",
|
||||
"! brew install poppler"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -44,8 +73,16 @@
|
||||
"\n",
|
||||
"### Partition PDF tables and text\n",
|
||||
"\n",
|
||||
"* `LLaMA2` Paper: https://arxiv.org/pdf/2307.09288.pdf\n",
|
||||
"* Use `Unstructured` to partition elements"
|
||||
"Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper. \n",
|
||||
"\n",
|
||||
"We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf), which segments a PDF document by using a layout model. \n",
|
||||
"\n",
|
||||
"This layout model makes it possible to extract elements, such as tables, from pdfs. \n",
|
||||
"\n",
|
||||
"We also can use `Unstructured` chunking, which:\n",
|
||||
"\n",
|
||||
"* Tries to identify document sections (e.g., Introduction, etc)\n",
|
||||
"* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -72,7 +109,7 @@
|
||||
"\n",
|
||||
"# Get elements\n",
|
||||
"raw_pdf_elements = partition_pdf(filename=path+\"LLaMA2.pdf\",\n",
|
||||
" # Using pdf format to find embedded image blocks\n",
|
||||
" # Unstructured first finds embedded image blocks\n",
|
||||
" extract_images_in_pdf=False,\n",
|
||||
" # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
|
||||
" # Titles are any sub-section of the document \n",
|
||||
@ -82,13 +119,22 @@
|
||||
" # Chunking params to aggregate text blocks\n",
|
||||
" # Attempt to create a new chunk 3800 chars\n",
|
||||
" # Attempt to keep chunks > 2000 chars \n",
|
||||
" # Hard max on chunks\n",
|
||||
" max_characters=4000, \n",
|
||||
" new_after_n_chars=3800, \n",
|
||||
" combine_text_under_n_chars=2000,\n",
|
||||
" image_output_dir_path=path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b09cd727-aeab-49af-8a51-0dc377321e7c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can examine the elements extracted by `partition_pdf`.\n",
|
||||
"\n",
|
||||
"`CompositeElement` are aggregated chunks."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
@ -168,7 +214,13 @@
|
||||
"source": [
|
||||
"## Multi-vector retriever\n",
|
||||
"\n",
|
||||
"Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).\n",
|
||||
"Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n",
|
||||
"\n",
|
||||
"With the summary, we will also store the raw table elements.\n",
|
||||
"\n",
|
||||
"The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
|
||||
"\n",
|
||||
"The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer. \n",
|
||||
"\n",
|
||||
"### Summaries"
|
||||
]
|
||||
@ -185,6 +237,21 @@
|
||||
"from langchain.schema.output_parser import StrOutputParser"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "37b65677-aeb4-44fd-b06d-4539341ede97",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We create a simple summarize chain for each element.\n",
|
||||
"\n",
|
||||
"You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"from langchain import hub\n",
|
||||
"obj = hub.pull(\"rlm/multi-vector-retriever-summarization\")\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
@ -233,7 +300,10 @@
|
||||
"source": [
|
||||
"### Add to vectorstore\n",
|
||||
"\n",
|
||||
"Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries."
|
||||
"Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries: \n",
|
||||
"\n",
|
||||
"* `InMemoryStore` stores the raw text, tables\n",
|
||||
"* `vectorstore` stores the embedded summaries"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
Loading…
Reference in New Issue
Block a user