mirror of
https://github.com/hwchase17/langchain.git
synced 2025-04-28 11:55:21 +00:00
This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the XXX parser. For more details, see [PR 28970](https://github.com/langchain-ai/langchain/pull/28970). --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
922 lines
30 KiB
Plaintext
922 lines
30 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# PyMuPDFLoader\n",
|
||
"\n",
|
||
"This notebook provides a quick overview for getting started with `PyMuPDF` [document loader](https://python.langchain.com/docs/concepts/document_loaders). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html).\n",
|
||
"\n",
|
||
" \n",
|
||
"\n",
|
||
"## Overview\n",
|
||
"### Integration details\n",
|
||
"\n",
|
||
"| Class | Package | Local | Serializable | JS support|\n",
|
||
"| :--- | :--- | :---: | :---: | :---: |\n",
|
||
"| [PyMuPDFLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ❌ | \n",
|
||
"\n",
|
||
"--------- \n",
|
||
"\n",
|
||
"### Loader features\n",
|
||
"\n",
|
||
"| Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |\n",
|
||
"| :---: | :---: | :---: | :---: |:---: |\n",
|
||
"| PyMuPDFLoader | ✅ | ❌ | ✅ | ✅ |\n",
|
||
"\n",
|
||
" \n",
|
||
"\n",
|
||
"## Setup\n",
|
||
"\n",
|
||
"### Credentials\n",
|
||
"\n",
|
||
"No credentials are required to use PyMuPDFLoader"
|
||
]
|
||
},
|
||
{
|
||
"metadata": {},
|
||
"cell_type": "markdown",
|
||
"source": "If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
|
||
},
|
||
{
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:32.087401Z",
|
||
"start_time": "2025-01-16T09:48:32.084843Z"
|
||
}
|
||
},
|
||
"cell_type": "code",
|
||
"source": [
|
||
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
|
||
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
|
||
],
|
||
"outputs": [],
|
||
"execution_count": 1
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Installation\n",
|
||
"\n",
|
||
"Install **langchain_community** and **pymupdf**."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:34.720803Z",
|
||
"start_time": "2025-01-16T09:48:33.057015Z"
|
||
}
|
||
},
|
||
"source": "%pip install -qU langchain_community pymupdf",
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Note: you may need to restart the kernel to use updated packages.\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 2
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Initialization\n",
|
||
"\n",
|
||
"Now we can instantiate our model object and load documents:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:36.618850Z",
|
||
"start_time": "2025-01-16T09:48:35.787958Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"from langchain_community.document_loaders import PyMuPDFLoader\n",
|
||
"\n",
|
||
"file_path = \"./example_data/layout-parser-paper.pdf\"\n",
|
||
"loader = PyMuPDFLoader(file_path)"
|
||
],
|
||
"outputs": [],
|
||
"execution_count": 3
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Load"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:37.650774Z",
|
||
"start_time": "2025-01-16T09:48:37.492137Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"docs = loader.load()\n",
|
||
"docs[0]"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'source': './example_data/layout-parser-paper.pdf', 'file_path': './example_data/layout-parser-paper.pdf', 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'trapped': '', 'page': 0}, page_content='LayoutParser: A Unified Toolkit for Deep\\nLearning Based Document Image Analysis\\nZejiang Shen1 (\\x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\\nLee4, Jacob Carlson3, and Weining Li5\\n1 Allen Institute for AI\\nshannons@allenai.org\\n2 Brown University\\nruochen zhang@brown.edu\\n3 Harvard University\\n{melissadell,jacob carlson}@fas.harvard.edu\\n4 University of Washington\\nbcgl@cs.washington.edu\\n5 University of Waterloo\\nw422li@uwaterloo.ca\\nAbstract. Recent advances in document image analysis (DIA) have been\\nprimarily driven by the application of neural networks. Ideally, research\\noutcomes could be easily deployed in production and extended for further\\ninvestigation. However, various factors like loosely organized codebases\\nand sophisticated model configurations complicate the easy reuse of im-\\nportant innovations by a wide audience. Though there have been on-going\\nefforts to improve reusability and simplify deep learning (DL) model\\ndevelopment in disciplines like natural language processing and computer\\nvision, none of them are optimized for challenges in the domain of DIA.\\nThis represents a major gap in the existing toolkit, as DIA is central to\\nacademic research across a wide range of disciplines in the social sciences\\nand humanities. This paper introduces LayoutParser, an open-source\\nlibrary for streamlining the usage of DL in DIA research and applica-\\ntions. The core LayoutParser library comes with a set of simple and\\nintuitive interfaces for applying and customizing DL models for layout de-\\ntection, character recognition, and many other document processing tasks.\\nTo promote extensibility, LayoutParser also incorporates a community\\nplatform for sharing both pre-trained models and full document digiti-\\nzation pipelines. We demonstrate that LayoutParser is helpful for both\\nlightweight and large-scale digitization pipelines in real-word use cases.\\nThe library is publicly available at https://layout-parser.github.io.\\nKeywords: Document Image Analysis · Deep Learning · Layout Analysis\\n· Character Recognition · Open Source library · Toolkit.\\n1\\nIntroduction\\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\\ndocument image analysis (DIA) tasks including document image classification [11,\\narXiv:2103.15348v2 [cs.CV] 21 Jun 2021')"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"execution_count": 4
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:38.072178Z",
|
||
"start_time": "2025-01-16T09:48:38.069508Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"import pprint\n",
|
||
"\n",
|
||
"pprint.pp(docs[0].metadata)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"{'producer': 'pdfTeX-1.40.21',\n",
|
||
" 'creator': 'LaTeX with hyperref',\n",
|
||
" 'creationdate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'source': './example_data/layout-parser-paper.pdf',\n",
|
||
" 'file_path': './example_data/layout-parser-paper.pdf',\n",
|
||
" 'total_pages': 16,\n",
|
||
" 'format': 'PDF 1.5',\n",
|
||
" 'title': '',\n",
|
||
" 'author': '',\n",
|
||
" 'subject': '',\n",
|
||
" 'keywords': '',\n",
|
||
" 'moddate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'trapped': '',\n",
|
||
" 'page': 0}\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 5
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Lazy Load\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:39.349546Z",
|
||
"start_time": "2025-01-16T09:48:39.295384Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"pages = []\n",
|
||
"for doc in loader.lazy_load():\n",
|
||
" pages.append(doc)\n",
|
||
" if len(pages) >= 10:\n",
|
||
" # do some paged operation, e.g.\n",
|
||
" # index.upsert(page)\n",
|
||
"\n",
|
||
" pages = []\n",
|
||
"len(pages)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"6"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"execution_count": 6
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:39.991257Z",
|
||
"start_time": "2025-01-16T09:48:39.987732Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"print(pages[0].page_content[:100])\n",
|
||
"pprint.pp(pages[0].metadata)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"LayoutParser: A Unified Toolkit for DL-Based DIA\n",
|
||
"11\n",
|
||
"focuses on precision, efficiency, and robustness. T\n",
|
||
"{'producer': 'pdfTeX-1.40.21',\n",
|
||
" 'creator': 'LaTeX with hyperref',\n",
|
||
" 'creationdate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'source': './example_data/layout-parser-paper.pdf',\n",
|
||
" 'file_path': './example_data/layout-parser-paper.pdf',\n",
|
||
" 'total_pages': 16,\n",
|
||
" 'format': 'PDF 1.5',\n",
|
||
" 'title': '',\n",
|
||
" 'author': '',\n",
|
||
" 'subject': '',\n",
|
||
" 'keywords': '',\n",
|
||
" 'moddate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'trapped': '',\n",
|
||
" 'page': 10}\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 7
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The metadata attribute contains at least the following keys:\n",
|
||
"- source\n",
|
||
"- page (if in mode *page*)\n",
|
||
"- total_page\n",
|
||
"- creationdate\n",
|
||
"- creator\n",
|
||
"- producer\n",
|
||
"\n",
|
||
"Additional metadata are specific to each parser.\n",
|
||
"These pieces of information can be helpful (to categorize your PDFs for example)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Splitting mode & custom pages delimiter"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"When loading the PDF file you can split it in two different ways:\n",
|
||
"- By page\n",
|
||
"- As a single text flow\n",
|
||
"\n",
|
||
"By default PDFPlumberLoader will split the PDF by page."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Extract the PDF by page. Each page is extracted as a langchain Document object:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:43.180738Z",
|
||
"start_time": "2025-01-16T09:48:43.132909Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"loader = PyMuPDFLoader(\n",
|
||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||
" mode=\"page\",\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"print(len(docs))\n",
|
||
"pprint.pp(docs[0].metadata)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"16\n",
|
||
"{'producer': 'pdfTeX-1.40.21',\n",
|
||
" 'creator': 'LaTeX with hyperref',\n",
|
||
" 'creationdate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'source': './example_data/layout-parser-paper.pdf',\n",
|
||
" 'file_path': './example_data/layout-parser-paper.pdf',\n",
|
||
" 'total_pages': 16,\n",
|
||
" 'format': 'PDF 1.5',\n",
|
||
" 'title': '',\n",
|
||
" 'author': '',\n",
|
||
" 'subject': '',\n",
|
||
" 'keywords': '',\n",
|
||
" 'moddate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'trapped': '',\n",
|
||
" 'page': 0}\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 8
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In this mode the pdf is split by pages and the resulting Documents metadata contains the page number. But in some cases we could want to process the pdf as a single text flow (so we don't cut some paragraphs in half). In this case you can use the *single* mode :"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Extract the whole PDF as a single langchain Document object:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:45.358999Z",
|
||
"start_time": "2025-01-16T09:48:45.305168Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"loader = PyMuPDFLoader(\n",
|
||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||
" mode=\"single\",\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"print(len(docs))\n",
|
||
"pprint.pp(docs[0].metadata)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"1\n",
|
||
"{'producer': 'pdfTeX-1.40.21',\n",
|
||
" 'creator': 'LaTeX with hyperref',\n",
|
||
" 'creationdate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'source': './example_data/layout-parser-paper.pdf',\n",
|
||
" 'file_path': './example_data/layout-parser-paper.pdf',\n",
|
||
" 'total_pages': 16,\n",
|
||
" 'format': 'PDF 1.5',\n",
|
||
" 'title': '',\n",
|
||
" 'author': '',\n",
|
||
" 'subject': '',\n",
|
||
" 'keywords': '',\n",
|
||
" 'moddate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'trapped': ''}\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 9
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Logically, in this mode, the ‘page_number’ metadata disappears. Here's how to clearly identify where pages end in the text flow :"
|
||
]
|
||
},
|
||
{
|
||
"metadata": {},
|
||
"cell_type": "markdown",
|
||
"source": "### Add a custom *pages_delimiter* to identify where are ends of pages in *single* mode:"
|
||
},
|
||
{
|
||
"metadata": {},
|
||
"cell_type": "code",
|
||
"outputs": [],
|
||
"execution_count": null,
|
||
"source": [
|
||
"loader = PyMuPDFLoader(\n",
|
||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||
" mode=\"single\",\n",
|
||
" pages_delimiter=\"\\n-------THIS IS A CUSTOM END OF PAGE-------\\n\",\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"print(docs[0].page_content[:5780])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"This could simply be \\n, or \\f to clearly indicate a page change, or \\<!-- PAGE BREAK --> for seamless injection in a Markdown viewer without a visual effect."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Extract images from the PDF"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"You can extract images from your PDFs with a choice of three different solutions:\n",
|
||
"- rapidOCR (lightweight Optical Character Recognition tool)\n",
|
||
"- Tesseract (OCR tool with high precision)\n",
|
||
"- Multimodal language model\n",
|
||
"\n",
|
||
"You can tune these functions to choose the output format of the extracted images among *html*, *markdown* or *text*\n",
|
||
"\n",
|
||
"The result is inserted between the last and the second-to-last paragraphs of text of the page."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Extract images from the PDF with rapidOCR:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T09:48:52.381641Z",
|
||
"start_time": "2025-01-16T09:48:50.979344Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"%pip install -qU rapidocr-onnxruntime"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Note: you may need to restart the kernel to use updated packages.\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 11
|
||
},
|
||
{
|
||
"metadata": {},
|
||
"cell_type": "code",
|
||
"outputs": [],
|
||
"execution_count": null,
|
||
"source": [
|
||
"from langchain_community.document_loaders.parsers import RapidOCRBlobParser\n",
|
||
"\n",
|
||
"loader = PyMuPDFLoader(\n",
|
||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||
" mode=\"page\",\n",
|
||
" images_inner_format=\"markdown-img\",\n",
|
||
" images_parser=RapidOCRBlobParser(),\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"\n",
|
||
"print(docs[5].page_content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Be careful, RapidOCR is designed to work with Chinese and English, not other languages."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Extract images from the PDF with Tesseract:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-15T09:03:07.148968Z",
|
||
"start_time": "2025-01-15T09:03:05.580316Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"%pip install -qU pytesseract"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Note: you may need to restart the kernel to use updated packages.\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 53
|
||
},
|
||
{
|
||
"metadata": {},
|
||
"cell_type": "code",
|
||
"outputs": [],
|
||
"execution_count": null,
|
||
"source": [
|
||
"from langchain_community.document_loaders.parsers import TesseractBlobParser\n",
|
||
"\n",
|
||
"loader = PyMuPDFLoader(\n",
|
||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||
" mode=\"page\",\n",
|
||
" images_inner_format=\"html-img\",\n",
|
||
" images_parser=TesseractBlobParser(),\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"print(docs[5].page_content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Extract images from the PDF with multimodal model:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T07:14:05.692081Z",
|
||
"start_time": "2025-01-16T07:14:04.251598Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"%pip install -qU langchain_openai"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Note: you may need to restart the kernel to use updated packages.\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 15
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T07:14:06.534821Z",
|
||
"start_time": "2025-01-16T07:14:06.511587Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"import os\n",
|
||
"\n",
|
||
"from dotenv import load_dotenv\n",
|
||
"\n",
|
||
"load_dotenv()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"execution_count": 16
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T07:14:07.080591Z",
|
||
"start_time": "2025-01-16T07:14:07.077724Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"from getpass import getpass\n",
|
||
"\n",
|
||
"if not os.environ.get(\"OPENAI_API_KEY\"):\n",
|
||
" os.environ[\"OPENAI_API_KEY\"] = getpass(\"OpenAI API key =\")"
|
||
],
|
||
"outputs": [],
|
||
"execution_count": 17
|
||
},
|
||
{
|
||
"metadata": {},
|
||
"cell_type": "code",
|
||
"outputs": [],
|
||
"execution_count": null,
|
||
"source": [
|
||
"from langchain_community.document_loaders.parsers import LLMImageBlobParser\n",
|
||
"from langchain_openai import ChatOpenAI\n",
|
||
"\n",
|
||
"loader = PyMuPDFLoader(\n",
|
||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||
" mode=\"page\",\n",
|
||
" images_inner_format=\"markdown-img\",\n",
|
||
" images_parser=LLMImageBlobParser(model=ChatOpenAI(model=\"gpt-4o\", max_tokens=1024)),\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"print(docs[5].page_content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Extract tables from the PDF"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"With PyMUPDF you can extract tables from your PDFs in *html*, *markdown* or *csv* format :"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T07:14:13.080261Z",
|
||
"start_time": "2025-01-16T07:14:11.825744Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"loader = PyMuPDFLoader(\n",
|
||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||
" mode=\"page\",\n",
|
||
" extract_tables=\"markdown\",\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"print(docs[4].page_content)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"LayoutParser: A Unified Toolkit for DL-Based DIA\n",
|
||
"5\n",
|
||
"Table 1: Current layout detection models in the LayoutParser model zoo\n",
|
||
"Dataset\n",
|
||
"Base Model1 Large Model\n",
|
||
"Notes\n",
|
||
"PubLayNet [38]\n",
|
||
"F / M\n",
|
||
"M\n",
|
||
"Layouts of modern scientific documents\n",
|
||
"PRImA [3]\n",
|
||
"M\n",
|
||
"-\n",
|
||
"Layouts of scanned modern magazines and scientific reports\n",
|
||
"Newspaper [17]\n",
|
||
"F\n",
|
||
"-\n",
|
||
"Layouts of scanned US newspapers from the 20th century\n",
|
||
"TableBank [18]\n",
|
||
"F\n",
|
||
"F\n",
|
||
"Table region on modern scientific and business document\n",
|
||
"HJDataset [31]\n",
|
||
"F / M\n",
|
||
"-\n",
|
||
"Layouts of history Japanese documents\n",
|
||
"1 For each dataset, we train several models of different sizes for different needs (the trade-offbetween accuracy\n",
|
||
"vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101\n",
|
||
"backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (F) and Mask\n",
|
||
"R-CNN [12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained\n",
|
||
"using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model\n",
|
||
"zoo in coming months.\n",
|
||
"layout data structures, which are optimized for efficiency and versatility. 3) When\n",
|
||
"necessary, users can employ existing or customized OCR models via the unified\n",
|
||
"API provided in the OCR module. 4) LayoutParser comes with a set of utility\n",
|
||
"functions for the visualization and storage of the layout data. 5) LayoutParser\n",
|
||
"is also highly customizable, via its integration with functions for layout data\n",
|
||
"annotation and model training. We now provide detailed descriptions for each\n",
|
||
"component.\n",
|
||
"3.1\n",
|
||
"Layout Detection Models\n",
|
||
"In LayoutParser, a layout model takes a document image as an input and\n",
|
||
"generates a list of rectangular boxes for the target content regions. Different\n",
|
||
"from traditional methods, it relies on deep convolutional neural networks rather\n",
|
||
"than manually curated rules to identify content regions. It is formulated as an\n",
|
||
"object detection problem and state-of-the-art models like Faster R-CNN [28] and\n",
|
||
"Mask R-CNN [12] are used. This yields prediction results of high accuracy and\n",
|
||
"makes it possible to build a concise, generalized interface for layout detection.\n",
|
||
"LayoutParser, built upon Detectron2 [35], provides a minimal API that can\n",
|
||
"perform layout detection with only four lines of code in Python:\n",
|
||
"1 import\n",
|
||
"layoutparser as lp\n",
|
||
"2 image = cv2.imread(\"image_file\") # load\n",
|
||
"images\n",
|
||
"3 model = lp. Detectron2LayoutModel (\n",
|
||
"4\n",
|
||
"\"lp:// PubLayNet/ faster_rcnn_R_50_FPN_3x /config\")\n",
|
||
"5 layout = model.detect(image)\n",
|
||
"LayoutParser provides a wealth of pre-trained model weights using various\n",
|
||
"datasets covering different languages, time periods, and document types. Due to\n",
|
||
"domain shift [7], the prediction performance can notably drop when models are ap-\n",
|
||
"plied to target samples that are significantly different from the training dataset. As\n",
|
||
"document structures and layouts vary greatly in different domains, it is important\n",
|
||
"to select models trained on a dataset similar to the test samples. A semantic syntax\n",
|
||
"is used for initializing the model weights in LayoutParser, using both the dataset\n",
|
||
"name and model name lp://<dataset-name>/<model-architecture-name>.\n",
|
||
"\n",
|
||
"\n",
|
||
"|Dataset|Base Model1|Large Model|Notes|\n",
|
||
"|---|---|---|---|\n",
|
||
"|PubLayNet [38] PRImA [3] Newspaper [17] TableBank [18] HJDataset [31]|F / M M F F F / M|M &#45; &#45; F &#45;|Layouts of modern scientific documents Layouts of scanned modern magazines and scientific reports Layouts of scanned US newspapers from the 20th century Table region on modern scientific and business document Layouts of history Japanese documents|\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 19
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Working with Files\n",
|
||
"\n",
|
||
"Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use `open` to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.\n",
|
||
"\n",
|
||
"As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.\n",
|
||
"You can use this strategy to analyze different files, with the same parsing parameters."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-01-16T07:14:13.355746Z",
|
||
"start_time": "2025-01-16T07:14:13.304456Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"from langchain_community.document_loaders import FileSystemBlobLoader\n",
|
||
"from langchain_community.document_loaders.generic import GenericLoader\n",
|
||
"from langchain_community.document_loaders.parsers import PyMuPDFParser\n",
|
||
"\n",
|
||
"loader = GenericLoader(\n",
|
||
" blob_loader=FileSystemBlobLoader(\n",
|
||
" path=\"./example_data/\",\n",
|
||
" glob=\"*.pdf\",\n",
|
||
" ),\n",
|
||
" blob_parser=PyMuPDFParser(),\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"print(docs[0].page_content)\n",
|
||
"pprint.pp(docs[0].metadata)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"LayoutParser: A Unified Toolkit for Deep\n",
|
||
"Learning Based Document Image Analysis\n",
|
||
"Zejiang Shen1 (\u0000), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\n",
|
||
"Lee4, Jacob Carlson3, and Weining Li5\n",
|
||
"1 Allen Institute for AI\n",
|
||
"shannons@allenai.org\n",
|
||
"2 Brown University\n",
|
||
"ruochen zhang@brown.edu\n",
|
||
"3 Harvard University\n",
|
||
"{melissadell,jacob carlson}@fas.harvard.edu\n",
|
||
"4 University of Washington\n",
|
||
"bcgl@cs.washington.edu\n",
|
||
"5 University of Waterloo\n",
|
||
"w422li@uwaterloo.ca\n",
|
||
"Abstract. Recent advances in document image analysis (DIA) have been\n",
|
||
"primarily driven by the application of neural networks. Ideally, research\n",
|
||
"outcomes could be easily deployed in production and extended for further\n",
|
||
"investigation. However, various factors like loosely organized codebases\n",
|
||
"and sophisticated model configurations complicate the easy reuse of im-\n",
|
||
"portant innovations by a wide audience. Though there have been on-going\n",
|
||
"efforts to improve reusability and simplify deep learning (DL) model\n",
|
||
"development in disciplines like natural language processing and computer\n",
|
||
"vision, none of them are optimized for challenges in the domain of DIA.\n",
|
||
"This represents a major gap in the existing toolkit, as DIA is central to\n",
|
||
"academic research across a wide range of disciplines in the social sciences\n",
|
||
"and humanities. This paper introduces LayoutParser, an open-source\n",
|
||
"library for streamlining the usage of DL in DIA research and applica-\n",
|
||
"tions. The core LayoutParser library comes with a set of simple and\n",
|
||
"intuitive interfaces for applying and customizing DL models for layout de-\n",
|
||
"tection, character recognition, and many other document processing tasks.\n",
|
||
"To promote extensibility, LayoutParser also incorporates a community\n",
|
||
"platform for sharing both pre-trained models and full document digiti-\n",
|
||
"zation pipelines. We demonstrate that LayoutParser is helpful for both\n",
|
||
"lightweight and large-scale digitization pipelines in real-word use cases.\n",
|
||
"The library is publicly available at https://layout-parser.github.io.\n",
|
||
"Keywords: Document Image Analysis · Deep Learning · Layout Analysis\n",
|
||
"· Character Recognition · Open Source library · Toolkit.\n",
|
||
"1\n",
|
||
"Introduction\n",
|
||
"Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of\n",
|
||
"document image analysis (DIA) tasks including document image classification [11,\n",
|
||
"arXiv:2103.15348v2 [cs.CV] 21 Jun 2021\n",
|
||
"{'source': 'example_data/layout-parser-paper.pdf',\n",
|
||
" 'file_path': 'example_data/layout-parser-paper.pdf',\n",
|
||
" 'total_pages': 16,\n",
|
||
" 'format': 'PDF 1.5',\n",
|
||
" 'title': '',\n",
|
||
" 'author': '',\n",
|
||
" 'subject': '',\n",
|
||
" 'keywords': '',\n",
|
||
" 'creator': 'LaTeX with hyperref',\n",
|
||
" 'producer': 'pdfTeX-1.40.21',\n",
|
||
" 'creationdate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'moddate': '2021-06-22T01:27:10+00:00',\n",
|
||
" 'trapped': '',\n",
|
||
" 'page': 0}\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 20
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": "It is possible to work with files from cloud storage."
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"metadata": {},
|
||
"source": [
|
||
"from langchain_community.document_loaders import CloudBlobLoader\n",
|
||
"from langchain_community.document_loaders.generic import GenericLoader\n",
|
||
"\n",
|
||
"loader = GenericLoader(\n",
|
||
" blob_loader=CloudBlobLoader(\n",
|
||
" url=\"s3:/mybucket\", # Supports s3://, az://, gs://, file:// schemes.\n",
|
||
" glob=\"*.pdf\",\n",
|
||
" ),\n",
|
||
" blob_parser=PyMuPDFParser(),\n",
|
||
")\n",
|
||
"docs = loader.load()\n",
|
||
"print(docs[0].page_content)\n",
|
||
"pprint.pp(docs[0].metadata)"
|
||
],
|
||
"outputs": [],
|
||
"execution_count": null
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## API reference\n",
|
||
"\n",
|
||
"For detailed documentation of all `PyMuPDFLoader` features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.7"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|