docs: add Docling loader docs (#29104)

### Description
This adds the docs for the Docling document loader.
[Docling](https://github.com/DS4SD/docling) parses PDF, DOCX, PPTX,
HTML, and other formats into a rich unified representation including
document layout, tables etc., making them ready for generative AI
workflows like RAG.

Some references:
- https://research.ibm.com/blog/docling-generative-AI
-
https://www.redhat.com/en/blog/docling-missing-document-processing-companion-generative-ai
- [Docling Technical Report](https://arxiv.org/abs/2408.09869)

The introduced `DoclingLoader` enables users to:
- use various document types in their LLM applications with ease and
speed, and
- leverage Docling's rich representation for advanced, document-native
grounding.

### Issue
Replacing PR #27987 as discussed with @efriis
[here](https://github.com/langchain-ai/langchain/pull/27987#issuecomment-2489354930).

### Dependencies
None

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Panos Vagenas 2025-01-09 16:15:35 +01:00 committed by GitHub
parent cc55e32924
commit 858f655a25
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 621 additions and 0 deletions

View File

@ -0,0 +1,555 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Docling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Docling](https://github.com/DS4SD/docling) parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.\n",
"\n",
"This integration provides Docling's capabilities via the `DoclingLoader` document loader."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"<!-- \n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | JS support|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| langchain_docling.DoclingLoader | langchain-docling | ✅ | ❌ | ❌ | \n",
"\n",
"### Loader features\n",
"| Source | Document Lazy Loading | Native Async Support\n",
"| :---: | :---: | :---: | \n",
"| DoclingLoader | ✅ | ❌ | \n",
" -->\n",
"\n",
"The presented `DoclingLoader` component enables you to:\n",
"- use various document types in your LLM applications with ease and speed, and\n",
"- leverage Docling's rich format for advanced, document-native grounding.\n",
"\n",
"`DoclingLoader` supports two different export modes:\n",
"- `ExportType.DOC_CHUNKS` (default): if you want to have each input document chunked and\n",
" to then capture each individual chunk as a separate LangChain Document downstream, or\n",
"- `ExportType.MARKDOWN`: if you want to capture each input document as a separate\n",
" LangChain Document\n",
"\n",
"The example allows exploring both modes via parameter `EXPORT_TYPE`; depending on the\n",
"value set, the example pipeline is then set up accordingly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -qU langchain-docling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use a GPU-enabled runtime."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Basic initialization looks as follows:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from langchain_docling import DoclingLoader\n",
"\n",
"FILE_PATH = \"https://arxiv.org/pdf/2408.09869\"\n",
"\n",
"loader = DoclingLoader(file_path=FILE_PATH)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For advanced usage, `DoclingLoader` has the following parameters:\n",
"- `file_path`: source as single str (URL or local file) or iterable thereof\n",
"- `converter` (optional): any specific Docling converter instance to use\n",
"- `convert_kwargs` (optional): any specific kwargs for conversion execution\n",
"- `export_type` (optional): export mode to use: `ExportType.DOC_CHUNKS` (default) or\n",
" `ExportType.MARKDOWN`\n",
"- `md_export_kwargs` (optional): any specific Markdown export kwargs (for Markdown mode)\n",
"- `chunker` (optional): any specific Docling chunker instance to use (for doc-chunk\n",
" mode)\n",
"- `meta_extractor` (optional): any specific metadata extractor to use\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n"
]
}
],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Note: a message saying `\"Token indices sequence length is longer than the specified\n",
"maximum sequence length...\"` can be ignored in this case — more details\n",
"[here](https://github.com/DS4SD/docling-core/issues/119#issuecomment-2577418826)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspecting some sample docs:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"- d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'\n",
"- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R¨uschlikon, Switzerland'\n",
"- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n"
]
}
],
"source": [
"for d in docs[:3]:\n",
" print(f\"- {d.page_content=}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load\n",
"\n",
"Documents can also be loaded in a lazy fashion:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"doc_iter = loader.lazy_load()\n",
"for doc in doc_iter:\n",
" pass # you can operate on `doc` here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## End-to-end Example\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"# https://github.com/huggingface/transformers/issues/5486:\n",
"os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"> - The following example pipeline uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.\n",
"> - Dependencies for this pipeline can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -q --progress-bar off --no-warn-conflicts langchain-core langchain-huggingface langchain_milvus langchain python-dotenv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Defining the pipeline parameters:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"from tempfile import mkdtemp\n",
"\n",
"from dotenv import load_dotenv\n",
"from langchain_core.prompts import PromptTemplate\n",
"from langchain_docling.loader import ExportType\n",
"\n",
"\n",
"def _get_env_from_colab_or_os(key):\n",
" try:\n",
" from google.colab import userdata\n",
"\n",
" try:\n",
" return userdata.get(key)\n",
" except userdata.SecretNotFoundError:\n",
" pass\n",
" except ImportError:\n",
" pass\n",
" return os.getenv(key)\n",
"\n",
"\n",
"load_dotenv()\n",
"\n",
"HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\n",
"FILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\n",
"EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
"GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n",
"EXPORT_TYPE = ExportType.DOC_CHUNKS\n",
"QUESTION = \"Which are the main AI models in Docling?\"\n",
"PROMPT = PromptTemplate.from_template(\n",
" \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n",
")\n",
"TOP_K = 3\n",
"MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can instantiate our loader and load documents:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n"
]
}
],
"source": [
"from docling.chunking import HybridChunker\n",
"from langchain_docling import DoclingLoader\n",
"\n",
"loader = DoclingLoader(\n",
" file_path=FILE_PATH,\n",
" export_type=EXPORT_TYPE,\n",
" chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n",
")\n",
"\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Determining the splits:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n",
" splits = docs\n",
"elif EXPORT_TYPE == ExportType.MARKDOWN:\n",
" from langchain_text_splitters import MarkdownHeaderTextSplitter\n",
"\n",
" splitter = MarkdownHeaderTextSplitter(\n",
" headers_to_split_on=[\n",
" (\"#\", \"Header_1\"),\n",
" (\"##\", \"Header_2\"),\n",
" (\"###\", \"Header_3\"),\n",
" ],\n",
" )\n",
" splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]\n",
"else:\n",
" raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspecting some sample splits:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"- d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'\n",
"- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R¨uschlikon, Switzerland'\n",
"- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n",
"...\n"
]
}
],
"source": [
"for d in splits[:3]:\n",
" print(f\"- {d.page_content=}\")\n",
"print(\"...\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ingestion"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from pathlib import Path\n",
"from tempfile import mkdtemp\n",
"\n",
"from langchain_huggingface.embeddings import HuggingFaceEmbeddings\n",
"from langchain_milvus import Milvus\n",
"\n",
"embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n",
"\n",
"milvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed\n",
"vectorstore = Milvus.from_documents(\n",
" documents=splits,\n",
" embedding=embedding,\n",
" collection_name=\"docling_demo\",\n",
" connection_args={\"uri\": milvus_uri},\n",
" index_params={\"index_type\": \"FLAT\"},\n",
" drop_old=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### RAG"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import create_retrieval_chain\n",
"from langchain.chains.combine_documents import create_stuff_documents_chain\n",
"from langchain_huggingface import HuggingFaceEndpoint\n",
"\n",
"retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\n",
"llm = HuggingFaceEndpoint(\n",
" repo_id=GEN_MODEL_ID,\n",
" huggingfacehub_api_token=HF_TOKEN,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def clip_text(text, threshold=100):\n",
" return f\"{text[:threshold]}...\" if len(text) > threshold else text"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question:\n",
"Which are the main AI models in Docling?\n",
"\n",
"Answer:\n",
"The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.\n",
"\n",
"Source 1:\n",
" text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n",
" dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n",
" source: https://arxiv.org/pdf/2408.09869\n",
"\n",
"Source 2:\n",
" text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n",
" dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n",
" source: https://arxiv.org/pdf/2408.09869\n",
"\n",
"Source 3:\n",
" text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n",
" dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n",
" source: https://arxiv.org/pdf/2408.09869\n"
]
}
],
"source": [
"question_answer_chain = create_stuff_documents_chain(llm, PROMPT)\n",
"rag_chain = create_retrieval_chain(retriever, question_answer_chain)\n",
"resp_dict = rag_chain.invoke({\"input\": QUESTION})\n",
"\n",
"clipped_answer = clip_text(resp_dict[\"answer\"], threshold=350)\n",
"print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\n",
"for i, doc in enumerate(resp_dict[\"context\"]):\n",
" print()\n",
" print(f\"Source {i+1}:\")\n",
" print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n",
" for key in doc.metadata:\n",
" if key != \"pk\":\n",
" val = doc.metadata.get(key)\n",
" clipped_val = clip_text(val) if isinstance(val, str) else val\n",
" print(f\" {key}: {clipped_val}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the sources contain rich grounding information, including the passage\n",
"headings (i.e. section), page, and precise bounding box."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"- [LangChain Docling integration GitHub](https://github.com/DS4SD/docling-langchain)\n",
"- [Docling GitHub](https://github.com/DS4SD/docling)\n",
"- [Docling docs](https://ds4sd.github.io/docling/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,42 @@
# Docling
> [Docling](https://github.com/DS4SD/docling) parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.
>
> This integration provides Docling's capabilities via the `DoclingLoader` document loader.
## Installation and Setup
Simply install `langchain-docling` from your package manager, e.g. pip:
```shell
pip install langchain-docling
```
## Document Loader
The `DoclingLoader` class in `langchain-docling` seamlessly integrates Docling into
LangChain, enabling you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich representation for advanced, document-native grounding.
Basic usage looks as follows:
```python
from langchain_docling import DoclingLoader
FILE_PATH = ["https://arxiv.org/pdf/2408.09869"] # Docling Technical Report
loader = DoclingLoader(file_path=FILE_PATH)
docs = loader.load()
```
For end-to-end usage check out
[this example](/docs/integrations/document_loaders/docling).
## Additional Resources
- [LangChain Docling integration GitHub](https://github.com/DS4SD/docling-langchain)
- [LangChain Docling integration PyPI package](https://pypi.org/project/langchain-docling/)
- [Docling GitHub](https://github.com/DS4SD/docling)
- [Docling docs](https://ds4sd.github.io/docling/)

View File

@ -808,6 +808,13 @@ const FEATURE_TABLES = {
source: "API service that can be deployed locally, hosted version has free credits.",
api: "API",
apiLink: "https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.firecrawl.FireCrawlLoader.html"
},
{
name: "Docling",
link: "docling",
source: "Uses Docling to load and parse web pages",
api: "Package",
apiLink: "https://python.langchain.com/docs/integrations/document_loaders/docling/"
}
]
},
@ -890,6 +897,13 @@ const FEATURE_TABLES = {
source: "Load PDF files using UpstageDocumentParseLoader",
api: "Package",
apiLink: "https://python.langchain.com/api_reference/upstage/document_parse/langchain_upstage.document_parse.UpstageDocumentParseLoader.html"
},
{
name: "Docling",
link: "docling",
source: "Load PDF files using Docling",
api: "Package",
apiLink: "https://python.langchain.com/docs/integrations/document_loaders/docling/"
}
]
},
@ -932,6 +946,12 @@ const FEATURE_TABLES = {
source: "HTML files",
apiLink: "https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.html_bs.BSHTMLLoader.html"
},
{
name: "DoclingLoader",
link: "../../integrations/document_loaders/docling",
source: "Various file types (see https://ds4sd.github.io/docling/)",
apiLink: "https://python.langchain.com/docs/integrations/document_loaders/docling/"
},
]
},
vectorstores: {

View File

@ -329,3 +329,7 @@ packages:
path: .
repo: kuzudb/langchain-kuzu
downloads: 0
- name: langchain-docling
path: .
repo: DS4SD/docling-langchain
downloads: 0