doc-loader: retain Azure Doc Intelligence API metadata in Document parser (#28382)

**Description**: This PR modifies the doc_intelligence.py parser in the community package to include all metadata returned by the Azure Doc Intelligence API in the Document object. Previously, only the parsed content (markdown) was retained, while other important metadata such as bounding boxes (bboxes) for images and tables was discarded. These image bboxes are crucial for supporting use cases like multi-modal RAG workflows when using Azure Doc Intelligence. The change ensures that all information returned by the Azure Doc Intelligence API is preserved by setting the metadata attribute of the Document object to the entire result returned by the API, rather than an empty dictionary. This extends the parser's utility for complex use cases without breaking existing functionality. **Issue**: This change does not address a specific issue number, but it resolves a critical limitation in supporting multimodal workflows when using the LangChain wrapper for the Azure API. **Dependencies**: No additional dependencies are required for this change. --------- Co-authored-by: jmohren <johannes.mohren@aol.de>
2025-06-26 16:43:35 +00:00 · 2024-12-10 17:22:58 +01:00 · 2024-12-10 17:22:58 +01:00 · c1d348e95d
commit c1d348e95d
parent 0d20c314dd
1 changed files with 1 additions and 1 deletions
--- a/libs/community/langchain_community/document_loaders/parsers/doc_intelligence.py
+++ b/libs/community/langchain_community/document_loaders/parsers/doc_intelligence.py
@ -71,7 +71,7 @@ class AzureAIDocumentIntelligenceParser(BaseBlobParser):
            yield d

    def _generate_docs_single(self, result: Any) -> Iterator[Document]:
-        yield Document(page_content=result.content, metadata={})
+        yield Document(page_content=result.content, metadata=result.as_dict())

    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Lazily parse the blob."""