doc-loader: retain Azure Doc Intelligence API metadata in Document parser (#28382)

**Description**:
This PR modifies the doc_intelligence.py parser in the community package
to include all metadata returned by the Azure Doc Intelligence API in
the Document object. Previously, only the parsed content (markdown) was
retained, while other important metadata such as bounding boxes (bboxes)
for images and tables was discarded. These image bboxes are crucial for
supporting use cases like multi-modal RAG workflows when using Azure Doc
Intelligence.

The change ensures that all information returned by the Azure Doc
Intelligence API is preserved by setting the metadata attribute of the
Document object to the entire result returned by the API, rather than an
empty dictionary. This extends the parser's utility for complex use
cases without breaking existing functionality.

**Issue**:
This change does not address a specific issue number, but it resolves a
critical limitation in supporting multimodal workflows when using the
LangChain wrapper for the Azure API.

**Dependencies**:
No additional dependencies are required for this change.

---------

Co-authored-by: jmohren <johannes.mohren@aol.de>
This commit is contained in:
Johannes Mohren 2024-12-10 17:22:58 +01:00 committed by GitHub
parent 0d20c314dd
commit c1d348e95d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -71,7 +71,7 @@ class AzureAIDocumentIntelligenceParser(BaseBlobParser):
yield d
def _generate_docs_single(self, result: Any) -> Iterator[Document]:
yield Document(page_content=result.content, metadata={})
yield Document(page_content=result.content, metadata=result.as_dict())
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""