community[minor]: Azure DocumentIntelligenceLoader/Parser support update with latest SDK (#14389)

- **Description:**
Add DocumentIntelligenceLoader & DocumentIntelligenceParser
implementation using the latest Azure Document Intelligence SDK with
markdown support.
The core logic resides in DocumentIntelligenceParser and
DocumentIntelligenceLoader is a mere wrapper of the parser.
The parser will takes api_endpoint and api_key and creates
DocumentIntelligenceClient for the user. 4 parsing modes are supported:
1. Markdown (default)
2. Single
3. Page 
4. Object

UT and notebook are also updated accordingly.

- **Dependencies:** Azure Document Intelligence SDK:
azure-ai-documentintelligence
[azure-sdk-for-python/sdk/documentintelligence/azure-ai-documentintelligence
at 7c42462ac662522a6fd21b17d2a20f4cd40d0356 · Azure/azure-sdk-for-python
(github.com)](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-sdk-for-python%2Ftree%2F7c42462ac662522a6fd21b17d2a20f4cd40d0356%2Fsdk%2Fdocumentintelligence%2Fazure-ai-documentintelligence&data=05%7C01%7CZifei.Qian%40microsoft.com%7C298225aa3e31468a863108dbf07374ff%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638368150928704292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oE0Sl4HERnMKdbkV9KgBV46Z2xytcQAShdTWf7ZNl%2Bs%3D&reserved=0).

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
QIAN Zifei
2023-12-22 08:40:27 +08:00
committed by GitHub
parent 129a929d69
commit 2460f977c5
11 changed files with 367 additions and 45 deletions

View File

@@ -542,9 +542,17 @@ class AmazonTextractPDFParser(BaseBlobParser):
class DocumentIntelligenceParser(BaseBlobParser):
"""Loads a PDF with Azure Document Intelligence
(formerly Forms Recognizer) and chunks at character level."""
(formerly Form Recognizer) and chunks at character level."""
def __init__(self, client: Any, model: str):
warnings.warn(
"langchain.document_loaders.parsers.pdf.DocumentIntelligenceParser"
"and langchain.document_loaders.pdf.DocumentIntelligenceLoader"
" are deprecated. Please upgrade to "
"langchain.document_loaders.DocumentIntelligenceLoader "
"for any file parsing purpose using Azure Document Intelligence "
"service."
)
self.client = client
self.model = model