community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers (#29063)

mirror of https://github.com/hwchase17/langchain.git synced 2025-09-04 20:46:45 +00:00

* Adds BlobParsers for images. These implementations can take an image
and produce one or more documents per image. This interface can be used
for exposing OCR capabilities.
* Update PyMuPDFParser and Loader to standardize metadata, handle
images, improve table extraction etc.

- **Twitter handle:** pprados

This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>

This commit is contained in:

Philippe PRADOS

2025-01-20 21:15:43 +01:00

committed by

GitHub

parent f175319303

commit 4efc5093c1

16 changed files with 2389 additions and 190 deletions

1188

docs/docs/integrations/document_loaders/pymupdf.ipynb

View File

File diff suppressed because it is too large Load Diff

community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers (#29063)

1188 docs/docs/integrations/document_loaders/pymupdf.ipynb View File

1188

docs/docs/integrations/document_loaders/pymupdf.ipynb

View File