community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers (#29063)

* Adds BlobParsers for images. These implementations can take an image
and produce one or more documents per image. This interface can be used
for exposing OCR capabilities.
* Update PyMuPDFParser and Loader to standardize metadata, handle
images, improve table extraction etc.

- **Twitter handle:** pprados

This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
Philippe PRADOS
2025-01-20 21:15:43 +01:00
committed by GitHub
parent f175319303
commit 4efc5093c1
16 changed files with 2389 additions and 190 deletions

File diff suppressed because it is too large Load Diff