community: Fix the problem of error reporting when OCR extracts text from PDF. (#29378)

- **Description:** The issue has been fixed where images could not be
recognized from ```xObject[obj]["/Filter"]``` (whose value can be either
a string or a list of strings) in the ```_extract_images_from_page()```
method. It also resolves the bug where vectorization by Faiss fails due
to the failure of image extraction from a PDF containing only
images```IndexError: list index out of range```.

![69a60f3f6bd474641b9126d74bb18f7e](https://github.com/user-attachments/assets/dc9e098d-2862-49f7-93b0-00f1056727dc)

- **Issue:** 
    Fix the following issues:
[#15227 ](https://github.com/langchain-ai/langchain/issues/15227)
[#22892 ](https://github.com/langchain-ai/langchain/issues/22892)
[#26652 ](https://github.com/langchain-ai/langchain/issues/26652)
[#27153 ](https://github.com/langchain-ai/langchain/issues/27153)
    Related issues:
[#7067 ](https://github.com/langchain-ai/langchain/issues/7067)

- **Dependencies:** None
- **Twitter handle:** None

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
This commit is contained in:
江同学呀 2025-01-23 23:01:52 +08:00 committed by GitHub
parent a13faab6b7
commit a1e62070d0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -311,6 +311,12 @@ class PyPDFParser(BaseBlobParser):
) )
elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS: elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
images.append(xObject[obj].get_data()) images.append(xObject[obj].get_data())
elif (
isinstance(xObject[obj]["/Filter"], list)
and xObject[obj]["/Filter"]
and xObject[obj]["/Filter"][0][1:] in _PDF_FILTER_WITH_LOSS
):
images.append(xObject[obj].get_data())
else: else:
warnings.warn("Unknown PDF Filter!") warnings.warn("Unknown PDF Filter!")
return extract_from_images_with_rapidocr(images) return extract_from_images_with_rapidocr(images)