community: Fix the problem of error reporting when OCR extracts text from PDF. (#29378)

- **Description:** The issue has been fixed where images could not be recognized from ```xObject[obj]["/Filter"]``` (whose value can be either a string or a list of strings) in the ```_extract_images_from_page()``` method. It also resolves the bug where vectorization by Faiss fails due to the failure of image extraction from a PDF containing only images```IndexError: list index out of range```. ![69a60f3f6bd474641b9126d74bb18f7e](https://github.com/user-attachments/assets/dc9e098d-2862-49f7-93b0-00f1056727dc) - **Issue:** Fix the following issues: [#15227 ](https://github.com/langchain-ai/langchain/issues/15227) [#22892 ](https://github.com/langchain-ai/langchain/issues/22892) [#26652 ](https://github.com/langchain-ai/langchain/issues/26652) [#27153 ](https://github.com/langchain-ai/langchain/issues/27153) Related issues: [#7067 ](https://github.com/langchain-ai/langchain/issues/7067) - **Dependencies:** None - **Twitter handle:** None --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>
2025-07-04 04:07:54 +00:00 · 2025-01-23 23:01:52 +08:00 · 2025-01-23 23:01:52 +08:00 · a1e62070d0
commit a1e62070d0
parent a13faab6b7
1 changed files with 6 additions and 0 deletions
--- a/libs/community/langchain_community/document_loaders/parsers/pdf.py
+++ b/libs/community/langchain_community/document_loaders/parsers/pdf.py
@ -311,6 +311,12 @@ class PyPDFParser(BaseBlobParser):
                    )
                elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
                    images.append(xObject[obj].get_data())
                elif (
                    isinstance(xObject[obj]["/Filter"], list)
                    and xObject[obj]["/Filter"]
                    and xObject[obj]["/Filter"][0][1:] in _PDF_FILTER_WITH_LOSS
                ):
                    images.append(xObject[obj].get_data())
                else:
                    warnings.warn("Unknown PDF Filter!")
        return extract_from_images_with_rapidocr(images)