mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-04 04:07:54 +00:00
community: Fix the problem of error reporting when OCR extracts text from PDF. (#29378)
- **Description:** The issue has been fixed where images could not be recognized from ```xObject[obj]["/Filter"]``` (whose value can be either a string or a list of strings) in the ```_extract_images_from_page()``` method. It also resolves the bug where vectorization by Faiss fails due to the failure of image extraction from a PDF containing only images```IndexError: list index out of range```.  - **Issue:** Fix the following issues: [#15227 ](https://github.com/langchain-ai/langchain/issues/15227) [#22892 ](https://github.com/langchain-ai/langchain/issues/22892) [#26652 ](https://github.com/langchain-ai/langchain/issues/26652) [#27153 ](https://github.com/langchain-ai/langchain/issues/27153) Related issues: [#7067 ](https://github.com/langchain-ai/langchain/issues/7067) - **Dependencies:** None - **Twitter handle:** None --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>
This commit is contained in:
parent
a13faab6b7
commit
a1e62070d0
@ -311,6 +311,12 @@ class PyPDFParser(BaseBlobParser):
|
|||||||
)
|
)
|
||||||
elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
|
elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
|
||||||
images.append(xObject[obj].get_data())
|
images.append(xObject[obj].get_data())
|
||||||
|
elif (
|
||||||
|
isinstance(xObject[obj]["/Filter"], list)
|
||||||
|
and xObject[obj]["/Filter"]
|
||||||
|
and xObject[obj]["/Filter"][0][1:] in _PDF_FILTER_WITH_LOSS
|
||||||
|
):
|
||||||
|
images.append(xObject[obj].get_data())
|
||||||
else:
|
else:
|
||||||
warnings.warn("Unknown PDF Filter!")
|
warnings.warn("Unknown PDF Filter!")
|
||||||
return extract_from_images_with_rapidocr(images)
|
return extract_from_images_with_rapidocr(images)
|
||||||
|
Loading…
Reference in New Issue
Block a user