mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-07 21:50:25 +00:00
Add feature for extracting images from pdf and recognizing text from images. (#10653)
**Description** It is for #10423 that it will be a useful feature if we can extract images from pdf and recognize text on them. I have implemented it with `PyPDFLoader`, `PyPDFium2Loader`, `PyPDFDirectoryLoader`, `PyMuPDFLoader`, `PDFMinerLoader`, and `PDFPlumberLoader`. [RapidOCR](https://github.com/RapidAI/RapidOCR.git) is used to recognize text on extracted images. It is time-consuming for ocr so a boolen parameter `extract_images` is set to control whether to extract and recognize. I have tested the time usage for each parser on my own laptop thinkbook 14+ with AMD R7-6800H by unit test and the result is: | extract_images | PyPDFParser | PDFMinerParser | PyMuPDFParser | PyPDFium2Parser | PDFPlumberParser | | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | | False | 0.27s | 0.39s | 0.06s | 0.08s | 1.01s | | True | 17.01s | 20.67s | 20.32s | 19,75s | 20.55s | **Issue** #10423 **Dependencies** rapidocr_onnxruntime in [RapidOCR](https://github.com/RapidAI/RapidOCR/tree/main) --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
parent
8e3fbc97ca
commit
35297ca0d3
@ -1,4 +1,4 @@
|
|||||||
# Using PyPDF
|
## Using PyPDF
|
||||||
|
|
||||||
Load PDF using `pypdf` into array of documents, where each document contains the page content and metadata with `page` number.
|
Load PDF using `pypdf` into array of documents, where each document contains the page content and metadata with `page` number.
|
||||||
|
|
||||||
@ -74,6 +74,30 @@ for doc in docs:
|
|||||||
|
|
||||||
</CodeOutputBlock>
|
</CodeOutputBlock>
|
||||||
|
|
||||||
|
|
||||||
|
### Extracting images
|
||||||
|
|
||||||
|
Using the `rapidocr-onnxruntime` package we can extract images as text as well:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install rapidocr-onnxruntime
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
|
||||||
|
pages = loader.load()
|
||||||
|
pages[4].page_content
|
||||||
|
```
|
||||||
|
|
||||||
|
<CodeOutputBlock lang="python">
|
||||||
|
|
||||||
|
```
|
||||||
|
'LayoutParser : A Unified Toolkit for DL-Based DIA 5\nTable 1: Current layout detection models in the LayoutParser model zoo\nDataset Base Model1Large Model Notes\nPubLayNet [38] F / M M Layouts of modern scientific documents\nPRImA [3] M - Layouts of scanned modern magazines and scientific reports\nNewspaper [17] F - Layouts of scanned US newspapers from the 20th century\nTableBank [18] F F Table region on modern scientific and business document\nHJDataset [31] F / M - Layouts of history Japanese documents\n1For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy\nvs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101\nbackbones [ 13], respectively. One can train models of different architectures, like Faster R-CNN [ 28] (F) and Mask\nR-CNN [ 12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained\nusing the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model\nzoo in coming months.\nlayout data structures , which are optimized for efficiency and versatility. 3) When\nnecessary, users can employ existing or customized OCR models via the unified\nAPI provided in the OCR module . 4)LayoutParser comes with a set of utility\nfunctions for the visualization and storage of the layout data. 5) LayoutParser\nis also highly customizable, via its integration with functions for layout data\nannotation and model training . We now provide detailed descriptions for each\ncomponent.\n3.1 Layout Detection Models\nInLayoutParser , a layout model takes a document image as an input and\ngenerates a list of rectangular boxes for the target content regions. Different\nfrom traditional methods, it relies on deep convolutional neural networks rather\nthan manually curated rules to identify content regions. It is formulated as an\nobject detection problem and state-of-the-art models like Faster R-CNN [ 28] and\nMask R-CNN [ 12] are used. This yields prediction results of high accuracy and\nmakes it possible to build a concise, generalized interface for layout detection.\nLayoutParser , built upon Detectron2 [ 35], provides a minimal API that can\nperform layout detection with only four lines of code in Python:\n1import layoutparser as lp\n2image = cv2. imread (" image_file ") # load images\n3model = lp. Detectron2LayoutModel (\n4 "lp :// PubLayNet / faster_rcnn_R_50_FPN_3x / config ")\n5layout = model . detect ( image )\nLayoutParser provides a wealth of pre-trained model weights using various\ndatasets covering different languages, time periods, and document types. Due to\ndomain shift [ 7], the prediction performance can notably drop when models are ap-\nplied to target samples that are significantly different from the training dataset. As\ndocument structures and layouts vary greatly in different domains, it is important\nto select models trained on a dataset similar to the test samples. A semantic syntax\nis used for initializing the model weights in LayoutParser , using both the dataset\nname and model name lp://<dataset-name>/<model-architecture-name> .'
|
||||||
|
```
|
||||||
|
|
||||||
|
</CodeOutputBlock>
|
||||||
|
|
||||||
|
|
||||||
## Using MathPix
|
## Using MathPix
|
||||||
|
|
||||||
Inspired by Daniel Gross's [https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21](https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21)
|
Inspired by Daniel Gross's [https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21](https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21)
|
||||||
|
@ -1,22 +1,79 @@
|
|||||||
"""Module contains common parsers for PDFs."""
|
"""Module contains common parsers for PDFs."""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from typing import TYPE_CHECKING, Any, Iterator, Mapping, Optional, Sequence, Union
|
import warnings
|
||||||
|
from typing import (
|
||||||
|
TYPE_CHECKING,
|
||||||
|
Any,
|
||||||
|
Iterable,
|
||||||
|
Iterator,
|
||||||
|
Mapping,
|
||||||
|
Optional,
|
||||||
|
Sequence,
|
||||||
|
Union,
|
||||||
|
)
|
||||||
from urllib.parse import urlparse
|
from urllib.parse import urlparse
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
from langchain.document_loaders.base import BaseBlobParser
|
from langchain.document_loaders.base import BaseBlobParser
|
||||||
from langchain.document_loaders.blob_loaders import Blob
|
from langchain.document_loaders.blob_loaders import Blob
|
||||||
from langchain.schema import Document
|
from langchain.schema import Document
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
if TYPE_CHECKING:
|
||||||
|
import fitz.fitz
|
||||||
|
import pdfminer.layout
|
||||||
import pdfplumber.page
|
import pdfplumber.page
|
||||||
|
import pypdf._page
|
||||||
|
import pypdfium2._helpers.page
|
||||||
|
|
||||||
|
|
||||||
|
_PDF_FILTER_WITH_LOSS = ["DCTDecode", "DCT", "JPXDecode"]
|
||||||
|
_PDF_FILTER_WITHOUT_LOSS = [
|
||||||
|
"LZWDecode",
|
||||||
|
"LZW",
|
||||||
|
"FlateDecode",
|
||||||
|
"Fl",
|
||||||
|
"ASCII85Decode",
|
||||||
|
"A85",
|
||||||
|
"ASCIIHexDecode",
|
||||||
|
"AHx",
|
||||||
|
"RunLengthDecode",
|
||||||
|
"RL",
|
||||||
|
"CCITTFaxDecode",
|
||||||
|
"CCF",
|
||||||
|
"JBIG2Decode",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def extract_from_images_with_rapidocr(
|
||||||
|
images: Sequence[Union[Iterable[np.ndarray], bytes]]
|
||||||
|
) -> str:
|
||||||
|
try:
|
||||||
|
from rapidocr_onnxruntime import RapidOCR
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
|
"`rapidocr-onnxruntime` package not found, please install it with "
|
||||||
|
"`pip install rapidocr-onnxruntime`"
|
||||||
|
)
|
||||||
|
ocr = RapidOCR()
|
||||||
|
text = ""
|
||||||
|
for img in images:
|
||||||
|
result, _ = ocr(img)
|
||||||
|
if result:
|
||||||
|
result = [text[1] for text in result]
|
||||||
|
text += "\n".join(result)
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
class PyPDFParser(BaseBlobParser):
|
class PyPDFParser(BaseBlobParser):
|
||||||
"""Load `PDF` using `pypdf`"""
|
"""Load `PDF` using `pypdf`"""
|
||||||
|
|
||||||
def __init__(self, password: Optional[Union[str, bytes]] = None):
|
def __init__(
|
||||||
|
self, password: Optional[Union[str, bytes]] = None, extract_images: bool = False
|
||||||
|
):
|
||||||
self.password = password
|
self.password = password
|
||||||
|
self.extract_images = extract_images
|
||||||
|
|
||||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||||
"""Lazily parse the blob."""
|
"""Lazily parse the blob."""
|
||||||
@ -26,36 +83,123 @@ class PyPDFParser(BaseBlobParser):
|
|||||||
pdf_reader = pypdf.PdfReader(pdf_file_obj, password=self.password)
|
pdf_reader = pypdf.PdfReader(pdf_file_obj, password=self.password)
|
||||||
yield from [
|
yield from [
|
||||||
Document(
|
Document(
|
||||||
page_content=page.extract_text(),
|
page_content=page.extract_text()
|
||||||
|
+ self._extract_images_from_page(page),
|
||||||
metadata={"source": blob.source, "page": page_number},
|
metadata={"source": blob.source, "page": page_number},
|
||||||
)
|
)
|
||||||
for page_number, page in enumerate(pdf_reader.pages)
|
for page_number, page in enumerate(pdf_reader.pages)
|
||||||
]
|
]
|
||||||
|
|
||||||
|
def _extract_images_from_page(self, page: pypdf._page.PageObject) -> str:
|
||||||
|
"""Extract images from page and get the text with RapidOCR."""
|
||||||
|
if not self.extract_images or "/XObject" not in page["/Resources"].keys():
|
||||||
|
return ""
|
||||||
|
|
||||||
|
xObject = page["/Resources"]["/XObject"].get_object()
|
||||||
|
images = []
|
||||||
|
for obj in xObject:
|
||||||
|
if xObject[obj]["/Subtype"] == "/Image":
|
||||||
|
if xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITHOUT_LOSS:
|
||||||
|
height, width = xObject[obj]["/Height"], xObject[obj]["/Width"]
|
||||||
|
|
||||||
|
images.append(
|
||||||
|
np.frombuffer(xObject[obj].get_data(), dtype=np.uint8).reshape(
|
||||||
|
height, width, -1
|
||||||
|
)
|
||||||
|
)
|
||||||
|
elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
|
||||||
|
images.append(xObject[obj].get_data())
|
||||||
|
else:
|
||||||
|
warnings.warn("Unknown PDF Filter!")
|
||||||
|
return extract_from_images_with_rapidocr(images)
|
||||||
|
|
||||||
|
|
||||||
class PDFMinerParser(BaseBlobParser):
|
class PDFMinerParser(BaseBlobParser):
|
||||||
"""Parse `PDF` using `PDFMiner`."""
|
"""Parse `PDF` using `PDFMiner`."""
|
||||||
|
|
||||||
|
def __init__(self, extract_images: bool = False):
|
||||||
|
self.extract_images = extract_images
|
||||||
|
|
||||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||||
"""Lazily parse the blob."""
|
"""Lazily parse the blob."""
|
||||||
|
if not self.extract_images:
|
||||||
from pdfminer.high_level import extract_text
|
from pdfminer.high_level import extract_text
|
||||||
|
|
||||||
with blob.as_bytes_io() as pdf_file_obj:
|
with blob.as_bytes_io() as pdf_file_obj:
|
||||||
text = extract_text(pdf_file_obj)
|
text = extract_text(pdf_file_obj)
|
||||||
metadata = {"source": blob.source}
|
metadata = {"source": blob.source}
|
||||||
yield Document(page_content=text, metadata=metadata)
|
yield Document(page_content=text, metadata=metadata)
|
||||||
|
else:
|
||||||
|
import io
|
||||||
|
|
||||||
|
from pdfminer.converter import PDFPageAggregator, TextConverter
|
||||||
|
from pdfminer.layout import LAParams
|
||||||
|
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
|
||||||
|
from pdfminer.pdfpage import PDFPage
|
||||||
|
|
||||||
|
text_io = io.StringIO()
|
||||||
|
with blob.as_bytes_io() as pdf_file_obj:
|
||||||
|
pages = PDFPage.get_pages(pdf_file_obj)
|
||||||
|
rsrcmgr = PDFResourceManager()
|
||||||
|
device_for_text = TextConverter(rsrcmgr, text_io, laparams=LAParams())
|
||||||
|
device_for_image = PDFPageAggregator(rsrcmgr, laparams=LAParams())
|
||||||
|
interpreter_for_text = PDFPageInterpreter(rsrcmgr, device_for_text)
|
||||||
|
interpreter_for_image = PDFPageInterpreter(rsrcmgr, device_for_image)
|
||||||
|
for i, page in enumerate(pages):
|
||||||
|
interpreter_for_text.process_page(page)
|
||||||
|
interpreter_for_image.process_page(page)
|
||||||
|
content = text_io.getvalue() + self._extract_images_from_page(
|
||||||
|
device_for_image.get_result()
|
||||||
|
)
|
||||||
|
text_io.truncate(0)
|
||||||
|
text_io.seek(0)
|
||||||
|
metadata = {"source": blob.source, "page": str(i)}
|
||||||
|
yield Document(page_content=content, metadata=metadata)
|
||||||
|
|
||||||
|
def _extract_images_from_page(self, page: pdfminer.layout.LTPage) -> str:
|
||||||
|
"""Extract images from page and get the text with RapidOCR."""
|
||||||
|
import pdfminer
|
||||||
|
|
||||||
|
def get_image(layout_object: Any) -> Any:
|
||||||
|
if isinstance(layout_object, pdfminer.layout.LTImage):
|
||||||
|
return layout_object
|
||||||
|
if isinstance(layout_object, pdfminer.layout.LTContainer):
|
||||||
|
for child in layout_object:
|
||||||
|
return get_image(child)
|
||||||
|
else:
|
||||||
|
return None
|
||||||
|
|
||||||
|
images = []
|
||||||
|
|
||||||
|
for img in list(filter(bool, map(get_image, page))):
|
||||||
|
if img.stream["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
|
||||||
|
images.append(
|
||||||
|
np.frombuffer(img.stream.get_data(), dtype=np.uint8).reshape(
|
||||||
|
img.stream["Height"], img.stream["Width"], -1
|
||||||
|
)
|
||||||
|
)
|
||||||
|
elif img.stream["Filter"].name in _PDF_FILTER_WITH_LOSS:
|
||||||
|
images.append(img.stream.get_data())
|
||||||
|
else:
|
||||||
|
warnings.warn("Unknown PDF Filter!")
|
||||||
|
return extract_from_images_with_rapidocr(images)
|
||||||
|
|
||||||
|
|
||||||
class PyMuPDFParser(BaseBlobParser):
|
class PyMuPDFParser(BaseBlobParser):
|
||||||
"""Parse `PDF` using `PyMuPDF`."""
|
"""Parse `PDF` using `PyMuPDF`."""
|
||||||
|
|
||||||
def __init__(self, text_kwargs: Optional[Mapping[str, Any]] = None) -> None:
|
def __init__(
|
||||||
|
self,
|
||||||
|
text_kwargs: Optional[Mapping[str, Any]] = None,
|
||||||
|
extract_images: bool = False,
|
||||||
|
) -> None:
|
||||||
"""Initialize the parser.
|
"""Initialize the parser.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
text_kwargs: Keyword arguments to pass to ``fitz.Page.get_text()``.
|
text_kwargs: Keyword arguments to pass to ``fitz.Page.get_text()``.
|
||||||
"""
|
"""
|
||||||
self.text_kwargs = text_kwargs or {}
|
self.text_kwargs = text_kwargs or {}
|
||||||
|
self.extract_images = extract_images
|
||||||
|
|
||||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||||
"""Lazily parse the blob."""
|
"""Lazily parse the blob."""
|
||||||
@ -66,7 +210,8 @@ class PyMuPDFParser(BaseBlobParser):
|
|||||||
|
|
||||||
yield from [
|
yield from [
|
||||||
Document(
|
Document(
|
||||||
page_content=page.get_text(**self.text_kwargs),
|
page_content=page.get_text(**self.text_kwargs)
|
||||||
|
+ self._extract_images_from_page(doc, page),
|
||||||
metadata=dict(
|
metadata=dict(
|
||||||
{
|
{
|
||||||
"source": blob.source,
|
"source": blob.source,
|
||||||
@ -84,11 +229,31 @@ class PyMuPDFParser(BaseBlobParser):
|
|||||||
for page in doc
|
for page in doc
|
||||||
]
|
]
|
||||||
|
|
||||||
|
def _extract_images_from_page(
|
||||||
|
self, doc: fitz.fitz.Document, page: fitz.fitz.Page
|
||||||
|
) -> str:
|
||||||
|
"""Extract images from page and get the text with RapidOCR."""
|
||||||
|
if not self.extract_images:
|
||||||
|
return ""
|
||||||
|
import fitz
|
||||||
|
|
||||||
|
img_list = page.get_images()
|
||||||
|
imgs = []
|
||||||
|
for img in img_list:
|
||||||
|
xref = img[0]
|
||||||
|
pix = fitz.Pixmap(doc, xref)
|
||||||
|
imgs.append(
|
||||||
|
np.frombuffer(pix.samples, dtype=np.uint8).reshape(
|
||||||
|
pix.height, pix.width, -1
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return extract_from_images_with_rapidocr(imgs)
|
||||||
|
|
||||||
|
|
||||||
class PyPDFium2Parser(BaseBlobParser):
|
class PyPDFium2Parser(BaseBlobParser):
|
||||||
"""Parse `PDF` with `PyPDFium2`."""
|
"""Parse `PDF` with `PyPDFium2`."""
|
||||||
|
|
||||||
def __init__(self) -> None:
|
def __init__(self, extract_images: bool = False) -> None:
|
||||||
"""Initialize the parser."""
|
"""Initialize the parser."""
|
||||||
try:
|
try:
|
||||||
import pypdfium2 # noqa:F401
|
import pypdfium2 # noqa:F401
|
||||||
@ -97,6 +262,7 @@ class PyPDFium2Parser(BaseBlobParser):
|
|||||||
"pypdfium2 package not found, please install it with"
|
"pypdfium2 package not found, please install it with"
|
||||||
" `pip install pypdfium2`"
|
" `pip install pypdfium2`"
|
||||||
)
|
)
|
||||||
|
self.extract_images = extract_images
|
||||||
|
|
||||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||||
"""Lazily parse the blob."""
|
"""Lazily parse the blob."""
|
||||||
@ -111,18 +277,34 @@ class PyPDFium2Parser(BaseBlobParser):
|
|||||||
text_page = page.get_textpage()
|
text_page = page.get_textpage()
|
||||||
content = text_page.get_text_range()
|
content = text_page.get_text_range()
|
||||||
text_page.close()
|
text_page.close()
|
||||||
|
content += "\n" + self._extract_images_from_page(page)
|
||||||
page.close()
|
page.close()
|
||||||
metadata = {"source": blob.source, "page": page_number}
|
metadata = {"source": blob.source, "page": page_number}
|
||||||
yield Document(page_content=content, metadata=metadata)
|
yield Document(page_content=content, metadata=metadata)
|
||||||
finally:
|
finally:
|
||||||
pdf_reader.close()
|
pdf_reader.close()
|
||||||
|
|
||||||
|
def _extract_images_from_page(self, page: pypdfium2._helpers.page.PdfPage) -> str:
|
||||||
|
"""Extract images from page and get the text with RapidOCR."""
|
||||||
|
if not self.extract_images:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
import pypdfium2.raw as pdfium_c
|
||||||
|
|
||||||
|
images = list(page.get_objects(filter=(pdfium_c.FPDF_PAGEOBJ_IMAGE,)))
|
||||||
|
|
||||||
|
images = list(map(lambda x: x.get_bitmap().to_numpy(), images))
|
||||||
|
return extract_from_images_with_rapidocr(images)
|
||||||
|
|
||||||
|
|
||||||
class PDFPlumberParser(BaseBlobParser):
|
class PDFPlumberParser(BaseBlobParser):
|
||||||
"""Parse `PDF` with `PDFPlumber`."""
|
"""Parse `PDF` with `PDFPlumber`."""
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self, text_kwargs: Optional[Mapping[str, Any]] = None, dedupe: bool = False
|
self,
|
||||||
|
text_kwargs: Optional[Mapping[str, Any]] = None,
|
||||||
|
dedupe: bool = False,
|
||||||
|
extract_images: bool = False,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize the parser.
|
"""Initialize the parser.
|
||||||
|
|
||||||
@ -132,6 +314,7 @@ class PDFPlumberParser(BaseBlobParser):
|
|||||||
"""
|
"""
|
||||||
self.text_kwargs = text_kwargs or {}
|
self.text_kwargs = text_kwargs or {}
|
||||||
self.dedupe = dedupe
|
self.dedupe = dedupe
|
||||||
|
self.extract_images = extract_images
|
||||||
|
|
||||||
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
|
||||||
"""Lazily parse the blob."""
|
"""Lazily parse the blob."""
|
||||||
@ -142,12 +325,14 @@ class PDFPlumberParser(BaseBlobParser):
|
|||||||
|
|
||||||
yield from [
|
yield from [
|
||||||
Document(
|
Document(
|
||||||
page_content=self._process_page_content(page),
|
page_content=self._process_page_content(page)
|
||||||
|
+ "\n"
|
||||||
|
+ self._extract_images_from_page(page),
|
||||||
metadata=dict(
|
metadata=dict(
|
||||||
{
|
{
|
||||||
"source": blob.source,
|
"source": blob.source,
|
||||||
"file_path": blob.source,
|
"file_path": blob.source,
|
||||||
"page": page.page_number,
|
"page": page.page_number - 1,
|
||||||
"total_pages": len(doc.pages),
|
"total_pages": len(doc.pages),
|
||||||
},
|
},
|
||||||
**{
|
**{
|
||||||
@ -166,6 +351,26 @@ class PDFPlumberParser(BaseBlobParser):
|
|||||||
return page.dedupe_chars().extract_text(**self.text_kwargs)
|
return page.dedupe_chars().extract_text(**self.text_kwargs)
|
||||||
return page.extract_text(**self.text_kwargs)
|
return page.extract_text(**self.text_kwargs)
|
||||||
|
|
||||||
|
def _extract_images_from_page(self, page: pdfplumber.page.Page) -> str:
|
||||||
|
"""Extract images from page and get the text with RapidOCR."""
|
||||||
|
if not self.extract_images:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
images = []
|
||||||
|
for img in page.images:
|
||||||
|
if img["stream"]["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
|
||||||
|
images.append(
|
||||||
|
np.frombuffer(img["stream"].get_data(), dtype=np.uint8).reshape(
|
||||||
|
img["stream"]["Height"], img["stream"]["Width"], -1
|
||||||
|
)
|
||||||
|
)
|
||||||
|
elif img["stream"]["Filter"].name in _PDF_FILTER_WITH_LOSS:
|
||||||
|
images.append(img["stream"].get_data())
|
||||||
|
else:
|
||||||
|
warnings.warn("Unknown PDF Filter!")
|
||||||
|
|
||||||
|
return extract_from_images_with_rapidocr(images)
|
||||||
|
|
||||||
|
|
||||||
class AmazonTextractPDFParser(BaseBlobParser):
|
class AmazonTextractPDFParser(BaseBlobParser):
|
||||||
"""Send `PDF` files to `Amazon Textract` and parse them.
|
"""Send `PDF` files to `Amazon Textract` and parse them.
|
||||||
|
@ -145,6 +145,7 @@ class PyPDFLoader(BasePDFLoader):
|
|||||||
file_path: str,
|
file_path: str,
|
||||||
password: Optional[Union[str, bytes]] = None,
|
password: Optional[Union[str, bytes]] = None,
|
||||||
headers: Optional[Dict] = None,
|
headers: Optional[Dict] = None,
|
||||||
|
extract_images: bool = False,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize with a file path."""
|
"""Initialize with a file path."""
|
||||||
try:
|
try:
|
||||||
@ -153,7 +154,7 @@ class PyPDFLoader(BasePDFLoader):
|
|||||||
raise ImportError(
|
raise ImportError(
|
||||||
"pypdf package not found, please install it with " "`pip install pypdf`"
|
"pypdf package not found, please install it with " "`pip install pypdf`"
|
||||||
)
|
)
|
||||||
self.parser = PyPDFParser(password=password)
|
self.parser = PyPDFParser(password=password, extract_images=extract_images)
|
||||||
super().__init__(file_path, headers=headers)
|
super().__init__(file_path, headers=headers)
|
||||||
|
|
||||||
def load(self) -> List[Document]:
|
def load(self) -> List[Document]:
|
||||||
@ -171,10 +172,16 @@ class PyPDFLoader(BasePDFLoader):
|
|||||||
class PyPDFium2Loader(BasePDFLoader):
|
class PyPDFium2Loader(BasePDFLoader):
|
||||||
"""Load `PDF` using `pypdfium2` and chunks at character level."""
|
"""Load `PDF` using `pypdfium2` and chunks at character level."""
|
||||||
|
|
||||||
def __init__(self, file_path: str, *, headers: Optional[Dict] = None):
|
def __init__(
|
||||||
|
self,
|
||||||
|
file_path: str,
|
||||||
|
*,
|
||||||
|
headers: Optional[Dict] = None,
|
||||||
|
extract_images: bool = False,
|
||||||
|
):
|
||||||
"""Initialize with a file path."""
|
"""Initialize with a file path."""
|
||||||
super().__init__(file_path, headers=headers)
|
super().__init__(file_path, headers=headers)
|
||||||
self.parser = PyPDFium2Parser()
|
self.parser = PyPDFium2Parser(extract_images=extract_images)
|
||||||
|
|
||||||
def load(self) -> List[Document]:
|
def load(self) -> List[Document]:
|
||||||
"""Load given path as pages."""
|
"""Load given path as pages."""
|
||||||
@ -201,12 +208,14 @@ class PyPDFDirectoryLoader(BaseLoader):
|
|||||||
silent_errors: bool = False,
|
silent_errors: bool = False,
|
||||||
load_hidden: bool = False,
|
load_hidden: bool = False,
|
||||||
recursive: bool = False,
|
recursive: bool = False,
|
||||||
|
extract_images: bool = False,
|
||||||
):
|
):
|
||||||
self.path = path
|
self.path = path
|
||||||
self.glob = glob
|
self.glob = glob
|
||||||
self.load_hidden = load_hidden
|
self.load_hidden = load_hidden
|
||||||
self.recursive = recursive
|
self.recursive = recursive
|
||||||
self.silent_errors = silent_errors
|
self.silent_errors = silent_errors
|
||||||
|
self.extract_images = extract_images
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def _is_visible(path: Path) -> bool:
|
def _is_visible(path: Path) -> bool:
|
||||||
@ -220,7 +229,7 @@ class PyPDFDirectoryLoader(BaseLoader):
|
|||||||
if i.is_file():
|
if i.is_file():
|
||||||
if self._is_visible(i.relative_to(p)) or self.load_hidden:
|
if self._is_visible(i.relative_to(p)) or self.load_hidden:
|
||||||
try:
|
try:
|
||||||
loader = PyPDFLoader(str(i))
|
loader = PyPDFLoader(str(i), extract_images=self.extract_images)
|
||||||
sub_docs = loader.load()
|
sub_docs = loader.load()
|
||||||
for doc in sub_docs:
|
for doc in sub_docs:
|
||||||
doc.metadata["source"] = str(i)
|
doc.metadata["source"] = str(i)
|
||||||
@ -236,7 +245,13 @@ class PyPDFDirectoryLoader(BaseLoader):
|
|||||||
class PDFMinerLoader(BasePDFLoader):
|
class PDFMinerLoader(BasePDFLoader):
|
||||||
"""Load `PDF` files using `PDFMiner`."""
|
"""Load `PDF` files using `PDFMiner`."""
|
||||||
|
|
||||||
def __init__(self, file_path: str, *, headers: Optional[Dict] = None) -> None:
|
def __init__(
|
||||||
|
self,
|
||||||
|
file_path: str,
|
||||||
|
*,
|
||||||
|
headers: Optional[Dict] = None,
|
||||||
|
extract_images: bool = False,
|
||||||
|
) -> None:
|
||||||
"""Initialize with file path."""
|
"""Initialize with file path."""
|
||||||
try:
|
try:
|
||||||
from pdfminer.high_level import extract_text # noqa:F401
|
from pdfminer.high_level import extract_text # noqa:F401
|
||||||
@ -247,7 +262,7 @@ class PDFMinerLoader(BasePDFLoader):
|
|||||||
)
|
)
|
||||||
|
|
||||||
super().__init__(file_path, headers=headers)
|
super().__init__(file_path, headers=headers)
|
||||||
self.parser = PDFMinerParser()
|
self.parser = PDFMinerParser(extract_images=extract_images)
|
||||||
|
|
||||||
def load(self) -> List[Document]:
|
def load(self) -> List[Document]:
|
||||||
"""Eagerly load the content."""
|
"""Eagerly load the content."""
|
||||||
@ -299,7 +314,12 @@ class PyMuPDFLoader(BasePDFLoader):
|
|||||||
"""Load `PDF` files using `PyMuPDF`."""
|
"""Load `PDF` files using `PyMuPDF`."""
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self, file_path: str, *, headers: Optional[Dict] = None, **kwargs: Any
|
self,
|
||||||
|
file_path: str,
|
||||||
|
*,
|
||||||
|
headers: Optional[Dict] = None,
|
||||||
|
extract_images: bool = False,
|
||||||
|
**kwargs: Any,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize with a file path."""
|
"""Initialize with a file path."""
|
||||||
try:
|
try:
|
||||||
@ -310,6 +330,7 @@ class PyMuPDFLoader(BasePDFLoader):
|
|||||||
"`pip install pymupdf`"
|
"`pip install pymupdf`"
|
||||||
)
|
)
|
||||||
super().__init__(file_path, headers=headers)
|
super().__init__(file_path, headers=headers)
|
||||||
|
self.extract_images = extract_images
|
||||||
self.text_kwargs = kwargs
|
self.text_kwargs = kwargs
|
||||||
|
|
||||||
def load(self, **kwargs: Any) -> List[Document]:
|
def load(self, **kwargs: Any) -> List[Document]:
|
||||||
@ -321,7 +342,9 @@ class PyMuPDFLoader(BasePDFLoader):
|
|||||||
)
|
)
|
||||||
|
|
||||||
text_kwargs = {**self.text_kwargs, **kwargs}
|
text_kwargs = {**self.text_kwargs, **kwargs}
|
||||||
parser = PyMuPDFParser(text_kwargs=text_kwargs)
|
parser = PyMuPDFParser(
|
||||||
|
text_kwargs=text_kwargs, extract_images=self.extract_images
|
||||||
|
)
|
||||||
blob = Blob.from_path(self.file_path)
|
blob = Blob.from_path(self.file_path)
|
||||||
return parser.parse(blob)
|
return parser.parse(blob)
|
||||||
|
|
||||||
@ -456,6 +479,7 @@ class PDFPlumberLoader(BasePDFLoader):
|
|||||||
text_kwargs: Optional[Mapping[str, Any]] = None,
|
text_kwargs: Optional[Mapping[str, Any]] = None,
|
||||||
dedupe: bool = False,
|
dedupe: bool = False,
|
||||||
headers: Optional[Dict] = None,
|
headers: Optional[Dict] = None,
|
||||||
|
extract_images: bool = False,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize with a file path."""
|
"""Initialize with a file path."""
|
||||||
try:
|
try:
|
||||||
@ -469,11 +493,16 @@ class PDFPlumberLoader(BasePDFLoader):
|
|||||||
super().__init__(file_path, headers=headers)
|
super().__init__(file_path, headers=headers)
|
||||||
self.text_kwargs = text_kwargs or {}
|
self.text_kwargs = text_kwargs or {}
|
||||||
self.dedupe = dedupe
|
self.dedupe = dedupe
|
||||||
|
self.extract_images = extract_images
|
||||||
|
|
||||||
def load(self) -> List[Document]:
|
def load(self) -> List[Document]:
|
||||||
"""Load file."""
|
"""Load file."""
|
||||||
|
|
||||||
parser = PDFPlumberParser(text_kwargs=self.text_kwargs, dedupe=self.dedupe)
|
parser = PDFPlumberParser(
|
||||||
|
text_kwargs=self.text_kwargs,
|
||||||
|
dedupe=self.dedupe,
|
||||||
|
extract_images=self.extract_images,
|
||||||
|
)
|
||||||
blob = Blob.from_path(self.file_path)
|
blob = Blob.from_path(self.file_path)
|
||||||
return parser.parse(blob)
|
return parser.parse(blob)
|
||||||
|
|
||||||
|
198
libs/langchain/poetry.lock
generated
198
libs/langchain/poetry.lock
generated
@ -1,4 +1,4 @@
|
|||||||
# This file is automatically @generated by Poetry 1.6.1 and should not be changed by hand.
|
# This file is automatically @generated by Poetry 1.5.1 and should not be changed by hand.
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "absl-py"
|
name = "absl-py"
|
||||||
@ -1658,6 +1658,23 @@ files = [
|
|||||||
{file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"},
|
{file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"},
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "coloredlogs"
|
||||||
|
version = "15.0.1"
|
||||||
|
description = "Colored terminal output for Python's logging module"
|
||||||
|
optional = true
|
||||||
|
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
|
||||||
|
files = [
|
||||||
|
{file = "coloredlogs-15.0.1-py2.py3-none-any.whl", hash = "sha256:612ee75c546f53e92e70049c9dbfcc18c935a2b9a53b66085ce9ef6a6e5c0934"},
|
||||||
|
{file = "coloredlogs-15.0.1.tar.gz", hash = "sha256:7c991aa71a4577af2f82600d8f8f3a89f936baeaf9b50a9c197da014e5bf16b0"},
|
||||||
|
]
|
||||||
|
|
||||||
|
[package.dependencies]
|
||||||
|
humanfriendly = ">=9.1"
|
||||||
|
|
||||||
|
[package.extras]
|
||||||
|
cron = ["capturer (>=2.4)"]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "comm"
|
name = "comm"
|
||||||
version = "0.1.4"
|
version = "0.1.4"
|
||||||
@ -3232,6 +3249,20 @@ testing = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "gradio", "jed
|
|||||||
torch = ["torch"]
|
torch = ["torch"]
|
||||||
typing = ["pydantic (<2.0)", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3"]
|
typing = ["pydantic (<2.0)", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3"]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "humanfriendly"
|
||||||
|
version = "10.0"
|
||||||
|
description = "Human friendly output for text interfaces using Python"
|
||||||
|
optional = true
|
||||||
|
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
|
||||||
|
files = [
|
||||||
|
{file = "humanfriendly-10.0-py2.py3-none-any.whl", hash = "sha256:1697e1a8a8f550fd43c2865cd84542fc175a61dcb779b6fee18cf6b6ccba1477"},
|
||||||
|
{file = "humanfriendly-10.0.tar.gz", hash = "sha256:6b0b831ce8f15f7300721aa49829fc4e83921a9a301cc7f606be6686a2288ddc"},
|
||||||
|
]
|
||||||
|
|
||||||
|
[package.dependencies]
|
||||||
|
pyreadline3 = {version = "*", markers = "sys_platform == \"win32\" and python_version >= \"3.8\""}
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "humbug"
|
name = "humbug"
|
||||||
version = "0.3.2"
|
version = "0.3.2"
|
||||||
@ -5642,6 +5673,47 @@ rsa = ["cryptography (>=3.0.0)"]
|
|||||||
signals = ["blinker (>=1.4.0)"]
|
signals = ["blinker (>=1.4.0)"]
|
||||||
signedtoken = ["cryptography (>=3.0.0)", "pyjwt (>=2.0.0,<3)"]
|
signedtoken = ["cryptography (>=3.0.0)", "pyjwt (>=2.0.0,<3)"]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "onnxruntime"
|
||||||
|
version = "1.16.0"
|
||||||
|
description = "ONNX Runtime is a runtime accelerator for Machine Learning models"
|
||||||
|
optional = true
|
||||||
|
python-versions = "*"
|
||||||
|
files = [
|
||||||
|
{file = "onnxruntime-1.16.0-cp310-cp310-macosx_10_15_x86_64.whl", hash = "sha256:69c86ba3d90c166944c4a3c8a5b2a24a7bc45e68ae5997d83279af21ffd0f5f3"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:604a46aa2ad6a51f2fc4df1a984ea571a43aa02424aea93464c32ce02d23b3bb"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a40660516b382031279fb690fc3d068ad004173c2bd12bbdc0bd0fe01ef8b7c3"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:349fd9c7875c1a76609d45b079484f8059adfb1fb87a30506934fb667ceab249"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp310-cp310-win32.whl", hash = "sha256:22c9e2f1a1f15b41b01195cd2520c013c22228efc4795ae4118048ea4118aad2"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp310-cp310-win_amd64.whl", hash = "sha256:b9667a131abfd226a728cc1c1ecf5cc5afa4fff37422f95a84bc22f7c175b57f"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp311-cp311-macosx_10_15_x86_64.whl", hash = "sha256:f7b292726a1f3fa4a483d7e902da083a5889a86a860dbc3a6479988cad342578"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:61eaf288a2482c5561f620fb686c80c32709e92724bbb59a5e4a0d349429e205"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5fe2239d5821d5501eecccfe5c408485591b5d73eb76a61491a8f78179c2e65a"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5a4924604fcdf1704b7f7e087b4c0b0e181c58367a687da55b1aec2705631943"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp311-cp311-win32.whl", hash = "sha256:55d8456f1ab28c32aec9c478b7638ed145102b03bb9b719b79e065ffc5de9c72"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp311-cp311-win_amd64.whl", hash = "sha256:c2a53ffd456187028c841ac7ed0d83b4c2b7e48bd2b1cf2a42d253ecf1e97cb3"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp38-cp38-macosx_10_15_x86_64.whl", hash = "sha256:bf5769aa4095cfe2503307867fa95b5f73732909ee21b67fe24da443af445925"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:c0974deadf11ddab201d915a10517be00fa9d6816def56fa374e4c1a0008985a"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:99dccf1d2eba5ecd7b6c0e8e80d92d0030291f3506726c156e018a4d7a187c6f"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0170ed05d3a8a7c24fe01fc262a6bc603837751f3bb273df7006a2da73f37fff"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp38-cp38-win32.whl", hash = "sha256:5ecd38e98ccdcbbaa7e529e96852f4c1c136559802354b76378d9a19532018ee"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp38-cp38-win_amd64.whl", hash = "sha256:1c585c60e9541a9bd4fb319ba9a3ef6122a28dcf4f3dbcdf014df44570cad6f8"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp39-cp39-macosx_10_15_x86_64.whl", hash = "sha256:efe59c1e51ad647fb18860233f5971e309961d09ca10697170ef9b7d9fa728f4"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:e3c9a9cccab8f6512a0c0207b2816dd8864f2f720f6e9df5cf01e30c4f80194f"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:dcf16a252308ec6e0737db7028b63fed0ac28fbad134f86216c0dfb051a31f38"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f533aa90ee7189e88b6b612d6adae7d290971090598cfd47ce034ab0d106fc9c"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp39-cp39-win32.whl", hash = "sha256:306c7f5d8a0c24c65afb34f7deb0bc526defde2249e53538f1dce083945a2d6e"},
|
||||||
|
{file = "onnxruntime-1.16.0-cp39-cp39-win_amd64.whl", hash = "sha256:df8a00a7b057ba497e2822175cc68731d84b89a6d50a3a2a3ec51e98e9c91125"},
|
||||||
|
]
|
||||||
|
|
||||||
|
[package.dependencies]
|
||||||
|
coloredlogs = "*"
|
||||||
|
flatbuffers = "*"
|
||||||
|
numpy = ">=1.21.6"
|
||||||
|
packaging = "*"
|
||||||
|
protobuf = "*"
|
||||||
|
sympy = "*"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "openai"
|
name = "openai"
|
||||||
version = "0.27.10"
|
version = "0.27.10"
|
||||||
@ -5678,6 +5750,33 @@ files = [
|
|||||||
[package.dependencies]
|
[package.dependencies]
|
||||||
pydantic = ">=1.8.2"
|
pydantic = ">=1.8.2"
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "opencv-python"
|
||||||
|
version = "4.8.1.78"
|
||||||
|
description = "Wrapper package for OpenCV python bindings."
|
||||||
|
optional = true
|
||||||
|
python-versions = ">=3.6"
|
||||||
|
files = [
|
||||||
|
{file = "opencv-python-4.8.1.78.tar.gz", hash = "sha256:cc7adbbcd1112877a39274106cb2752e04984bc01a031162952e97450d6117f6"},
|
||||||
|
{file = "opencv_python-4.8.1.78-cp37-abi3-macosx_10_16_x86_64.whl", hash = "sha256:91d5f6f5209dc2635d496f6b8ca6573ecdad051a09e6b5de4c399b8e673c60da"},
|
||||||
|
{file = "opencv_python-4.8.1.78-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:bc31f47e05447da8b3089faa0a07ffe80e114c91ce0b171e6424f9badbd1c5cd"},
|
||||||
|
{file = "opencv_python-4.8.1.78-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9814beca408d3a0eca1bae7e3e5be68b07c17ecceb392b94170881216e09b319"},
|
||||||
|
{file = "opencv_python-4.8.1.78-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c4c406bdb41eb21ea51b4e90dfbc989c002786c3f601c236a99c59a54670a394"},
|
||||||
|
{file = "opencv_python-4.8.1.78-cp37-abi3-win32.whl", hash = "sha256:a7aac3900fbacf55b551e7b53626c3dad4c71ce85643645c43e91fcb19045e47"},
|
||||||
|
{file = "opencv_python-4.8.1.78-cp37-abi3-win_amd64.whl", hash = "sha256:b983197f97cfa6fcb74e1da1802c7497a6f94ed561aba6980f1f33123f904956"},
|
||||||
|
]
|
||||||
|
|
||||||
|
[package.dependencies]
|
||||||
|
numpy = [
|
||||||
|
{version = ">=1.21.0", markers = "python_version <= \"3.9\" and platform_system == \"Darwin\" and platform_machine == \"arm64\""},
|
||||||
|
{version = ">=1.19.3", markers = "python_version >= \"3.6\" and platform_system == \"Linux\" and platform_machine == \"aarch64\" or python_version >= \"3.9\""},
|
||||||
|
{version = ">=1.17.0", markers = "python_version >= \"3.7\""},
|
||||||
|
{version = ">=1.17.3", markers = "python_version >= \"3.8\""},
|
||||||
|
{version = ">=1.21.2", markers = "python_version >= \"3.10\""},
|
||||||
|
{version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\""},
|
||||||
|
{version = ">=1.23.5", markers = "python_version >= \"3.11\""},
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "openlm"
|
name = "openlm"
|
||||||
version = "0.0.5"
|
version = "0.0.5"
|
||||||
@ -5862,7 +5961,7 @@ files = [
|
|||||||
[package.dependencies]
|
[package.dependencies]
|
||||||
numpy = [
|
numpy = [
|
||||||
{version = ">=1.20.3", markers = "python_version < \"3.10\""},
|
{version = ">=1.20.3", markers = "python_version < \"3.10\""},
|
||||||
{version = ">=1.21.0", markers = "python_version >= \"3.10\" and python_version < \"3.11\""},
|
{version = ">=1.21.0", markers = "python_version >= \"3.10\""},
|
||||||
{version = ">=1.23.2", markers = "python_version >= \"3.11\""},
|
{version = ">=1.23.2", markers = "python_version >= \"3.11\""},
|
||||||
]
|
]
|
||||||
python-dateutil = ">=2.8.2"
|
python-dateutil = ">=2.8.2"
|
||||||
@ -6659,6 +6758,59 @@ cffi = ">=1.5.0"
|
|||||||
[package.extras]
|
[package.extras]
|
||||||
idna = ["idna (>=2.1)"]
|
idna = ["idna (>=2.1)"]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "pyclipper"
|
||||||
|
version = "1.3.0.post5"
|
||||||
|
description = "Cython wrapper for the C++ translation of the Angus Johnson's Clipper library (ver. 6.4.2)"
|
||||||
|
optional = true
|
||||||
|
python-versions = "*"
|
||||||
|
files = [
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:3c45f99b8180dd4df4c86642657ca92b7d5289a5e3724521822e0f9461961fe2"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:567ffd419a0bdc3727fa4562cfa1f18484691817a2bc0bc675750aa28ed98bd4"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:59c8c75661a6d87e98b1655851578a2917d3c8859912c9a4f1956b9830940fd9"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a496efa146d2d88b59350021739e4685e439dc569b6654e9e6d5e42e9a0b1666"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp310-cp310-win32.whl", hash = "sha256:02a98d09af9b60bcf8e9480d153c0839e20b92689f5602f87242a4933842fecd"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp310-cp310-win_amd64.whl", hash = "sha256:847f1e2fc3994bb498fe675f55c98129b95dc26a5c92304ba4cf0ab40721ea3d"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:b7a983ae019932bfa0a1971a2dc8c856704add5f3d567bed8fac02dbc0e7f0bf"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:d8760075c395b924f894aa16ee06e8c040c6f9b63e0903e49de3cc8d82d9e637"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e4ea61ca5899d3346c614951342c506f119601ed0a1f4889a9cc236558afec6b"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:46499b361ae067662b22578401d83d57716f3cc0071d592feb07d504b439fea7"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp311-cp311-win32.whl", hash = "sha256:d5c77e39ab05a6cf277c819639968b21e6959e996ea1a074afc24236541708ff"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp311-cp311-win_amd64.whl", hash = "sha256:0f78a1c18ff4f9276f78d9353d6ed4309c3886a9d0172437e48328aef499165e"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:5237282f906049c307e6c90333c7d56f6b8712bf087ef97b141830c40b09ca0a"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:aca8635573646b65c054399433fb3493637f1445db942de8a52fca9ef493ba3d"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1158a2b13d59bdfab33d1d928f7b72c8c7fb8a76e7d2283839cb45d7c0ff2140"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5a041f1a7982b17cf92fd3be349ec41ff1901792149c166bf283f469567b52d6"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp312-cp312-win32.whl", hash = "sha256:bf3a2ccd6e4e078250b0a31a12c519b0be6d1bc160acfceee62407dbd68558f6"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp312-cp312-win_amd64.whl", hash = "sha256:2ce6e0a6ab32182c26537965cf521822cd11a28a7ffcef48635a94c6ca8559ef"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:010ee13d40d924341cc41b6d9901d763175040c68753939f140bc0cc714f18bb"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ee1c4797b1dc982ae9d60333269536ea03ddc0baa1c3383a6d5b741dbbb12675"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl", hash = "sha256:ba692cf11873886085a0445dcfc362b24ca35bcb997ad9e9b5685854a290d8ff"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp36-cp36m-win32.whl", hash = "sha256:f0b84fcf5230aca2de06ddb7920459daa858853835f8774739ca30dd516e7d37"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp36-cp36m-win_amd64.whl", hash = "sha256:741910bfd7b0bd40f027869f4bf86bdd9678ae7f74e8dabcf62d170269f6191d"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:5f3484b4dffa64f0e3a43b63165a5c0f507c5850e70b9cc2eaa82474d7746393"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:87efec9795744cef786f2f8cab17d6dc07f57dfce5e3b7f3be96eb79a4ce5794"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl", hash = "sha256:5f445a2d03690faa23a1b90e32dfb4352a60b23437323de87388c6c611d3d1e3"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp37-cp37m-win32.whl", hash = "sha256:eb9d1cb2999bc1ea8ad1c3a031ba33b0a89a5ace25d33df7529d3ff18c16604c"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp37-cp37m-win_amd64.whl", hash = "sha256:ead0f3ecd1961005f61d50c896e33442138b4e7c9e0c035784d3525068dd2b10"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:39ccd920b192a4f8096589a2a1f8faaf6aaaadb7a163b5ce913d03faac2449bb"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:e346e7adba43e40f5f5f293b6b6a45de5a6a3bdc74e437dedd948c5d74de9405"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:bb2fb22927c3ac3191e555efd335c6efa819aa1ff4d0901979673ab5a18eb740"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl", hash = "sha256:a678999d728023f1f3988a14a2e6d89d6f1ed4d0786d5992c1bffb4c1ab30318"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp38-cp38-win32.whl", hash = "sha256:36d456fdf32a6410a87bd7af8ebc4c01f19b4e3b839104b3072558cad0d8bf4c"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp38-cp38-win_amd64.whl", hash = "sha256:c9c1fdf4ecae6b55033ede3f4e931156ffc969334300f44f8bf1b356ec0a3d63"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:8bb9cd95fd4bd88fb1590d1763a52e3ea6a1095e11b3e885ff164da1313aae79"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:0f516fd69aa61a9698a3ce3ba2f7edda5ac6aafc8d964ee3bc60897906947fcb"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e36f018303656ea4a629d2fba0d0d4c74960eacec7119fe2ab3c658ce84c494b"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl", hash = "sha256:dd3c4b312a931e668a7a291d4bd5b10bacb0687bd163220a9f0418c7e23169e2"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp39-cp39-win32.whl", hash = "sha256:cfea42972e90954b3c89da9216993373a2270a5103d4916fd543a1109528ed4c"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-cp39-cp39-win_amd64.whl", hash = "sha256:85ca06f382f999903d809380e4c01ec127d3eb26431402e9b3f01facaec68b80"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:da30e59c684eea198f6e19244e9a41e855a23a416cc708821fd4eb8f5f18626c"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-pp38-pypy38_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:d8a9e3e46aa50e4c3667db9a816d59ae4f9c62b05f997abb8a9b3f3afe6d94a4"},
|
||||||
|
{file = "pyclipper-1.3.0.post5-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0589b80f2da1ad322345a93c053b5d46dc692def5a188351be01f34bcf041218"},
|
||||||
|
{file = "pyclipper-1.3.0.post5.tar.gz", hash = "sha256:c0239f928e0bf78a3efc2f2f615a10bfcdb9f33012d46d64c8d1225b4bde7096"},
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "pycparser"
|
name = "pycparser"
|
||||||
version = "2.21"
|
version = "2.21"
|
||||||
@ -7125,6 +7277,17 @@ files = [
|
|||||||
[package.dependencies]
|
[package.dependencies]
|
||||||
tomli = {version = ">=1.1.0", markers = "python_version < \"3.11\""}
|
tomli = {version = ">=1.1.0", markers = "python_version < \"3.11\""}
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "pyreadline3"
|
||||||
|
version = "3.4.1"
|
||||||
|
description = "A python implementation of GNU readline."
|
||||||
|
optional = true
|
||||||
|
python-versions = "*"
|
||||||
|
files = [
|
||||||
|
{file = "pyreadline3-3.4.1-py3-none-any.whl", hash = "sha256:b0efb6516fd4fb07b45949053826a62fa4cb353db5be2bbb4a7aa1fdd1e345fb"},
|
||||||
|
{file = "pyreadline3-3.4.1.tar.gz", hash = "sha256:6f3d1f7b8a31ba32b73917cefc1f28cc660562f39aea8646d30bd6eff21f7bae"},
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "pysocks"
|
name = "pysocks"
|
||||||
version = "1.7.1"
|
version = "1.7.1"
|
||||||
@ -7821,6 +7984,26 @@ files = [
|
|||||||
[package.extras]
|
[package.extras]
|
||||||
full = ["numpy"]
|
full = ["numpy"]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "rapidocr-onnxruntime"
|
||||||
|
version = "1.3.7"
|
||||||
|
description = "A cross platform OCR Library based on OnnxRuntime."
|
||||||
|
optional = true
|
||||||
|
python-versions = ">=3.6,<3.12"
|
||||||
|
files = [
|
||||||
|
{file = "rapidocr_onnxruntime-1.3.7-py3-none-any.whl", hash = "sha256:9d061786f6255c57a98f04a2f7624eacabc1d0dede2a69707c99a6dd9024e6fa"},
|
||||||
|
]
|
||||||
|
|
||||||
|
[package.dependencies]
|
||||||
|
numpy = ">=1.19.5"
|
||||||
|
onnxruntime = ">=1.7.0"
|
||||||
|
opencv-python = ">=4.5.1.48"
|
||||||
|
Pillow = "*"
|
||||||
|
pyclipper = ">=1.2.0"
|
||||||
|
PyYAML = "*"
|
||||||
|
Shapely = ">=1.7.1"
|
||||||
|
six = ">=1.15.0"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "ratelimiter"
|
name = "ratelimiter"
|
||||||
version = "1.2.0.post0"
|
version = "1.2.0.post0"
|
||||||
@ -8394,6 +8577,11 @@ files = [
|
|||||||
{file = "scikit_learn-1.3.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f66eddfda9d45dd6cadcd706b65669ce1df84b8549875691b1f403730bdef217"},
|
{file = "scikit_learn-1.3.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f66eddfda9d45dd6cadcd706b65669ce1df84b8549875691b1f403730bdef217"},
|
||||||
{file = "scikit_learn-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c6448c37741145b241eeac617028ba6ec2119e1339b1385c9720dae31367f2be"},
|
{file = "scikit_learn-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c6448c37741145b241eeac617028ba6ec2119e1339b1385c9720dae31367f2be"},
|
||||||
{file = "scikit_learn-1.3.1-cp311-cp311-win_amd64.whl", hash = "sha256:c413c2c850241998168bbb3bd1bb59ff03b1195a53864f0b80ab092071af6028"},
|
{file = "scikit_learn-1.3.1-cp311-cp311-win_amd64.whl", hash = "sha256:c413c2c850241998168bbb3bd1bb59ff03b1195a53864f0b80ab092071af6028"},
|
||||||
|
{file = "scikit_learn-1.3.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:ef540e09873e31569bc8b02c8a9f745ee04d8e1263255a15c9969f6f5caa627f"},
|
||||||
|
{file = "scikit_learn-1.3.1-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:9147a3a4df4d401e618713880be023e36109c85d8569b3bf5377e6cd3fecdeac"},
|
||||||
|
{file = "scikit_learn-1.3.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d2cd3634695ad192bf71645702b3df498bd1e246fc2d529effdb45a06ab028b4"},
|
||||||
|
{file = "scikit_learn-1.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0c275a06c5190c5ce00af0acbb61c06374087949f643ef32d355ece12c4db043"},
|
||||||
|
{file = "scikit_learn-1.3.1-cp312-cp312-win_amd64.whl", hash = "sha256:0e1aa8f206d0de814b81b41d60c1ce31f7f2c7354597af38fae46d9c47c45122"},
|
||||||
{file = "scikit_learn-1.3.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:52b77cc08bd555969ec5150788ed50276f5ef83abb72e6f469c5b91a0009bbca"},
|
{file = "scikit_learn-1.3.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:52b77cc08bd555969ec5150788ed50276f5ef83abb72e6f469c5b91a0009bbca"},
|
||||||
{file = "scikit_learn-1.3.1-cp38-cp38-macosx_12_0_arm64.whl", hash = "sha256:a683394bc3f80b7c312c27f9b14ebea7766b1f0a34faf1a2e9158d80e860ec26"},
|
{file = "scikit_learn-1.3.1-cp38-cp38-macosx_12_0_arm64.whl", hash = "sha256:a683394bc3f80b7c312c27f9b14ebea7766b1f0a34faf1a2e9158d80e860ec26"},
|
||||||
{file = "scikit_learn-1.3.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a15d964d9eb181c79c190d3dbc2fff7338786bf017e9039571418a1d53dab236"},
|
{file = "scikit_learn-1.3.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a15d964d9eb181c79c190d3dbc2fff7338786bf017e9039571418a1d53dab236"},
|
||||||
@ -8841,7 +9029,7 @@ files = [
|
|||||||
]
|
]
|
||||||
|
|
||||||
[package.dependencies]
|
[package.dependencies]
|
||||||
greenlet = {version = "!=0.4.17", markers = "platform_machine == \"aarch64\" or platform_machine == \"ppc64le\" or platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"AMD64\" or platform_machine == \"win32\" or platform_machine == \"WIN32\""}
|
greenlet = {version = "!=0.4.17", markers = "platform_machine == \"win32\" or platform_machine == \"WIN32\" or platform_machine == \"AMD64\" or platform_machine == \"amd64\" or platform_machine == \"x86_64\" or platform_machine == \"ppc64le\" or platform_machine == \"aarch64\""}
|
||||||
typing-extensions = ">=4.2.0"
|
typing-extensions = ">=4.2.0"
|
||||||
|
|
||||||
[package.extras]
|
[package.extras]
|
||||||
@ -10666,7 +10854,7 @@ cli = ["typer"]
|
|||||||
cohere = ["cohere"]
|
cohere = ["cohere"]
|
||||||
docarray = ["docarray"]
|
docarray = ["docarray"]
|
||||||
embeddings = ["sentence-transformers"]
|
embeddings = ["sentence-transformers"]
|
||||||
extended-testing = ["amazon-textract-caller", "anthropic", "arxiv", "assemblyai", "atlassian-python-api", "beautifulsoup4", "bibtexparser", "cassio", "chardet", "dashvector", "esprima", "faiss-cpu", "feedparser", "geopandas", "gitpython", "gql", "html2text", "jinja2", "jq", "lxml", "markdownify", "motor", "mwparserfromhell", "mwxml", "newspaper3k", "numexpr", "openai", "openai", "openapi-schema-pydantic", "pandas", "pdfminer-six", "pgvector", "psychicapi", "py-trello", "pymupdf", "pypdf", "pypdfium2", "pyspark", "rank-bm25", "rapidfuzz", "requests-toolbelt", "scikit-learn", "sqlite-vss", "streamlit", "sympy", "telethon", "timescale-vector", "tqdm", "xata", "xmltodict"]
|
extended-testing = ["amazon-textract-caller", "anthropic", "arxiv", "assemblyai", "atlassian-python-api", "beautifulsoup4", "bibtexparser", "cassio", "chardet", "dashvector", "esprima", "faiss-cpu", "feedparser", "geopandas", "gitpython", "gql", "html2text", "jinja2", "jq", "lxml", "markdownify", "motor", "mwparserfromhell", "mwxml", "newspaper3k", "numexpr", "openai", "openai", "openapi-schema-pydantic", "pandas", "pdfminer-six", "pgvector", "psychicapi", "py-trello", "pymupdf", "pypdf", "pypdfium2", "pyspark", "rank-bm25", "rapidfuzz", "rapidocr-onnxruntime", "requests-toolbelt", "scikit-learn", "sqlite-vss", "streamlit", "sympy", "telethon", "timescale-vector", "tqdm", "xata", "xmltodict"]
|
||||||
javascript = ["esprima"]
|
javascript = ["esprima"]
|
||||||
llms = ["clarifai", "cohere", "huggingface_hub", "manifest-ml", "nlpcloud", "openai", "openlm", "torch", "transformers"]
|
llms = ["clarifai", "cohere", "huggingface_hub", "manifest-ml", "nlpcloud", "openai", "openlm", "torch", "transformers"]
|
||||||
openai = ["openai", "tiktoken"]
|
openai = ["openai", "tiktoken"]
|
||||||
@ -10676,4 +10864,4 @@ text-helpers = ["chardet"]
|
|||||||
[metadata]
|
[metadata]
|
||||||
lock-version = "2.0"
|
lock-version = "2.0"
|
||||||
python-versions = ">=3.8.1,<4.0"
|
python-versions = ">=3.8.1,<4.0"
|
||||||
content-hash = "498a5510e617012122596bf4e947f7466d7f574e7c7f1bb69e264ff0990f2277"
|
content-hash = "7fbe9a5144717db54413735663870168b00e34deb4f37559e38d62843488adae"
|
||||||
|
@ -82,6 +82,7 @@ pdfminer-six = {version = "^20221105", optional = true}
|
|||||||
docarray = {version="^0.32.0", extras=["hnswlib"], optional=true}
|
docarray = {version="^0.32.0", extras=["hnswlib"], optional=true}
|
||||||
lxml = {version = "^4.9.2", optional = true}
|
lxml = {version = "^4.9.2", optional = true}
|
||||||
pymupdf = {version = "^1.22.3", optional = true}
|
pymupdf = {version = "^1.22.3", optional = true}
|
||||||
|
rapidocr-onnxruntime = {version = "^1.3.2", optional = true, python = ">=3.8.1,<3.12"}
|
||||||
pypdfium2 = {version = "^4.10.0", optional = true}
|
pypdfium2 = {version = "^4.10.0", optional = true}
|
||||||
gql = {version = "^3.4.1", optional = true}
|
gql = {version = "^3.4.1", optional = true}
|
||||||
pandas = {version = "^2.0.1", optional = true}
|
pandas = {version = "^2.0.1", optional = true}
|
||||||
@ -359,6 +360,7 @@ extended_testing = [
|
|||||||
"arxiv",
|
"arxiv",
|
||||||
"dashvector",
|
"dashvector",
|
||||||
"sqlite-vss",
|
"sqlite-vss",
|
||||||
|
"rapidocr-onnxruntime",
|
||||||
"motor",
|
"motor",
|
||||||
"timescale-vector",
|
"timescale-vector",
|
||||||
"anthropic",
|
"anthropic",
|
||||||
|
@ -50,7 +50,7 @@ def _assert_with_parser(parser: BaseBlobParser, splits_by_page: bool = True) ->
|
|||||||
assert metadata["source"] == str(LAYOUT_PARSER_PAPER_PDF)
|
assert metadata["source"] == str(LAYOUT_PARSER_PAPER_PDF)
|
||||||
|
|
||||||
if splits_by_page:
|
if splits_by_page:
|
||||||
assert metadata["page"] == 0
|
assert int(metadata["page"]) == 0
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.requires("pypdf")
|
@pytest.mark.requires("pypdf")
|
||||||
@ -77,3 +77,12 @@ def test_pypdfium2_parser() -> None:
|
|||||||
"""Test PyPDFium2 parser."""
|
"""Test PyPDFium2 parser."""
|
||||||
# Does not follow defaults to split by page.
|
# Does not follow defaults to split by page.
|
||||||
_assert_with_parser(PyPDFium2Parser())
|
_assert_with_parser(PyPDFium2Parser())
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("rapidocr_onnxruntime")
|
||||||
|
def test_extract_images_text_from_pdf() -> None:
|
||||||
|
"""Test extract image from pdf and recognize text with rapid ocr"""
|
||||||
|
_assert_with_parser(PyPDFParser(extract_images=True))
|
||||||
|
_assert_with_parser(PDFMinerParser(extract_images=True))
|
||||||
|
_assert_with_parser(PyMuPDFParser(extract_images=True))
|
||||||
|
_assert_with_parser(PyPDFium2Parser(extract_images=True))
|
||||||
|
Loading…
Reference in New Issue
Block a user