mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-05 04:55:14 +00:00
upstage[minor]: Update few codes and add upstage loader in pdf section (#21085)
**Description:** Update UpstageLayoutAnalysisParser and Loader and add upstage loader example in pdf section **Dependencies:** langchain_community **Twitter handle:** [@upstageai](https://twitter.com/upstageai) - [x] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17.
This commit is contained in:
@@ -475,4 +475,31 @@ loader = AzureAIDocumentIntelligenceLoader(
|
||||
)
|
||||
|
||||
documents = loader.load()
|
||||
```
|
||||
```
|
||||
|
||||
## Using UpstageLayoutAnalysisLoader
|
||||
|
||||
The UpstageLayoutAnalysisLoader invokes the [Upstage Layout Analysis API](https://developers.upstage.ai/docs/apis/layout-analyzer) to detect document elements, including tables and figures, from various document formats. This loader employs pure OCR to extract textual information and detect elements within documents such as `JPEG`, `PNG`, `BMP`, `PDF`, `TIFF`, and `HEIC` files. In the case of digital born PDF documents, users have the option to forego OCR and utilize text information within the file by setting use_ocr=False, which is the default value. Both single and multi-page documents are supported, with a limit of 100 pages and a file size of 50 MB when use_ocr=True, while there are no restrictions when use_ocr=False (applicable to PDF files only).
|
||||
|
||||
### Prerequisite
|
||||
|
||||
To access the Upstage Layout Analysis API, you require an API access token. Kindly refer to the [quick start guide](https://developers.upstage.ai/docs/getting-started/quick-start) provided to obtain the access token and begin utilizing the Upstage Layout Analysis API.
|
||||
|
||||
```bash
|
||||
pip install langchain_upstage
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
os.environ["UPSTAGE_DOCUMENT_AI_API_KEY"] = "YOUR_API_KEY"
|
||||
|
||||
from langchain_upstage import UpstageLayoutAnalysisLoader
|
||||
|
||||
file_path = "/PATH/TO/FILE.pdf"
|
||||
|
||||
loader = UpstageLayoutAnalysisLoader(file_path)
|
||||
data = loader.load()
|
||||
```
|
||||
|
Reference in New Issue
Block a user