upstage[minor]: Update few codes and add upstage loader in pdf section (#21085)

**Description:** Update UpstageLayoutAnalysisParser and Loader and add
upstage loader example in pdf section
**Dependencies:** langchain_community
**Twitter handle:** [@upstageai](https://twitter.com/upstageai)

- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.
This commit is contained in:
junkeon
2024-05-01 09:15:49 +09:00
committed by GitHub
parent bef50ded63
commit 8d2909ee25
5 changed files with 173 additions and 134 deletions

View File

@@ -475,4 +475,31 @@ loader = AzureAIDocumentIntelligenceLoader(
)
documents = loader.load()
```
```
## Using UpstageLayoutAnalysisLoader
The UpstageLayoutAnalysisLoader invokes the [Upstage Layout Analysis API](https://developers.upstage.ai/docs/apis/layout-analyzer) to detect document elements, including tables and figures, from various document formats. This loader employs pure OCR to extract textual information and detect elements within documents such as `JPEG`, `PNG`, `BMP`, `PDF`, `TIFF`, and `HEIC` files. In the case of digital born PDF documents, users have the option to forego OCR and utilize text information within the file by setting use_ocr=False, which is the default value. Both single and multi-page documents are supported, with a limit of 100 pages and a file size of 50 MB when use_ocr=True, while there are no restrictions when use_ocr=False (applicable to PDF files only).
### Prerequisite
To access the Upstage Layout Analysis API, you require an API access token. Kindly refer to the [quick start guide](https://developers.upstage.ai/docs/getting-started/quick-start) provided to obtain the access token and begin utilizing the Upstage Layout Analysis API.
```bash
pip install langchain_upstage
```
### Example
```python
import os
os.environ["UPSTAGE_DOCUMENT_AI_API_KEY"] = "YOUR_API_KEY"
from langchain_upstage import UpstageLayoutAnalysisLoader
file_path = "/PATH/TO/FILE.pdf"
loader = UpstageLayoutAnalysisLoader(file_path)
data = loader.load()
```