upstage[minor]: Update few codes and add upstage loader in pdf section (#21085)

**Description:** Update UpstageLayoutAnalysisParser and Loader and add upstage loader example in pdf section **Dependencies:** langchain_community **Twitter handle:** [@upstageai](https://twitter.com/upstageai) - [x] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17.
2025-09-05 04:55:14 +00:00 · 2024-05-01 09:15:49 +09:00
parent bef50ded63
commit 8d2909ee25
5 changed files with 173 additions and 134 deletions
--- a/docs/docs/modules/data_connection/document_loaders/pdf.mdx
+++ b/docs/docs/modules/data_connection/document_loaders/pdf.mdx
@@ -475,4 +475,31 @@ loader = AzureAIDocumentIntelligenceLoader(
 )

 documents = loader.load()
-```
+```
+
+## Using UpstageLayoutAnalysisLoader
+
+The UpstageLayoutAnalysisLoader invokes the [Upstage Layout Analysis API](https://developers.upstage.ai/docs/apis/layout-analyzer) to detect document elements, including tables and figures, from various document formats. This loader employs pure OCR to extract textual information and detect elements within documents such as `JPEG`, `PNG`, `BMP`, `PDF`, `TIFF`, and `HEIC` files. In the case of digital born PDF documents, users have the option to forego OCR and utilize text information within the file by setting use_ocr=False, which is the default value. Both single and multi-page documents are supported, with a limit of 100 pages and a file size of 50 MB when use_ocr=True, while there are no restrictions when use_ocr=False (applicable to PDF files only).
+
+### Prerequisite
+
+To access the Upstage Layout Analysis API, you require an API access token. Kindly refer to the [quick start guide](https://developers.upstage.ai/docs/getting-started/quick-start) provided to obtain the access token and begin utilizing the Upstage Layout Analysis API.
+
+```bash
+pip install langchain_upstage
+```
+
+### Example
+
+```python
+import os
+
+os.environ["UPSTAGE_DOCUMENT_AI_API_KEY"] = "YOUR_API_KEY"
+
+from langchain_upstage import UpstageLayoutAnalysisLoader
+
+file_path = "/PATH/TO/FILE.pdf"
+
+loader = UpstageLayoutAnalysisLoader(file_path)
+data = loader.load()
+```