mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-19 00:58:32 +00:00
unstructured, community, initialize langchain-unstructured package (#22779)
#### Update (2): A single `UnstructuredLoader` is added to handle both local and api partitioning. This loader also handles single or multiple documents. #### Changes in `community`: Changes here do not affect users. In the initial process of using the SDK for the API Loaders, the Loaders in community were refactored. Other changes include: The `UnstructuredBaseLoader` has a new check to see if both `mode="paged"` and `chunking_strategy="by_page"`. It also now has `Element.element_id` added to the `Document.metadata`. `UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such, now both directly inherit from `UnstructuredBaseLoader` and initialize their `file_path`/`file` attributes respectively and implement their own `_post_process_elements` methods. -------- #### Update: New SDK Loaders in a [partner package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo) are introduced to prevent breaking changes for users (see discussion below). ##### TODO: - [x] Test docstring examples -------- - **Description:** UnstructuredAPIFileIOLoader and UnstructuredAPIFileLoader calls to the unstructured api are now made using the unstructured-client sdk. - **New Dependencies:** unstructured-client - [x] **Add tests and docs**: If you're adding a new integration, please include - [x] a test for the integration, preferably unit tests that do not rely on network access, - [x] update the description in `docs/docs/integrations/providers/unstructured.mdx` - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. TODO: - [x] Update https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api - `langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb` - The description here needs to indicate that users should install `unstructured-client` instead of `unstructured`. Read over closely to look for any other changes that need to be made. - [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to handle json responses from the API instead of just lists of elements. - This method may need to be overwritten by the API loaders instead of changing it in the `UnstructuredBaseLoader`. - [x] Update the documentation links in the class docstrings (the Unstructured documents have moved) - [x] Update Document.metadata to include `element_id` (see thread [here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419)) --------- Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com> Co-authored-by: ChengZi <chen.zhang@zilliz.com>
This commit is contained in:
71
libs/partners/unstructured/README.md
Normal file
71
libs/partners/unstructured/README.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# langchain-unstructured
|
||||
|
||||
This package contains the LangChain integration with Unstructured
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install -U langchain-unstructured
|
||||
```
|
||||
|
||||
And you should configure credentials by setting the following environment variables:
|
||||
|
||||
```bash
|
||||
export UNSTRUCTURED_API_KEY="your-api-key"
|
||||
```
|
||||
|
||||
## Loaders
|
||||
|
||||
Partition and load files using either the `unstructured-client` sdk and the
|
||||
Unstructured API or locally using the `unstructured` library.
|
||||
|
||||
API:
|
||||
To partition via the Unstructured API `pip install unstructured-client` and set
|
||||
`partition_via_api=True` and define `api_key`. If you are running the unstructured API
|
||||
locally, you can change the API rule by defining `url` when you initialize the
|
||||
loader. The hosted Unstructured API requires an API key. See the links below to
|
||||
learn more about our API offerings and get an API key.
|
||||
|
||||
Local:
|
||||
By default the file loader uses the Unstructured `partition` function and will
|
||||
automatically detect the file type.
|
||||
|
||||
In addition to document specific partition parameters, Unstructured has a rich set
|
||||
of "chunking" parameters for post-processing elements into more useful text segments
|
||||
for uses cases such as Retrieval Augmented Generation (RAG). You can pass additional
|
||||
Unstructured kwargs to the loader to configure different unstructured settings.
|
||||
|
||||
Setup:
|
||||
```bash
|
||||
pip install -U langchain-unstructured
|
||||
pip install -U unstructured-client
|
||||
export UNSTRUCTURED_API_KEY="your-api-key"
|
||||
```
|
||||
|
||||
Instantiate:
|
||||
```python
|
||||
from langchain_unstructured import UnstructuredLoader
|
||||
|
||||
loader = UnstructuredLoader(
|
||||
file_path = ["example.pdf", "fake.pdf"],
|
||||
api_key=UNSTRUCTURED_API_KEY,
|
||||
partition_via_api=True,
|
||||
chunking_strategy="by_title",
|
||||
strategy="fast",
|
||||
)
|
||||
```
|
||||
|
||||
Load:
|
||||
```python
|
||||
docs = loader.load()
|
||||
|
||||
print(docs[0].page_content[:100])
|
||||
print(docs[0].metadata)
|
||||
```
|
||||
|
||||
References
|
||||
----------
|
||||
https://docs.unstructured.io/api-reference/api-services/sdk
|
||||
https://docs.unstructured.io/api-reference/api-services/overview
|
||||
https://docs.unstructured.io/open-source/core-functionality/partitioning
|
||||
https://docs.unstructured.io/open-source/core-functionality/chunking
|
Reference in New Issue
Block a user