langchain/libs/partners/unstructured/README.md
John d59c656ea5
unstructured, community, initialize langchain-unstructured package (#22779)
#### Update (2): 
A single `UnstructuredLoader` is added to handle both local and api
partitioning. This loader also handles single or multiple documents.

#### Changes in `community`:
Changes here do not affect users. In the initial process of using the
SDK for the API Loaders, the Loaders in community were refactored.
Other changes include:
The `UnstructuredBaseLoader` has a new check to see if both
`mode="paged"` and `chunking_strategy="by_page"`. It also now has
`Element.element_id` added to the `Document.metadata`.
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such,
now both directly inherit from `UnstructuredBaseLoader` and initialize
their `file_path`/`file` attributes respectively and implement their own
`_post_process_elements` methods.

--------
#### Update:
New SDK Loaders in a [partner
package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo)
are introduced to prevent breaking changes for users (see discussion
below).

##### TODO:
- [x] Test docstring examples
--------
- **Description:** UnstructuredAPIFileIOLoader and
UnstructuredAPIFileLoader calls to the unstructured api are now made
using the unstructured-client sdk.
- **New Dependencies:** unstructured-client

- [x] **Add tests and docs**: If you're adding a new integration, please
include
- [x] a test for the integration, preferably unit tests that do not rely
on network access,
- [x] update the description in
`docs/docs/integrations/providers/unstructured.mdx`
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

TODO:
- [x] Update
https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api
-
`langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb`
- The description here needs to indicate that users should install
`unstructured-client` instead of `unstructured`. Read over closely to
look for any other changes that need to be made.
- [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to
handle json responses from the API instead of just lists of elements.
- This method may need to be overwritten by the API loaders instead of
changing it in the `UnstructuredBaseLoader`.
- [x] Update the documentation links in the class docstrings (the
Unstructured documents have moved)
- [x] Update Document.metadata to include `element_id` (see thread
[here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419))

---------

Signed-off-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>
Co-authored-by: ChengZi <chen.zhang@zilliz.com>
2024-07-24 23:21:20 +00:00

2.1 KiB

langchain-unstructured

This package contains the LangChain integration with Unstructured

Installation

pip install -U langchain-unstructured

And you should configure credentials by setting the following environment variables:

export UNSTRUCTURED_API_KEY="your-api-key"

Loaders

Partition and load files using either the unstructured-client sdk and the Unstructured API or locally using the unstructured library.

API: To partition via the Unstructured API pip install unstructured-client and set partition_via_api=True and define api_key. If you are running the unstructured API locally, you can change the API rule by defining url when you initialize the loader. The hosted Unstructured API requires an API key. See the links below to learn more about our API offerings and get an API key.

Local: By default the file loader uses the Unstructured partition function and will automatically detect the file type.

In addition to document specific partition parameters, Unstructured has a rich set of "chunking" parameters for post-processing elements into more useful text segments for uses cases such as Retrieval Augmented Generation (RAG). You can pass additional Unstructured kwargs to the loader to configure different unstructured settings.

Setup:

    pip install -U langchain-unstructured
    pip install -U unstructured-client
    export UNSTRUCTURED_API_KEY="your-api-key"

Instantiate:

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
    file_path = ["example.pdf", "fake.pdf"],
    api_key=UNSTRUCTURED_API_KEY,
    partition_via_api=True,
    chunking_strategy="by_title",
    strategy="fast",
)

Load:

docs = loader.load()

print(docs[0].page_content[:100])
print(docs[0].metadata)

References

https://docs.unstructured.io/api-reference/api-services/sdk https://docs.unstructured.io/api-reference/api-services/overview https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking