langchain/docs
John d59c656ea5
unstructured, community, initialize langchain-unstructured package (#22779)
#### Update (2): 
A single `UnstructuredLoader` is added to handle both local and api
partitioning. This loader also handles single or multiple documents.

#### Changes in `community`:
Changes here do not affect users. In the initial process of using the
SDK for the API Loaders, the Loaders in community were refactored.
Other changes include:
The `UnstructuredBaseLoader` has a new check to see if both
`mode="paged"` and `chunking_strategy="by_page"`. It also now has
`Element.element_id` added to the `Document.metadata`.
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such,
now both directly inherit from `UnstructuredBaseLoader` and initialize
their `file_path`/`file` attributes respectively and implement their own
`_post_process_elements` methods.

--------
#### Update:
New SDK Loaders in a [partner
package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo)
are introduced to prevent breaking changes for users (see discussion
below).

##### TODO:
- [x] Test docstring examples
--------
- **Description:** UnstructuredAPIFileIOLoader and
UnstructuredAPIFileLoader calls to the unstructured api are now made
using the unstructured-client sdk.
- **New Dependencies:** unstructured-client

- [x] **Add tests and docs**: If you're adding a new integration, please
include
- [x] a test for the integration, preferably unit tests that do not rely
on network access,
- [x] update the description in
`docs/docs/integrations/providers/unstructured.mdx`
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

TODO:
- [x] Update
https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api
-
`langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb`
- The description here needs to indicate that users should install
`unstructured-client` instead of `unstructured`. Read over closely to
look for any other changes that need to be made.
- [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to
handle json responses from the API instead of just lists of elements.
- This method may need to be overwritten by the API loaders instead of
changing it in the `UnstructuredBaseLoader`.
- [x] Update the documentation links in the class docstrings (the
Unstructured documents have moved)
- [x] Update Document.metadata to include `element_id` (see thread
[here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419))

---------

Signed-off-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>
Co-authored-by: ChengZi <chen.zhang@zilliz.com>
2024-07-24 23:21:20 +00:00
..
api_reference docs: readthedocs deprecation fix (#24321) 2024-07-16 20:32:51 +00:00
data 👥 Update LangChain people data (#23697) 2024-07-01 17:42:55 +00:00
docs unstructured, community, initialize langchain-unstructured package (#22779) 2024-07-24 23:21:20 +00:00
scripts docs: add tables for search and code interpreter tools (#24586) 2024-07-24 10:51:39 -07:00
src docs: update ChatModelTabs defaults (#24583) 2024-07-23 21:56:30 +00:00
static docs[patch]: Update intro diagram (#24290) 2024-07-15 22:04:42 -07:00
.gitignore infra: cleanup docs build (#21134) 2024-05-01 17:34:05 -07:00
.yarnrc.yml docs[minor]: Add thumbs up/down to all docs pages (#18526) 2024-03-04 15:14:28 -08:00
babel.config.js Restructure docs (#11620) 2023-10-10 12:55:19 -07:00
docusaurus.config.js docs: rm discord (#23985) 2024-07-08 14:27:58 -07:00
ignore-step.sh infra: docs ignore step in script (#24090) 2024-07-10 15:18:00 -07:00
Makefile docs: add tables for search and code interpreter tools (#24586) 2024-07-24 10:51:39 -07:00
package.json docs[patch]: Adds feedback input after thumbs up/down (#23141) 2024-06-18 16:08:22 -07:00
README.md docs: developer docs (#14776) 2023-12-17 12:55:49 -08:00
sidebars.js docs: add tables for search and code interpreter tools (#24586) 2024-07-24 10:51:39 -07:00
vercel_build.sh infra: use nbconvert for docs build (#21135) 2024-05-07 12:30:17 -07:00
vercel_requirements.txt package: security update urllib3 to @1.26.19 (#23366) 2024-06-24 19:44:39 +00:00
vercel.json docs[patch]: Remove very old document comparison notebook (#24587) 2024-07-23 22:25:35 -07:00
yarn.lock docs[patch]: Adds feedback input after thumbs up/down (#23141) 2024-06-18 16:08:22 -07:00

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide