mirror of
https://github.com/hwchase17/langchain.git
synced 2025-08-28 14:05:02 +00:00
#### Update (2): A single `UnstructuredLoader` is added to handle both local and api partitioning. This loader also handles single or multiple documents. #### Changes in `community`: Changes here do not affect users. In the initial process of using the SDK for the API Loaders, the Loaders in community were refactored. Other changes include: The `UnstructuredBaseLoader` has a new check to see if both `mode="paged"` and `chunking_strategy="by_page"`. It also now has `Element.element_id` added to the `Document.metadata`. `UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such, now both directly inherit from `UnstructuredBaseLoader` and initialize their `file_path`/`file` attributes respectively and implement their own `_post_process_elements` methods. -------- #### Update: New SDK Loaders in a [partner package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo) are introduced to prevent breaking changes for users (see discussion below). ##### TODO: - [x] Test docstring examples -------- - **Description:** UnstructuredAPIFileIOLoader and UnstructuredAPIFileLoader calls to the unstructured api are now made using the unstructured-client sdk. - **New Dependencies:** unstructured-client - [x] **Add tests and docs**: If you're adding a new integration, please include - [x] a test for the integration, preferably unit tests that do not rely on network access, - [x] update the description in `docs/docs/integrations/providers/unstructured.mdx` - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. TODO: - [x] Update https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api - `langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb` - The description here needs to indicate that users should install `unstructured-client` instead of `unstructured`. Read over closely to look for any other changes that need to be made. - [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to handle json responses from the API instead of just lists of elements. - This method may need to be overwritten by the API loaders instead of changing it in the `UnstructuredBaseLoader`. - [x] Update the documentation links in the class docstrings (the Unstructured documents have moved) - [x] Update Document.metadata to include `element_id` (see thread [here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419)) --------- Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com> Co-authored-by: ChengZi <chen.zhang@zilliz.com>
67 lines
2.1 KiB
Makefile
67 lines
2.1 KiB
Makefile
.PHONY: all format lint test tests integration_tests docker_tests help extended_tests
|
|
|
|
# Default target executed when no arguments are given to make.
|
|
all: help
|
|
|
|
# Define a variable for the test file path.
|
|
TEST_FILE ?= tests/unit_tests/
|
|
integration_test integration_tests: TEST_FILE = tests/integration_tests/
|
|
|
|
|
|
# unit tests are run with the --disable-socket flag to prevent network calls
|
|
test tests:
|
|
poetry run pytest --disable-socket --allow-unix-socket $(TEST_FILE)
|
|
|
|
# integration tests are run without the --disable-socket flag to allow network calls
|
|
integration_test:
|
|
poetry run pytest $(TEST_FILE)
|
|
|
|
# skip tests marked as local in CI
|
|
integration_tests:
|
|
poetry run pytest $(TEST_FILE) -m "not local"
|
|
|
|
######################
|
|
# LINTING AND FORMATTING
|
|
######################
|
|
|
|
# Define a variable for Python and notebook files.
|
|
PYTHON_FILES=.
|
|
MYPY_CACHE=.mypy_cache
|
|
lint format: PYTHON_FILES=.
|
|
lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative=libs/partners/unstructured --name-only --diff-filter=d master | grep -E '\.py$$|\.ipynb$$')
|
|
lint_package: PYTHON_FILES=langchain_unstructured
|
|
lint_tests: PYTHON_FILES=tests
|
|
lint_tests: MYPY_CACHE=.mypy_cache_test
|
|
|
|
lint lint_diff lint_package lint_tests:
|
|
poetry run ruff .
|
|
poetry run ruff format $(PYTHON_FILES) --diff
|
|
poetry run ruff --select I $(PYTHON_FILES)
|
|
mkdir -p $(MYPY_CACHE); poetry run mypy $(PYTHON_FILES) --cache-dir $(MYPY_CACHE)
|
|
|
|
format format_diff:
|
|
poetry run ruff format $(PYTHON_FILES)
|
|
poetry run ruff --select I --fix $(PYTHON_FILES)
|
|
|
|
spell_check:
|
|
poetry run codespell --toml pyproject.toml
|
|
|
|
spell_fix:
|
|
poetry run codespell --toml pyproject.toml -w
|
|
|
|
check_imports: $(shell find langchain_unstructured -name '*.py')
|
|
poetry run python ./scripts/check_imports.py $^
|
|
|
|
######################
|
|
# HELP
|
|
######################
|
|
|
|
help:
|
|
@echo '----'
|
|
@echo 'check_imports - check imports'
|
|
@echo 'format - run code formatters'
|
|
@echo 'lint - run linters'
|
|
@echo 'test - run unit tests'
|
|
@echo 'tests - run unit tests'
|
|
@echo 'test TEST_FILE=<test_file> - run all tests in file'
|