langchain

mirror of https://github.com/hwchase17/langchain.git synced 2026-02-21 22:56:05 +00:00

Author	SHA1	Message	Date
Mason Daugherty	ae16392ada	release(text-splitters): 1.0.0a1 (#33214 )	2025-10-02 13:56:10 -04:00
Mason Daugherty	5e8cb58e6a	refactor(text-splitters): drop python 3.9 (#33212 )	2025-10-02 13:51:10 -04:00
Mason Daugherty	eaa6dcce9e	release: v1.0.0 (#32567 ) Co-authored-by: Mohammad Mohtashim <45242107+keenborder786@users.noreply.github.com> Co-authored-by: Caspar Broekhuizen <caspar@langchain.dev> Co-authored-by: ccurme <chester.curme@gmail.com> Co-authored-by: Christophe Bornet <cbornet@hotmail.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com> Co-authored-by: Vadym Barda <vadim.barda@gmail.com>	2025-10-02 10:49:42 -04:00
Mason Daugherty	986302322f	docs: more standardization (#33124 )	2025-09-25 20:46:20 -04:00
Christophe Bornet	eaf8dce7c2	chore: bump ruff version to 0.13 (#33043 ) Co-authored-by: Mason Daugherty <mason@langchain.dev>	2025-09-25 12:27:39 -04:00
Mason Daugherty	e3efd1e891	test(text-splitters): capture beta warnings (#33113 )	2025-09-25 01:30:20 -04:00
Mason Daugherty	d6769cf032	test(text-splitters): resolve pytest marker warning (#33112 ) #33111	2025-09-25 01:29:42 -04:00
Mason Daugherty	781db9d892	chore: update `pyproject.toml` files, remove codespell (#33028 ) - Removes Codespell from deps, docs, and `Makefile`s - Python version requirements in all `pyproject.toml` files now use the `~=` (compatible release) specifier - All dependency groups and main dependencies now use explicit lower and upper bounds, reducing potential for breaking changes	2025-09-20 22:09:33 -04:00
Christophe Bornet	cbaf97ada4	chore: bump mypy version to 1.18 (#32914 )	2025-09-12 09:19:23 -04:00
Hyunjoon Jeong	9cc85387d1	fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter (#32205 ) ### Description 1) Add validation to prevent infinite loop condition when ```tokenizer.tokens_per_chunk > tokenizer.chunk_overlap``` 2) Avoid empty decoded chunk when splitter appends tokens --------- Co-authored-by: Eugene Yurtsev <eugene@langchain.dev> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>	2025-09-11 16:55:32 -04:00
Mason Daugherty	7a158c7f1c	revert: "chore: remove ruff target-version" (#32895 ) Reverts langchain-ai/langchain#32880 Not needed at the moment, will do when finishing v1	2025-09-10 20:56:48 -04:00
Christophe Bornet	b274416441	chore: remove ruff target-version (#32880 ) This is not needed anymore since `requires-python` was added when moving to `uv`.	2025-09-10 11:12:30 -04:00
Mason Daugherty	c124e67325	chore(docs): update package `README`s (#32869 ) - Fix badges - Focus on agents - Cut down fluff	2025-09-09 14:50:32 +00:00
Christophe Bornet	8b90eae455	chore(text-splitters): enable ruff docstring-code-format (#32854 )	2025-09-08 16:40:11 -04:00
Christophe Bornet	0c3e8ccd0e	chore(text-splitters): select ALL rules with exclusions (#32325 ) Co-authored-by: Mason Daugherty <mason@langchain.dev>	2025-09-08 14:46:09 +00:00
Mason Daugherty	6b5fdfb804	release(text-splitters): 0.3.11 (#32770 ) Fixes #32747 SpaCy integration test fixture was trying to use pip to download the SpaCy language model (`en_core_web_sm`), but uv-managed environments don't include pip by default. Fail test if not installed as opposed to downloading.	2025-08-31 23:00:05 +00:00
Christophe Bornet	e0a4af8d8b	docs(text-splitters): fix some docstrings (#32767 )	2025-08-31 13:46:11 -05:00
Sydney Runkle	b26e52aa4d	chore(text-splitters): bump version of core (#32740 )	2025-08-28 13:14:57 -04:00
Sydney Runkle	38cdd7a2ec	chore(text-splitters): relax max bound for langchain-core (#32739 )	2025-08-28 13:05:47 -04:00
Mason Daugherty	3d08b6bd11	chore: adress pytest-asyncio deprecation warnings + other nits (#32696 ) amongst some linting imcompatible rules	2025-08-26 15:51:38 -04:00
Maitrey Talware	622337a297	docs(docs): fixed typos in documentations (#32661 ) Minor typo fixes. (Not linked to current open issues)	2025-08-25 10:02:53 -04:00
Christophe Bornet	73a7de63aa	chore(text-splitters): add mypy pydantic plugin (#32611 )	2025-08-19 16:58:12 -04:00
Keyu Chen	03138f41a0	feat(text-splitters): add optional custom header pattern support (#31887 ) ## Description This PR adds support for custom header patterns in `MarkdownHeaderTextSplitter`, allowing users to define non-standard Markdown header formats (like `Header`) and specify their hierarchy levels. Issue: Fixes #22738 Dependencies: None - this change has no new dependencies Key Changes: - Added optional `custom_header_patterns` parameter to support non-standard header formats - Enable splitting on patterns like `Header` and `*Header` - Maintain full backward compatibility with existing usage - Added comprehensive tests for custom and mixed header scenarios ## Example Usage ```python from langchain_text_splitters import MarkdownHeaderTextSplitter headers_to_split_on = [ ("", "Chapter"), ("", "Section"), ] custom_header_patterns = { "": 1, # Level 1 headers "*": 2, # Level 2 headers } splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, custom_header_patterns=custom_header_patterns, ) # Now Chapter 1 is treated as a level 1 header # And Section 1.1** is treated as a level 2 header ``` ## Testing - ✅ Added unit tests for custom header patterns - ✅ Added tests for mixed standard and custom headers - ✅ All existing tests pass (backward compatibility maintained) - ✅ Linting and formatting checks pass --- The implementation provides a flexible solution while maintaining the simplicity of the existing API. Users can continue using the splitter exactly as before, with the new functionality being entirely opt-in through the `custom_header_patterns` parameter. --------- Co-authored-by: Mason Daugherty <mason@langchain.dev> Co-authored-by: Claude <noreply@anthropic.com>	2025-08-18 10:10:49 -04:00
Christophe Bornet	4656f727da	chore(text-splitters): add mypy `warn_unreachable` (#32558 )	2025-08-15 09:45:20 -04:00
Mason Daugherty	397cd89988	docs: update outdated `README.md` content (#32540 )	2025-08-13 22:19:38 +00:00
Christophe Bornet	8b663ed6c6	chore(text-splitters): bump mypy version to 1.17 (#32387 ) Co-authored-by: Mason Daugherty <mason@langchain.dev>	2025-08-11 22:24:49 +00:00
Mason Daugherty	457ce9c4b0	feat(text-splitters): ruff fixes and rules (#32502 )	2025-08-11 13:28:22 -04:00
Mason Daugherty	c31236264e	chore: formatting across codebase (#32466 )	2025-08-08 10:20:10 -04:00
Mason Daugherty	96cbd90cba	fix: formatting issues in docstrings (#32265 ) Ensures proper reStructuredText formatting by adding the required blank line before closing docstring quotes, which resolves the "Block quote ends without a blank line; unexpected unindent" warning.	2025-07-27 23:37:47 -04:00
Mason Daugherty	f624ad489a	feat(docs): improve devx, fix `Makefile` targets (#32237 ) TL;DR much of the provided `Makefile` targets were broken, and any time I wanted to preview changes locally I either had to refer to a command Chester gave me or try waiting on a Vercel preview deployment. With this PR, everything should behave like normal. Significant updates to the `Makefile` and documentation files, focusing on improving usability, adding clear messaging, and fixing/enhancing documentation workflows. ### Updates to `Makefile`: #### Enhanced build and cleaning processes: - Added informative messages (e.g., "📚 Building LangChain documentation...") to makefile targets like `docs_build`, `docs_clean`, and `api_docs_build` for better user feedback during execution. - Introduced a `clean-cache` target to the `docs` `Makefile` to clear cached dependencies and ensure clean builds. #### Improved dependency handling: - Modified `install-py-deps` to create a `.venv/deps_installed` marker, preventing redundant/duplicate dependency installations and improving efficiency. #### Streamlined file generation and infrastructure setup: - Added caching for the LangServe README download and parallelized feature table generation - Added user-friendly completion messages for targets like `copy-infra` and `render`. #### Documentation server updates: - Enhanced the `start` target with messages indicating server start and URL for local documentation viewing. --- ### Documentation Improvements: #### Content clarity and consistency: - Standardized section titles for consistency across documentation files. [[1]](diffhunk://#diff-9b1a85ea8a9dcf79f58246c88692cd7a36316665d7e05a69141cfdc50794c82aL1-R1) [[2]](diffhunk://#diff-944008ad3a79d8a312183618401fcfa71da0e69c75803eff09b779fc8e03183dL1-R1) - Refined phrasing and formatting in sections like "Dependency management" and "Formatting and linting" for better readability. [[1]](diffhunk://#diff-2069d4f956ab606ae6d51b191439283798adaf3a6648542c409d258131617059L6-R6) [[2]](diffhunk://#diff-2069d4f956ab606ae6d51b191439283798adaf3a6648542c409d258131617059L84-R82) #### Enhanced workflows: - Updated instructions for building and viewing documentation locally, including tips for specifying server ports and handling API reference previews. [[1]](diffhunk://#diff-048deddcfd44b242e5b23aed9f2e9ec73afc672244ce14df2a0a316d95840c87L60-R94) [[2]](diffhunk://#diff-048deddcfd44b242e5b23aed9f2e9ec73afc672244ce14df2a0a316d95840c87L82-R126) - Expanded guidance on cleaning documentation artifacts and using linting tools effectively. [[1]](diffhunk://#diff-048deddcfd44b242e5b23aed9f2e9ec73afc672244ce14df2a0a316d95840c87L82-R126) [[2]](diffhunk://#diff-048deddcfd44b242e5b23aed9f2e9ec73afc672244ce14df2a0a316d95840c87L107-R142) #### API reference documentation: - Improved instructions for generating and formatting in-code documentation, highlighting best practices for docstring writing. [[1]](diffhunk://#diff-048deddcfd44b242e5b23aed9f2e9ec73afc672244ce14df2a0a316d95840c87L107-R142) [[2]](diffhunk://#diff-048deddcfd44b242e5b23aed9f2e9ec73afc672244ce14df2a0a316d95840c87L144-R186) --- ### Minor Changes: - Added support for a new package name (`langchain_v1`) in the API documentation generation script. - Fixed minor capitalization and formatting issues in documentation files. [[1]](diffhunk://#diff-2069d4f956ab606ae6d51b191439283798adaf3a6648542c409d258131617059L40-R40) [[2]](diffhunk://#diff-2069d4f956ab606ae6d51b191439283798adaf3a6648542c409d258131617059L166-R160) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-07-25 14:49:03 -04:00
Mason Daugherty	77c981999e	fix(text-splitters): update langchain-core version to 0.3.72	2025-07-24 10:35:07 -04:00
Mason Daugherty	7f015b6f14	fix(text-splitters): update lock for release	2025-07-24 10:32:04 -04:00
tanwirahmad	622bb05751	fix(langchain): class HTMLSemanticPreservingSplitter ignores the text inside the div tag (#32213 ) Description: We collect the text from the "html", "body", "div", and "main" nodes, if they have any. Issue: Fixes #32206.	2025-07-24 10:09:03 -04:00
Fabio Fontana	fd168e1c11	feat(text-splitters): add Visual Basic 6 support (#31173 ) ### Description Add Visual Basic 6 support. --- ### Issue No specific issue addressed. --- ### Dependencies No additional dependencies required. --------- Co-authored-by: Mason Daugherty <mason@langchain.dev>	2025-07-14 13:51:16 +00:00
Christophe Bornet	060fc0e3c9	text-splitters: Add ruff rules FBT (#31935 ) See [flake8-boolean-trap (FBT)](https://docs.astral.sh/ruff/rules/#flake8-boolean-trap-fbt)	2025-07-09 18:36:58 -04:00
Michael Li	5b3e29f809	text splitters: add chunk_size and chunk_overlap validations (#31916 ) Thank you for contributing to LangChain! - [x] PR title: "package: description" - Where "package" is whichever of langchain, core, etc. is being modified. Use "docs: ..." for purely docs changes, "infra: ..." for CI changes. - Example: "core: add foobar LLM" - [x] PR message: *Delete this entire checklist* and replace with - Description: a description of the change - Issue: the issue # it fixes, if applicable - Dependencies: any dependencies required for this change - Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [x] Add tests and docs: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [x] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. If no one reviews your PR within a few days, please @-mention one of baskaryan, eyurtsev, ccurme, vbarda, hwchase17.	2025-07-08 12:22:33 -04:00
Mason Daugherty	706a66eccd	fix: automatically fix issues with ruff (#31897 ) * Perform safe automatic fixes instead of only selecting [isort](https://docs.astral.sh/ruff/rules/#isort-i)	2025-07-07 14:13:10 -04:00
Christophe Bornet	451c90fefa	text-splitters: Ruff autofixes (#31858 ) Auto-fixes from ruff with rule `ALL`	2025-07-07 10:06:08 -04:00
Christophe Bornet	a15c3e0856	text-splitters: Bump ruff version to 0.12 (#31866 )	2025-07-05 17:13:08 -04:00
Christophe Bornet	ee3709535d	text-splitters: bump spacy version to 3.8.7 (#31834 ) This allows to use spacy with Python 3.13	2025-07-03 10:13:25 -04:00
Christophe Bornet	802d2bf249	text-splitters: Add ruff rule UP (pyupgrade) (#31841 ) See https://docs.astral.sh/ruff/rules/#pyupgrade-up All auto-fixed except `typing.AbstractSet` -> `collections.abc.Set`	2025-07-03 10:11:35 -04:00
Eugene Yurtsev	10ec5c8f02	text-splitters: 0.3.9 (#31844 ) Release langchain-text-splitters 0.3.9	2025-07-03 10:02:35 -04:00
Cole Murray	43eef43550	security: Remove xslt_path and harden XML parsers in HTMLSectionSplitter: package: langchain-text-splitters (#31819 ) ## Summary - Removes the `xslt_path` parameter from HTMLSectionSplitter to eliminate XXE attack vector - Hardens XML/HTML parsers with secure configurations to prevent XXE attacks - Adds comprehensive security tests to ensure the vulnerability is fixed ## Context This PR addresses a critical XXE vulnerability discovered in the HTMLSectionSplitter component. The vulnerability allowed attackers to: - Read sensitive local files (SSH keys, passwords, configuration files) - Perform Server-Side Request Forgery (SSRF) attacks - Exfiltrate data to attacker-controlled servers ## Changes Made 1. Removed `xslt_path` parameter - This eliminates the primary attack vector where users could supply malicious XSLT files 2. Hardened XML parsers - Added security configurations to prevent XXE attacks even with the default XSLT: - `no_network=True` - Blocks network access - `resolve_entities=False` - Prevents entity expansion - `load_dtd=False` - Disables DTD processing - `XSLTAccessControl.DENY_ALL` - Blocks all file/network I/O in XSLT transformations 3. Added security tests - New test file `test_html_security.py` with comprehensive tests for various XXE attack vectors 4. Updated existing tests - Modified tests that were using the removed `xslt_path` parameter ## Test Plan - [x] All existing tests pass - [x] New security tests verify XXE attacks are blocked - [x] Code passes linting and formatting checks - [x] Tested with both old and new versions of lxml Twitter handle: @_colemurray	2025-07-02 15:24:08 -04:00
Raghu Kapur	2c9859956a	text-splitters: fix stale header metadata in ExperimentalMarkdownSyntaxTextSplitter (#31622 ) Description: Previously, when transitioning from a deeper Markdown header (e.g., ###) to a shallower one (e.g., ##), the ExperimentalMarkdownSyntaxTextSplitter retained the deeper header in the metadata. This commit updates the `_resolve_header_stack` method to remove headers at the same or deeper levels before appending the current header. As a result, each chunk now reflects only the active header context. Fixes unexpected metadata leakage across sections in nested Markdown documents. Additionally, test cases have been updated to: - Validate correct header resolution and metadata assignment. - Cover edge cases with nested headers and horizontal rules. Issue: Fixes [#31596](https://github.com/langchain-ai/langchain/issues/31596) Dependencies: None Twitter handle: -> [_RaghuKapur](https://twitter.com/_RaghuKapur) LinkedIn: -> [https://www.linkedin.com/in/raghukapur/](https://www.linkedin.com/in/raghukapur/)	2025-06-20 15:52:17 -04:00
Xin Jin	e979cd106a	chore: Bump langsmith in splitter uv (#31626 ) `uv lock --upgrade-package langsmith ` Original issue: The lock file (uv.lock) was constraining langsmith>=0.1.125,<0.4, preventing LangSmith 0.4.1 installation. Even though the pyproject.toml wasn't restricting langchain core. Issue: https://langchain.slack.com/archives/C050X0VTN56/p1750107176007629	2025-06-16 16:58:46 -07:00
Tom-Trumper	532e6455e9	text-splitters: Add keep_separator arg to HTMLSemanticPreservingSplitter (#31588 ) ### Description Add keep_separator arg to HTMLSemanticPreservingSplitter and pass value to instance of RecursiveCharacterTextSplitter used under the hood. ### Issue Documents returned by `HTMLSemanticPreservingSplitter.split_text(text)` are defaulted to use separators at beginning of page_content. [See third and fourth document in example output from how-to guide](https://python.langchain.com/docs/how_to/split_html/#using-htmlsemanticpreservingsplitter): ``` [Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'), Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'), Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'), Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"), Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video: ![image:example_image_link.mp4](example_image_link.mp4) ![video:example_video_link.mp4](example_video_link.mp4)'), Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'), Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')] ``` ### Dependencies None @ttrumper3	2025-06-14 17:56:14 -04:00
Christophe Bornet	eab8484a80	text-splitters[patch]: fix some import-untyped errors (#31030 )	2025-05-15 11:34:22 -04:00
Sumin Shin	683da2c9e9	text-splitters: Fix regex separator merge bug in CharacterTextSplitter (#31137 ) Description: Fix the merge logic in `CharacterTextSplitter.split_text` so that when using a regex lookahead separator (`is_separator_regex=True`) with `keep_separator=False`, the raw pattern is not re-inserted between chunks. Issue: Fixes #31136 Dependencies: None Twitter handle: None Since this is my first open-source PR, please feel free to point out any mistakes, and I'll be eager to make corrections.	2025-05-10 15:42:03 -04:00
Sydney Runkle	7e926520d5	packaging: remove Python upper bound for langchain and co libs (#31025 ) Follow up to https://github.com/langchain-ai/langsmith-sdk/pull/1696, I've bumped the `langsmith` version where applicable in `uv.lock`. Type checking problems here because deps have been updated in `pyproject.toml` and `uv lock` hasn't been run - we should enforce that in the future - goes with the other dependabot todos :).	2025-04-28 14:44:28 -04:00
Christophe Bornet	8c5ae108dd	text-splitters: Set strict mypy rules (#30900 ) * Add strict mypy rules * Fix mypy violations * Add error codes to all type ignores * Add ruff rule PGH003 * Bump mypy version to 1.15	2025-04-22 20:41:24 -07:00

1 2 3

118 Commits