Commit Graph

151 Commits

Author SHA1 Message Date
Julia (Juli) Huang
cd5b36456a fix(text-splitters): HTMLSemanticPreservingSplitter nested preserved … (#34587)
Summary
Fixes an issue where HTMLSemanticPreservingSplitter failed to preserve
elements nested inside non-container tags. With these changes, preserved
elements are now correctly detected and handled at any nesting depth.

Root Cause
`_process_element()` only recursed into a small set of hard-coded
container tags (`html`, `body`, `div`, `main`). For other tags, the
subtree was flattened into text, preventing nested preserved elements
(inside `<p>`, `<section>`, `<article>`, etc.) from being detected.


Fix
- Updated traversal logic in _process_element (html.py) to recursively
process child elements for any tag that contains nested elements
- Avoided duplicate text extraction
- Preserved correct placeholder ordering
- Treated leaf nodes as text only

Tests
Adds regression tests covering preserved elements nested inside
non-container tags, including:
- table inside section
- nested divs
- code inside paragraph

All existing tests pass (make lint, format, test, etc).

Breaking changes
None.

Fixes
Fixes #31569

Disclaimer
GitHub Copilot was used to assist with test case design in
test_text_splitters.py and documentation comments; all code logic was
manually implemented and reviewed.

---------

Co-authored-by: julih <julih@julihs-MacBook-Pro.local>
Co-authored-by: Mason Daugherty <github@mdrxy.com>
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2026-01-05 10:28:27 -05:00
Christophe Bornet
e03d6b80d5 chore(deps): bump mypy to v1.19 and ruff to v1.14 (#34521)
* Set mypy to >=1.19.1,<1.20
* Set ruff to >=0.14.10,<0.15
2025-12-29 18:07:55 -06:00
Christophe Bornet
ea25f5ebdd chore(text-splitters): bump dependency locks for python 3.14 (#34522)
* Support sentence-transformers optional dep on python 3.14
* Bump some dep locks to use pre-built wheels instead of building them
(murmurhash, cymem, preshed, thinc, srsly, blis)
* Still not possible to use spacy: even though there are wheels
available, spacy depends on Pydantic v1 which doesn't work on Python
3.14.
* Speeds up installation and CI.

Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-12-29 17:55:34 -06:00
Christophe Bornet
5884fb9523 style(text-splitters,standard-tests,cli): add ruff TC and RUF012 rules (#34495)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-12-27 01:41:33 -06:00
Christophe Bornet
d46187201d style: add ruff ISC001 rule (#34493)
ISC001 doesn't conflict anymore with the formatter. See
https://github.com/astral-sh/ruff/issues/8272

Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-12-26 21:39:56 -06:00
ccurme
795e746ca7 release(core): 1.2.3 (#34421) 2025-12-18 15:06:32 -05:00
Mason Daugherty
71778cb721 feat(infra): add CI check for out of date lockfiles (#34397) 2025-12-16 22:23:25 -05:00
Mason Daugherty
0cd72b50fb release(text-splitters): 1.1.0 (#34346) 2025-12-13 20:13:03 -05:00
rari404
0f940d74b2 feat(text-splitters): add R programming language support (#34241) 2025-12-12 13:34:22 -05:00
William FH
1867521d1a feat: Use uuid7 for run ids (#34172)
Co-authored-by: Sydney Runkle <54324534+sydney-runkle@users.noreply.github.com>
Co-authored-by: Sydney Runkle <sydneymarierunkle@gmail.com>
2025-12-03 10:09:10 -08:00
Christophe Bornet
ef79c26f18 chore(cli,standard-tests,text-splitters): fix some ruff TC rules (#33934)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-11-12 14:06:31 -05:00
Mason Daugherty
e023201d42 style: some cleanup (#33857) 2025-11-06 23:50:46 -05:00
Mason Daugherty
d40e340479 chore: attribute package change versions (#33854)
Needed to disambiguate for within inherited docs
2025-11-06 16:57:30 -05:00
Mason Daugherty
123e29dc26 style: more refs fixes (#33730) 2025-10-29 16:34:46 -04:00
Mason Daugherty
f15391f4fc chore(text-splitters): API reference link in README (#33713) 2025-10-28 23:28:48 -04:00
Mason Daugherty
e5e1d6c705 style: more refs work (#33707) 2025-10-28 14:43:28 -04:00
Mason Daugherty
db7f2db1ae feat(infra): langchain docs MCP (#33636) 2025-10-22 11:50:35 -04:00
Mason Daugherty
e731ba1e47 style: more refs work (#33616) 2025-10-20 18:40:19 -04:00
Mason Daugherty
64e6798a39 chore: update pyproject.toml url entries (#33587) 2025-10-17 17:16:55 -04:00
ccurme
3152d25811 fix: support python 3.14 in various projects (#33575)
Co-authored-by: cbornet <cbornet@hotmail.com>
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-10-17 11:06:23 -04:00
ccurme
3b8cb3d4b6 release(text-splitters): 1.0.0 (#33565) 2025-10-17 10:30:42 -04:00
Mason Daugherty
26e0a00c4c style: more work for refs (#33508)
Largely:
- Remove explicit `"Default is x"` since new refs show default inferred
from sig
- Inline code (useful for eventual parsing)
- Fix code block rendering (indentations)
2025-10-15 18:46:55 -04:00
Mason Daugherty
79200cf3c2 docs: update package READMEs (#33488) 2025-10-15 10:49:35 -04:00
Christophe Bornet
83901b30e3 chore(text-splitters): remove arg types from docstrings (#33406)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-10-10 11:37:53 -04:00
Mason Daugherty
6fc21afbc9 style: .. code-block:: admonition translations (#33400)
biiiiiiiiiiiiiiiigggggggg pass
2025-10-09 16:52:58 -04:00
Mason Daugherty
d8a680ee57 style: address Sphinx double-backtick snippet syntax (#33389) 2025-10-09 13:35:51 -04:00
Mason Daugherty
b6132fc23e style: remove more Optional syntax (#33371) 2025-10-08 23:28:43 -04:00
Mason Daugherty
d13823043d style: monorepo pass for refs (#33359)
* Delete some double backticks previously used by Sphinx (not done
everywhere yet)
* Fix some code blocks / dropdowns

Ignoring CLI CI for now
2025-10-08 18:41:39 -04:00
Mason Daugherty
cda336295f chore: enrich pyproject.toml files with links to new references, others (#33343) 2025-10-07 16:17:14 -04:00
Mason Daugherty
8bcdfbb24e chore: clean up pyproject.toml files, use core a7 (#33334) 2025-10-07 10:49:04 -04:00
Christophe Bornet
20e04fc3dd chore(text-splitters): cleanup ruff config (#33247)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-10-06 17:02:31 -04:00
Mason Daugherty
90e4d944ac chore(infra): pdm -> hatchling (#33289) 2025-10-05 23:52:52 -04:00
Mason Daugherty
ae5b105d11 docs: v1 docs updates (#33173)
Co-authored-by: Mohammad Mohtashim <45242107+keenborder786@users.noreply.github.com>
Co-authored-by: Caspar Broekhuizen <caspar@langchain.dev>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: Vadym Barda <vadim.barda@gmail.com>
2025-10-02 18:46:26 -04:00
Mason Daugherty
ae16392ada release(text-splitters): 1.0.0a1 (#33214) 2025-10-02 13:56:10 -04:00
Mason Daugherty
5e8cb58e6a refactor(text-splitters): drop python 3.9 (#33212) 2025-10-02 13:51:10 -04:00
Mason Daugherty
eaa6dcce9e release: v1.0.0 (#32567)
Co-authored-by: Mohammad Mohtashim <45242107+keenborder786@users.noreply.github.com>
Co-authored-by: Caspar Broekhuizen <caspar@langchain.dev>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: Vadym Barda <vadim.barda@gmail.com>
2025-10-02 10:49:42 -04:00
Mason Daugherty
986302322f docs: more standardization (#33124) 2025-09-25 20:46:20 -04:00
Christophe Bornet
eaf8dce7c2 chore: bump ruff version to 0.13 (#33043)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-09-25 12:27:39 -04:00
Mason Daugherty
e3efd1e891 test(text-splitters): capture beta warnings (#33113) 2025-09-25 01:30:20 -04:00
Mason Daugherty
d6769cf032 test(text-splitters): resolve pytest marker warning (#33112)
#33111
2025-09-25 01:29:42 -04:00
Mason Daugherty
781db9d892 chore: update pyproject.toml files, remove codespell (#33028)
- Removes Codespell from deps, docs, and `Makefile`s
- Python version requirements in all `pyproject.toml` files now use the
`~=` (compatible release) specifier
- All dependency groups and main dependencies now use explicit lower and
upper bounds, reducing potential for breaking changes
2025-09-20 22:09:33 -04:00
Christophe Bornet
cbaf97ada4 chore: bump mypy version to 1.18 (#32914) 2025-09-12 09:19:23 -04:00
Hyunjoon Jeong
9cc85387d1 fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter (#32205)
### Description
1) Add validation to prevent infinite loop condition when
```tokenizer.tokens_per_chunk > tokenizer.chunk_overlap```
2) Avoid empty decoded chunk when splitter appends tokens

---------

Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-09-11 16:55:32 -04:00
Mason Daugherty
7a158c7f1c revert: "chore: remove ruff target-version" (#32895)
Reverts langchain-ai/langchain#32880

Not needed at the moment, will do when finishing v1
2025-09-10 20:56:48 -04:00
Christophe Bornet
b274416441 chore: remove ruff target-version (#32880)
This is not needed anymore since `requires-python` was added when moving
to `uv`.
2025-09-10 11:12:30 -04:00
Mason Daugherty
c124e67325 chore(docs): update package READMEs (#32869)
- Fix badges
- Focus on agents
- Cut down fluff
2025-09-09 14:50:32 +00:00
Christophe Bornet
8b90eae455 chore(text-splitters): enable ruff docstring-code-format (#32854) 2025-09-08 16:40:11 -04:00
Christophe Bornet
0c3e8ccd0e chore(text-splitters): select ALL rules with exclusions (#32325)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-09-08 14:46:09 +00:00
Mason Daugherty
6b5fdfb804 release(text-splitters): 0.3.11 (#32770)
Fixes #32747

SpaCy integration test fixture was trying to use pip to download the
SpaCy language model (`en_core_web_sm`), but uv-managed environments
don't include pip by default. Fail test if not installed as opposed to
downloading.
2025-08-31 23:00:05 +00:00
Christophe Bornet
e0a4af8d8b docs(text-splitters): fix some docstrings (#32767) 2025-08-31 13:46:11 -05:00