Commit Graph

144 Commits

Author SHA1 Message Date
Mason Daugherty
0cd72b50fb release(text-splitters): 1.1.0 (#34346) 2025-12-13 20:13:03 -05:00
rari404
0f940d74b2 feat(text-splitters): add R programming language support (#34241) 2025-12-12 13:34:22 -05:00
William FH
1867521d1a feat: Use uuid7 for run ids (#34172)
Co-authored-by: Sydney Runkle <54324534+sydney-runkle@users.noreply.github.com>
Co-authored-by: Sydney Runkle <sydneymarierunkle@gmail.com>
2025-12-03 10:09:10 -08:00
Christophe Bornet
ef79c26f18 chore(cli,standard-tests,text-splitters): fix some ruff TC rules (#33934)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-11-12 14:06:31 -05:00
Mason Daugherty
e023201d42 style: some cleanup (#33857) 2025-11-06 23:50:46 -05:00
Mason Daugherty
d40e340479 chore: attribute package change versions (#33854)
Needed to disambiguate for within inherited docs
2025-11-06 16:57:30 -05:00
Mason Daugherty
123e29dc26 style: more refs fixes (#33730) 2025-10-29 16:34:46 -04:00
Mason Daugherty
f15391f4fc chore(text-splitters): API reference link in README (#33713) 2025-10-28 23:28:48 -04:00
Mason Daugherty
e5e1d6c705 style: more refs work (#33707) 2025-10-28 14:43:28 -04:00
Mason Daugherty
db7f2db1ae feat(infra): langchain docs MCP (#33636) 2025-10-22 11:50:35 -04:00
Mason Daugherty
e731ba1e47 style: more refs work (#33616) 2025-10-20 18:40:19 -04:00
Mason Daugherty
64e6798a39 chore: update pyproject.toml url entries (#33587) 2025-10-17 17:16:55 -04:00
ccurme
3152d25811 fix: support python 3.14 in various projects (#33575)
Co-authored-by: cbornet <cbornet@hotmail.com>
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-10-17 11:06:23 -04:00
ccurme
3b8cb3d4b6 release(text-splitters): 1.0.0 (#33565) 2025-10-17 10:30:42 -04:00
Mason Daugherty
26e0a00c4c style: more work for refs (#33508)
Largely:
- Remove explicit `"Default is x"` since new refs show default inferred
from sig
- Inline code (useful for eventual parsing)
- Fix code block rendering (indentations)
2025-10-15 18:46:55 -04:00
Mason Daugherty
79200cf3c2 docs: update package READMEs (#33488) 2025-10-15 10:49:35 -04:00
Christophe Bornet
83901b30e3 chore(text-splitters): remove arg types from docstrings (#33406)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-10-10 11:37:53 -04:00
Mason Daugherty
6fc21afbc9 style: .. code-block:: admonition translations (#33400)
biiiiiiiiiiiiiiiigggggggg pass
2025-10-09 16:52:58 -04:00
Mason Daugherty
d8a680ee57 style: address Sphinx double-backtick snippet syntax (#33389) 2025-10-09 13:35:51 -04:00
Mason Daugherty
b6132fc23e style: remove more Optional syntax (#33371) 2025-10-08 23:28:43 -04:00
Mason Daugherty
d13823043d style: monorepo pass for refs (#33359)
* Delete some double backticks previously used by Sphinx (not done
everywhere yet)
* Fix some code blocks / dropdowns

Ignoring CLI CI for now
2025-10-08 18:41:39 -04:00
Mason Daugherty
cda336295f chore: enrich pyproject.toml files with links to new references, others (#33343) 2025-10-07 16:17:14 -04:00
Mason Daugherty
8bcdfbb24e chore: clean up pyproject.toml files, use core a7 (#33334) 2025-10-07 10:49:04 -04:00
Christophe Bornet
20e04fc3dd chore(text-splitters): cleanup ruff config (#33247)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-10-06 17:02:31 -04:00
Mason Daugherty
90e4d944ac chore(infra): pdm -> hatchling (#33289) 2025-10-05 23:52:52 -04:00
Mason Daugherty
ae5b105d11 docs: v1 docs updates (#33173)
Co-authored-by: Mohammad Mohtashim <45242107+keenborder786@users.noreply.github.com>
Co-authored-by: Caspar Broekhuizen <caspar@langchain.dev>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: Vadym Barda <vadim.barda@gmail.com>
2025-10-02 18:46:26 -04:00
Mason Daugherty
ae16392ada release(text-splitters): 1.0.0a1 (#33214) 2025-10-02 13:56:10 -04:00
Mason Daugherty
5e8cb58e6a refactor(text-splitters): drop python 3.9 (#33212) 2025-10-02 13:51:10 -04:00
Mason Daugherty
eaa6dcce9e release: v1.0.0 (#32567)
Co-authored-by: Mohammad Mohtashim <45242107+keenborder786@users.noreply.github.com>
Co-authored-by: Caspar Broekhuizen <caspar@langchain.dev>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: Vadym Barda <vadim.barda@gmail.com>
2025-10-02 10:49:42 -04:00
Mason Daugherty
986302322f docs: more standardization (#33124) 2025-09-25 20:46:20 -04:00
Christophe Bornet
eaf8dce7c2 chore: bump ruff version to 0.13 (#33043)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-09-25 12:27:39 -04:00
Mason Daugherty
e3efd1e891 test(text-splitters): capture beta warnings (#33113) 2025-09-25 01:30:20 -04:00
Mason Daugherty
d6769cf032 test(text-splitters): resolve pytest marker warning (#33112)
#33111
2025-09-25 01:29:42 -04:00
Mason Daugherty
781db9d892 chore: update pyproject.toml files, remove codespell (#33028)
- Removes Codespell from deps, docs, and `Makefile`s
- Python version requirements in all `pyproject.toml` files now use the
`~=` (compatible release) specifier
- All dependency groups and main dependencies now use explicit lower and
upper bounds, reducing potential for breaking changes
2025-09-20 22:09:33 -04:00
Christophe Bornet
cbaf97ada4 chore: bump mypy version to 1.18 (#32914) 2025-09-12 09:19:23 -04:00
Hyunjoon Jeong
9cc85387d1 fix(text-splitters): add validation to prevent infinite loop and prevent empty token splitter (#32205)
### Description
1) Add validation to prevent infinite loop condition when
```tokenizer.tokens_per_chunk > tokenizer.chunk_overlap```
2) Avoid empty decoded chunk when splitter appends tokens

---------

Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-09-11 16:55:32 -04:00
Mason Daugherty
7a158c7f1c revert: "chore: remove ruff target-version" (#32895)
Reverts langchain-ai/langchain#32880

Not needed at the moment, will do when finishing v1
2025-09-10 20:56:48 -04:00
Christophe Bornet
b274416441 chore: remove ruff target-version (#32880)
This is not needed anymore since `requires-python` was added when moving
to `uv`.
2025-09-10 11:12:30 -04:00
Mason Daugherty
c124e67325 chore(docs): update package READMEs (#32869)
- Fix badges
- Focus on agents
- Cut down fluff
2025-09-09 14:50:32 +00:00
Christophe Bornet
8b90eae455 chore(text-splitters): enable ruff docstring-code-format (#32854) 2025-09-08 16:40:11 -04:00
Christophe Bornet
0c3e8ccd0e chore(text-splitters): select ALL rules with exclusions (#32325)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-09-08 14:46:09 +00:00
Mason Daugherty
6b5fdfb804 release(text-splitters): 0.3.11 (#32770)
Fixes #32747

SpaCy integration test fixture was trying to use pip to download the
SpaCy language model (`en_core_web_sm`), but uv-managed environments
don't include pip by default. Fail test if not installed as opposed to
downloading.
2025-08-31 23:00:05 +00:00
Christophe Bornet
e0a4af8d8b docs(text-splitters): fix some docstrings (#32767) 2025-08-31 13:46:11 -05:00
Sydney Runkle
b26e52aa4d chore(text-splitters): bump version of core (#32740) 2025-08-28 13:14:57 -04:00
Sydney Runkle
38cdd7a2ec chore(text-splitters): relax max bound for langchain-core (#32739) 2025-08-28 13:05:47 -04:00
Mason Daugherty
3d08b6bd11 chore: adress pytest-asyncio deprecation warnings + other nits (#32696)
amongst some linting imcompatible rules
2025-08-26 15:51:38 -04:00
Maitrey Talware
622337a297 docs(docs): fixed typos in documentations (#32661)
Minor typo fixes. (Not linked to current open issues)
2025-08-25 10:02:53 -04:00
Christophe Bornet
73a7de63aa chore(text-splitters): add mypy pydantic plugin (#32611) 2025-08-19 16:58:12 -04:00
Keyu Chen
03138f41a0 feat(text-splitters): add optional custom header pattern support (#31887)
## Description

This PR adds support for custom header patterns in
`MarkdownHeaderTextSplitter`, allowing users to define non-standard
Markdown header formats (like `**Header**`) and specify their hierarchy
levels.

**Issue:** Fixes #22738

**Dependencies:** None - this change has no new dependencies

**Key Changes:**
- Added optional `custom_header_patterns` parameter to support
non-standard header formats
- Enable splitting on patterns like `**Header**` and `***Header***`
- Maintain full backward compatibility with existing usage
- Added comprehensive tests for custom and mixed header scenarios

## Example Usage

```python
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("**", "Chapter"),
    ("***", "Section"),
]

custom_header_patterns = {
    "**": 1,   # Level 1 headers
    "***": 2,  # Level 2 headers
}

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    custom_header_patterns=custom_header_patterns,
)

# Now **Chapter 1** is treated as a level 1 header
# And ***Section 1.1*** is treated as a level 2 header
```

## Testing

-  Added unit tests for custom header patterns
-  Added tests for mixed standard and custom headers
-  All existing tests pass (backward compatibility maintained)
-  Linting and formatting checks pass

---

The implementation provides a flexible solution while maintaining the
simplicity of the existing API. Users can continue using the splitter
exactly as before, with the new functionality being entirely opt-in
through the `custom_header_patterns` parameter.

---------

Co-authored-by: Mason Daugherty <mason@langchain.dev>
Co-authored-by: Claude <noreply@anthropic.com>
2025-08-18 10:10:49 -04:00
Christophe Bornet
4656f727da chore(text-splitters): add mypy warn_unreachable (#32558) 2025-08-15 09:45:20 -04:00