Sweep classic deprecations so every removal lands on `2.0.0`, runtime
warnings carry the auto-generated since/removal/alternative line, and
replacements steer at `langchain.agents.create_agent` and
`with_structured_output(...)` instead of pre-v1 LangGraph +
`python.langchain.com` links.
## Changes
- **Bump removal targets from `1.0` / `1.0.0` to `2.0.0`** across
agents, chains, memory, retrievers, structured-output, vectorstore
toolkits, and the `langchain_classic._api.module_import` shim — gives
users a real runway now that v1 has shipped.
- **Move bespoke `message=` strings onto `addendum=`** (or split into
`alternative=` + `addendum=`). `warn_deprecated` skips the
auto-generated since/removal/alternative line whenever `message=` is
set, so the prior pattern silently dropped that info from the runtime
`LangChainDeprecationWarning`. Matches the pattern already used in
`HTMLHeaderTextSplitter.split_text_from_url`, which is updated for
consistency.
- **Repoint `alternative=` at v1 replacements**: chains/memory/agent
toolkits → `langchain.agents.create_agent` (with checkpointer or
retrieval-tool guidance in the addendum); `openai_functions` and
`chains/structured_output` → `ChatModel.with_structured_output(...)`;
`openapi` chains → `ChatModel.bind_tools(...)` + HTTP client.
`ConversationChain` no longer points at `RunnableWithMessageHistory`.
- **Refresh `AGENT_DEPRECATION_WARNING`** in
`langchain_classic._api.deprecation` — drop stale LangGraph and
`python.langchain.com` links in favor of `langchain.agents.create_agent`
and the `docs.langchain.com/oss/python/migrate/langchain-v1` guide.
Propagates to all 13 caller sites in `agents/`.
- **Newly deprecate `langchain_classic.chat_models.init_chat_model` and
`langchain_classic.embeddings.init_embeddings`** with the framing
*"maintained in `langchain`; `langchain-classic` retains this entry
point for import-compatibility only"*. The classic docstring examples
and the warning admonition both point at `langchain.chat_models`.
- **Improve `init_chat_model` docstrings** in both `langchain_v1` and
the classic copy: clarify `provider:model` prefix vs. `model_provider=`,
recommend pinned IDs over moving aliases, add the `upstage` provider
row, and refresh examples to GA models (`gpt-5.5`, `claude-opus-4-7`).
- **Standardize partner Anthropic deprecations**: replace
`AnthropicLLM`'s `model_validator(raise_warning)` with
`@deprecated(since="0.1.0", removal="2.0.0",
alternative="ChatAnthropic")`, and pin the `ChatAnthropic`
`output_format` runtime warning at `langchain-anthropic 2.0.0` instead
of "a future version".
## Summary
Fixes four issues in `get_separators_for_language()` in `character.py`:
- **Kotlin**: removed `"\ncase "` — `case` is not a Kotlin keyword.
Kotlin uses `when` expressions (already present in the list). This was
copied from Java/Swift.
- **Rust**: removed duplicate `"\nconst "` — appeared twice, once under
function definitions and again under control flow statements.
- **Haskell**: removed duplicate `"\n:: "` — appeared under function
definitions and again under type declarations.
- **Haskell**: removed duplicate `"\ndata "` — appeared under type
declarations and again under record field declarations.
All four are dead separators that never match or produce redundant
splits.
## Issue
Closes#37038
## Types of changes
- [x] Bug fix
## Checklist
- [x] I have read the CONTRIBUTING doc
- [x] Lint and unit tests pass locally with my changes
## Summary
Removes two incorrect separators from `get_separators_for_language()` in
`RecursiveCharacterTextSplitter`:
- **C#**: `"\nimplements "` is a Java keyword. C# uses `:` for interface
implementation. This separator never matches valid C# source code.
- **Elixir**: `"\nwhile "` does not exist in Elixir. The language uses
recursion and `Enum.reduce_while/3` instead of while loops.
Both are dead separators that silently degrade chunking quality by
occupying positions in the separator priority list without contributing
useful split points.
## Tests
Added two targeted tests:
- `test_csharp_separators_no_java_keywords`: verifies `"\nimplements "`
is not in the C# separator list
- `test_elixir_separators_no_while`: verifies `"\nwhile "` is not in the
Elixir separator list
Existing `test_csharp_code_splitter` continues to pass (no change to
expected output since `implements` never matched valid C# code).
Full suite: 129 passed, 0 failed.
Fixes#37030
CI lint jobs use `uv run --all-groups` for all tools, but ruff doesn't
need dependency resolution — only mypy does. By splitting into
`UV_RUN_LINT` (ruff) and `UV_RUN_TYPE` (mypy), the CI-facing targets run
ruff with `--group lint` only, giving fast-fail feedback before mypy
triggers the full environment sync.
For packages where source code only conditionally imports heavy deps
(text-splitters, huggingface), `lint_package` also overrides
`UV_RUN_TYPE` to `--group lint --group typing`, skipping the ~3.5GB
`test_integration` download entirely. `lint_tests` keeps `--all-groups`
since test code legitimately imports those deps.
Additionally, `lint_imports.sh` was inconsistently wired — most packages
had the script but weren't calling it.
## Changes
**Makefile optimization**
- Introduce `UV_RUN_LINT` and `UV_RUN_TYPE` Make variables, both
defaulting to `uv run --all-groups`. For `lint_package` and
`lint_tests`, `UV_RUN_LINT` is overridden to `uv run --group lint` so
ruff runs instantly without syncing heavy deps
- For `text-splitters` and `huggingface`, override `UV_RUN_TYPE` on
`lint_package` to `uv run --group lint --group typing` — mypy runs
without downloading torch, CUDA, spacy, etc.
**mypy config for lean groups**
- Add `transformers` and `transformers.*` to `ignore_missing_imports` in
`text-splitters` pyproject.toml (conditional `try/except` import, same
treatment as existing `konlpy`/`nltk` entries)
- Add `torch`, `torch.*`, `langchain_community`, `langchain_community.*`
to `ignore_missing_imports` in `huggingface` pyproject.toml
- Add dual `# type: ignore[unreachable, unused-ignore]` in
`text-splitters/base.py` to handle the `PreTrainedTokenizerBase`
isinstance check that behaves differently depending on whether
transformers is installed
**lint_imports.sh consistency**
- Add `./scripts/lint_imports.sh` to the lint recipe in every package
that wasn't calling it (standard-tests, model-profiles, all 15
partners), and create the script for the two packages missing it
entirely (`model-profiles`, `openrouter`)
- Update all `lint_imports.sh` scripts to allow `from langchain.agents`
and `from langchain.tools` imports (legitimate v1 middleware
dependencies used by `langchain-anthropic` and `langchain-openai`)
During an automated code review of .github/scripts/get_min_versions.py,
the following issue was identified. Set a timeout on get min versions
HTTP calls. Network calls without a timeout can hang a worker
indefinitely. I kept the patch small and re-ran syntax checks after
applying it.
Summary
Fixes an issue where HTMLSemanticPreservingSplitter failed to preserve
elements nested inside non-container tags. With these changes, preserved
elements are now correctly detected and handled at any nesting depth.
Root Cause
`_process_element()` only recursed into a small set of hard-coded
container tags (`html`, `body`, `div`, `main`). For other tags, the
subtree was flattened into text, preventing nested preserved elements
(inside `<p>`, `<section>`, `<article>`, etc.) from being detected.
Fix
- Updated traversal logic in _process_element (html.py) to recursively
process child elements for any tag that contains nested elements
- Avoided duplicate text extraction
- Preserved correct placeholder ordering
- Treated leaf nodes as text only
Tests
Adds regression tests covering preserved elements nested inside
non-container tags, including:
- table inside section
- nested divs
- code inside paragraph
All existing tests pass (make lint, format, test, etc).
Breaking changes
None.
Fixes
Fixes#31569
Disclaimer
GitHub Copilot was used to assist with test case design in
test_text_splitters.py and documentation comments; all code logic was
manually implemented and reviewed.
---------
Co-authored-by: julih <julih@julihs-MacBook-Pro.local>
Co-authored-by: Mason Daugherty <github@mdrxy.com>
Co-authored-by: Mason Daugherty <mason@langchain.dev>
Largely:
- Remove explicit `"Default is x"` since new refs show default inferred
from sig
- Inline code (useful for eventual parsing)
- Fix code block rendering (indentations)
## Description
This PR adds support for custom header patterns in
`MarkdownHeaderTextSplitter`, allowing users to define non-standard
Markdown header formats (like `**Header**`) and specify their hierarchy
levels.
**Issue:** Fixes#22738
**Dependencies:** None - this change has no new dependencies
**Key Changes:**
- Added optional `custom_header_patterns` parameter to support
non-standard header formats
- Enable splitting on patterns like `**Header**` and `***Header***`
- Maintain full backward compatibility with existing usage
- Added comprehensive tests for custom and mixed header scenarios
## Example Usage
```python
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("**", "Chapter"),
("***", "Section"),
]
custom_header_patterns = {
"**": 1, # Level 1 headers
"***": 2, # Level 2 headers
}
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
custom_header_patterns=custom_header_patterns,
)
# Now **Chapter 1** is treated as a level 1 header
# And ***Section 1.1*** is treated as a level 2 header
```
## Testing
- ✅ Added unit tests for custom header patterns
- ✅ Added tests for mixed standard and custom headers
- ✅ All existing tests pass (backward compatibility maintained)
- ✅ Linting and formatting checks pass
---
The implementation provides a flexible solution while maintaining the
simplicity of the existing API. Users can continue using the splitter
exactly as before, with the new functionality being entirely opt-in
through the `custom_header_patterns` parameter.
---------
Co-authored-by: Mason Daugherty <mason@langchain.dev>
Co-authored-by: Claude <noreply@anthropic.com>
Ensures proper reStructuredText formatting by adding the required blank
line before closing docstring quotes, which resolves the "Block quote
ends without a blank line; unexpected unindent" warning.
Thank you for contributing to LangChain!
- [x] **PR title**: "package: description"
- Where "package" is whichever of langchain, core, etc. is being
modified. Use "docs: ..." for purely docs changes, "infra: ..." for CI
changes.
- Example: "core: add foobar LLM"
- [x] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** a description of the change
- **Issue:** the issue # it fixes, if applicable
- **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!
- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, eyurtsev, ccurme, vbarda, hwchase17.
## Summary
- Removes the `xslt_path` parameter from HTMLSectionSplitter to
eliminate XXE attack vector
- Hardens XML/HTML parsers with secure configurations to prevent XXE
attacks
- Adds comprehensive security tests to ensure the vulnerability is fixed
## Context
This PR addresses a critical XXE vulnerability discovered in the
HTMLSectionSplitter component. The vulnerability allowed attackers to:
- Read sensitive local files (SSH keys, passwords, configuration files)
- Perform Server-Side Request Forgery (SSRF) attacks
- Exfiltrate data to attacker-controlled servers
## Changes Made
1. **Removed `xslt_path` parameter** - This eliminates the primary
attack vector where users could supply malicious XSLT files
2. **Hardened XML parsers** - Added security configurations to prevent
XXE attacks even with the default XSLT:
- `no_network=True` - Blocks network access
- `resolve_entities=False` - Prevents entity expansion -
`load_dtd=False` - Disables DTD processing -
`XSLTAccessControl.DENY_ALL` - Blocks all file/network I/O in XSLT
transformations
3. **Added security tests** - New test file `test_html_security.py` with
comprehensive tests for various XXE attack vectors
4. **Updated existing tests** - Modified tests that were using the
removed `xslt_path` parameter
## Test Plan
- [x] All existing tests pass
- [x] New security tests verify XXE attacks are blocked
- [x] Code passes linting and formatting checks
- [x] Tested with both old and new versions of lxml
Twitter handle: @_colemurray