Files
langchain/libs
Julia (Juli) Huang cd5b36456a fix(text-splitters): HTMLSemanticPreservingSplitter nested preserved … (#34587)
Summary
Fixes an issue where HTMLSemanticPreservingSplitter failed to preserve
elements nested inside non-container tags. With these changes, preserved
elements are now correctly detected and handled at any nesting depth.

Root Cause
`_process_element()` only recursed into a small set of hard-coded
container tags (`html`, `body`, `div`, `main`). For other tags, the
subtree was flattened into text, preventing nested preserved elements
(inside `<p>`, `<section>`, `<article>`, etc.) from being detected.


Fix
- Updated traversal logic in _process_element (html.py) to recursively
process child elements for any tag that contains nested elements
- Avoided duplicate text extraction
- Preserved correct placeholder ordering
- Treated leaf nodes as text only

Tests
Adds regression tests covering preserved elements nested inside
non-container tags, including:
- table inside section
- nested divs
- code inside paragraph

All existing tests pass (make lint, format, test, etc).

Breaking changes
None.

Fixes
Fixes #31569

Disclaimer
GitHub Copilot was used to assist with test case design in
test_text_splitters.py and documentation comments; all code logic was
manually implemented and reviewed.

---------

Co-authored-by: julih <julih@julihs-MacBook-Pro.local>
Co-authored-by: Mason Daugherty <github@mdrxy.com>
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2026-01-05 10:28:27 -05:00
..

Packages

Important

View all LangChain integrations packages

This repository is structured as a monorepo, with various packages located in this libs/ directory. Packages to note in this directory include:

core/             # Core primitives and abstractions for langchain
langchain/        # langchain-classic
langchain_v1/     # langchain
partners/         # Certain third-party providers integrations (see below)
standard-tests/   # Standardized tests for integrations
text-splitters/   # Text splitter utilities

(Each package contains its own README.md file with specific details about that package.)

Integrations (partners/)

The partners/ directory contains a small subset of third-party provider integrations that are maintained directly by the LangChain team. These include, but are not limited to:

Most integrations have been moved to their own repositories for improved versioning, dependency management, collaboration, and testing. This includes packages from popular providers such as Google and AWS. Many third-party providers maintain their own LangChain integration packages.

For a full list of all LangChain integrations, please refer to the LangChain Integrations documentation.