mirror of
https://github.com/hwchase17/langchain.git
synced 2026-03-18 11:07:36 +00:00
Summary Fixes an issue where HTMLSemanticPreservingSplitter failed to preserve elements nested inside non-container tags. With these changes, preserved elements are now correctly detected and handled at any nesting depth. Root Cause `_process_element()` only recursed into a small set of hard-coded container tags (`html`, `body`, `div`, `main`). For other tags, the subtree was flattened into text, preventing nested preserved elements (inside `<p>`, `<section>`, `<article>`, etc.) from being detected. Fix - Updated traversal logic in _process_element (html.py) to recursively process child elements for any tag that contains nested elements - Avoided duplicate text extraction - Preserved correct placeholder ordering - Treated leaf nodes as text only Tests Adds regression tests covering preserved elements nested inside non-container tags, including: - table inside section - nested divs - code inside paragraph All existing tests pass (make lint, format, test, etc). Breaking changes None. Fixes Fixes #31569 Disclaimer GitHub Copilot was used to assist with test case design in test_text_splitters.py and documentation comments; all code logic was manually implemented and reviewed. --------- Co-authored-by: julih <julih@julihs-MacBook-Pro.local> Co-authored-by: Mason Daugherty <github@mdrxy.com> Co-authored-by: Mason Daugherty <mason@langchain.dev>