Files
langchain/libs/text-splitters/tests
Keyu Chen 03138f41a0 feat(text-splitters): add optional custom header pattern support (#31887)
## Description

This PR adds support for custom header patterns in
`MarkdownHeaderTextSplitter`, allowing users to define non-standard
Markdown header formats (like `**Header**`) and specify their hierarchy
levels.

**Issue:** Fixes #22738

**Dependencies:** None - this change has no new dependencies

**Key Changes:**
- Added optional `custom_header_patterns` parameter to support
non-standard header formats
- Enable splitting on patterns like `**Header**` and `***Header***`
- Maintain full backward compatibility with existing usage
- Added comprehensive tests for custom and mixed header scenarios

## Example Usage

```python
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("**", "Chapter"),
    ("***", "Section"),
]

custom_header_patterns = {
    "**": 1,   # Level 1 headers
    "***": 2,  # Level 2 headers
}

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    custom_header_patterns=custom_header_patterns,
)

# Now **Chapter 1** is treated as a level 1 header
# And ***Section 1.1*** is treated as a level 2 header
```

## Testing

-  Added unit tests for custom header patterns
-  Added tests for mixed standard and custom headers
-  All existing tests pass (backward compatibility maintained)
-  Linting and formatting checks pass

---

The implementation provides a flexible solution while maintaining the
simplicity of the existing API. Users can continue using the splitter
exactly as before, with the new functionality being entirely opt-in
through the `custom_header_patterns` parameter.

---------

Co-authored-by: Mason Daugherty <mason@langchain.dev>
Co-authored-by: Claude <noreply@anthropic.com>
2025-08-18 10:10:49 -04:00
..