langchain/libs/text-splitters/langchain_text_splitters
Tom-Trumper 532e6455e9
text-splitters: Add keep_separator arg to HTMLSemanticPreservingSplitter (#31588)
### Description
Add keep_separator arg to HTMLSemanticPreservingSplitter and pass value
to instance of RecursiveCharacterTextSplitter used under the hood.
### Issue
Documents returned by `HTMLSemanticPreservingSplitter.split_text(text)`
are defaulted to use separators at beginning of page_content. [See third
and fourth document in example output from how-to
guide](https://python.langchain.com/docs/how_to/split_html/#using-htmlsemanticpreservingsplitter):
```
[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video: ![image:example_image_link.mp4](example_image_link.mp4) ![video:example_video_link.mp4](example_video_link.mp4)'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]
```
### Dependencies
None

@ttrumper3
2025-06-14 17:56:14 -04:00
..
xsl
__init__.py text-splitters: Add JSFrameworkTextSplitter for Handling JavaScript Framework Code (#28972) 2025-03-17 23:32:33 +00:00
base.py text-splitters[patch]: fix some import-untyped errors (#31030) 2025-05-15 11:34:22 -04:00
character.py text-splitters: Fix regex separator merge bug in CharacterTextSplitter (#31137) 2025-05-10 15:42:03 -04:00
html.py text-splitters: Add keep_separator arg to HTMLSemanticPreservingSplitter (#31588) 2025-06-14 17:56:14 -04:00
json.py text-splitters: Set strict mypy rules (#30900) 2025-04-22 20:41:24 -07:00
jsx.py text-splitters: Add JSFrameworkTextSplitter for Handling JavaScript Framework Code (#28972) 2025-03-17 23:32:33 +00:00
konlpy.py text-splitters[patch]: fix some import-untyped errors (#31030) 2025-05-15 11:34:22 -04:00
latex.py
markdown.py text-splitters: Set strict mypy rules (#30900) 2025-04-22 20:41:24 -07:00
nltk.py text-splitters[patch]: fix some import-untyped errors (#31030) 2025-05-15 11:34:22 -04:00
py.typed
python.py
sentence_transformers.py text-splitters: Set strict mypy rules (#30900) 2025-04-22 20:41:24 -07:00
spacy.py text-splitters[patch]: fix some import-untyped errors (#31030) 2025-05-15 11:34:22 -04:00