text-splitters: Fix regex separator merge bug in CharacterTextSplitter (#31137)

**Description:**
Fix the merge logic in `CharacterTextSplitter.split_text` so that when
using a regex lookahead separator (`is_separator_regex=True`) with
`keep_separator=False`, the raw pattern is not re-inserted between
chunks.

**Issue:**
Fixes #31136 

**Dependencies:**
None

**Twitter handle:**
None

Since this is my first open-source PR, please feel free to point out any
mistakes, and I'll be eager to make corrections.
This commit is contained in:
Sumin Shin
2025-05-11 04:42:03 +09:00
committed by GitHub
parent 0ef4ac75b7
commit 683da2c9e9
2 changed files with 70 additions and 6 deletions

View File

@@ -18,14 +18,30 @@ class CharacterTextSplitter(TextSplitter):
self._is_separator_regex = is_separator_regex
def split_text(self, text: str) -> List[str]:
"""Split incoming text and return chunks."""
# First we naively split the large input into a bunch of smaller ones.
separator = (
"""Split into chunks without re-inserting lookaround separators."""
# 1. Determine split pattern: raw regex or escaped literal
sep_pattern = (
self._separator if self._is_separator_regex else re.escape(self._separator)
)
splits = _split_text_with_regex(text, separator, self._keep_separator)
_separator = "" if self._keep_separator else self._separator
return self._merge_splits(splits, _separator)
# 2. Initial split (keep separator if requested)
splits = _split_text_with_regex(text, sep_pattern, self._keep_separator)
# 3. Detect zero-width lookaround so we never re-insert it
lookaround_prefixes = ("(?=", "(?<!", "(?<=", "(?!")
is_lookaround = self._is_separator_regex and any(
self._separator.startswith(p) for p in lookaround_prefixes
)
# 4. Decide merge separator:
# - if keep_separator or lookaround → dont re-insert
# - else → re-insert literal separator
merge_sep = ""
if not (self._keep_separator or is_lookaround):
merge_sep = self._separator
# 5. Merge adjacent splits and return
return self._merge_splits(splits, merge_sep)
def _split_text_with_regex(