fix(text-splitters): remove incorrect C# and Elixir separator keywords (#37037)

## Summary

Removes two incorrect separators from `get_separators_for_language()` in
`RecursiveCharacterTextSplitter`:

- **C#**: `"\nimplements "` is a Java keyword. C# uses `:` for interface
implementation. This separator never matches valid C# source code.
- **Elixir**: `"\nwhile "` does not exist in Elixir. The language uses
recursion and `Enum.reduce_while/3` instead of while loops.

Both are dead separators that silently degrade chunking quality by
occupying positions in the separator priority list without contributing
useful split points.

## Tests

Added two targeted tests:
- `test_csharp_separators_no_java_keywords`: verifies `"\nimplements "`
is not in the C# separator list
- `test_elixir_separators_no_while`: verifies `"\nwhile "` is not in the
Elixir separator list

Existing `test_csharp_code_splitter` continues to pass (no change to
expected output since `implements` never matched valid C# code).

Full suite: 129 passed, 0 failed.

Fixes #37030
This commit is contained in:
Dayna Blackwell
2026-04-27 10:48:19 -07:00
committed by GitHub
parent 3b945d02d9
commit 3b9750f0a4
2 changed files with 17 additions and 2 deletions

View File

@@ -1011,6 +1011,23 @@ class Program
]
def test_csharp_separators_no_java_keywords() -> None:
"""C# separators should not contain Java-only keywords."""
splitter = RecursiveCharacterTextSplitter.from_language(
Language.CSHARP, chunk_size=CHUNK_SIZE, chunk_overlap=0
)
# "implements" is a Java keyword; C# uses ":" for interface implementation
assert "\nimplements " not in splitter._separators
def test_elixir_separators_no_while() -> None:
"""Elixir has no while loop; the separator should not be present."""
splitter = RecursiveCharacterTextSplitter.from_language(
Language.ELIXIR, chunk_size=CHUNK_SIZE, chunk_overlap=0
)
assert "\nwhile " not in splitter._separators
def test_cpp_code_splitter() -> None:
splitter = RecursiveCharacterTextSplitter.from_language(
Language.CPP, chunk_size=CHUNK_SIZE, chunk_overlap=0