mirror of
https://github.com/hwchase17/langchain.git
synced 2026-06-09 10:17:00 +00:00
fix(text-splitters): remove incorrect C# and Elixir separator keywords (#37037)
## Summary Removes two incorrect separators from `get_separators_for_language()` in `RecursiveCharacterTextSplitter`: - **C#**: `"\nimplements "` is a Java keyword. C# uses `:` for interface implementation. This separator never matches valid C# source code. - **Elixir**: `"\nwhile "` does not exist in Elixir. The language uses recursion and `Enum.reduce_while/3` instead of while loops. Both are dead separators that silently degrade chunking quality by occupying positions in the separator priority list without contributing useful split points. ## Tests Added two targeted tests: - `test_csharp_separators_no_java_keywords`: verifies `"\nimplements "` is not in the C# separator list - `test_elixir_separators_no_while`: verifies `"\nwhile "` is not in the Elixir separator list Existing `test_csharp_code_splitter` continues to pass (no change to expected output since `implements` never matched valid C# code). Full suite: 129 passed, 0 failed. Fixes #37030
This commit is contained in:
@@ -1011,6 +1011,23 @@ class Program
|
||||
]
|
||||
|
||||
|
||||
def test_csharp_separators_no_java_keywords() -> None:
|
||||
"""C# separators should not contain Java-only keywords."""
|
||||
splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
Language.CSHARP, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
||||
)
|
||||
# "implements" is a Java keyword; C# uses ":" for interface implementation
|
||||
assert "\nimplements " not in splitter._separators
|
||||
|
||||
|
||||
def test_elixir_separators_no_while() -> None:
|
||||
"""Elixir has no while loop; the separator should not be present."""
|
||||
splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
Language.ELIXIR, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
||||
)
|
||||
assert "\nwhile " not in splitter._separators
|
||||
|
||||
|
||||
def test_cpp_code_splitter() -> None:
|
||||
splitter = RecursiveCharacterTextSplitter.from_language(
|
||||
Language.CPP, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
||||
|
||||
Reference in New Issue
Block a user