mirror of
https://github.com/hwchase17/langchain.git
synced 2026-06-09 10:17:00 +00:00
fix(text-splitters): remove incorrect C# and Elixir separator keywords (#37037)
## Summary Removes two incorrect separators from `get_separators_for_language()` in `RecursiveCharacterTextSplitter`: - **C#**: `"\nimplements "` is a Java keyword. C# uses `:` for interface implementation. This separator never matches valid C# source code. - **Elixir**: `"\nwhile "` does not exist in Elixir. The language uses recursion and `Enum.reduce_while/3` instead of while loops. Both are dead separators that silently degrade chunking quality by occupying positions in the separator priority list without contributing useful split points. ## Tests Added two targeted tests: - `test_csharp_separators_no_java_keywords`: verifies `"\nimplements "` is not in the C# separator list - `test_elixir_separators_no_while`: verifies `"\nwhile "` is not in the Elixir separator list Existing `test_csharp_code_splitter` continues to pass (no change to expected output since `implements` never matched valid C# code). Full suite: 129 passed, 0 failed. Fixes #37030
This commit is contained in:
@@ -440,7 +440,6 @@ class RecursiveCharacterTextSplitter(TextSplitter):
|
||||
# Split along control flow statements
|
||||
"\nif ",
|
||||
"\nunless ",
|
||||
"\nwhile ",
|
||||
"\ncase ",
|
||||
"\ncond ",
|
||||
"\nwith ",
|
||||
@@ -593,7 +592,6 @@ class RecursiveCharacterTextSplitter(TextSplitter):
|
||||
return [
|
||||
"\ninterface ",
|
||||
"\nenum ",
|
||||
"\nimplements ",
|
||||
"\ndelegate ",
|
||||
"\nevent ",
|
||||
# Split along class definitions
|
||||
|
||||
Reference in New Issue
Block a user