mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-02 11:39:18 +00:00
text-splitters: fix state persistence issue in ExperimentalMarkdownSyntaxTextSplitter (#28373)
- **Description:** This PR resolves an issue with the `ExperimentalMarkdownSyntaxTextSplitter` class, which retains the internal state across multiple calls to the `split_text` method. This behaviour caused an unintended accumulation of chunks in `self` variables, leading to incorrect outputs when processing multiple Markdown files sequentially. - Modified `libs\text-splitters\langchain_text_splitters\markdown.py` to reset the relevant internal attributes at the start of each `split_text` invocation. This ensures each call processes the input independently. - Added unit tests in `libs\text-splitters\tests\unit_tests\test_text_splitters.py` to verify the fix and ensure the state does not persist across calls. - **Issue:** Fixes [#26440](https://github.com/langchain-ai/langchain/issues/26440). - **Dependencies:** No additional dependencies are introduced with this change. - [x] Unit tests were added to verify the changes. - [x] Updated documentation where necessary. - [x] Ran `make format`, `make lint`, and `make test` to ensure compliance with project standards. --------- Co-authored-by: Angel Chen <angelchen396@gmail.com> Co-authored-by: Chester Curme <chester.curme@gmail.com>
This commit is contained in:
@@ -324,6 +324,11 @@ class ExperimentalMarkdownSyntaxTextSplitter:
|
||||
chunks of the input text. If `return_each_line` is enabled, each line
|
||||
is returned as a separate `Document`.
|
||||
"""
|
||||
# Reset the state for each new file processed
|
||||
self.chunks.clear()
|
||||
self.current_chunk = Document(page_content="")
|
||||
self.current_header_stack.clear()
|
||||
|
||||
raw_lines = text.splitlines(keepends=True)
|
||||
|
||||
while raw_lines:
|
||||
|
Reference in New Issue
Block a user