text-splitters: Introduce Experimental Markdown Syntax Splitter (#22257)

#### Description
This MR defines a `ExperimentalMarkdownSyntaxTextSplitter` class. The
main goal is to replicate the functionality of the original
`MarkdownHeaderTextSplitter` which extracts the header stack as metadata
but with one critical difference: it keeps the whitespace of the
original text intact.

This draft reimplements the `MarkdownHeaderTextSplitter` with a very
different algorithmic approach. Instead of marking up each line of the
text individually and aggregating them back together into chunks, this
method builds each chunk sequentially and applies the metadata to each
chunk. This makes the implementation simpler. However, since it's
designed to keep white space intact its not a full drop in replacement
for the original. Since it is a radical implementation change to the
original code and I would like to get feedback to see if this is a
worthwhile replacement, should be it's own class, or is not a good idea
at all.

Note: I implemented the `return_each_line` parameter but I don't think
it's a necessary feature. I'd prefer to remove it.

This implementation also adds the following additional features:
- Splits out code blocks and includes the language in the `"Code"`
metadata key
- Splits text on the horizontal rule `---` as well
- The `headers_to_split_on` parameter is now optional - with sensible
defaults that can be overridden.

#### Issue
Keeping the whitespace keeps the paragraphs structure and the formatting
of the code blocks intact which allows the caller much more flexibility
in how they want to further split the individuals sections of the
resulting documents. This addresses the issues brought up by the
community in the following issues:
- https://github.com/langchain-ai/langchain/issues/20823
- https://github.com/langchain-ai/langchain/issues/19436
- https://github.com/langchain-ai/langchain/issues/22256

#### Dependencies
N/A

#### Twitter handle
@RyanElston

---------

Co-authored-by: isaac hershenson <ihershenson@hmc.edu>
This commit is contained in:
Ryan Elston
2024-06-18 20:44:00 -06:00
committed by GitHub
parent 93d0ad97fe
commit 86ee4f0daa
2 changed files with 368 additions and 2 deletions

View File

@@ -19,7 +19,10 @@ from langchain_text_splitters.base import split_text_on_tokens
from langchain_text_splitters.character import CharacterTextSplitter
from langchain_text_splitters.html import HTMLHeaderTextSplitter, HTMLSectionSplitter
from langchain_text_splitters.json import RecursiveJsonSplitter
from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter
from langchain_text_splitters.markdown import (
ExperimentalMarkdownSyntaxTextSplitter,
MarkdownHeaderTextSplitter,
)
from langchain_text_splitters.python import PythonCodeTextSplitter
FAKE_PYTHON_TEXT = """
@@ -1296,6 +1299,210 @@ def test_md_header_text_splitter_with_invisible_characters(characters: str) -> N
assert output == expected_output
EXPERIMENTAL_MARKDOWN_DOCUMENT = (
"# My Header 1\n"
"Content for header 1\n"
"## Header 2\n"
"Content for header 2\n"
"```python\n"
"def func_definition():\n"
" print('Keep the whitespace consistent')\n"
"```\n"
"# Header 1 again\n"
"We should also split on the horizontal line\n"
"----\n"
"This will be a new doc but with the same header metadata\n\n"
"And it includes a new paragraph"
)
def test_experimental_markdown_syntax_text_splitter() -> None:
"""Test experimental markdown syntax splitter."""
markdown_splitter = ExperimentalMarkdownSyntaxTextSplitter()
output = markdown_splitter.split_text(EXPERIMENTAL_MARKDOWN_DOCUMENT)
expected_output = [
Document(
page_content="Content for header 1\n",
metadata={"Header 1": "My Header 1"},
),
Document(
page_content="Content for header 2\n",
metadata={"Header 1": "My Header 1", "Header 2": "Header 2"},
),
Document(
page_content=(
"```python\ndef func_definition():\n "
"print('Keep the whitespace consistent')\n```\n"
),
metadata={
"Code": "python",
"Header 1": "My Header 1",
"Header 2": "Header 2",
},
),
Document(
page_content="We should also split on the horizontal line\n",
metadata={"Header 1": "Header 1 again"},
),
Document(
page_content=(
"This will be a new doc but with the same header metadata\n\n"
"And it includes a new paragraph"
),
metadata={"Header 1": "Header 1 again"},
),
]
assert output == expected_output
def test_experimental_markdown_syntax_text_splitter_header_configuration() -> None:
"""Test experimental markdown syntax splitter."""
headers_to_split_on = [("#", "Encabezamiento 1")]
markdown_splitter = ExperimentalMarkdownSyntaxTextSplitter(
headers_to_split_on=headers_to_split_on
)
output = markdown_splitter.split_text(EXPERIMENTAL_MARKDOWN_DOCUMENT)
expected_output = [
Document(
page_content="Content for header 1\n## Header 2\nContent for header 2\n",
metadata={"Encabezamiento 1": "My Header 1"},
),
Document(
page_content=(
"```python\ndef func_definition():\n "
"print('Keep the whitespace consistent')\n```\n"
),
metadata={"Code": "python", "Encabezamiento 1": "My Header 1"},
),
Document(
page_content="We should also split on the horizontal line\n",
metadata={"Encabezamiento 1": "Header 1 again"},
),
Document(
page_content=(
"This will be a new doc but with the same header metadata\n\n"
"And it includes a new paragraph"
),
metadata={"Encabezamiento 1": "Header 1 again"},
),
]
assert output == expected_output
def test_experimental_markdown_syntax_text_splitter_with_headers() -> None:
"""Test experimental markdown syntax splitter."""
markdown_splitter = ExperimentalMarkdownSyntaxTextSplitter(strip_headers=False)
output = markdown_splitter.split_text(EXPERIMENTAL_MARKDOWN_DOCUMENT)
expected_output = [
Document(
page_content="# My Header 1\nContent for header 1\n",
metadata={"Header 1": "My Header 1"},
),
Document(
page_content="## Header 2\nContent for header 2\n",
metadata={"Header 1": "My Header 1", "Header 2": "Header 2"},
),
Document(
page_content=(
"```python\ndef func_definition():\n "
"print('Keep the whitespace consistent')\n```\n"
),
metadata={
"Code": "python",
"Header 1": "My Header 1",
"Header 2": "Header 2",
},
),
Document(
page_content=(
"# Header 1 again\nWe should also split on the horizontal line\n"
),
metadata={"Header 1": "Header 1 again"},
),
Document(
page_content=(
"This will be a new doc but with the same header metadata\n\n"
"And it includes a new paragraph"
),
metadata={"Header 1": "Header 1 again"},
),
]
assert output == expected_output
def test_experimental_markdown_syntax_text_splitter_split_lines() -> None:
"""Test experimental markdown syntax splitter."""
markdown_splitter = ExperimentalMarkdownSyntaxTextSplitter(return_each_line=True)
output = markdown_splitter.split_text(EXPERIMENTAL_MARKDOWN_DOCUMENT)
expected_output = [
Document(
page_content="Content for header 1", metadata={"Header 1": "My Header 1"}
),
Document(
page_content="Content for header 2",
metadata={"Header 1": "My Header 1", "Header 2": "Header 2"},
),
Document(
page_content="```python",
metadata={
"Code": "python",
"Header 1": "My Header 1",
"Header 2": "Header 2",
},
),
Document(
page_content="def func_definition():",
metadata={
"Code": "python",
"Header 1": "My Header 1",
"Header 2": "Header 2",
},
),
Document(
page_content=" print('Keep the whitespace consistent')",
metadata={
"Code": "python",
"Header 1": "My Header 1",
"Header 2": "Header 2",
},
),
Document(
page_content="```",
metadata={
"Code": "python",
"Header 1": "My Header 1",
"Header 2": "Header 2",
},
),
Document(
page_content="We should also split on the horizontal line",
metadata={"Header 1": "Header 1 again"},
),
Document(
page_content="This will be a new doc but with the same header metadata",
metadata={"Header 1": "Header 1 again"},
),
Document(
page_content="And it includes a new paragraph",
metadata={"Header 1": "Header 1 again"},
),
]
assert output == expected_output
def test_solidity_code_splitter() -> None:
splitter = RecursiveCharacterTextSplitter.from_language(
Language.SOL, chunk_size=CHUNK_SIZE, chunk_overlap=0