text_splitters: Add HTMLSemanticPreservingSplitter (#25911)

**Description:** 

With current HTML splitters, they rely on secondary use of the
`RecursiveCharacterSplitter` to further chunk the document into
manageable chunks. The issue with this is it fails to maintain important
structures such as tables, lists, etc within HTML.

This Implementation of a HTML splitter, allows the user to define a
maximum chunk size, HTML elements to preserve in full, options to
preserve `<a>` href links in the output and custom handlers.

The core splitting begins with headers, similar to `HTMLHeaderSplitter`.
If these sections exceed the length of the `max_chunk_size` further
recursive splitting is triggered. During this splitting, elements listed
to preserve, will be excluded from the splitting process. This can cause
chunks to be slightly larger then the max size, depending on preserved
length. However, all contextual relevance of the preserved item remains
intact.

**Custom Handlers**: Sometimes, companies such as Atlassian have custom
HTML elements, that are not parsed by default with `BeautifulSoup`.
Custom handlers allows a user to provide a function to be ran whenever a
specific html tag is encountered. This allows the user to preserve and
gather information within custom html tags that `bs4` will potentially
miss during extraction.

**Dependencies:** User will need to install `bs4` in their project to
utilise this class

I have also added in `how_to` and unit tests, which require `bs4` to
run, otherwise they will be skipped.

Flowchart of process:


![HTMLSemanticPreservingSplitter](https://github.com/user-attachments/assets/20873c36-22ed-4c80-884b-d3c6f433f5a7)

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
This commit is contained in:
Luke 2024-12-20 04:09:22 +11:00 committed by GitHub
parent 24bfa062bf
commit f69695069d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
13 changed files with 1835 additions and 574 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -110,7 +110,7 @@ Examples of structure-based splitting:
* See the how-to guide for [Markdown splitting](/docs/how_to/markdown_header_metadata_splitter/).
* See the how-to guide for [Recursive JSON splitting](/docs/how_to/recursive_json_splitter/).
* See the how-to guide for [Code splitting](/docs/how_to/code_splitter/).
* See the how-to guide for [HTML splitting](/docs/how_to/HTML_header_metadata_splitter/).
* See the how-to guide for [HTML splitting](/docs/how_to/split_html/).
:::

View File

@ -1,359 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c95fcd15cd52c944",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"# How to split by HTML header \n",
"## Description and motivation\n",
"\n",
"[HTMLHeaderTextSplitter](https://python.langchain.com/api_reference/text_splitters/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html) is a \"structure-aware\" [text splitter](/docs/concepts/text_splitters/) that splits text at the HTML element level and adds metadata for each header \"relevant\" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.\n",
"\n",
"It is analogous to the [MarkdownHeaderTextSplitter](/docs/how_to/markdown_header_metadata_splitter) for markdown files.\n",
"\n",
"To specify what headers to split on, specify `headers_to_split_on` when instantiating `HTMLHeaderTextSplitter` as shown below.\n",
"\n",
"## Usage examples\n",
"### 1) How to split HTML strings:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e55d44c-1fff-449a-bf52-0d6df488323f",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-text-splitters"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "initial_id",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:49.208965400Z",
"start_time": "2023-10-02T18:57:48.899756Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Foo'),\n",
" Document(page_content='Some intro text about Foo. \\nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),\n",
" Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),\n",
" Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),\n",
" Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),\n",
" Document(page_content='Baz', metadata={'Header 1': 'Foo'}),\n",
" Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),\n",
" Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import HTMLHeaderTextSplitter\n",
"\n",
"html_string = \"\"\"\n",
"<!DOCTYPE html>\n",
"<html>\n",
"<body>\n",
" <div>\n",
" <h1>Foo</h1>\n",
" <p>Some intro text about Foo.</p>\n",
" <div>\n",
" <h2>Bar main section</h2>\n",
" <p>Some intro text about Bar.</p>\n",
" <h3>Bar subsection 1</h3>\n",
" <p>Some text about the first subtopic of Bar.</p>\n",
" <h3>Bar subsection 2</h3>\n",
" <p>Some text about the second subtopic of Bar.</p>\n",
" </div>\n",
" <div>\n",
" <h2>Baz</h2>\n",
" <p>Some text about Baz</p>\n",
" </div>\n",
" <br>\n",
" <p>Some concluding text about Foo</p>\n",
" </div>\n",
"</body>\n",
"</html>\n",
"\"\"\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"html_header_splits"
]
},
{
"cell_type": "markdown",
"id": "7126f179-f4d0-4b5d-8bef-44e83b59262c",
"metadata": {},
"source": [
"To return each element together with their associated headers, specify `return_each_element=True` when instantiating `HTMLHeaderTextSplitter`:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "90c23088-804c-4c89-bd09-b820587ceeef",
"metadata": {},
"outputs": [],
"source": [
"html_splitter = HTMLHeaderTextSplitter(\n",
" headers_to_split_on,\n",
" return_each_element=True,\n",
")\n",
"html_header_splits_elements = html_splitter.split_text(html_string)"
]
},
{
"cell_type": "markdown",
"id": "b776c54e-9159-4d88-9d6c-3a1d0b639dfe",
"metadata": {},
"source": [
"Comparing with the above, where elements are aggregated by their headers:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "711abc74-a7b0-4dc5-a4bb-af3cafe4e0f4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Foo'\n",
"page_content='Some intro text about Foo. \\nBar main section Bar subsection 1 Bar subsection 2' metadata={'Header 1': 'Foo'}\n"
]
}
],
"source": [
"for element in html_header_splits[:2]:\n",
" print(element)"
]
},
{
"cell_type": "markdown",
"id": "fe5528db-187c-418a-9480-fc0267645d42",
"metadata": {},
"source": [
"Now each element is returned as a distinct `Document`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "24722d8e-d073-46a8-a821-6b722412f1be",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Foo'\n",
"page_content='Some intro text about Foo.' metadata={'Header 1': 'Foo'}\n",
"page_content='Bar main section Bar subsection 1 Bar subsection 2' metadata={'Header 1': 'Foo'}\n"
]
}
],
"source": [
"for element in html_header_splits_elements[:3]:\n",
" print(element)"
]
},
{
"cell_type": "markdown",
"id": "e29b4aade2a0070c",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"#### 2) How to split from a URL or HTML file:\n",
"\n",
"To read directly from a URL, pass the URL string into the `split_text_from_url` method.\n",
"\n",
"Similarly, a local HTML file can be passed to the `split_text_from_file` method."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6ecb9fb2-32ff-4249-a4b4-d5e5e191f013",
"metadata": {},
"outputs": [],
"source": [
"url = \"https://plato.stanford.edu/entries/goedel/\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
" (\"h4\", \"Header 4\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"\n",
"# for local file use html_splitter.split_text_from_file(<path_to_file>)\n",
"html_header_splits = html_splitter.split_text_from_url(url)"
]
},
{
"cell_type": "markdown",
"id": "c6e3dd41-0c57-472a-a3d4-4e7e8ea6914f",
"metadata": {},
"source": [
"### 2) How to constrain chunk sizes:\n",
"\n",
"`HTMLHeaderTextSplitter`, which splits based on HTML headers, can be composed with another splitter which constrains splits based on character lengths, such as `RecursiveCharacterTextSplitter`.\n",
"\n",
"This can be done using the `.split_documents` method of the second splitter:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6ada8ea093ea0475",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:51.016141300Z",
"start_time": "2023-10-02T18:57:50.647495400Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berrys paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
" Document(page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
" Document(page_content='This account of Gödels discovery was told to Hao Wang very much after the fact; but in Gödels contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
" Document(page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödels publication of that theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
" Document(page_content='We now describe the proof of the two theorems, formulating Gödels results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödels notation.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'})]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"chunk_size = 500\n",
"chunk_overlap = 30\n",
"text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
")\n",
"\n",
"# Split\n",
"splits = text_splitter.split_documents(html_header_splits)\n",
"splits[80:85]"
]
},
{
"cell_type": "markdown",
"id": "ac0930371d79554a",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Limitations\n",
"\n",
"There can be quite a bit of structural variation from one HTML document to another, and while `HTMLHeaderTextSplitter` will attempt to attach all \"relevant\" headers to any given chunk, it can sometimes miss certain headers. For example, the algorithm assumes an informational hierarchy in which headers are always at nodes \"above\" associated text, i.e. prior siblings, ancestors, and combinations thereof. In the following news article (as of the writing of this document), the document is structured such that the text of the top-level headline, while tagged \"h1\", is in a *distinct* subtree from the text elements that we'd expect it to be *\"above\"*&mdash;so we can observe that the \"h1\" element and its associated text do not show up in the chunk metadata (but, where applicable, we do see \"h2\" and its associated text): \n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5a5ec1482171b119",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T19:03:25.943524300Z",
"start_time": "2023-10-02T19:03:25.691641Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No two El Niño winters are the same, but many have temperature and precipitation trends in common. \n",
"Average conditions during an El Niño winter across the continental US. \n",
"One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA. \n",
"Because the jet stream is essentially a river of air that storms flow through, they c\n"
]
}
],
"source": [
"url = \"https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text_from_url(url)\n",
"print(html_header_splits[1].page_content[:500])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -1,207 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c95fcd15cd52c944",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"# How to split by HTML sections\n",
"## Description and motivation\n",
"Similar in concept to the [HTMLHeaderTextSplitter](/docs/how_to/HTML_header_metadata_splitter), the `HTMLSectionSplitter` is a \"structure-aware\" [text splitter](/docs/concepts/text_splitters/) that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk.\n",
"\n",
"It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures.\n",
"\n",
"Use `xslt_path` to provide an absolute path to transform the HTML so that it can detect sections based on provided tags. The default is to use the `converting_to_header.xslt` file in the `data_connection/document_transformers` directory. This is for converting the html to a format/layout that is easier to detect sections. For example, `span` based on their font size can be converted to header tags to be detected as a section.\n",
"\n",
"## Usage examples\n",
"### 1) How to split HTML strings:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "initial_id",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:49.208965400Z",
"start_time": "2023-10-02T18:57:48.899756Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Foo \\n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),\n",
" Document(page_content='Bar main section \\n Some intro text about Bar. \\n Bar subsection 1 \\n Some text about the first subtopic of Bar. \\n Bar subsection 2 \\n Some text about the second subtopic of Bar.', metadata={'Header 2': 'Bar main section'}),\n",
" Document(page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import HTMLSectionSplitter\n",
"\n",
"html_string = \"\"\"\n",
" <!DOCTYPE html>\n",
" <html>\n",
" <body>\n",
" <div>\n",
" <h1>Foo</h1>\n",
" <p>Some intro text about Foo.</p>\n",
" <div>\n",
" <h2>Bar main section</h2>\n",
" <p>Some intro text about Bar.</p>\n",
" <h3>Bar subsection 1</h3>\n",
" <p>Some text about the first subtopic of Bar.</p>\n",
" <h3>Bar subsection 2</h3>\n",
" <p>Some text about the second subtopic of Bar.</p>\n",
" </div>\n",
" <div>\n",
" <h2>Baz</h2>\n",
" <p>Some text about Baz</p>\n",
" </div>\n",
" <br>\n",
" <p>Some concluding text about Foo</p>\n",
" </div>\n",
" </body>\n",
" </html>\n",
"\"\"\"\n",
"\n",
"headers_to_split_on = [(\"h1\", \"Header 1\"), (\"h2\", \"Header 2\")]\n",
"\n",
"html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"html_header_splits"
]
},
{
"cell_type": "markdown",
"id": "e29b4aade2a0070c",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"### 2) How to constrain chunk sizes:\n",
"\n",
"`HTMLSectionSplitter` can be used with other text splitters as part of a chunking pipeline. Internally, it uses the `RecursiveCharacterTextSplitter` when the section size is larger than the chunk size. It also considers the font size of the text to determine whether it is a section or not based on the determined font size threshold."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6ada8ea093ea0475",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:51.016141300Z",
"start_time": "2023-10-02T18:57:50.647495400Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Foo \\n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),\n",
" Document(page_content='Bar main section \\n Some intro text about Bar.', metadata={'Header 2': 'Bar main section'}),\n",
" Document(page_content='Bar subsection 1 \\n Some text about the first subtopic of Bar.', metadata={'Header 3': 'Bar subsection 1'}),\n",
" Document(page_content='Bar subsection 2 \\n Some text about the second subtopic of Bar.', metadata={'Header 3': 'Bar subsection 2'}),\n",
" Document(page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"html_string = \"\"\"\n",
" <!DOCTYPE html>\n",
" <html>\n",
" <body>\n",
" <div>\n",
" <h1>Foo</h1>\n",
" <p>Some intro text about Foo.</p>\n",
" <div>\n",
" <h2>Bar main section</h2>\n",
" <p>Some intro text about Bar.</p>\n",
" <h3>Bar subsection 1</h3>\n",
" <p>Some text about the first subtopic of Bar.</p>\n",
" <h3>Bar subsection 2</h3>\n",
" <p>Some text about the second subtopic of Bar.</p>\n",
" </div>\n",
" <div>\n",
" <h2>Baz</h2>\n",
" <p>Some text about Baz</p>\n",
" </div>\n",
" <br>\n",
" <p>Some concluding text about Foo</p>\n",
" </div>\n",
" </body>\n",
" </html>\n",
"\"\"\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
" (\"h4\", \"Header 4\"),\n",
"]\n",
"\n",
"html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
"\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"\n",
"chunk_size = 500\n",
"chunk_overlap = 30\n",
"text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
")\n",
"\n",
"# Split\n",
"splits = text_splitter.split_documents(html_header_splits)\n",
"splits"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -143,8 +143,7 @@ What LangChain calls [LLMs](/docs/concepts/text_llms) are older forms of languag
[Text Splitters](/docs/concepts/text_splitters) take a document and split into chunks that can be used for retrieval.
- [How to: recursively split text](/docs/how_to/recursive_text_splitter)
- [How to: split by HTML headers](/docs/how_to/HTML_header_metadata_splitter)
- [How to: split by HTML sections](/docs/how_to/HTML_section_aware_splitter)
- [How to: split HTML](/docs/how_to/split_html)
- [How to: split by character](/docs/how_to/character_text_splitter)
- [How to: split code](/docs/how_to/code_splitter)
- [How to: split Markdown by headers](/docs/how_to/markdown_header_metadata_splitter)

View File

@ -0,0 +1,963 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7fb27b941602401d91542211134fc71a",
"metadata": {},
"source": [
"# How to split HTML"
]
},
{
"cell_type": "markdown",
"id": "acae54e37e7d407bbb7b55eff062a284",
"metadata": {},
"source": [
"Splitting HTML documents into manageable chunks is essential for various text processing tasks such as natural language processing, search indexing, and more. In this guide, we will explore three different text splitters provided by LangChain that you can use to split HTML content effectively:\n",
"\n",
"- [**HTMLHeaderTextSplitter**](#using-htmlheadertextsplitter)\n",
"- [**HTMLSectionSplitter**](#using-htmlsectionsplitter)\n",
"- [**HTMLSemanticPreservingSplitter**](#using-htmlsemanticpreservingsplitter)\n",
"\n",
"Each of these splitters has unique features and use cases. This guide will help you understand the differences between them, why you might choose one over the others, and how to use them effectively."
]
},
{
"cell_type": "markdown",
"id": "e48a7480-5ec3-47e6-9e6a-f36c74d16f4f",
"metadata": {},
"source": [
"```\n",
"%pip install -qU langchain-text-splitters\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "9a63283cbaf04dbcab1f6479b197f3a8",
"metadata": {},
"source": [
"## Overview of the Splitters"
]
},
{
"cell_type": "markdown",
"id": "8dd0d8092fe74a7c96281538738b07e2",
"metadata": {},
"source": [
"### [HTMLHeaderTextSplitter](#using-htmlheadertextsplitter)\n",
"\n",
":::info\n",
"Useful when you want to preserve the hierarchical structure of a document based on its headings.\n",
":::\n",
"\n",
"**Description**: Splits HTML text based on header tags (e.g., `<h1>`, `<h2>`, `<h3>`, etc.), and adds metadata for each header relevant to any given chunk.\n",
"\n",
"**Capabilities**:\n",
"- Splits text at the HTML element level.\n",
"- Preserves context-rich information encoded in document structures.\n",
"- Can return chunks element by element or combine elements with the same metadata.\n",
"\n",
"___\n"
]
},
{
"cell_type": "markdown",
"id": "72eea5119410473aa328ad9291626812",
"metadata": {},
"source": [
"### [HTMLSectionSplitter](#using-htmlsectionsplitter)\n",
"\n",
":::info \n",
"Useful when you want to split HTML documents into larger sections, such as `<section>`, `<div>`, or custom-defined sections. \n",
":::\n",
"\n",
"**Description**: Similar to HTMLHeaderTextSplitter but focuses on splitting HTML into sections based on specified tags.\n",
"\n",
"**Capabilities**:\n",
"- Uses XSLT transformations to detect and split sections.\n",
"- Internally uses `RecursiveCharacterTextSplitter` for large sections.\n",
"- Considers font sizes to determine sections.\n",
"___"
]
},
{
"cell_type": "markdown",
"id": "8edb47106e1a46a883d545849b8ab81b",
"metadata": {},
"source": [
"### [HTMLSemanticPreservingSplitter](#using-htmlsemanticpreservingsplitter)\n",
"\n",
":::info \n",
"Ideal when you need to ensure that structured elements are not split across chunks, preserving contextual relevancy. \n",
":::\n",
"\n",
"**Description**: Splits HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other HTML components.\n",
"\n",
"**Capabilities**:\n",
"- Preserves tables, lists, and other specified HTML elements.\n",
"- Allows custom handlers for specific HTML tags.\n",
"- Ensures that the semantic meaning of the document is maintained.\n",
"- Built in normalization & stopword removal\n",
"\n",
"___"
]
},
{
"cell_type": "markdown",
"id": "10185d26023b46108eb7d9f57d49d2b3",
"metadata": {},
"source": [
"### Choosing the Right Splitter\n",
"\n",
"- **Use `HTMLHeaderTextSplitter` when**: You need to split an HTML document based on its header hierarchy and maintain metadata about the headers.\n",
"- **Use `HTMLSectionSplitter` when**: You need to split the document into larger, more general sections, possibly based on custom tags or font sizes.\n",
"- **Use `HTMLSemanticPreservingSplitter` when**: You need to split the document into chunks while preserving semantic elements like tables and lists, ensuring that they are not split and that their context is maintained."
]
},
{
"cell_type": "markdown",
"id": "19d42e40-015c-4bfa-b9ce-783e8377af2b",
"metadata": {},
"source": [
"| Feature | HTMLHeaderTextSplitter | HTMLSectionSplitter | HTMLSemanticPreservingSplitter |\n",
"|--------------------------------------------|------------------------|---------------------|-------------------------------|\n",
"| Splits based on headers | Yes | Yes | Yes |\n",
"| Preserves semantic elements (tables, lists) | No | No | Yes |\n",
"| Adds metadata for headers | Yes | Yes | Yes |\n",
"| Custom handlers for HTML tags | No | No | Yes |\n",
"| Preserves media (images, videos) | No | No | Yes |\n",
"| Considers font sizes | No | Yes | No |\n",
"| Uses XSLT transformations | No | Yes | No |"
]
},
{
"cell_type": "markdown",
"id": "8763a12b2bbd4a93a75aff182afb95dc",
"metadata": {},
"source": [
"## Example HTML Document\n",
"\n",
"Let's use the following HTML document as an example:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c9ca2682",
"metadata": {},
"outputs": [],
"source": [
"html_string = \"\"\"\n",
"<!DOCTYPE html>\n",
" <html lang='en'>\n",
" <head>\n",
" <meta charset='UTF-8'>\n",
" <meta name='viewport' content='width=device-width, initial-scale=1.0'>\n",
" <title>Fancy Example HTML Page</title>\n",
" </head>\n",
" <body>\n",
" <h1>Main Title</h1>\n",
" <p>This is an introductory paragraph with some basic content.</p>\n",
" \n",
" <h2>Section 1: Introduction</h2>\n",
" <p>This section introduces the topic. Below is a list:</p>\n",
" <ul>\n",
" <li>First item</li>\n",
" <li>Second item</li>\n",
" <li>Third item with <strong>bold text</strong> and <a href='#'>a link</a></li>\n",
" </ul>\n",
" \n",
" <h3>Subsection 1.1: Details</h3>\n",
" <p>This subsection provides additional details. Here's a table:</p>\n",
" <table border='1'>\n",
" <thead>\n",
" <tr>\n",
" <th>Header 1</th>\n",
" <th>Header 2</th>\n",
" <th>Header 3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>Row 1, Cell 1</td>\n",
" <td>Row 1, Cell 2</td>\n",
" <td>Row 1, Cell 3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Row 2, Cell 1</td>\n",
" <td>Row 2, Cell 2</td>\n",
" <td>Row 2, Cell 3</td>\n",
" </tr>\n",
" </tbody>\n",
" </table>\n",
" \n",
" <h2>Section 2: Media Content</h2>\n",
" <p>This section contains an image and a video:</p>\n",
" <img src='example_image_link.mp4' alt='Example Image'>\n",
" <video controls width='250' src='example_video_link.mp4' type='video/mp4'>\n",
" Your browser does not support the video tag.\n",
" </video>\n",
"\n",
" <h2>Section 3: Code Example</h2>\n",
" <p>This section contains a code block:</p>\n",
" <pre><code data-lang=\"html\">\n",
" &lt;div&gt;\n",
" &lt;p&gt;This is a paragraph inside a div.&lt;/p&gt;\n",
" &lt;/div&gt;\n",
" </code></pre>\n",
"\n",
" <h2>Conclusion</h2>\n",
" <p>This is the conclusion of the document.</p>\n",
" </body>\n",
" </html>\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"id": "86c25b86",
"metadata": {},
"source": [
"## Using HTMLHeaderTextSplitter\n",
"\n",
"[HTMLHeaderTextSplitter](https://python.langchain.com/api_reference/text_splitters/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html) is a \"structure-aware\" [text splitter](/docs/concepts/text_splitters/) that splits text at the HTML element level and adds metadata for each header \"relevant\" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.\n",
"\n",
"It is analogous to the [MarkdownHeaderTextSplitter](/docs/how_to/markdown_header_metadata_splitter) for markdown files.\n",
"\n",
"To specify what headers to split on, specify `headers_to_split_on` when instantiating `HTMLHeaderTextSplitter` as shown below."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "23361e55",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),\n",
" Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list: \\nFirst item Second item Third item with bold text and a link'),\n",
" Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content=\"This subsection provides additional details. Here's a table:\"),\n",
" Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),\n",
" Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'),\n",
" Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import HTMLHeaderTextSplitter\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"html_header_splits"
]
},
{
"cell_type": "markdown",
"id": "f7e9cfef-5387-4ffe-b1a8-4b9214b9debd",
"metadata": {},
"source": [
"To return each element together with their associated headers, specify `return_each_element=True` when instantiating `HTMLHeaderTextSplitter`:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "adb0da17-d1d5-4913-aacd-dc5d70db66cf",
"metadata": {},
"outputs": [],
"source": [
"html_splitter = HTMLHeaderTextSplitter(\n",
" headers_to_split_on,\n",
" return_each_element=True,\n",
")\n",
"html_header_splits_elements = html_splitter.split_text(html_string)"
]
},
{
"cell_type": "markdown",
"id": "fa781f86-d04f-4c09-a4a1-26aac88d4fc3",
"metadata": {},
"source": [
"Comparing with the above, where elements are aggregated by their headers:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d9d7b61f-f927-49fc-a592-1b0f049d1a2f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}\n",
"page_content='This section introduces the topic. Below is a list: \n",
"First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}\n"
]
}
],
"source": [
"for element in html_header_splits[:2]:\n",
" print(element)"
]
},
{
"cell_type": "markdown",
"id": "4b5869e0-3a9b-4fa2-82ec-dbde33464a52",
"metadata": {},
"source": [
"Now each element is returned as a distinct `Document`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e2677a27-875a-4455-8e3f-6a6c7706be20",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}\n",
"page_content='This section introduces the topic. Below is a list:' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}\n",
"page_content='First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}\n"
]
}
],
"source": [
"for element in html_header_splits_elements[:3]:\n",
" print(element)"
]
},
{
"cell_type": "markdown",
"id": "24200c53-6a52-436e-ba4d-8d0514cfa87c",
"metadata": {},
"source": [
"### How to split from a URL or HTML file:\n",
"\n",
"To read directly from a URL, pass the URL string into the `split_text_from_url` method.\n",
"\n",
"Similarly, a local HTML file can be passed to the `split_text_from_file` method."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "581eeddf-7e88-48a7-999d-da56304e3522",
"metadata": {},
"outputs": [],
"source": [
"url = \"https://plato.stanford.edu/entries/goedel/\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
" (\"h4\", \"Header 4\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"\n",
"# for local file use html_splitter.split_text_from_file(<path_to_file>)\n",
"html_header_splits = html_splitter.split_text_from_url(url)"
]
},
{
"cell_type": "markdown",
"id": "88ba9878-068e-434f-bdac-17972b2aa9ad",
"metadata": {},
"source": [
"### How to constrain chunk sizes:\n",
"\n",
"`HTMLHeaderTextSplitter`, which splits based on HTML headers, can be composed with another splitter which constrains splits based on character lengths, such as `RecursiveCharacterTextSplitter`.\n",
"\n",
"This can be done using the `.split_documents` method of the second splitter:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3efc1fb8-f264-4ae6-883d-694a4d252f86",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berrys paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth'),\n",
" Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.'),\n",
" Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='This account of Gödels discovery was told to Hao Wang very much after the fact; but in Gödels contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians'),\n",
" Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödels publication of that theorem.'),\n",
" Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'}, page_content='We now describe the proof of the two theorems, formulating Gödels results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödels notation.')]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"chunk_size = 500\n",
"chunk_overlap = 30\n",
"text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
")\n",
"\n",
"# Split\n",
"splits = text_splitter.split_documents(html_header_splits)\n",
"splits[80:85]"
]
},
{
"cell_type": "markdown",
"id": "27363092-c6b2-4290-9e12-f6aece503bfb",
"metadata": {},
"source": [
"### Limitations\n",
"\n",
"There can be quite a bit of structural variation from one HTML document to another, and while `HTMLHeaderTextSplitter` will attempt to attach all \"relevant\" headers to any given chunk, it can sometimes miss certain headers. For example, the algorithm assumes an informational hierarchy in which headers are always at nodes \"above\" associated text, i.e. prior siblings, ancestors, and combinations thereof. In the following news article (as of the writing of this document), the document is structured such that the text of the top-level headline, while tagged \"h1\", is in a *distinct* subtree from the text elements that we'd expect it to be *\"above\"*&mdash;so we can observe that the \"h1\" element and its associated text do not show up in the chunk metadata (but, where applicable, we do see \"h2\" and its associated text): \n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "332043f6-ec39-4fcd-aa57-4dec5252684b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No two El Niño winters are the same, but many have temperature and precipitation trends in common. \n",
"Average conditions during an El Niño winter across the continental US. \n",
"One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA. \n",
"Because the jet stream is essentially a river of air that storms flow through, they c\n"
]
}
],
"source": [
"url = \"https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text_from_url(url)\n",
"print(html_header_splits[1].page_content[:500])"
]
},
{
"cell_type": "markdown",
"id": "aa8ba422",
"metadata": {},
"source": [
"## Using HTMLSectionSplitter\n",
"\n",
"Similar in concept to the [HTMLHeaderTextSplitter](#using-htmlheadertextsplitter), the `HTMLSectionSplitter` is a \"structure-aware\" [text splitter](/docs/concepts/text_splitters/) that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk. It lets you split HTML by sections.\n",
"\n",
"It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures.\n",
"\n",
"Use `xslt_path` to provide an absolute path to transform the HTML so that it can detect sections based on provided tags. The default is to use the `converting_to_header.xslt` file in the `data_connection/document_transformers` directory. This is for converting the html to a format/layout that is easier to detect sections. For example, `span` based on their font size can be converted to header tags to be detected as a section.\n",
"\n",
"### How to split HTML strings:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "65376c86",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \\n This is an introductory paragraph with some basic content.'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=\"Section 1: Introduction \\n This section introduces the topic. Below is a list: \\n \\n First item \\n Second item \\n Third item with bold text and a link \\n \\n \\n Subsection 1.1: Details \\n This subsection provides additional details. Here's a table: \\n \\n \\n \\n Header 1 \\n Header 2 \\n Header 3 \\n \\n \\n \\n \\n Row 1, Cell 1 \\n Row 1, Cell 2 \\n Row 1, Cell 3 \\n \\n \\n Row 2, Cell 1 \\n Row 2, Cell 2 \\n Row 2, Cell 3\"),\n",
" Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \\n This section contains an image and a video: \\n \\n \\n Your browser does not support the video tag.'),\n",
" Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \\n This section contains a code block: \\n \\n <div>\\n <p>This is a paragraph inside a div.</p>\\n </div>'),\n",
" Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \\n This is the conclusion of the document.')]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import HTMLSectionSplitter\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
"]\n",
"\n",
"html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"html_header_splits"
]
},
{
"cell_type": "markdown",
"id": "5aa8627f-0af3-48c2-b9ed-bfbc46f8030d",
"metadata": {},
"source": [
"### How to constrain chunk sizes:\n",
"\n",
"`HTMLSectionSplitter` can be used with other text splitters as part of a chunking pipeline. Internally, it uses the `RecursiveCharacterTextSplitter` when the section size is larger than the chunk size. It also considers the font size of the text to determine whether it is a section or not based on the determined font size threshold."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "5a9759b2-f6ff-413e-b42d-9d8967f3f3e6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title'),\n",
" Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some'),\n",
" Document(metadata={'Header 1': 'Main Title'}, page_content='some basic content.'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Section 1: Introduction'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='is a list:'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='First item \\n Second item'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Third item with bold text and a link'),\n",
" Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Subsection 1.1: Details'),\n",
" Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='This subsection provides additional details.'),\n",
" Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content=\"Here's a table:\"),\n",
" Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Header 1 \\n Header 2 \\n Header 3'),\n",
" Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 1 \\n Row 1, Cell 2'),\n",
" Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 3 \\n \\n \\n Row 2, Cell 1'),\n",
" Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 2, Cell 2 \\n Row 2, Cell 3'),\n",
" Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content'),\n",
" Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),\n",
" Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Your browser does not support the video'),\n",
" Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='tag.'),\n",
" Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example'),\n",
" Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: \\n \\n <div>'),\n",
" Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='<p>This is a paragraph inside a div.</p>'),\n",
" Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='</div>'),\n",
" Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion'),\n",
" Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
"]\n",
"\n",
"html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
"\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"\n",
"chunk_size = 50\n",
"chunk_overlap = 5\n",
"text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
")\n",
"\n",
"# Split\n",
"splits = text_splitter.split_documents(html_header_splits)\n",
"splits"
]
},
{
"cell_type": "markdown",
"id": "6fb9f81a",
"metadata": {},
"source": [
"## Using HTMLSemanticPreservingSplitter\n",
"\n",
"The `HTMLSemanticPreservingSplitter` is designed to split HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other HTML components. This ensures that such elements are not split across chunks, causing loss of contextual relevancy such as table headers, list headers etc.\n",
"\n",
"This splitter is designed at its heart, to create contextually relevant chunks. General Recursive splitting with `HTMLHeaderTextSplitter` can cause tables, lists and other structered elements to be split in the middle, losing signifcant context and creating bad chunks.\n",
"\n",
"The `HTMLSemanticPreservingSplitter` is essential for splitting HTML content that includes structured elements like tables and lists, especially when it's critical to preserve these elements intact. Additionally, its ability to define custom handlers for specific HTML tags makes it a versatile tool for processing complex HTML documents.\n",
"\n",
"**IMPORTANT**: `max_chunk_size` is not a definite maximum size of a chunk, the calculation of max size, occurs when the preserved content is not apart of the chunk, to ensure it is not split. When we add the preserved data back in to the chunk, there is a chance the chunk size will exceed the `max_chunk_size`. This is crucial to ensure we maintain the structure of the original document\n",
"\n",
":::info \n",
"\n",
"Notes:\n",
"\n",
"1. We have defined a custom handler to re-format the contents of code blocks\n",
"2. We defined a deny list for specific html elements, to decompose them and their contents pre-processing\n",
"3. We have intentionally set a small chunk size to demonstrate the non-splitting of elements\n",
"\n",
":::"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "6cd119a8-c3d1-48a8-b569-469cd1607169",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),\n",
" Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=\". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3\"),\n",
" Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video: ![image:example_image_link.mp4](example_image_link.mp4) ![video:example_video_link.mp4](example_video_link.mp4)'),\n",
" Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),\n",
" Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# BeautifulSoup is required to use the custom handlers\n",
"from bs4 import Tag\n",
"from langchain_text_splitters import HTMLSemanticPreservingSplitter\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
"]\n",
"\n",
"\n",
"def code_handler(element: Tag) -> str:\n",
" data_lang = element.get(\"data-lang\")\n",
" code_format = f\"<code:{data_lang}>{element.get_text()}</code>\"\n",
"\n",
" return code_format\n",
"\n",
"\n",
"splitter = HTMLSemanticPreservingSplitter(\n",
" headers_to_split_on=headers_to_split_on,\n",
" separators=[\"\\n\\n\", \"\\n\", \". \", \"! \", \"? \"],\n",
" max_chunk_size=50,\n",
" preserve_images=True,\n",
" preserve_videos=True,\n",
" elements_to_preserve=[\"table\", \"ul\", \"ol\", \"code\"],\n",
" denylist_tags=[\"script\", \"style\", \"head\"],\n",
" custom_handlers={\"code\": code_handler},\n",
")\n",
"\n",
"documents = splitter.split_text(html_string)\n",
"documents"
]
},
{
"cell_type": "markdown",
"id": "8c5413e2-3c50-435a-bda7-11e574a3fbab",
"metadata": {},
"source": [
"### Preserving Tables and Lists\n",
"In this example, we will demonstrate how the `HTMLSemanticPreservingSplitter` can preserve a table and a large list within an HTML document. The chunk size will be set to 50 characters to illustrate how the splitter ensures that these elements are not split, even when they exceed the maximum defined chunk size."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8d97a2a6-9c73-4396-a922-f7a4eebe47f8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(metadata={'Header 1': 'Section 1'}, page_content='This section contains an important table and list'), Document(metadata={'Header 1': 'Section 1'}, page_content='that should not be split across chunks.'), Document(metadata={'Header 1': 'Section 1'}, page_content='Item Quantity Price Apples 10 $1.00 Oranges 5 $0.50 Bananas 50 $1.50'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='Additional text in subsection 1.1 that is'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='separated from the table and list. Here is a'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content=\"detailed list: Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.\")]\n"
]
}
],
"source": [
"from langchain_text_splitters import HTMLSemanticPreservingSplitter\n",
"\n",
"html_string = \"\"\"\n",
"<!DOCTYPE html>\n",
"<html>\n",
" <body>\n",
" <div>\n",
" <h1>Section 1</h1>\n",
" <p>This section contains an important table and list that should not be split across chunks.</p>\n",
" <table>\n",
" <tr>\n",
" <th>Item</th>\n",
" <th>Quantity</th>\n",
" <th>Price</th>\n",
" </tr>\n",
" <tr>\n",
" <td>Apples</td>\n",
" <td>10</td>\n",
" <td>$1.00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Oranges</td>\n",
" <td>5</td>\n",
" <td>$0.50</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Bananas</td>\n",
" <td>50</td>\n",
" <td>$1.50</td>\n",
" </tr>\n",
" </table>\n",
" <h2>Subsection 1.1</h2>\n",
" <p>Additional text in subsection 1.1 that is separated from the table and list.</p>\n",
" <p>Here is a detailed list:</p>\n",
" <ul>\n",
" <li>Item 1: Description of item 1, which is quite detailed and important.</li>\n",
" <li>Item 2: Description of item 2, which also contains significant information.</li>\n",
" <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>\n",
" </ul>\n",
" </div>\n",
" </body>\n",
"</html>\n",
"\"\"\"\n",
"\n",
"headers_to_split_on = [(\"h1\", \"Header 1\"), (\"h2\", \"Header 2\")]\n",
"\n",
"splitter = HTMLSemanticPreservingSplitter(\n",
" headers_to_split_on=headers_to_split_on,\n",
" max_chunk_size=50,\n",
" elements_to_preserve=[\"table\", \"ul\"],\n",
")\n",
"\n",
"documents = splitter.split_text(html_string)\n",
"print(documents)"
]
},
{
"cell_type": "markdown",
"id": "e44bab7f-5a81-4ec4-a7d8-0413c8e15b33",
"metadata": {},
"source": [
"#### Explanation\n",
"In this example, the `HTMLSemanticPreservingSplitter` ensures that the entire table and the unordered list (`<ul>`) are preserved within their respective chunks. Even though the chunk size is set to 50 characters, the splitter recognizes that these elements should not be split and keeps them intact.\n",
"\n",
"This is particularly important when dealing with data tables or lists, where splitting the content could lead to loss of context or confusion. The resulting `Document` objects retain the full structure of these elements, ensuring that the contextual relevance of the information is maintained."
]
},
{
"cell_type": "markdown",
"id": "0c0a83c9-1177-4f64-9f94-688b5ec4fafd",
"metadata": {},
"source": [
"### Using a Custom Handler\n",
"The `HTMLSemanticPreservingSplitter` allows you to define custom handlers for specific HTML elements. Some platforms, have custom HTML tags that are not natively parsed by `BeautifulSoup`, when this occurs, you can utilize custom handlers to add the formatting logic easily. \n",
"\n",
"This can be particularly useful for elements that require special processing, such as `<iframe>` tags or specific 'data-' elements. In this example, we'll create a custom handler for `iframe` tags that converts them into Markdown-like links."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "7179e5f4-cb00-4ec9-8b87-46e0061544b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(metadata={'Header 1': 'Section with Iframe'}, page_content='[iframe:https://example.com/embed](https://example.com/embed) Some text after the iframe'), Document(metadata={'Header 1': 'Section with Iframe'}, page_content=\". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.\")]\n"
]
}
],
"source": [
"def custom_iframe_extractor(iframe_tag):\n",
" iframe_src = iframe_tag.get(\"src\", \"\")\n",
" return f\"[iframe:{iframe_src}]({iframe_src})\"\n",
"\n",
"\n",
"splitter = HTMLSemanticPreservingSplitter(\n",
" headers_to_split_on=headers_to_split_on,\n",
" max_chunk_size=50,\n",
" separators=[\"\\n\\n\", \"\\n\", \". \"],\n",
" elements_to_preserve=[\"table\", \"ul\", \"ol\"],\n",
" custom_handlers={\"iframe\": custom_iframe_extractor},\n",
")\n",
"\n",
"html_string = \"\"\"\n",
"<!DOCTYPE html>\n",
"<html>\n",
" <body>\n",
" <div>\n",
" <h1>Section with Iframe</h1>\n",
" <iframe src=\"https://example.com/embed\"></iframe>\n",
" <p>Some text after the iframe.</p>\n",
" <ul>\n",
" <li>Item 1: Description of item 1, which is quite detailed and important.</li>\n",
" <li>Item 2: Description of item 2, which also contains significant information.</li>\n",
" <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>\n",
" </ul>\n",
" </div>\n",
" </body>\n",
"</html>\n",
"\"\"\"\n",
"\n",
"documents = splitter.split_text(html_string)\n",
"print(documents)"
]
},
{
"cell_type": "markdown",
"id": "f2f62dc1-89ff-4bac-8f79-97cff3d1af3a",
"metadata": {},
"source": [
"#### Explanation\n",
"In this example, we defined a custom handler for `iframe` tags that converts them into Markdown-like links. When the splitter processes the HTML content, it uses this custom handler to transform the `iframe` tags while preserving other elements like tables and lists. The resulting `Document` objects show how the iframe is handled according to the custom logic you provided.\n",
"\n",
"**Important**: When presvering items such as links, you should be mindful not to include `.` in your seperators, or leave seperators blank. `RecursiveCharacterTextSplitter` splits on full stop, which will cut links in half. Ensure you provide a seperator list with `. ` instead."
]
},
{
"cell_type": "markdown",
"id": "138ee7a2-200e-41a7-9e45-e15658a7b2e9",
"metadata": {},
"source": [
"### Using a custom handler to analyze an image with an LLM\n",
"\n",
"With custom handler's, we can also override the default processing for any element. A great example of this, is inserting semantic analysis of an image within a document, directly in the chunking flow.\n",
"\n",
"Since our function is called when the tag is discovered, we can override the `<img>` tag and turn off `preserve_images` to insert any content we would like to embed in our chunks."
]
},
{
"cell_type": "markdown",
"id": "18f8bd11-770c-4b88-a2a7-a7cf42e2f481",
"metadata": {},
"source": [
"```python\n",
"\"\"\"This example assumes you have helper methods `load_image_from_url` and an LLM agent `llm` that can process image data.\"\"\"\n",
"\n",
"from langchain.agents import AgentExecutor\n",
"\n",
"# This example needs to be replaced with your own agent\n",
"llm = AgentExecutor(...)\n",
"\n",
"\n",
"# This method is a placeholder for loading image data from a URL and is not implemented here\n",
"def load_image_from_url(image_url: str) -> bytes:\n",
" # Assuming this method fetches the image data from the URL\n",
" return b\"image_data\"\n",
"\n",
"\n",
"html_string = \"\"\"\n",
"<!DOCTYPE html>\n",
"<html>\n",
" <body>\n",
" <div>\n",
" <h1>Section with Image and Link</h1>\n",
" <p>\n",
" <img src=\"https://example.com/image.jpg\" alt=\"An example image\" />\n",
" Some text after the image.\n",
" </p>\n",
" <ul>\n",
" <li>Item 1: Description of item 1, which is quite detailed and important.</li>\n",
" <li>Item 2: Description of item 2, which also contains significant information.</li>\n",
" <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>\n",
" </ul>\n",
" </div>\n",
" </body>\n",
"</html>\n",
"\"\"\"\n",
"\n",
"\n",
"def custom_image_handler(img_tag) -> str:\n",
" img_src = img_tag.get(\"src\", \"\")\n",
" img_alt = img_tag.get(\"alt\", \"No alt text provided\")\n",
"\n",
" image_data = load_image_from_url(img_src)\n",
" semantic_meaning = llm.invoke(image_data)\n",
"\n",
" markdown_text = f\"[Image Alt Text: {img_alt} | Image Source: {img_src} | Image Semantic Meaning: {semantic_meaning}]\"\n",
"\n",
" return markdown_text\n",
"\n",
"\n",
"splitter = HTMLSemanticPreservingSplitter(\n",
" headers_to_split_on=headers_to_split_on,\n",
" max_chunk_size=50,\n",
" separators=[\"\\n\\n\", \"\\n\", \". \"],\n",
" elements_to_preserve=[\"ul\"],\n",
" preserve_images=False,\n",
" custom_handlers={\"img\": custom_image_handler},\n",
")\n",
"\n",
"documents = splitter.split_text(html_string)\n",
"\n",
"print(documents)\n",
"```\n",
"\n",
"```\n",
"[Document(metadata={'Header 1': 'Section with Image and Link'}, page_content='[Image Alt Text: An example image | Image Source: https://example.com/image.jpg | Image Semantic Meaning: semantic-meaning] Some text after the image'), \n",
"Document(metadata={'Header 1': 'Section with Image and Link'}, page_content=\". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.\")]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "f07de062-29da-4f30-82c2-ec48242f2a6e",
"metadata": {},
"source": [
"#### Explanation:\n",
"\n",
"With our custom handler written to extract the specific fields from a `<img>` element in HTML, we can further process the data with our agent, and insert the result directly into our chunk. It is important to ensure `preserve_images` is set to `False` otherwise the default processing of `<img>` fields will take place. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "728e0cfe-d7fd-4cc0-9c46-b2c02d335953",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -70,6 +70,14 @@
"source": "/docs/how_to/graph_prompting(/?)",
"destination": "/docs/tutorials/graph#few-shot-prompting"
},
{
"source": "/docs/how_to/HTML_header_metadata_splitter(/?)",
"destination": "/docs/how_to/split_html#using-htmlheadertextsplitter"
},
{
"source": "/docs/how_to/HTML_section_aware_splitter(/?)",
"destination": "/docs/how_to/split_html#using-htmlsectionsplitter"
},
{
"source": "/docs/tutorials/data_generation",
"destination": "https://python.langchain.com/v0.2/docs/tutorials/data_generation/"

View File

@ -33,6 +33,7 @@ from langchain_text_splitters.html import (
ElementType,
HTMLHeaderTextSplitter,
HTMLSectionSplitter,
HTMLSemanticPreservingSplitter,
)
from langchain_text_splitters.json import RecursiveJsonSplitter
from langchain_text_splitters.konlpy import KonlpyTextSplitter
@ -70,6 +71,7 @@ __all__ = [
"LineType",
"HTMLHeaderTextSplitter",
"HTMLSectionSplitter",
"HTMLSemanticPreservingSplitter",
"MarkdownHeaderTextSplitter",
"MarkdownTextSplitter",
"CharacterTextSplitter",

View File

@ -2,11 +2,24 @@ from __future__ import annotations
import copy
import pathlib
import re
from io import BytesIO, StringIO
from typing import Any, Dict, Iterable, List, Optional, Tuple, TypedDict, cast
from typing import (
Any,
Callable,
Dict,
Iterable,
List,
Optional,
Sequence,
Tuple,
TypedDict,
cast,
)
import requests
from langchain_core.documents import Document
from langchain_core._api import beta
from langchain_core.documents import BaseDocumentTransformer, Document
from langchain_text_splitters.character import RecursiveCharacterTextSplitter
@ -350,3 +363,484 @@ class HTMLSectionSplitter:
)
for section in sections
]
@beta()
class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
"""Split HTML content preserving semantic structure.
Splits HTML content by headers into generalized chunks, preserving semantic
structure. If chunks exceed the maximum chunk size, it uses
RecursiveCharacterTextSplitter for further splitting.
The splitter preserves full HTML elements (e.g., <table>, <ul>) and converts
links to Markdown-like links. It can also preserve images, videos, and audio
elements by converting them into Markdown format. Note that some chunks may
exceed the maximum size to maintain semantic integrity.
.. versionadded: 0.3.5
Args:
headers_to_split_on (List[Tuple[str, str]]): HTML headers (e.g., "h1", "h2")
that define content sections.
max_chunk_size (int): Maximum size for each chunk, with allowance for
exceeding this limit to preserve semantics.
chunk_overlap (int): Number of characters to overlap between chunks to ensure
contextual continuity.
separators (List[str]): Delimiters used by RecursiveCharacterTextSplitter for
further splitting.
elements_to_preserve (List[str]): HTML tags (e.g., <table>, <ul>) to remain
intact during splitting.
preserve_links (bool): Converts <a> tags to Markdown links ([text](url)).
preserve_images (bool): Converts <img> tags to Markdown images (![alt](src)).
preserve_videos (bool): Converts <video> tags to Markdown
video links (![video](src)).
preserve_audio (bool): Converts <audio> tags to Markdown
audio links (![audio](src)).
custom_handlers (Dict[str, Callable[[Any], str]]): Optional custom handlers for
specific HTML tags, allowing tailored extraction or processing.
stopword_removal (bool): Optionally remove stopwords from the text.
stopword_lang (str): The language of stopwords to remove.
normalize_text (bool): Optionally normalize text
(e.g., lowercasing, removing punctuation).
external_metadata (Optional[Dict[str, str]]): Additional metadata to attach to
the Document objects.
allowlist_tags (Optional[List[str]]): Only these tags will be retained in
the HTML.
denylist_tags (Optional[List[str]]): These tags will be removed from the HTML.
preserve_parent_metadata (bool): Whether to pass through parent document
metadata to split documents when calling
``transform_documents/atransform_documents()``.
Example:
.. code-block:: python
from langchain_text_splitters.html import HTMLSemanticPreservingSplitter
def custom_iframe_extractor(iframe_tag):
```
Custom handler function to extract the 'src' attribute from an <iframe> tag.
Converts the iframe to a Markdown-like link: [iframe:<src>](src).
Args:
iframe_tag (bs4.element.Tag): The <iframe> tag to be processed.
Returns:
str: A formatted string representing the iframe in Markdown-like format.
```
iframe_src = iframe_tag.get('src', '')
return f"[iframe:{iframe_src}]({iframe_src})"
text_splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
max_chunk_size=500,
preserve_links=True,
preserve_images=True,
custom_handlers={"iframe": custom_iframe_extractor}
)
""" # noqa: E501, D214
def __init__(
self,
headers_to_split_on: List[Tuple[str, str]],
*,
max_chunk_size: int = 1000,
chunk_overlap: int = 0,
separators: Optional[List[str]] = None,
elements_to_preserve: Optional[List[str]] = None,
preserve_links: bool = False,
preserve_images: bool = False,
preserve_videos: bool = False,
preserve_audio: bool = False,
custom_handlers: Optional[Dict[str, Callable[[Any], str]]] = None,
stopword_removal: bool = False,
stopword_lang: str = "english",
normalize_text: bool = False,
external_metadata: Optional[Dict[str, str]] = None,
allowlist_tags: Optional[List[str]] = None,
denylist_tags: Optional[List[str]] = None,
preserve_parent_metadata: bool = False,
):
"""Initialize splitter."""
try:
from bs4 import BeautifulSoup, Tag
self._BeautifulSoup = BeautifulSoup
self._Tag = Tag
except ImportError:
raise ImportError(
"Could not import BeautifulSoup. "
"Please install it with 'pip install bs4'."
)
self._headers_to_split_on = sorted(headers_to_split_on)
self._max_chunk_size = max_chunk_size
self._elements_to_preserve = elements_to_preserve or []
self._preserve_links = preserve_links
self._preserve_images = preserve_images
self._preserve_videos = preserve_videos
self._preserve_audio = preserve_audio
self._custom_handlers = custom_handlers or {}
self._stopword_removal = stopword_removal
self._stopword_lang = stopword_lang
self._normalize_text = normalize_text
self._external_metadata = external_metadata or {}
self._allowlist_tags = allowlist_tags
self._preserve_parent_metadata = preserve_parent_metadata
if allowlist_tags:
self._allowlist_tags = list(
set(allowlist_tags + [header[0] for header in headers_to_split_on])
)
self._denylist_tags = denylist_tags
if denylist_tags:
self._denylist_tags = [
tag
for tag in denylist_tags
if tag not in [header[0] for header in headers_to_split_on]
]
if separators:
self._recursive_splitter = RecursiveCharacterTextSplitter(
separators=separators,
chunk_size=max_chunk_size,
chunk_overlap=chunk_overlap,
)
else:
self._recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=max_chunk_size, chunk_overlap=chunk_overlap
)
if self._stopword_removal:
try:
import nltk # type: ignore
from nltk.corpus import stopwords # type: ignore
nltk.download("stopwords")
self._stopwords = set(stopwords.words(self._stopword_lang))
except ImportError:
raise ImportError(
"Could not import nltk. Please install it with 'pip install nltk'."
)
def split_text(self, text: str) -> List[Document]:
"""Splits the provided HTML text into smaller chunks based on the configuration.
Args:
text (str): The HTML content to be split.
Returns:
List[Document]: A list of Document objects containing the split content.
"""
soup = self._BeautifulSoup(text, "html.parser")
self._process_media(soup)
if self._preserve_links:
self._process_links(soup)
if self._allowlist_tags or self._denylist_tags:
self._filter_tags(soup)
return self._process_html(soup)
def transform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> List[Document]:
"""Transform sequence of documents by splitting them."""
transformed = []
for doc in documents:
splits = self.split_text(doc.page_content)
if self._preserve_parent_metadata:
splits = [
Document(
page_content=split_doc.page_content,
metadata={**doc.metadata, **split_doc.metadata},
)
for split_doc in splits
]
transformed.extend(splits)
return transformed
def _process_media(self, soup: Any) -> None:
"""Processes the media elements.
Process elements in the HTML content by wrapping them in a <media-wrapper> tag
and converting them to Markdown format.
Args:
soup (Any): Parsed HTML content using BeautifulSoup.
"""
if self._preserve_images:
for img_tag in soup.find_all("img"):
img_src = img_tag.get("src", "")
markdown_img = f"![image:{img_src}]({img_src})"
wrapper = soup.new_tag("media-wrapper")
wrapper.string = markdown_img
img_tag.replace_with(wrapper)
if self._preserve_videos:
for video_tag in soup.find_all("video"):
video_src = video_tag.get("src", "")
markdown_video = f"![video:{video_src}]({video_src})"
wrapper = soup.new_tag("media-wrapper")
wrapper.string = markdown_video
video_tag.replace_with(wrapper)
if self._preserve_audio:
for audio_tag in soup.find_all("audio"):
audio_src = audio_tag.get("src", "")
markdown_audio = f"![audio:{audio_src}]({audio_src})"
wrapper = soup.new_tag("media-wrapper")
wrapper.string = markdown_audio
audio_tag.replace_with(wrapper)
def _process_links(self, soup: Any) -> None:
"""Processes the links in the HTML content.
Args:
soup (Any): Parsed HTML content using BeautifulSoup.
"""
for a_tag in soup.find_all("a"):
a_href = a_tag.get("href", "")
a_text = a_tag.get_text(strip=True)
markdown_link = f"[{a_text}]({a_href})"
wrapper = soup.new_tag("link-wrapper")
wrapper.string = markdown_link
a_tag.replace_with(markdown_link)
def _filter_tags(self, soup: Any) -> None:
"""Filters the HTML content based on the allowlist and denylist tags.
Args:
soup (Any): Parsed HTML content using BeautifulSoup.
"""
if self._allowlist_tags:
for tag in soup.find_all(True):
if tag.name not in self._allowlist_tags:
tag.decompose()
if self._denylist_tags:
for tag in soup.find_all(self._denylist_tags):
tag.decompose()
def _normalize_and_clean_text(self, text: str) -> str:
"""Normalizes the text by removing extra spaces and newlines.
Args:
text (str): The text to be normalized.
Returns:
str: The normalized text.
"""
if self._normalize_text:
text = text.lower()
text = re.sub(r"[^\w\s]", "", text)
text = re.sub(r"\s+", " ", text).strip()
if self._stopword_removal:
text = " ".join(
[word for word in text.split() if word not in self._stopwords]
)
return text
def _process_html(self, soup: Any) -> List[Document]:
"""Processes the HTML content using BeautifulSoup and splits it using headers.
Args:
soup (Any): Parsed HTML content using BeautifulSoup.
Returns:
List[Document]: A list of Document objects containing the split content.
"""
documents: List[Document] = []
current_headers: Dict[str, str] = {}
current_content: List[str] = []
preserved_elements: Dict[str, str] = {}
placeholder_count: int = 0
def _get_element_text(element: Any) -> str:
"""Recursively extracts and processes the text of an element.
Applies custom handlers where applicable, and ensures correct spacing.
Args:
element (Any): The HTML element to process.
Returns:
str: The processed text of the element.
"""
if element.name in self._custom_handlers:
return self._custom_handlers[element.name](element)
text = ""
if element.name is not None:
for child in element.children:
child_text = _get_element_text(child).strip()
if text and child_text:
text += " "
text += child_text
elif element.string:
text += element.string
return self._normalize_and_clean_text(text)
elements = soup.find_all(recursive=False)
def _process_element(
element: List[Any],
documents: List[Document],
current_headers: Dict[str, str],
current_content: List[str],
preserved_elements: Dict[str, str],
placeholder_count: int,
) -> Tuple[List[Document], Dict[str, str], List[str], Dict[str, str], int]:
for elem in element:
if elem.name.lower() in ["html", "body", "div"]:
children = elem.find_all(recursive=False)
(
documents,
current_headers,
current_content,
preserved_elements,
placeholder_count,
) = _process_element(
children,
documents,
current_headers,
current_content,
preserved_elements,
placeholder_count,
)
continue
if elem.name in [h[0] for h in self._headers_to_split_on]:
if current_content:
documents.extend(
self._create_documents(
current_headers,
" ".join(current_content),
preserved_elements,
)
)
current_content.clear()
preserved_elements.clear()
header_name = elem.get_text(strip=True)
current_headers = {
dict(self._headers_to_split_on)[elem.name]: header_name
}
elif elem.name in self._elements_to_preserve:
placeholder = f"PRESERVED_{placeholder_count}"
preserved_elements[placeholder] = _get_element_text(elem)
current_content.append(placeholder)
placeholder_count += 1
else:
content = _get_element_text(elem)
if content:
current_content.append(content)
return (
documents,
current_headers,
current_content,
preserved_elements,
placeholder_count,
)
# Process the elements
(
documents,
current_headers,
current_content,
preserved_elements,
placeholder_count,
) = _process_element(
elements,
documents,
current_headers,
current_content,
preserved_elements,
placeholder_count,
)
# Handle any remaining content
if current_content:
documents.extend(
self._create_documents(
current_headers, " ".join(current_content), preserved_elements
)
)
return documents
def _create_documents(
self, headers: dict, content: str, preserved_elements: dict
) -> List[Document]:
"""Creates Document objects from the provided headers, content, and elements.
Args:
headers (dict): The headers to attach as metadata to the Document.
content (str): The content of the Document.
preserved_elements (dict): Preserved elements to be reinserted
into the content.
Returns:
List[Document]: A list of Document objects.
"""
content = re.sub(r"\s+", " ", content).strip()
metadata = {**headers, **self._external_metadata}
if len(content) <= self._max_chunk_size:
page_content = self._reinsert_preserved_elements(
content, preserved_elements
)
return [Document(page_content=page_content, metadata=metadata)]
else:
return self._further_split_chunk(content, metadata, preserved_elements)
def _further_split_chunk(
self, content: str, metadata: dict, preserved_elements: dict
) -> List[Document]:
"""Further splits the content into smaller chunks.
Args:
content (str): The content to be split.
metadata (dict): Metadata to attach to each chunk.
preserved_elements (dict): Preserved elements
to be reinserted into each chunk.
Returns:
List[Document]: A list of Document objects containing the split content.
"""
splits = self._recursive_splitter.split_text(content)
result = []
for split in splits:
split_with_preserved = self._reinsert_preserved_elements(
split, preserved_elements
)
if split_with_preserved.strip():
result.append(
Document(
page_content=split_with_preserved.strip(), metadata=metadata
)
)
return result
def _reinsert_preserved_elements(
self, content: str, preserved_elements: dict
) -> str:
"""Reinserts preserved elements into the content into their original positions.
Args:
content (str): The content where placeholders need to be replaced.
preserved_elements (dict): Preserved elements to be reinserted.
Returns:
str: The content with placeholders replaced by preserved elements.
"""
for placeholder, preserved_content in preserved_elements.items():
content = content.replace(placeholder, preserved_content.strip())
return content

View File

@ -17,7 +17,11 @@ from langchain_text_splitters import (
)
from langchain_text_splitters.base import split_text_on_tokens
from langchain_text_splitters.character import CharacterTextSplitter
from langchain_text_splitters.html import HTMLHeaderTextSplitter, HTMLSectionSplitter
from langchain_text_splitters.html import (
HTMLHeaderTextSplitter,
HTMLSectionSplitter,
HTMLSemanticPreservingSplitter,
)
from langchain_text_splitters.json import RecursiveJsonSplitter
from langchain_text_splitters.markdown import (
ExperimentalMarkdownSyntaxTextSplitter,
@ -2452,3 +2456,360 @@ $csvContent | ForEach-Object {
"$csvContent | ForEach-Object {\n $_.ProcessName\n}",
"# End of script",
]
def custom_iframe_extractor(iframe_tag: Any) -> str:
iframe_src = iframe_tag.get("src", "")
return f"[iframe:{iframe_src}]({iframe_src})"
@pytest.mark.requires("bs4")
def test_html_splitter_with_custom_extractor() -> None:
"""Test HTML splitting with a custom extractor."""
html_content = """
<h1>Section 1</h1>
<p>This is an iframe:</p>
<iframe src="http://example.com"></iframe>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
custom_handlers={"iframe": custom_iframe_extractor},
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="This is an iframe: [iframe:http://example.com](http://example.com)",
metadata={"Header 1": "Section 1"},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_href_links() -> None:
"""Test HTML splitting with href links."""
html_content = """
<h1>Section 1</h1>
<p>This is a link to <a href="http://example.com">example.com</a></p>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
preserve_links=True,
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="This is a link to [example.com](http://example.com)",
metadata={"Header 1": "Section 1"},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_nested_elements() -> None:
"""Test HTML splitting with nested elements."""
html_content = """
<h1>Main Section</h1>
<div>
<p>Some text here.</p>
<div>
<p>Nested content.</p>
</div>
</div>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")], max_chunk_size=1000
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="Some text here. Nested content.",
metadata={"Header 1": "Main Section"},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_preserved_elements() -> None:
"""Test HTML splitting with preserved elements like <table>, <ul> with low chunk
size."""
html_content = """
<h1>Section 1</h1>
<table>
<tr><td>Row 1</td></tr>
<tr><td>Row 2</td></tr>
</table>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
elements_to_preserve=["table", "ul"],
max_chunk_size=50, # Deliberately low to test preservation
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="Row 1 Row 2 Item 1 Item 2",
metadata={"Header 1": "Section 1"},
),
]
assert documents == expected # Shouldn't split the table or ul
@pytest.mark.requires("bs4")
def test_html_splitter_with_no_further_splits() -> None:
"""Test HTML splitting that requires no further splits beyond sections."""
html_content = """
<h1>Section 1</h1>
<p>Some content here.</p>
<h1>Section 2</h1>
<p>More content here.</p>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")], max_chunk_size=1000
)
documents = splitter.split_text(html_content)
expected = [
Document(page_content="Some content here.", metadata={"Header 1": "Section 1"}),
Document(page_content="More content here.", metadata={"Header 1": "Section 2"}),
]
assert documents == expected # No further splits, just sections
@pytest.mark.requires("bs4")
def test_html_splitter_with_small_chunk_size() -> None:
"""Test HTML splitting with a very small chunk size to validate chunking."""
html_content = """
<h1>Section 1</h1>
<p>This is some long text that should be split into multiple chunks due to the
small chunk size.</p>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")], max_chunk_size=20, chunk_overlap=5
)
documents = splitter.split_text(html_content)
expected = [
Document(page_content="This is some long", metadata={"Header 1": "Section 1"}),
Document(page_content="long text that", metadata={"Header 1": "Section 1"}),
Document(page_content="that should be", metadata={"Header 1": "Section 1"}),
Document(page_content="be split into", metadata={"Header 1": "Section 1"}),
Document(page_content="into multiple", metadata={"Header 1": "Section 1"}),
Document(page_content="chunks due to the", metadata={"Header 1": "Section 1"}),
Document(page_content="the small chunk", metadata={"Header 1": "Section 1"}),
Document(page_content="size.", metadata={"Header 1": "Section 1"}),
]
assert documents == expected # Should split into multiple chunks
@pytest.mark.requires("bs4")
def test_html_splitter_with_denylist_tags() -> None:
"""Test HTML splitting with denylist tag filtering."""
html_content = """
<h1>Section 1</h1>
<p>This paragraph should be kept.</p>
<span>This span should be removed.</span>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
denylist_tags=["span"],
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="This paragraph should be kept.",
metadata={"Header 1": "Section 1"},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_external_metadata() -> None:
"""Test HTML splitting with external metadata integration."""
html_content = """
<h1>Section 1</h1>
<p>This is some content.</p>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
external_metadata={"source": "example.com"},
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="This is some content.",
metadata={"Header 1": "Section 1", "source": "example.com"},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_text_normalization() -> None:
"""Test HTML splitting with text normalization."""
html_content = """
<h1>Section 1</h1>
<p>This is some TEXT that should be normalized!</p>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
normalize_text=True,
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="this is some text that should be normalized",
metadata={"Header 1": "Section 1"},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_allowlist_tags() -> None:
"""Test HTML splitting with allowlist tag filtering."""
html_content = """
<h1>Section 1</h1>
<p>This paragraph should be kept.</p>
<span>This span should be kept.</span>
<div>This div should be removed.</div>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
allowlist_tags=["p", "span"],
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="This paragraph should be kept. This span should be kept.",
metadata={"Header 1": "Section 1"},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_mixed_preserve_and_filter() -> None:
"""Test HTML splitting with both preserved elements and denylist tags."""
html_content = """
<h1>Section 1</h1>
<table>
<tr>
<td>Keep this table</td>
<td>Cell contents kept, span removed
<span>This span should be removed.</span>
</td>
</tr>
</table>
<p>This paragraph should be kept.</p>
<span>This span should be removed.</span>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
elements_to_preserve=["table"],
denylist_tags=["span"],
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="Keep this table Cell contents kept, span removed"
" This paragraph should be kept.",
metadata={"Header 1": "Section 1"},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_no_headers() -> None:
"""Test HTML splitting when there are no headers to split on."""
html_content = """
<p>This is content without any headers.</p>
<p>It should still produce a valid document.</p>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[],
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="This is content without any headers. It should still produce"
" a valid document.",
metadata={},
),
]
assert documents == expected
@pytest.mark.requires("bs4")
def test_html_splitter_with_media_preservation() -> None:
"""Test HTML splitting with media elements preserved and converted to Markdown-like
links."""
html_content = """
<h1>Section 1</h1>
<p>This is an image:</p>
<img src="http://example.com/image.png" />
<p>This is a video:</p>
<video src="http://example.com/video.mp4"></video>
<p>This is audio:</p>
<audio src="http://example.com/audio.mp3"></audio>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1")],
preserve_images=True,
preserve_videos=True,
preserve_audio=True,
max_chunk_size=1000,
)
documents = splitter.split_text(html_content)
expected = [
Document(
page_content="This is an image: ![image:http://example.com/image.png]"
"(http://example.com/image.png) "
"This is a video: ![video:http://example.com/video.mp4]"
"(http://example.com/video.mp4) "
"This is audio: ![audio:http://example.com/audio.mp3]"
"(http://example.com/audio.mp3)",
metadata={"Header 1": "Section 1"},
),
]
assert documents == expected