mirror of
https://github.com/hwchase17/langchain.git
synced 2026-02-21 06:33:41 +00:00
208 lines
7.3 KiB
Plaintext
208 lines
7.3 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c95fcd15cd52c944",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
}
|
|
},
|
|
"source": [
|
|
"# How to split by HTML sections\n",
|
|
"## Description and motivation\n",
|
|
"Similar in concept to the [HTMLHeaderTextSplitter](/docs/how_to/HTML_header_metadata_splitter), the `HTMLSectionSplitter` is a \"structure-aware\" chunker that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk.\n",
|
|
"\n",
|
|
"It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures.\n",
|
|
"\n",
|
|
"Use `xslt_path` to provide an absolute path to transform the HTML so that it can detect sections based on provided tags. The default is to use the `converting_to_header.xslt` file in the `data_connection/document_transformers` directory. This is for converting the html to a format/layout that is easier to detect sections. For example, `span` based on their font size can be converted to header tags to be detected as a section.\n",
|
|
"\n",
|
|
"## Usage examples\n",
|
|
"### 1) How to split HTML strings:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "initial_id",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-10-02T18:57:49.208965400Z",
|
|
"start_time": "2023-10-02T18:57:48.899756Z"
|
|
},
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='Foo \\n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),\n",
|
|
" Document(page_content='Bar main section \\n Some intro text about Bar. \\n Bar subsection 1 \\n Some text about the first subtopic of Bar. \\n Bar subsection 2 \\n Some text about the second subtopic of Bar.', metadata={'Header 2': 'Bar main section'}),\n",
|
|
" Document(page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]"
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_text_splitters import HTMLSectionSplitter\n",
|
|
"\n",
|
|
"html_string = \"\"\"\n",
|
|
" <!DOCTYPE html>\n",
|
|
" <html>\n",
|
|
" <body>\n",
|
|
" <div>\n",
|
|
" <h1>Foo</h1>\n",
|
|
" <p>Some intro text about Foo.</p>\n",
|
|
" <div>\n",
|
|
" <h2>Bar main section</h2>\n",
|
|
" <p>Some intro text about Bar.</p>\n",
|
|
" <h3>Bar subsection 1</h3>\n",
|
|
" <p>Some text about the first subtopic of Bar.</p>\n",
|
|
" <h3>Bar subsection 2</h3>\n",
|
|
" <p>Some text about the second subtopic of Bar.</p>\n",
|
|
" </div>\n",
|
|
" <div>\n",
|
|
" <h2>Baz</h2>\n",
|
|
" <p>Some text about Baz</p>\n",
|
|
" </div>\n",
|
|
" <br>\n",
|
|
" <p>Some concluding text about Foo</p>\n",
|
|
" </div>\n",
|
|
" </body>\n",
|
|
" </html>\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"headers_to_split_on = [(\"h1\", \"Header 1\"), (\"h2\", \"Header 2\")]\n",
|
|
"\n",
|
|
"html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
|
|
"html_header_splits = html_splitter.split_text(html_string)\n",
|
|
"html_header_splits"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e29b4aade2a0070c",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
}
|
|
},
|
|
"source": [
|
|
"### 2) How to constrain chunk sizes:\n",
|
|
"\n",
|
|
"`HTMLSectionSplitter` can be used with other text splitters as part of a chunking pipeline. Internally, it uses the `RecursiveCharacterTextSplitter` when the section size is larger than the chunk size. It also considers the font size of the text to determine whether it is a section or not based on the determined font size threshold."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "6ada8ea093ea0475",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-10-02T18:57:51.016141300Z",
|
|
"start_time": "2023-10-02T18:57:50.647495400Z"
|
|
},
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='Foo \\n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),\n",
|
|
" Document(page_content='Bar main section \\n Some intro text about Bar.', metadata={'Header 2': 'Bar main section'}),\n",
|
|
" Document(page_content='Bar subsection 1 \\n Some text about the first subtopic of Bar.', metadata={'Header 3': 'Bar subsection 1'}),\n",
|
|
" Document(page_content='Bar subsection 2 \\n Some text about the second subtopic of Bar.', metadata={'Header 3': 'Bar subsection 2'}),\n",
|
|
" Document(page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
|
|
"\n",
|
|
"html_string = \"\"\"\n",
|
|
" <!DOCTYPE html>\n",
|
|
" <html>\n",
|
|
" <body>\n",
|
|
" <div>\n",
|
|
" <h1>Foo</h1>\n",
|
|
" <p>Some intro text about Foo.</p>\n",
|
|
" <div>\n",
|
|
" <h2>Bar main section</h2>\n",
|
|
" <p>Some intro text about Bar.</p>\n",
|
|
" <h3>Bar subsection 1</h3>\n",
|
|
" <p>Some text about the first subtopic of Bar.</p>\n",
|
|
" <h3>Bar subsection 2</h3>\n",
|
|
" <p>Some text about the second subtopic of Bar.</p>\n",
|
|
" </div>\n",
|
|
" <div>\n",
|
|
" <h2>Baz</h2>\n",
|
|
" <p>Some text about Baz</p>\n",
|
|
" </div>\n",
|
|
" <br>\n",
|
|
" <p>Some concluding text about Foo</p>\n",
|
|
" </div>\n",
|
|
" </body>\n",
|
|
" </html>\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"headers_to_split_on = [\n",
|
|
" (\"h1\", \"Header 1\"),\n",
|
|
" (\"h2\", \"Header 2\"),\n",
|
|
" (\"h3\", \"Header 3\"),\n",
|
|
" (\"h4\", \"Header 4\"),\n",
|
|
"]\n",
|
|
"\n",
|
|
"html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
|
|
"\n",
|
|
"html_header_splits = html_splitter.split_text(html_string)\n",
|
|
"\n",
|
|
"chunk_size = 500\n",
|
|
"chunk_overlap = 30\n",
|
|
"text_splitter = RecursiveCharacterTextSplitter(\n",
|
|
" chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
|
|
")\n",
|
|
"\n",
|
|
"# Split\n",
|
|
"splits = text_splitter.split_documents(html_header_splits)\n",
|
|
"splits"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|