{ "cells": [ { "cell_type": "markdown", "id": "70e9b619", "metadata": {}, "source": [ "# MarkdownHeaderTextSplitter\n", "\n", "The objective is to split a markdown file by a specified set of headers.\n", " \n", "**Given this example:**\n", "\n", "# Foo\n", "\n", "## Bar\n", "\n", "Hi this is Jim \n", "Hi this is Joe\n", "\n", "## Baz\n", "\n", "Hi this is Molly\n", " \n", "**Written as:**\n", "\n", "```\n", "md = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim \\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n", "```\n", "\n", "**If we want to split on specified headers:**\n", "```\n", "[(\"#\", \"Header 1\"),(\"##\", \"Header 2\")]\n", "```\n", "\n", "**Then we expect:** \n", "```\n", "{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n", "```\n", "\n", "**Options:**\n", " \n", "This also includes `return_each_line` in case a user want to perform other types of aggregation. \n", "\n", "If `return_each_line=True`, each line and associated header metadata are returned. " ] }, { "cell_type": "code", "execution_count": 1, "id": "19c044f0", "metadata": {}, "outputs": [], "source": [ "from langchain.text_splitter import MarkdownHeaderTextSplitter" ] }, { "cell_type": "markdown", "id": "ec8d8053", "metadata": {}, "source": [ "`Test case 1`" ] }, { "cell_type": "code", "execution_count": 2, "id": "5cd0a66c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'content': 'Hi this is Jim', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", "{'content': 'Hi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" ] } ], "source": [ "# Doc\n", "markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n", " \n", "# Test case 1\n", "headers_to_split_on = [\n", " (\"#\", \"Header 1\"),\n", " (\"##\", \"Header 2\"),\n", "]\n", "\n", "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=True)\n", "\n", "chunked_docs = markdown_splitter.split_text(markdown_document)\n", "for chunk in chunked_docs:\n", " print(chunk)" ] }, { "cell_type": "code", "execution_count": 4, "id": "67d25a1c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" ] } ], "source": [ "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", "chunked_docs = markdown_splitter.split_text(markdown_document)\n", "for chunk in chunked_docs:\n", " print(chunk)" ] }, { "cell_type": "markdown", "id": "f1f74dfa", "metadata": {}, "source": [ "`Test case 2`" ] }, { "cell_type": "code", "execution_count": 5, "id": "2183c96a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'content': 'Text under H3.', 'metadata': {'Header 1': 'H1', 'Header 2': 'H2', 'Header 3': 'H3'}}\n", "{'content': 'Text under H2_2.', 'metadata': {'Header 1': 'H1_2', 'Header 2': 'H2_2'}}\n" ] } ], "source": [ "headers_to_split_on = [\n", " (\"#\", \"Header 1\"),\n", " (\"##\", \"Header 2\"),\n", " (\"###\", \"Header 3\"),\n", "]\n", "markdown_document = '# H1\\n\\n## H2\\n\\n### H3\\n\\nText under H3.\\n\\n# H1_2\\n\\n## H2_2\\n\\nText under H2_2.'\n", "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", "chunked_docs = markdown_splitter.split_text(markdown_document)\n", "for chunk in chunked_docs:\n", " print(chunk)" ] }, { "cell_type": "markdown", "id": "add24254", "metadata": {}, "source": [ "`Test case 3`" ] }, { "cell_type": "code", "execution_count": 6, "id": "c3f4690f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", "{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n", "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" ] } ], "source": [ "markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n ## Baz\\n\\n Hi this is Molly' \n", " \n", "headers_to_split_on = [\n", " (\"#\", \"Header 1\"),\n", " (\"##\", \"Header 2\"),\n", " (\"###\", \"Header 3\"),\n", "]\n", "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", "chunked_docs = markdown_splitter.split_text(markdown_document)\n", "for chunk in chunked_docs:\n", " print(chunk)" ] }, { "cell_type": "code", "execution_count": 7, "id": "20907fb7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'content': 'Hi this is Jim', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", "{'content': 'Hi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", "{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n", "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" ] } ], "source": [ "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=True)\n", "chunked_docs = markdown_splitter.split_text(markdown_document)\n", "for chunk in chunked_docs:\n", " print(chunk)" ] }, { "cell_type": "markdown", "id": "9c448431", "metadata": {}, "source": [ "`Test case 4`" ] }, { "cell_type": "code", "execution_count": 8, "id": "9858ea51", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", "{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n", "{'content': 'Hi this is John', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo', 'Header 4': 'Bim'}}\n", "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" ] } ], "source": [ "markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n #### Bim \\n\\n Hi this is John \\n\\n ## Baz\\n\\n Hi this is Molly'\n", " \n", "headers_to_split_on = [\n", " (\"#\", \"Header 1\"),\n", " (\"##\", \"Header 2\"),\n", " (\"###\", \"Header 3\"),\n", " (\"####\", \"Header 4\"),\n", "]\n", " \n", "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", "chunked_docs = markdown_splitter.split_text(markdown_document)\n", "for chunk in chunked_docs:\n", " print(chunk)" ] }, { "cell_type": "markdown", "id": "bba6eb9e", "metadata": {}, "source": [ "`Test case 5`" ] }, { "cell_type": "code", "execution_count": 9, "id": "8af8f9a2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'content': 'Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'History'}}\n", "{'content': 'As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}}\n", "{'content': 'From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence', 'Header 4': 'Standardization'}}\n", "{'content': 'Implementations of Markdown are available for over a dozen programming languages.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Implementations'}}\n" ] } ], "source": [ "markdown_document = '# Intro \\n\\n ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.'\n", " \n", "headers_to_split_on = [\n", " (\"#\", \"Header 1\"),\n", " (\"##\", \"Header 2\"),\n", " (\"###\", \"Header 3\"),\n", " (\"####\", \"Header 4\"),\n", "]\n", " \n", "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", "chunked_docs = markdown_splitter.split_text(markdown_document)\n", "for chunk in chunked_docs:\n", " print(chunk)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }