(new docs): update text splitter how-to guides (#21087)

This commit is contained in:
ccurme
2024-04-30 11:34:42 -04:00
committed by GitHub
parent c3b7933d98
commit df8a2cdc96
11 changed files with 639 additions and 334 deletions

View File

@@ -4,6 +4,7 @@
"cell_type": "markdown",
"id": "c95fcd15cd52c944",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
@@ -11,10 +12,15 @@
"source": [
"# How to split by HTML header \n",
"## Description and motivation\n",
"Similar in concept to the <a href=\"https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata\">`MarkdownHeaderTextSplitter`</a>, the `HTMLHeaderTextSplitter` is a \"structure-aware\" chunker that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.\n",
"\n",
"[HTMLHeaderTextSplitter](https://api.python.langchain.com/en/latest/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html) is a \"structure-aware\" chunker that splits text at the HTML element level and adds metadata for each header \"relevant\" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.\n",
"\n",
"It is analogous to the [MarkdownHeaderTextSplitter](/docs/0.2.x/how_to/markdown_header_metadata_splitter) for markdown files.\n",
"\n",
"To specify what headers to split on, specify `headers_to_split_on` when instantiating `HTMLHeaderTextSplitter` as shown below.\n",
"\n",
"## Usage examples\n",
"#### 1) With an HTML string:"
"### 1) How to split HTML strings:"
]
},
{
@@ -36,6 +42,7 @@
"end_time": "2023-10-02T18:57:49.208965400Z",
"start_time": "2023-10-02T18:57:48.899756Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
@@ -94,32 +101,151 @@
" (\"h3\", \"Header 3\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"html_header_splits"
]
},
{
"cell_type": "markdown",
"id": "e29b4aade2a0070c",
"metadata": {
"jupyter": {
"outputs_hidden": false
}
},
"id": "7126f179-f4d0-4b5d-8bef-44e83b59262c",
"metadata": {},
"source": [
"#### 2) Pipelined to another splitter, with html loaded from a web URL:"
"To return each element together with their associated headers, specify `return_each_element=True` when instantiating `HTMLHeaderTextSplitter`:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "90c23088-804c-4c89-bd09-b820587ceeef",
"metadata": {},
"outputs": [],
"source": [
"html_splitter = HTMLHeaderTextSplitter(\n",
" headers_to_split_on,\n",
" return_each_element=True,\n",
")\n",
"html_header_splits_elements = html_splitter.split_text(html_string)"
]
},
{
"cell_type": "markdown",
"id": "b776c54e-9159-4d88-9d6c-3a1d0b639dfe",
"metadata": {},
"source": [
"Comparing with the above, where elements are aggregated by their headers:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "711abc74-a7b0-4dc5-a4bb-af3cafe4e0f4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Foo'\n",
"page_content='Some intro text about Foo. \\nBar main section Bar subsection 1 Bar subsection 2' metadata={'Header 1': 'Foo'}\n"
]
}
],
"source": [
"for element in html_header_splits[:2]:\n",
" print(element)"
]
},
{
"cell_type": "markdown",
"id": "fe5528db-187c-418a-9480-fc0267645d42",
"metadata": {},
"source": [
"Now each element is returned as a distinct `Document`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "24722d8e-d073-46a8-a821-6b722412f1be",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Foo'\n",
"page_content='Some intro text about Foo.' metadata={'Header 1': 'Foo'}\n",
"page_content='Bar main section Bar subsection 1 Bar subsection 2' metadata={'Header 1': 'Foo'}\n"
]
}
],
"source": [
"for element in html_header_splits_elements[:3]:\n",
" print(element)"
]
},
{
"cell_type": "markdown",
"id": "e29b4aade2a0070c",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"#### 2) How to split from a URL or HTML file:\n",
"\n",
"To read directly from a URL, pass the URL string into the `split_text_from_url` method.\n",
"\n",
"Similarly, a local HTML file can be passed to the `split_text_from_file` method."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6ecb9fb2-32ff-4249-a4b4-d5e5e191f013",
"metadata": {},
"outputs": [],
"source": [
"url = \"https://plato.stanford.edu/entries/goedel/\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
" (\"h4\", \"Header 4\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"\n",
"# for local file use html_splitter.split_text_from_file(<path_to_file>)\n",
"html_header_splits = html_splitter.split_text_from_url(url)"
]
},
{
"cell_type": "markdown",
"id": "c6e3dd41-0c57-472a-a3d4-4e7e8ea6914f",
"metadata": {},
"source": [
"### 2) How to constrain chunk sizes:\n",
"\n",
"`HTMLHeaderTextSplitter`, which splits based on HTML headers, can be composed with another splitter which constrains splits based on character lengths, such as `RecursiveCharacterTextSplitter`.\n",
"\n",
"This can be done using the `.split_documents` method of the second splitter:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6ada8ea093ea0475",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:51.016141300Z",
"start_time": "2023-10-02T18:57:50.647495400Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
@@ -135,7 +261,7 @@
" Document(page_content='We now describe the proof of the two theorems, formulating Gödels results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödels notation.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'})]"
]
},
"execution_count": 2,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@@ -143,19 +269,6 @@
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"url = \"https://plato.stanford.edu/entries/goedel/\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
" (\"h4\", \"Header 4\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"\n",
"# for local file use html_splitter.split_text_from_file(<path_to_file>)\n",
"html_header_splits = html_splitter.split_text_from_url(url)\n",
"\n",
"chunk_size = 500\n",
"chunk_overlap = 30\n",
@@ -172,6 +285,7 @@
"cell_type": "markdown",
"id": "ac0930371d79554a",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
@@ -184,13 +298,14 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 6,
"id": "5a5ec1482171b119",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T19:03:25.943524300Z",
"start_time": "2023-10-02T19:03:25.691641Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
@@ -200,10 +315,10 @@
"name": "stdout",
"output_type": "stream",
"text": [
"No two El Niño winters are the same, but many have temperature and precipitation trends in common. \n",
"Average conditions during an El Niño winter across the continental US. \n",
"One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA. \n",
"Because the jet stream is essentially a river of air that storms flow through, the\n"
"No two El Niño winters are the same, but many have temperature and precipitation trends in common. \n",
"Average conditions during an El Niño winter across the continental US. \n",
"One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA. \n",
"Because the jet stream is essentially a river of air that storms flow through, they c\n"
]
}
],
@@ -215,7 +330,7 @@
" (\"h2\", \"Header 2\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text_from_url(url)\n",
"print(html_header_splits[1].page_content[:500])"
]
@@ -237,7 +352,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@@ -4,6 +4,7 @@
"cell_type": "markdown",
"id": "c95fcd15cd52c944",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
@@ -11,26 +12,44 @@
"source": [
"# How to split by HTML sections\n",
"## Description and motivation\n",
"Similar in concept to the [HTMLHeaderTextSplitter](/docs/modules/data_connection/document_transformers/HTML_header_metadata), the `HTMLSectionSplitter` is a \"structure-aware\" chunker that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline. Internally, it uses the `RecursiveCharacterTextSplitter` when the section size is larger than the chunk size. It also considers the font size of the text to determine whether it is a section or not based on the determined font size threshold. Use `xslt_path` to provide an absolute path to transform the HTML so that it can detect sections based on provided tags. The default is to use the `converting_to_header.xslt` file in the `data_connection/document_transformers` directory. This is for converting the html to a format/layout that is easier to detect sections. For example, `span` based on their font size can be converted to header tags to be detected as a section.\n",
"Similar in concept to the [HTMLHeaderTextSplitter](/docs/0.2.x/how_to/HTML_header_metadata_splitter), the `HTMLSectionSplitter` is a \"structure-aware\" chunker that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk.\n",
"\n",
"It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures.\n",
"\n",
"Use `xslt_path` to provide an absolute path to transform the HTML so that it can detect sections based on provided tags. The default is to use the `converting_to_header.xslt` file in the `data_connection/document_transformers` directory. This is for converting the html to a format/layout that is easier to detect sections. For example, `span` based on their font size can be converted to header tags to be detected as a section.\n",
"\n",
"## Usage examples\n",
"#### 1) With an HTML string:"
"### 1) How to split HTML strings:"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "initial_id",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:49.208965400Z",
"start_time": "2023-10-02T18:57:48.899756Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Foo \\n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),\n",
" Document(page_content='Bar main section \\n Some intro text about Bar. \\n Bar subsection 1 \\n Some text about the first subtopic of Bar. \\n Bar subsection 2 \\n Some text about the second subtopic of Bar.', metadata={'Header 2': 'Bar main section'}),\n",
" Document(page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_text_splitters import HTMLSectionSplitter\n",
"\n",
@@ -62,7 +81,7 @@
"\n",
"headers_to_split_on = [(\"h1\", \"Header 1\"), (\"h2\", \"Header 2\")]\n",
"\n",
"html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)\n",
"html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"html_header_splits"
]
@@ -71,28 +90,47 @@
"cell_type": "markdown",
"id": "e29b4aade2a0070c",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"#### 2) Pipelined to another splitter, with html loaded from a html string content:"
"### 2) How to constrain chunk sizes:\n",
"\n",
"`HTMLSectionSplitter` can be used with other text splitters as part of a chunking pipeline. Internally, it uses the `RecursiveCharacterTextSplitter` when the section size is larger than the chunk size. It also considers the font size of the text to determine whether it is a section or not based on the determined font size threshold."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"id": "6ada8ea093ea0475",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:51.016141300Z",
"start_time": "2023-10-02T18:57:50.647495400Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Foo \\n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),\n",
" Document(page_content='Bar main section \\n Some intro text about Bar.', metadata={'Header 2': 'Bar main section'}),\n",
" Document(page_content='Bar subsection 1 \\n Some text about the first subtopic of Bar.', metadata={'Header 3': 'Bar subsection 1'}),\n",
" Document(page_content='Bar subsection 2 \\n Some text about the second subtopic of Bar.', metadata={'Header 3': 'Bar subsection 2'}),\n",
" Document(page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
@@ -129,7 +167,7 @@
" (\"h4\", \"Header 4\"),\n",
"]\n",
"\n",
"html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)\n",
"html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
"\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"\n",
@@ -161,7 +199,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@@ -7,10 +7,14 @@
"source": [
"# How to split by character\n",
"\n",
"This is the simplest method. This splits based on characters (by default \"\\n\\n\") and measure chunk length by number of characters.\n",
"This is the simplest method. This splits based on a given character sequence, which defaults to `\"\\n\\n\"`. Chunk length is measured by number of characters.\n",
"\n",
"1. How the text is split: by single character.\n",
"2. How the chunk size is measured: by number of characters."
"1. How the text is split: by single character separator.\n",
"2. How the chunk size is measured: by number of characters.\n",
"\n",
"To obtain the string content directly, use `.split_text`.\n",
"\n",
"To create LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects (e.g., for use in downstream tasks), use `.create_documents`."
]
},
{
@@ -25,39 +29,9 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 1,
"id": "313fb032",
"metadata": {},
"outputs": [],
"source": [
"# This is a long document we can split up.\n",
"with open(\"../../state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a88ff70c",
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"text_splitter = CharacterTextSplitter(\n",
" separator=\"\\n\\n\",\n",
" chunk_size=1000,\n",
" chunk_overlap=200,\n",
" length_function=len,\n",
" is_separator_regex=False,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "295ec095",
"metadata": {},
"outputs": [
{
"name": "stdout",
@@ -68,6 +42,20 @@
}
],
"source": [
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"\n",
"# Load an example document\n",
"with open(\"../../../docs/modules/state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()\n",
"\n",
"text_splitter = CharacterTextSplitter(\n",
" separator=\"\\n\\n\",\n",
" chunk_size=1000,\n",
" chunk_overlap=200,\n",
" length_function=len,\n",
" is_separator_regex=False,\n",
")\n",
"texts = text_splitter.create_documents([state_of_the_union])\n",
"print(texts[0])"
]
@@ -77,12 +65,12 @@
"id": "dadcb9d6",
"metadata": {},
"source": [
"Here's an example of passing metadata along with the documents, notice that it is split along with the documents.\n"
"Use `.create_documents` to propagate metadata associated with each document to the output chunks:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 2,
"id": "1affda60",
"metadata": {},
"outputs": [
@@ -102,6 +90,14 @@
"print(documents[0])"
]
},
{
"cell_type": "markdown",
"id": "ee080e12-6f44-4311-b1ef-302520a41d66",
"metadata": {},
"source": [
"Use `.split_text` to obtain the string content directly:"
]
},
{
"cell_type": "code",
"execution_count": 7,
@@ -148,7 +144,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@@ -7,7 +7,48 @@
"source": [
"# How to split code\n",
"\n",
"CodeTextSplitter allows you to split your code with multiple languages supported. Import enum `Language` and specify the language. \n"
"[RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html) includes pre-built lists of separators that are useful for splitting text in a specific programming language.\n",
"\n",
"Supported languages are stored in the `langchain_text_splitters.Language` enum. They include:\n",
"\n",
"```\n",
"\"cpp\",\n",
"\"go\",\n",
"\"java\",\n",
"\"kotlin\",\n",
"\"js\",\n",
"\"ts\",\n",
"\"php\",\n",
"\"proto\",\n",
"\"python\",\n",
"\"rst\",\n",
"\"ruby\",\n",
"\"rust\",\n",
"\"scala\",\n",
"\"swift\",\n",
"\"markdown\",\n",
"\"latex\",\n",
"\"html\",\n",
"\"sol\",\n",
"\"csharp\",\n",
"\"cobol\",\n",
"\"c\",\n",
"\"lua\",\n",
"\"perl\",\n",
"\"haskell\"\n",
"```\n",
"\n",
"To view the list of separators for a given language, pass a value from this enum into\n",
"```python\n",
"RecursiveCharacterTextSplitter.get_separators_for_language`\n",
"```\n",
"\n",
"To instantiate a splitter that is tailored for a specific language, pass a value from the enum into\n",
"```python\n",
"RecursiveCharacterTextSplitter.from_language\n",
"```\n",
"\n",
"Below we demonstrate examples for the various languages."
]
},
{
@@ -22,7 +63,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 1,
"id": "a9e37aa1",
"metadata": {},
"outputs": [],
@@ -33,9 +74,17 @@
")"
]
},
{
"cell_type": "markdown",
"id": "082807cb-dfba-4495-af12-0441f63f30e1",
"metadata": {},
"source": [
"To view the full list of supported languages:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 3,
"id": "e21a2434",
"metadata": {},
"outputs": [
@@ -68,16 +117,23 @@
" 'haskell']"
]
},
"execution_count": 5,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Full list of supported languages\n",
"[e.value for e in Language]"
]
},
{
"cell_type": "markdown",
"id": "56669f16-266a-4820-a7e7-d90ade9e642f",
"metadata": {},
"source": [
"You can also see the separators used for a given language:"
]
},
{
"cell_type": "code",
"execution_count": 3,
@@ -96,7 +152,6 @@
}
],
"source": [
"# You can also see the separators used for a given language\n",
"RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)"
]
},
@@ -687,7 +742,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@@ -68,7 +68,7 @@
"\n",
"data = loader.load()\n",
"assert len(data) == 1\n",
"assert(isinstance(data[0], Document))\n",
"assert isinstance(data[0], Document)\n",
"readme_content = data[0].page_content\n",
"print(readme_content[:250])"
]

View File

@@ -17,7 +17,7 @@
"When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text.\n",
"```\n",
" \n",
"As mentioned, chunking often aims to keep text with common context together. With this in mind, we might want to specifically honor the structure of the document itself. For example, a markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use `MarkdownHeaderTextSplitter`. This will split a markdown file by a specified set of headers. \n",
"As mentioned, chunking often aims to keep text with common context together. With this in mind, we might want to specifically honor the structure of the document itself. For example, a markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use [MarkdownHeaderTextSplitter](https://api.python.langchain.com/en/latest/markdown/langchain_text_splitters.markdown.MarkdownHeaderTextSplitter.html). This will split a markdown file by a specified set of headers. \n",
"\n",
"For example, if we want to split this markdown:\n",
"```\n",
@@ -35,7 +35,9 @@
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n",
"```\n",
"\n",
"Let's have a look at some examples below."
"Let's have a look at some examples below.\n",
"\n",
"### Basic usage:"
]
},
{
@@ -96,7 +98,7 @@
" (\"###\", \"Header 3\"),\n",
"]\n",
"\n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)\n",
"md_header_splits = markdown_splitter.split_text(markdown_document)\n",
"md_header_splits"
]
@@ -115,7 +117,7 @@
{
"data": {
"text/plain": [
"langchain.schema.document.Document"
"langchain_core.documents.base.Document"
]
},
"execution_count": 3,
@@ -154,9 +156,46 @@
"output_type": "execute_result"
}
],
"source": [
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)\n",
"md_header_splits = markdown_splitter.split_text(markdown_document)\n",
"md_header_splits"
]
},
{
"cell_type": "markdown",
"id": "aa67e0cc-d721-4536-9c7a-9fa3a7a69cbe",
"metadata": {},
"source": [
"### How to return Markdown lines as separate documents\n",
"\n",
"By default, `MarkdownHeaderTextSplitter` aggregates lines based on the headers specified in `headers_to_split_on`. We can disable this by specifying `return_each_line`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "940bb609-c9c3-4593-ac2d-d825c80ceb44",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Hi this is Jim', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),\n",
" Document(page_content='Hi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),\n",
" Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),\n",
" Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"markdown_splitter = MarkdownHeaderTextSplitter(\n",
" headers_to_split_on=headers_to_split_on, strip_headers=False\n",
" headers_to_split_on,\n",
" return_each_line=True,\n",
")\n",
"md_header_splits = markdown_splitter.split_text(markdown_document)\n",
"md_header_splits"
@@ -167,19 +206,18 @@
"id": "9bd8977a",
"metadata": {},
"source": [
"Within each markdown group we can then apply any text splitter we want. "
"Note that here header information is retained in the `metadata` for each document.\n",
"\n",
"### How to constrain chunk size:\n",
"\n",
"Within each markdown group we can then apply any text splitter we want, such as `RecursiveCharacterTextSplitter`, which allows for further control of the chunk size."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "480e0e3a",
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-25T19:12:41.337249Z",
"start_time": "2023-09-25T19:12:41.326099200Z"
}
},
"execution_count": 6,
"id": "6f1f62bf-2653-4361-9bb0-964d86cb14db",
"metadata": {},
"outputs": [
{
"data": {
@@ -191,7 +229,7 @@
" Document(page_content='## Implementations \\nImplementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]"
]
},
"execution_count": 5,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@@ -223,14 +261,6 @@
"splits = text_splitter.split_documents(md_header_splits)\n",
"splits"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4017f148d414a45c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@@ -249,7 +279,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@@ -5,9 +5,11 @@
"id": "a678d550",
"metadata": {},
"source": [
"# How to recursively split JSON\n",
"# How to split JSON data\n",
"\n",
"This json splitter traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size. If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size considder following this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.\n",
"This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.\n",
"\n",
"If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.\n",
"\n",
"1. How the text is split: json value.\n",
"2. How the chunk size is measured: by number of characters."
@@ -23,65 +25,94 @@
"%pip install -qU langchain-text-splitters"
]
},
{
"cell_type": "markdown",
"id": "a2b3fe87-d230-4cbd-b3ae-01559c5351a3",
"metadata": {},
"source": [
"First we load some json data:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a504e1e7",
"id": "3390ae1d",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"import requests"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3390ae1d",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"# This is a large nested json object and will be loaded as a python dict\n",
"json_data = requests.get(\"https://api.smith.langchain.com/openapi.json\").json()"
]
},
{
"cell_type": "markdown",
"id": "3cdc725d-f4b8-4725-9084-cb395d8ef48b",
"metadata": {},
"source": [
"## Basic usage\n",
"\n",
"Specify `max_chunk_size` to constrain chunk sizes:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 2,
"id": "7bfe2c1e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import RecursiveJsonSplitter"
"from langchain_text_splitters import RecursiveJsonSplitter\n",
"\n",
"splitter = RecursiveJsonSplitter(max_chunk_size=300)"
]
},
{
"cell_type": "markdown",
"id": "e03b79fb-b1c6-4324-a409-86cd3e40cb92",
"metadata": {},
"source": [
"To obtain json chunks, use the `.split_json` method:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "69250bc6-c0f5-40d0-b8ba-7a349236bfd2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'servers': [{'url': 'https://api.smith.langchain.com', 'description': 'LangSmith API endpoint.'}]}\n",
"{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.', 'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}\n",
"{'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}\n"
]
}
],
"source": [
"# Recursively split json data - If you need to access/manipulate the smaller json chunks\n",
"json_chunks = splitter.split_json(json_data=json_data)\n",
"\n",
"for chunk in json_chunks[:3]:\n",
" print(chunk)"
]
},
{
"cell_type": "markdown",
"id": "3f05bc21-227e-4d2c-af51-16d69ad3cd7b",
"metadata": {},
"source": [
"To obtain LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects, use the `.create_documents` method:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2833c409",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"splitter = RecursiveJsonSplitter(max_chunk_size=300)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f941aa56",
"metadata": {},
"outputs": [],
"source": [
"# Recursively split json data - If you need to access/manipulate the smaller json chunks\n",
"json_chunks = splitter.split_json(json_data=json_data)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0839f4f0",
"metadata": {},
"outputs": [
@@ -89,8 +120,9 @@
"name": "stdout",
"output_type": "stream",
"text": [
"{\"openapi\": \"3.0.2\", \"info\": {\"title\": \"LangChainPlus\", \"version\": \"0.1.0\"}, \"paths\": {\"/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_sessions__session_id__get\"}}}}\n",
"{\"paths\": {\"/sessions/{session_id}\": {\"get\": {\"parameters\": [{\"required\": true, \"schema\": {\"title\": \"Session Id\", \"type\": \"string\", \"format\": \"uuid\"}, \"name\": \"session_id\", \"in\": \"path\"}, {\"required\": false, \"schema\": {\"title\": \"Include Stats\", \"type\": \"boolean\", \"default\": false}, \"name\": \"include_stats\", \"in\": \"query\"}, {\"required\": false, \"schema\": {\"title\": \"Accept\", \"type\": \"string\"}, \"name\": \"accept\", \"in\": \"header\"}]}}}}\n"
"page_content='{\"openapi\": \"3.1.0\", \"info\": {\"title\": \"LangSmith\", \"version\": \"0.1.0\"}, \"servers\": [{\"url\": \"https://api.smith.langchain.com\", \"description\": \"LangSmith API endpoint.\"}]}'\n",
"page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}'\n",
"page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"security\": [{\"API Key\": []}, {\"Tenant ID\": []}, {\"Bearer Auth\": []}]}}}}'\n"
]
}
],
@@ -98,70 +130,129 @@
"# The splitter can also output documents\n",
"docs = splitter.create_documents(texts=[json_data])\n",
"\n",
"# or a list of strings\n",
"for doc in docs[:3]:\n",
" print(doc)"
]
},
{
"cell_type": "markdown",
"id": "677c3dd0-afc7-488a-a58d-b7943814f85d",
"metadata": {},
"source": [
"Or use `.split_text` to obtain string content directly:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "fa0a4d66-b470-404e-918b-6728df3b88b0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"openapi\": \"3.1.0\", \"info\": {\"title\": \"LangSmith\", \"version\": \"0.1.0\"}, \"servers\": [{\"url\": \"https://api.smith.langchain.com\", \"description\": \"LangSmith API endpoint.\"}]}\n",
"{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}\n"
]
}
],
"source": [
"texts = splitter.split_text(json_data=json_data)\n",
"\n",
"print(texts[0])\n",
"print(texts[1])"
]
},
{
"cell_type": "markdown",
"id": "7070bf45-b885-4949-b8e0-7d1ea5205d2a",
"metadata": {},
"source": [
"## How to manage chunk sizes from list content\n",
"\n",
"Note that one of the chunks in this example is larger than the specified `max_chunk_size` of 300. Reviewing one of these chunks that was bigger we see there is a list object there:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c34b1f7f",
"execution_count": 6,
"id": "86ef3195-375b-4db2-9804-f3fa5a249417",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[293, 431, 203, 277, 230, 194, 162, 280, 223, 193]\n",
"{\"paths\": {\"/sessions/{session_id}\": {\"get\": {\"parameters\": [{\"required\": true, \"schema\": {\"title\": \"Session Id\", \"type\": \"string\", \"format\": \"uuid\"}, \"name\": \"session_id\", \"in\": \"path\"}, {\"required\": false, \"schema\": {\"title\": \"Include Stats\", \"type\": \"boolean\", \"default\": false}, \"name\": \"include_stats\", \"in\": \"query\"}, {\"required\": false, \"schema\": {\"title\": \"Accept\", \"type\": \"string\"}, \"name\": \"accept\", \"in\": \"header\"}]}}}}\n"
"[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]\n",
"\n",
"{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"parameters\": [{\"name\": \"session_id\", \"in\": \"path\", \"required\": true, \"schema\": {\"type\": \"string\", \"format\": \"uuid\", \"title\": \"Session Id\"}}, {\"name\": \"include_stats\", \"in\": \"query\", \"required\": false, \"schema\": {\"type\": \"boolean\", \"default\": false, \"title\": \"Include Stats\"}}, {\"name\": \"accept\", \"in\": \"header\", \"required\": false, \"schema\": {\"anyOf\": [{\"type\": \"string\"}, {\"type\": \"null\"}], \"title\": \"Accept\"}}]}}}}\n"
]
}
],
"source": [
"# Let's look at the size of the chunks\n",
"print([len(text) for text in texts][:10])\n",
"print()\n",
"print(texts[3])"
]
},
{
"cell_type": "markdown",
"id": "ddc98a1d-05df-48ab-8d17-6e4ee0d9d0cb",
"metadata": {},
"source": [
"The json splitter by default does not split lists.\n",
"\n",
"# Reviewing one of these chunks that was bigger we see there is a list object there\n",
"print(texts[1])"
"Specify `convert_lists=True` to preprocess the json, converting list content to dicts with `index:item` as `key:val` pairs:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "992477c2",
"metadata": {},
"outputs": [],
"source": [
"texts = splitter.split_text(json_data=json_data, convert_lists=True)"
]
},
{
"cell_type": "markdown",
"id": "912c20c2-8d05-47a6-bc03-f5c866761dff",
"metadata": {},
"source": [
"Let's look at the size of the chunks. Now they are all under the max"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7abd43f6-78ab-4a73-853a-a777ab268efc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[176, 236, 141, 203, 212, 221, 210, 213, 242, 291]\n"
]
}
],
"source": [
"print([len(text) for text in texts][:10])"
]
},
{
"cell_type": "markdown",
"id": "3e5753bf-cede-4751-a1c0-c42aca56b88a",
"metadata": {},
"source": [
"The list has been converted to a dict, but retains all the needed contextual information even if split into many chunks:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "992477c2",
"metadata": {},
"outputs": [],
"source": [
"# The json splitter by default does not split lists\n",
"# the following will preprocess the json and convert list to dict with index:item as key:val pairs\n",
"texts = splitter.split_text(json_data=json_data, convert_lists=True)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2d23b3aa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[293, 431, 203, 277, 230, 194, 162, 280, 223, 193]\n"
]
}
],
"source": [
"# Let's look at the size of the chunks. Now they are all under the max\n",
"print([len(text) for text in texts][:10])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "d2c2773e",
"metadata": {},
"outputs": [
@@ -169,28 +260,27 @@
"name": "stdout",
"output_type": "stream",
"text": [
"{\"paths\": {\"/sessions/{session_id}\": {\"get\": {\"parameters\": [{\"required\": true, \"schema\": {\"title\": \"Session Id\", \"type\": \"string\", \"format\": \"uuid\"}, \"name\": \"session_id\", \"in\": \"path\"}, {\"required\": false, \"schema\": {\"title\": \"Include Stats\", \"type\": \"boolean\", \"default\": false}, \"name\": \"include_stats\", \"in\": \"query\"}, {\"required\": false, \"schema\": {\"title\": \"Accept\", \"type\": \"string\"}, \"name\": \"accept\", \"in\": \"header\"}]}}}}\n"
"{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": {\"0\": \"tracer-sessions\"}, \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}\n"
]
}
],
"source": [
"# The list has been converted to a dict, but retains all the needed contextual information even if split into many chunks\n",
"print(texts[1])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 10,
"id": "8963b01a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='{\"paths\": {\"/sessions/{session_id}\": {\"get\": {\"parameters\": [{\"required\": true, \"schema\": {\"title\": \"Session Id\", \"type\": \"string\", \"format\": \"uuid\"}, \"name\": \"session_id\", \"in\": \"path\"}, {\"required\": false, \"schema\": {\"title\": \"Include Stats\", \"type\": \"boolean\", \"default\": false}, \"name\": \"include_stats\", \"in\": \"query\"}, {\"required\": false, \"schema\": {\"title\": \"Accept\", \"type\": \"string\"}, \"name\": \"accept\", \"in\": \"header\"}]}}}}')"
"Document(page_content='{\"paths\": {\"/api/v1/sessions/{session_id}\": {\"get\": {\"tags\": [\"tracer-sessions\"], \"summary\": \"Read Tracer Session\", \"description\": \"Get a specific session.\", \"operationId\": \"read_tracer_session_api_v1_sessions__session_id__get\"}}}}')"
]
},
"execution_count": 13,
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
@@ -199,14 +289,6 @@
"# We can also look at the documents\n",
"docs[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "168da4f0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@@ -225,7 +307,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@@ -10,7 +10,13 @@
"This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `[\"\\n\\n\", \"\\n\", \" \", \"\"]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.\n",
"\n",
"1. How the text is split: by list of characters.\n",
"2. How the chunk size is measured: by number of characters."
"2. How the chunk size is measured: by number of characters.\n",
"\n",
"Below we show example usage.\n",
"\n",
"To obtain the string content directly, use `.split_text`.\n",
"\n",
"To create LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects (e.g., for use in downstream tasks), use `.create_documents`."
]
},
{
@@ -28,44 +34,6 @@
"execution_count": 1,
"id": "3390ae1d",
"metadata": {},
"outputs": [],
"source": [
"# This is a long document we can split up.\n",
"with open(\"../../state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7bfe2c1e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2833c409",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(\n",
" # Set a really small chunk size, just to show.\n",
" chunk_size=100,\n",
" chunk_overlap=20,\n",
" length_function=len,\n",
" is_separator_regex=False,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f63902f0",
"metadata": {},
"outputs": [
{
"name": "stdout",
@@ -77,6 +45,20 @@
}
],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"\n",
"# Load example document\n",
"with open(\"../../../docs/modules/state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()\n",
"\n",
"text_splitter = RecursiveCharacterTextSplitter(\n",
" # Set a really small chunk size, just to show.\n",
" chunk_size=100,\n",
" chunk_overlap=20,\n",
" length_function=len,\n",
" is_separator_regex=False,\n",
")\n",
"texts = text_splitter.create_documents([state_of_the_union])\n",
"print(texts[0])\n",
"print(texts[1])"
@@ -84,7 +66,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 2,
"id": "0839f4f0",
"metadata": {},
"outputs": [
@@ -95,7 +77,7 @@
" 'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']"
]
},
"execution_count": 5,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
@@ -105,12 +87,16 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c34b1f7f",
"cell_type": "markdown",
"id": "60336622-b9d0-4172-816a-6cd1bb9ec481",
"metadata": {},
"outputs": [],
"source": []
"source": [
"Let's go through the parameters set above for `RecursiveCharacterTextSplitter`:\n",
"- `chunk_size`: The maximum size of a chunk, where size is determined by the `length_function`.\n",
"- `chunk_overlap`: Target overlap between chunks. Overlapping chunks helps to mitigate loss of information when context is divided between chunks.\n",
"- `length_function`: Function determining the chunk size.\n",
"- `is_separator_regex`: Whether the separator list (defaulting to `[\"\\n\\n\", \"\\n\", \" \", \"\"]`) should be interpreted as regex."
]
},
{
"cell_type": "markdown",
@@ -150,14 +136,6 @@
" # Existing args\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1177ee4f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@@ -176,7 +154,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@@ -12,6 +12,8 @@
"\n",
"All credit to him.\n",
"\n",
"This guide covers how to split chunks based on their semantic similarity. If embeddings are sufficiently far apart, chunks are split.\n",
"\n",
"At a high level, this splits into sentences, then groups into groups of 3\n",
"sentences, and then merges one that are similar in the embedding space."
]
@@ -50,7 +52,7 @@
"outputs": [],
"source": [
"# This is a long document we can split up.\n",
"with open(\"../../state_of_the_union.txt\") as f:\n",
"with open(\"../../../docs/modules/state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()"
]
},
@@ -62,24 +64,24 @@
"## Create Text Splitter"
]
},
{
"cell_type": "markdown",
"id": "774a5199-c2ff-43bc-bf07-87573e0b8db4",
"metadata": {},
"source": [
"To instantiate a [SemanticChunker](https://api.python.langchain.com/en/latest/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html), we must specify an embedding model. Below we will use [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.openai.OpenAIEmbeddings.html). "
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 4,
"id": "a88ff70c",
"metadata": {},
"outputs": [],
"source": [
"from langchain_experimental.text_splitter import SemanticChunker\n",
"from langchain_openai.embeddings import OpenAIEmbeddings"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "613d4a3b",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai.embeddings import OpenAIEmbeddings\n",
"\n",
"text_splitter = SemanticChunker(OpenAIEmbeddings())"
]
},
@@ -88,12 +90,14 @@
"id": "91b14834",
"metadata": {},
"source": [
"## Split Text"
"## Split Text\n",
"\n",
"We split text in the usual way, e.g., by invoking `.create_documents` to create LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 5,
"id": "295ec095",
"metadata": {},
"outputs": [
@@ -119,7 +123,7 @@
"\n",
"This chunker works by determining when to \"break\" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.\n",
"\n",
"There are a few ways to determine what that threshold is.\n",
"There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.\n",
"\n",
"### Percentile\n",
"\n",
@@ -318,7 +322,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@@ -17,13 +17,17 @@
"source": [
"## tiktoken\n",
"\n",
">[tiktoken](https://github.com/openai/tiktoken) is a fast `BPE` tokenizer created by `OpenAI`.\n",
":::{.callout-note}\n",
"[tiktoken](https://github.com/openai/tiktoken) is a fast `BPE` tokenizer created by `OpenAI`.\n",
":::\n",
"\n",
"\n",
"We can use it to estimate tokens used. It will probably be more accurate for the OpenAI models.\n",
"We can use `tiktoken` to estimate tokens used. It will probably be more accurate for the OpenAI models.\n",
"\n",
"1. How the text is split: by character passed in.\n",
"2. How the chunk size is measured: by `tiktoken` tokenizer."
"2. How the chunk size is measured: by `tiktoken` tokenizer.\n",
"\n",
"[CharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html), [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html), and [TokenTextSplitter](https://api.python.langchain.com/en/latest/base/langchain_text_splitters.base.TokenTextSplitter.html) can be used with `tiktoken` directly."
]
},
{
@@ -43,10 +47,12 @@
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"\n",
"# This is a long document we can split up.\n",
"with open(\"../../state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()\n",
"from langchain_text_splitters import CharacterTextSplitter"
"with open(\"../../../docs/modules/state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()"
]
},
{
@@ -54,18 +60,20 @@
"id": "a3ba1d8a",
"metadata": {},
"source": [
"The `.from_tiktoken_encoder()` method takes either `encoding` as an argument (e.g. `cl100k_base`), or the `model_name` (e.g. `gpt-4`). All additional arguments like `chunk_size`, `chunk_overlap`, and `separators` are used to instantiate `CharacterTextSplitter`:"
"To split with a [CharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html) and then merge chunks with `tiktoken`, use its `.from_tiktoken_encoder()` method. Note that splits from this method can be larger than the chunk size measured by the `tiktoken` tokenizer.\n",
"\n",
"The `.from_tiktoken_encoder()` method takes either `encoding_name` as an argument (e.g. `cl100k_base`), or the `model_name` (e.g. `gpt-4`). All additional arguments like `chunk_size`, `chunk_overlap`, and `separators` are used to instantiate `CharacterTextSplitter`:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 6,
"id": "825f7c0a",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
" encoding=\"cl100k_base\", chunk_size=100, chunk_overlap=0\n",
" encoding_name=\"cl100k_base\", chunk_size=100, chunk_overlap=0\n",
")\n",
"texts = text_splitter.split_text(state_of_the_union)"
]
@@ -99,12 +107,12 @@
"id": "de5b6a6e",
"metadata": {},
"source": [
"Note that if we use `CharacterTextSplitter.from_tiktoken_encoder`, text is only split by `CharacterTextSplitter` and `tiktoken` tokenizer is used to merge splits. It means that split can be larger than chunk size measured by `tiktoken` tokenizer. We can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size:"
"To implement a hard constraint on the chunk size, we can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder`, where each split will be recursively split if it has a larger size:"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"id": "0262a991",
"metadata": {},
"outputs": [],
@@ -123,15 +131,23 @@
"id": "04457e3a",
"metadata": {},
"source": [
"We can also load a tiktoken splitter directly, which will ensure each split is smaller than chunk size."
"We can also load a `TokenTextSplitter` splitter, which works with `tiktoken` directly and will ensure each split is smaller than chunk size."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 8,
"id": "4454c70e",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our\n"
]
}
],
"source": [
"from langchain_text_splitters import TokenTextSplitter\n",
"\n",
@@ -156,9 +172,11 @@
"source": [
"## spaCy\n",
"\n",
">[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.\n",
":::{.callout-note}\n",
"[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.\n",
":::\n",
"\n",
"Another alternative to `NLTK` is to use [spaCy tokenizer](https://spacy.io/api/tokenizer).\n",
"LangChain implements splitters based on the [spaCy tokenizer](https://spacy.io/api/tokenizer).\n",
"\n",
"1. How the text is split: by `spaCy` tokenizer.\n",
"2. How the chunk size is measured: by number of characters."
@@ -182,22 +200,10 @@
"outputs": [],
"source": [
"# This is a long document we can split up.\n",
"with open(\"../../state_of_the_union.txt\") as f:\n",
"with open(\"../../../docs/modules/state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f4ec9b90",
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import SpacyTextSplitter\n",
"\n",
"text_splitter = SpacyTextSplitter(chunk_size=1000)"
]
},
{
"cell_type": "code",
"execution_count": 4,
@@ -259,6 +265,11 @@
}
],
"source": [
"from langchain_text_splitters import SpacyTextSplitter\n",
"\n",
"\n",
"text_splitter = SpacyTextSplitter(chunk_size=1000)\n",
"\n",
"texts = text_splitter.split_text(state_of_the_union)\n",
"print(texts[0])"
]
@@ -270,34 +281,19 @@
"source": [
"## SentenceTransformers\n",
"\n",
"The `SentenceTransformersTokenTextSplitter` is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9dd5419e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import SentenceTransformersTokenTextSplitter"
"The [SentenceTransformersTokenTextSplitter](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html) is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.\n",
"\n",
"To split text and constrain token counts according to the sentence-transformers tokenizer, instantiate a `SentenceTransformersTokenTextSplitter`. You can optionally specify:\n",
"\n",
"- `chunk_overlap`: integer count of token overlap;\n",
"- `model_name`: sentence-transformer model name, defaulting to `\"sentence-transformers/all-mpnet-base-v2\"`;\n",
"- `tokens_per_chunk`: desired token count per chunk."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b43e5d54",
"metadata": {},
"outputs": [],
"source": [
"splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)\n",
"text = \"Lorem \""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1df84cb4",
"id": "9dd5419e",
"metadata": {},
"outputs": [
{
@@ -309,6 +305,11 @@
}
],
"source": [
"from langchain_text_splitters import SentenceTransformersTokenTextSplitter\n",
"\n",
"splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)\n",
"text = \"Lorem \"\n",
"\n",
"count_start_and_stop_tokens = 2\n",
"text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens\n",
"print(text_token_count)"
@@ -364,7 +365,10 @@
"source": [
"## NLTK\n",
"\n",
">[The Natural Language Toolkit](https://en.wikipedia.org/wiki/Natural_Language_Toolkit), or more commonly [NLTK](https://www.nltk.org/), is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.\n",
":::{.callout-note}\n",
"[The Natural Language Toolkit](https://en.wikipedia.org/wiki/Natural_Language_Toolkit), or more commonly [NLTK](https://www.nltk.org/), is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.\n",
":::\n",
"\n",
"\n",
"Rather than just splitting on \"\\n\\n\", we can use `NLTK` to split based on [NLTK tokenizers](https://www.nltk.org/api/nltk.tokenize.html).\n",
"\n",
@@ -390,7 +394,7 @@
"outputs": [],
"source": [
"# This is a long document we can split up.\n",
"with open(\"../../state_of_the_union.txt\") as f:\n",
"with open(\"../../../docs/modules/state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()"
]
},
@@ -463,7 +467,10 @@
"metadata": {},
"source": [
"## KoNLPY\n",
"> [KoNLPy: Korean NLP in Python](https://konlpy.org/en/latest/) is is a Python package for natural language processing (NLP) of the Korean language.\n",
"\n",
":::{.callout-note}\n",
"[KoNLPy: Korean NLP in Python](https://konlpy.org/en/latest/) is is a Python package for natural language processing (NLP) of the Korean language.\n",
":::\n",
"\n",
"Token splitting involves the segmentation of text into smaller, more manageable units called tokens. These tokens are often words, phrases, symbols, or other meaningful elements crucial for further processing and analysis. In languages like English, token splitting typically involves separating words by spaces and punctuation marks. The effectiveness of token splitting largely depends on the tokenizer's understanding of the language structure, ensuring the generation of meaningful tokens. Since tokenizers designed for the English language are not equipped to understand the unique semantic structures of other languages, such as Korean, they cannot be effectively used for Korean language processing.\n",
"\n",
@@ -563,12 +570,12 @@
"source": [
"## Hugging Face tokenizer\n",
"\n",
">[Hugging Face](https://huggingface.co/docs/tokenizers/index) has many tokenizers.\n",
"[Hugging Face](https://huggingface.co/docs/tokenizers/index) has many tokenizers.\n",
"\n",
"We use Hugging Face tokenizer, the [GPT2TokenizerFast](https://huggingface.co/Ransaka/gpt2-tokenizer-fast) to count the text length in tokens.\n",
"\n",
"1. How the text is split: by character passed in.\n",
"2. How the chunk size is measured: by number of tokens calculated by the `Hugging Face` tokenizer.\n"
"2. How the chunk size is measured: by number of tokens calculated by the `Hugging Face` tokenizer."
]
},
{
@@ -658,7 +665,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
},
"vscode": {
"interpreter": {

View File

@@ -1045,7 +1045,7 @@
"- We used chains to build a predictable application that generates search queries for each user input;\n",
"- We used agents to build an application that \"decides\" when and how to generate search queries.\n",
"\n",
"To explore different types of retrievers and retrieval strategies, visit the [retrievers](docs/0.2.x/how_to/#retrievers) section of the how-to guides.\n",
"To explore different types of retrievers and retrieval strategies, visit the [retrievers](/docs/0.2.x/how_to/#retrievers) section of the how-to guides.\n",
"\n",
"For a detailed walkthrough of LangChain's conversation memory abstractions, visit the [How to add message history (memory)](/docs/how_to/message_history) LCEL page.\n",
"\n",
@@ -1077,7 +1077,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.1"
"version": "3.10.4"
}
},
"nbformat": 4,