langchain/docs/docs/how_to/recursive_text_splitter.ipynb
2024-06-03 16:30:17 -07:00

176 lines
6.0 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "raw",
"id": "52976910",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"---\n",
"keywords: [recursivecharactertextsplitter]\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "a678d550",
"metadata": {},
"source": [
"# How to recursively split text by characters\n",
"\n",
"This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `[\"\\n\\n\", \"\\n\", \" \", \"\"]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.\n",
"\n",
"1. How the text is split: by list of characters.\n",
"2. How the chunk size is measured: by number of characters.\n",
"\n",
"Below we show example usage.\n",
"\n",
"To obtain the string content directly, use `.split_text`.\n",
"\n",
"To create LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects (e.g., for use in downstream tasks), use `.create_documents`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c16167c-1e56-4e11-9b8b-60f93044498e",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-text-splitters"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3390ae1d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and'\n",
"page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'\n"
]
}
],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"# Load example document\n",
"with open(\"state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()\n",
"\n",
"text_splitter = RecursiveCharacterTextSplitter(\n",
" # Set a really small chunk size, just to show.\n",
" chunk_size=100,\n",
" chunk_overlap=20,\n",
" length_function=len,\n",
" is_separator_regex=False,\n",
")\n",
"texts = text_splitter.create_documents([state_of_the_union])\n",
"print(texts[0])\n",
"print(texts[1])"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0839f4f0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',\n",
" 'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_splitter.split_text(state_of_the_union)[:2]"
]
},
{
"cell_type": "markdown",
"id": "60336622-b9d0-4172-816a-6cd1bb9ec481",
"metadata": {},
"source": [
"Let's go through the parameters set above for `RecursiveCharacterTextSplitter`:\n",
"- `chunk_size`: The maximum size of a chunk, where size is determined by the `length_function`.\n",
"- `chunk_overlap`: Target overlap between chunks. Overlapping chunks helps to mitigate loss of information when context is divided between chunks.\n",
"- `length_function`: Function determining the chunk size.\n",
"- `is_separator_regex`: Whether the separator list (defaulting to `[\"\\n\\n\", \"\\n\", \" \", \"\"]`) should be interpreted as regex."
]
},
{
"cell_type": "markdown",
"id": "2b74939c",
"metadata": {},
"source": [
"## Splitting text from languages without word boundaries\n",
"\n",
"Some writing systems do not have [word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries), for example Chinese, Japanese, and Thai. Splitting text with the default separator list of `[\"\\n\\n\", \"\\n\", \" \", \"\"]` can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:\n",
"\n",
"* Add ASCII full-stop \"`.`\", [Unicode fullwidth](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) full stop \"``\" (used in Chinese text), and [ideographic full stop](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation) \"`。`\" (used in Japanese and Chinese)\n",
"* Add [Zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) used in Thai, Myanmar, Kmer, and Japanese.\n",
"* Add ASCII comma \"`,`\", Unicode fullwidth comma \"``\", and Unicode ideographic comma \"`、`\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d48a8ef",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(\n",
" separators=[\n",
" \"\\n\\n\",\n",
" \"\\n\",\n",
" \" \",\n",
" \".\",\n",
" \",\",\n",
" \"\\u200b\", # Zero-width space\n",
" \"\\uff0c\", # Fullwidth comma\n",
" \"\\u3001\", # Ideographic comma\n",
" \"\\uff0e\", # Fullwidth full stop\n",
" \"\\u3002\", # Ideographic full stop\n",
" \"\",\n",
" ],\n",
" # Existing args\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}