smart text splitter (#530)

smart text splitter that iteratively tries different separators until it
works!
This commit is contained in:
Harrison Chase
2023-01-08 15:11:10 -08:00
committed by GitHub
parent 8dfad874a2
commit 1192cc0767
3 changed files with 172 additions and 21 deletions

View File

@@ -90,6 +90,61 @@
"print(texts[0])"
]
},
{
"cell_type": "markdown",
"id": "1be00b73",
"metadata": {},
"source": [
"## Recursive Character Text Splitting\n",
"Sometimes, it's not enough to split on just one character. This text splitter uses a whole list of characters and recursive splits them down until they are under the limit."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1ac6376d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import RecursiveCharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6787b13b",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(\n",
" # Set a really small chunk size, just to show.\n",
" chunk_size = 100,\n",
" chunk_overlap = 20,\n",
" length_function = len,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4f0e7d9b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet.\n",
"and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n"
]
}
],
"source": [
"texts = text_splitter.split_text(state_of_the_union)\n",
"print(texts[0])\n",
"print(texts[1])"
]
},
{
"cell_type": "markdown",
"id": "87a71115",