smart text splitter (#530)

smart text splitter that iteratively tries different separators until it works!
2025-09-06 05:25:04 +00:00 · 2023-01-08 15:11:10 -08:00
parent 8dfad874a2
commit 1192cc0767
3 changed files with 172 additions and 21 deletions
--- a/docs/modules/utils/combine_docs_examples/textsplitter.ipynb
+++ b/docs/modules/utils/combine_docs_examples/textsplitter.ipynb
@@ -90,6 +90,61 @@
    "print(texts[0])"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "1be00b73",
+   "metadata": {},
+   "source": [
+    "## Recursive Character Text Splitting\n",
+    "Sometimes, it's not enough to split on just one character. This text splitter uses a whole list of characters and recursive splits them down until they are under the limit."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "1ac6376d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "6787b13b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_splitter = RecursiveCharacterTextSplitter(\n",
+    "    # Set a really small chunk size, just to show.\n",
+    "    chunk_size = 100,\n",
+    "    chunk_overlap  = 20,\n",
+    "    length_function = len,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "4f0e7d9b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet.\n",
+      "and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n"
+     ]
+    }
+   ],
+   "source": [
+    "texts = text_splitter.split_text(state_of_the_union)\n",
+    "print(texts[0])\n",
+    "print(texts[1])"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "87a71115",