docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295)

The existing default list of separators for the `RecursiveTextSplitter` assumes spaces are word boundaries. Some languages [don't use spaces between words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) (Chinese, Japanese, Thai, Burmese). This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words. Ideally, **these separators could be a constant in the module** but for now, defining them in the documentation is a start.
2025-06-22 14:49:29 +00:00 · 2024-03-26 11:34:00 +11:00 · 2024-03-26 11:34:00 +11:00 · 6c9b0f96f3
commit 6c9b0f96f3
parent 441a8012b3
1 changed files with 47 additions and 0 deletions
--- a/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb
+++ b/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb
@ -111,6 +111,53 @@
   "metadata": {},
   "outputs": [],
   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b74939c",
+   "metadata": {},
+   "source": [
+    "## Splitting text from languages without word boundaries\n",
+    "\n",
+    "Some writing systems do not have [word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries), for example Chinese, Japanese, and Thai. Splitting text with the default separator list of `[\"\\n\\n\", \"\\n\", \" \", \"\"]` can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:\n",
+    "\n",
+    "* Add ASCII full-stop \"`.`\", [Unicode fullwidth](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) full stop \"`．`\" (used in Chinese text), and [ideographic full stop](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation) \"`。`\" (used in Japanese and Chinese)\n",
+    "* Add [Zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) used in Thai, Myanmar, Kmer, and Japanese.\n",
+    "* Add ASCII comma \"`,`\", Unicode fullwidth comma \"`，`\", and Unicode ideographic comma \"`、`\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d48a8ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_splitter = RecursiveCharacterTextSplitter(\n",
+    "    separators=[\n",
+    "        \"\\n\\n\",\n",
+    "        \"\\n\",\n",
+    "        \" \",\n",
+    "        \".\",\n",
+    "        \",\",\n",
+    "        \"\\u200B\",  # Zero-width space\n",
+    "        \"\\uff0c\",  # Fullwidth comma\n",
+    "        \"\\u3001\",  # Ideographic comma\n",
+    "        \"\\uff0e\",  # Fullwidth full stop\n",
+    "        \"\\u3002\",  # Ideographic full stop\n",
+    "        \"\",\n",
+    "    ],\n",
+    "    # Existing args\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1177ee4f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {