From 6c9b0f96f364bea58828398b801afde565a72aad Mon Sep 17 00:00:00 2001
From: Anthony Shaw <anthony.p.shaw@gmail.com>
Date: Tue, 26 Mar 2024 11:34:00 +1100
Subject: [PATCH] docs: Add guidance for splitting Chinese, Japanese, and Thai
 (#19295)

The existing default list of separators for the `RecursiveTextSplitter`
assumes spaces are word boundaries. Some languages [don't use spaces
between
words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries)
(Chinese, Japanese, Thai, Burmese).

This PR extends the documentation to explain how to cater for those
languages by adding additional punctuation to the separators and
zero-width spaces which are used by some typesetters and will assist the
splitter to not split in words.

Ideally, **these separators could be a constant in the module** but for
now, defining them in the documentation is a start.
---
 .../recursive_text_splitter.ipynb             | 47 +++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb b/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb
index 1808db78d6a..9d10763895c 100644
--- a/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb
+++ b/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb
@@ -111,6 +111,53 @@
    "metadata": {},
    "outputs": [],
    "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b74939c",
+   "metadata": {},
+   "source": [
+    "## Splitting text from languages without word boundaries\n",
+    "\n",
+    "Some writing systems do not have [word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries), for example Chinese, Japanese, and Thai. Splitting text with the default separator list of `[\"\\n\\n\", \"\\n\", \" \", \"\"]` can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:\n",
+    "\n",
+    "* Add ASCII full-stop \"`.`\", [Unicode fullwidth](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) full stop \"`．`\" (used in Chinese text), and [ideographic full stop](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation) \"`。`\" (used in Japanese and Chinese)\n",
+    "* Add [Zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) used in Thai, Myanmar, Kmer, and Japanese.\n",
+    "* Add ASCII comma \"`,`\", Unicode fullwidth comma \"`，`\", and Unicode ideographic comma \"`、`\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d48a8ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_splitter = RecursiveCharacterTextSplitter(\n",
+    "    separators=[\n",
+    "        \"\\n\\n\",\n",
+    "        \"\\n\",\n",
+    "        \" \",\n",
+    "        \".\",\n",
+    "        \",\",\n",
+    "        \"\\u200B\",  # Zero-width space\n",
+    "        \"\\uff0c\",  # Fullwidth comma\n",
+    "        \"\\u3001\",  # Ideographic comma\n",
+    "        \"\\uff0e\",  # Fullwidth full stop\n",
+    "        \"\\u3002\",  # Ideographic full stop\n",
+    "        \"\",\n",
+    "    ],\n",
+    "    # Existing args\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1177ee4f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {