diff --git a/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb b/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb index 1808db78d6a..9d10763895c 100644 --- a/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb +++ b/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb @@ -111,6 +111,53 @@ "metadata": {}, "outputs": [], "source": [] + }, + { + "cell_type": "markdown", + "id": "2b74939c", + "metadata": {}, + "source": [ + "## Splitting text from languages without word boundaries\n", + "\n", + "Some writing systems do not have [word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries), for example Chinese, Japanese, and Thai. Splitting text with the default separator list of `[\"\\n\\n\", \"\\n\", \" \", \"\"]` can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:\n", + "\n", + "* Add ASCII full-stop \"`.`\", [Unicode fullwidth](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) full stop \"`.`\" (used in Chinese text), and [ideographic full stop](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation) \"`。`\" (used in Japanese and Chinese)\n", + "* Add [Zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) used in Thai, Myanmar, Kmer, and Japanese.\n", + "* Add ASCII comma \"`,`\", Unicode fullwidth comma \"`,`\", and Unicode ideographic comma \"`、`\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d48a8ef", + "metadata": {}, + "outputs": [], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(\n", + " separators=[\n", + " \"\\n\\n\",\n", + " \"\\n\",\n", + " \" \",\n", + " \".\",\n", + " \",\",\n", + " \"\\u200B\", # Zero-width space\n", + " \"\\uff0c\", # Fullwidth comma\n", + " \"\\u3001\", # Ideographic comma\n", + " \"\\uff0e\", # Fullwidth full stop\n", + " \"\\u3002\", # Ideographic full stop\n", + " \"\",\n", + " ],\n", + " # Existing args\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1177ee4f", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {