From 6c9b0f96f364bea58828398b801afde565a72aad Mon Sep 17 00:00:00 2001 From: Anthony Shaw Date: Tue, 26 Mar 2024 11:34:00 +1100 Subject: [PATCH] docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295) The existing default list of separators for the `RecursiveTextSplitter` assumes spaces are word boundaries. Some languages [don't use spaces between words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) (Chinese, Japanese, Thai, Burmese). This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words. Ideally, **these separators could be a constant in the module** but for now, defining them in the documentation is a start. --- .../recursive_text_splitter.ipynb | 47 +++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb b/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb index 1808db78d6a..9d10763895c 100644 --- a/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb +++ b/docs/docs/modules/data_connection/document_transformers/recursive_text_splitter.ipynb @@ -111,6 +111,53 @@ "metadata": {}, "outputs": [], "source": [] + }, + { + "cell_type": "markdown", + "id": "2b74939c", + "metadata": {}, + "source": [ + "## Splitting text from languages without word boundaries\n", + "\n", + "Some writing systems do not have [word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries), for example Chinese, Japanese, and Thai. Splitting text with the default separator list of `[\"\\n\\n\", \"\\n\", \" \", \"\"]` can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:\n", + "\n", + "* Add ASCII full-stop \"`.`\", [Unicode fullwidth](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) full stop \"`.`\" (used in Chinese text), and [ideographic full stop](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation) \"`。`\" (used in Japanese and Chinese)\n", + "* Add [Zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) used in Thai, Myanmar, Kmer, and Japanese.\n", + "* Add ASCII comma \"`,`\", Unicode fullwidth comma \"`,`\", and Unicode ideographic comma \"`、`\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d48a8ef", + "metadata": {}, + "outputs": [], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(\n", + " separators=[\n", + " \"\\n\\n\",\n", + " \"\\n\",\n", + " \" \",\n", + " \".\",\n", + " \",\",\n", + " \"\\u200B\", # Zero-width space\n", + " \"\\uff0c\", # Fullwidth comma\n", + " \"\\u3001\", # Ideographic comma\n", + " \"\\uff0e\", # Fullwidth full stop\n", + " \"\\u3002\", # Ideographic full stop\n", + " \"\",\n", + " ],\n", + " # Existing args\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1177ee4f", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {