mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-22 14:49:29 +00:00
docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295)
The existing default list of separators for the `RecursiveTextSplitter` assumes spaces are word boundaries. Some languages [don't use spaces between words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries) (Chinese, Japanese, Thai, Burmese). This PR extends the documentation to explain how to cater for those languages by adding additional punctuation to the separators and zero-width spaces which are used by some typesetters and will assist the splitter to not split in words. Ideally, **these separators could be a constant in the module** but for now, defining them in the documentation is a start.
This commit is contained in:
parent
441a8012b3
commit
6c9b0f96f3
@ -111,6 +111,53 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2b74939c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Splitting text from languages without word boundaries\n",
|
||||
"\n",
|
||||
"Some writing systems do not have [word boundaries](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries), for example Chinese, Japanese, and Thai. Splitting text with the default separator list of `[\"\\n\\n\", \"\\n\", \" \", \"\"]` can cause words to be split between chunks. To keep words together, you can override the list of separators to include additional punctuation:\n",
|
||||
"\n",
|
||||
"* Add ASCII full-stop \"`.`\", [Unicode fullwidth](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) full stop \"`.`\" (used in Chinese text), and [ideographic full stop](https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation) \"`。`\" (used in Japanese and Chinese)\n",
|
||||
"* Add [Zero-width space](https://en.wikipedia.org/wiki/Zero-width_space) used in Thai, Myanmar, Kmer, and Japanese.\n",
|
||||
"* Add ASCII comma \"`,`\", Unicode fullwidth comma \"`,`\", and Unicode ideographic comma \"`、`\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6d48a8ef",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text_splitter = RecursiveCharacterTextSplitter(\n",
|
||||
" separators=[\n",
|
||||
" \"\\n\\n\",\n",
|
||||
" \"\\n\",\n",
|
||||
" \" \",\n",
|
||||
" \".\",\n",
|
||||
" \",\",\n",
|
||||
" \"\\u200B\", # Zero-width space\n",
|
||||
" \"\\uff0c\", # Fullwidth comma\n",
|
||||
" \"\\u3001\", # Ideographic comma\n",
|
||||
" \"\\uff0e\", # Fullwidth full stop\n",
|
||||
" \"\\u3002\", # Ideographic full stop\n",
|
||||
" \"\",\n",
|
||||
" ],\n",
|
||||
" # Existing args\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1177ee4f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
Loading…
Reference in New Issue
Block a user