Files
langchain/docs
Anthony Shaw aa0523b1ee docs: Add guidance for splitting Chinese, Japanese, and Thai (#19295)
The existing default list of separators for the `RecursiveTextSplitter`
assumes spaces are word boundaries. Some languages [don't use spaces
between
words](https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries)
(Chinese, Japanese, Thai, Burmese).

This PR extends the documentation to explain how to cater for those
languages by adding additional punctuation to the separators and
zero-width spaces which are used by some typesetters and will assist the
splitter to not split in words.

Ideally, **these separators could be a constant in the module** but for
now, defining them in the documentation is a start.
2024-04-25 17:39:31 -07:00
..
2024-04-25 17:39:07 -07:00
2023-12-17 12:55:49 -08:00
2024-02-08 14:52:26 -08:00
2024-04-25 17:39:10 -07:00

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide