langchain/libs/text-splitters/langchain_text_splitters
Antonio Lanza b2102b8cc4
text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782)
This PR closes #27781

# Problem
The current implementation of `NLTKTextSplitter` is using
`sent_tokenize`. However, this `sent_tokenize` doesn't handle chars
between 2 tokenized sentences... hence, this behavior throws errors when
we are using `add_start_index=True`, as described in issue #27781. In
particular:
```python
from nltk.tokenize import sent_tokenize

output1 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output1)
output2 = sent_tokenize("Innovation drives our success.        Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output2)
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
```

# Solution
With this new `use_span_tokenize` parameter, we can use NLTK to create
sentences (with `span_tokenize`), but also add extra chars to be sure
that we still can map the chunks to the original text.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Erick Friis <erickfriis@gmail.com>
2024-12-16 19:53:15 +00:00
..
xsl text-splitters[minor]: Adding a new section aware splitter to langchain (#16526) 2024-04-01 20:32:26 +00:00
__init__.py text-splitters: add pydocstyle linting (#28127) 2024-12-09 06:01:03 +00:00
base.py text-splitters: add pydocstyle linting (#28127) 2024-12-09 06:01:03 +00:00
character.py text-splitters: add pydocstyle linting (#28127) 2024-12-09 06:01:03 +00:00
html.py text-splitters: add pydocstyle linting (#28127) 2024-12-09 06:01:03 +00:00
json.py text-splitters: add pydocstyle linting (#28127) 2024-12-09 06:01:03 +00:00
konlpy.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 2024-02-29 18:33:21 -08:00
latex.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 2024-02-29 18:33:21 -08:00
markdown.py text-splitters: add pydocstyle linting (#28127) 2024-12-09 06:01:03 +00:00
nltk.py text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782) 2024-12-16 19:53:15 +00:00
py.typed text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 2024-02-29 18:33:21 -08:00
python.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 2024-02-29 18:33:21 -08:00
sentence_transformers.py text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782) 2024-12-16 19:53:15 +00:00
spacy.py text-splitters: add pydocstyle linting (#28127) 2024-12-09 06:01:03 +00:00