langchain

mirror of https://github.com/hwchase17/langchain.git synced 2026-01-21 13:52:48 +00:00

Files

Antonio Lanza b2102b8cc4 text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782 )

This PR closes #27781

# Problem
The current implementation of `NLTKTextSplitter` is using
`sent_tokenize`. However, this `sent_tokenize` doesn't handle chars
between 2 tokenized sentences... hence, this behavior throws errors when
we are using `add_start_index=True`, as described in issue #27781. In
particular:
```python
from nltk.tokenize import sent_tokenize

output1 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output1)
output2 = sent_tokenize("Innovation drives our success.        Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output2)
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
```

# Solution
With this new `use_span_tokenize` parameter, we can use NLTK to create
sentences (with `span_tokenize`), but also add extra chars to be sure
that we still can map the chunks to the original text.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Erick Friis <erickfriis@gmail.com>

2024-12-16 19:53:15 +00:00

__init__.py

text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )

2024-02-29 18:33:21 -08:00

test_compile.py

text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )

2024-02-29 18:33:21 -08:00

test_nlp_text_splitters.py

text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782 )

2024-12-16 19:53:15 +00:00

test_text_splitter.py

text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782 )

2024-12-16 19:53:15 +00:00