langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-12-15 03:46:41 +00:00

Files

Antonio Lanza b2102b8cc4 text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782 )

This PR closes #27781

# Problem
The current implementation of `NLTKTextSplitter` is using
`sent_tokenize`. However, this `sent_tokenize` doesn't handle chars
between 2 tokenized sentences... hence, this behavior throws errors when
we are using `add_start_index=True`, as described in issue #27781. In
particular:
```python
from nltk.tokenize import sent_tokenize

output1 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output1)
output2 = sent_tokenize("Innovation drives our success.        Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output2)
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
```

# Solution
With this new `use_span_tokenize` parameter, we can use NLTK to create
sentences (with `span_tokenize`), but also add extra chars to be sure
that we still can map the chunks to the original text.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Erick Friis <erickfriis@gmail.com>

2024-12-16 19:53:15 +00:00

integration_tests

text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782 )

2024-12-16 19:53:15 +00:00

test_data

text-splitters[patch]: fix HTMLSectionSplitter parsing of xslt paths (#22176 )

2024-06-03 20:26:59 +00:00

unit_tests

text-splitters[patch]: fix typing for keep_separator (#25706 )

2024-08-23 17:22:02 +00:00

__init__.py

text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )

2024-02-29 18:33:21 -08:00