langchain/libs/text-splitters
Antonio Lanza b2102b8cc4
text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782)
This PR closes #27781

# Problem
The current implementation of `NLTKTextSplitter` is using
`sent_tokenize`. However, this `sent_tokenize` doesn't handle chars
between 2 tokenized sentences... hence, this behavior throws errors when
we are using `add_start_index=True`, as described in issue #27781. In
particular:
```python
from nltk.tokenize import sent_tokenize

output1 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output1)
output2 = sent_tokenize("Innovation drives our success.        Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output2)
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
```

# Solution
With this new `use_span_tokenize` parameter, we can use NLTK to create
sentences (with `span_tokenize`), but also add extra chars to be sure
that we still can map the chunks to the original text.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Erick Friis <erickfriis@gmail.com>
2024-12-16 19:53:15 +00:00
..
langchain_text_splitters text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782) 2024-12-16 19:53:15 +00:00
scripts multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
tests text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782) 2024-12-16 19:53:15 +00:00
extended_testing_deps.txt multiple: get rid of pyproject extras (#22581) 2024-06-06 15:45:22 -07:00
Makefile text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782) 2024-12-16 19:53:15 +00:00
poetry.lock text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782) 2024-12-16 19:53:15 +00:00
pyproject.toml text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782) 2024-12-16 19:53:15 +00:00
README.md docs: more api ref links, add linting step to prevent more (#28495) 2024-12-04 04:19:42 +00:00

🦜✂️ LangChain Text Splitters

Downloads License: MIT

Quick Install

pip install langchain-text-splitters

What is it?

LangChain Text Splitters contains utilities for splitting into chunks a wide variety of text documents.

For full documentation see the API reference and the Text Splitters module in the main docs.

📕 Releases & Versioning

langchain-text-splitters is currently on version 0.0.x.

Minor version increases will occur for:

  • Breaking changes for any public interfaces NOT marked beta

Patch version increases will occur for:

  • Bug fixes
  • New features
  • Any changes to private interfaces
  • Any changes to beta features

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see the Contributing Guide.