langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-08-04 02:33:05 +00:00

History

Antonio Lanza b2102b8cc4 text-splitters: Inconsistent results with `NLTKTextSplitter`'s `add_start_index=True` (#27782 ) This PR closes #27781 # Problem The current implementation of `NLTKTextSplitter` is using `sent_tokenize`. However, this `sent_tokenize` doesn't handle chars between 2 tokenized sentences... hence, this behavior throws errors when we are using `add_start_index=True`, as described in issue #27781. In particular: ```python from nltk.tokenize import sent_tokenize output1 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english") print(output1) output2 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english") print(output2) >>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.'] >>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.'] ``` # Solution With this new `use_span_tokenize` parameter, we can use NLTK to create sentences (with `span_tokenize`), but also add extra chars to be sure that we still can map the chunks to the original text. --------- Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Erick Friis <erickfriis@gmail.com>		2024-12-16 19:53:15 +00:00
..
xsl
__init__.py	text-splitters: add pydocstyle linting (#28127 )	2024-12-09 06:01:03 +00:00
base.py	text-splitters: add pydocstyle linting (#28127 )	2024-12-09 06:01:03 +00:00
character.py	text-splitters: add pydocstyle linting (#28127 )	2024-12-09 06:01:03 +00:00
html.py	text-splitters: add pydocstyle linting (#28127 )	2024-12-09 06:01:03 +00:00
json.py	text-splitters: add pydocstyle linting (#28127 )	2024-12-09 06:01:03 +00:00
konlpy.py
latex.py
markdown.py	text-splitters: add pydocstyle linting (#28127 )	2024-12-09 06:01:03 +00:00
nltk.py	text-splitters: Inconsistent results with `NLTKTextSplitter`'s `add_start_index=True` (#27782 )	2024-12-16 19:53:15 +00:00
py.typed
python.py
sentence_transformers.py	text-splitters: Inconsistent results with `NLTKTextSplitter`'s `add_start_index=True` (#27782 )	2024-12-16 19:53:15 +00:00
spacy.py	text-splitters: add pydocstyle linting (#28127 )	2024-12-09 06:01:03 +00:00