langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-07-13 16:36:06 +00:00

Author	SHA1	Message	Date
Antonio Lanza	b2102b8cc4	text-splitters: Inconsistent results with `NLTKTextSplitter`'s `add_start_index=True` (#27782 ) This PR closes #27781 # Problem The current implementation of `NLTKTextSplitter` is using `sent_tokenize`. However, this `sent_tokenize` doesn't handle chars between 2 tokenized sentences... hence, this behavior throws errors when we are using `add_start_index=True`, as described in issue #27781. In particular: ```python from nltk.tokenize import sent_tokenize output1 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english") print(output1) output2 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english") print(output2) >>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.'] >>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.'] ``` # Solution With this new `use_span_tokenize` parameter, we can use NLTK to create sentences (with `span_tokenize`), but also add extra chars to be sure that we still can map the chunks to the original text. --------- Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Erick Friis <erickfriis@gmail.com>	2024-12-16 19:53:15 +00:00
Matthew DeGenaro	66828f4ecc	text-splitters[patch]: Modified SpacyTextSplitter to fully keep whitespace when strip_whitespace is false (#23272 ) Previously, regardless of whether or not strip_whitespace was set to true or false, the strip text method in the SpacyTextSplitter class used `sent.text` to get the sentence. I modified this to include a ternary such that if strip_whitespace is false, it uses `sent.text_with_ws` I also modified the project.toml to include the spacy pipeline package and to lock the numpy version, as higher versions break spacy. - Issue: N/a - Dependencies: None	2024-09-02 21:15:56 +00:00
Guangdong Liu	e5d7e455dc	splitters: Add ensure_ascii parameter (#18485 ) - Description: Add ensure_ascii parameter	2024-03-19 12:51:16 -07:00
Bagatur	5efb5c099f	text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )	2024-02-29 18:33:21 -08:00

4 Commits