Commit Graph

55 Commits

Author SHA1 Message Date
Mason Daugherty
7a158c7f1c revert: "chore: remove ruff target-version" (#32895)
Reverts langchain-ai/langchain#32880

Not needed at the moment, will do when finishing v1
2025-09-10 20:56:48 -04:00
Christophe Bornet
b274416441 chore: remove ruff target-version (#32880)
This is not needed anymore since `requires-python` was added when moving
to `uv`.
2025-09-10 11:12:30 -04:00
Christophe Bornet
8b90eae455 chore(text-splitters): enable ruff docstring-code-format (#32854) 2025-09-08 16:40:11 -04:00
Christophe Bornet
0c3e8ccd0e chore(text-splitters): select ALL rules with exclusions (#32325)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-09-08 14:46:09 +00:00
Mason Daugherty
6b5fdfb804 release(text-splitters): 0.3.11 (#32770)
Fixes #32747

SpaCy integration test fixture was trying to use pip to download the
SpaCy language model (`en_core_web_sm`), but uv-managed environments
don't include pip by default. Fail test if not installed as opposed to
downloading.
2025-08-31 23:00:05 +00:00
Sydney Runkle
b26e52aa4d chore(text-splitters): bump version of core (#32740) 2025-08-28 13:14:57 -04:00
Sydney Runkle
38cdd7a2ec chore(text-splitters): relax max bound for langchain-core (#32739) 2025-08-28 13:05:47 -04:00
Christophe Bornet
73a7de63aa chore(text-splitters): add mypy pydantic plugin (#32611) 2025-08-19 16:58:12 -04:00
Christophe Bornet
4656f727da chore(text-splitters): add mypy warn_unreachable (#32558) 2025-08-15 09:45:20 -04:00
Christophe Bornet
8b663ed6c6 chore(text-splitters): bump mypy version to 1.17 (#32387)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-08-11 22:24:49 +00:00
Mason Daugherty
457ce9c4b0 feat(text-splitters): ruff fixes and rules (#32502) 2025-08-11 13:28:22 -04:00
Mason Daugherty
77c981999e fix(text-splitters): update langchain-core version to 0.3.72 2025-07-24 10:35:07 -04:00
Christophe Bornet
060fc0e3c9 text-splitters: Add ruff rules FBT (#31935)
See [flake8-boolean-trap
(FBT)](https://docs.astral.sh/ruff/rules/#flake8-boolean-trap-fbt)
2025-07-09 18:36:58 -04:00
Christophe Bornet
a15c3e0856 text-splitters: Bump ruff version to 0.12 (#31866) 2025-07-05 17:13:08 -04:00
Christophe Bornet
ee3709535d text-splitters: bump spacy version to 3.8.7 (#31834)
This allows to use spacy with Python 3.13
2025-07-03 10:13:25 -04:00
Christophe Bornet
802d2bf249 text-splitters: Add ruff rule UP (pyupgrade) (#31841)
See https://docs.astral.sh/ruff/rules/#pyupgrade-up
All auto-fixed except `typing.AbstractSet` -> `collections.abc.Set`
2025-07-03 10:11:35 -04:00
Eugene Yurtsev
10ec5c8f02 text-splitters: 0.3.9 (#31844)
Release langchain-text-splitters 0.3.9
2025-07-03 10:02:35 -04:00
Christophe Bornet
eab8484a80 text-splitters[patch]: fix some import-untyped errors (#31030) 2025-05-15 11:34:22 -04:00
Sydney Runkle
7e926520d5 packaging: remove Python upper bound for langchain and co libs (#31025)
Follow up to https://github.com/langchain-ai/langsmith-sdk/pull/1696,
I've bumped the `langsmith` version where applicable in `uv.lock`.

Type checking problems here because deps have been updated in
`pyproject.toml` and `uv lock` hasn't been run - we should enforce that
in the future - goes with the other dependabot todos :).
2025-04-28 14:44:28 -04:00
Christophe Bornet
8c5ae108dd text-splitters: Set strict mypy rules (#30900)
* Add strict mypy rules
* Fix mypy violations
* Add error codes to all type ignores
* Add ruff rule PGH003
* Bump mypy version to 1.15
2025-04-22 20:41:24 -07:00
ccurme
0c2c8c36c1 text-splitters: release 0.3.8 (#30671) 2025-04-04 09:58:45 -04:00
ccurme
958f85d541 text-splitters: release 0.3.7 (#30347) 2025-03-18 19:11:37 +00:00
Erick Friis
1a225fad03 multiple: fix uv path deps (#29790)
file:// format wasn't working with updates - it doesn't install as an
editable dep

move to tool.uv.sources with path= instead
2025-02-13 21:32:34 +00:00
ccurme
e1b593ae77 text-splitters[patch]: release 0.3.6 (#29647) 2025-02-06 16:16:05 -05:00
ccurme
a91e58bc10 core: release 0.3.34 (#29644) 2025-02-06 15:53:56 -05:00
ccurme
d172984c91 infra: migrate to uv (#29566) 2025-02-06 13:36:26 -05:00
Christophe Bornet
836c791829 text-splitters: Bump ruff version to 0.9 (#29231)
Co-authored-by: Erick Friis <erick@langchain.dev>
2025-01-22 00:27:58 +00:00
ccurme
55677e31f7 text-splitters[patch]: release 0.3.5 (#29054)
Resolves https://github.com/langchain-ai/langchain/issues/29053
2025-01-07 09:48:26 -05:00
Bagatur
1c797ac68f infra: speed up unit tests (#28974)
Co-authored-by: Erick Friis <erick@langchain.dev>
2025-01-02 04:13:08 +00:00
Erick Friis
9b024d00c9 text-splitters: release 0.3.4 (#28795) 2024-12-18 09:44:36 -08:00
Antonio Lanza
b2102b8cc4 text-splitters: Inconsistent results with NLTKTextSplitter's add_start_index=True (#27782)
This PR closes #27781

# Problem
The current implementation of `NLTKTextSplitter` is using
`sent_tokenize`. However, this `sent_tokenize` doesn't handle chars
between 2 tokenized sentences... hence, this behavior throws errors when
we are using `add_start_index=True`, as described in issue #27781. In
particular:
```python
from nltk.tokenize import sent_tokenize

output1 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output1)
output2 = sent_tokenize("Innovation drives our success.        Collaboration fosters creative solutions. Efficiency enhances data management.", language="english")
print(output2)
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
>>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.']
```

# Solution
With this new `use_span_tokenize` parameter, we can use NLTK to create
sentences (with `span_tokenize`), but also add extra chars to be sure
that we still can map the chunks to the original text.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Erick Friis <erickfriis@gmail.com>
2024-12-16 19:53:15 +00:00
Erick Friis
8ec1c72e03 text-splitters: test without socket (#28732) 2024-12-15 22:10:35 +00:00
Bagatur
679e3a9970 text-splitters[patch]: Release 0.3.3 (#28723) 2024-12-14 19:20:22 +00:00
Ankit Dangi
90f162efb6 text-splitters: add pydocstyle linting (#28127)
As seen in #23188, turned on Google-style docstrings by enabling
`pydocstyle` linting in the `text-splitters` package. Each resulting
linting error was addressed differently: ignored, resolved, suppressed,
and missing docstrings were added.

Fixes one of the checklist items from #25154, similar to #25939 in
`core` package. Ran `make format`, `make lint` and `make test` from the
root of the package `text-splitters` to ensure no issues were found.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-12-09 06:01:03 +00:00
Bagatur
ee63d21915 many: use core 0.3.15 (#27834) 2024-11-01 20:35:55 +00:00
Bagatur
8f4423e042 text-splitters[patch]: Release 0.3.1 (#27726) 2024-10-30 00:04:48 +00:00
Erick Friis
600b7bdd61 all: test 3.13 ci (#27197)
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2024-10-25 12:56:58 -07:00
Erick Friis
92ae61bcc8 multiple: rely on asyncio_mode auto in tests (#27200) 2024-10-15 16:26:38 +00:00
ccurme
d1462badaf text-splitters: release 0.3 (#26460)
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-09-13 22:31:06 +00:00
Erick Friis
c2a3021bb0 multiple: pydantic 2 compatibility, v0.3 (#26443)
Signed-off-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Dan O'Donovan <dan.odonovan@gmail.com>
Co-authored-by: Tom Daniel Grande <tomdgrande@gmail.com>
Co-authored-by: Grande <Tom.Daniel.Grande@statsbygg.no>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Tomaz Bratanic <bratanic.tomaz@gmail.com>
Co-authored-by: ZhangShenao <15201440436@163.com>
Co-authored-by: Friso H. Kingma <fhkingma@gmail.com>
Co-authored-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Nuno Campos <nuno@langchain.dev>
Co-authored-by: Morgante Pell <morgantep@google.com>
2024-09-13 14:38:45 -07:00
Bagatur
fb642e1e27 text-splitters[patch]: Release 0.2.4 (#25979) 2024-09-03 18:09:43 +00:00
Bagatur
dd8e4cd020 text-splitters[patch]: Release 0.2.3 (#24998) 2024-08-02 20:27:22 +00:00
Erick Friis
3dce2e1d35 all: add release notes to pypi (#24519) 2024-07-22 13:59:13 -07:00
Bagatur
a0c2281540 infra: update mypy 1.10, ruff 0.5 (#23721)
```python
"""python scripts/update_mypy_ruff.py"""
import glob
import tomllib
from pathlib import Path

import toml
import subprocess
import re

ROOT_DIR = Path(__file__).parents[1]


def main():
    for path in glob.glob(str(ROOT_DIR / "libs/**/pyproject.toml"), recursive=True):
        print(path)
        with open(path, "rb") as f:
            pyproject = tomllib.load(f)
        try:
            pyproject["tool"]["poetry"]["group"]["typing"]["dependencies"]["mypy"] = (
                "^1.10"
            )
            pyproject["tool"]["poetry"]["group"]["lint"]["dependencies"]["ruff"] = (
                "^0.5"
            )
        except KeyError:
            continue
        with open(path, "w") as f:
            toml.dump(pyproject, f)
        cwd = "/".join(path.split("/")[:-1])
        completed = subprocess.run(
            "poetry lock --no-update; poetry install --with typing; poetry run mypy . --no-color",
            cwd=cwd,
            shell=True,
            capture_output=True,
            text=True,
        )
        logs = completed.stdout.split("\n")

        to_ignore = {}
        for l in logs:
            if re.match("^(.*)\:(\d+)\: error:.*\[(.*)\]", l):
                path, line_no, error_type = re.match(
                    "^(.*)\:(\d+)\: error:.*\[(.*)\]", l
                ).groups()
                if (path, line_no) in to_ignore:
                    to_ignore[(path, line_no)].append(error_type)
                else:
                    to_ignore[(path, line_no)] = [error_type]
        print(len(to_ignore))
        for (error_path, line_no), error_types in to_ignore.items():
            all_errors = ", ".join(error_types)
            full_path = f"{cwd}/{error_path}"
            try:
                with open(full_path, "r") as f:
                    file_lines = f.readlines()
            except FileNotFoundError:
                continue
            file_lines[int(line_no) - 1] = (
                file_lines[int(line_no) - 1][:-1] + f"  # type: ignore[{all_errors}]\n"
            )
            with open(full_path, "w") as f:
                f.write("".join(file_lines))

        subprocess.run(
            "poetry run ruff format .; poetry run ruff --select I --fix .",
            cwd=cwd,
            shell=True,
            capture_output=True,
            text=True,
        )


if __name__ == "__main__":
    main()

```
2024-07-03 10:33:27 -07:00
ccurme
6f7fe82830 text-splitters: release 0.2.2 (#23508) 2024-06-25 18:26:05 -04:00
Erick Friis
a24a9c6427 multiple: get rid of pyproject extras (#22581)
They cause `poetry lock` to take a ton of time, and `uv pip install` can
resolve the constraints from these toml files in trivial time
(addressing problem with #19153)

This allows us to properly upgrade lockfile dependencies moving forward,
which revealed some issues that were either fixed or type-ignored (see
file comments)
2024-06-06 15:45:22 -07:00
Bagatur
99a3cad258 text-splitters[patch]: Release 0.2.1 (#22490) 2024-06-04 11:19:21 -07:00
ccurme
0e72ed39a0 infra: fix CI on text-splitters (#21935) 2024-05-20 14:03:42 -07:00
Erick Friis
1b555021f7 text-splitters: release 0.2.0 (#21832) 2024-05-17 13:30:54 -07:00
Erick Friis
c77d2f2b06 multiple: core 0.2 nonbreaking dep, check_diff community->langchain dep (#21646)
0.2 is not a breaking release for core (but it is for langchain and
community)

To keep the core+langchain+community packages in sync at 0.2, we will
relax deps throughout the ecosystem to tolerate `langchain-core` 0.2
2024-05-13 19:50:36 -07:00