langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-10-10 20:20:00 +00:00

Author	SHA1	Message	Date
ccurme	55677e31f7	text-splitters[patch]: release 0.3.5 (#29054 ) Resolves https://github.com/langchain-ai/langchain/issues/29053	2025-01-07 09:48:26 -05:00
Bagatur	1c797ac68f	infra: speed up unit tests (#28974 ) Co-authored-by: Erick Friis <erick@langchain.dev>	2025-01-02 04:13:08 +00:00
Luke	f69695069d	text_splitters: Add HTMLSemanticPreservingSplitter (#25911 ) Description: With current HTML splitters, they rely on secondary use of the `RecursiveCharacterSplitter` to further chunk the document into manageable chunks. The issue with this is it fails to maintain important structures such as tables, lists, etc within HTML. This Implementation of a HTML splitter, allows the user to define a maximum chunk size, HTML elements to preserve in full, options to preserve `<a>` href links in the output and custom handlers. The core splitting begins with headers, similar to `HTMLHeaderSplitter`. If these sections exceed the length of the `max_chunk_size` further recursive splitting is triggered. During this splitting, elements listed to preserve, will be excluded from the splitting process. This can cause chunks to be slightly larger then the max size, depending on preserved length. However, all contextual relevance of the preserved item remains intact. Custom Handlers: Sometimes, companies such as Atlassian have custom HTML elements, that are not parsed by default with `BeautifulSoup`. Custom handlers allows a user to provide a function to be ran whenever a specific html tag is encountered. This allows the user to preserve and gather information within custom html tags that `bs4` will potentially miss during extraction. Dependencies: User will need to install `bs4` in their project to utilise this class I have also added in `how_to` and unit tests, which require `bs4` to run, otherwise they will be skipped. Flowchart of process: ![HTMLSemanticPreservingSplitter](https://github.com/user-attachments/assets/20873c36-22ed-4c80-884b-d3c6f433f5a7) --------- Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-12-19 12:09:22 -05:00
Yuxin Chen	3256b5d6ae	text-splitters: fix state persistence issue in ExperimentalMarkdownSyntaxTextSplitter (#28373 ) - Description: This PR resolves an issue with the `ExperimentalMarkdownSyntaxTextSplitter` class, which retains the internal state across multiple calls to the `split_text` method. This behaviour caused an unintended accumulation of chunks in `self` variables, leading to incorrect outputs when processing multiple Markdown files sequentially. - Modified `libs\text-splitters\langchain_text_splitters\markdown.py` to reset the relevant internal attributes at the start of each `split_text` invocation. This ensures each call processes the input independently. - Added unit tests in `libs\text-splitters\tests\unit_tests\test_text_splitters.py` to verify the fix and ensure the state does not persist across calls. - Issue: Fixes [#26440](https://github.com/langchain-ai/langchain/issues/26440). - Dependencies: No additional dependencies are introduced with this change. - [x] Unit tests were added to verify the changes. - [x] Updated documentation where necessary. - [x] Ran `make format`, `make lint`, and `make test` to ensure compliance with project standards. --------- Co-authored-by: Angel Chen <angelchen396@gmail.com> Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-12-18 20:27:59 +00:00
Erick Friis	9b024d00c9	text-splitters: release 0.3.4 (#28795 )	2024-12-18 09:44:36 -08:00
Antonio Lanza	b2102b8cc4	text-splitters: Inconsistent results with `NLTKTextSplitter`'s `add_start_index=True` (#27782 ) This PR closes #27781 # Problem The current implementation of `NLTKTextSplitter` is using `sent_tokenize`. However, this `sent_tokenize` doesn't handle chars between 2 tokenized sentences... hence, this behavior throws errors when we are using `add_start_index=True`, as described in issue #27781. In particular: ```python from nltk.tokenize import sent_tokenize output1 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english") print(output1) output2 = sent_tokenize("Innovation drives our success. Collaboration fosters creative solutions. Efficiency enhances data management.", language="english") print(output2) >>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.'] >>> ['Innovation drives our success.', 'Collaboration fosters creative solutions.', 'Efficiency enhances data management.'] ``` # Solution With this new `use_span_tokenize` parameter, we can use NLTK to create sentences (with `span_tokenize`), but also add extra chars to be sure that we still can map the chunks to the original text. --------- Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Erick Friis <erickfriis@gmail.com>	2024-12-16 19:53:15 +00:00
Erick Friis	4f6ccb7080	text-splitters: extended-tests without socket (#28736 )	2024-12-16 05:19:50 +00:00
Erick Friis	8ec1c72e03	text-splitters: test without socket (#28732 )	2024-12-15 22:10:35 +00:00
Bagatur	679e3a9970	text-splitters[patch]: Release 0.3.3 (#28723 )	2024-12-14 19:20:22 +00:00
Ankit Dangi	90f162efb6	text-splitters: add pydocstyle linting (#28127 ) As seen in #23188, turned on Google-style docstrings by enabling `pydocstyle` linting in the `text-splitters` package. Each resulting linting error was addressed differently: ignored, resolved, suppressed, and missing docstrings were added. Fixes one of the checklist items from #25154, similar to #25939 in `core` package. Ran `make format`, `make lint` and `make test` from the root of the package `text-splitters` to ensure no issues were found. --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2024-12-09 06:01:03 +00:00
Erick Friis	e6a08355a3	docs: more api ref links, add linting step to prevent more (#28495 )	2024-12-04 04:19:42 +00:00
Bagatur	ee63d21915	many: use core 0.3.15 (#27834 )	2024-11-01 20:35:55 +00:00
Bagatur	8f4423e042	text-splitters[patch]: Release 0.3.1 (#27726 )	2024-10-30 00:04:48 +00:00
Erick Friis	600b7bdd61	all: test 3.13 ci (#27197 ) Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2024-10-25 12:56:58 -07:00
ZhangShenao	455ab7d714	Improvement[Community] Improve Document Loaders and Splitters (#27568 ) - Fix word spelling error - Add static method decorator - Fix language splitter Co-authored-by: Erick Friis <erick@langchain.dev>	2024-10-24 21:42:16 +00:00
Erick Friis	92ae61bcc8	multiple: rely on asyncio_mode auto in tests (#27200 )	2024-10-15 16:26:38 +00:00
ZhangShenao	baef7639fd	Improvement[text-splitter] Fix import of `ExperimentalMarkdownSyntaxTextSplitter` (#26703 ) #26028 Export `ExperimentalMarkdownSyntaxTextSplitter` in __init__ Co-authored-by: Erick Friis <erick@langchain.dev>	2024-09-20 17:04:31 +00:00
ccurme	d1462badaf	text-splitters: release 0.3 (#26460 ) Co-authored-by: Erick Friis <erick@langchain.dev>	2024-09-13 22:31:06 +00:00
Erick Friis	c2a3021bb0	multiple: pydantic 2 compatibility, v0.3 (#26443 ) Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Dan O'Donovan <dan.odonovan@gmail.com> Co-authored-by: Tom Daniel Grande <tomdgrande@gmail.com> Co-authored-by: Grande <Tom.Daniel.Grande@statsbygg.no> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: ccurme <chester.curme@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Tomaz Bratanic <bratanic.tomaz@gmail.com> Co-authored-by: ZhangShenao <15201440436@163.com> Co-authored-by: Friso H. Kingma <fhkingma@gmail.com> Co-authored-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Nuno Campos <nuno@langchain.dev> Co-authored-by: Morgante Pell <morgantep@google.com>	2024-09-13 14:38:45 -07:00
Bagatur	fb642e1e27	text-splitters[patch]: Release 0.2.4 (#25979 )	2024-09-03 18:09:43 +00:00
Matthew DeGenaro	66828f4ecc	text-splitters[patch]: Modified SpacyTextSplitter to fully keep whitespace when strip_whitespace is false (#23272 ) Previously, regardless of whether or not strip_whitespace was set to true or false, the strip text method in the SpacyTextSplitter class used `sent.text` to get the sentence. I modified this to include a ternary such that if strip_whitespace is false, it uses `sent.text_with_ws` I also modified the project.toml to include the spacy pipeline package and to lock the numpy version, as higher versions break spacy. - Issue: N/a - Dependencies: None	2024-09-02 21:15:56 +00:00
Christophe Bornet	038c287b3a	all: Improve make lint command (#25344 ) * Removed `ruff check --select I` as `I` is already selected and checked in the main `ruff check` command * Added checks for non-empty `PYTHON_FILES` * Run `ruff check` only on `PYTHON_FILES` Co-authored-by: Erick Friis <erick@langchain.dev>	2024-08-23 18:23:52 -07:00
ccurme	bc557a5663	text-splitters[patch]: fix typing for `keep_separator` (#25706 )	2024-08-23 17:22:02 +00:00
Bagatur	dd8e4cd020	text-splitters[patch]: Release 0.2.3 (#24998 )	2024-08-02 20:27:22 +00:00
Tamir Zitman	b3e1378f2b	langchain : text_splitters Added PowerShell (#24582 ) - Description: Added PowerShell support for text splitters language include docs relevant update - Issue: None - Dependencies: None --------- Co-authored-by: tzitman <tamir.zitman@intel.com> Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-07-30 16:13:52 +00:00
Erick Friis	3dce2e1d35	all: add release notes to pypi (#24519 )	2024-07-22 13:59:13 -07:00
blueoom	d895614d19	text_splitters: add request parameters for function HTMLHeaderTextSplitter.split_text… (#24178 ) Description: The `split_text_from_url` method of `HTMLHeaderTextSplitter` does not include parameters like `timeout` when using `requests` to send a request. Therefore, I suggest adding a `kwargs` parameter to the function, which can be passed as arguments to `requests.get()` internally, allowing control over the `get` request. --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>	2024-07-15 16:43:56 +00:00
Harold Martin	ccdaf14eff	docs: Spell check fixes (#24217 ) Description: Spell check fixes for docs, comments, and a couple of strings. No code change e.g. variable names. Issue: none Dependencies: none Twitter handle: hmartin	2024-07-15 15:51:43 +00:00
Harley Gross	ea3cd1ebba	community[minor]: added support for C in RecursiveCharacterTextSplitter (#24091 ) Description: Added support for C in RecursiveCharacterTextSplitter by reusing the separators for C++	2024-07-11 16:47:48 +00:00
Bagatur	a0c2281540	infra: update mypy 1.10, ruff 0.5 (#23721 ) ```python """python scripts/update_mypy_ruff.py""" import glob import tomllib from pathlib import Path import toml import subprocess import re ROOT_DIR = Path(__file__).parents[1] def main(): for path in glob.glob(str(ROOT_DIR / "libs/*/pyproject.toml"), recursive=True): print(path) with open(path, "rb") as f: pyproject = tomllib.load(f) try: pyproject["tool"]["poetry"]["group"]["typing"]["dependencies"]["mypy"] = ( "^1.10" ) pyproject["tool"]["poetry"]["group"]["lint"]["dependencies"]["ruff"] = ( "^0.5" ) except KeyError: continue with open(path, "w") as f: toml.dump(pyproject, f) cwd = "/".join(path.split("/")[:-1]) completed = subprocess.run( "poetry lock --no-update; poetry install --with typing; poetry run mypy . --no-color", cwd=cwd, shell=True, capture_output=True, text=True, ) logs = completed.stdout.split("\n") to_ignore = {} for l in logs: if re.match("^(.)\:(\d+)\: error:.\[(.)\]", l): path, line_no, error_type = re.match( "^(.)\:(\d+)\: error:.\[(.*)\]", l ).groups() if (path, line_no) in to_ignore: to_ignore[(path, line_no)].append(error_type) else: to_ignore[(path, line_no)] = [error_type] print(len(to_ignore)) for (error_path, line_no), error_types in to_ignore.items(): all_errors = ", ".join(error_types) full_path = f"{cwd}/{error_path}" try: with open(full_path, "r") as f: file_lines = f.readlines() except FileNotFoundError: continue file_lines[int(line_no) - 1] = ( file_lines[int(line_no) - 1][:-1] + f" # type: ignore[{all_errors}]\n" ) with open(full_path, "w") as f: f.write("".join(file_lines)) subprocess.run( "poetry run ruff format .; poetry run ruff --select I --fix .", cwd=cwd, shell=True, capture_output=True, text=True, ) if __name__ == "__main__": main() ```	2024-07-03 10:33:27 -07:00
ccurme	6f7fe82830	text-splitters: release 0.2.2 (#23508 )	2024-06-25 18:26:05 -04:00
bilk0h	3d54784e6d	text-splitters: Fix/recursive json splitter data persistence issue (#21529 ) Thank you for contributing to LangChain! Description: Noticed an issue with when I was calling `RecursiveJsonSplitter().split_json()` multiple times that I was getting weird results. I found an issue where `chunks` list in the `_json_split` method. If chunks is not provided when _json_split (which is the case when split_json calls _json_split) then the same list is used for subsequent calls to `_json_split`. You can see this in the test case i also added to this commit. Output should be: ``` [{'a': 1, 'b': 2}] [{'c': 3, 'd': 4}] ``` Instead you get: ``` [{'a': 1, 'b': 2}] [{'a': 1, 'b': 2, 'c': 3, 'd': 4}] ``` --------- Co-authored-by: Nuno Campos <nuno@langchain.dev> Co-authored-by: isaac hershenson <ihershenson@hmc.edu> Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>	2024-06-18 20:21:55 -07:00
Ryan Elston	86ee4f0daa	text-splitters: Introduce Experimental Markdown Syntax Splitter (#22257 ) #### Description This MR defines a `ExperimentalMarkdownSyntaxTextSplitter` class. The main goal is to replicate the functionality of the original `MarkdownHeaderTextSplitter` which extracts the header stack as metadata but with one critical difference: it keeps the whitespace of the original text intact. This draft reimplements the `MarkdownHeaderTextSplitter` with a very different algorithmic approach. Instead of marking up each line of the text individually and aggregating them back together into chunks, this method builds each chunk sequentially and applies the metadata to each chunk. This makes the implementation simpler. However, since it's designed to keep white space intact its not a full drop in replacement for the original. Since it is a radical implementation change to the original code and I would like to get feedback to see if this is a worthwhile replacement, should be it's own class, or is not a good idea at all. Note: I implemented the `return_each_line` parameter but I don't think it's a necessary feature. I'd prefer to remove it. This implementation also adds the following additional features: - Splits out code blocks and includes the language in the `"Code"` metadata key - Splits text on the horizontal rule `---` as well - The `headers_to_split_on` parameter is now optional - with sensible defaults that can be overridden. #### Issue Keeping the whitespace keeps the paragraphs structure and the formatting of the code blocks intact which allows the caller much more flexibility in how they want to further split the individuals sections of the resulting documents. This addresses the issues brought up by the community in the following issues: - https://github.com/langchain-ai/langchain/issues/20823 - https://github.com/langchain-ai/langchain/issues/19436 - https://github.com/langchain-ai/langchain/issues/22256 #### Dependencies N/A #### Twitter handle @RyanElston --------- Co-authored-by: isaac hershenson <ihershenson@hmc.edu>	2024-06-18 19:44:00 -07:00
Bagatur	093ae04d58	core[patch]: Pin pydantic in py3.12.4 (#23130 )	2024-06-18 12:00:02 -07:00
Jiejun Tan	c8c67dde6f	text-splitters[patch]: Fix HTMLSectionSplitter (#22812 ) Update former pull request: https://github.com/langchain-ai/langchain/pull/22654. Modified `langchain_text_splitters.HTMLSectionSplitter`, where in the latest version `dict` data structure is used to store sections from a html document, in function `split_html_by_headers`. The header/section element names serve as dict keys. This can be a problem when duplicate header/section element names are present in a single html document. Latter ones can replace former ones with the same name. Therefore some contents can be miss after html text splitting is conducted. Using a list to store sections can hopefully solve the problem. A Unit test considering duplicate header names has been added. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-06-14 22:40:39 +00:00
Max Mulatz	058a64c563	Community[minor]: Add language parser for Elixir (#22742 ) Hi 👋 First off, thanks a ton for your work on this 💚 Really appreciate what you're providing here for the community. ## Description This PR adds a basic language parser for the [Elixir](https://elixir-lang.org/) programming language. The parser code is based upon the approach outlined in https://github.com/langchain-ai/langchain/pull/13318: it's using `tree-sitter` under the hood and aligns with all the other `tree-sitter` based parses added that PR. The `CHUNK_QUERY` I'm using here is probably not the most sophisticated one, but it worked for my application. It's a starting point to provide "core" parsing support for Elixir in LangChain. It enables people to use the language parser out in real world applications which may then lead to further tweaking of the queries. I consider this PR just the ground work. - Dependencies: requires `tree-sitter` and `tree-sitter-languages` from the extended dependencies - Twitter handle:`@bitcrowd` ## Checklist - [x] PR title: "package: description" - [x] Add tests and docs - [x] Lint and test: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. <!-- If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. -->	2024-06-10 15:56:57 +00:00
Erick Friis	a24a9c6427	multiple: get rid of pyproject extras (#22581 ) They cause `poetry lock` to take a ton of time, and `uv pip install` can resolve the constraints from these toml files in trivial time (addressing problem with #19153) This allows us to properly upgrade lockfile dependencies moving forward, which revealed some issues that were either fixed or type-ignored (see file comments)	2024-06-06 15:45:22 -07:00
Bagatur	99a3cad258	text-splitters[patch]: Release 0.2.1 (#22490 )	2024-06-04 11:19:21 -07:00
Tom Clelford	c599732e1a	text-splitters[patch]: fix HTMLSectionSplitter parsing of xslt paths (#22176 ) ## Description This PR allows passing the HTMLSectionSplitter paths to xslt files. It does so by fixing two trivial bugs with how passed paths were being handled. It also changes the default value of the param `xslt_path` to `None` so the special case where the file was part of the langchain package could be handled. ## Issue #22175	2024-06-03 20:26:59 +00:00
Harrison Chase	8fad2e209a	fix error message (#22437 ) Was confusing when language is in Enum but not implemented	2024-06-03 15:48:26 +00:00
Christos Boulmpasakos	c3bcfad66d	text-splitters[patch]: Extend TextSplitter:keep_separator functionality (#21130 ) Description: Added extra functionality to `CharacterTextSplitter`, `TextSplitter` classes. The user can select whether to append the separator to the previous chunk with `keep_separator='end' ` or else prepend to the next chunk. Previous functionality prepended by default to next chunk. Issue: Fixes #20908 --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2024-05-22 13:17:45 -07:00
ccurme	0e72ed39a0	infra: fix CI on text-splitters (#21935 )	2024-05-20 14:03:42 -07:00
Erick Friis	1b555021f7	text-splitters: release 0.2.0 (#21832 )	2024-05-17 13:30:54 -07:00
Erick Friis	c77d2f2b06	multiple: core 0.2 nonbreaking dep, check_diff community->langchain dep (#21646 ) 0.2 is not a breaking release for core (but it is for langchain and community) To keep the core+langchain+community packages in sync at 0.2, we will relax deps throughout the ecosystem to tolerate `langchain-core` 0.2	2024-05-13 19:50:36 -07:00
Lei Zhang	2cd907ad7e	text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645 ) Description: MarkdownHeaderTextSplitter Fails to Parse Headers with non-printable characters. more #20643 The following is the official test case. Just replacing `# Foo\n\n` with `\ufeff# Foo\n\n` will cause the test case to fail. chunk metadata is empty ```python def test_md_header_text_splitter_1() -> None: """Test markdown splitter by header: Case 1.""" markdown_document = ( "\ufeff# Foo\n\n" " ## Bar\n\n" "Hi this is Jim\n\n" "Hi this is Joe\n\n" " ## Baz\n\n" " Hi this is Molly" ) headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ] markdown_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, ) output = markdown_splitter.split_text(markdown_document) expected_output = [ Document( page_content="Hi this is Jim \nHi this is Joe", metadata={"Header 1": "Foo", "Header 2": "Bar"}, ), Document( page_content="Hi this is Molly", metadata={"Header 1": "Foo", "Header 2": "Baz"}, ), ] assert output == expected_output ``` twitter: @coolbeevip Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2024-04-25 00:07:42 +00:00
saberuster	160bcaeb93	text-splitters[minor]: Add lua code splitting (#20421 ) - Description: Complete the support for Lua code in langchain.text_splitter module. - Dependencies: No - Twitter handle: @saberuster If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-04-13 22:42:51 +00:00
Mahdi Setayesh	c28efb878c	text-splitters[minor]: Adding a new section aware splitter to langchain (#16526 ) - Description: the layout of html pages can be variant based on the bootstrap framework or the styles of the pages. So we need to have a splitter to transform the html tags to a proper layout and then split the html content based on the provided list of tags to determine its html sections. We are using BS4 library along with xslt structure to split the html content using an section aware approach. - Dependencies: No new dependencies - Twitter handle: @m_setayesh Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` from the root of the package you've modified to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://python.langchain.com/docs/contributing/ If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --> --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-04-01 20:32:26 +00:00
Nisarg Trivedi	1252ccce6f	text-splitters[minor]: Added Haskell support in langchain.text_splitter module (#16191 ) - Description: Haskell language support added in text_splitter module - Dependencies: No - Twitter handle: @nisargtr If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2024-03-29 20:17:50 +00:00
Evgenii Zheltonozhskii	5b1f9c6d3a	infra: Consistent lxml requirements (#19520 ) Update the dependency for lxml to be consistent among different packages; should fix https://github.com/langchain-ai/langchain/issues/19040	2024-03-27 20:27:59 +00:00
Guangdong Liu	e5d7e455dc	splitters: Add ensure_ascii parameter (#18485 ) - Description: Add ensure_ascii parameter	2024-03-19 12:51:16 -07:00

1 2

55 Commits