mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-06 21:43:44 +00:00
Add regex control over separators in character text splitter (#7933)
<!-- Thank you for contributing to LangChain! Replace this comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced and you'd like a mention, we'll gladly shout you out! Please make sure you're PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. Maintainer responsibilities: - General / Misc / if you don't know who to tag: @baskaryan - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev - Models / Prompts: @hwchase17, @baskaryan - Memory: @hwchase17 - Agents / Tools / Toolkits: @hinthornw - Tracing / Callbacks: @agola11 - Async: @agola11 If no one reviews your PR within a few days, feel free to @-mention the same people again. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> #7854 Added the ability to use the `separator` ase a regex or a simple character. Fixed a bug where `start_index` was incorrectly counting from -1. Who can review? @eyurtsev @hwchase17 @mmz-001
This commit is contained in:
@@ -12,6 +12,7 @@ text_splitter = CharacterTextSplitter(
|
||||
chunk_size = 1000,
|
||||
chunk_overlap = 200,
|
||||
length_function = len,
|
||||
is_separator_regex = False,
|
||||
)
|
||||
```
|
||||
|
||||
|
@@ -16,6 +16,7 @@ text_splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size = 100,
|
||||
chunk_overlap = 20,
|
||||
length_function = len,
|
||||
is_separator_regex = False,
|
||||
)
|
||||
```
|
||||
|
||||
|
Reference in New Issue
Block a user