30 Commits

Author SHA1 Message Date
Christophe Bornet
12921a94c5 test(core): reactivate commented tests in test_indexing (#32882)
* These tests now pass
* Commenting them is a [ruff
ERA](https://docs.astral.sh/ruff/rules/commented-out-code/) violation
2025-09-10 11:14:14 -04:00
Christophe Bornet
01fdeede50 chore(core): fix some ruff preview rules (#32785)
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-09-08 15:55:20 +00:00
Aleksandr Filippov
f0b6baa0ef fix(core): track within-batch deduplication in indexing num_skipped count (#32273)
**Description:** Fixes incorrect `num_skipped` count in the LangChain
indexing API. The current implementation only counts documents that
already exist in RecordManager (cross-batch duplicates) but fails to
count documents removed during within-batch deduplication via
`_deduplicate_in_order()`.

This PR adds tracking of the original batch size before deduplication
and includes the difference in `num_skipped`, ensuring that `num_added +
num_skipped` equals the total number of input documents.

**Issue:** Fixes incorrect document count reporting in indexing
statistics

**Dependencies:** None

Fixes #32272

---------

Co-authored-by: Alex Feel <afilippov@spotware.com>
2025-07-28 09:58:51 -04:00
Eugene Yurtsev
02d0a9af6c chore(core): unpin packaging dependency (#32032)
Unpin packaging dependency

---------

Co-authored-by: ntjohnson1 <24689722+ntjohnson1@users.noreply.github.com>
2025-07-14 21:42:32 +00:00
Christophe Bornet
d57216c295 feat(core): add ruff rules D to tests except D1 (#32000)
Docs are not required for tests but when there are docstrings, they
shall be correctly formatted.
See https://docs.astral.sh/ruff/rules/#pydocstyle-d
2025-07-14 10:42:03 -04:00
Christophe Bornet
03e8327e01 core: Ruff preview fixes (#31877)
Auto-fixes from `uv run ruff check --fix --unsafe-fixes --preview`

---------

Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-07-07 13:02:40 -04:00
Eugene Yurtsev
9164e6f906 core[patch]: Add additional hashing options to indexing API, warn on SHA-1 (#31649)
Add additional hashing options to the indexing API, warn on SHA-1

Requires:

- Bumping langchain-core version
- bumping min langchain-core in langchain

---------

Co-authored-by: ccurme <chester.curme@gmail.com>
2025-06-24 14:44:06 -04:00
Christophe Bornet
a8f2ddee31 core: Add ruff rules RUF (#29353)
See https://docs.astral.sh/ruff/rules/#ruff-specific-rules-ruf
Mostly:
* [RUF022](https://docs.astral.sh/ruff/rules/unsorted-dunder-all/)
(unsorted `__all__`)
* [RUF100](https://docs.astral.sh/ruff/rules/unused-noqa/) (unused noqa)
*
[RUF021](https://docs.astral.sh/ruff/rules/parenthesize-chained-operators/)
(parenthesize-chained-operators)
*
[RUF015](https://docs.astral.sh/ruff/rules/unnecessary-iterable-allocation-for-first-element/)
(unnecessary-iterable-allocation-for-first-element)
*
[RUF005](https://docs.astral.sh/ruff/rules/collection-literal-concatenation/)
(collection-literal-concatenation)
* [RUF046](https://docs.astral.sh/ruff/rules/unnecessary-cast-to-int/)
(unnecessary-cast-to-int)

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-05-15 15:43:57 -04:00
Lope Ramos
b8ae2de169 langchain-core[patch]: Incremental record manager deletion should be batched (#31206)
**Description:** Before this commit, if one record is batched in more
than 32k rows for sqlite3 >= 3.32 or more than 999 rows for sqlite3 <
3.31, the `record_manager.delete_keys()` will fail, as we are creating a
query with too many variables.

This commit ensures that we are batching the delete operation leveraging
the `cleanup_batch_size` as it is already done for `full` cleanup.

Added unit tests for incremental mode as well on different deleting
batch size.
2025-05-14 11:38:21 -04:00
Sydney Runkle
75e50a3efd core[patch]: Raise AttributeError (instead of ModuleNotFoundError) in custom __getattr__ (#30905)
Follow up to https://github.com/langchain-ai/langchain/pull/30769,
fixing the regression reported
[here](https://github.com/langchain-ai/langchain/pull/30769#issuecomment-2807483610),
thanks @krassowski for the report!

Fix inspired by https://github.com/PrefectHQ/prefect/pull/16172/files

Other changes:
* Using tuples for `__all__`, except in `output_parsers` bc of a list
namespace conflict
* Using a helper function for imports due to repeated logic across
`__init__.py` files becoming hard to maintain.

Co-authored-by: Michał Krassowski < krassowski 5832902+krassowski@users.noreply.github.com>"
2025-04-17 14:15:28 -04:00
Christophe Bornet
dc19d42d37 core: Specify code when ignoring type issue (ruff PGH003) (#30675)
See https://docs.astral.sh/ruff/rules/blanket-type-ignore/
2025-04-10 22:23:52 -04:00
Christophe Bornet
150ac0cb79 core: Add ruff rules DTZ (#30657)
Add ruff rules DTZ:
https://docs.astral.sh/ruff/rules/#flake8-datetimez-dtz
2025-04-04 13:43:47 -04:00
Christophe Bornet
8a33402016 core: Add ruff rules PT (pytest) (#29381)
See https://docs.astral.sh/ruff/rules/#flake8-pytest-style-pt
2025-04-01 13:31:07 -04:00
Keiichi Hirobe
956b09f468 core[patch]: stop deleting records with "scoped_full" when doc is empty (#30520)
Fix a bug that causes `scoped_full` in index to delete records when there are no input docs.
2025-03-27 11:04:34 -04:00
Christophe Bornet
1c4ce7b42b core: Auto-fix some docstrings (#29337) 2025-01-21 13:29:53 -05:00
Keiichi Hirobe
67fd554512 core[patch]: throw exception indexing code if deletion fails in vectorstore (#28103)
The delete methods in the VectorStore and DocumentIndex interfaces
return a status indicating the result. Therefore, we can assume that
their implementations don't throw exceptions but instead return a result
indicating whether the delete operations have failed. The current
implementation doesn't check the returned value, so I modified it to
throw an exception when the operation fails.

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-12-13 16:14:27 -05:00
Keiichi Hirobe
258b3be5ec core[minor]: add new clean up strategy "scoped_full" to indexing (#28505)
~Note that this PR is now Draft, so I didn't add change to `aindex`
function and didn't add test codes for my change.
After we have an agreement on the direction, I will add commits.~

`batch_size` is very difficult to decide because setting a large number
like >10000 will impact VectorDB and RecordManager, while setting a
small number will delete records unnecessarily, leading to redundant
work, as the `IMPORTANT` section says.
On the other hand, we can't use `full` because the loader returns just a
subset of the dataset in our use case.

I guess many people are in the same situation as us.

So, as one of the possible solutions for it, I would like to introduce a
new argument, `scoped_full_cleanup`.
This argument will be valid only when `claneup` is Full. If True, Full
cleanup deletes all documents that haven't been updated AND that are
associated with source ids that were seen during indexing. Default is
False.

This change keeps backward compatibility.

---------

Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-12-13 20:35:25 +00:00
Keiichi Hirobe
da28cf1f54 core[patch]: Reverts PR #25754 and add unit tests (#28702)
I reported the bug 2 weeks ago here:
https://github.com/langchain-ai/langchain/issues/28447

I believe this is a critical bug for the indexer, so I submitted a PR to
revert the change and added unit tests to prevent similar bugs from
being introduced in the future.

@eyurtsev Could you check this?
2024-12-13 15:13:06 -05:00
Erick Friis
0dbaf05bb7 standard-tests: rename langchain_standard_tests to langchain_tests, release 0.3.2 (#28203) 2024-11-18 19:10:39 -08:00
Eric Pinzur
eadc2f6a90 core: added DeleteResponse to the module (#28069)
Description:
* added `DeleteResponse` to the `langchain_core.indexing` module, for
implementing DocumentIndex classes.
2024-11-13 11:08:08 -05:00
João Carlos Ferra de Almeida
780ce00dea core[minor]: add **kwargs to index and aindex functions for custom vector_field support (#26998)
Added `**kwargs` parameters to the `index` and `aindex` functions in
`libs/core/langchain_core/indexing/api.py`. This allows users to pass
additional arguments to the `add_documents` and `aadd_documents`
methods, enabling the specification of a custom `vector_field`. For
example, users can now use `vector_field="embedding"` when indexing
documents in `OpenSearchVectorStore`

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-10-07 14:52:50 -04:00
federico-pisanu
2538963945 core[patch]: improve index/aindex api when batch_size<n_docs (#25754)
- **Description:** prevent index function to re-index entire source
document even if nothing has changed.
- **Issue:** #22135

I worked on a solution to this issue that is a compromise between being
cheap and being fast.
In the previous code, when batch_size is greater than the number of docs
from a certain source almost the entire source is deleted (all documents
from that source except for the documents in the first batch)
My solution deletes documents from vector store and record manager only
if at least one document has changed for that source.

Hope this can help!

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-09-30 20:57:41 +00:00
Christophe Bornet
3a1b9259a7 core: Add ruff rules for comprehensions (C4) (#26829) 2024-09-25 09:34:17 -04:00
Christophe Bornet
a47b332841 core: Put Python version as a project requirement so it is considered by ruff (#26608)
Ruff doesn't know about the python version in
`[tool.poetry.dependencies]`. It can get it from
`project.requires-python`.

Notes:
* poetry seems to have issues getting the python constraints from
`requires-python` and using `python` in per dependency constraints. So I
had to duplicate the info. I will open an issue on poetry.
* `inspect.isclass()` doesn't work correctly with `GenericAlias`
(`list[...]`, `dict[..., ...]`) on Python <3.11 so I added some `not
isinstance(type, GenericAlias)` checks:

Python 3.11
```pycon
>>> import inspect
>>> inspect.isclass(list)
True
>>> inspect.isclass(list[str])
False
```

Python 3.9
```pycon
>>> import inspect
>>> inspect.isclass(list)
True
>>> inspect.isclass(list[str])
True
```

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-09-18 14:37:57 +00:00
Eugene Yurtsev
d283f452cc core[minor]: Add support for DocumentIndex in the index api (#25100)
Support document index in the index api.
2024-08-06 12:30:49 -07:00
Eugene Yurtsev
41dfad5104 core[minor]: Introduce DocumentIndex abstraction (#25062)
This PR adds a minimal document indexer abstraction.

The goal of this abstraction is to allow developers to create custom
retrievers that also have a standard indexing API and allow updating the
document content in them.

The abstraction comes with a test suite that can verify that the indexer
implements the correct semantics.

This is an iteration over a previous PRs
(https://github.com/langchain-ai/langchain/pull/24364). The main
difference is that we're sub-classing from BaseRetriever in this
iteration and as so have consolidated the sync and async interfaces.

The main problem with the current design is that runt time search
configuration has to be specified at init rather than provided at run
time.

We will likely resolve this issue in one of the two ways:

(1) Define a method (`get_retriever`) that will allow creating a
retriever at run time with a specific configuration.. If we do this, we
will likely break the subclass on BaseRetriever
(2) Generalize base retriever so it can support structured queries

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-08-05 18:06:33 +00:00
Eugene Yurtsev
4ba14adec6 core[patch]: Clean up indexing test code (#24139)
Refactor the code to use the existing InMemroyVectorStore.

This change is needed for another PR that moves some of the imports
around (and messes up the mock.patch in this file)
2024-07-11 18:54:46 +00:00
Eugene Yurtsev
6f08e11d7c core[minor]: add upsert, streaming_upsert, aupsert, astreaming_upsert methods to the VectorStore abstraction (#23774)
This PR rolls out part of the new proposed interface for vectorstores
(https://github.com/langchain-ai/langchain/pull/23544) to existing store
implementations.

The PR makes the following changes:

1. Adds standard upsert, streaming_upsert, aupsert, astreaming_upsert
methods to the vectorstore.
2. Updates `add_texts` and `aadd_texts` to be non required with a
default implementation that delegates to `upsert` and `aupsert` if those
have been implemented. The original `add_texts` and `aadd_texts` methods
are problematic as they spread object specific information across
document and **kwargs. (e.g., ids are not a part of the document)
3. Adds a default implementation to `add_documents` and `aadd_documents`
that delegates to `upsert` and `aupsert` respectively.
4. Adds standard unit tests to verify that a given vectorstore
implements a correct read/write API.

A downside of this implementation is that it creates `upsert` with a
very similar signature to `add_documents`.
The reason for introducing `upsert` is to:
* Remove any ambiguities about what information is allowed in `kwargs`.
Specifically kwargs should only be used for information common to all
indexed data. (e.g., indexing timeout).
*Allow inheriting from an anticipated generalized interface for indexing
that will allow indexing `BaseMedia` (i.e., allow making a vectorstore
for images/audio etc.)
 
`add_documents` can be deprecated in the future in favor of `upsert` to
make sure that users have a single correct way of indexing content.

---------

Co-authored-by: ccurme <chester.curme@gmail.com>
2024-07-05 12:21:40 -04:00
Philippe PRADOS
8711c61298 core[minor]: Adds an in-memory implementation of RecordManager (#13200)
**Description:**
langchain offers three technologies to save data:
-
[vectorstore](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
- [docstore](https://js.langchain.com/docs/api/schema/classes/Docstore)
- [record
manager](https://python.langchain.com/docs/modules/data_connection/indexing)

If you want to combine these technologies in a sample persistence
stategy you need a common implementation for each. `DocStore` propose
`InMemoryDocstore`.

We propose the class `MemoryRecordManager` to complete the system.

This is the prelude to another full-request, which needs a consistent
combination of persistence components.

**Tag maintainer:**
@baskaryan

**Twitter handle:**
@pprados

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-06-20 12:19:10 -04:00
Eugene Yurtsev
d8aa72f51d core[minor],langchain[patch]: Move base indexing interface and logic to core (#20667)
This PR moves the interface and the logic to core.

The following changes to namespaces:


`indexes` -> `indexing`
`indexes._api` -> `indexing.api`


Testing code is intentionally duplicated for now since it's testing
different
implementations of the record manager (in-memory vs. SQL).

Common logic will need to be pulled out into the test client.


A follow up PR will move the SQL based implementation outside of
LangChain.
2024-04-24 13:18:42 -04:00