core[minor]: add upsert, streaming_upsert, aupsert, astreaming_upsert methods to the VectorStore abstraction (#23774)

This PR rolls out part of the new proposed interface for vectorstores
(https://github.com/langchain-ai/langchain/pull/23544) to existing store
implementations.

The PR makes the following changes:

1. Adds standard upsert, streaming_upsert, aupsert, astreaming_upsert
methods to the vectorstore.
2. Updates `add_texts` and `aadd_texts` to be non required with a
default implementation that delegates to `upsert` and `aupsert` if those
have been implemented. The original `add_texts` and `aadd_texts` methods
are problematic as they spread object specific information across
document and **kwargs. (e.g., ids are not a part of the document)
3. Adds a default implementation to `add_documents` and `aadd_documents`
that delegates to `upsert` and `aupsert` respectively.
4. Adds standard unit tests to verify that a given vectorstore
implements a correct read/write API.

A downside of this implementation is that it creates `upsert` with a
very similar signature to `add_documents`.
The reason for introducing `upsert` is to:
* Remove any ambiguities about what information is allowed in `kwargs`.
Specifically kwargs should only be used for information common to all
indexed data. (e.g., indexing timeout).
*Allow inheriting from an anticipated generalized interface for indexing
that will allow indexing `BaseMedia` (i.e., allow making a vectorstore
for images/audio etc.)
 
`add_documents` can be deprecated in the future in favor of `upsert` to
make sure that users have a single correct way of indexing content.

---------

Co-authored-by: ccurme <chester.curme@gmail.com>
This commit is contained in:
Eugene Yurtsev
2024-07-05 12:21:40 -04:00
committed by GitHub
parent 3c752238c5
commit 6f08e11d7c
14 changed files with 667 additions and 83 deletions

View File

@@ -1,4 +1,5 @@
from pathlib import Path
from typing import Any
import pytest
from langchain_core.documents import Document
@@ -13,6 +14,11 @@ from tests.integration_tests.vectorstores.fake_embeddings import (
)
class AnyStr(str):
def __eq__(self, other: Any) -> bool:
return isinstance(other, str)
class TestInMemoryReadWriteTestSuite(ReadWriteTestSuite):
@pytest.fixture
def vectorstore(self) -> InMemoryVectorStore:
@@ -31,10 +37,13 @@ async def test_inmemory() -> None:
["foo", "bar", "baz"], ConsistentFakeEmbeddings()
)
output = await store.asimilarity_search("foo", k=1)
assert output == [Document(page_content="foo")]
assert output == [Document(page_content="foo", id=AnyStr())]
output = await store.asimilarity_search("bar", k=2)
assert output == [Document(page_content="bar"), Document(page_content="baz")]
assert output == [
Document(page_content="bar", id=AnyStr()),
Document(page_content="baz", id=AnyStr()),
]
output2 = await store.asimilarity_search_with_score("bar", k=2)
assert output2[0][1] > output2[1][1]
@@ -61,8 +70,8 @@ async def test_inmemory_mmr() -> None:
"foo", k=10, lambda_mult=0.1
)
assert len(output) == len(texts)
assert output[0] == Document(page_content="foo")
assert output[1] == Document(page_content="foy")
assert output[0] == Document(page_content="foo", id=AnyStr())
assert output[1] == Document(page_content="foy", id=AnyStr())
async def test_inmemory_dump_load(tmp_path: Path) -> None:
@@ -90,4 +99,4 @@ async def test_inmemory_filter() -> None:
output = await store.asimilarity_search(
"baz", filter=lambda doc: doc.metadata["id"] == 1
)
assert output == [Document(page_content="foo", metadata={"id": 1})]
assert output == [Document(page_content="foo", metadata={"id": 1}, id=AnyStr())]