Reuse the existing collection if configured properly in Qdrant.from_texts (#7530)

This PR changes the behavior of `Qdrant.from_texts` so the collection is
reused if not requested to recreate it. Previously, calling
`Qdrant.from_texts` or `Qdrant.from_documents` resulted in removing the
old data which was confusing for many.
This commit is contained in:
Kacper Łukawski 2023-07-11 22:24:35 +02:00 committed by GitHub
parent 6674b33cf5
commit 1f83b5f47e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 989 additions and 804 deletions

View File

@ -54,7 +54,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"OpenAI API Key: \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n"
"OpenAI API Key: ········\n"
]
}
],
@ -249,24 +249,9 @@
"id": "93540013",
"metadata": {},
"source": [
"## Reusing the same collection\n",
"## Recreating the collection\n",
"\n",
"Both `Qdrant.from_texts` and `Qdrant.from_documents` methods are great to start using Qdrant with LangChain, but **they are going to destroy the collection and create it from scratch**! If you want to reuse the existing collection, you can always create an instance of `Qdrant` on your own and pass the `QdrantClient` instance with the connection details."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "b7b432d7",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:24.843090Z",
"start_time": "2023-04-04T10:51:24.840041Z"
}
},
"outputs": [],
"source": [
"del qdrant"
"Both `Qdrant.from_texts` and `Qdrant.from_documents` methods are great to start using Qdrant with Langchain. In the previous versions the collection was recreated every time you called any of them. That behaviour has changed. Currently, the collection is going to be reused if it already exists. Setting `force_recreate` to `True` allows to remove the old collection and start from scratch."
]
},
{
@ -281,10 +266,15 @@
},
"outputs": [],
"source": [
"import qdrant_client\n",
"\n",
"client = qdrant_client.QdrantClient(path=\"/tmp/local_qdrant\", prefer_grpc=True)\n",
"qdrant = Qdrant(client=client, collection_name=\"my_documents\", embeddings=embeddings)"
"url = \"<---qdrant url here --->\"\n",
"qdrant = Qdrant.from_documents(\n",
" docs,\n",
" embeddings,\n",
" url,\n",
" prefer_grpc=True,\n",
" collection_name=\"my_documents\",\n",
" force_recreate=True,\n",
")"
]
},
{
@ -336,13 +326,13 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you\u2019re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, I\u2019d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer\u2014an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation\u2019s top legal minds, who will continue Justice Breyer\u2019s legacy of excellence.\n"
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
@ -393,13 +383,13 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you\u2019re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, I\u2019d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer\u2014an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation\u2019s top legal minds, who will continue Justice Breyer\u2019s legacy of excellence.\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"Score: 0.8153784913324512\n"
]
@ -483,15 +473,15 @@
"name": "stdout",
"output_type": "stream",
"text": [
"1. Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you\u2019re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"1. Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, I\u2019d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer\u2014an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation\u2019s top legal minds, who will continue Justice Breyer\u2019s legacy of excellence. \n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence. \n",
"\n",
"2. We can\u2019t change how divided we\u2019ve been. But we can change how we move forward\u2014on COVID-19 and other issues we must face together. \n",
"2. We cant change how divided weve been. But we can change how we move forward—on COVID-19 and other issues we must face together. \n",
"\n",
"I recently visited the New York City Police Department days after the funerals of Officer Wilbert Mora and his partner, Officer Jason Rivera. \n",
"\n",
@ -501,13 +491,13 @@
"\n",
"Officer Rivera was 22. \n",
"\n",
"Both Dominican Americans who\u2019d grown up on the same streets they later chose to patrol as police officers. \n",
"Both Dominican Americans whod grown up on the same streets they later chose to patrol as police officers. \n",
"\n",
"I spoke with their families and told them that we are forever in debt for their sacrifice, and we will carry on their mission to restore the trust and safety every community deserves. \n",
"\n",
"I\u2019ve worked on these issues a long time. \n",
"Ive worked on these issues a long time. \n",
"\n",
"I know what works: Investing in crime preventionand community police officers who\u2019ll walk the beat, who\u2019ll know the neighborhood, and who can restore trust and safety. \n",
"I know what works: Investing in crime preventionand community police officers wholl walk the beat, wholl know the neighborhood, and who can restore trust and safety. \n",
"\n"
]
}
@ -605,7 +595,7 @@
{
"data": {
"text/plain": [
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you\u2019re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I\u2019d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer\u2014an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation\u2019s top legal minds, who will continue Justice Breyer\u2019s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
]
},
"execution_count": 17,
@ -648,8 +638,7 @@
],
"metadata": {
"collapsed": false
},
"id": "3166ff99"
}
},
{
"cell_type": "markdown",
@ -658,8 +647,7 @@
],
"metadata": {
"collapsed": false
},
"id": "79848e80"
}
},
{
"cell_type": "markdown",
@ -683,8 +671,7 @@
],
"metadata": {
"collapsed": false
},
"id": "daeafd6d"
}
},
{
"cell_type": "code",

View File

@ -34,6 +34,10 @@ if TYPE_CHECKING:
MetadataFilter = Union[DictFilter, common_types.Filter]
class QdrantException(Exception):
"""Base class for all the Qdrant related exceptions"""
class Qdrant(VectorStore):
"""Wrapper around Qdrant vector database.
@ -552,6 +556,7 @@ class Qdrant(VectorStore):
wal_config: Optional[common_types.WalConfigDiff] = None,
quantization_config: Optional[common_types.QuantizationConfig] = None,
init_from: Optional[common_types.InitFrom] = None,
force_recreate: bool = False,
**kwargs: Any,
) -> Qdrant:
"""Construct Qdrant wrapper from a list of texts.
@ -636,6 +641,8 @@ class Qdrant(VectorStore):
Params for quantization, if None - quantization will be disabled
init_from:
Use data stored in another collection to initialize this collection
force_recreate:
Force recreating the collection
**kwargs:
Additional arguments passed directly into REST client initialization
@ -663,7 +670,9 @@ class Qdrant(VectorStore):
"Please install it with `pip install qdrant-client`."
)
from grpc import RpcError
from qdrant_client.http import models as rest
from qdrant_client.http.exceptions import UnexpectedResponse
# Just do a single quick embedding to get vector size
partial_embeddings = embedding.embed_documents(texts[:1])
@ -687,6 +696,70 @@ class Qdrant(VectorStore):
**kwargs,
)
try:
# Skip any validation in case of forced collection recreate.
if force_recreate:
raise ValueError
# Get the vector configuration of the existing collection and vector, if it
# was specified. If the old configuration does not match the current one,
# an exception is being thrown.
collection_info = client.get_collection(collection_name=collection_name)
current_vector_config = collection_info.config.params.vectors
if isinstance(current_vector_config, dict) and vector_name is not None:
if vector_name not in current_vector_config:
raise QdrantException(
f"Existing Qdrant collection {collection_name} does not "
f"contain vector named {vector_name}. Did you mean one of the "
f"existing vectors: {', '.join(current_vector_config.keys())}? "
f"If you want to recreate the collection, set `force_recreate` "
f"parameter to `True`."
)
current_vector_config = current_vector_config.get(
vector_name
) # type: ignore[assignment]
elif isinstance(current_vector_config, dict) and vector_name is None:
raise QdrantException(
f"Existing Qdrant collection {collection_name} uses named vectors. "
f"If you want to reuse it, please set `vector_name` to any of the "
f"existing named vectors: "
f"{', '.join(current_vector_config.keys())}." # noqa
f"If you want to recreate the collection, set `force_recreate` "
f"parameter to `True`."
)
elif (
not isinstance(current_vector_config, dict) and vector_name is not None
):
raise QdrantException(
f"Existing Qdrant collection {collection_name} doesn't use named "
f"vectors. If you want to reuse it, please set `vector_name` to "
f"`None`. If you want to recreate the collection, set "
f"`force_recreate` parameter to `True`."
)
# Check if the vector configuration has the same dimensionality.
if current_vector_config.size != vector_size: # type: ignore[union-attr]
raise QdrantException(
f"Existing Qdrant collection is configured for vectors with "
f"{current_vector_config.size} " # type: ignore[union-attr]
f"dimensions. Selected embeddings are {vector_size}-dimensional. "
f"If you want to recreate the collection, set `force_recreate` "
f"parameter to `True`."
)
current_distance_func = (
current_vector_config.distance.name.upper() # type: ignore[union-attr]
)
if current_distance_func != distance_func:
raise QdrantException(
f"Existing Qdrant collection is configured for "
f"{current_vector_config.distance} " # type: ignore[union-attr]
f"similarity. Please set `distance_func` parameter to "
f"`{distance_func}` if you want to reuse it. If you want to "
f"recreate the collection, set `force_recreate` parameter to "
f"`True`."
)
except (UnexpectedResponse, RpcError, ValueError):
vectors_config = rest.VectorParams(
size=vector_size,
distance=rest.Distance[distance_func],
@ -714,35 +787,7 @@ class Qdrant(VectorStore):
timeout=timeout, # type: ignore[arg-type]
)
texts_iterator = iter(texts)
metadatas_iterator = iter(metadatas or [])
ids_iterator = iter(ids or [uuid.uuid4().hex for _ in iter(texts)])
while batch_texts := list(islice(texts_iterator, batch_size)):
# Take the corresponding metadata and id for each text in a batch
batch_metadatas = list(islice(metadatas_iterator, batch_size)) or None
batch_ids = list(islice(ids_iterator, batch_size))
# Generate the embeddings for all the texts in a batch
batch_embeddings = embedding.embed_documents(batch_texts)
if vector_name is not None:
batch_embeddings = { # type: ignore[assignment]
vector_name: batch_embeddings
}
points = rest.Batch.construct(
ids=batch_ids,
vectors=batch_embeddings,
payloads=cls._build_payloads(
batch_texts,
batch_metadatas,
content_payload_key,
metadata_payload_key,
),
)
client.upsert(collection_name=collection_name, points=points)
return cls(
qdrant = cls(
client=client,
collection_name=collection_name,
embeddings=embedding,
@ -751,6 +796,8 @@ class Qdrant(VectorStore):
distance_strategy=distance_func,
vector_name=vector_name,
)
qdrant.add_texts(texts, metadatas, ids, batch_size)
return qdrant
@classmethod
def _build_payloads(

View File

@ -27,8 +27,9 @@ class ConsistentFakeEmbeddings(FakeEmbeddings):
"""Fake embeddings which remember all the texts seen so far to return consistent
vectors for the same texts."""
def __init__(self) -> None:
def __init__(self, dimensionality: int = 10) -> None:
self.known_texts: List[str] = []
self.dimensionality = dimensionality
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Return consistent embeddings for each text seen so far."""
@ -36,7 +37,9 @@ class ConsistentFakeEmbeddings(FakeEmbeddings):
for text in texts:
if text not in self.known_texts:
self.known_texts.append(text)
vector = [float(1.0)] * 9 + [float(self.known_texts.index(text))]
vector = [float(1.0)] * (self.dimensionality - 1) + [
float(self.known_texts.index(text))
]
out_vectors.append(vector)
return out_vectors
@ -44,8 +47,10 @@ class ConsistentFakeEmbeddings(FakeEmbeddings):
"""Return consistent embeddings for the text, if seen before, or a constant
one if the text is unknown."""
if text not in self.known_texts:
return [float(1.0)] * 9 + [float(0.0)]
return [float(1.0)] * 9 + [float(self.known_texts.index(text))]
return [float(1.0)] * (self.dimensionality - 1) + [float(0.0)]
return [float(1.0)] * (self.dimensionality - 1) + [
float(self.known_texts.index(text))
]
class AngularTwoDimensionalEmbeddings(Embeddings):

View File

@ -9,6 +9,7 @@ from qdrant_client.http import models as rest
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from langchain.vectorstores import Qdrant
from langchain.vectorstores.qdrant import QdrantException
from tests.integration_tests.vectorstores.fake_embeddings import (
ConsistentFakeEmbeddings,
)
@ -537,3 +538,148 @@ def test_qdrant_similarity_search_with_relevance_scores(
assert all(
(1 >= score or np.isclose(score, 1)) and score >= 0 for _, score in output
)
@pytest.mark.parametrize("vector_name", [None, "custom-vector"])
def test_qdrant_from_texts_reuses_same_collection(vector_name: Optional[str]) -> None:
"""Test if Qdrant.from_texts reuses the same collection"""
from qdrant_client import QdrantClient
collection_name = "test"
embeddings = ConsistentFakeEmbeddings()
with tempfile.TemporaryDirectory() as tmpdir:
vec_store = Qdrant.from_texts(
["lorem", "ipsum", "dolor", "sit", "amet"],
embeddings,
collection_name=collection_name,
path=str(tmpdir),
vector_name=vector_name,
)
del vec_store
vec_store = Qdrant.from_texts(
["foo", "bar"],
embeddings,
collection_name=collection_name,
path=str(tmpdir),
vector_name=vector_name,
)
del vec_store
client = QdrantClient(path=str(tmpdir))
assert 7 == client.count(collection_name).count
@pytest.mark.parametrize("vector_name", [None, "custom-vector"])
def test_qdrant_from_texts_raises_error_on_different_dimensionality(
vector_name: Optional[str],
) -> None:
"""Test if Qdrant.from_texts raises an exception if dimensionality does not match"""
collection_name = "test"
with tempfile.TemporaryDirectory() as tmpdir:
vec_store = Qdrant.from_texts(
["lorem", "ipsum", "dolor", "sit", "amet"],
ConsistentFakeEmbeddings(dimensionality=10),
collection_name=collection_name,
path=str(tmpdir),
vector_name=vector_name,
)
del vec_store
with pytest.raises(QdrantException):
Qdrant.from_texts(
["foo", "bar"],
ConsistentFakeEmbeddings(dimensionality=5),
collection_name=collection_name,
path=str(tmpdir),
vector_name=vector_name,
)
@pytest.mark.parametrize(
["first_vector_name", "second_vector_name"],
[
(None, "custom-vector"),
("custom-vector", None),
("my-first-vector", "my-second_vector"),
],
)
def test_qdrant_from_texts_raises_error_on_different_vector_name(
first_vector_name: Optional[str],
second_vector_name: Optional[str],
) -> None:
"""Test if Qdrant.from_texts raises an exception if vector name does not match"""
collection_name = "test"
with tempfile.TemporaryDirectory() as tmpdir:
vec_store = Qdrant.from_texts(
["lorem", "ipsum", "dolor", "sit", "amet"],
ConsistentFakeEmbeddings(dimensionality=10),
collection_name=collection_name,
path=str(tmpdir),
vector_name=first_vector_name,
)
del vec_store
with pytest.raises(QdrantException):
Qdrant.from_texts(
["foo", "bar"],
ConsistentFakeEmbeddings(dimensionality=5),
collection_name=collection_name,
path=str(tmpdir),
vector_name=second_vector_name,
)
def test_qdrant_from_texts_raises_error_on_different_distance() -> None:
"""Test if Qdrant.from_texts raises an exception if distance does not match"""
collection_name = "test"
with tempfile.TemporaryDirectory() as tmpdir:
vec_store = Qdrant.from_texts(
["lorem", "ipsum", "dolor", "sit", "amet"],
ConsistentFakeEmbeddings(dimensionality=10),
collection_name=collection_name,
path=str(tmpdir),
distance_func="Cosine",
)
del vec_store
with pytest.raises(QdrantException):
Qdrant.from_texts(
["foo", "bar"],
ConsistentFakeEmbeddings(dimensionality=5),
collection_name=collection_name,
path=str(tmpdir),
distance_func="Euclid",
)
@pytest.mark.parametrize("vector_name", [None, "custom-vector"])
def test_qdrant_from_texts_recreates_collection_on_force_recreate(
vector_name: Optional[str],
) -> None:
"""Test if Qdrant.from_texts recreates the collection even if config mismatches"""
from qdrant_client import QdrantClient
collection_name = "test"
with tempfile.TemporaryDirectory() as tmpdir:
vec_store = Qdrant.from_texts(
["lorem", "ipsum", "dolor", "sit", "amet"],
ConsistentFakeEmbeddings(dimensionality=10),
collection_name=collection_name,
path=str(tmpdir),
vector_name=vector_name,
)
del vec_store
vec_store = Qdrant.from_texts(
["foo", "bar"],
ConsistentFakeEmbeddings(dimensionality=5),
collection_name=collection_name,
path=str(tmpdir),
vector_name=vector_name,
force_recreate=True,
)
del vec_store
client = QdrantClient(path=str(tmpdir))
assert 2 == client.count(collection_name).count