core: implement a batch_size parameter for CacheBackedEmbeddings (#18070)

**Description:**

Currently, `CacheBackedEmbeddings` computes vectors for *all* uncached
documents before updating the store. This pull request updates the
embedding computation loop to compute embeddings in batches, updating
the store after each batch.

I noticed this when I tried `CacheBackedEmbeddings` on our 30k document
set and the cache directory hadn't appeared on disk after 30 minutes.

The motivation is to minimize compute/data loss when problems occur:

* If there is a transient embedding failure (e.g. a network outage at
the embedding endpoint triggers an exception), at least the completed
vectors are written to the store instead of being discarded.
* If there is an issue with the store (e.g. no write permissions), the
condition is detected early without computing (and discarding!) all the
vectors.

**Issue:**
Implements enhancement #18026.

**Testing:**
I was unable to run unit tests; details in [this
post](https://github.com/langchain-ai/langchain/discussions/15019#discussioncomment-8576684).

---------

Signed-off-by: chrispy <chrispy@synopsys.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
Chris Papademetrious
2024-03-19 14:55:43 -04:00
committed by GitHub
parent 89af30807b
commit 305d74c67a
4 changed files with 74 additions and 11 deletions

View File

@@ -165,8 +165,16 @@ class Tee(Generic[T]):
safetee = Tee
def batch_iterate(size: int, iterable: Iterable[T]) -> Iterator[List[T]]:
"""Utility batching function."""
def batch_iterate(size: Optional[int], iterable: Iterable[T]) -> Iterator[List[T]]:
"""Utility batching function.
Args:
size: The size of the batch. If None, returns a single batch.
iterable: The iterable to batch.
Returns:
An iterator over the batches.
"""
it = iter(iterable)
while True:
chunk = list(islice(it, size))