mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-03 03:59:42 +00:00
core: implement a batch_size parameter for CacheBackedEmbeddings (#18070)
**Description:** Currently, `CacheBackedEmbeddings` computes vectors for *all* uncached documents before updating the store. This pull request updates the embedding computation loop to compute embeddings in batches, updating the store after each batch. I noticed this when I tried `CacheBackedEmbeddings` on our 30k document set and the cache directory hadn't appeared on disk after 30 minutes. The motivation is to minimize compute/data loss when problems occur: * If there is a transient embedding failure (e.g. a network outage at the embedding endpoint triggers an exception), at least the completed vectors are written to the store instead of being discarded. * If there is an issue with the store (e.g. no write permissions), the condition is detected early without computing (and discarding!) all the vectors. **Issue:** Implements enhancement #18026. **Testing:** I was unable to run unit tests; details in [this post](https://github.com/langchain-ai/langchain/discussions/15019#discussioncomment-8576684). --------- Signed-off-by: chrispy <chrispy@synopsys.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
committed by
GitHub
parent
89af30807b
commit
305d74c67a
@@ -165,8 +165,16 @@ class Tee(Generic[T]):
|
||||
safetee = Tee
|
||||
|
||||
|
||||
def batch_iterate(size: int, iterable: Iterable[T]) -> Iterator[List[T]]:
|
||||
"""Utility batching function."""
|
||||
def batch_iterate(size: Optional[int], iterable: Iterable[T]) -> Iterator[List[T]]:
|
||||
"""Utility batching function.
|
||||
|
||||
Args:
|
||||
size: The size of the batch. If None, returns a single batch.
|
||||
iterable: The iterable to batch.
|
||||
|
||||
Returns:
|
||||
An iterator over the batches.
|
||||
"""
|
||||
it = iter(iterable)
|
||||
while True:
|
||||
chunk = list(islice(it, size))
|
||||
|
Reference in New Issue
Block a user