core: implement a batch_size parameter for CacheBackedEmbeddings (#18070)

**Description:** Currently, `CacheBackedEmbeddings` computes vectors for *all* uncached documents before updating the store. This pull request updates the embedding computation loop to compute embeddings in batches, updating the store after each batch. I noticed this when I tried `CacheBackedEmbeddings` on our 30k document set and the cache directory hadn't appeared on disk after 30 minutes. The motivation is to minimize compute/data loss when problems occur: * If there is a transient embedding failure (e.g. a network outage at the embedding endpoint triggers an exception), at least the completed vectors are written to the store instead of being discarded. * If there is an issue with the store (e.g. no write permissions), the condition is detected early without computing (and discarding!) all the vectors. **Issue:** Implements enhancement #18026. **Testing:** I was unable to run unit tests; details in [this post](https://github.com/langchain-ai/langchain/discussions/15019#discussioncomment-8576684). --------- Signed-off-by: chrispy <chrispy@synopsys.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2025-09-03 03:59:42 +00:00 · 2024-03-19 14:55:43 -04:00
parent 89af30807b
commit 305d74c67a
4 changed files with 74 additions and 11 deletions
--- a/libs/core/langchain_core/utils/iter.py
+++ b/libs/core/langchain_core/utils/iter.py
@@ -165,8 +165,16 @@ class Tee(Generic[T]):
 safetee = Tee


-def batch_iterate(size: int, iterable: Iterable[T]) -> Iterator[List[T]]:
-    """Utility batching function."""
+def batch_iterate(size: Optional[int], iterable: Iterable[T]) -> Iterator[List[T]]:
+    """Utility batching function.
+
+    Args:
+        size: The size of the batch. If None, returns a single batch.
+        iterable: The iterable to batch.
+
+    Returns:
+        An iterator over the batches.
+    """
    it = iter(iterable)
    while True:
        chunk = list(islice(it, size))