langchain/libs/core/langchain_core/utils
Chris Papademetrious 305d74c67a
core: implement a batch_size parameter for CacheBackedEmbeddings (#18070)
**Description:**

Currently, `CacheBackedEmbeddings` computes vectors for *all* uncached
documents before updating the store. This pull request updates the
embedding computation loop to compute embeddings in batches, updating
the store after each batch.

I noticed this when I tried `CacheBackedEmbeddings` on our 30k document
set and the cache directory hadn't appeared on disk after 30 minutes.

The motivation is to minimize compute/data loss when problems occur:

* If there is a transient embedding failure (e.g. a network outage at
the embedding endpoint triggers an exception), at least the completed
vectors are written to the store instead of being discarded.
* If there is an issue with the store (e.g. no write permissions), the
condition is detected early without computing (and discarding!) all the
vectors.

**Issue:**
Implements enhancement #18026.

**Testing:**
I was unable to run unit tests; details in [this
post](https://github.com/langchain-ai/langchain/discussions/15019#discussioncomment-8576684).

---------

Signed-off-by: chrispy <chrispy@synopsys.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2024-03-19 18:55:43 +00:00
..
__init__.py core[minor]: Image prompt template (#14263) 2024-01-27 17:04:29 -08:00
_merge.py core[minor]: generation info on msg (#18592) 2024-03-12 04:43:17 +00:00
aiter.py Separate out langchain_core package (#13577) 2023-11-20 13:09:30 -08:00
env.py Improve: remove extra spaces in get_from_env error (#15064) 2023-12-22 11:50:03 -08:00
formatting.py core[patch]: docstring update (#16813) 2024-02-09 12:47:41 -08:00
function_calling.py core: update _rm_titles to account for title argument name bug (#19036) 2024-03-18 21:25:06 -07:00
html.py core[patch], community[patch]: link extraction continue on failure (#17200) 2024-02-07 14:15:30 -08:00
image.py core[minor]: Image prompt template (#14263) 2024-01-27 17:04:29 -08:00
input.py infra: add print rule to ruff (#16221) 2024-02-09 16:13:30 -08:00
interactive_env.py core[patch]: simple prompt pretty printing (#15968) 2024-01-12 21:08:51 -05:00
iter.py core: implement a batch_size parameter for CacheBackedEmbeddings (#18070) 2024-03-19 18:55:43 +00:00
json_schema.py core[patch]: fixed circular dependency with json schema (#18657) 2024-03-12 05:42:45 +00:00
loading.py core[patch]: deprecate hwchase17/langchain-hub, address path traversal (#18600) 2024-03-05 12:49:38 -08:00
pydantic.py Separate out langchain_core package (#13577) 2023-11-20 13:09:30 -08:00
strings.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
utils.py Separate out langchain_core package (#13577) 2023-11-20 13:09:30 -08:00