partners/openai: OpenAIEmbeddings not respecting chunk_size argument (#30757)

When calling `embed_documents` and providing a `chunk_size` argument,
that argument is ignored when `OpenAIEmbeddings` is instantiated with
its default configuration (where `check_embedding_ctx_length=True`).

`_get_len_safe_embeddings` specifies a `chunk_size` parameter but it's
not being passed through in `embed_documents`, which is its only caller.
This appears to be an oversight, especially given that the
`_get_len_safe_embeddings` docstring states it should respect "the set
embedding context length and chunk size."

Developers typically expect method parameters to take effect (also, take
precedence) when explicitly provided, especially when instantiating
using defaults. I was confused as to why my API calls were being
rejected regardless of the chunk size I provided.

This bug also exists in langchain_community package. I can add that to
this PR if requested otherwise I will create a new one once this passes.
This commit is contained in:
Aubrey Ford
2025-04-18 12:27:27 -07:00
committed by GitHub
parent 017c8079e1
commit b344f34635
2 changed files with 26 additions and 2 deletions

View File

@@ -573,7 +573,9 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
# NOTE: to keep things simple, we assume the list may contain texts longer
# than the maximum context and use length-safe embedding function.
engine = cast(str, self.deployment)
return self._get_len_safe_embeddings(texts, engine=engine)
return self._get_len_safe_embeddings(
texts, engine=engine, chunk_size=chunk_size
)
async def aembed_documents(
self, texts: list[str], chunk_size: int | None = None
@@ -603,7 +605,9 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
# NOTE: to keep things simple, we assume the list may contain texts longer
# than the maximum context and use length-safe embedding function.
engine = cast(str, self.deployment)
return await self._aget_len_safe_embeddings(texts, engine=engine)
return await self._aget_len_safe_embeddings(
texts, engine=engine, chunk_size=chunk_size
)
def embed_query(self, text: str) -> list[float]:
"""Call out to OpenAI's embedding endpoint for embedding query text.