Building applications with LLMs through composability
Go to file
Rafal Wojdyla 039b672f46
Fixup OpenAI Embeddings - fix the weighted mean (#3778)
Re: https://github.com/hwchase17/langchain/issues/3777

Copy pasting from the issue:

While working on https://github.com/hwchase17/langchain/issues/3722 I
have noticed that there might be a bug in the current implementation of
the OpenAI length safe embeddings in `_get_len_safe_embeddings`, which
before https://github.com/hwchase17/langchain/issues/3722 was actually
the **default implementation** regardless of the length of the context
(via https://github.com/hwchase17/langchain/pull/2330).

It appears the weights used are constant and the length of the embedding
vector (1536) and NOT the number of tokens in the batch, as in the
reference implementation at
https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

<hr>

Here's some debug info:

<img width="1094" alt="image"
src="https://user-images.githubusercontent.com/1419010/235286595-a8b55298-7830-45df-b9f7-d2a2ad0356e0.png">

<hr>

We can also validate this against the reference implementation:

<details>

<summary>Reference implementation (click to unroll)</summary>

This implementation is copy pasted from
https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb

```py
import openai
from itertools import islice
import numpy as np
from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type


EMBEDDING_MODEL = 'text-embedding-ada-002'
EMBEDDING_CTX_LENGTH = 8191
EMBEDDING_ENCODING = 'cl100k_base'

# let's make sure to not retry on an invalid request, because that is what we want to demonstrate
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError))
def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):
    return openai.Embedding.create(input=text_or_tokens, model=model)["data"][0]["embedding"]

def batched(iterable, n):
    """Batch data into tuples of length n. The last batch may be shorter."""
    # batched('ABCDEFG', 3) --> ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while (batch := tuple(islice(it, n))):
        yield batch
        
def chunked_tokens(text, encoding_name, chunk_length):
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    chunks_iterator = batched(tokens, chunk_length)
    yield from chunks_iterator


def reference_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True):
    chunk_embeddings = []
    chunk_lens = []
    for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):
        chunk_embeddings.append(get_embedding(chunk, model=model))
        chunk_lens.append(len(chunk))

    if average:
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)  # normalizes length to 1
        chunk_embeddings = chunk_embeddings.tolist()
    return chunk_embeddings
```

</details>

```py
long_text = 'foo bar' * 5000

reference_safe_get_embedding(long_text, average=True)[:10]

# Here's the first 10 floats from the reference embeddings:
[0.004407593824276758,
 0.0017611146161865465,
 -0.019824815970984996,
 -0.02177626039794025,
 -0.012060967454897886,
 0.0017955296329155309,
 -0.015609168983609643,
 -0.012059823076681351,
 -0.016990468527792825,
 -0.004970484452089445]


# and now langchain implementation
from langchain.embeddings.openai import OpenAIEmbeddings
OpenAIEmbeddings().embed_query(long_text)[:10]

[0.003791506184693747,
 0.0025310066579390025,
 -0.019282322699514628,
 -0.021492679249899803,
 -0.012598522213242891,
 0.0022181168611315662,
 -0.015858940621301307,
 -0.011754004130791204,
 -0.016402944319627515,
 -0.004125287485127554]

# clearly they are different ^
```
2023-05-01 15:47:38 -07:00
.github Use a consistent poetry version everywhere (#3250) 2023-04-24 18:19:51 -07:00
docs [docs]: updates connecting_to_a_feature_store.ipynb (#3776) 2023-05-01 15:45:59 -07:00
langchain Fixup OpenAI Embeddings - fix the weighted mean (#3778) 2023-05-01 15:47:38 -07:00
tests Add SQLiteChatMessageHistory (#3534) 2023-05-01 15:40:00 -07:00
.dockerignore fix: tests with Dockerfile (#2382) 2023-04-04 06:47:19 -07:00
.flake8 change run to use args and kwargs (#367) 2022-12-18 15:54:56 -05:00
.gitignore Allow clearing cache and fix gptcache (#3493) 2023-04-26 22:03:50 -07:00
CITATION.cff bump version to 0069 (#710) 2023-01-24 00:24:54 -08:00
Dockerfile feat: add pytest-vcr for recording HTTP interactions in integration tests (#2445) 2023-04-07 07:28:57 -07:00
LICENSE add license (#50) 2022-11-01 21:12:02 -07:00
Makefile Add lint_diff command (#2449) 2023-04-05 09:34:24 -07:00
poetry.lock Harrison/csv loader (#3771) 2023-04-28 21:54:24 -07:00
poetry.toml fix Poetry 1.4.0+ installation (#1935) 2023-03-27 08:27:54 -07:00
pyproject.toml bump version to 154 (#3846) 2023-04-30 17:49:58 -07:00
README.md Update README.md (#3643) 2023-04-27 08:14:09 -07:00
readthedocs.yml update rtd config (#1664) 2023-03-14 10:40:06 -07:00

🦜🔗 LangChain

Building applications with LLMs through composability

lint test linkcheck Downloads License: MIT Twitter

Looking for the JS/TS version? Check out LangChain.js.

Production Support: As you move your LangChains into production, we'd love to offer more comprehensive support. Please fill out this form and we'll set up a dedicated support Slack channel.

Quick Install

pip install langchain or conda install langchain -c conda-forge

🤔 What is this?

Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. However, using these LLMs in isolation is often insufficient for creating a truly powerful app - the real power comes when you can combine them with other sources of computation or knowledge.

This library aims to assist in the development of those types of applications. Common examples of these applications include:

Question Answering over specific documents

💬 Chatbots

🤖 Agents

📖 Documentation

Please see here for full documentation on:

  • Getting started (installation, setting up the environment, simple examples)
  • How-To examples (demos, integrations, helper functions)
  • Reference (full API docs)
  • Resources (high-level explanation of core concepts)

🚀 What can this help with?

There are six main areas that LangChain is designed to help with. These are, in increasing order of complexity:

📃 LLMs and Prompts:

This includes prompt management, prompt optimization, a generic interface for all LLMs, and common utilities for working with LLMs.

🔗 Chains:

Chains go beyond a single LLM call and involve sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

📚 Data Augmented Generation:

Data Augmented Generation involves specific types of chains that first interact with an external data source to fetch data for use in the generation step. Examples include summarization of long pieces of text and question/answering over specific data sources.

🤖 Agents:

Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end-to-end agents.

🧠 Memory:

Memory refers to persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.

🧐 Evaluation:

[BETA] Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.

For more information on these concepts, please see our full documentation.

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see here.