Files
langchain/tests
William FH c5edbea34a Load Run Evaluator (#7101)
Current problems:
1. Evaluating LLMs or Chat models isn't smooth. Even specifying
'generations' as the output inserts a redundant list into the eval
template
2. Configuring input / prediction / reference keys in the
`get_qa_evaluator` function is confusing. Unless you are using a chain
with the default keys, you have to specify all the variables and need to
reason about whether the key corresponds to the traced run's inputs,
outputs or the examples inputs or outputs.


Proposal:
- Configure the run evaluator according to a model. Use the model type
and input/output keys to assert compatibility where possible. Only need
to specify a reference_key for certain evaluators (which is less
confusing than specifying input keys)


When does this work:
- If you have your langchain model available (assumed always for
run_on_dataset flow)
- If you are evaluating an LLM, Chat model, or chain
- If the LLM or chat models are traced by langchain (wouldn't work if
you add an incompatible schema via the REST API)

When would this fail:
- Currently if you directly create an example from an LLM run, the
outputs are generations with all the extra metadata present. A simple
`example_key` and dumping all to the template could make the evaluations
unreliable
- Doesn't help if you're not using the low level API
- If you want to instantiate the evaluator without instantiating your
chain or LLM (maybe common for monitoring, for instance) -> could also
load from run or run type though

What's ugly:
- Personally think it's better to load evaluators one by one since
passing a config down is pretty confusing.
- Lots of testing needs to be added
- Inconsistent in that it makes a separate run and example input mapper
instead of the original `RunEvaluatorInputMapper`, which maps a run and
example to a single input.

Example usage running the for an LLM, Chat Model, and Agent.

```
# Test running for the string evaluators
evaluator_names = ["qa", "criteria"]

model = ChatOpenAI()
configured_evaluators = load_run_evaluators_for_model(evaluator_names, model=model, reference_key="answer")
run_on_dataset(ds_name, model, run_evaluators=configured_evaluators)
```


<details>
  <summary>Full code with dataset upload</summary>
```
## Create dataset
from langchain.evaluation.run_evaluators.loading import load_run_evaluators_for_model
from langchain.evaluation import load_dataset
import pandas as pd

lcds = load_dataset("llm-math")
df = pd.DataFrame(lcds)

from uuid import uuid4
from langsmith import Client
client = Client()
ds_name = "llm-math - " + str(uuid4())[0:8]
ds = client.upload_dataframe(df, name=ds_name, input_keys=["question"], output_keys=["answer"])



## Define the models we'll test over
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, AgentType

from langchain.tools import tool

llm = OpenAI(temperature=0)
chat_model = ChatOpenAI(temperature=0)

@tool
    def sum(a: float, b: float) -> float:
        """Add two numbers"""
        return a + b
    
def construct_agent():
    return initialize_agent(
        llm=chat_model,
        tools=[sum],
        agent=AgentType.OPENAI_MULTI_FUNCTIONS,
    )

agent = construct_agent()

# Test running for the string evaluators
evaluator_names = ["qa", "criteria"]

models = [llm, chat_model, agent]
run_evaluators = []
for model in models:
    run_evaluators.append(load_run_evaluators_for_model(evaluator_names, model=model, reference_key="answer"))
    

# Run on LLM, Chat Model, and Agent
from langchain.client.runner_utils import run_on_dataset

to_test = [llm, chat_model, construct_agent]

for model, configured_evaluators in zip(to_test, run_evaluators):
    run_on_dataset(ds_name, model, run_evaluators=configured_evaluators, verbose=True)
```
</details>

---------

Co-authored-by: Nuno Campos <nuno@boringbits.io>
2023-07-07 19:57:59 -07:00
..
2023-04-05 10:35:46 -07:00
2023-07-07 19:57:59 -07:00
2022-10-24 14:51:15 -07:00
2023-04-13 21:49:31 -07:00

Readme tests(draft)

Integrations Tests

Prepare

This repository contains functional tests for several search engines and databases. The tests aim to verify the correct behavior of the engines and databases according to their specifications and requirements.

To run some integration tests, such as tests located in tests/integration_tests/vectorstores/, you will need to install the following software:

  • Docker
  • Python 3.8.1 or later

We have optional group test_integration in the pyproject.toml file. This group should contain dependencies for the integration tests and can be installed using the command:

poetry install --with test_integration

Any new dependencies should be added by running:

# add package and install it after adding:
poetry add tiktoken@latest --group "test_integration" && poetry install --with test_integration

Before running any tests, you should start a specific Docker container that has all the necessary dependencies installed. For instance, we use the elasticsearch.yml container for test_elasticsearch.py:

cd tests/integration_tests/vectorstores/docker-compose
docker-compose -f elasticsearch.yml up

Prepare environment variables for local testing:

  • copy tests/.env.example to tests/.env
  • set variables in tests/.env file, e.g OPENAI_API_KEY

Additionally, it's important to note that some integration tests may require certain environment variables to be set, such as OPENAI_API_KEY. Be sure to set any required environment variables before running the tests to ensure they run correctly.

Recording HTTP interactions with pytest-vcr

Some of the integration tests in this repository involve making HTTP requests to external services. To prevent these requests from being made every time the tests are run, we use pytest-vcr to record and replay HTTP interactions.

When running tests in a CI/CD pipeline, you may not want to modify the existing cassettes. You can use the --vcr-record=none command-line option to disable recording new cassettes. Here's an example:

pytest --log-cli-level=10 tests/integration_tests/vectorstores/test_pinecone.py --vcr-record=none
pytest tests/integration_tests/vectorstores/test_elasticsearch.py --vcr-record=none

Run some tests with coverage:

pytest tests/integration_tests/vectorstores/test_elasticsearch.py --cov=langchain --cov-report=html
start "" htmlcov/index.html || open htmlcov/index.html