Continuing with Tolkien inspired series of langchain tools. I bring to you: **The Fellowship of the Vectors**, AKA EmbeddingsClusteringFilter. This document filter uses embeddings to group vectors together into clusters, then allows you to pick an arbitrary number of documents vector based on proximity to the cluster centers. That's a representative sample of the cluster. The original idea is from [Greg Kamradt](https://github.com/gkamradt) from this video (Level4): https://www.youtube.com/watch?v=qaPMdcCqtWk&t=365s I added few tricks to make it a bit more versatile, so you can parametrize what to do with duplicate documents in case of cluster overlap: replace the duplicates with the next closest document or remove it. This allow you to use it as an special kind of redundant filter too. Additionally you can choose 2 diff orders: grouped by cluster or respecting the original retriever scores. In my use case I was using the docs grouped by cluster to run refine chains per cluster to generate summarization over a large corpus of documents. Let me know if you want to change anything! @rlancemartin, @eyurtsev, @hwchase17, --------- Co-authored-by: rlm <pexpresss31@gmail.com>
Readme tests(draft)
Integrations Tests
Prepare
This repository contains functional tests for several search engines and databases. The tests aim to verify the correct behavior of the engines and databases according to their specifications and requirements.
To run some integration tests, such as tests located in
tests/integration_tests/vectorstores/, you will need to install the following
software:
- Docker
- Python 3.8.1 or later
We have optional group test_integration in the pyproject.toml file. This group
should contain dependencies for the integration tests and can be installed using the
command:
poetry install --with test_integration
Any new dependencies should be added by running:
# add package and install it after adding:
poetry add tiktoken@latest --group "test_integration" && poetry install --with test_integration
Before running any tests, you should start a specific Docker container that has all the
necessary dependencies installed. For instance, we use the elasticsearch.yml container
for test_elasticsearch.py:
cd tests/integration_tests/vectorstores/docker-compose
docker-compose -f elasticsearch.yml up
Prepare environment variables for local testing:
- copy
tests/.env.exampletotests/.env - set variables in
tests/.envfile, e.gOPENAI_API_KEY
Additionally, it's important to note that some integration tests may require certain
environment variables to be set, such as OPENAI_API_KEY. Be sure to set any required
environment variables before running the tests to ensure they run correctly.
Recording HTTP interactions with pytest-vcr
Some of the integration tests in this repository involve making HTTP requests to external services. To prevent these requests from being made every time the tests are run, we use pytest-vcr to record and replay HTTP interactions.
When running tests in a CI/CD pipeline, you may not want to modify the existing cassettes. You can use the --vcr-record=none command-line option to disable recording new cassettes. Here's an example:
pytest --log-cli-level=10 tests/integration_tests/vectorstores/test_pinecone.py --vcr-record=none
pytest tests/integration_tests/vectorstores/test_elasticsearch.py --vcr-record=none
Run some tests with coverage:
pytest tests/integration_tests/vectorstores/test_elasticsearch.py --cov=langchain --cov-report=html
start "" htmlcov/index.html || open htmlcov/index.html