mirror of
https://github.com/hwchase17/langchain.git
synced 2025-04-27 19:46:55 +00:00
x
This commit is contained in:
parent
0ab8e5cfe0
commit
df4e0e6d81
206
docs/docs/contributing/how_to/integrations/retriever_guide.md
Normal file
206
docs/docs/contributing/how_to/integrations/retriever_guide.md
Normal file
@ -0,0 +1,206 @@
|
||||
---
|
||||
pagination_prev: contributing/how_to/integrations/index
|
||||
pagination_next: contributing/how_to/integrations/publish
|
||||
---
|
||||
# How to implement and test a retriever integration
|
||||
|
||||
In this guide, we'll implement and test a custom [retriever](/docs/concepts/retrievers) that you have integrated with LangChain.
|
||||
|
||||
For testing, we will rely on the `langchain-tests` dependency we added in the previous [package creation guide](/docs/contributing/how_to/integrations/package).
|
||||
|
||||
## Implementation
|
||||
|
||||
Let's say you're building a simple integration package that provides a `ToyRetriever`
|
||||
retriever integration for LangChain. Here's a simple example of what your project
|
||||
structure might look like:
|
||||
|
||||
```plaintext
|
||||
langchain-parrot-link/
|
||||
├── langchain_parrot_link/
|
||||
│ ├── __init__.py
|
||||
│ └── retrievers.py
|
||||
├── tests/
|
||||
│ └── integration_tests
|
||||
| ├── __init__.py
|
||||
| └── test_retrievers.py
|
||||
├── pyproject.toml
|
||||
└── README.md
|
||||
```
|
||||
|
||||
In this first step, we will implement the `retrievers.py` file
|
||||
|
||||
import CustomRetrieverIntro from '/docs/how_to/_custom_retriever_intro.mdx';
|
||||
|
||||
<CustomRetrieverIntro />
|
||||
|
||||
<details>
|
||||
<summary>retrievers.py</summary>
|
||||
```python title="langchain_parrot_link/retrievers.py"
|
||||
from typing import Any
|
||||
|
||||
from langchain_core.callbacks import CallbackManagerForRetrieverRun
|
||||
from langchain_core.documents import Document
|
||||
from langchain_core.retrievers import BaseRetriever
|
||||
|
||||
class ParrotRetriever(BaseRetriever):
|
||||
parrot_name: str
|
||||
k: int = 3
|
||||
|
||||
def _get_relevant_documents(
|
||||
self, query: str, *, run_manager: CallbackManagerForRetrieverRun, **kwargs: Any
|
||||
) -> list[Document]:
|
||||
k = kwargs.get("k", self.k)
|
||||
return [Document(page_content=f"{self.parrot_name} says: {query}")] * k
|
||||
```
|
||||
</details>
|
||||
|
||||
:::tip
|
||||
|
||||
The `ParrotRetriever` from this guide is tested
|
||||
against the standard unit and integration tests in the LangChain Github repository.
|
||||
You can always use this as a starting point [here](https://github.com/langchain-ai/langchain/blob/master/libs/standard-tests/tests/unit_tests/test_basic_retriever.py).
|
||||
|
||||
:::
|
||||
|
||||
## Testing
|
||||
|
||||
|
||||
|
||||
### 1. Create Your Retriever Class
|
||||
|
||||
```python
|
||||
from langchain.schema import BaseRetriever, Document
|
||||
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
|
||||
|
||||
class MyCustomRetriever(BaseRetriever):
|
||||
"""Custom retriever implementation."""
|
||||
|
||||
def _get_relevant_documents(
|
||||
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
||||
) -> List[Document]:
|
||||
"""Core implementation of retrieving relevant documents."""
|
||||
# Your implementation here
|
||||
pass
|
||||
|
||||
async def _aget_relevant_documents(
|
||||
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
||||
) -> List[Document]:
|
||||
"""Async implementation of retrieving relevant documents."""
|
||||
# Your async implementation here
|
||||
pass
|
||||
```
|
||||
|
||||
### 2. Required Testing
|
||||
|
||||
All retrievers must include the following tests:
|
||||
|
||||
#### Basic Functionality Tests
|
||||
```python
|
||||
def test_get_relevant_documents():
|
||||
retriever = MyCustomRetriever()
|
||||
docs = retriever.get_relevant_documents("test query")
|
||||
assert isinstance(docs, list)
|
||||
assert all(isinstance(doc, Document) for doc in docs)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_aget_relevant_documents():
|
||||
retriever = MyCustomRetriever()
|
||||
docs = await retriever.aget_relevant_documents("test query")
|
||||
assert isinstance(docs, list)
|
||||
assert all(isinstance(doc, Document) for doc in docs)
|
||||
```
|
||||
|
||||
#### Edge Cases
|
||||
- Empty query handling
|
||||
- Special character handling
|
||||
- Long query handling
|
||||
- Rate limiting (if applicable)
|
||||
- Error handling
|
||||
|
||||
### 3. Documentation Requirements
|
||||
|
||||
Your retriever should include:
|
||||
|
||||
1. Class docstring with:
|
||||
- General description
|
||||
- Required dependencies
|
||||
- Example usage
|
||||
- Parameters explanation
|
||||
|
||||
2. Integration documentation file:
|
||||
- Installation instructions
|
||||
- Basic usage example
|
||||
- Advanced configuration
|
||||
- Common issues and solutions
|
||||
|
||||
### 4. Best Practices
|
||||
|
||||
1. **Error Handling**
|
||||
- Implement proper error handling for API calls
|
||||
- Provide meaningful error messages
|
||||
- Handle rate limits gracefully
|
||||
|
||||
2. **Performance**
|
||||
- Implement caching when appropriate
|
||||
- Use batch operations where possible
|
||||
- Consider implementing both sync and async methods
|
||||
|
||||
3. **Configuration**
|
||||
- Use environment variables for sensitive data
|
||||
- Provide sensible defaults
|
||||
- Allow for customization of key parameters
|
||||
|
||||
4. **Type Hints**
|
||||
- Use proper type hints throughout your code
|
||||
- Document expected types in docstrings
|
||||
|
||||
## Example Implementation
|
||||
|
||||
Here's a minimal example of a custom retriever:
|
||||
|
||||
```python
|
||||
from typing import List
|
||||
from langchain.schema import BaseRetriever, Document
|
||||
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
|
||||
|
||||
class SimpleKeywordRetriever(BaseRetriever):
|
||||
"""A simple retriever that matches documents based on keywords."""
|
||||
|
||||
documents: List[Document] # Store your documents here
|
||||
|
||||
def _get_relevant_documents(
|
||||
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
||||
) -> List[Document]:
|
||||
"""Return documents that contain the query string."""
|
||||
return [
|
||||
doc for doc in self.documents
|
||||
if query.lower() in doc.page_content.lower()
|
||||
]
|
||||
|
||||
async def _aget_relevant_documents(
|
||||
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
||||
) -> List[Document]:
|
||||
"""Async version of get_relevant_documents."""
|
||||
return self._get_relevant_documents(query, run_manager=run_manager)
|
||||
```
|
||||
|
||||
## Submission Checklist
|
||||
|
||||
- [ ] Implemented base retriever interface
|
||||
- [ ] Added comprehensive tests
|
||||
- [ ] Included proper documentation
|
||||
- [ ] Added type hints
|
||||
- [ ] Handled error cases
|
||||
- [ ] Implemented both sync and async methods
|
||||
- [ ] Added example usage
|
||||
- [ ] Followed code style guidelines
|
||||
- [ ] Added requirements.txt or setup.py updates
|
||||
|
||||
## Getting Help
|
||||
|
||||
If you need help while implementing your retriever:
|
||||
1. Check existing retriever implementations for reference
|
||||
2. Open a discussion in the GitHub repository
|
||||
3. Ask in the LangChain Discord community
|
||||
|
||||
Remember to follow the existing patterns in the codebase and maintain consistency with other retrievers.
|
207
docs/docs/contributing/how_to/integrations/retriever_tests.md
Normal file
207
docs/docs/contributing/how_to/integrations/retriever_tests.md
Normal file
@ -0,0 +1,207 @@
|
||||
# Standard Tests for LangChain Retrievers
|
||||
|
||||
This guide outlines the standard tests that should be implemented for all LangChain retrievers.
|
||||
|
||||
## Test Structure
|
||||
|
||||
### 1. Basic Functionality Tests
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from langchain.schema import Document
|
||||
from your_retriever import YourRetriever
|
||||
|
||||
def test_basic_retrieval():
|
||||
"""Test basic document retrieval functionality."""
|
||||
retriever = YourRetriever()
|
||||
query = "test query"
|
||||
docs = retriever.get_relevant_documents(query)
|
||||
|
||||
assert isinstance(docs, list)
|
||||
assert all(isinstance(doc, Document) for doc in docs)
|
||||
assert len(docs) > 0 # Adjust if your retriever might return empty results
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_retrieval():
|
||||
"""Test async document retrieval functionality."""
|
||||
retriever = YourRetriever()
|
||||
query = "test query"
|
||||
docs = await retriever.aget_relevant_documents(query)
|
||||
|
||||
assert isinstance(docs, list)
|
||||
assert all(isinstance(doc, Document) for doc in docs)
|
||||
```
|
||||
|
||||
### 2. Edge Cases
|
||||
|
||||
```python
|
||||
def test_empty_query():
|
||||
"""Test behavior with empty query."""
|
||||
retriever = YourRetriever()
|
||||
docs = retriever.get_relevant_documents("")
|
||||
assert isinstance(docs, list)
|
||||
|
||||
def test_special_characters():
|
||||
"""Test handling of special characters."""
|
||||
retriever = YourRetriever()
|
||||
special_queries = [
|
||||
"test!@#$%^&*()",
|
||||
"múltiple áccents",
|
||||
"中文测试",
|
||||
"test\nwith\nnewlines",
|
||||
]
|
||||
for query in special_queries:
|
||||
docs = retriever.get_relevant_documents(query)
|
||||
assert isinstance(docs, list)
|
||||
|
||||
def test_long_query():
|
||||
"""Test handling of very long queries."""
|
||||
retriever = YourRetriever()
|
||||
long_query = "test " * 1000
|
||||
docs = retriever.get_relevant_documents(long_query)
|
||||
assert isinstance(docs, list)
|
||||
```
|
||||
|
||||
### 3. Error Handling
|
||||
|
||||
```python
|
||||
def test_invalid_configuration():
|
||||
"""Test behavior with invalid configuration."""
|
||||
with pytest.raises(ValueError):
|
||||
YourRetriever(invalid_param="invalid")
|
||||
|
||||
def test_connection_error():
|
||||
"""Test behavior when connection fails (if applicable)."""
|
||||
retriever = YourRetriever()
|
||||
# Mock connection failure
|
||||
with pytest.raises(ConnectionError):
|
||||
retriever.get_relevant_documents("test")
|
||||
```
|
||||
|
||||
### 4. Performance Tests (Optional)
|
||||
|
||||
```python
|
||||
@pytest.mark.slow
|
||||
def test_large_scale_retrieval():
|
||||
"""Test retrieval with a large number of documents."""
|
||||
retriever = YourRetriever()
|
||||
# Test with a significant number of documents
|
||||
docs = retriever.get_relevant_documents("test")
|
||||
assert len(docs) <= YOUR_MAX_LIMIT # If applicable
|
||||
|
||||
@pytest.mark.slow
|
||||
def test_concurrent_requests():
|
||||
"""Test handling of concurrent requests."""
|
||||
import asyncio
|
||||
|
||||
async def run_concurrent_requests():
|
||||
retriever = YourRetriever()
|
||||
tasks = [
|
||||
retriever.aget_relevant_documents("test")
|
||||
for _ in range(5)
|
||||
]
|
||||
results = await asyncio.gather(*tasks)
|
||||
return results
|
||||
|
||||
results = asyncio.run(run_concurrent_requests())
|
||||
assert len(results) == 5
|
||||
```
|
||||
|
||||
### 5. Integration Tests
|
||||
|
||||
```python
|
||||
def test_chain_integration():
|
||||
"""Test integration with LangChain chains."""
|
||||
from langchain.chains import RetrievalQA
|
||||
from langchain.llms import FakeLLM
|
||||
|
||||
retriever = YourRetriever()
|
||||
llm = FakeLLM()
|
||||
qa_chain = RetrievalQA.from_chain_type(
|
||||
llm=llm,
|
||||
retriever=retriever,
|
||||
chain_type="stuff"
|
||||
)
|
||||
result = qa_chain.run("test query")
|
||||
assert isinstance(result, str)
|
||||
```
|
||||
|
||||
## Test Configuration
|
||||
|
||||
```python
|
||||
# conftest.py
|
||||
import pytest
|
||||
|
||||
def pytest_configure(config):
|
||||
config.addinivalue_line(
|
||||
"markers", "slow: marks tests as slow (deselect with '-m \"not slow\"')"
|
||||
)
|
||||
|
||||
@pytest.fixture
|
||||
def sample_documents():
|
||||
"""Fixture providing sample documents for testing."""
|
||||
return [
|
||||
Document(page_content="test document 1", metadata={"source": "test1"}),
|
||||
Document(page_content="test document 2", metadata={"source": "test2"}),
|
||||
]
|
||||
|
||||
@pytest.fixture
|
||||
def mock_retriever(sample_documents):
|
||||
"""Fixture providing a retriever with sample documents."""
|
||||
retriever = YourRetriever()
|
||||
# Set up retriever with sample documents
|
||||
return retriever
|
||||
```
|
||||
|
||||
## Running Tests
|
||||
|
||||
To run the tests:
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
pytest tests/retrievers/test_your_retriever.py
|
||||
|
||||
# Run only fast tests
|
||||
pytest tests/retrievers/test_your_retriever.py -m "not slow"
|
||||
|
||||
# Run with coverage
|
||||
pytest tests/retrievers/test_your_retriever.py --cov=your_retriever
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Isolation**: Each test should be independent and not rely on the state from other tests.
|
||||
|
||||
2. **Mocking**: Use mocks for external services to avoid actual API calls during testing:
|
||||
```python
|
||||
@pytest.fixture
|
||||
def mock_api(mocker):
|
||||
return mocker.patch("your_retriever.api_client")
|
||||
```
|
||||
|
||||
3. **Parametrization**: Use pytest.mark.parametrize for testing multiple scenarios:
|
||||
```python
|
||||
@pytest.mark.parametrize("query,expected_count", [
|
||||
("test", 1),
|
||||
("invalid", 0),
|
||||
("multiple words", 2),
|
||||
])
|
||||
def test_retrieval_counts(query, expected_count):
|
||||
retriever = YourRetriever()
|
||||
docs = retriever.get_relevant_documents(query)
|
||||
assert len(docs) == expected_count
|
||||
```
|
||||
|
||||
4. **Documentation**: Include docstrings in test functions explaining what they test.
|
||||
|
||||
5. **Coverage**: Aim for high test coverage, especially for core functionality.
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. Not testing error cases
|
||||
2. Not testing async functionality
|
||||
3. Not handling rate limits in tests
|
||||
4. Missing edge cases
|
||||
5. Relying on external services in unit tests
|
||||
|
||||
Remember to adapt these tests based on your retriever's specific functionality and requirements.
|
23
docs/docs/how_to/_custom_retriever_intro.mdx
Normal file
23
docs/docs/how_to/_custom_retriever_intro.mdx
Normal file
@ -0,0 +1,23 @@
|
||||
To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:
|
||||
|
||||
| Method | Description | Required/Optional |
|
||||
|--------------------------------|--------------------------------------------------|-------------------|
|
||||
| `_get_relevant_documents` | Get documents relevant to a query. | Required |
|
||||
| `_aget_relevant_documents` | Implement to provide async native support. | Optional |
|
||||
|
||||
|
||||
The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.
|
||||
|
||||
:::tip
|
||||
By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/concepts/runnables) and will gain the standard `Runnable` functionality out of the box!
|
||||
:::
|
||||
|
||||
|
||||
:::info
|
||||
You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.
|
||||
|
||||
The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/how_to/functions)) is that a `BaseRetriever` is a well
|
||||
known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference
|
||||
is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event
|
||||
in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.
|
||||
:::
|
@ -27,29 +27,9 @@
|
||||
"\n",
|
||||
"## Interface\n",
|
||||
"\n",
|
||||
"To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:\n",
|
||||
"import CustomRetrieverIntro from './_custom_retriever_intro.mdx';\n",
|
||||
"\n",
|
||||
"| Method | Description | Required/Optional |\n",
|
||||
"|--------------------------------|--------------------------------------------------|-------------------|\n",
|
||||
"| `_get_relevant_documents` | Get documents relevant to a query. | Required |\n",
|
||||
"| `_aget_relevant_documents` | Implement to provide async native support. | Optional |\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.\n",
|
||||
"\n",
|
||||
":::tip\n",
|
||||
"By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/concepts/runnables) and will gain the standard `Runnable` functionality out of the box!\n",
|
||||
":::\n",
|
||||
"\n",
|
||||
"\n",
|
||||
":::info\n",
|
||||
"You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.\n",
|
||||
"\n",
|
||||
"The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/how_to/functions)) is that a `BaseRetriever` is a well\n",
|
||||
"known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference\n",
|
||||
"is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event\n",
|
||||
"in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.\n",
|
||||
":::\n"
|
||||
"<CustomRetrieverIntro />"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
Loading…
Reference in New Issue
Block a user