**issue**
In Langchain, the original content is generally stored under the `text`
key. However, the `PineconeHybridSearchRetriever` searches the `context`
field in the metadata and cannot change this key. To address this, I
have modified the code to allow changing the key to something other than
context.
In my opinion, following Langchain's conventions, the `text` key seems
more appropriate than `context`. However, since I wasn't sure about the
author's intent, I have left the default value as `context`.
- Description: Adding getattr methods and set default value 500 to
cls.bulk_size, it can prevent the error below:
Error: type object 'OpenSearchVectorSearch' has no attribute 'bulk_size'
- Issue: https://github.com/langchain-ai/langchain/issues/29071
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once. This specific part focuses on updating the
PyPDFium2 parser.
For more details, see
https://github.com/langchain-ai/langchain/pull/28970.
- This pull request includes various changes to add a `user_agent`
parameter to Azure OpenAI, Azure Search and Whisper in the Community and
Partner packages. This helps in identifying the source of API requests
so we can better track usage and help support the community better. I
will also be adding the user_agent to the new `langchain-azure` repo as
well.
- No issue connected or updated dependencies.
- Utilises existing tests and docs
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
Thank you for contributing to LangChain!
- [x] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core, etc. is
being modified. Use "docs: ..." for purely docs changes, "infra: ..."
for CI changes.
- Example: "community: add foobar LLM"
- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** a description of the change
- **Issue:** the issue # it fixes, if applicable
- **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
I made a change to how was implemented the support for GPU in
`FastEmbedEmbeddings` to be more consistent with the existing
implementation `langchain-qdrant` sparse embeddings implementation
It is directly enabling to provide the list of ONNX execution providers:
https://github.com/langchain-ai/langchain/blob/master/libs/partners/qdrant/langchain_qdrant/fastembed_sparse.py#L15
It is a bit less clear to a user that just wants to enable GPU, but
gives more capabilities to work with other execution providers that are
not the `CUDAExecutionProvider`, and is more future proof
Sorry for the disturbance @ccurme
> Nice to see you just moved to `uv`! It is so much nicer to run
format/lint/test! No need to manually rerun the `poetry install` with
all required extras now
Deep Lake recently released version 4, which introduces significant
architectural changes, including a new on-disk storage format, enhanced
indexing mechanisms, and improved concurrency. However, LangChain's
vector store integration currently does not support Deep Lake v4 due to
breaking API changes.
Previously, the installation command was:
`pip install deeplake[enterprise]`
This installs the latest available version, which now defaults to Deep
Lake v4. Since LangChain's vector store integration is still dependent
on v3, this can lead to compatibility issues when using Deep Lake as a
vector database within LangChain.
To ensure compatibility, the installation command has been updated to:
`pip install deeplake[enterprise]<4.0.0`
This constraint ensures that pip installs the latest available version
of Deep Lake within the v3 series while avoiding the incompatible v4
update.
- **Description:** add a `gpu: bool = False` field to the
`FastEmbedEmbeddings` class which enables to use GPU (through ONNX CUDA
provider) when generating embeddings with any fastembed model. It just
requires the user to install a different dependency and we use a
different provider when instantiating `fastembed.TextEmbedding`
- **Issue:** when generating embeddings for a really large amount of
documents this drastically increase performance (honestly that is a must
have in some situations, you can't just use CPU it is way too slow)
- **Dependencies:** no direct change to dependencies, but internally the
users will need to install `fastembed-gpu` instead of `fastembed`, I
made all the changes to the init function to properly let the user know
which dependency they should install depending on if they enabled `gpu`
or not
cf. fastembed docs about GPU for more details:
https://qdrant.github.io/fastembed/examples/FastEmbed_GPU/
I did not added test because it would require access to a GPU in the
testing environment
### PR Title:
**community: add latest OpenAI models pricing**
### Description:
This PR updates the OpenAI model cost calculation mapping by adding the
latest OpenAI models, **o1 (non-preview)** and **o3-mini**, based on the
pricing listed on the [OpenAI pricing
page](https://platform.openai.com/docs/pricing).
### Changes:
- Added pricing for `o1`, `o1-2024-12-17`, `o1-cached`, and
`o1-2024-12-17-cached` for input tokens.
- Added pricing for `o1-completion` and `o1-2024-12-17-completion` for
output tokens.
- Added pricing for `o3-mini`, `o3-mini-2025-01-31`, `o3-mini-cached`,
and `o3-mini-2025-01-31-cached` for input tokens.
- Added pricing for `o3-mini-completion` and
`o3-mini-2025-01-31-completion` for output tokens.
### Issue:
N/A
### Dependencies:
None
### Testing & Validation:
- No functional changes outside of updating the cost mapping.
- No tests were added or modified.
Description: Fixes PreFilter value handling in Azure Cosmos DB NoSQL
vectorstore. The current implementation fails to handle numeric values
in filter conditions, causing an undefined value variable error. This PR
adds support for numeric, boolean, and NULL values while maintaining the
existing string and list handling.
Changes:
Added handling for numeric types (int/float)
Added boolean value support
Added NULL value handling
Added type validation for unsupported values
Fixed scope of value variable initialization
Issue:
Fixes#29610
Implementation Notes:
No changes to public API
Backwards compatible
Maintains consistent behavior with existing MongoDB-style filtering
Preserves SQL injection prevention through proper value handling
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once. This specific part focuses on updating the XXX
parser.
For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
**PR title**: "community: Option to pass auth_file_location for
oci_generative_ai"
**Description:** Option to pass auth_file_location, to overwrite config
file default location "~/.oci/config" where profile name configs
present. This is not fixing any issues. Just added optional parameter
called "auth_file_location", which internally supported by any OCI
client including GenerativeAiInferenceClient.
## Description:
This PR addresses issue #29429 by fixing the _wrap_query method in
langchain_community/graphs/age_graph.py. The method now correctly
handles Cypher queries with UNION and EXCEPT operators, ensuring that
the fields in the SQL query are ordered as they appear in the Cypher
query. Additionally, the method now properly handles cases where RETURN
* is not supported.
### Issue: #29429
### Dependencies: None
### Add tests and docs:
Added unit tests in tests/unit_tests/graphs/test_age_graph.py to
validate the changes.
No new integrations were added, so no example notebook is necessary.
Lint and test:
Ran make format, make lint, and make test to ensure code quality and
functionality.
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses on updating the PyPDF parser.
For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).
*Description:**
Updates the YahooFinanceNewsTool to handle the current yfinance news
data structure. The tool was failing with a KeyError due to changes in
the yfinance API's response format. This PR updates the code to
correctly extract news URLs from the new structure.
**Issue:** #29495
**Dependencies:**
No new dependencies required. Works with existing yfinance package.
The changes maintain backwards compatibility while fixing the KeyError
that users were experiencing.
The modified code properly handles the new data structure where:
- News type is now at `content.contentType`
- News URL is now at `content.canonicalUrl.url`
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
Description
This PR adds support for MongoDB-style $in operator filtering in the
Supabase vectorstore implementation. Currently, filtering with $in
operators returns no results, even when matching documents exist. This
change properly translates MongoDB-style filters to PostgreSQL syntax,
enabling efficient multi-document filtering.
Changes
Modified similarity_search_by_vector_with_relevance_scores to handle
MongoDB-style $in operators
Added automatic conversion of $in filters to PostgreSQL IN clauses
Preserved original vector type handling and numpy array conversion
Maintained compatibility with existing postgrest filters
Added support for the same filtering in
similarity_search_by_vector_returning_embeddings
Issue
Closes#27932
Implementation Notes
No changes to public API or function signatures
Backwards compatible - behavior unchanged for non-$in filters
More efficient than multiple individual queries for multi-ID searches
Preserves all existing functionality including numpy array conversion
for vector types
Dependencies
None
Additional Notes
The implementation handles proper SQL escaping for filter values
Maintains consistent behavior with other vectorstore implementations
that support MongoDB-style operators
Future extensions could support additional MongoDB-style operators ($gt,
$lt, etc.)
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
allow any credential type in AzureAIDocumentInteligence, not only
`api_key`.
This allows to use any of the credentials types integrated with AD.
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
Description: Fix TypeError in AzureSearch similarity_search_with_score
by removing search_type from kwargs before passing to underlying
requests.
This resolves issue #29407 where search_type was being incorrectly
passed through to Session.request().
Issue: #29407
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
- **Description:** Add to check pad_token_id and eos_token_id of model
config. It seems that this is the same bug as the HuggingFace TGI bug.
In addition, the source code of
libs/partners/huggingface/langchain_huggingface/llms/huggingface_pipeline.py
also requires similar changes.
- **Issue:** #29431
- **Dependencies:** none
- **Twitter handle:** tell14
- **Description:** the `delete` function of
AzureCosmosDBNoSqlVectorSearch is using
`self._container.delete_item(document_id)` which miss a mandatory
parameter `partition_key`
We use the class function `delete_document_by_id` to provide a default
`partition_key`
- **Issue:** #29372
- **Dependencies:** None
- **Twitter handle:** None
Co-authored-by: Loris Alexandre <loris.alexandre@boursorama.fr>
This pr, expands the gitlab url so it can also be set in a constructor,
instead of only through env variables.
This allows to do something like this.
```
# Create the GitLab API wrapper
gitlab_api = GitLabAPIWrapper(
gitlab_url=self.gitlab_url,
gitlab_personal_access_token=self.gitlab_personal_access_token,
gitlab_repository=self.gitlab_repository,
gitlab_branch=self.gitlab_branch,
gitlab_base_branch=self.gitlab_base_branch,
)
```
Where before you could not set the url in the constructor.
Co-authored-by: Tim Mallezie <tim.mallezie@dropsolid.com>
- [ *] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** Fix for pedantic model validator for GoogleApiHandler
- **Issue:** the issue #29165
- [ *] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.
---------
Signed-off-by: Bhav Sardana <sardana.bhav@gmail.com>
* Adds BlobParsers for images. These implementations can take an image
and produce one or more documents per image. This interface can be used
for exposing OCR capabilities.
* Update PyMuPDFParser and Loader to standardize metadata, handle
images, improve table extraction etc.
- **Twitter handle:** pprados
This is one part of a larger Pull Request (PR) that is too large to be
submitted all at once.
This specific part focuses to prepare the update of all parsers.
For more details, see [PR
28970](https://github.com/langchain-ai/langchain/pull/28970).
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
## Description
- Responding to `NCP API Key` changes.
- To fix `ChatClovaX` `astream` function to raise `SSEError` when an
error event occurs.
- To add `token length` and `ai_filter` to ChatClovaX's
`response_metadata`.
- To update document for apply NCP API Key changes.
cc. @efriis @vbarda
- **Description:** Changed the Base Default Model and Base URL to
correct versions. Plus added a more explicit exception if user provides
an invalid API Key
- **Issue:** #29278
- [ ] **PR title**:[langchain_community.llms.xinference]: Rewrite
_stream() method and support stream() method in xinference.py
- [ ] **PR message**: Rewrite the _stream method so that the
chain.stream() can be used to return data streams.
chain = prompt | llm
chain.stream(input=user_input)
- [ ] **tests**:
from langchain_community.llms import Xinference
from langchain.prompts import PromptTemplate
llm = Xinference(
server_url="http://0.0.0.0:9997", # replace your xinference server url
model_uid={model_uid} # replace model_uid with the model UID return from
launching the model
stream = True
)
prompt = PromptTemplate(input=['country'], template="Q: where can we
visit in the capital of {country}? A:")
chain = prompt | llm
chain.stream(input={'country': 'France'})
### Description
- Since there is no cost per 1k input tokens for a fine-tuned cached
version of `gpt-4o-mini-2024-07-18` is not available when using the
`OpenAICallbackHandler`, it raises an error when trying to make calls
with such model.
- To add the price in the `MODEL_COST_PER_1K_TOKENS` dictionary
cc. @efriis
# Description
## Summary
This PR adds support for handling multi-labeled page numbers in the
**PyPDFLoader**. Some PDFs use complex page numbering systems where the
actual content may begin after multiple introductory pages. The
page_label field helps accurately reflect the document’s page structure,
making it easier to handle such cases during document parsing.
## Motivation
This feature improves document parsing accuracy by allowing users to
access the actual page labels instead of relying only on the physical
page numbers. This is particularly useful for documents where the first
few pages have roman numerals or other non-standard page labels.
## Use Case
This feature is especially useful for **Retrieval-Augmented Generation**
(RAG) systems where users may reference page numbers when asking
questions. Some PDFs have both labeled page numbers (like roman numerals
for introductory sections) and index-based page numbers.
For example, a user might ask:
"What is mentioned on page 5?"
The system can now check both:
• **Index-based page number** (page)
• **Labeled page number** (page_label)
This dual-check helps improve retrieval accuracy. Additionally, the
results can be validated with an **agent or tool** to ensure the
retrieved pages match the user’s query contextually.
## Code Changes
- Added a page_label field to the metadata of the Document class in
**PyPDFLoader**.
- Implemented support for retrieving page_label from the
pdf_reader.page_labels.
- Created a test case (test_pypdf_loader_with_multi_label_page_numbers)
with a sample PDF containing multi-labeled pages
(geotopo-komprimiert.pdf) [[Source of
pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)].
- Updated existing tests to ensure compatibility and verify page_label
extraction.
## Tests Added
- Added a new test case for a PDF with multi-labeled pages.
- Verified both page and page_label metadata fields are correctly
extracted.
## Screenshots
<img width="549" alt="image"
src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33"
/>
- [ ] **PR title**: [langchain_community.llms.xinference]: fix error in
xinference.py
- [ ] **PR message**:
- The old code raised an ValidationError:
pydantic_core._pydantic_core.ValidationError: 1 validation error for
Xinference when import Xinference from xinference.py. This issue has
been resolved by adjusting it's type and default value.
File "/media/vdc/python/lib/python3.10/site-packages/pydantic/main.py",
line 212, in __init__
validated_self = self.__pydantic_validator__.validate_python(data,
self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for
Xinference
client
Field required [type=missing, input_value={'server_url':
'http://10...t4', 'model_kwargs': {}}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.9/v/missing
- [ ] **tests**:
from langchain_community.llms import Xinference
llm = Xinference(
server_url="http://0.0.0.0:9997", # replace your xinference server url
model_uid={model_uid} # replace model_uid with the model UID return from
launching the model
)
- [langchain_community.utilities.SQLDatabase] **[fix] Convert table
names to list for compatibility in SQLDatabase**:
- The issue #29227 is being fixed here
- The "package" modified is community
- The issue lied in this block of code:
44b41b699c/libs/community/langchain_community/utilities/sql_database.py (L72-L77)
- [langchain_community.utilities.SQLDatabase] **[fix] Convert table
names to list for compatibility in SQLDatabase**:
- **Description:** When the SQLDatabase is initialized, it runs a code
`self._inspector.get_table_names(schema=schema)` which expects an output
of list. However, with some connectors (such as snowflake) the data type
returned could be another iterable. This results in a type error when
concatenating the table_names to view_names. I have added explicit type
casting to prevent this.
- **Issue:** The issue #29227 is being fixed here
- **Dependencies:** None
- **Twitter handle:** @BaqarAbbas2001
## Additional Information
When the following method is called for a Snowflake database:
44b41b699c/libs/community/langchain_community/utilities/sql_database.py (L75)
Snowflake under the hood calls:
```python
from snowflake.sqlalchemy.snowdialect import SnowflakeDialect
SnowflakeDialect.get_table_names
```
This method returns a `dict_keys()` object which is incompatible to
concatenate with a list and results in a `TypeError`
### Relevant Library Versions
- **snowflake-sqlalchemy**: 1.7.2
- **snowflake-connector-python**: 3.12.4
- **sqlalchemy**: 2.0.20
- **langchain_community**: 0.3.14