The tokens I get are:
```
['', '\n\n', 'The', ' sun', ' was', ' setting', ' over', ' the', ' horizon', ',', ' casting', '']
```
so possibly an extra empty token is included in the output.
lmk @efriis if we should look into this further.
- [ ] **PR title**:[langchain_community.llms.xinference]: Rewrite
_stream() method and support stream() method in xinference.py
- [ ] **PR message**: Rewrite the _stream method so that the
chain.stream() can be used to return data streams.
chain = prompt | llm
chain.stream(input=user_input)
- [ ] **tests**:
from langchain_community.llms import Xinference
from langchain.prompts import PromptTemplate
llm = Xinference(
server_url="http://0.0.0.0:9997", # replace your xinference server url
model_uid={model_uid} # replace model_uid with the model UID return from
launching the model
stream = True
)
prompt = PromptTemplate(input=['country'], template="Q: where can we
visit in the capital of {country}? A:")
chain = prompt | llm
chain.stream(input={'country': 'France'})
Thank you for contributing to LangChain!
- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core, etc. is
being modified. Use "docs: ..." for purely docs changes, "infra: ..."
for CI changes.
- Example: "community: add foobar LLM"
- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
- **Description:** a description of the change
- **Issue:** the issue # it fixes, if applicable
- **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
### Description
- Since there is no cost per 1k input tokens for a fine-tuned cached
version of `gpt-4o-mini-2024-07-18` is not available when using the
`OpenAICallbackHandler`, it raises an error when trying to make calls
with such model.
- To add the price in the `MODEL_COST_PER_1K_TOKENS` dictionary
cc. @efriis
# Description
## Summary
This PR adds support for handling multi-labeled page numbers in the
**PyPDFLoader**. Some PDFs use complex page numbering systems where the
actual content may begin after multiple introductory pages. The
page_label field helps accurately reflect the document’s page structure,
making it easier to handle such cases during document parsing.
## Motivation
This feature improves document parsing accuracy by allowing users to
access the actual page labels instead of relying only on the physical
page numbers. This is particularly useful for documents where the first
few pages have roman numerals or other non-standard page labels.
## Use Case
This feature is especially useful for **Retrieval-Augmented Generation**
(RAG) systems where users may reference page numbers when asking
questions. Some PDFs have both labeled page numbers (like roman numerals
for introductory sections) and index-based page numbers.
For example, a user might ask:
"What is mentioned on page 5?"
The system can now check both:
• **Index-based page number** (page)
• **Labeled page number** (page_label)
This dual-check helps improve retrieval accuracy. Additionally, the
results can be validated with an **agent or tool** to ensure the
retrieved pages match the user’s query contextually.
## Code Changes
- Added a page_label field to the metadata of the Document class in
**PyPDFLoader**.
- Implemented support for retrieving page_label from the
pdf_reader.page_labels.
- Created a test case (test_pypdf_loader_with_multi_label_page_numbers)
with a sample PDF containing multi-labeled pages
(geotopo-komprimiert.pdf) [[Source of
pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)].
- Updated existing tests to ensure compatibility and verify page_label
extraction.
## Tests Added
- Added a new test case for a PDF with multi-labeled pages.
- Verified both page and page_label metadata fields are correctly
extracted.
## Screenshots
<img width="549" alt="image"
src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33"
/>
Title: community: add Financial Modeling Prep (FMP) API integration
Description: Adding LangChain integration for Financial Modeling Prep
(FMP) API to enable semantic search and structured tool creation for
financial data endpoints. This integration provides semantic endpoint
search using vector stores and automatic tool creation with proper
typing and error handling. Users can discover relevant financial
endpoints using natural language queries and get properly typed
LangChain tools for discovered endpoints.
Issue: N/A
Dependencies:
fmp-data>=0.3.1
langchain-core>=0.1.0
faiss-cpu
tiktoken
Twitter handle: @mehdizarem
Unit tests and example notebook have been added:
Tests are in tests/integration_tests/est_tools.py and
tests/unit_tests/test_tools.py
Example notebook is in docs/tools.ipynb
All format, lint and test checks pass:
pytest
mypy .
Dependencies are imported within functions and not added to
pyproject.toml. The changes are backwards compatible and only affect the
community package.
---------
Co-authored-by: mehdizare <mehdizare@users.noreply.github.com>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
- [ ] **PR title**: [langchain_community.llms.xinference]: fix error in
xinference.py
- [ ] **PR message**:
- The old code raised an ValidationError:
pydantic_core._pydantic_core.ValidationError: 1 validation error for
Xinference when import Xinference from xinference.py. This issue has
been resolved by adjusting it's type and default value.
File "/media/vdc/python/lib/python3.10/site-packages/pydantic/main.py",
line 212, in __init__
validated_self = self.__pydantic_validator__.validate_python(data,
self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for
Xinference
client
Field required [type=missing, input_value={'server_url':
'http://10...t4', 'model_kwargs': {}}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.9/v/missing
- [ ] **tests**:
from langchain_community.llms import Xinference
llm = Xinference(
server_url="http://0.0.0.0:9997", # replace your xinference server url
model_uid={model_uid} # replace model_uid with the model UID return from
launching the model
)
- [langchain_community.utilities.SQLDatabase] **[fix] Convert table
names to list for compatibility in SQLDatabase**:
- The issue #29227 is being fixed here
- The "package" modified is community
- The issue lied in this block of code:
44b41b699c/libs/community/langchain_community/utilities/sql_database.py (L72-L77)
- [langchain_community.utilities.SQLDatabase] **[fix] Convert table
names to list for compatibility in SQLDatabase**:
- **Description:** When the SQLDatabase is initialized, it runs a code
`self._inspector.get_table_names(schema=schema)` which expects an output
of list. However, with some connectors (such as snowflake) the data type
returned could be another iterable. This results in a type error when
concatenating the table_names to view_names. I have added explicit type
casting to prevent this.
- **Issue:** The issue #29227 is being fixed here
- **Dependencies:** None
- **Twitter handle:** @BaqarAbbas2001
## Additional Information
When the following method is called for a Snowflake database:
44b41b699c/libs/community/langchain_community/utilities/sql_database.py (L75)
Snowflake under the hood calls:
```python
from snowflake.sqlalchemy.snowdialect import SnowflakeDialect
SnowflakeDialect.get_table_names
```
This method returns a `dict_keys()` object which is incompatible to
concatenate with a list and results in a `TypeError`
### Relevant Library Versions
- **snowflake-sqlalchemy**: 1.7.2
- **snowflake-connector-python**: 3.12.4
- **sqlalchemy**: 2.0.20
- **langchain_community**: 0.3.14
## Description
This PR modifies the is_public_page function in ConfluenceLoader to
prevent exceptions caused by deleted pages during the execution of
ConfluenceLoader.process_pages().
**Example scenario:**
Consider the following usage of ConfluenceLoader:
```python
import os
from langchain_community.document_loaders import ConfluenceLoader
loader = ConfluenceLoader(
url=os.getenv("BASE_URL"),
token=os.getenv("TOKEN"),
max_pages=1000,
cql=f'type=page and lastmodified >= "2020-01-01 00:00"',
include_restricted_content=False,
)
# Raised Exception : HTTPError: Outdated version/old_draft/trashed? Cannot find content Please provide valid ContentId.
documents = loader.load()
```
If a deleted page exists within the query result, the is_public_page
function would previously raise an exception when calling
get_all_restrictions_for_content, causing the loader.load() process to
fail for all pages.
By adding a pre-check for the page's "current" status, unnecessary API
calls to get_all_restrictions_for_content for non-current pages are
avoided.
This fix ensures that such pages are skipped without affecting the rest
of the loading process.
## Issue
N/A (No specific issue number)
## Dependencies
No new dependencies are introduced with this change.
## Twitter handle
[@zenoengine](https://x.com/zenoengine)
**Description:** Added Additional parameters that could be useful for
usage of OpenAIAssistantV2Runnable.
This change is thought to allow langchain users to set parameters that
cannot be set using assistants UI
(max_completion_tokens,max_prompt_tokens,parallel_tool_calls) and
parameters that could be useful for experimenting like top_p and
temperature.
This PR originated from the need of using parallel_tool_calls in
langchain, this parameter is very important in openAI assistants because
without this parameter set to False strict mode is not respected by
OpenAI Assistants
(https://platform.openai.com/docs/guides/function-calling#parallel-function-calling).
> Note: Currently, if the model calls multiple functions in one turn
then strict mode will be disabled for those calls.
**Issue:** None
**Dependencies:** openai
Related: https://github.com/langchain-ai/langchain-aws/pull/322
The legacy `NeptuneOpenCypherQAChain` and `NeptuneSparqlQAChain` classes
are being replaced by the new LCEL format chains
`create_neptune_opencypher_qa_chain` and
`create_neptune_sparql_qa_chain`, respectively, in the `langchain_aws`
package.
This PR adds deprecation warnings to all Neptune classes and functions
that have been migrated to `langchain_aws`. All relevant documentation
has also been updated to replace `langchain_community` usage with the
new `langchain_aws` implementations.
---------
Co-authored-by: Chester Curme <chester.curme@gmail.com>
**Description:** Makes it possible to instantiate
`OpenAIModerationChain` with an `openai_api_key` argument only and no
`OPENAI_API_KEY` environment variable defined.
**Issue:** https://github.com/langchain-ai/langchain/issues/25176
**Dependencies:** `openai`
---------
Co-authored-by: ccurme <chester.curme@gmail.com>
### Description
This PR adds docs for the
[langchain-hyperbrowser](https://pypi.org/project/langchain-hyperbrowser/)
package. It includes a document loader that uses Hyperbrowser to scrape
or crawl any urls and return formatted markdown or html content as well
as relevant metadata.
[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and
scaling headless browsers. It lets you launch and manage browser
sessions at scale and provides easy to use solutions for any webscraping
needs, such as scraping a single page or crawling an entire site.
### Issue
None
### Dependencies
None
### Twitter Handle
`@hyperbrowser`
# **PR title**: "community: Fix rank-llm import paths for new 0.20.3
version"
- The "community" package is being modified to handle updated import
paths for the new `rank-llm` version.
---
## Description
This PR updates the import paths for the `rank-llm` package to account
for changes introduced in version `0.20.3`. The changes ensure
compatibility with both pre- and post-revamp versions of `rank-llm`,
specifically version `0.12.8`. Conditional imports are introduced based
on the detected version of `rank-llm` to handle different path
structures for `VicunaReranker`, `ZephyrReranker`, and `SafeOpenai`.
## Issue
RankLLMRerank usage throws an error when used GPT (not only) when
rank-llm version is > 0.12.8 - #29156
## Dependencies
This change relies on the `packaging` and `pkg_resources` libraries to
handle version checks.
## Twitter handle
@tymzar
- **Description**: In the functions `_create_run` and `_acreate_run`,
the parameters passed to the creation of
`openai.resources.beta.threads.runs` were limited.
Source:
```
def _create_run(self, input: dict) -> Any:
params = {
k: v
for k, v in input.items()
if k in ("instructions", "model", "tools", "run_metadata")
}
return self.client.beta.threads.runs.create(
input["thread_id"],
assistant_id=self.assistant_id,
**params,
)
```
- OpenAI Documentation
([createRun](https://platform.openai.com/docs/api-reference/runs/createRun))
- Full list of parameters `openai.resources.beta.threads.runs` ([source
code](https://github.com/openai/openai-python/blob/main/src/openai/resources/beta/threads/runs/runs.py#L91))
- **Issue:** Fix#17574
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Co-authored-by: ccurme <chester.curme@gmail.com>
Thank you for contributing to LangChain!
- [x] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core, etc. is
being modified. Use "docs: ..." for purely docs changes, "infra: ..."
for CI changes.
- Example: "community: add foobar LLM"
- [x] **PR message**
- **Description:** Add a missing format specifier in an an error log in
`langchain_community.document_loaders.CubeSemanticLoader`
- **Issue:** raises `TypeError: not all arguments converted during
string formatting`
- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.
If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
- **partner**: "Update Aiohttp for resolving vulnerability issue"
- **Description:** I have updated the upper limit of aiohttp from `3.10`
to `3.10.5` in the pyproject.toml file of langchain-pinecone. Hopefully
this will resolve#28771 . Please review this as I'm quite unsure.
---------
Co-authored-by: = <=>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
## Goal
Solve the following problems with `langchain-openai`:
- Structured output with `o1` [breaks out of the
box](https://langchain.slack.com/archives/C050X0VTN56/p1735232400232099).
- `with_structured_output` by default does not use OpenAI’s [structured
output
feature](https://platform.openai.com/docs/guides/structured-outputs).
- We override API defaults for temperature and other parameters.
## Breaking changes:
- Default method for structured output is changing to OpenAI’s dedicated
[structured output
feature](https://platform.openai.com/docs/guides/structured-outputs).
For schemas specified via TypedDict or JSON schema, strict schema
validation is disabled by default but can be enabled by specifying
`strict=True`.
- To recover previous default, pass `method="function_calling"` into
`with_structured_output`.
- Models that don’t support `method="json_schema"` (e.g., `gpt-4` and
`gpt-3.5-turbo`, currently the default model for ChatOpenAI) will raise
an error unless `method` is explicitly specified.
- To recover previous default, pass `method="function_calling"` into
`with_structured_output`.
- Schemas specified via Pydantic `BaseModel` that have fields with
non-null defaults or metadata (like min/max constraints) will raise an
error.
- To recover previous default, pass `method="function_calling"` into
`with_structured_output`.
- `strict` now defaults to False for `method="json_schema"` when schemas
are specified via TypedDict or JSON schema.
- To recover previous behavior, use `with_structured_output(schema,
strict=True)`
- Schemas specified via Pydantic V1 will raise a warning (and use
`method="function_calling"`) unless `method` is explicitly specified.
- To remove the warning, pass `method="function_calling"` into
`with_structured_output`.
- Streaming with default structured output method / Pydantic schema no
longer generates intermediate streamed chunks.
- To recover previous behavior, pass `method="function_calling"` into
`with_structured_output`.
- We no longer override default temperature (was 0.7 in LangChain, now
will follow OpenAI, currently 1.0).
- To recover previous behavior, initialize `ChatOpenAI` or
`AzureChatOpenAI` with `temperature=0.7`.
- Note: conceptually there is a difference between forcing a tool call
and forcing a response format. Tool calls may have more concise
arguments vs. generating content adhering to a schema. Prompts may need
to be adjusted to recover desired behavior.
---------
Co-authored-by: Jacob Lee <jacoblee93@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
- **Description:** Adds backticks to generate_schema function in the
arango graph client
- **Issue:** We experienced an issue with the generate schema function
when talking to our arango database where these backticks were missing
- **Dependencies:** none
- **Twitter handle:** @anangelofgrace
## Description
Add `__init__` for `UnstructuredHTMLLoader` to restrict the input type
to `str` or `Path`, and transfer the `self.file_path` to `str` just like
`UnstructuredXMLLoader` does.
## Issue
Fix#29090
## Dependencies
No changes.
## Description
This PR enables label inclusion for documents loaded via CQL in the
confluence-loader.
- Updated _lazy_load to pass the include_labels parameter instead of
False in process_pages calls for documents loaded via CQL.
- Ensured that labels can now be fetched and added to the metadata for
documents queried with cql.
## Related Modification History
This PR builds on the previous functionality introduced in
[#28259](https://github.com/langchain-ai/langchain/pull/28259), which
added support for including labels with the include_labels option.
However, this functionality did not work as expected for CQL queries,
and this PR fixes that issue.
If the False handling was intentional due to another issue, please let
me know. I have verified with our Confluence instance that this change
allows labels to be correctly fetched for documents loaded via CQL.
## Issue
Fixes#29088
## Dependencies
No changes.
## Twitter Handle
[@zenoengine](https://x.com/zenoengine)