Commit Graph

72 Commits

Author SHA1 Message Date
Mason Daugherty
2f64d80cc6 fix(core,model-profiles): add missing ModelProfile fields, warn on schema drift (#36129)
PR #35788 added 7 new fields to the `langchain-profiles` CLI output
(`name`, `status`, `release_date`, `last_updated`, `open_weights`,
`attachment`, `temperature`) but didn't update `ModelProfile` in
`langchain-core`. Partner packages like `langchain-aws` that set
`extra="forbid"` on their Pydantic models hit `extra_forbidden`
validation errors when Pydantic encountered undeclared TypedDict keys at
construction time. This adds the missing fields, makes `ModelProfile`
forward-compatible, provides a base-class hook so partners can stop
duplicating model-profile validator boilerplate, migrates all in-repo
partners to the new hook, and adds runtime + CI-time warnings for schema
drift.

## Changes

### `langchain-core`
- Add `__pydantic_config__ = ConfigDict(extra="allow")` to
`ModelProfile` so unknown profile keys pass Pydantic validation even on
models with `extra="forbid"` — forward-compatibility for when the CLI
schema evolves ahead of core
- Declare the 7 missing fields on `ModelProfile`: `name`, `status`,
`release_date`, `last_updated`, `open_weights` (metadata) and
`attachment`, `temperature` (capabilities)
- Add `_warn_unknown_profile_keys()` in `model_profile.py` — emits a
`UserWarning` when a profile dict contains keys not in `ModelProfile`,
suggesting a core upgrade. Wrapped in a bare `except` so introspection
failures never crash model construction
- Add `BaseChatModel._resolve_model_profile()` hook that returns `None`
by default. Partners can override this single method instead of
redefining the full `_set_model_profile` validator — the base validator
calls it automatically
- Add `BaseChatModel._check_profile_keys` as a separate
`model_validator` that calls `_warn_unknown_profile_keys`. Uses a
distinct method name so partner overrides of `_set_model_profile` don't
inadvertently suppress the check

### `langchain-profiles` CLI
- Add `_warn_undeclared_profile_keys()` to the CLI (`cli.py`), called
after merging augmentations in `refresh()` — warns at profile-generation
time (not just runtime) when emitted keys aren't declared in
`ModelProfile`. Gracefully skips if `langchain-core` isn't installed
- Add guard test
`test_model_data_to_profile_keys_subset_of_model_profile` in
model-profiles — feeds a fully-populated model dict to
`_model_data_to_profile()` and asserts every emitted key exists in
`ModelProfile.__annotations__`. CI fails before any release if someone
adds a CLI field without updating the TypedDict

### Partner packages
- Migrate all 10 in-repo partners to the `_resolve_model_profile()`
hook, replacing duplicated `@model_validator` / `_set_model_profile`
overrides: anthropic, deepseek, fireworks, groq, huggingface, mistralai,
openai (base + azure), openrouter, perplexity, xai
- Anthropic retains custom logic (context-1m beta → `max_input_tokens`
override); all others reduce to a one-liner
- Add `pr_lint.yml` scope for the new `model-profiles` package
2026-03-23 00:44:27 -04:00
Mason Daugherty
5d9568b5f5 feat(model-profiles): new fields + Makefile target (#35788)
Extract additional fields from models.dev into `_model_data_to_profile`:
`name`, `status`, `release_date`, `last_updated`, `open_weights`,
`attachment`, `temperature`

Move the model profile refresh logic from an inline bash script in the
GitHub Actions workflow into a `make refresh-profiles` target in
`libs/model-profiles/Makefile`. This makes it runnable locally with a
single command and keeps the provider map in one place instead of
duplicated between CI and developer docs.
2026-03-12 13:56:25 +00:00
Mason Daugherty
70192690b1 fix(model-profiles): sort generated profiles by model ID for stable diffs (#35344)
- Sort model profiles alphabetically by model ID (the top-level
`_PROFILES` dictionary keys, e.g. `claude-3-5-haiku-20241022`,
`gpt-4o-mini`) before writing `_profiles.py`, so that regenerating
profiles only shows actual data changes in diffs — not random reordering
from the models.dev API response order
- Regenerate all 10 partner profile files with the new sorted ordering
2026-02-19 23:11:22 -05:00
Mason Daugherty
82ae4fb6fa chore: bump model profiles (#35294) 2026-02-17 20:22:07 -05:00
Mason Daugherty
4ca586b322 feat(model-profiles): add text_inputs and text_outputs (#35084)
- Add `text_inputs` and `text_outputs` fields to `ModelProfile`
- Regenerate `_profiles.py` for all providers

## Why

models.dev data includes `'text'` as both an input and output modality,
but we didn't capture it.

models.dev broadly contains models without text input (Whisper/ASR) and
without text output (image generators, TTS).

Without this, downstream consumers can't filter on model text support
(e.g. preventing users from passing text input to an audio-only model).

---

We'd need to also run for Google, AWS and cut releases for all to
propagate
2026-02-09 14:50:09 -05:00
XXt
689ce96016 docs: add missing module-level docstrings to partner integrations (#34838)
docs: add missing module-level docstrings to partner integrations

Added module-level docstrings to 6 partner integration __init__.py files
  that were missing documentation:
2026-01-22 12:05:59 -05:00
Georgey
16c984ef0a fix(langchain-classic): fix init_chat_model for HuggingFace models (#33943) 2025-12-12 11:05:48 -05:00
Paul
bf6a5eb122 fix(huggingface): Helper logic for init_chat_model with HuggingFace backend (#34259) 2025-12-12 10:05:16 -05:00
Mason Daugherty
ff6e3558d7 docs(fireworks,groq,huggingface,mistralai,ollama,openai): x-ref convert_to_openai_tool (#34276) 2025-12-09 19:51:04 -05:00
ccurme
33e5d01f7c feat(model-profiles): distribute data across packages (#34024) 2025-11-21 15:47:05 -05:00
Azibek
d8b94007c1 fix(huggingface): pass llm params to ChatHuggingFace (#32368)
This PR fixes #32234 and improves HuggingFace chat model integration by:

Ensuring ChatHuggingFace inherits key parameters (temperature,
max_tokens, top_p, streaming, etc.) from the underlying LLM when not
explicitly set.
Adding and updating unit tests to verify property inheritance.
No breaking changes; these updates enhance reliability and
maintainability.

---------

Co-authored-by: Mason Daugherty <mason@langchain.dev>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Mason Daugherty <github@mdrxy.com>
2025-11-07 14:29:15 -05:00
Mason Daugherty
e023201d42 style: some cleanup (#33857) 2025-11-06 23:50:46 -05:00
Hyejeong Jo
0e36185933 fix(huggingface): add stream_usage support for ChatHuggingFace invoke/stream (#32708) 2025-11-03 14:44:32 -05:00
Mason Daugherty
123e29dc26 style: more refs fixes (#33730) 2025-10-29 16:34:46 -04:00
Mason Daugherty
1d2273597a docs: more fixes for refs (#33554) 2025-10-16 22:54:16 -04:00
Mason Daugherty
15db024811 chore: more sweeping (#33533)
more fixes for refs
2025-10-16 15:44:56 -04:00
Mason Daugherty
291a9fcea1 style: llm -> model (#33423) 2025-10-10 13:19:13 -04:00
Mason Daugherty
6fc21afbc9 style: .. code-block:: admonition translations (#33400)
biiiiiiiiiiiiiiiigggggggg pass
2025-10-09 16:52:58 -04:00
Mason Daugherty
d8a680ee57 style: address Sphinx double-backtick snippet syntax (#33389) 2025-10-09 13:35:51 -04:00
Mason Daugherty
b6132fc23e style: remove more Optional syntax (#33371) 2025-10-08 23:28:43 -04:00
Mason Daugherty
31eeb50ce0 chore: drop UP045 (#33362)
Python 3.9 EOL
2025-10-08 21:17:53 -04:00
Mason Daugherty
d13823043d style: monorepo pass for refs (#33359)
* Delete some double backticks previously used by Sphinx (not done
everywhere yet)
* Fix some code blocks / dropdowns

Ignoring CLI CI for now
2025-10-08 18:41:39 -04:00
Mason Daugherty
ae5b105d11 docs: v1 docs updates (#33173)
Co-authored-by: Mohammad Mohtashim <45242107+keenborder786@users.noreply.github.com>
Co-authored-by: Caspar Broekhuizen <caspar@langchain.dev>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: Vadym Barda <vadim.barda@gmail.com>
2025-10-02 18:46:26 -04:00
Mason Daugherty
eaa6dcce9e release: v1.0.0 (#32567)
Co-authored-by: Mohammad Mohtashim <45242107+keenborder786@users.noreply.github.com>
Co-authored-by: Caspar Broekhuizen <caspar@langchain.dev>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: Vadym Barda <vadim.barda@gmail.com>
2025-10-02 10:49:42 -04:00
Mason Daugherty
986302322f docs: more standardization (#33124) 2025-09-25 20:46:20 -04:00
Mason Daugherty
b92b394804 style: repo linting pass (#33089)
enable docstring-code-format
2025-09-24 15:25:55 -04:00
Mason Daugherty
96cbd90cba fix: formatting issues in docstrings (#32265)
Ensures proper reStructuredText formatting by adding the required blank
line before closing docstring quotes, which resolves the "Block quote
ends without a blank line; unexpected unindent" warning.
2025-07-27 23:37:47 -04:00
niceg
0d6f915442 fix: LLM mimicking Unicode responses due to forced Unicode conversion of non-ASCII characters. (#32222)
fix: Fix LLM mimicking Unicode responses due to forced Unicode
conversion of non-ASCII characters.

- **Description:** This PR fixes an issue where the LLM would mimic
Unicode responses due to forced Unicode conversion of non-ASCII
characters in tool calls. The fix involves disabling the `ensure_ascii`
flag in `json.dumps()` when converting tool calls to OpenAI format.
- **Issue:** Fixes ↓↓↓
input:
```json
{'role': 'assistant', 'tool_calls': [{'type': 'function', 'id': 'call_nv9trcehdpihr21zj9po19vq', 'function': {'name': 'create_customer', 'arguments': '{"customer_name": "你好啊集团"}'}}]}
```
output:
```json
{'role': 'assistant', 'tool_calls': [{'type': 'function', 'id': 'call_nv9trcehdpihr21zj9po19vq', 'function': {'name': 'create_customer', 'arguments': '{"customer_name": "\\u4f60\\u597d\\u554a\\u96c6\\u56e2"}'}}]}
```
then:
llm will mimic outputting unicode. Unicode's vast number of symbols can
lengthen LLM responses, leading to slower performance.
<img width="686" height="277" alt="image"
src="https://github.com/user-attachments/assets/28f3b007-3964-4455-bee2-68f86ac1906d"
/>

---------

Co-authored-by: Mason Daugherty <github@mdrxy.com>
Co-authored-by: Mason Daugherty <mason@langchain.dev>
2025-07-24 17:01:31 -04:00
Mason Daugherty
d53ebf367e fix(docs): capitalization, codeblock formatting, and hyperlinks, note blocks (#32235)
widespread cleanup attempt
2025-07-24 16:55:04 -04:00
Mason Daugherty
4d9eefecab fix: bump lockfiles (#31923)
* bump lockfiles after upgrading ruff
* resolve resulting linting fixes
2025-07-08 13:27:55 -04:00
Mason Daugherty
ae210c1590 ruff: add bugbear across packages (#31917)
WIP, other packages will get in next PRs
2025-07-08 12:22:55 -04:00
Mason Daugherty
750721b4c3 huggingface[patch]: ruff fixes and rules (#31912)
* bump ruff deps
* add more thorough ruff rules
* fix said rules
2025-07-08 10:07:57 -04:00
m27315
013ce2c47f huggingface: fix HuggingFaceEndpoint._astream() got multiple values for argument 'stop' (#31385) 2025-07-06 15:18:53 +00:00
Peter Schneider
cecfec5efa huggingface: handle image-text-to-text pipeline task (#31611)
**Description:** Allows for HuggingFacePipeline to handle
image-text-to-text pipeline
2025-06-14 16:41:11 -04:00
अंkur गोswami
729526ff7c huggingface: Undefined model_id fix (#31358)
**Description:** This change fixes the undefined model_id issue when
instantiating
[ChatHuggingFace](https://github.com/langchain-ai/langchain/blob/master/libs/partners/huggingface/langchain_huggingface/chat_models/huggingface.py#L306)
**Issue:** Fixes https://github.com/langchain-ai/langchain/issues/31357


@baskaryan @hwchase17
2025-05-29 15:59:35 -04:00
ccurme
bdb7c4a8b3 huggingface: fix embeddings return type (#31072)
Integration tests failing

cc @hanouticelina
2025-04-29 18:45:04 +00:00
célina
868f07f8f4 partners: (langchain-huggingface) Chat Models - Integrate Hugging Face Inference Providers and remove deprecated code (#30733)
Hi there, I'm Célina from 🤗,
This PR introduces support for Hugging Face's serverless Inference
Providers (documentation
[here](https://huggingface.co/docs/inference-providers/index)), allowing
users to specify different providers for chat completion and text
generation tasks.

This PR also removes the usage of `InferenceClient.post()` method in
`HuggingFaceEndpoint`, in favor of the task-specific `text_generation`
method. `InferenceClient.post()` is deprecated and will be removed in
`huggingface_hub v0.31.0`.

---
## Changes made
- bumped the minimum required version of the `huggingface-hub` package
to ensure compatibility with the latest API usage.
- added a `provider` field to `HuggingFaceEndpoint`, enabling users to
select the inference provider (e.g., 'cerebras', 'together',
'fireworks-ai'). Defaults to `hf-inference` (HF Inference API).
- replaced the deprecated `InferenceClient.post()` call in
`HuggingFaceEndpoint` with the task-specific `text_generation` method
for future-proofing, `post()` will be removed in huggingface-hub
v0.31.0.
- updated the `ChatHuggingFace` component:
    - added async and streaming support.
    - added support for tool calling.
- exposed underlying chat completion parameters for more granular
control.
- Added integration tests for `ChatHuggingFace` and updated the
corresponding unit tests.

  All changes are backward compatible.

---------

Co-authored-by: ccurme <chester.curme@gmail.com>
2025-04-29 09:53:14 -04:00
Sydney Runkle
7e926520d5 packaging: remove Python upper bound for langchain and co libs (#31025)
Follow up to https://github.com/langchain-ai/langsmith-sdk/pull/1696,
I've bumped the `langsmith` version where applicable in `uv.lock`.

Type checking problems here because deps have been updated in
`pyproject.toml` and `uv lock` hasn't been run - we should enforce that
in the future - goes with the other dependabot todos :).
2025-04-28 14:44:28 -04:00
Sydney Runkle
8c6734325b partners[lint]: run pyupgrade to get code in line with 3.9 standards (#30781)
Using `pyupgrade` to get all `partners` code up to 3.9 standards
(mostly, fixing old `typing` imports).
2025-04-11 07:18:44 -04:00
célina
68361f9c2d partners: (langchain-huggingface) Embeddings - Integrate Inference Providers and remove deprecated code (#30735)
Hi there, This is a complementary PR to #30733.
This PR introduces support for Hugging Face's serverless Inference
Providers (documentation
[here](https://huggingface.co/docs/inference-providers/index)), allowing
users to specify different providers

This PR also removes the usage of `InferenceClient.post()` method in
`HuggingFaceEndpointEmbeddings`, in favor of the task-specific
`feature_extraction` method. `InferenceClient.post()` is deprecated and
will be removed in `huggingface_hub` v0.31.0.

## Changes made

- bumped the minimum required version of the `huggingface_hub` package
to ensure compatibility with the latest API usage.
- added a provider field to `HuggingFaceEndpointEmbeddings`, enabling
users to select the inference provider.
- replaced the deprecated `InferenceClient.post()` call in
`HuggingFaceEndpointEmbeddings` with the task-specific
`feature_extraction` method for future-proofing, `post()` will be
removed in `huggingface-hub` v0.31.0.

 All changes are backward compatible.

---------

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
2025-04-09 19:05:43 +00:00
Ella Charlaix
c401254770 huggingface: Add ipex support to HuggingFaceEmbeddings (#29386)
ONNX and OpenVINO models are available by specifying the `backend`
argument (the model is loaded using `optimum`
https://github.com/huggingface/optimum)

```python
from langchain_huggingface import HuggingFaceEmbeddings

embedding = HuggingFaceEmbeddings(
    model_name=model_id,
    model_kwargs={"backend": "onnx"},
)
```

With this PR we also enable the IPEX backend 



```python
from langchain_huggingface import HuggingFaceEmbeddings

embedding = HuggingFaceEmbeddings(
    model_name=model_id,
    model_kwargs={"backend": "ipex"},
)
```
2025-02-07 15:21:09 -08:00
Teruaki Ishizaki
aeb42dc900 partners: Fixed the procedure of initializing pad_token_id (#29500)
- **Description:** Add to check pad_token_id and eos_token_id of model
config. It seems that this is the same bug as the HuggingFace TGI bug.
It's same bug as #29434
- **Issue:** #29431
- **Dependencies:** none
- **Twitter handle:** tell14

Example code is followings:
```python
from langchain_huggingface.llms import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    task="text-generation",
    pipeline_kwargs={"max_new_tokens": 10},
)

from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "What is electroencephalography?"

print(chain.invoke({"question": question}))
```
2025-02-03 21:40:33 -05:00
Ella Charlaix
6f95db81b7 huggingface: Add IPEX models support (#29179)
Co-authored-by: Erick Friis <erick@langchain.dev>
2025-01-22 00:16:44 +00:00
Mohammad Mohtashim
8cf5f20bb5 required tool_choice added for ChatHuggingFace (#28851)
- **Description:** HuggingFace Inference Client V3 now supports
`required` as tool_choice which has been added.
- **Issue:** #28842
2024-12-20 12:06:04 -05:00
Manuel
af2e0a7ede partners: add 'model' alias for consistency in embedding classes (#28374)
**Description:** This PR introduces a `model` alias for the embedding
classes that contain the attribute `model_name`, to ensure consistency
across the codebase, as suggested by a moderator in a previous PR. The
change aligns the usage of attribute names across the project (see for
example
[here](65deeddd5d/libs/partners/groq/langchain_groq/chat_models.py (L304))).
**Issue:** This PR addresses the suggestion from the review of issue
#28269.
**Dependencies:**  None

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-12-13 22:30:00 +00:00
Wang, Yi
d834c6b618 huggingface: fix tool argument serialization in _convert_TGI_message_to_LC_message (#26075)
Currently `_convert_TGI_message_to_LC_message` replaces `'` in the tool
arguments, so an argument like "It's" will be converted to `It"s` and
could cause a json parser to fail.

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Vadym Barda <vadym@langchain.dev>
2024-12-11 18:34:32 -08:00
af su
7c7ee07d30 huggingface[fix]: HuggingFaceEndpointEmbeddings model parameter passing error when async embed (#27953)
This change refines the handling of _model_kwargs in POST requests.
Instead of nesting _model_kwargs as a dictionary under the parameters
key, it is now directly unpacked and merged into the request's JSON
payload. This ensures that the model parameters are passed correctly and
avoids unnecessary nesting.E. g.:

```python
import asyncio

from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

embedding_input = ["This input will get multiplied" * 10000]

embeddings = HuggingFaceEndpointEmbeddings(
    model="http://127.0.0.1:8081/embed",
    model_kwargs={"truncate": True},
)

# Truncated parameters in synchronized methods are handled correctly
embeddings.embed_documents(texts=embedding_input)
# The truncate parameter is not handled correctly in the asynchronous method,
# and 413 Request Entity Too Large is returned.
asyncio.run(embeddings.aembed_documents(texts=embedding_input))
```

Co-authored-by: af su <saf@zjuici.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-11-20 19:08:56 +00:00
Roman Solomatin
0f85dea8c8 langchain-huggingface: use separate kwargs for queries and docs (#27857)
Now `encode_kwargs` used for both for documents and queries and this
leads to wrong embeddings. E. g.:
```python
    model_kwargs = {"device": "cuda", "trust_remote_code": True}
    encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"}

    model = HuggingFaceEmbeddings(
        model_name="dunzhang/stella_en_400M_v5",
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
    )

    query_embedding = np.array(
        model.embed_query("What are some ways to reduce stress?",)
    )
    document_embedding = np.array(
        model.embed_documents(
            [
                "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
                "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
            ]
        )
    )
    print(model._client.similarity(query_embedding, document_embedding)) # output: tensor([[0.8421, 0.3317]], dtype=torch.float64)
```
But from the [model
card](https://huggingface.co/dunzhang/stella_en_400M_v5#sentence-transformers)
expexted like this:
```python
    model_kwargs = {"device": "cuda", "trust_remote_code": True}
    encode_kwargs = {"normalize_embeddings": False}
    query_encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"}

    model = HuggingFaceEmbeddings(
        model_name="dunzhang/stella_en_400M_v5",
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        query_encode_kwargs=query_encode_kwargs,
    )

    query_embedding = np.array(
        model.embed_query("What are some ways to reduce stress?", )
    )
    document_embedding = np.array(
        model.embed_documents(
            [
                "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
                "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
            ]
        )
    )
    print(model._client.similarity(query_embedding, document_embedding)) # tensor([[0.8398, 0.2990]], dtype=torch.float64)
```
2024-11-06 17:35:39 -05:00
Andrew Effendi
49517cc1e7 partners/huggingface[patch]: fix HuggingFacePipeline model_id parameter (#27514)
**Description:** Fixes issue with model parameter not getting
initialized correctly when passing transformers pipeline
**Issue:** https://github.com/langchain-ai/langchain/issues/25915
2024-10-29 14:34:46 +00:00
Hyejun An
6227396e20 partners/HuggingFacePipeline[stream]: Change to use pipeline instead of pipeline.model.generate in stream() (#26531)
## Description

I encountered an error while using the` gemma-2-2b-it model` with the
`HuggingFacePipeline` class and have implemented a fix to resolve this
issue.

### What is Problem

```python
model_id="google/gemma-2-2b-it"


gemma_2_model = AutoModelForCausalLM.from_pretrained(model_id)
gemma_2_tokenizer = AutoTokenizer.from_pretrained(model_id)

gen = pipeline( 
    task='text-generation',
    model=gemma_2_model,
    tokenizer=gemma_2_tokenizer,
    max_new_tokens=1024,
    device=0 if torch.cuda.is_available() else -1,
    temperature=.5,
    top_p=0.7,
    repetition_penalty=1.1,
    do_sample=True,
    )

llm = HuggingFacePipeline(pipeline=gen)

for chunk in llm.stream("Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World."):
    print(chunk, end="", flush=True)
```

This code outputs the following error message:

```
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1258: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Exception in thread Thread-19 (generate):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1874, in generate
    self._validate_generated_length(generation_config, input_ids_length, has_default_max_length)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1266, in _validate_generated_length
    raise ValueError(
ValueError: Input length of input_ids is 31, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.
```

In addition, the following error occurs when the number of tokens is
reduced.

```python
for chunk in llm.stream("Hello World"):
    print(chunk, end="", flush=True)
```

```
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1258: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1885: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Exception in thread Thread-20 (generate):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2982, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 994, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 803, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 164, in forward
    return F.embedding(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
```

On the other hand, in the case of invoke, the output is normal:

```
llm.invoke("Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World.")
```
```
'Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World.\n\nThis is a simple program that prints the phrase "Hello World" to the console. \n\n**Here\'s how it works:**\n\n* **`print("Hello World")`**: This line of code uses the `print()` function, which is a built-in function in most programming languages (like Python). The `print()` function takes whatever you put inside its parentheses and displays it on the screen.\n* **`"Hello World"`**:  The text within the double quotes (`"`) is called a string. It represents the message we want to print.\n\n\nLet me know if you\'d like to explore other programming concepts or see more examples! \n'
```

### Problem Analysis

- Apparently, I put kwargs in while generating pipelines and it applied
to `invoke()`, but it's not applied in the `stream()`.
- When using the stream, `inputs = self.pipeline.tokenizer (prompt,
return_tensors = "pt")` enters cpu.
  - This can crash when the model is in gpu.

### Solution

Just use `self.pipeline` instead of `self.pipeline.model.generate`.

- **Original Code**

```python
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

inputs = self.pipeline.tokenizer(prompt, return_tensors="pt")
streamer = TextIteratorStreamer(
    self.pipeline.tokenizer,
    timeout=60.0,
    skip_prompt=skip_prompt,
    skip_special_tokens=True,
)
generation_kwargs = dict(
    inputs,
    streamer=streamer,
    stopping_criteria=stopping_criteria,
    **pipeline_kwargs,
)
t1 = Thread(target=self.pipeline.model.generate, kwargs=generation_kwargs)
t1.start()
```

- **Updated Code**

```python
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

streamer = TextIteratorStreamer(
    self.pipeline.tokenizer,
    timeout=60.0,
    skip_prompt=skip_prompt,
    skip_special_tokens=True,
)
generation_kwargs = dict(
    text_inputs= prompt,
    streamer=streamer,
    stopping_criteria=stopping_criteria,
    **pipeline_kwargs,
)
t1 = Thread(target=self.pipeline, kwargs=generation_kwargs)
t1.start()
```

By using the `pipeline` directly, the `kwargs` of the pipeline are
applied, and there is no need to consider the `device` of the `tensor`
made with the `tokenizer`.

> According to the change to use `pipeline`, it was modified to put
`text_inputs=prompts` directly into `generation_kwargs`.

## Issue

None

## Dependencies

None

## Twitter handle

None

---------

Co-authored-by: Vadym Barda <vadym@langchain.dev>
2024-10-24 16:49:43 -04:00