Building applications with LLMs through composability
Go to file
Hyejun An 6227396e20
partners/HuggingFacePipeline[stream]: Change to use pipeline instead of pipeline.model.generate in stream() (#26531)
## Description

I encountered an error while using the` gemma-2-2b-it model` with the
`HuggingFacePipeline` class and have implemented a fix to resolve this
issue.

### What is Problem

```python
model_id="google/gemma-2-2b-it"


gemma_2_model = AutoModelForCausalLM.from_pretrained(model_id)
gemma_2_tokenizer = AutoTokenizer.from_pretrained(model_id)

gen = pipeline( 
    task='text-generation',
    model=gemma_2_model,
    tokenizer=gemma_2_tokenizer,
    max_new_tokens=1024,
    device=0 if torch.cuda.is_available() else -1,
    temperature=.5,
    top_p=0.7,
    repetition_penalty=1.1,
    do_sample=True,
    )

llm = HuggingFacePipeline(pipeline=gen)

for chunk in llm.stream("Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World."):
    print(chunk, end="", flush=True)
```

This code outputs the following error message:

```
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1258: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Exception in thread Thread-19 (generate):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1874, in generate
    self._validate_generated_length(generation_config, input_ids_length, has_default_max_length)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1266, in _validate_generated_length
    raise ValueError(
ValueError: Input length of input_ids is 31, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.
```

In addition, the following error occurs when the number of tokens is
reduced.

```python
for chunk in llm.stream("Hello World"):
    print(chunk, end="", flush=True)
```

```
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1258: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1885: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Exception in thread Thread-20 (generate):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2982, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 994, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 803, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 164, in forward
    return F.embedding(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
```

On the other hand, in the case of invoke, the output is normal:

```
llm.invoke("Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World.")
```
```
'Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World.\n\nThis is a simple program that prints the phrase "Hello World" to the console. \n\n**Here\'s how it works:**\n\n* **`print("Hello World")`**: This line of code uses the `print()` function, which is a built-in function in most programming languages (like Python). The `print()` function takes whatever you put inside its parentheses and displays it on the screen.\n* **`"Hello World"`**:  The text within the double quotes (`"`) is called a string. It represents the message we want to print.\n\n\nLet me know if you\'d like to explore other programming concepts or see more examples! \n'
```

### Problem Analysis

- Apparently, I put kwargs in while generating pipelines and it applied
to `invoke()`, but it's not applied in the `stream()`.
- When using the stream, `inputs = self.pipeline.tokenizer (prompt,
return_tensors = "pt")` enters cpu.
  - This can crash when the model is in gpu.

### Solution

Just use `self.pipeline` instead of `self.pipeline.model.generate`.

- **Original Code**

```python
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

inputs = self.pipeline.tokenizer(prompt, return_tensors="pt")
streamer = TextIteratorStreamer(
    self.pipeline.tokenizer,
    timeout=60.0,
    skip_prompt=skip_prompt,
    skip_special_tokens=True,
)
generation_kwargs = dict(
    inputs,
    streamer=streamer,
    stopping_criteria=stopping_criteria,
    **pipeline_kwargs,
)
t1 = Thread(target=self.pipeline.model.generate, kwargs=generation_kwargs)
t1.start()
```

- **Updated Code**

```python
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

streamer = TextIteratorStreamer(
    self.pipeline.tokenizer,
    timeout=60.0,
    skip_prompt=skip_prompt,
    skip_special_tokens=True,
)
generation_kwargs = dict(
    text_inputs= prompt,
    streamer=streamer,
    stopping_criteria=stopping_criteria,
    **pipeline_kwargs,
)
t1 = Thread(target=self.pipeline, kwargs=generation_kwargs)
t1.start()
```

By using the `pipeline` directly, the `kwargs` of the pipeline are
applied, and there is no need to consider the `device` of the `tensor`
made with the `tokenizer`.

> According to the change to use `pipeline`, it was modified to put
`text_inputs=prompts` directly into `generation_kwargs`.

## Issue

None

## Dependencies

None

## Twitter handle

None

---------

Co-authored-by: Vadym Barda <vadym@langchain.dev>
2024-10-24 16:49:43 -04:00
.devcontainer community[minor]: Add ApertureDB as a vectorstore (#24088) 2024-07-16 09:32:59 -07:00
.github infra: azure/mongo api docs build (#27512) 2024-10-21 08:27:46 -07:00
cookbook multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
docker
docs docs: Update structured_outputs.mdx (#27613) 2024-10-24 15:13:28 +00:00
libs partners/HuggingFacePipeline[stream]: Change to use pipeline instead of pipeline.model.generate in stream() (#26531) 2024-10-24 16:49:43 -04:00
scripts infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
templates community[patch]: update the default hf bge embeddings (#22627) 2024-09-02 22:10:21 +00:00
.gitattributes
.gitignore infra: gitignore api_ref mds (#25705) 2024-08-23 09:50:30 -07:00
.readthedocs.yaml
CITATION.cff
LICENSE
Makefile multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
MIGRATE.md Update MIGRATE.md (#27169) 2024-10-07 14:53:40 -04:00
poetry.lock monorepo: add script for updating notebook cassettes (#27399) 2024-10-16 13:46:49 -04:00
poetry.toml multiple: use modern installer in poetry (#23998) 2024-07-08 18:50:48 -07:00
pyproject.toml monorepo: add script for updating notebook cassettes (#27399) 2024-10-16 13:46:49 -04:00
README.md docs: platforms -> providers (#27285) 2024-10-16 18:27:07 +00:00
SECURITY.md
yarn.lock box: add langchain box package and DocumentLoader (#25506) 2024-08-21 02:23:43 +00:00

🦜🔗 LangChain

Build context-aware reasoning applications

Release Notes CI PyPI - License PyPI - Downloads GitHub star chart Open Issues Open in Dev Containers Open in GitHub Codespaces Twitter

Looking for the JS/TS library? Check out LangChain.js.

To help you ship LangChain apps to production faster, check out LangSmith. LangSmith is a unified developer platform for building, testing, and monitoring LLM applications. Fill out this form to speak with our sales team.

Quick Install

With pip:

pip install langchain

With conda:

conda install langchain -c conda-forge

🤔 What is LangChain?

LangChain is a framework for developing applications powered by large language models (LLMs).

For these applications, LangChain simplifies the entire application lifecycle:

  • Open-source libraries: Build your applications using LangChain's open-source building blocks, components, and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.
  • Productionization: Inspect, monitor, and evaluate your apps with LangSmith so that you can constantly optimize and deploy with confidence.
  • Deployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Cloud.

Open-source libraries

  • langchain-core: Base abstractions and LangChain Expression Language.
  • langchain-community: Third party integrations.
    • Some integrations have been further split into partner packages that only rely on langchain-core. Examples include langchain_openai and langchain_anthropic.
  • langchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.
  • LangGraph: A library for building robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph. Integrates smoothly with LangChain, but can be used without it. To learn more about LangGraph, check out our first LangChain Academy course, Introduction to LangGraph, available here.

Productionization:

  • LangSmith: A developer platform that lets you debug, test, evaluate, and monitor chains built on any LLM framework and seamlessly integrates with LangChain.

Deployment:

  • LangGraph Cloud: Turn your LangGraph applications into production-ready APIs and Assistants.

Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers.

🧱 What can you build with LangChain?

Question answering with RAG

🧱 Extracting structured output

🤖 Chatbots

And much more! Head to the Tutorials section of the docs for more.

🚀 How does LangChain help?

The main value props of the LangChain libraries are:

  1. Components: composable building blocks, tools and integrations for working with language models. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not
  2. Off-the-shelf chains: built-in assemblages of components for accomplishing higher-level tasks

Off-the-shelf chains make it easy to get started. Components make it easy to customize existing chains and build new ones.

LangChain Expression Language (LCEL)

LCEL is a key part of LangChain, allowing you to build and organize chains of processes in a straightforward, declarative manner. It was designed to support taking prototypes directly into production without needing to alter any code. This means you can use LCEL to set up everything from basic "prompt + LLM" setups to intricate, multi-step workflows.

  • Overview: LCEL and its benefits
  • Interface: The standard Runnable interface for LCEL objects
  • Primitives: More on the primitives LCEL includes
  • Cheatsheet: Quick overview of the most common usage patterns

Components

Components fall into the following modules:

📃 Model I/O

This includes prompt management, prompt optimization, a generic interface for chat models and LLMs, and common utilities for working with model outputs.

📚 Retrieval

Retrieval Augmented Generation involves loading data from a variety of sources, preparing it, then searching over (a.k.a. retrieving from) it for use in the generation step.

🤖 Agents

Agents allow an LLM autonomy over how a task is accomplished. Agents make decisions about which Actions to take, then take that Action, observe the result, and repeat until the task is complete. LangChain provides a standard interface for agents, along with LangGraph for building custom agents.

📖 Documentation

Please see here for full documentation, which includes:

  • Introduction: Overview of the framework and the structure of the docs.
  • Tutorials: If you're looking to build something specific or are more of a hands-on learner, check out our tutorials. This is the best place to get started.
  • How-to guides: Answers to “How do I….?” type questions. These guides are goal-oriented and concrete; they're meant to help you complete a specific task.
  • Conceptual guide: Conceptual explanations of the key parts of the framework.
  • API Reference: Thorough documentation of every class and method.

🌐 Ecosystem

  • 🦜🛠️ LangSmith: Trace and evaluate your language model applications and intelligent agents to help you move from prototype to production.
  • 🦜🕸️ LangGraph: Create stateful, multi-actor applications with LLMs. Integrates smoothly with LangChain, but can be used without it.
  • 🦜🏓 LangServe: Deploy LangChain runnables and chains as REST APIs.

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see here.

🌟 Contributors

langchain contributors