Commit Graph

925 Commits

Author SHA1 Message Date
Erick Friis
29f8a79ebe
groq,openai,mistralai: fix unit tests (#28279) 2024-11-22 04:54:01 +00:00
ccurme
56499cf58b
openai[patch]: unskip test and relax tolerance in embeddings comparison (#28262)
From what I can tell response using SDK is not deterministic:
```python
import numpy as np
import openai

documents = ["disallowed special token '<|endoftext|>'"]
model = "text-embedding-ada-002"

direct_output_1 = (
    openai.OpenAI()
    .embeddings.create(input=documents, model=model)
    .data[0]
    .embedding
)

for i in range(10):
    direct_output_2 = (
        openai.OpenAI()
        .embeddings.create(input=documents, model=model)
        .data[0]
        .embedding
    )
    print(f"{i}: {np.isclose(direct_output_1, direct_output_2).all()}")
```
```
0: True
1: True
2: True
3: True
4: False
5: True
6: True
7: True
8: True
9: True
```

See related discussion here:
https://community.openai.com/t/can-text-embedding-ada-002-be-made-deterministic/318054

Found the same result using `"text-embedding-3-small"`.
2024-11-21 10:23:10 -08:00
Erick Friis
d1108607f4
multiple: push deprecation removals to 1.0 (#28236) 2024-11-20 19:56:29 -08:00
Eugene Yurtsev
2acc83f146
mistralai[patch]: 0.2.2 release (#28240)
mistralai 0.2.2 release
2024-11-20 22:18:15 +00:00
Eugene Yurtsev
1a66175e38
mistral[patch]: Propagate tool call id (#28238)
mistralai-large-2411 requires tool call id

Older models accept tool call id if its provided

mistral-large-2407 
mistral-large-2402
2024-11-20 17:02:30 -05:00
af su
7c7ee07d30
huggingface[fix]: HuggingFaceEndpointEmbeddings model parameter passing error when async embed (#27953)
This change refines the handling of _model_kwargs in POST requests.
Instead of nesting _model_kwargs as a dictionary under the parameters
key, it is now directly unpacked and merged into the request's JSON
payload. This ensures that the model parameters are passed correctly and
avoids unnecessary nesting.E. g.:

```python
import asyncio

from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

embedding_input = ["This input will get multiplied" * 10000]

embeddings = HuggingFaceEndpointEmbeddings(
    model="http://127.0.0.1:8081/embed",
    model_kwargs={"truncate": True},
)

# Truncated parameters in synchronized methods are handled correctly
embeddings.embed_documents(texts=embedding_input)
# The truncate parameter is not handled correctly in the asynchronous method,
# and 413 Request Entity Too Large is returned.
asyncio.run(embeddings.aembed_documents(texts=embedding_input))
```

Co-authored-by: af su <saf@zjuici.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2024-11-20 19:08:56 +00:00
Eric Pinzur
923ef85105
langchain_chroma: fixed integration tests (#27968)
Description:
* I'm planning to add `Document.id` support to the Chroma VectorStore,
but first I wanted to make sure all the integration tests were passing
first. They weren't. This PR fixes the broken tests.
* I found 2 issues:
* This change (from a year ago, exactly :) ) for supporting multi-modal
embeddings:
https://docs.trychroma.com/deployment/migration#migration-to-0.4.16---november-7,-2023
* This change https://github.com/langchain-ai/langchain/pull/27827 due
to an update in the chroma client.
  
Also ran `format` and `lint` on the changes.

Note: I am not a member of the Chroma team.
2024-11-20 11:05:02 -08:00
Erick Friis
0dbaf05bb7
standard-tests: rename langchain_standard_tests to langchain_tests, release 0.3.2 (#28203) 2024-11-18 19:10:39 -08:00
Erick Friis
d9d689572a
openai: release 0.2.9, o1 streaming (#28197) 2024-11-18 23:54:38 +00:00
Erick Friis
6d2004ee7d
multiple: langchain-standard-tests -> langchain-tests (#28139) 2024-11-15 11:32:04 -08:00
Elham Badri
d696728278
partners/ollama: Enabled Token Level Streaming when Using Bind Tools for ChatOllama (#27689)
**Description:** The issue concerns the unexpected behavior observed
using the bind_tools method in LangChain's ChatOllama. When tools are
not bound, the llm.stream() method works as expected, returning
incremental chunks of content, which is crucial for real-time
applications such as conversational agents and live feedback systems.
However, when bind_tools([]) is used, the streaming behavior changes,
causing the output to be delivered in full chunks rather than
incrementally. This change negatively impacts the user experience by
breaking the real-time nature of the streaming mechanism.
**Issue:** #26971

---------

Co-authored-by: 4meyDam1e <amey.damle@mail.utoronto.ca>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
2024-11-15 11:36:27 -05:00
ccurme
776e3271e3
standard-tests[patch]: add test for async tool calling (#28133) 2024-11-15 16:09:50 +00:00
Vadym Barda
6ec688cf2b
xai[patch]: update core (#28092) 2024-11-13 17:51:51 +00:00
Vadym Barda
09e85c7c4b
xai[patch]: update dependencies (#28067) 2024-11-12 16:15:17 -05:00
ccurme
00e7b2dada
anthropic[patch]: add examples to API ref (#28065) 2024-11-12 20:17:02 +00:00
Vadym Barda
48ee322a78
partners: add xAI chat integration (#28032) 2024-11-12 15:11:29 -05:00
ccurme
2898b95ca7
anthropic[major]: release 0.3.0 (#28063) 2024-11-12 14:58:00 -05:00
ccurme
5eaa0e8c45
openai[patch]: release 0.2.8 (#28062) 2024-11-12 14:57:11 -05:00
ccurme
1538ee17f9
anthropic[major]: support python 3.13 (#27916)
Last week Anthropic released version 0.39.0 of its python sdk, which
enabled support for Python 3.13. This release deleted a legacy
`client.count_tokens` method, which we currently access during init of
the `Anthropic` LLM. Anthropic has replaced this functionality with the
[client.beta.messages.count_tokens()
API](https://github.com/anthropics/anthropic-sdk-python/pull/726).

To enable support for `anthropic >= 0.39.0` and Python 3.13, here we
drop support for the legacy token counting method, and add support for
the new method via `ChatAnthropic.get_num_tokens_from_messages`.

To fully support the token counting API, we update the signature of
`get_num_tokens_from_message` to accept tools everywhere.

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2024-11-12 14:31:07 -05:00
Bagatur
139881b108
openai[patch]: fix azure oai stream check (#28048) 2024-11-12 15:42:06 +00:00
Bagatur
9611f0b55d
openai[patch]: Release 0.2.7 (#28047) 2024-11-12 15:16:15 +00:00
Bagatur
33dbfba08b
openai[patch]: default to invoke on o1 stream() (#27983) 2024-11-08 19:12:59 -08:00
Erick Friis
8a5b9bf2ad
box: migrate to repo (#27969) 2024-11-07 10:19:22 -08:00
ccurme
a747dbd24b
anthropic[patch]: remove retired model from tests (#27965)
`claude-instant` was [retired
yesterday](https://docs.anthropic.com/en/docs/resources/model-deprecations).
2024-11-07 16:16:29 +00:00
ZhangShenao
c2072d909a
Improvement[Partner] Improve qdrant vector store (#27251)
- Add static method decorator
- Add args for api doc
- Fix word spelling

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-11-07 02:42:41 +00:00
Roman Solomatin
0f85dea8c8
langchain-huggingface: use separate kwargs for queries and docs (#27857)
Now `encode_kwargs` used for both for documents and queries and this
leads to wrong embeddings. E. g.:
```python
    model_kwargs = {"device": "cuda", "trust_remote_code": True}
    encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"}

    model = HuggingFaceEmbeddings(
        model_name="dunzhang/stella_en_400M_v5",
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
    )

    query_embedding = np.array(
        model.embed_query("What are some ways to reduce stress?",)
    )
    document_embedding = np.array(
        model.embed_documents(
            [
                "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
                "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
            ]
        )
    )
    print(model._client.similarity(query_embedding, document_embedding)) # output: tensor([[0.8421, 0.3317]], dtype=torch.float64)
```
But from the [model
card](https://huggingface.co/dunzhang/stella_en_400M_v5#sentence-transformers)
expexted like this:
```python
    model_kwargs = {"device": "cuda", "trust_remote_code": True}
    encode_kwargs = {"normalize_embeddings": False}
    query_encode_kwargs = {"normalize_embeddings": False, "prompt_name": "s2p_query"}

    model = HuggingFaceEmbeddings(
        model_name="dunzhang/stella_en_400M_v5",
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        query_encode_kwargs=query_encode_kwargs,
    )

    query_embedding = np.array(
        model.embed_query("What are some ways to reduce stress?", )
    )
    document_embedding = np.array(
        model.embed_documents(
            [
                "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
                "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
            ]
        )
    )
    print(model._client.similarity(query_embedding, document_embedding)) # tensor([[0.8398, 0.2990]], dtype=torch.float64)
```
2024-11-06 17:35:39 -05:00
ccurme
66966a6e72
openai[patch]: release 0.2.6 (#27924)
Some additions in support of [predicted
outputs](https://platform.openai.com/docs/guides/latency-optimization#use-predicted-outputs)
feature:
- Bump openai sdk version
- Add integration test
- Add example to integration docs

The `prediction` kwarg is already plumbed through model invocation.
2024-11-05 23:02:24 +00:00
SHJUN
f6b2f82099
community: chroma error patch(attribute changed on chroma) (#27827)
There was a change of attribute name which was "max_batch_size". It's
now "get_max_batch_size" method.
I want to use "create_batches" which is right down below.

Please check this PR link.
reference: https://github.com/chroma-core/chroma/pull/2305

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Co-authored-by: Prithvi Kannan <46332835+prithvikannan@users.noreply.github.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Jun Yamog <jkyamog@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: ono-hiroki <86904208+ono-hiroki@users.noreply.github.com>
Co-authored-by: Dobiichi-Origami <56953648+Dobiichi-Origami@users.noreply.github.com>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
Co-authored-by: Duy Huynh <vndee.huynh@gmail.com>
Co-authored-by: Rashmi Pawar <168514198+raspawar@users.noreply.github.com>
Co-authored-by: sifatj <26035630+sifatj@users.noreply.github.com>
Co-authored-by: Eric Pinzur <2641606+epinzur@users.noreply.github.com>
Co-authored-by: Daniel Vu Dao <danielvdao@users.noreply.github.com>
Co-authored-by: Ofer Mendelevitch <ofermend@gmail.com>
Co-authored-by: Stéphane Philippart <wildagsx@gmail.com>
2024-11-05 19:43:11 +00:00
Bagatur
dfa83531ad
qdrant,nomic[minor]: bump core deps (#27849) 2024-11-04 20:19:50 +00:00
Bagatur
3b0b7cfb74
chroma[minor]: release 0.2.0 (#27840) 2024-11-01 18:12:00 -07:00
Bagatur
002e1c9055
airbyte: remove from master (#27837) 2024-11-01 13:59:34 -07:00
Bagatur
ee63d21915
many: use core 0.3.15 (#27834) 2024-11-01 20:35:55 +00:00
Bagatur
06420de2e7
integrations[patch]: bump core to 0.3.15 (#27805) 2024-10-31 11:27:05 -07:00
JiaranI
3952ee31b8
ollama: add pydocstyle linting for ollama (#27686)
Description: add lint docstrings for ollama module
Issue: the issue https://github.com/langchain-ai/langchain/issues/23188
@baskaryan

test: ruff check passed.
<img width="311" alt="e94c68ffa93dd518297a95a93de5217"
src="https://github.com/user-attachments/assets/e96bf721-e0e3-44de-a50e-206603de398e">

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-10-31 03:06:55 +00:00
Bagatur
6691202998
anthropic[patch]: allow multiple sys not at start (#27725) 2024-10-30 23:56:47 +00:00
ccurme
88bfd60b03
infra: specify python max version of 3.12 for some integration packages (#27740) 2024-10-30 12:24:48 -04:00
ccurme
bd5ea18a6c
groq[patch]: update standard tests (#27744)
- Add xfail on integration test (fails [> 50% of the
time](https://github.com/langchain-ai/langchain/actions/workflows/scheduled_test.yml));
- Remove xfail on passing unit test.
2024-10-30 15:50:51 +00:00
Andrew Effendi
49517cc1e7
partners/huggingface[patch]: fix HuggingFacePipeline model_id parameter (#27514)
**Description:** Fixes issue with model parameter not getting
initialized correctly when passing transformers pipeline
**Issue:** https://github.com/langchain-ai/langchain/issues/25915
2024-10-29 14:34:46 +00:00
Erick Friis
583808a7b8
partners/huggingface: release 0.1.1 (#27691) 2024-10-28 13:39:38 -07:00
Erick Friis
6d524e9566
partners/box: release 0.2.2 (#27690) 2024-10-28 12:54:20 -07:00
yahya-mouman
6803cb4f34
openai[patch]: add check for none values when summing token usage (#27585)
**Description:** Fixes None addition issues when an empty value is
passed on

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
2024-10-28 12:49:43 -07:00
Bagatur
ede953d617
openai[patch]: fix schema formatting util (#27685) 2024-10-28 15:46:47 +00:00
ccurme
fe87e411f2
groq: fix unit test (#27660) 2024-10-26 14:57:23 -04:00
Bagatur
d5306899d3
openai[patch]: Release 0.2.4 (#27652) 2024-10-25 20:26:21 +00:00
Nithish Raghunandanan
0623c74560
couchbase: Add document id to vector search results (#27622)
**Description:** Returns the document id along with the Vector Search
results

**Issue:** Fixes https://github.com/langchain-ai/langchain/issues/26860
for CouchbaseVectorStore


- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-10-24 21:47:36 +00:00
Hyejun An
6227396e20
partners/HuggingFacePipeline[stream]: Change to use pipeline instead of pipeline.model.generate in stream() (#26531)
## Description

I encountered an error while using the` gemma-2-2b-it model` with the
`HuggingFacePipeline` class and have implemented a fix to resolve this
issue.

### What is Problem

```python
model_id="google/gemma-2-2b-it"


gemma_2_model = AutoModelForCausalLM.from_pretrained(model_id)
gemma_2_tokenizer = AutoTokenizer.from_pretrained(model_id)

gen = pipeline( 
    task='text-generation',
    model=gemma_2_model,
    tokenizer=gemma_2_tokenizer,
    max_new_tokens=1024,
    device=0 if torch.cuda.is_available() else -1,
    temperature=.5,
    top_p=0.7,
    repetition_penalty=1.1,
    do_sample=True,
    )

llm = HuggingFacePipeline(pipeline=gen)

for chunk in llm.stream("Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World."):
    print(chunk, end="", flush=True)
```

This code outputs the following error message:

```
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1258: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Exception in thread Thread-19 (generate):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1874, in generate
    self._validate_generated_length(generation_config, input_ids_length, has_default_max_length)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1266, in _validate_generated_length
    raise ValueError(
ValueError: Input length of input_ids is 31, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.
```

In addition, the following error occurs when the number of tokens is
reduced.

```python
for chunk in llm.stream("Hello World"):
    print(chunk, end="", flush=True)
```

```
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1258: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1885: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Exception in thread Thread-20 (generate):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2982, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 994, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 803, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 164, in forward
    return F.embedding(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
```

On the other hand, in the case of invoke, the output is normal:

```
llm.invoke("Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World.")
```
```
'Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World.\n\nThis is a simple program that prints the phrase "Hello World" to the console. \n\n**Here\'s how it works:**\n\n* **`print("Hello World")`**: This line of code uses the `print()` function, which is a built-in function in most programming languages (like Python). The `print()` function takes whatever you put inside its parentheses and displays it on the screen.\n* **`"Hello World"`**:  The text within the double quotes (`"`) is called a string. It represents the message we want to print.\n\n\nLet me know if you\'d like to explore other programming concepts or see more examples! \n'
```

### Problem Analysis

- Apparently, I put kwargs in while generating pipelines and it applied
to `invoke()`, but it's not applied in the `stream()`.
- When using the stream, `inputs = self.pipeline.tokenizer (prompt,
return_tensors = "pt")` enters cpu.
  - This can crash when the model is in gpu.

### Solution

Just use `self.pipeline` instead of `self.pipeline.model.generate`.

- **Original Code**

```python
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

inputs = self.pipeline.tokenizer(prompt, return_tensors="pt")
streamer = TextIteratorStreamer(
    self.pipeline.tokenizer,
    timeout=60.0,
    skip_prompt=skip_prompt,
    skip_special_tokens=True,
)
generation_kwargs = dict(
    inputs,
    streamer=streamer,
    stopping_criteria=stopping_criteria,
    **pipeline_kwargs,
)
t1 = Thread(target=self.pipeline.model.generate, kwargs=generation_kwargs)
t1.start()
```

- **Updated Code**

```python
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

streamer = TextIteratorStreamer(
    self.pipeline.tokenizer,
    timeout=60.0,
    skip_prompt=skip_prompt,
    skip_special_tokens=True,
)
generation_kwargs = dict(
    text_inputs= prompt,
    streamer=streamer,
    stopping_criteria=stopping_criteria,
    **pipeline_kwargs,
)
t1 = Thread(target=self.pipeline, kwargs=generation_kwargs)
t1.start()
```

By using the `pipeline` directly, the `kwargs` of the pipeline are
applied, and there is no need to consider the `device` of the `tensor`
made with the `tokenizer`.

> According to the change to use `pipeline`, it was modified to put
`text_inputs=prompts` directly into `generation_kwargs`.

## Issue

None

## Dependencies

None

## Twitter handle

None

---------

Co-authored-by: Vadym Barda <vadym@langchain.dev>
2024-10-24 16:49:43 -04:00
Bagatur
655ced84d7
openai[patch]: accept json schema response format directly (#27623)
fix #25460

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
2024-10-24 18:19:15 +00:00
Kwan Kin Chan
6d2a76ac05
langchain_huggingface: Fix multiple GPU usage bug in from_model_id function (#23628)
- [ ]  **Description:**   
   - pass the device_map into model_kwargs 
- removing the unused device_map variable in the hf_pipeline function
call
- [ ] **Issue:** issue #13128 
When using the from_model_id function to load a Hugging Face model for
text generation across multiple GPUs, the model defaults to loading on
the CPU despite multiple GPUs being available using the expected format
``` python
llm = HuggingFacePipeline.from_model_id(
    model_id="model-id",
    task="text-generation",
    device_map="auto",
)
```
Currently, to enable multiple GPU , we have to pass in variable in this
format instead
``` python
llm = HuggingFacePipeline.from_model_id(
    model_id="model-id",
    task="text-generation",
    device=None,
    model_kwargs={
        "device_map": "auto",
    }
)
```
This issue arises due to improper handling of the device and device_map
parameters.

- [ ] **Explanation:**
1. In from_model_id, the model is created using model_kwargs and passed
as the model variable of the pipeline function. So at this moment, to
load the model with multiple GPUs, "device_map" needs to be set to
"auto" within model_kwargs. Otherwise, the model defaults to loading on
the CPU.
2. The device_map variable in from_model_id is not utilized correctly.
In the pipeline function's source code of tnansformer:
- The device_map variable is stored in the model_kwargs dictionary
(lines 867-878 of transformers/src/transformers/pipelines/\__init__.py).
```python
    if device_map is not None:
        ......
        model_kwargs["device_map"] = device_map
```
- The model is constructed with model_kwargs containing the device_map
value ONLY IF it is a string (lines 893-903 of
transformers/src/transformers/pipelines/\__init__.py).
```python
    if isinstance(model, str) or framework is None:
        model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
        framework, model = infer_framework_load_model( ... , **model_kwargs, )
```
- Consequently, since a model object is already passed to the pipeline
function, the device_map variable from from_model_id is never used.

3. The device_map variable in from_model_id not only appears unused but
also causes errors. Without explicitly setting device=None, attempting
to load the model on multiple GPUs may result in the following error:
 ```
Device has 2 GPUs available. Provide device={deviceId} to
`from_model_id` to use available GPUs for execution. deviceId is -1
(default) for CPU and can be a positive integer associated with CUDA
device id.
  Traceback (most recent call last):
    File "foo.py", line 15, in <module>
      llm = HuggingFacePipeline.from_model_id(
File
"foo\site-packages\langchain_huggingface\llms\huggingface_pipeline.py",
line 217, in from_model_id
      pipeline = hf_pipeline(
File "foo\lib\site-packages\transformers\pipelines\__init__.py", line
1108, in pipeline
return pipeline_class(model=model, framework=framework, task=task,
**kwargs)
File "foo\lib\site-packages\transformers\pipelines\text_generation.py",
line 96, in __init__
      super().__init__(*args, **kwargs)
File "foo\lib\site-packages\transformers\pipelines\base.py", line 835,
in __init__
      raise ValueError(
ValueError: The model has been loaded with `accelerate` and therefore
cannot be moved to a specific device. Please discard the `device`
argument when creating your pipeline object.
```
This error occurs because, in from_model_id, the default values in from_model_id for device and device_map are -1 and None, respectively. It would passes the statement (`device_map is not None and device < 0`) and keep the device as -1 so the pipeline function later raises an error when trying to move a GPU-loaded model back to the CPU. 
19eb82e68b/libs/community/langchain_community/llms/huggingface_pipeline.py (L204-L213)




If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: William FH <13333726+hinthornw@users.noreply.github.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: vbarda <vadym@langchain.dev>
2024-10-22 21:41:47 -04:00
Fernando de Oliveira
ab205e7389
partners/openai + community: Async Azure AD token provider support for Azure OpenAI (#27488)
This PR introduces a new `azure_ad_async_token_provider` attribute to
the `AzureOpenAI` and `AzureChatOpenAI` classes in `partners/openai` and
`community` packages, given it's currently supported on `openai` package
as
[AsyncAzureADTokenProvider](https://github.com/openai/openai-python/blob/main/src/openai/lib/azure.py#L33)
type.

The reason for creating a new attribute is to avoid breaking changes.
Let's say you have an existing code that uses a `AzureOpenAI` or
`AzureChatOpenAI` instance to perform both sync and async operations.
The `azure_ad_token_provider` will work exactly as it is today, while
`azure_ad_async_token_provider` will override it for async requests.


If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
2024-10-22 21:43:06 +00:00
Vadym Barda
0640cbf2f1
huggingface[patch]: hide client field in HuggingFaceEmbeddings (#27522) 2024-10-21 17:37:07 -04:00