docs: update multi-modal docs (#30880)

Co-authored-by: Sydney Runkle <54324534+sydney-runkle@users.noreply.github.com>
This commit is contained in:
ccurme 2025-04-17 16:03:05 -04:00 committed by GitHub
parent 98c357b3d7
commit f14bcee525
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 778 additions and 243 deletions

View File

@ -15,7 +15,10 @@
* [Messages](/docs/concepts/messages)
:::
Multimodal support is still relatively new and less common, model providers have not yet standardized on the "best" way to define the API. As such, LangChain's multimodal abstractions are lightweight and flexible, designed to accommodate different model providers' APIs and interaction patterns, but are **not** standardized across models.
LangChain supports multimodal data as input to chat models:
1. Following provider-specific formats
2. Adhering to a cross-provider standard (see [how-to guides](/docs/how_to/#multimodal) for detail)
### How to use multimodal models
@ -26,38 +29,85 @@ Multimodal support is still relatively new and less common, model providers have
#### Inputs
Some models can accept multimodal inputs, such as images, audio, video, or files. The types of multimodal inputs supported depend on the model provider. For instance, [Google's Gemini](/docs/integrations/chat/google_generative_ai/) supports documents like PDFs as inputs.
Some models can accept multimodal inputs, such as images, audio, video, or files.
The types of multimodal inputs supported depend on the model provider. For instance,
[OpenAI](/docs/integrations/chat/openai/),
[Anthropic](/docs/integrations/chat/anthropic/), and
[Google Gemini](/docs/integrations/chat/google_generative_ai/)
support documents like PDFs as inputs.
Most chat models that support **multimodal inputs** also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.
The gist of passing multimodal inputs to a chat model is to use content blocks that specify a type and corresponding data. For example, to pass an image to a chat model:
The gist of passing multimodal inputs to a chat model is to use content blocks that
specify a type and corresponding data. For example, to pass an image to a chat model
as URL:
```python
from langchain_core.messages import HumanMessage
message = HumanMessage(
content=[
{"type": "text", "text": "describe the weather in this image"},
{"type": "text", "text": "Describe the weather in this image:"},
{
"type": "image",
"source_type": "url",
"url": "https://...",
},
],
)
response = model.invoke([message])
```
We can also pass the image as in-line data:
```python
from langchain_core.messages import HumanMessage
message = HumanMessage(
content=[
{"type": "text", "text": "Describe the weather in this image:"},
{
"type": "image",
"source_type": "base64",
"data": "<base64 string>",
"mime_type": "image/jpeg",
},
],
)
response = model.invoke([message])
```
To pass a PDF file as in-line data (or URL, as supported by providers such as
Anthropic), just change `"type"` to `"file"` and `"mime_type"` to `"application/pdf"`.
See the [how-to guides](/docs/how_to/#multimodal) for more detail.
Most chat models that support multimodal **image** inputs also accept those values in
OpenAI's [Chat Completions format](https://platform.openai.com/docs/guides/images?api-mode=chat):
```python
from langchain_core.messages import HumanMessage
message = HumanMessage(
content=[
{"type": "text", "text": "Describe the weather in this image:"},
{"type": "image_url", "image_url": {"url": image_url}},
],
)
response = model.invoke([message])
```
:::caution
The exact format of the content blocks may vary depending on the model provider. Please refer to the chat model's
integration documentation for the correct format. Find the integration in the [chat model integration table](/docs/integrations/chat/).
:::
Otherwise, chat models will typically accept the native, provider-specific content
block format. See [chat model integrations](/docs/integrations/chat/) for detail
on specific providers.
#### Outputs
Virtually no popular chat models support multimodal outputs at the time of writing (October 2024).
Some chat models support multimodal outputs, such as images and audio. Multimodal
outputs will appear as part of the [AIMessage](/docs/concepts/messages/#aimessage)
response object. See for example:
The only exception is OpenAI's chat model ([gpt-4o-audio-preview](/docs/integrations/chat/openai/)), which can generate audio outputs.
Multimodal outputs will appear as part of the [AIMessage](/docs/concepts/messages/#aimessage) response object.
Please see the [ChatOpenAI](/docs/integrations/chat/openai/) for more information on how to use multimodal outputs.
- Generating [audio outputs](/docs/integrations/chat/openai/#audio-generation-preview) with OpenAI;
- Generating [image outputs](/docs/integrations/chat/google_generative_ai/#image-generation) with Google Gemini.
#### Tools

View File

@ -50,6 +50,7 @@ See [supported integrations](/docs/integrations/chat/) for details on getting st
- [How to: force a specific tool call](/docs/how_to/tool_choice)
- [How to: work with local models](/docs/how_to/local_llms)
- [How to: init any model in one line](/docs/how_to/chat_models_universal_init/)
- [How to: pass multimodal data directly to models](/docs/how_to/multimodal_inputs/)
### Messages
@ -67,6 +68,7 @@ See [supported integrations](/docs/integrations/chat/) for details on getting st
- [How to: use few shot examples in chat models](/docs/how_to/few_shot_examples_chat/)
- [How to: partially format prompt templates](/docs/how_to/prompts_partial)
- [How to: compose prompts together](/docs/how_to/prompts_composition)
- [How to: use multimodal prompts](/docs/how_to/multimodal_prompts/)
### Example selectors

View File

@ -5,120 +5,165 @@
"id": "4facdf7f-680e-4d28-908b-2b8408e2a741",
"metadata": {},
"source": [
"# How to pass multimodal data directly to models\n",
"# How to pass multimodal data to models\n",
"\n",
"Here we demonstrate how to pass [multimodal](/docs/concepts/multimodality/) input directly to models. \n",
"We currently expect all input to be passed in the same format as [OpenAI expects](https://platform.openai.com/docs/guides/vision).\n",
"For other model providers that support multimodal input, we have added logic inside the class to convert to the expected format.\n",
"Here we demonstrate how to pass [multimodal](/docs/concepts/multimodality/) input directly to models.\n",
"\n",
"In this example we will ask a [model](/docs/concepts/chat_models/#multimodality) to describe an image."
"LangChain supports multimodal data as input to chat models:\n",
"\n",
"1. Following provider-specific formats\n",
"2. Adhering to a cross-provider standard\n",
"\n",
"Below, we demonstrate the cross-provider standard. See [chat model integrations](/docs/integrations/chat/) for detail\n",
"on native formats for specific providers.\n",
"\n",
":::note\n",
"\n",
"Most chat models that support multimodal **image** inputs also accept those values in\n",
"OpenAI's [Chat Completions format](https://platform.openai.com/docs/guides/images?api-mode=chat):\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\"url\": image_url},\n",
"}\n",
"```\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "e30a4ff0-ab38-41a7-858c-a93f99bb2f1b",
"metadata": {},
"source": [
"## Images\n",
"\n",
"Many providers will accept images passed in-line as base64 data. Some will additionally accept an image from a URL directly.\n",
"\n",
"### Images from base64 data\n",
"\n",
"To pass images in-line, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"image\",\n",
" \"source_type\": \"base64\",\n",
" \"mime_type\": \"image/jpeg\", # or image/png, etc.\n",
" \"data\": \"<base64 data string>\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "0d9fd81a-b7f0-445a-8e3d-cfc2d31fdd59",
"execution_count": 10,
"id": "1fcf7b27-1cc3-420a-b920-0420b5892e20",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The image shows a beautiful clear day with bright blue skies and wispy cirrus clouds stretching across the horizon. The clouds are thin and streaky, creating elegant patterns against the blue backdrop. The lighting suggests it's during the day, possibly late afternoon given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no indication of rain. It's the kind of perfect, mild weather that's ideal for walking along the wooden boardwalk through the marsh grass.\n"
]
}
],
"source": [
"image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\""
"import base64\n",
"\n",
"import httpx\n",
"from langchain.chat_models import init_chat_model\n",
"\n",
"# Fetch image data\n",
"image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
"image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")\n",
"\n",
"\n",
"# Pass to LLM\n",
"llm = init_chat_model(\"anthropic:claude-3-5-sonnet-latest\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the weather in this image:\",\n",
" },\n",
" # highlight-start\n",
" {\n",
" \"type\": \"image\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": image_data,\n",
" \"mime_type\": \"image/jpeg\",\n",
" },\n",
" # highlight-end\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "ee2b678a-01dd-40c1-81ff-ddac22be21b7",
"metadata": {},
"source": [
"See [LangSmith trace](https://smith.langchain.com/public/eab05a31-54e8-4fc9-911f-56805da67bef/r) for more detail.\n",
"\n",
"### Images from a URL\n",
"\n",
"Some providers (including [OpenAI](/docs/integrations/chat/openai/),\n",
"[Anthropic](/docs/integrations/chat/anthropic/), and\n",
"[Google Gemini](/docs/integrations/chat/google_generative_ai/)) will also accept images from URLs directly.\n",
"\n",
"To pass images as URLs, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"image\",\n",
" \"source_type\": \"url\",\n",
" \"url\": \"https://...\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "fb896ce9",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.messages import HumanMessage\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"model = ChatOpenAI(model=\"gpt-4o\")"
]
},
{
"cell_type": "markdown",
"id": "4fca4da7",
"metadata": {},
"source": [
"The most commonly supported way to pass in images is to pass it in as a byte string.\n",
"This should work for most model integrations."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9ca1040c",
"metadata": {},
"outputs": [],
"source": [
"import base64\n",
"\n",
"import httpx\n",
"\n",
"image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ec680b6b",
"id": "99d27f8f-ae78-48bc-9bf2-3cef35213ec7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The weather in the image appears to be clear and pleasant. The sky is mostly blue with scattered, light clouds, suggesting a sunny day with minimal cloud cover. There is no indication of rain or strong winds, and the overall scene looks bright and calm. The lush green grass and clear visibility further indicate good weather conditions.\n"
"The weather in this image appears to be pleasant and clear. The sky is mostly blue with a few scattered, light clouds, and there is bright sunlight illuminating the green grass and plants. There are no signs of rain or stormy conditions, suggesting it is a calm, likely warm day—typical of spring or summer.\n"
]
}
],
"source": [
"message = HumanMessage(\n",
" content=[\n",
" {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\"url\": f\"data:image/jpeg;base64,{image_data}\"},\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the weather in this image:\",\n",
" },\n",
" {\n",
" \"type\": \"image\",\n",
" # highlight-start\n",
" \"source_type\": \"url\",\n",
" \"url\": image_url,\n",
" # highlight-end\n",
" },\n",
" ],\n",
")\n",
"response = model.invoke([message])\n",
"print(response.content)"
]
},
{
"cell_type": "markdown",
"id": "8656018e-c56d-47d2-b2be-71e87827f90a",
"metadata": {},
"source": [
"We can feed the image URL directly in a content block of type \"image_url\". Note that only some model providers support this."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a8819cf3-5ddc-44f0-889a-19ca7b7fe77e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The weather in the image appears to be clear and sunny. The sky is mostly blue with a few scattered clouds, suggesting good visibility and a likely pleasant temperature. The bright sunlight is casting distinct shadows on the grass and vegetation, indicating it is likely daytime, possibly late morning or early afternoon. The overall ambiance suggests a warm and inviting day, suitable for outdoor activities.\n"
]
}
],
"source": [
"message = HumanMessage(\n",
" content=[\n",
" {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
" ],\n",
")\n",
"response = model.invoke([message])\n",
"print(response.content)"
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
@ -126,12 +171,12 @@
"id": "1c470309",
"metadata": {},
"source": [
"We can also pass in multiple images."
"We can also pass in multiple images:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 4,
"id": "325fb4ca",
"metadata": {},
"outputs": [
@ -139,20 +184,460 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Yes, the two images are the same. They both depict a wooden boardwalk extending through a grassy field under a blue sky with light clouds. The scenery, lighting, and composition are identical.\n"
"Yes, these two images are the same. They depict a wooden boardwalk going through a grassy field under a blue sky with some clouds. The colors, composition, and elements in both images are identical.\n"
]
}
],
"source": [
"message = HumanMessage(\n",
" content=[\n",
" {\"type\": \"text\", \"text\": \"are these two images the same?\"},\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"text\", \"text\": \"Are these two images the same?\"},\n",
" {\"type\": \"image\", \"source_type\": \"url\", \"url\": image_url},\n",
" {\"type\": \"image\", \"source_type\": \"url\", \"url\": image_url},\n",
" ],\n",
")\n",
"response = model.invoke([message])\n",
"print(response.content)"
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "d72b83e6-8d21-448e-b5df-d5b556c3ccc8",
"metadata": {},
"source": [
"## Documents (PDF)\n",
"\n",
"Some providers (including [OpenAI](/docs/integrations/chat/openai/),\n",
"[Anthropic](/docs/integrations/chat/anthropic/), and\n",
"[Google Gemini](/docs/integrations/chat/google_generative_ai/)) will accept PDF documents.\n",
"\n",
"### Documents from base64 data\n",
"\n",
"To pass documents in-line, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"file\",\n",
" \"source_type\": \"base64\",\n",
" \"mime_type\": \"application/pdf\",\n",
" \"data\": \"<base64 data string>\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6c1455a9-699a-4702-a7e0-7f6eaec76a21",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This document appears to be a sample PDF file that contains Lorem ipsum placeholder text. It begins with a title \"Sample PDF\" followed by the subtitle \"This is a simple PDF file. Fun fun fun.\"\n",
"\n",
"The rest of the document consists of several paragraphs of Lorem ipsum text, which is a commonly used placeholder text in design and publishing. The text is formatted in a clean, readable layout with consistent paragraph spacing. The document appears to be a single page containing four main paragraphs of this placeholder text.\n",
"\n",
"The Lorem ipsum text, while appearing to be Latin, is actually scrambled Latin-like text that is used primarily to demonstrate the visual form of a document or typeface without the distraction of meaningful content. It's commonly used in publishing and graphic design when the actual content is not yet available but the layout needs to be demonstrated.\n",
"\n",
"The document has a professional, simple layout with generous margins and clear paragraph separation, making it an effective example of basic PDF formatting and structure.\n"
]
}
],
"source": [
"import base64\n",
"\n",
"import httpx\n",
"from langchain.chat_models import init_chat_model\n",
"\n",
"# Fetch PDF data\n",
"pdf_url = \"https://pdfobject.com/pdf/sample.pdf\"\n",
"pdf_data = base64.b64encode(httpx.get(pdf_url).content).decode(\"utf-8\")\n",
"\n",
"\n",
"# Pass to LLM\n",
"llm = init_chat_model(\"anthropic:claude-3-5-sonnet-latest\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the document:\",\n",
" },\n",
" # highlight-start\n",
" {\n",
" \"type\": \"file\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": pdf_data,\n",
" \"mime_type\": \"application/pdf\",\n",
" },\n",
" # highlight-end\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "efb271da-8fdd-41b5-9f29-be6f8c76f49b",
"metadata": {},
"source": [
"### Documents from a URL\n",
"\n",
"Some providers (specifically [Anthropic](/docs/integrations/chat/anthropic/))\n",
"will also accept documents from URLs directly.\n",
"\n",
"To pass documents as URLs, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"file\",\n",
" \"source_type\": \"url\",\n",
" \"url\": \"https://...\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "55e1d937-3b22-4deb-b9f0-9e688f0609dc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This document appears to be a sample PDF file with both text and an image. It begins with a title \"Sample PDF\" followed by the text \"This is a simple PDF file. Fun fun fun.\" The rest of the document contains Lorem ipsum placeholder text arranged in several paragraphs. The content is shown both as text and as an image of the formatted PDF, with the same content displayed in a clean, formatted layout with consistent spacing and typography. The document consists of a single page containing this sample text.\n"
]
}
],
"source": [
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the document:\",\n",
" },\n",
" {\n",
" \"type\": \"file\",\n",
" # highlight-start\n",
" \"source_type\": \"url\",\n",
" \"url\": pdf_url,\n",
" # highlight-end\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "1e661c26-e537-4721-8268-42c0861cb1e6",
"metadata": {},
"source": [
"## Audio\n",
"\n",
"Some providers (including [OpenAI](/docs/integrations/chat/openai/) and\n",
"[Google Gemini](/docs/integrations/chat/google_generative_ai/)) will accept audio inputs.\n",
"\n",
"### Audio from base64 data\n",
"\n",
"To pass audio in-line, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"audio\",\n",
" \"source_type\": \"base64\",\n",
" \"mime_type\": \"audio/wav\", # or appropriate mime-type\n",
" \"data\": \"<base64 data string>\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a0b91b29-dbd6-4c94-8f24-05471adc7598",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The audio appears to consist primarily of bird sounds, specifically bird vocalizations like chirping and possibly other bird songs.\n"
]
}
],
"source": [
"import base64\n",
"\n",
"import httpx\n",
"from langchain.chat_models import init_chat_model\n",
"\n",
"# Fetch audio data\n",
"audio_url = \"https://upload.wikimedia.org/wikipedia/commons/3/3d/Alcal%C3%A1_de_Henares_%28RPS_13-04-2024%29_canto_de_ruise%C3%B1or_%28Luscinia_megarhynchos%29_en_el_Soto_del_Henares.wav\"\n",
"audio_data = base64.b64encode(httpx.get(audio_url).content).decode(\"utf-8\")\n",
"\n",
"\n",
"# Pass to LLM\n",
"llm = init_chat_model(\"google_genai:gemini-2.0-flash-001\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe this audio:\",\n",
" },\n",
" # highlight-start\n",
" {\n",
" \"type\": \"audio\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": audio_data,\n",
" \"mime_type\": \"audio/wav\",\n",
" },\n",
" # highlight-end\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "92f55a6c-2e4a-4175-8444-8b9aacd6a13e",
"metadata": {},
"source": [
"## Provider-specific parameters\n",
"\n",
"Some providers will support or require additional fields on content blocks containing multimodal data.\n",
"For example, Anthropic lets you specify [caching](/docs/integrations/chat/anthropic/#prompt-caching) of\n",
"specific content to reduce token consumption.\n",
"\n",
"To use these fields, you can:\n",
"\n",
"1. Store them on directly on the content block; or\n",
"2. Use the native format supported by each provider (see [chat model integrations](/docs/integrations/chat/) for detail).\n",
"\n",
"We show three examples below.\n",
"\n",
"### Example: Anthropic prompt caching"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "83593b9d-a8d3-4c99-9dac-64e0a9d397cb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The image shows a beautiful, clear day with partly cloudy skies. The sky is a vibrant blue with wispy, white cirrus clouds stretching across it. The lighting suggests it's during daylight hours, possibly late afternoon or early evening given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no threatening weather conditions. It's the kind of perfect weather you'd want for a walk along this wooden boardwalk through the marshland or grassland area.\n"
]
},
{
"data": {
"text/plain": [
"{'input_tokens': 1586,\n",
" 'output_tokens': 117,\n",
" 'total_tokens': 1703,\n",
" 'input_token_details': {'cache_read': 0, 'cache_creation': 1582}}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm = init_chat_model(\"anthropic:claude-3-5-sonnet-latest\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the weather in this image:\",\n",
" },\n",
" {\n",
" \"type\": \"image\",\n",
" \"source_type\": \"url\",\n",
" \"url\": image_url,\n",
" # highlight-next-line\n",
" \"cache_control\": {\"type\": \"ephemeral\"},\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())\n",
"response.usage_metadata"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9bbf578e-794a-4dc0-a469-78c876ccd4a3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Clear blue skies, wispy clouds.\n"
]
},
{
"data": {
"text/plain": [
"{'input_tokens': 1716,\n",
" 'output_tokens': 12,\n",
" 'total_tokens': 1728,\n",
" 'input_token_details': {'cache_read': 1582, 'cache_creation': 0}}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"next_message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Summarize that in 5 words.\",\n",
" }\n",
" ],\n",
"}\n",
"response = llm.invoke([message, response, next_message])\n",
"print(response.text())\n",
"response.usage_metadata"
]
},
{
"cell_type": "markdown",
"id": "915b9443-5964-43b8-bb08-691c1ba59065",
"metadata": {},
"source": [
"### Example: Anthropic citations"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ea7707a1-5660-40a1-a10f-0df48a028689",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'citations': [{'cited_text': 'Sample PDF\\r\\nThis is a simple PDF file. Fun fun fun.\\r\\n',\n",
" 'document_index': 0,\n",
" 'document_title': None,\n",
" 'end_page_number': 2,\n",
" 'start_page_number': 1,\n",
" 'type': 'page_location'}],\n",
" 'text': 'Simple PDF file: fun fun',\n",
" 'type': 'text'}]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Generate a 5 word summary of this document.\",\n",
" },\n",
" {\n",
" \"type\": \"file\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": pdf_data,\n",
" \"mime_type\": \"application/pdf\",\n",
" # highlight-next-line\n",
" \"citations\": {\"enabled\": True},\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"response.content"
]
},
{
"cell_type": "markdown",
"id": "e26991eb-e769-41f4-b6e0-63d81f2c7d67",
"metadata": {},
"source": [
"### Example: OpenAI file names\n",
"\n",
"OpenAI requires that PDF documents be associated with file names:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ae076c9b-ff8f-461d-9349-250f396c9a25",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The document is a sample PDF file containing placeholder text. It consists of one page, titled \"Sample PDF\". The content is a mixture of English and the commonly used filler text \"Lorem ipsum dolor sit amet...\" and its extensions, which are often used in publishing and web design as generic text to demonstrate font, layout, and other visual elements.\n",
"\n",
"**Key points about the document:**\n",
"- Length: 1 page\n",
"- Purpose: Demonstrative/sample content\n",
"- Content: No substantive or meaningful information, just demonstration text in paragraph form\n",
"- Language: English (with the Latin-like \"Lorem Ipsum\" text used for layout purposes)\n",
"\n",
"There are no charts, tables, diagrams, or images on the page—only plain text. The document serves as an example of what a PDF file looks like rather than providing actual, useful content.\n"
]
}
],
"source": [
"llm = init_chat_model(\"openai:gpt-4.1\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the document:\",\n",
" },\n",
" {\n",
" \"type\": \"file\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": pdf_data,\n",
" \"mime_type\": \"application/pdf\",\n",
" # highlight-next-line\n",
" \"filename\": \"my-file\",\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
@ -167,16 +652,22 @@
},
{
"cell_type": "code",
"execution_count": 8,
"id": "cd22ea82-2f93-46f9-9f7a-6aaf479fcaa9",
"execution_count": 4,
"id": "0f68cce7-350b-4cde-bc40-d3a169551fc3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'call_BSX4oq4SKnLlp2WlzDhToHBr'}]\n"
]
"data": {
"text/plain": [
"[{'name': 'weather_tool',\n",
" 'args': {'weather': 'sunny'},\n",
" 'id': 'toolu_01G6JgdkhwggKcQKfhXZQPjf',\n",
" 'type': 'tool_call'}]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
@ -191,16 +682,17 @@
" pass\n",
"\n",
"\n",
"model_with_tools = model.bind_tools([weather_tool])\n",
"llm_with_tools = llm.bind_tools([weather_tool])\n",
"\n",
"message = HumanMessage(\n",
" content=[\n",
" {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"text\", \"text\": \"Describe the weather in this image:\"},\n",
" {\"type\": \"image\", \"source_type\": \"url\", \"url\": image_url},\n",
" ],\n",
")\n",
"response = model_with_tools.invoke([message])\n",
"print(response.tool_calls)"
"}\n",
"response = llm_with_tools.invoke([message])\n",
"response.tool_calls"
]
}
],
@ -220,7 +712,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.4"
}
},
"nbformat": 4,

View File

@ -9,157 +9,148 @@
"\n",
"Here we demonstrate how to use prompt templates to format [multimodal](/docs/concepts/multimodality/) inputs to models. \n",
"\n",
"In this example we will ask a [model](/docs/concepts/chat_models/#multimodality) to describe an image."
"To use prompt templates in the context of multimodal data, we can templatize elements of the corresponding content block.\n",
"For example, below we define a prompt that takes a URL for an image as a parameter:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0d9fd81a-b7f0-445a-8e3d-cfc2d31fdd59",
"metadata": {},
"outputs": [],
"source": [
"import base64\n",
"\n",
"import httpx\n",
"\n",
"image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
"image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 1,
"id": "2671f995",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"model = ChatOpenAI(model=\"gpt-4o\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "4ee35e4f",
"metadata": {},
"outputs": [],
"source": [
"prompt = ChatPromptTemplate.from_messages(\n",
"# Define prompt\n",
"prompt = ChatPromptTemplate(\n",
" [\n",
" (\"system\", \"Describe the image provided\"),\n",
" (\n",
" \"user\",\n",
" [\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"Describe the image provided.\",\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\"url\": \"data:image/jpeg;base64,{image_data}\"},\n",
" }\n",
" \"type\": \"image\",\n",
" \"source_type\": \"url\",\n",
" # highlight-next-line\n",
" \"url\": \"{image_url}\",\n",
" },\n",
" ],\n",
" ),\n",
" },\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "089f75c2",
"metadata": {},
"outputs": [],
"source": [
"chain = prompt | model"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "02744b06",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The image depicts a sunny day with a beautiful blue sky filled with scattered white clouds. The sky has varying shades of blue, ranging from a deeper hue near the horizon to a lighter, almost pale blue higher up. The white clouds are fluffy and scattered across the expanse of the sky, creating a peaceful and serene atmosphere. The lighting and cloud patterns suggest pleasant weather conditions, likely during the daytime hours on a mild, sunny day in an outdoor natural setting.\n"
]
}
],
"source": [
"response = chain.invoke({\"image_data\": image_data})\n",
"print(response.content)"
]
},
{
"cell_type": "markdown",
"id": "e9b9ebf6",
"id": "f75d2e26-5b9a-4d5f-94a7-7f98f5666f6d",
"metadata": {},
"source": [
"We can also pass in multiple images."
"Let's use this prompt to pass an image to a [chat model](/docs/concepts/chat_models/#multimodality):"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "02190ee3",
"metadata": {},
"outputs": [],
"source": [
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\"system\", \"compare the two pictures provided\"),\n",
" (\n",
" \"user\",\n",
" [\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\"url\": \"data:image/jpeg;base64,{image_data1}\"},\n",
" },\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\"url\": \"data:image/jpeg;base64,{image_data2}\"},\n",
" },\n",
" ],\n",
" ),\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "42af057b",
"metadata": {},
"outputs": [],
"source": [
"chain = prompt | model"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "513abe00",
"execution_count": 2,
"id": "5df2e558-321d-4cf7-994e-2815ac37e704",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The two images provided are identical. Both images feature a wooden boardwalk path extending through a lush green field under a bright blue sky with some clouds. The perspective, colors, and elements in both images are exactly the same.\n"
"This image shows a beautiful wooden boardwalk cutting through a lush green wetland or marsh area. The boardwalk extends straight ahead toward the horizon, creating a strong leading line through the composition. On either side, tall green grasses sway in what appears to be a summer or late spring setting. The sky is particularly striking, with wispy cirrus clouds streaking across a vibrant blue background. In the distance, you can see a tree line bordering the wetland area. The lighting suggests this may be during \"golden hour\" - either early morning or late afternoon - as there's a warm, gentle quality to the light that's illuminating the scene. The wooden planks of the boardwalk appear well-maintained and provide safe passage through what would otherwise be difficult terrain to traverse. It's the kind of scene you might find in a nature preserve or wildlife refuge designed to give visitors access to observe wetland ecosystems while protecting the natural environment.\n"
]
}
],
"source": [
"response = chain.invoke({\"image_data1\": image_data, \"image_data2\": image_data})\n",
"print(response.content)"
"from langchain.chat_models import init_chat_model\n",
"\n",
"llm = init_chat_model(\"anthropic:claude-3-5-sonnet-latest\")\n",
"\n",
"url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
"\n",
"chain = prompt | llm\n",
"response = chain.invoke({\"image_url\": url})\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "f4cfdc50-4a9f-4888-93b4-af697366b0f3",
"metadata": {},
"source": [
"Note that we can templatize arbitrary elements of the content block:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "53c88ebb-dd57-40c8-8542-b2c916706653",
"metadata": {},
"outputs": [],
"source": [
"prompt = ChatPromptTemplate(\n",
" [\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"Describe the image provided.\",\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"image\",\n",
" \"source_type\": \"base64\",\n",
" \"mime_type\": \"{image_mime_type}\",\n",
" \"data\": \"{image_data}\",\n",
" \"cache_control\": {\"type\": \"{cache_type}\"},\n",
" },\n",
" ],\n",
" },\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "25e4829e-0073-49a8-9669-9f43e5778383",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This image shows a beautiful wooden boardwalk cutting through a lush green marsh or wetland area. The boardwalk extends straight ahead toward the horizon, creating a strong leading line in the composition. The surrounding vegetation consists of tall grass and reeds in vibrant green hues, with some bushes and trees visible in the background. The sky is particularly striking, featuring a bright blue color with wispy white clouds streaked across it. The lighting suggests this photo was taken during the \"golden hour\" - either early morning or late afternoon - giving the scene a warm, peaceful quality. The raised wooden path provides accessible access through what would otherwise be difficult terrain to traverse, allowing visitors to experience and appreciate this natural environment.\n"
]
}
],
"source": [
"import base64\n",
"\n",
"import httpx\n",
"\n",
"image_data = base64.b64encode(httpx.get(url).content).decode(\"utf-8\")\n",
"\n",
"chain = prompt | llm\n",
"response = chain.invoke(\n",
" {\n",
" \"image_data\": image_data,\n",
" \"image_mime_type\": \"image/jpeg\",\n",
" \"cache_type\": \"ephemeral\",\n",
" }\n",
")\n",
"print(response.text())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea8152c3",
"id": "424defe8-d85c-4e45-a88d-bf6f910d5ebb",
"metadata": {},
"outputs": [],
"source": []
@ -181,7 +172,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.10.4"
}
},
"nbformat": 4,