langchain/docs/docs/how_to/multimodal_inputs.ipynb
ccurme f14bcee525
docs: update multi-modal docs (#30880)
Co-authored-by: Sydney Runkle <54324534+sydney-runkle@users.noreply.github.com>
2025-04-17 16:03:05 -04:00

721 lines
23 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "4facdf7f-680e-4d28-908b-2b8408e2a741",
"metadata": {},
"source": [
"# How to pass multimodal data to models\n",
"\n",
"Here we demonstrate how to pass [multimodal](/docs/concepts/multimodality/) input directly to models.\n",
"\n",
"LangChain supports multimodal data as input to chat models:\n",
"\n",
"1. Following provider-specific formats\n",
"2. Adhering to a cross-provider standard\n",
"\n",
"Below, we demonstrate the cross-provider standard. See [chat model integrations](/docs/integrations/chat/) for detail\n",
"on native formats for specific providers.\n",
"\n",
":::note\n",
"\n",
"Most chat models that support multimodal **image** inputs also accept those values in\n",
"OpenAI's [Chat Completions format](https://platform.openai.com/docs/guides/images?api-mode=chat):\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\"url\": image_url},\n",
"}\n",
"```\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "e30a4ff0-ab38-41a7-858c-a93f99bb2f1b",
"metadata": {},
"source": [
"## Images\n",
"\n",
"Many providers will accept images passed in-line as base64 data. Some will additionally accept an image from a URL directly.\n",
"\n",
"### Images from base64 data\n",
"\n",
"To pass images in-line, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"image\",\n",
" \"source_type\": \"base64\",\n",
" \"mime_type\": \"image/jpeg\", # or image/png, etc.\n",
" \"data\": \"<base64 data string>\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1fcf7b27-1cc3-420a-b920-0420b5892e20",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The image shows a beautiful clear day with bright blue skies and wispy cirrus clouds stretching across the horizon. The clouds are thin and streaky, creating elegant patterns against the blue backdrop. The lighting suggests it's during the day, possibly late afternoon given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no indication of rain. It's the kind of perfect, mild weather that's ideal for walking along the wooden boardwalk through the marsh grass.\n"
]
}
],
"source": [
"import base64\n",
"\n",
"import httpx\n",
"from langchain.chat_models import init_chat_model\n",
"\n",
"# Fetch image data\n",
"image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
"image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")\n",
"\n",
"\n",
"# Pass to LLM\n",
"llm = init_chat_model(\"anthropic:claude-3-5-sonnet-latest\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the weather in this image:\",\n",
" },\n",
" # highlight-start\n",
" {\n",
" \"type\": \"image\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": image_data,\n",
" \"mime_type\": \"image/jpeg\",\n",
" },\n",
" # highlight-end\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "ee2b678a-01dd-40c1-81ff-ddac22be21b7",
"metadata": {},
"source": [
"See [LangSmith trace](https://smith.langchain.com/public/eab05a31-54e8-4fc9-911f-56805da67bef/r) for more detail.\n",
"\n",
"### Images from a URL\n",
"\n",
"Some providers (including [OpenAI](/docs/integrations/chat/openai/),\n",
"[Anthropic](/docs/integrations/chat/anthropic/), and\n",
"[Google Gemini](/docs/integrations/chat/google_generative_ai/)) will also accept images from URLs directly.\n",
"\n",
"To pass images as URLs, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"image\",\n",
" \"source_type\": \"url\",\n",
" \"url\": \"https://...\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "99d27f8f-ae78-48bc-9bf2-3cef35213ec7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The weather in this image appears to be pleasant and clear. The sky is mostly blue with a few scattered, light clouds, and there is bright sunlight illuminating the green grass and plants. There are no signs of rain or stormy conditions, suggesting it is a calm, likely warm day—typical of spring or summer.\n"
]
}
],
"source": [
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the weather in this image:\",\n",
" },\n",
" {\n",
" \"type\": \"image\",\n",
" # highlight-start\n",
" \"source_type\": \"url\",\n",
" \"url\": image_url,\n",
" # highlight-end\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "1c470309",
"metadata": {},
"source": [
"We can also pass in multiple images:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "325fb4ca",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Yes, these two images are the same. They depict a wooden boardwalk going through a grassy field under a blue sky with some clouds. The colors, composition, and elements in both images are identical.\n"
]
}
],
"source": [
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"text\", \"text\": \"Are these two images the same?\"},\n",
" {\"type\": \"image\", \"source_type\": \"url\", \"url\": image_url},\n",
" {\"type\": \"image\", \"source_type\": \"url\", \"url\": image_url},\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "d72b83e6-8d21-448e-b5df-d5b556c3ccc8",
"metadata": {},
"source": [
"## Documents (PDF)\n",
"\n",
"Some providers (including [OpenAI](/docs/integrations/chat/openai/),\n",
"[Anthropic](/docs/integrations/chat/anthropic/), and\n",
"[Google Gemini](/docs/integrations/chat/google_generative_ai/)) will accept PDF documents.\n",
"\n",
"### Documents from base64 data\n",
"\n",
"To pass documents in-line, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"file\",\n",
" \"source_type\": \"base64\",\n",
" \"mime_type\": \"application/pdf\",\n",
" \"data\": \"<base64 data string>\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6c1455a9-699a-4702-a7e0-7f6eaec76a21",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This document appears to be a sample PDF file that contains Lorem ipsum placeholder text. It begins with a title \"Sample PDF\" followed by the subtitle \"This is a simple PDF file. Fun fun fun.\"\n",
"\n",
"The rest of the document consists of several paragraphs of Lorem ipsum text, which is a commonly used placeholder text in design and publishing. The text is formatted in a clean, readable layout with consistent paragraph spacing. The document appears to be a single page containing four main paragraphs of this placeholder text.\n",
"\n",
"The Lorem ipsum text, while appearing to be Latin, is actually scrambled Latin-like text that is used primarily to demonstrate the visual form of a document or typeface without the distraction of meaningful content. It's commonly used in publishing and graphic design when the actual content is not yet available but the layout needs to be demonstrated.\n",
"\n",
"The document has a professional, simple layout with generous margins and clear paragraph separation, making it an effective example of basic PDF formatting and structure.\n"
]
}
],
"source": [
"import base64\n",
"\n",
"import httpx\n",
"from langchain.chat_models import init_chat_model\n",
"\n",
"# Fetch PDF data\n",
"pdf_url = \"https://pdfobject.com/pdf/sample.pdf\"\n",
"pdf_data = base64.b64encode(httpx.get(pdf_url).content).decode(\"utf-8\")\n",
"\n",
"\n",
"# Pass to LLM\n",
"llm = init_chat_model(\"anthropic:claude-3-5-sonnet-latest\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the document:\",\n",
" },\n",
" # highlight-start\n",
" {\n",
" \"type\": \"file\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": pdf_data,\n",
" \"mime_type\": \"application/pdf\",\n",
" },\n",
" # highlight-end\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "efb271da-8fdd-41b5-9f29-be6f8c76f49b",
"metadata": {},
"source": [
"### Documents from a URL\n",
"\n",
"Some providers (specifically [Anthropic](/docs/integrations/chat/anthropic/))\n",
"will also accept documents from URLs directly.\n",
"\n",
"To pass documents as URLs, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"file\",\n",
" \"source_type\": \"url\",\n",
" \"url\": \"https://...\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "55e1d937-3b22-4deb-b9f0-9e688f0609dc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This document appears to be a sample PDF file with both text and an image. It begins with a title \"Sample PDF\" followed by the text \"This is a simple PDF file. Fun fun fun.\" The rest of the document contains Lorem ipsum placeholder text arranged in several paragraphs. The content is shown both as text and as an image of the formatted PDF, with the same content displayed in a clean, formatted layout with consistent spacing and typography. The document consists of a single page containing this sample text.\n"
]
}
],
"source": [
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the document:\",\n",
" },\n",
" {\n",
" \"type\": \"file\",\n",
" # highlight-start\n",
" \"source_type\": \"url\",\n",
" \"url\": pdf_url,\n",
" # highlight-end\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "1e661c26-e537-4721-8268-42c0861cb1e6",
"metadata": {},
"source": [
"## Audio\n",
"\n",
"Some providers (including [OpenAI](/docs/integrations/chat/openai/) and\n",
"[Google Gemini](/docs/integrations/chat/google_generative_ai/)) will accept audio inputs.\n",
"\n",
"### Audio from base64 data\n",
"\n",
"To pass audio in-line, format them as content blocks of the following form:\n",
"\n",
"```python\n",
"{\n",
" \"type\": \"audio\",\n",
" \"source_type\": \"base64\",\n",
" \"mime_type\": \"audio/wav\", # or appropriate mime-type\n",
" \"data\": \"<base64 data string>\",\n",
"}\n",
"```\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a0b91b29-dbd6-4c94-8f24-05471adc7598",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The audio appears to consist primarily of bird sounds, specifically bird vocalizations like chirping and possibly other bird songs.\n"
]
}
],
"source": [
"import base64\n",
"\n",
"import httpx\n",
"from langchain.chat_models import init_chat_model\n",
"\n",
"# Fetch audio data\n",
"audio_url = \"https://upload.wikimedia.org/wikipedia/commons/3/3d/Alcal%C3%A1_de_Henares_%28RPS_13-04-2024%29_canto_de_ruise%C3%B1or_%28Luscinia_megarhynchos%29_en_el_Soto_del_Henares.wav\"\n",
"audio_data = base64.b64encode(httpx.get(audio_url).content).decode(\"utf-8\")\n",
"\n",
"\n",
"# Pass to LLM\n",
"llm = init_chat_model(\"google_genai:gemini-2.0-flash-001\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe this audio:\",\n",
" },\n",
" # highlight-start\n",
" {\n",
" \"type\": \"audio\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": audio_data,\n",
" \"mime_type\": \"audio/wav\",\n",
" },\n",
" # highlight-end\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "92f55a6c-2e4a-4175-8444-8b9aacd6a13e",
"metadata": {},
"source": [
"## Provider-specific parameters\n",
"\n",
"Some providers will support or require additional fields on content blocks containing multimodal data.\n",
"For example, Anthropic lets you specify [caching](/docs/integrations/chat/anthropic/#prompt-caching) of\n",
"specific content to reduce token consumption.\n",
"\n",
"To use these fields, you can:\n",
"\n",
"1. Store them on directly on the content block; or\n",
"2. Use the native format supported by each provider (see [chat model integrations](/docs/integrations/chat/) for detail).\n",
"\n",
"We show three examples below.\n",
"\n",
"### Example: Anthropic prompt caching"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "83593b9d-a8d3-4c99-9dac-64e0a9d397cb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The image shows a beautiful, clear day with partly cloudy skies. The sky is a vibrant blue with wispy, white cirrus clouds stretching across it. The lighting suggests it's during daylight hours, possibly late afternoon or early evening given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no threatening weather conditions. It's the kind of perfect weather you'd want for a walk along this wooden boardwalk through the marshland or grassland area.\n"
]
},
{
"data": {
"text/plain": [
"{'input_tokens': 1586,\n",
" 'output_tokens': 117,\n",
" 'total_tokens': 1703,\n",
" 'input_token_details': {'cache_read': 0, 'cache_creation': 1582}}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm = init_chat_model(\"anthropic:claude-3-5-sonnet-latest\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the weather in this image:\",\n",
" },\n",
" {\n",
" \"type\": \"image\",\n",
" \"source_type\": \"url\",\n",
" \"url\": image_url,\n",
" # highlight-next-line\n",
" \"cache_control\": {\"type\": \"ephemeral\"},\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())\n",
"response.usage_metadata"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9bbf578e-794a-4dc0-a469-78c876ccd4a3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Clear blue skies, wispy clouds.\n"
]
},
{
"data": {
"text/plain": [
"{'input_tokens': 1716,\n",
" 'output_tokens': 12,\n",
" 'total_tokens': 1728,\n",
" 'input_token_details': {'cache_read': 1582, 'cache_creation': 0}}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"next_message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Summarize that in 5 words.\",\n",
" }\n",
" ],\n",
"}\n",
"response = llm.invoke([message, response, next_message])\n",
"print(response.text())\n",
"response.usage_metadata"
]
},
{
"cell_type": "markdown",
"id": "915b9443-5964-43b8-bb08-691c1ba59065",
"metadata": {},
"source": [
"### Example: Anthropic citations"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ea7707a1-5660-40a1-a10f-0df48a028689",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'citations': [{'cited_text': 'Sample PDF\\r\\nThis is a simple PDF file. Fun fun fun.\\r\\n',\n",
" 'document_index': 0,\n",
" 'document_title': None,\n",
" 'end_page_number': 2,\n",
" 'start_page_number': 1,\n",
" 'type': 'page_location'}],\n",
" 'text': 'Simple PDF file: fun fun',\n",
" 'type': 'text'}]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Generate a 5 word summary of this document.\",\n",
" },\n",
" {\n",
" \"type\": \"file\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": pdf_data,\n",
" \"mime_type\": \"application/pdf\",\n",
" # highlight-next-line\n",
" \"citations\": {\"enabled\": True},\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"response.content"
]
},
{
"cell_type": "markdown",
"id": "e26991eb-e769-41f4-b6e0-63d81f2c7d67",
"metadata": {},
"source": [
"### Example: OpenAI file names\n",
"\n",
"OpenAI requires that PDF documents be associated with file names:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ae076c9b-ff8f-461d-9349-250f396c9a25",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The document is a sample PDF file containing placeholder text. It consists of one page, titled \"Sample PDF\". The content is a mixture of English and the commonly used filler text \"Lorem ipsum dolor sit amet...\" and its extensions, which are often used in publishing and web design as generic text to demonstrate font, layout, and other visual elements.\n",
"\n",
"**Key points about the document:**\n",
"- Length: 1 page\n",
"- Purpose: Demonstrative/sample content\n",
"- Content: No substantive or meaningful information, just demonstration text in paragraph form\n",
"- Language: English (with the Latin-like \"Lorem Ipsum\" text used for layout purposes)\n",
"\n",
"There are no charts, tables, diagrams, or images on the page—only plain text. The document serves as an example of what a PDF file looks like rather than providing actual, useful content.\n"
]
}
],
"source": [
"llm = init_chat_model(\"openai:gpt-4.1\")\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"text\",\n",
" \"text\": \"Describe the document:\",\n",
" },\n",
" {\n",
" \"type\": \"file\",\n",
" \"source_type\": \"base64\",\n",
" \"data\": pdf_data,\n",
" \"mime_type\": \"application/pdf\",\n",
" # highlight-next-line\n",
" \"filename\": \"my-file\",\n",
" },\n",
" ],\n",
"}\n",
"response = llm.invoke([message])\n",
"print(response.text())"
]
},
{
"cell_type": "markdown",
"id": "71bd28cf-d76c-44e2-a55e-c5f265db986e",
"metadata": {},
"source": [
"## Tool calls\n",
"\n",
"Some multimodal models support [tool calling](/docs/concepts/tool_calling) features as well. To call tools using such models, simply bind tools to them in the [usual way](/docs/how_to/tool_calling), and invoke the model using content blocks of the desired type (e.g., containing image data)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0f68cce7-350b-4cde-bc40-d3a169551fc3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'name': 'weather_tool',\n",
" 'args': {'weather': 'sunny'},\n",
" 'id': 'toolu_01G6JgdkhwggKcQKfhXZQPjf',\n",
" 'type': 'tool_call'}]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from typing import Literal\n",
"\n",
"from langchain_core.tools import tool\n",
"\n",
"\n",
"@tool\n",
"def weather_tool(weather: Literal[\"sunny\", \"cloudy\", \"rainy\"]) -> None:\n",
" \"\"\"Describe the weather\"\"\"\n",
" pass\n",
"\n",
"\n",
"llm_with_tools = llm.bind_tools([weather_tool])\n",
"\n",
"message = {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"text\", \"text\": \"Describe the weather in this image:\"},\n",
" {\"type\": \"image\", \"source_type\": \"url\", \"url\": image_url},\n",
" ],\n",
"}\n",
"response = llm_with_tools.invoke([message])\n",
"response.tool_calls"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}