docs: add how-to on multi-modal tool calling (#21667)

Can move this to a dedicated multi-modal section if desired.
2025-08-12 22:28:03 +00:00 · 2024-05-14 12:26:25 -04:00 · 2024-05-14 12:26:25 -04:00 · 12b599c47f
commit 12b599c47f
parent 5c64c004cc
2 changed files with 161 additions and 0 deletions
--- a/docs/docs/how_to/index.mdx
+++ b/docs/docs/how_to/index.mdx
@ -172,6 +172,7 @@ LangChain Tools contain a description of the tool (to pass to the language model
 - [How to: add a human in the loop to tool usage](/docs/how_to/tools_human)
 - [How to: do parallel tool use](/docs/how_to/tools_parallel)
 - [How to: handle errors when calling tools](/docs/how_to/tools_error)
 - [How to: call tools using multi-modal data](/docs/how_to/tool_calls_multi_modal)
 ### Agents
--- a/docs/docs/how_to/tool_calls_multi_modal.ipynb
+++ b/docs/docs/how_to/tool_calls_multi_modal.ipynb
@ -0,0 +1,160 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4facdf7f-680e-4d28-908b-2b8408e2a741",
   "metadata": {},
   "source": [
    "# How to call tools with multi-modal data\n",
    "\n",
    "Here we demonstrate how to call tools with multi-modal data, such as images.\n",
    "\n",
    "Some multi-modal models, such as those that can reason over images or audio, support [tool calling](/docs/concepts/#functiontool-calling) features as well.\n",
    "\n",
    "To call tools using such models, simply bind tools to them in the [usual way](/docs/how_to/tool_calling), and invoke the model using content blocks of the desired type (e.g., containing image data).\n",
    "\n",
    "Below, we demonstrate examples using [OpenAI](/docs/integrations/platforms/openai) and [Anthropic](/docs/integrations/platforms/anthropic). We will use the same image and tool in all cases. Let's first select an image, and build a placeholder tool that expects as input the string \"sunny\", \"cloudy\", or \"rainy\". We will ask the models to describe the weather in the image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0d9fd81a-b7f0-445a-8e3d-cfc2d31fdd59",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Literal\n",
    "\n",
    "from langchain_core.tools import tool\n",
    "\n",
    "image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
    "\n",
    "\n",
    "@tool\n",
    "def weather_tool(weather: Literal[\"sunny\", \"cloudy\", \"rainy\"]) -> None:\n",
    "    \"\"\"Describe the weather\"\"\"\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8656018e-c56d-47d2-b2be-71e87827f90a",
   "metadata": {},
   "source": [
    "## OpenAI\n",
    "\n",
    "For OpenAI, we can feed the image URL directly in a content block of type \"image_url\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a8819cf3-5ddc-44f0-889a-19ca7b7fe77e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'call_mRYL50MtHdeNuNIjSCm5UPmB'}]\n"
     ]
    }
   ],
   "source": [
    "from langchain_core.messages import HumanMessage\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "model = ChatOpenAI(model=\"gpt-4o\").bind_tools([weather_tool])\n",
    "\n",
    "message = HumanMessage(\n",
    "    content=[\n",
    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
    "        {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
    "    ],\n",
    ")\n",
    "response = model.invoke([message])\n",
    "print(response.tool_calls)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5738224-1109-4bf8-8976-ff1570dd1d46",
   "metadata": {},
   "source": [
    "Note that we recover tool calls with parsed arguments in LangChain's [standard format](/docs/how_to/tool_calling) in the model response."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cee63ff-e09f-4dd8-8323-912edbde94f6",
   "metadata": {},
   "source": [
    "## Anthropic\n",
    "\n",
    "For Anthropic, we can format a base64-encoded image into a content block of type \"image\", as below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "d90c4590-71c8-42b1-99ff-03a9eca8082e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'toolu_016m9KfknJqx5fVRYk4tkF6s'}]\n"
     ]
    }
   ],
   "source": [
    "import base64\n",
    "\n",
    "import httpx\n",
    "from langchain_anthropic import ChatAnthropic\n",
    "\n",
    "image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")\n",
    "\n",
    "model = ChatAnthropic(model=\"claude-3-sonnet-20240229\").bind_tools([weather_tool])\n",
    "\n",
    "message = HumanMessage(\n",
    "    content=[\n",
    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
    "        {\n",
    "            \"type\": \"image\",\n",
    "            \"source\": {\n",
    "                \"type\": \"base64\",\n",
    "                \"media_type\": \"image/jpeg\",\n",
    "                \"data\": image_data,\n",
    "            },\n",
    "        },\n",
    "    ],\n",
    ")\n",
    "response = model.invoke([message])\n",
    "print(response.tool_calls)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }