langchain/docs/versioned_docs/version-0.2.x/how_to/structured_output.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "id": "27598444",
   "metadata": {},
   "source": [
    "---\n",
    "sidebar_position: 3\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e3f0f72",
   "metadata": {},
   "source": [
    "# How to return structured data from a model\n",
    "\n",
    "It is often useful to have a model return output that matches some specific schema. One common use-case is extracting data from arbitrary text to insert into a traditional database or use with some other downstrem system. This guide will show you a few different strategies you can use to do this.\n",
    "\n",
    "```{=mdx}\n",
    "import PrerequisiteLinks from \"@theme/PrerequisiteLinks\";\n",
    "\n",
    "<PrerequisiteLinks content={`\n",
    "- [Chat models](/docs/concepts/#chat-models)\n",
    "`}/>\n",
    "```\n",
    "\n",
    "## The `.with_structured_output()` method\n",
    "\n",
    "There are several strategies that models can use under the hood. For some of the most popular model providers, including [OpenAI](/docs/integrations/platforms/openai/), [Anthropic](/docs/integrations/platforms/anthropic/), and [Mistral](/docs/integrations/providers/mistralai/), LangChain implements a common interface that abstracts away these strategies called `.with_structured_output`.\n",
    "\n",
    "By invoking this method (and passing in [JSON schema](https://json-schema.org/) or a [Pydantic](https://docs.pydantic.dev/latest/) model) the model will add whatever model parameters + output parsers are necessary to get back structured output matching the requested schema. If the model supports more than one way to do this (e.g., function calling vs JSON mode) - you can configure which method to use by passing into that method.\n",
    "\n",
    "You can find the [current list of models that support this method here](/docs/integrations/chat/).\n",
    "\n",
    "Let's look at some examples of this in action! We'll use Pydantic to create a simple response schema.\n",
    "\n",
    "```{=mdx}\n",
    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
    "\n",
    "<ChatModelTabs\n",
    "  customVarName=\"model\"\n",
    "/>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6d55008f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# | output: false\n",
    "# | echo: false\n",
    "\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "model = ChatOpenAI(\n",
    "    model=\"gpt-4-0125-preview\",\n",
    "    temperature=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "070bf702",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Joke(setup='Why was the cat sitting on the computer?', punchline='Because it wanted to keep an eye on the mouse!', rating=None)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from typing import Optional\n",
    "\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
    "class Joke(BaseModel):\n",
    "    setup: str = Field(description=\"The setup of the joke\")\n",
    "    punchline: str = Field(description=\"The punchline to the joke\")\n",
    "    rating: Optional[int] = Field(description=\"How funny the joke is, from 1 to 10\")\n",
    "\n",
    "\n",
    "structured_llm = model.with_structured_output(Joke)\n",
    "\n",
    "structured_llm.invoke(\"Tell me a joke about cats\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "deddb6d3",
   "metadata": {},
   "source": [
    "The result is a Pydantic model. Note that name of the model and the names and provided descriptions of parameters are very important, as they help guide the model's output.\n",
    "\n",
    "We can also pass in an OpenAI-style JSON schema dict if you prefer not to use Pydantic. This dict should contain three properties:\n",
    "\n",
    "- `name`: The name of the schema to output.\n",
    "- `description`: A high level description of the schema to output.\n",
    "- `parameters`: The nested details of the schema you want to extract, formatted as a [JSON schema](https://json-schema.org/) dict.\n",
    "\n",
    "In this case, the response is also a dict:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6700994a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'setup': 'Why was the cat sitting on the computer?',\n",
       " 'punchline': 'To keep an eye on the mouse!'}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "structured_llm = model.with_structured_output(\n",
    "    {\n",
    "        \"name\": \"joke\",\n",
    "        \"description\": \"Joke to tell user.\",\n",
    "        \"parameters\": {\n",
    "            \"title\": \"Joke\",\n",
    "            \"type\": \"object\",\n",
    "            \"properties\": {\n",
    "                \"setup\": {\"type\": \"string\", \"description\": \"The setup for the joke\"},\n",
    "                \"punchline\": {\"type\": \"string\", \"description\": \"The joke's punchline\"},\n",
    "            },\n",
    "            \"required\": [\"setup\", \"punchline\"],\n",
    "        },\n",
    "    }\n",
    ")\n",
    "\n",
    "structured_llm.invoke(\"Tell me a joke about cats\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3da57988",
   "metadata": {},
   "source": [
    "### Choosing between multiple schemas\n",
    "\n",
    "If you have multiple schemas that are valid outputs for the model, you can use Pydantic's `Union` type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9194bcf2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Response(output=Joke(setup='Why was the cat sitting on the computer?', punchline='Because it wanted to keep an eye on the mouse!'))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from typing import Union\n",
    "\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
    "class Joke(BaseModel):\n",
    "    setup: str = Field(description=\"The setup of the joke\")\n",
    "    punchline: str = Field(description=\"The punchline to the joke\")\n",
    "\n",
    "\n",
    "class ConversationalResponse(BaseModel):\n",
    "    response: str = Field(description=\"A conversational response to the user's query\")\n",
    "\n",
    "\n",
    "class Response(BaseModel):\n",
    "    output: Union[Joke, ConversationalResponse]\n",
    "\n",
    "\n",
    "structured_llm = model.with_structured_output(Response)\n",
    "\n",
    "structured_llm.invoke(\"Tell me a joke about cats\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "84d86132",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Response(output=ConversationalResponse(response=\"I'm just a collection of code, so I don't have feelings, but thanks for asking! How can I assist you today?\"))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "structured_llm.invoke(\"How are you today?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e28c14d3",
   "metadata": {},
   "source": [
    "If you are using JSON Schema, you can take advantage of other more complex schema descriptions to create a similar effect.\n",
    "\n",
    "You can also use tool calling directly to allow the model to choose between options, if your chosen model supports it. This involves a bit more parsing and setup. See [this how-to guide](/docs/how_to/tool_calling/) for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39d7a555",
   "metadata": {},
   "source": [
    "### Specifying the output method (Advanced)\n",
    "\n",
    "For models that support more than one means of outputting data, you can specify the preferred one like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "df0370e3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Joke(setup='Why was the cat sitting on the computer?', punchline='Because it wanted to keep an eye on the mouse!')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "structured_llm = model.with_structured_output(Joke, method=\"json_mode\")\n",
    "\n",
    "structured_llm.invoke(\n",
    "    \"Tell me a joke about cats, respond in JSON with `setup` and `punchline` keys\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e92a98a",
   "metadata": {},
   "source": [
    "In the above example, we use OpenAI's alternate JSON mode capability along with a more specific prompt.\n",
    "\n",
    "For specifics about the model you choose, peruse its entry in the [API reference pages](https://api.python.langchain.com/en/latest/langchain_api_reference.html).\n",
    "\n",
    "## Prompting techniques\n",
    "\n",
    "You can also prompt models to outputting information in a given format. This approach relies on designing good prompts and then parsing the output of the models. This is the only option for models that don't support `.with_structured_output()` or other built-in approaches.\n",
    "\n",
    "### Using `PydanticOutputParser`\n",
    "\n",
    "The following example uses the built-in [`PydanticOutputParser`](https://api.python.langchain.com/en/latest/output_parsers/langchain_core.output_parsers.pydantic.PydanticOutputParser.html) to parse the output of a chat model prompted to match a the given Pydantic schema. Note that we are adding `format_instructions` directly to the prompt from a method on the parser:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6e514455",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "\n",
    "from langchain.output_parsers import PydanticOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    name: str = Field(..., description=\"The name of the person\")\n",
    "    height_in_meters: float = Field(\n",
    "        ..., description=\"The height of the person expressed in meters.\"\n",
    "    )\n",
    "\n",
    "\n",
    "class People(BaseModel):\n",
    "    \"\"\"Identifying information about all people in a text.\"\"\"\n",
    "\n",
    "    people: List[Person]\n",
    "\n",
    "\n",
    "# Set up a parser\n",
    "parser = PydanticOutputParser(pydantic_object=People)\n",
    "\n",
    "# Prompt\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"Answer the user query. Wrap the output in `json` tags\\n{format_instructions}\",\n",
    "        ),\n",
    "        (\"human\", \"{query}\"),\n",
    "    ]\n",
    ").partial(format_instructions=parser.get_format_instructions())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "082fa166",
   "metadata": {},
   "source": [
    "Let’s take a look at what information is sent to the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "3d73d33d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System: Answer the user query. Wrap the output in `json` tags\n",
      "The output should be formatted as a JSON instance that conforms to the JSON schema below.\n",
      "\n",
      "As an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\n",
      "the object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n",
      "\n",
      "Here is the output schema:\n",
      "```\n",
      "{\"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"title\": \"People\", \"type\": \"array\", \"items\": {\"$ref\": \"#/definitions/Person\"}}}, \"required\": [\"people\"], \"definitions\": {\"Person\": {\"title\": \"Person\", \"description\": \"Information about a person.\", \"type\": \"object\", \"properties\": {\"name\": {\"title\": \"Name\", \"description\": \"The name of the person\", \"type\": \"string\"}, \"height_in_meters\": {\"title\": \"Height In Meters\", \"description\": \"The height of the person expressed in meters.\", \"type\": \"number\"}}, \"required\": [\"name\", \"height_in_meters\"]}}}\n",
      "```\n",
      "Human: Anna is 23 years old and she is 6 feet tall\n"
     ]
    }
   ],
   "source": [
    "query = \"Anna is 23 years old and she is 6 feet tall\"\n",
    "\n",
    "print(prompt.format_prompt(query=query).to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "081956b9",
   "metadata": {},
   "source": [
    "And now let's invoke it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8d6b3d17",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "People(people=[Person(name='Anna', height_in_meters=1.8288)])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain = prompt | model | parser\n",
    "\n",
    "chain.invoke({\"query\": query})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6732dd87",
   "metadata": {},
   "source": [
    "For a deeper dive into using output parsers with prompting techniques for structured output, see [this guide](/docs/how_to/output_parser_structured).\n",
    "\n",
    "### Custom Parsing\n",
    "\n",
    "You can also create a custom prompt and parser with [LangChain Expression Language (LCEL)](/docs/concepts/#langchain-expression-language), using a plain function to parse the output from the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e8d37e15",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "from typing import List\n",
    "\n",
    "from langchain_core.messages import AIMessage\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    name: str = Field(..., description=\"The name of the person\")\n",
    "    height_in_meters: float = Field(\n",
    "        ..., description=\"The height of the person expressed in meters.\"\n",
    "    )\n",
    "\n",
    "\n",
    "class People(BaseModel):\n",
    "    \"\"\"Identifying information about all people in a text.\"\"\"\n",
    "\n",
    "    people: List[Person]\n",
    "\n",
    "\n",
    "# Prompt\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"Answer the user query. Output your answer as JSON that  \"\n",
    "            \"matches the given schema: ```json\\n{schema}\\n```. \"\n",
    "            \"Make sure to wrap the answer in ```json and ``` tags\",\n",
    "        ),\n",
    "        (\"human\", \"{query}\"),\n",
    "    ]\n",
    ").partial(schema=People.schema())\n",
    "\n",
    "\n",
    "# Custom parser\n",
    "def extract_json(message: AIMessage) -> List[dict]:\n",
    "    \"\"\"Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.\n",
    "\n",
    "    Parameters:\n",
    "        text (str): The text containing the JSON content.\n",
    "\n",
    "    Returns:\n",
    "        list: A list of extracted JSON strings.\n",
    "    \"\"\"\n",
    "    text = message.content\n",
    "    # Define the regular expression pattern to match JSON blocks\n",
    "    pattern = r\"```json(.*?)```\"\n",
    "\n",
    "    # Find all non-overlapping matches of the pattern in the string\n",
    "    matches = re.findall(pattern, text, re.DOTALL)\n",
    "\n",
    "    # Return the list of matched JSON strings, stripping any leading or trailing whitespace\n",
    "    try:\n",
    "        return [json.loads(match.strip()) for match in matches]\n",
    "    except Exception:\n",
    "        raise ValueError(f\"Failed to parse: {message}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f1bc8f7",
   "metadata": {},
   "source": [
    "Here is the prompt sent to the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c8a30d0e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System: Answer the user query. Output your answer as JSON that  matches the given schema: ```json\n",
      "{'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}\n",
      "```. Make sure to wrap the answer in ```json and ``` tags\n",
      "Human: Anna is 23 years old and she is 6 feet tall\n"
     ]
    }
   ],
   "source": [
    "query = \"Anna is 23 years old and she is 6 feet tall\"\n",
    "\n",
    "print(prompt.format_prompt(query=query).to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec018893",
   "metadata": {},
   "source": [
    "And here's what it looks like when we invoke it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e1e7baf6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'people': [{'name': 'Anna', 'height_in_meters': 1.8288}]}]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain = prompt | model | extract_json\n",
    "\n",
    "chain.invoke({\"query\": query})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a39221a",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "Now you've learned a few methods to make a model output structured data.\n",
    "\n",
    "To learn more, check out the other how-to guides in this section, or the conceptual guide on tool calling."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e3759e2",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}