mirror of
https://github.com/hwchase17/langchain.git
synced 2026-04-03 10:55:08 +00:00
Done with a script + manual review: 1. Map unique file names to new paths; 2. Where those file names have old links, update them.
587 lines
19 KiB
Plaintext
587 lines
19 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "raw",
|
||
"id": "27598444",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"sidebar_position: 3\n",
|
||
"---"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6e3f0f72",
|
||
"metadata": {},
|
||
"source": [
|
||
"# How to return structured data from a model\n",
|
||
"\n",
|
||
"It is often useful to have a model return output that matches some specific schema. One common use-case is extracting data from arbitrary text to insert into a traditional database or use with some other downstrem system. This guide will show you a few different strategies you can use to do this.\n",
|
||
"\n",
|
||
"```{=mdx}\n",
|
||
"import PrerequisiteLinks from \"@theme/PrerequisiteLinks\";\n",
|
||
"\n",
|
||
"<PrerequisiteLinks content={`\n",
|
||
"- [Chat models](/docs/concepts/#chat-models)\n",
|
||
"`}/>\n",
|
||
"```\n",
|
||
"\n",
|
||
"## The `.with_structured_output()` method\n",
|
||
"\n",
|
||
"There are several strategies that models can use under the hood. For some of the most popular model providers, including [OpenAI](/docs/integrations/platforms/openai/), [Anthropic](/docs/integrations/platforms/anthropic/), and [Mistral](/docs/integrations/providers/mistralai/), LangChain implements a common interface that abstracts away these strategies called `.with_structured_output`.\n",
|
||
"\n",
|
||
"By invoking this method (and passing in [JSON schema](https://json-schema.org/) or a [Pydantic](https://docs.pydantic.dev/latest/) model) the model will add whatever model parameters + output parsers are necessary to get back structured output matching the requested schema. If the model supports more than one way to do this (e.g., function calling vs JSON mode) - you can configure which method to use by passing into that method.\n",
|
||
"\n",
|
||
"You can find the [current list of models that support this method here](/docs/integrations/chat/).\n",
|
||
"\n",
|
||
"Let's look at some examples of this in action! We'll use Pydantic to create a simple response schema.\n",
|
||
"\n",
|
||
"```{=mdx}\n",
|
||
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
|
||
"\n",
|
||
"<ChatModelTabs\n",
|
||
" customVarName=\"model\"\n",
|
||
"/>\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "6d55008f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# | output: false\n",
|
||
"# | echo: false\n",
|
||
"\n",
|
||
"from langchain_openai import ChatOpenAI\n",
|
||
"\n",
|
||
"model = ChatOpenAI(\n",
|
||
" model=\"gpt-4-0125-preview\",\n",
|
||
" temperature=0,\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"id": "070bf702",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Joke(setup='Why was the cat sitting on the computer?', punchline='Because it wanted to keep an eye on the mouse!', rating=None)"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"from typing import Optional\n",
|
||
"\n",
|
||
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
|
||
"\n",
|
||
"\n",
|
||
"class Joke(BaseModel):\n",
|
||
" setup: str = Field(description=\"The setup of the joke\")\n",
|
||
" punchline: str = Field(description=\"The punchline to the joke\")\n",
|
||
" rating: Optional[int] = Field(description=\"How funny the joke is, from 1 to 10\")\n",
|
||
"\n",
|
||
"\n",
|
||
"structured_llm = model.with_structured_output(Joke)\n",
|
||
"\n",
|
||
"structured_llm.invoke(\"Tell me a joke about cats\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "deddb6d3",
|
||
"metadata": {},
|
||
"source": [
|
||
"The result is a Pydantic model. Note that name of the model and the names and provided descriptions of parameters are very important, as they help guide the model's output.\n",
|
||
"\n",
|
||
"We can also pass in an OpenAI-style JSON schema dict if you prefer not to use Pydantic. This dict should contain three properties:\n",
|
||
"\n",
|
||
"- `name`: The name of the schema to output.\n",
|
||
"- `description`: A high level description of the schema to output.\n",
|
||
"- `parameters`: The nested details of the schema you want to extract, formatted as a [JSON schema](https://json-schema.org/) dict.\n",
|
||
"\n",
|
||
"In this case, the response is also a dict:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "6700994a",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"{'setup': 'Why was the cat sitting on the computer?',\n",
|
||
" 'punchline': 'To keep an eye on the mouse!'}"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"structured_llm = model.with_structured_output(\n",
|
||
" {\n",
|
||
" \"name\": \"joke\",\n",
|
||
" \"description\": \"Joke to tell user.\",\n",
|
||
" \"parameters\": {\n",
|
||
" \"title\": \"Joke\",\n",
|
||
" \"type\": \"object\",\n",
|
||
" \"properties\": {\n",
|
||
" \"setup\": {\"type\": \"string\", \"description\": \"The setup for the joke\"},\n",
|
||
" \"punchline\": {\"type\": \"string\", \"description\": \"The joke's punchline\"},\n",
|
||
" },\n",
|
||
" \"required\": [\"setup\", \"punchline\"],\n",
|
||
" },\n",
|
||
" }\n",
|
||
")\n",
|
||
"\n",
|
||
"structured_llm.invoke(\"Tell me a joke about cats\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3da57988",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Choosing between multiple schemas\n",
|
||
"\n",
|
||
"If you have multiple schemas that are valid outputs for the model, you can use Pydantic's `Union` type:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "9194bcf2",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Response(output=Joke(setup='Why was the cat sitting on the computer?', punchline='Because it wanted to keep an eye on the mouse!'))"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"from typing import Union\n",
|
||
"\n",
|
||
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
|
||
"\n",
|
||
"\n",
|
||
"class Joke(BaseModel):\n",
|
||
" setup: str = Field(description=\"The setup of the joke\")\n",
|
||
" punchline: str = Field(description=\"The punchline to the joke\")\n",
|
||
"\n",
|
||
"\n",
|
||
"class ConversationalResponse(BaseModel):\n",
|
||
" response: str = Field(description=\"A conversational response to the user's query\")\n",
|
||
"\n",
|
||
"\n",
|
||
"class Response(BaseModel):\n",
|
||
" output: Union[Joke, ConversationalResponse]\n",
|
||
"\n",
|
||
"\n",
|
||
"structured_llm = model.with_structured_output(Response)\n",
|
||
"\n",
|
||
"structured_llm.invoke(\"Tell me a joke about cats\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "84d86132",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Response(output=ConversationalResponse(response=\"I'm just a collection of code, so I don't have feelings, but thanks for asking! How can I assist you today?\"))"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"structured_llm.invoke(\"How are you today?\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e28c14d3",
|
||
"metadata": {},
|
||
"source": [
|
||
"If you are using JSON Schema, you can take advantage of other more complex schema descriptions to create a similar effect.\n",
|
||
"\n",
|
||
"You can also use tool calling directly to allow the model to choose between options, if your chosen model supports it. This involves a bit more parsing and setup. See [this how-to guide](/docs/how_to/tool_calling/) for more details."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "39d7a555",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Specifying the output method (Advanced)\n",
|
||
"\n",
|
||
"For models that support more than one means of outputting data, you can specify the preferred one like this:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "df0370e3",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Joke(setup='Why was the cat sitting on the computer?', punchline='Because it wanted to keep an eye on the mouse!')"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"structured_llm = model.with_structured_output(Joke, method=\"json_mode\")\n",
|
||
"\n",
|
||
"structured_llm.invoke(\n",
|
||
" \"Tell me a joke about cats, respond in JSON with `setup` and `punchline` keys\"\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5e92a98a",
|
||
"metadata": {},
|
||
"source": [
|
||
"In the above example, we use OpenAI's alternate JSON mode capability along with a more specific prompt.\n",
|
||
"\n",
|
||
"For specifics about the model you choose, peruse its entry in the [API reference pages](https://api.python.langchain.com/en/latest/langchain_api_reference.html).\n",
|
||
"\n",
|
||
"## Prompting techniques\n",
|
||
"\n",
|
||
"You can also prompt models to outputting information in a given format. This approach relies on designing good prompts and then parsing the output of the models. This is the only option for models that don't support `.with_structured_output()` or other built-in approaches.\n",
|
||
"\n",
|
||
"### Using `PydanticOutputParser`\n",
|
||
"\n",
|
||
"The following example uses the built-in [`PydanticOutputParser`](https://api.python.langchain.com/en/latest/output_parsers/langchain_core.output_parsers.pydantic.PydanticOutputParser.html) to parse the output of a chat model prompted to match a the given Pydantic schema. Note that we are adding `format_instructions` directly to the prompt from a method on the parser:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "6e514455",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from typing import List\n",
|
||
"\n",
|
||
"from langchain.output_parsers import PydanticOutputParser\n",
|
||
"from langchain_core.prompts import ChatPromptTemplate\n",
|
||
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
|
||
"\n",
|
||
"\n",
|
||
"class Person(BaseModel):\n",
|
||
" \"\"\"Information about a person.\"\"\"\n",
|
||
"\n",
|
||
" name: str = Field(..., description=\"The name of the person\")\n",
|
||
" height_in_meters: float = Field(\n",
|
||
" ..., description=\"The height of the person expressed in meters.\"\n",
|
||
" )\n",
|
||
"\n",
|
||
"\n",
|
||
"class People(BaseModel):\n",
|
||
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
|
||
"\n",
|
||
" people: List[Person]\n",
|
||
"\n",
|
||
"\n",
|
||
"# Set up a parser\n",
|
||
"parser = PydanticOutputParser(pydantic_object=People)\n",
|
||
"\n",
|
||
"# Prompt\n",
|
||
"prompt = ChatPromptTemplate.from_messages(\n",
|
||
" [\n",
|
||
" (\n",
|
||
" \"system\",\n",
|
||
" \"Answer the user query. Wrap the output in `json` tags\\n{format_instructions}\",\n",
|
||
" ),\n",
|
||
" (\"human\", \"{query}\"),\n",
|
||
" ]\n",
|
||
").partial(format_instructions=parser.get_format_instructions())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "082fa166",
|
||
"metadata": {},
|
||
"source": [
|
||
"Let’s take a look at what information is sent to the model:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "3d73d33d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"System: Answer the user query. Wrap the output in `json` tags\n",
|
||
"The output should be formatted as a JSON instance that conforms to the JSON schema below.\n",
|
||
"\n",
|
||
"As an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\n",
|
||
"the object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n",
|
||
"\n",
|
||
"Here is the output schema:\n",
|
||
"```\n",
|
||
"{\"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"title\": \"People\", \"type\": \"array\", \"items\": {\"$ref\": \"#/definitions/Person\"}}}, \"required\": [\"people\"], \"definitions\": {\"Person\": {\"title\": \"Person\", \"description\": \"Information about a person.\", \"type\": \"object\", \"properties\": {\"name\": {\"title\": \"Name\", \"description\": \"The name of the person\", \"type\": \"string\"}, \"height_in_meters\": {\"title\": \"Height In Meters\", \"description\": \"The height of the person expressed in meters.\", \"type\": \"number\"}}, \"required\": [\"name\", \"height_in_meters\"]}}}\n",
|
||
"```\n",
|
||
"Human: Anna is 23 years old and she is 6 feet tall\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = \"Anna is 23 years old and she is 6 feet tall\"\n",
|
||
"\n",
|
||
"print(prompt.format_prompt(query=query).to_string())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "081956b9",
|
||
"metadata": {},
|
||
"source": [
|
||
"And now let's invoke it:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "8d6b3d17",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"People(people=[Person(name='Anna', height_in_meters=1.8288)])"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"chain = prompt | model | parser\n",
|
||
"\n",
|
||
"chain.invoke({\"query\": query})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6732dd87",
|
||
"metadata": {},
|
||
"source": [
|
||
"For a deeper dive into using output parsers with prompting techniques for structured output, see [this guide](/docs/how_to/output_parser_structured).\n",
|
||
"\n",
|
||
"### Custom Parsing\n",
|
||
"\n",
|
||
"You can also create a custom prompt and parser with [LangChain Expression Language (LCEL)](/docs/concepts/#langchain-expression-language), using a plain function to parse the output from the model:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"id": "e8d37e15",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import json\n",
|
||
"import re\n",
|
||
"from typing import List\n",
|
||
"\n",
|
||
"from langchain_core.messages import AIMessage\n",
|
||
"from langchain_core.prompts import ChatPromptTemplate\n",
|
||
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
|
||
"\n",
|
||
"\n",
|
||
"class Person(BaseModel):\n",
|
||
" \"\"\"Information about a person.\"\"\"\n",
|
||
"\n",
|
||
" name: str = Field(..., description=\"The name of the person\")\n",
|
||
" height_in_meters: float = Field(\n",
|
||
" ..., description=\"The height of the person expressed in meters.\"\n",
|
||
" )\n",
|
||
"\n",
|
||
"\n",
|
||
"class People(BaseModel):\n",
|
||
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
|
||
"\n",
|
||
" people: List[Person]\n",
|
||
"\n",
|
||
"\n",
|
||
"# Prompt\n",
|
||
"prompt = ChatPromptTemplate.from_messages(\n",
|
||
" [\n",
|
||
" (\n",
|
||
" \"system\",\n",
|
||
" \"Answer the user query. Output your answer as JSON that \"\n",
|
||
" \"matches the given schema: ```json\\n{schema}\\n```. \"\n",
|
||
" \"Make sure to wrap the answer in ```json and ``` tags\",\n",
|
||
" ),\n",
|
||
" (\"human\", \"{query}\"),\n",
|
||
" ]\n",
|
||
").partial(schema=People.schema())\n",
|
||
"\n",
|
||
"\n",
|
||
"# Custom parser\n",
|
||
"def extract_json(message: AIMessage) -> List[dict]:\n",
|
||
" \"\"\"Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.\n",
|
||
"\n",
|
||
" Parameters:\n",
|
||
" text (str): The text containing the JSON content.\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" list: A list of extracted JSON strings.\n",
|
||
" \"\"\"\n",
|
||
" text = message.content\n",
|
||
" # Define the regular expression pattern to match JSON blocks\n",
|
||
" pattern = r\"```json(.*?)```\"\n",
|
||
"\n",
|
||
" # Find all non-overlapping matches of the pattern in the string\n",
|
||
" matches = re.findall(pattern, text, re.DOTALL)\n",
|
||
"\n",
|
||
" # Return the list of matched JSON strings, stripping any leading or trailing whitespace\n",
|
||
" try:\n",
|
||
" return [json.loads(match.strip()) for match in matches]\n",
|
||
" except Exception:\n",
|
||
" raise ValueError(f\"Failed to parse: {message}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9f1bc8f7",
|
||
"metadata": {},
|
||
"source": [
|
||
"Here is the prompt sent to the model:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "c8a30d0e",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"System: Answer the user query. Output your answer as JSON that matches the given schema: ```json\n",
|
||
"{'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}\n",
|
||
"```. Make sure to wrap the answer in ```json and ``` tags\n",
|
||
"Human: Anna is 23 years old and she is 6 feet tall\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = \"Anna is 23 years old and she is 6 feet tall\"\n",
|
||
"\n",
|
||
"print(prompt.format_prompt(query=query).to_string())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ec018893",
|
||
"metadata": {},
|
||
"source": [
|
||
"And here's what it looks like when we invoke it:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "e1e7baf6",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[{'people': [{'name': 'Anna', 'height_in_meters': 1.8288}]}]"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"chain = prompt | model | extract_json\n",
|
||
"\n",
|
||
"chain.invoke({\"query\": query})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7a39221a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Next steps\n",
|
||
"\n",
|
||
"Now you've learned a few methods to make a model output structured data.\n",
|
||
"\n",
|
||
"To learn more, check out the other how-to guides in this section, or the conceptual guide on tool calling."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6e3759e2",
|
||
"metadata": {},
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.4"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|