mirror of
https://github.com/hwchase17/langchain.git
synced 2026-02-21 06:33:41 +00:00
422 lines
15 KiB
Plaintext
422 lines
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ea37db49-d389-4291-be73-885d06c1fb7e",
|
|
"metadata": {},
|
|
"source": [
|
|
"# How to use prompting alone (no tool calling) to do extraction\n",
|
|
"\n",
|
|
"Tool calling features are not required for generating structured output from LLMs. LLMs that are able to follow prompt instructions well can be tasked with outputting information in a given format.\n",
|
|
"\n",
|
|
"This approach relies on designing good prompts and then parsing the output of the LLMs to make them extract information well.\n",
|
|
"\n",
|
|
"To extract data without tool-calling features: \n",
|
|
"\n",
|
|
"1. Instruct the LLM to generate text following an expected format (e.g., JSON with a certain schema);\n",
|
|
"2. Use [output parsers](/docs/concepts#output-parsers) to structure the model response into a desired Python object.\n",
|
|
"\n",
|
|
"First we select a LLM:\n",
|
|
"\n",
|
|
"```{=mdx}\n",
|
|
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
|
|
"\n",
|
|
"<ChatModelTabs customVarName=\"model\" />\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "25487939-8713-4ec7-b774-e4a761ac8298",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-09-10T20:35:44.442501Z",
|
|
"iopub.status.busy": "2024-09-10T20:35:44.442044Z",
|
|
"iopub.status.idle": "2024-09-10T20:35:44.872217Z",
|
|
"shell.execute_reply": "2024-09-10T20:35:44.871897Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# | output: false\n",
|
|
"# | echo: false\n",
|
|
"\n",
|
|
"from langchain_anthropic.chat_models import ChatAnthropic\n",
|
|
"\n",
|
|
"model = ChatAnthropic(model_name=\"claude-3-sonnet-20240229\", temperature=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3e412374-3beb-4bbf-966b-400c1f66a258",
|
|
"metadata": {},
|
|
"source": [
|
|
":::{.callout-tip}\n",
|
|
"This tutorial is meant to be simple, but generally should really include reference examples to squeeze out performance!\n",
|
|
":::"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "abc1a945-0f80-4953-add4-cd572b6f2a51",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Using PydanticOutputParser\n",
|
|
"\n",
|
|
"The following example uses the built-in `PydanticOutputParser` to parse the output of a chat model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "497eb023-c043-443d-ac62-2d4ea85fe1b0",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-09-10T20:35:44.873979Z",
|
|
"iopub.status.busy": "2024-09-10T20:35:44.873840Z",
|
|
"iopub.status.idle": "2024-09-10T20:35:44.878966Z",
|
|
"shell.execute_reply": "2024-09-10T20:35:44.878718Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from typing import List, Optional\n",
|
|
"\n",
|
|
"from langchain_core.output_parsers import PydanticOutputParser\n",
|
|
"from langchain_core.prompts import ChatPromptTemplate\n",
|
|
"from pydantic import BaseModel, Field, validator\n",
|
|
"\n",
|
|
"\n",
|
|
"class Person(BaseModel):\n",
|
|
" \"\"\"Information about a person.\"\"\"\n",
|
|
"\n",
|
|
" name: str = Field(..., description=\"The name of the person\")\n",
|
|
" height_in_meters: float = Field(\n",
|
|
" ..., description=\"The height of the person expressed in meters.\"\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"class People(BaseModel):\n",
|
|
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
|
|
"\n",
|
|
" people: List[Person]\n",
|
|
"\n",
|
|
"\n",
|
|
"# Set up a parser\n",
|
|
"parser = PydanticOutputParser(pydantic_object=People)\n",
|
|
"\n",
|
|
"# Prompt\n",
|
|
"prompt = ChatPromptTemplate.from_messages(\n",
|
|
" [\n",
|
|
" (\n",
|
|
" \"system\",\n",
|
|
" \"Answer the user query. Wrap the output in `json` tags\\n{format_instructions}\",\n",
|
|
" ),\n",
|
|
" (\"human\", \"{query}\"),\n",
|
|
" ]\n",
|
|
").partial(format_instructions=parser.get_format_instructions())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c31aa2c8-05a9-4a12-80c5-ea1250dea0ae",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's take a look at what information is sent to the model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "20b99ffb-a114-49a9-a7be-154c525f8ada",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-09-10T20:35:44.880355Z",
|
|
"iopub.status.busy": "2024-09-10T20:35:44.880277Z",
|
|
"iopub.status.idle": "2024-09-10T20:35:44.881834Z",
|
|
"shell.execute_reply": "2024-09-10T20:35:44.881601Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"Anna is 23 years old and she is 6 feet tall\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "4f3a66ce-de19-4571-9e54-67504ae3fba7",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-09-10T20:35:44.883138Z",
|
|
"iopub.status.busy": "2024-09-10T20:35:44.883049Z",
|
|
"iopub.status.idle": "2024-09-10T20:35:44.885139Z",
|
|
"shell.execute_reply": "2024-09-10T20:35:44.884801Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"System: Answer the user query. Wrap the output in `json` tags\n",
|
|
"The output should be formatted as a JSON instance that conforms to the JSON schema below.\n",
|
|
"\n",
|
|
"As an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\n",
|
|
"the object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n",
|
|
"\n",
|
|
"Here is the output schema:\n",
|
|
"```\n",
|
|
"{\"$defs\": {\"Person\": {\"description\": \"Information about a person.\", \"properties\": {\"name\": {\"description\": \"The name of the person\", \"title\": \"Name\", \"type\": \"string\"}, \"height_in_meters\": {\"description\": \"The height of the person expressed in meters.\", \"title\": \"Height In Meters\", \"type\": \"number\"}}, \"required\": [\"name\", \"height_in_meters\"], \"title\": \"Person\", \"type\": \"object\"}}, \"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"items\": {\"$ref\": \"#/$defs/Person\"}, \"title\": \"People\", \"type\": \"array\"}}, \"required\": [\"people\"]}\n",
|
|
"```\n",
|
|
"Human: Anna is 23 years old and she is 6 feet tall\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(prompt.format_prompt(query=query).to_string())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6f1048e0-1bfd-49f9-b697-74389a5ce69c",
|
|
"metadata": {},
|
|
"source": [
|
|
"Having defined our prompt, we simply chain together the prompt, model and output parser:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "7e0041eb-37dc-4384-9fe3-6dd8c356371e",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-09-10T20:35:44.886765Z",
|
|
"iopub.status.busy": "2024-09-10T20:35:44.886675Z",
|
|
"iopub.status.idle": "2024-09-10T20:35:46.835960Z",
|
|
"shell.execute_reply": "2024-09-10T20:35:46.835282Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"People(people=[Person(name='Anna', height_in_meters=1.83)])"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"chain = prompt | model | parser\n",
|
|
"chain.invoke({\"query\": query})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dd492fe4-110a-4b83-a191-79fffbc1055a",
|
|
"metadata": {},
|
|
"source": [
|
|
"Check out the associated [Langsmith trace](https://smith.langchain.com/public/92ed52a3-92b9-45af-a663-0a9c00e5e396/r).\n",
|
|
"\n",
|
|
"Note that the schema shows up in two places: \n",
|
|
"\n",
|
|
"1. In the prompt, via `parser.get_format_instructions()`;\n",
|
|
"2. In the chain, to receive the formatted output and structure it into a Python object (in this case, the Pydantic object `People`)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "815b3b87-3bc6-4b56-835e-c6b6703cef5d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Custom Parsing\n",
|
|
"\n",
|
|
"If desired, it's easy to create a custom prompt and parser with `LangChain` and `LCEL`.\n",
|
|
"\n",
|
|
"To create a custom parser, define a function to parse the output from the model (typically an [AIMessage](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.ai.AIMessage.html)) into an object of your choice.\n",
|
|
"\n",
|
|
"See below for a simple implementation of a JSON parser."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "b1f11912-c1bb-4a2a-a482-79bf3996961f",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-09-10T20:35:46.839577Z",
|
|
"iopub.status.busy": "2024-09-10T20:35:46.839233Z",
|
|
"iopub.status.idle": "2024-09-10T20:35:46.849663Z",
|
|
"shell.execute_reply": "2024-09-10T20:35:46.849177Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"import re\n",
|
|
"from typing import List, Optional\n",
|
|
"\n",
|
|
"from langchain_anthropic.chat_models import ChatAnthropic\n",
|
|
"from langchain_core.messages import AIMessage\n",
|
|
"from langchain_core.prompts import ChatPromptTemplate\n",
|
|
"from pydantic import BaseModel, Field, validator\n",
|
|
"\n",
|
|
"\n",
|
|
"class Person(BaseModel):\n",
|
|
" \"\"\"Information about a person.\"\"\"\n",
|
|
"\n",
|
|
" name: str = Field(..., description=\"The name of the person\")\n",
|
|
" height_in_meters: float = Field(\n",
|
|
" ..., description=\"The height of the person expressed in meters.\"\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"class People(BaseModel):\n",
|
|
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
|
|
"\n",
|
|
" people: List[Person]\n",
|
|
"\n",
|
|
"\n",
|
|
"# Prompt\n",
|
|
"prompt = ChatPromptTemplate.from_messages(\n",
|
|
" [\n",
|
|
" (\n",
|
|
" \"system\",\n",
|
|
" \"Answer the user query. Output your answer as JSON that \"\n",
|
|
" \"matches the given schema: ```json\\n{schema}\\n```. \"\n",
|
|
" \"Make sure to wrap the answer in ```json and ``` tags\",\n",
|
|
" ),\n",
|
|
" (\"human\", \"{query}\"),\n",
|
|
" ]\n",
|
|
").partial(schema=People.schema())\n",
|
|
"\n",
|
|
"\n",
|
|
"# Custom parser\n",
|
|
"def extract_json(message: AIMessage) -> List[dict]:\n",
|
|
" \"\"\"Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.\n",
|
|
"\n",
|
|
" Parameters:\n",
|
|
" text (str): The text containing the JSON content.\n",
|
|
"\n",
|
|
" Returns:\n",
|
|
" list: A list of extracted JSON strings.\n",
|
|
" \"\"\"\n",
|
|
" text = message.content\n",
|
|
" # Define the regular expression pattern to match JSON blocks\n",
|
|
" pattern = r\"```json(.*?)```\"\n",
|
|
"\n",
|
|
" # Find all non-overlapping matches of the pattern in the string\n",
|
|
" matches = re.findall(pattern, text, re.DOTALL)\n",
|
|
"\n",
|
|
" # Return the list of matched JSON strings, stripping any leading or trailing whitespace\n",
|
|
" try:\n",
|
|
" return [json.loads(match.strip()) for match in matches]\n",
|
|
" except Exception:\n",
|
|
" raise ValueError(f\"Failed to parse: {message}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "9260d5e8-3b6c-4639-9f3b-fb2f90239e4b",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-09-10T20:35:46.851870Z",
|
|
"iopub.status.busy": "2024-09-10T20:35:46.851698Z",
|
|
"iopub.status.idle": "2024-09-10T20:35:46.854786Z",
|
|
"shell.execute_reply": "2024-09-10T20:35:46.854424Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"System: Answer the user query. Output your answer as JSON that matches the given schema: ```json\n",
|
|
"{'$defs': {'Person': {'description': 'Information about a person.', 'properties': {'name': {'description': 'The name of the person', 'title': 'Name', 'type': 'string'}, 'height_in_meters': {'description': 'The height of the person expressed in meters.', 'title': 'Height In Meters', 'type': 'number'}}, 'required': ['name', 'height_in_meters'], 'title': 'Person', 'type': 'object'}}, 'description': 'Identifying information about all people in a text.', 'properties': {'people': {'items': {'$ref': '#/$defs/Person'}, 'title': 'People', 'type': 'array'}}, 'required': ['people'], 'title': 'People', 'type': 'object'}\n",
|
|
"```. Make sure to wrap the answer in ```json and ``` tags\n",
|
|
"Human: Anna is 23 years old and she is 6 feet tall\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"query = \"Anna is 23 years old and she is 6 feet tall\"\n",
|
|
"print(prompt.format_prompt(query=query).to_string())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "c523301d-ae0e-45e3-b195-7fd28c67a5c4",
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-09-10T20:35:46.856945Z",
|
|
"iopub.status.busy": "2024-09-10T20:35:46.856769Z",
|
|
"iopub.status.idle": "2024-09-10T20:35:48.373728Z",
|
|
"shell.execute_reply": "2024-09-10T20:35:48.373079Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/Users/bagatur/langchain/.venv/lib/python3.11/site-packages/pydantic/_internal/_fields.py:201: UserWarning: Field name \"schema\" in \"PromptInput\" shadows an attribute in parent \"BaseModel\"\n",
|
|
" warnings.warn(\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'people': [{'name': 'Anna', 'height_in_meters': 1.83}]}]"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"chain = prompt | model | extract_json\n",
|
|
"chain.invoke({\"query\": query})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d3601bde",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Other Libraries\n",
|
|
"\n",
|
|
"If you're looking at extracting using a parsing approach, check out the [Kor](https://eyurtsev.github.io/kor/) library. It's written by one of the `LangChain` maintainers and it\n",
|
|
"helps to craft a prompt that takes examples into account, allows controlling formats (e.g., JSON or CSV) and expresses the schema in TypeScript. It seems to work pretty!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|