mirror of
https://github.com/hwchase17/langchain.git
synced 2025-05-11 18:16:12 +00:00
``` https://api\.python\.langchain\.com/en/latest/([^/]*)/langchain_([^.]*)\.(.*)\.html([^"]*) https://python.langchain.com/v0.2/api_reference/$2/$1/langchain_$2.$3.html$4 ``` --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
358 lines
12 KiB
Plaintext
358 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ea37db49-d389-4291-be73-885d06c1fb7e",
|
|
"metadata": {},
|
|
"source": [
|
|
"# How to use prompting alone (no tool calling) to do extraction\n",
|
|
"\n",
|
|
"Tool calling features are not required for generating structured output from LLMs. LLMs that are able to follow prompt instructions well can be tasked with outputting information in a given format.\n",
|
|
"\n",
|
|
"This approach relies on designing good prompts and then parsing the output of the LLMs to make them extract information well.\n",
|
|
"\n",
|
|
"To extract data without tool-calling features: \n",
|
|
"\n",
|
|
"1. Instruct the LLM to generate text following an expected format (e.g., JSON with a certain schema);\n",
|
|
"2. Use [output parsers](/docs/concepts#output-parsers) to structure the model response into a desired Python object.\n",
|
|
"\n",
|
|
"First we select a LLM:\n",
|
|
"\n",
|
|
"```{=mdx}\n",
|
|
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
|
|
"\n",
|
|
"<ChatModelTabs customVarName=\"model\" />\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "25487939-8713-4ec7-b774-e4a761ac8298",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# | output: false\n",
|
|
"# | echo: false\n",
|
|
"\n",
|
|
"from langchain_anthropic.chat_models import ChatAnthropic\n",
|
|
"\n",
|
|
"model = ChatAnthropic(model_name=\"claude-3-sonnet-20240229\", temperature=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3e412374-3beb-4bbf-966b-400c1f66a258",
|
|
"metadata": {},
|
|
"source": [
|
|
":::{.callout-tip}\n",
|
|
"This tutorial is meant to be simple, but generally should really include reference examples to squeeze out performance!\n",
|
|
":::"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "abc1a945-0f80-4953-add4-cd572b6f2a51",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Using PydanticOutputParser\n",
|
|
"\n",
|
|
"The following example uses the built-in `PydanticOutputParser` to parse the output of a chat model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "497eb023-c043-443d-ac62-2d4ea85fe1b0",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from typing import List, Optional\n",
|
|
"\n",
|
|
"from langchain_core.output_parsers import PydanticOutputParser\n",
|
|
"from langchain_core.prompts import ChatPromptTemplate\n",
|
|
"from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
|
|
"\n",
|
|
"\n",
|
|
"class Person(BaseModel):\n",
|
|
" \"\"\"Information about a person.\"\"\"\n",
|
|
"\n",
|
|
" name: str = Field(..., description=\"The name of the person\")\n",
|
|
" height_in_meters: float = Field(\n",
|
|
" ..., description=\"The height of the person expressed in meters.\"\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"class People(BaseModel):\n",
|
|
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
|
|
"\n",
|
|
" people: List[Person]\n",
|
|
"\n",
|
|
"\n",
|
|
"# Set up a parser\n",
|
|
"parser = PydanticOutputParser(pydantic_object=People)\n",
|
|
"\n",
|
|
"# Prompt\n",
|
|
"prompt = ChatPromptTemplate.from_messages(\n",
|
|
" [\n",
|
|
" (\n",
|
|
" \"system\",\n",
|
|
" \"Answer the user query. Wrap the output in `json` tags\\n{format_instructions}\",\n",
|
|
" ),\n",
|
|
" (\"human\", \"{query}\"),\n",
|
|
" ]\n",
|
|
").partial(format_instructions=parser.get_format_instructions())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c31aa2c8-05a9-4a12-80c5-ea1250dea0ae",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's take a look at what information is sent to the model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "20b99ffb-a114-49a9-a7be-154c525f8ada",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"Anna is 23 years old and she is 6 feet tall\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "4f3a66ce-de19-4571-9e54-67504ae3fba7",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"System: Answer the user query. Wrap the output in `json` tags\n",
|
|
"The output should be formatted as a JSON instance that conforms to the JSON schema below.\n",
|
|
"\n",
|
|
"As an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\n",
|
|
"the object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n",
|
|
"\n",
|
|
"Here is the output schema:\n",
|
|
"```\n",
|
|
"{\"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"title\": \"People\", \"type\": \"array\", \"items\": {\"$ref\": \"#/definitions/Person\"}}}, \"required\": [\"people\"], \"definitions\": {\"Person\": {\"title\": \"Person\", \"description\": \"Information about a person.\", \"type\": \"object\", \"properties\": {\"name\": {\"title\": \"Name\", \"description\": \"The name of the person\", \"type\": \"string\"}, \"height_in_meters\": {\"title\": \"Height In Meters\", \"description\": \"The height of the person expressed in meters.\", \"type\": \"number\"}}, \"required\": [\"name\", \"height_in_meters\"]}}}\n",
|
|
"```\n",
|
|
"Human: Anna is 23 years old and she is 6 feet tall\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(prompt.format_prompt(query=query).to_string())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6f1048e0-1bfd-49f9-b697-74389a5ce69c",
|
|
"metadata": {},
|
|
"source": [
|
|
"Having defined our prompt, we simply chain together the prompt, model and output parser:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "7e0041eb-37dc-4384-9fe3-6dd8c356371e",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"People(people=[Person(name='Anna', height_in_meters=1.83)])"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"chain = prompt | model | parser\n",
|
|
"chain.invoke({\"query\": query})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dd492fe4-110a-4b83-a191-79fffbc1055a",
|
|
"metadata": {},
|
|
"source": [
|
|
"Check out the associated [Langsmith trace](https://smith.langchain.com/public/92ed52a3-92b9-45af-a663-0a9c00e5e396/r).\n",
|
|
"\n",
|
|
"Note that the schema shows up in two places: \n",
|
|
"\n",
|
|
"1. In the prompt, via `parser.get_format_instructions()`;\n",
|
|
"2. In the chain, to receive the formatted output and structure it into a Python object (in this case, the Pydantic object `People`)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "815b3b87-3bc6-4b56-835e-c6b6703cef5d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Custom Parsing\n",
|
|
"\n",
|
|
"If desired, it's easy to create a custom prompt and parser with `LangChain` and `LCEL`.\n",
|
|
"\n",
|
|
"To create a custom parser, define a function to parse the output from the model (typically an [AIMessage](https://python.langchain.com/v0.2/api_reference/core/messages/langchain_core.messages.ai.AIMessage.html)) into an object of your choice.\n",
|
|
"\n",
|
|
"See below for a simple implementation of a JSON parser."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "b1f11912-c1bb-4a2a-a482-79bf3996961f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"import re\n",
|
|
"from typing import List, Optional\n",
|
|
"\n",
|
|
"from langchain_anthropic.chat_models import ChatAnthropic\n",
|
|
"from langchain_core.messages import AIMessage\n",
|
|
"from langchain_core.prompts import ChatPromptTemplate\n",
|
|
"from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
|
|
"\n",
|
|
"\n",
|
|
"class Person(BaseModel):\n",
|
|
" \"\"\"Information about a person.\"\"\"\n",
|
|
"\n",
|
|
" name: str = Field(..., description=\"The name of the person\")\n",
|
|
" height_in_meters: float = Field(\n",
|
|
" ..., description=\"The height of the person expressed in meters.\"\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"class People(BaseModel):\n",
|
|
" \"\"\"Identifying information about all people in a text.\"\"\"\n",
|
|
"\n",
|
|
" people: List[Person]\n",
|
|
"\n",
|
|
"\n",
|
|
"# Prompt\n",
|
|
"prompt = ChatPromptTemplate.from_messages(\n",
|
|
" [\n",
|
|
" (\n",
|
|
" \"system\",\n",
|
|
" \"Answer the user query. Output your answer as JSON that \"\n",
|
|
" \"matches the given schema: ```json\\n{schema}\\n```. \"\n",
|
|
" \"Make sure to wrap the answer in ```json and ``` tags\",\n",
|
|
" ),\n",
|
|
" (\"human\", \"{query}\"),\n",
|
|
" ]\n",
|
|
").partial(schema=People.schema())\n",
|
|
"\n",
|
|
"\n",
|
|
"# Custom parser\n",
|
|
"def extract_json(message: AIMessage) -> List[dict]:\n",
|
|
" \"\"\"Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.\n",
|
|
"\n",
|
|
" Parameters:\n",
|
|
" text (str): The text containing the JSON content.\n",
|
|
"\n",
|
|
" Returns:\n",
|
|
" list: A list of extracted JSON strings.\n",
|
|
" \"\"\"\n",
|
|
" text = message.content\n",
|
|
" # Define the regular expression pattern to match JSON blocks\n",
|
|
" pattern = r\"```json(.*?)```\"\n",
|
|
"\n",
|
|
" # Find all non-overlapping matches of the pattern in the string\n",
|
|
" matches = re.findall(pattern, text, re.DOTALL)\n",
|
|
"\n",
|
|
" # Return the list of matched JSON strings, stripping any leading or trailing whitespace\n",
|
|
" try:\n",
|
|
" return [json.loads(match.strip()) for match in matches]\n",
|
|
" except Exception:\n",
|
|
" raise ValueError(f\"Failed to parse: {message}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "9260d5e8-3b6c-4639-9f3b-fb2f90239e4b",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"System: Answer the user query. Output your answer as JSON that matches the given schema: ```json\n",
|
|
"{'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}\n",
|
|
"```. Make sure to wrap the answer in ```json and ``` tags\n",
|
|
"Human: Anna is 23 years old and she is 6 feet tall\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"query = \"Anna is 23 years old and she is 6 feet tall\"\n",
|
|
"print(prompt.format_prompt(query=query).to_string())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "c523301d-ae0e-45e3-b195-7fd28c67a5c4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'people': [{'name': 'Anna', 'height_in_meters': 1.83}]}]"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"chain = prompt | model | extract_json\n",
|
|
"chain.invoke({\"query\": query})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d3601bde",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Other Libraries\n",
|
|
"\n",
|
|
"If you're looking at extracting using a parsing approach, check out the [Kor](https://eyurtsev.github.io/kor/) library. It's written by one of the `LangChain` maintainers and it\n",
|
|
"helps to craft a prompt that takes examples into account, allows controlling formats (e.g., JSON or CSV) and expresses the schema in TypeScript. It seems to work pretty!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|