langchain/docs/extras/modules/model_io/output_parsers/xml.ipynb
Mateusz Wosinski 720f6dbaac
Add XMLOutputParser (#10051)
**Description**
Adds new output parser, this time enabling the output of LLM to be of an
XML format. Seems to be particularly useful together with Claude model.
Addresses [issue
9820](https://github.com/langchain-ai/langchain/issues/9820).

**Twitter handle**
@deepsense_ai @matt_wosinski
2023-09-19 16:17:33 -07:00

359 lines
10 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "181b5b6d",
"metadata": {},
"source": [
"# XML parser\n",
"This output parser allows users to obtain results from LLM in the popular XML format. \n",
"\n",
"Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate well-formed XML. \n",
"\n",
"In the following example we use Claude model (https://docs.anthropic.com/claude/docs) which works really well with XML tags."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3b10fc55",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import PromptTemplate\n",
"from langchain.llms import Anthropic\n",
"from langchain.output_parsers import XMLOutputParser"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "909161d1",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/mateusz/Documents/Projects/langchain/libs/langchain/langchain/llms/anthropic.py:170: UserWarning: This Anthropic LLM is deprecated. Please use `from langchain.chat_models import ChatAnthropic` instead\n",
" warnings.warn(\n"
]
}
],
"source": [
"model = Anthropic(model=\"claude-2\", max_tokens_to_sample=512, temperature=0.1)"
]
},
{
"cell_type": "markdown",
"id": "da312f86-0d2a-4aef-a09d-1e72bd0ea9b1",
"metadata": {},
"source": [
"Let's start with the simple request to the model."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b03785af-69fc-40a1-a1be-c04ed6fade70",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Here is the shortened filmography for Tom Hanks enclosed in <movie> tags:\n",
"\n",
"<movie>Splash (1984)</movie>\n",
"<movie>Big (1988)</movie> \n",
"<movie>A League of Their Own (1992)</movie>\n",
"<movie>Sleepless in Seattle (1993)</movie> \n",
"<movie>Forrest Gump (1994)</movie>\n",
"<movie>Apollo 13 (1995)</movie>\n",
"<movie>Toy Story (1995)</movie>\n",
"<movie>Saving Private Ryan (1998)</movie>\n",
"<movie>Cast Away (2000)</movie>\n",
"<movie>The Da Vinci Code (2006)</movie>\n",
"<movie>Toy Story 3 (2010)</movie>\n",
"<movie>Captain Phillips (2013)</movie>\n",
"<movie>Bridge of Spies (2015)</movie>\n",
"<movie>Toy Story 4 (2019)</movie>\n"
]
}
],
"source": [
"actor_query = \"Generate the shortened filmography for Tom Hanks.\"\n",
"output = model(\n",
" f\"\"\"\n",
"\n",
"Human:\n",
"{actor_query}\n",
"Please enclose the movies in <movie></movie> tags\n",
"Assistant:\n",
"\"\"\"\n",
")\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"id": "4db65781-3d54-4ba6-ae26-5b4ead47a4c8",
"metadata": {},
"source": [
"Now we will use the XMLOutputParser in order to get the structured output."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "87ba8d11",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"<filmography>\n",
" <movie>\n",
" <title>Splash</title>\n",
" <year>1984</year>\n",
" </movie>\n",
" \n",
" <movie>\n",
" <title>Big</title> \n",
" <year>1988</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>A League of Their Own</title>\n",
" <year>1992</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Sleepless in Seattle</title>\n",
" <year>1993</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Forrest Gump</title>\n",
" <year>1994</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Toy Story</title>\n",
" <year>1995</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Apollo 13</title>\n",
" <year>1995</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Saving Private Ryan</title>\n",
" <year>1998</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Cast Away</title>\n",
" <year>2000</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Catch Me If You Can</title>\n",
" <year>2002</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>The Polar Express</title>\n",
" <year>2004</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Charlie Wilson's War</title>\n",
" <year>2007</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Toy Story 3</title>\n",
" <year>2010</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Captain Phillips</title>\n",
" <year>2013</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>Bridge of Spies</title>\n",
" <year>2015</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>The Post</title>\n",
" <year>2017</year>\n",
" </movie>\n",
"\n",
" <movie>\n",
" <title>A Beautiful Day in the Neighborhood</title> \n",
" <year>2019</year>\n",
" </movie>\n",
"</filmography>\n"
]
}
],
"source": [
"parser = XMLOutputParser()\n",
"\n",
"prompt = PromptTemplate(\n",
" template=\"\"\"\n",
" \n",
" Human:\n",
" {query}\n",
" {format_instructions}\n",
" Assistant:\"\"\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=actor_query)\n",
"\n",
"output = model(_input.to_string())\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"id": "1c4c47ee",
"metadata": {},
"source": [
"And here parsed output is shown:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4c864dc9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'filmography': [{'movie': [{'title': 'Splash'}, {'year': '1984'}]},\n",
" {'movie': [{'title': 'Big'}, {'year': '1988'}]},\n",
" {'movie': [{'title': 'A League of Their Own'}, {'year': '1992'}]},\n",
" {'movie': [{'title': 'Sleepless in Seattle'}, {'year': '1993'}]},\n",
" {'movie': [{'title': 'Forrest Gump'}, {'year': '1994'}]},\n",
" {'movie': [{'title': 'Toy Story'}, {'year': '1995'}]},\n",
" {'movie': [{'title': 'Apollo 13'}, {'year': '1995'}]},\n",
" {'movie': [{'title': 'Saving Private Ryan'}, {'year': '1998'}]},\n",
" {'movie': [{'title': 'Cast Away'}, {'year': '2000'}]},\n",
" {'movie': [{'title': 'Catch Me If You Can'}, {'year': '2002'}]},\n",
" {'movie': [{'title': 'The Polar Express'}, {'year': '2004'}]},\n",
" {'movie': [{'title': \"Charlie Wilson's War\"}, {'year': '2007'}]},\n",
" {'movie': [{'title': 'Toy Story 3'}, {'year': '2010'}]},\n",
" {'movie': [{'title': 'Captain Phillips'}, {'year': '2013'}]},\n",
" {'movie': [{'title': 'Bridge of Spies'}, {'year': '2015'}]},\n",
" {'movie': [{'title': 'The Post'}, {'year': '2017'}]},\n",
" {'movie': [{'title': 'A Beautiful Day in the Neighborhood'},\n",
" {'year': '2019'}]}]}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"parser.parse(output)"
]
},
{
"cell_type": "markdown",
"id": "327f5479-77e0-4549-8393-2cd7a286d491",
"metadata": {},
"source": [
"Finally, let's add some tags to tailor the output to our needs."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "b722a235",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'movies': [{'actor': [{'name': 'Tom Hanks'},\n",
" {'film': [{'name': 'Splash'}, {'genre': 'Comedy'}]},\n",
" {'film': [{'name': 'Big'}, {'genre': 'Comedy'}]},\n",
" {'film': [{'name': 'A League of Their Own'}, {'genre': 'Drama'}]},\n",
" {'film': [{'name': 'Sleepless in Seattle'}, {'genre': 'Romance'}]},\n",
" {'film': [{'name': 'Forrest Gump'}, {'genre': 'Drama'}]},\n",
" {'film': [{'name': 'Toy Story'}, {'genre': 'Animation'}]},\n",
" {'film': [{'name': 'Apollo 13'}, {'genre': 'Drama'}]},\n",
" {'film': [{'name': 'Saving Private Ryan'}, {'genre': 'War'}]},\n",
" {'film': [{'name': 'Cast Away'}, {'genre': 'Adventure'}]},\n",
" {'film': [{'name': 'The Green Mile'}, {'genre': 'Drama'}]}]}]}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"parser = XMLOutputParser(tags=[\"movies\", \"actor\", \"film\", \"name\", \"genre\"])\n",
"prompt = PromptTemplate(\n",
" template=\"\"\"\n",
" \n",
" Human:\n",
" {query}\n",
" {format_instructions}\n",
" Assistant:\"\"\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"\n",
"_input = prompt.format_prompt(query=actor_query)\n",
"\n",
"output = model(_input.to_string())\n",
"\n",
"parser.parse(output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "808a5df5-b11e-42a0-bd7a-6b95ca0c3eba",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}