langchain/docs/docs/how_to/output_parser_xml.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "181b5b6d",
   "metadata": {},
   "source": [
    "# How to parse XML output\n",
    "\n",
    ":::info Prerequisites\n",
    "\n",
    "This guide assumes familiarity with the following concepts:\n",
    "- [Chat models](/docs/concepts/#chat-models)\n",
    "- [Output parsers](/docs/concepts/#output-parsers)\n",
    "- [Prompt templates](/docs/concepts/#prompt-templates)\n",
    "- [Structured output](/docs/how_to/structured_output)\n",
    "- [Chaining runnables together](/docs/how_to/sequence/)\n",
    "\n",
    ":::\n",
    "\n",
    "LLMs from different providers often have different strengths depending on the specific data they are trianed on. This also means that some may be \"better\" and more reliable at generating output in formats other than JSON.\n",
    "\n",
    "This guide shows you how to use the [`XMLOutputParser`](https://python.langchain.com/v0.2/api_reference/core/output_parsers/langchain_core.output_parsers.xml.XMLOutputParser.html) to prompt models for XML output, then and parse that output into a usable format.\n",
    "\n",
    ":::{.callout-note}\n",
    "Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate well-formed XML.\n",
    ":::\n",
    "\n",
    "In the following examples, we use Anthropic's Claude-2 model (https://docs.anthropic.com/claude/docs), which is one such model that is optimized for XML tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aee0c52e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -qU langchain langchain-anthropic\n",
    "\n",
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ[\"ANTHROPIC_API_KEY\"] = getpass()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da312f86-0d2a-4aef-a09d-1e72bd0ea9b1",
   "metadata": {},
   "source": [
    "Let's start with a simple request to the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b03785af-69fc-40a1-a1be-c04ed6fade70",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Here is the shortened filmography for Tom Hanks, with movies enclosed in XML tags:\n",
      "\n",
      "<movie>Splash</movie>\n",
      "<movie>Big</movie>\n",
      "<movie>A League of Their Own</movie>\n",
      "<movie>Sleepless in Seattle</movie>\n",
      "<movie>Forrest Gump</movie>\n",
      "<movie>Toy Story</movie>\n",
      "<movie>Apollo 13</movie>\n",
      "<movie>Saving Private Ryan</movie>\n",
      "<movie>Cast Away</movie>\n",
      "<movie>The Da Vinci Code</movie>\n"
     ]
    }
   ],
   "source": [
    "from langchain_anthropic import ChatAnthropic\n",
    "from langchain_core.output_parsers import XMLOutputParser\n",
    "from langchain_core.prompts import PromptTemplate\n",
    "\n",
    "model = ChatAnthropic(model=\"claude-2.1\", max_tokens_to_sample=512, temperature=0.1)\n",
    "\n",
    "actor_query = \"Generate the shortened filmography for Tom Hanks.\"\n",
    "\n",
    "output = model.invoke(\n",
    "    f\"\"\"{actor_query}\n",
    "Please enclose the movies in <movie></movie> tags\"\"\"\n",
    ")\n",
    "\n",
    "print(output.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4db65781-3d54-4ba6-ae26-5b4ead47a4c8",
   "metadata": {},
   "source": [
    "This actually worked pretty well! But it would be nice to parse that XML into a more easily usable format. We can use the `XMLOutputParser` to both add default format instructions to the prompt and parse outputted XML into a dict:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6917e057",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The output should be formatted as a XML file.\\n1. Output should conform to the tags below. \\n2. If tags are not given, make them on your own.\\n3. Remember to always open and close all the tags.\\n\\nAs an example, for the tags [\"foo\", \"bar\", \"baz\"]:\\n1. String \"<foo>\\n   <bar>\\n      <baz></baz>\\n   </bar>\\n</foo>\" is a well-formatted instance of the schema. \\n2. String \"<foo>\\n   <bar>\\n   </foo>\" is a badly-formatted instance.\\n3. String \"<foo>\\n   <tag>\\n   </tag>\\n</foo>\" is a badly-formatted instance.\\n\\nHere are the output tags:\\n```\\nNone\\n```'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parser = XMLOutputParser()\n",
    "\n",
    "# We will add these instructions to the prompt below\n",
    "parser.get_format_instructions()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "87ba8d11",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'filmography': [{'movie': [{'title': 'Big'}, {'year': '1988'}]}, {'movie': [{'title': 'Forrest Gump'}, {'year': '1994'}]}, {'movie': [{'title': 'Toy Story'}, {'year': '1995'}]}, {'movie': [{'title': 'Saving Private Ryan'}, {'year': '1998'}]}, {'movie': [{'title': 'Cast Away'}, {'year': '2000'}]}]}\n"
     ]
    }
   ],
   "source": [
    "prompt = PromptTemplate(\n",
    "    template=\"\"\"{query}\\n{format_instructions}\"\"\",\n",
    "    input_variables=[\"query\"],\n",
    "    partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
    ")\n",
    "\n",
    "chain = prompt | model | parser\n",
    "\n",
    "output = chain.invoke({\"query\": actor_query})\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "327f5479-77e0-4549-8393-2cd7a286d491",
   "metadata": {},
   "source": [
    "We can also add some tags to tailor the output to our needs. You can and should experiment with adding your own formatting hints in the other parts of your prompt to either augment or replace the default instructions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4af50494",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The output should be formatted as a XML file.\\n1. Output should conform to the tags below. \\n2. If tags are not given, make them on your own.\\n3. Remember to always open and close all the tags.\\n\\nAs an example, for the tags [\"foo\", \"bar\", \"baz\"]:\\n1. String \"<foo>\\n   <bar>\\n      <baz></baz>\\n   </bar>\\n</foo>\" is a well-formatted instance of the schema. \\n2. String \"<foo>\\n   <bar>\\n   </foo>\" is a badly-formatted instance.\\n3. String \"<foo>\\n   <tag>\\n   </tag>\\n</foo>\" is a badly-formatted instance.\\n\\nHere are the output tags:\\n```\\n[\\'movies\\', \\'actor\\', \\'film\\', \\'name\\', \\'genre\\']\\n```'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parser = XMLOutputParser(tags=[\"movies\", \"actor\", \"film\", \"name\", \"genre\"])\n",
    "\n",
    "# We will add these instructions to the prompt below\n",
    "parser.get_format_instructions()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b722a235",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'movies': [{'actor': [{'name': 'Tom Hanks'}, {'film': [{'name': 'Forrest Gump'}, {'genre': 'Drama'}]}, {'film': [{'name': 'Cast Away'}, {'genre': 'Adventure'}]}, {'film': [{'name': 'Saving Private Ryan'}, {'genre': 'War'}]}]}]}\n"
     ]
    }
   ],
   "source": [
    "prompt = PromptTemplate(\n",
    "    template=\"\"\"{query}\\n{format_instructions}\"\"\",\n",
    "    input_variables=[\"query\"],\n",
    "    partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
    ")\n",
    "\n",
    "\n",
    "chain = prompt | model | parser\n",
    "\n",
    "output = chain.invoke({\"query\": actor_query})\n",
    "\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61ab269a",
   "metadata": {},
   "source": [
    "This output parser also supports streaming of partial chunks. Here's an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "808a5df5-b11e-42a0-bd7a-6b95ca0c3eba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'movies': [{'actor': [{'name': 'Tom Hanks'}]}]}\n",
      "{'movies': [{'actor': [{'film': [{'name': 'Forrest Gump'}]}]}]}\n",
      "{'movies': [{'actor': [{'film': [{'genre': 'Drama'}]}]}]}\n",
      "{'movies': [{'actor': [{'film': [{'name': 'Cast Away'}]}]}]}\n",
      "{'movies': [{'actor': [{'film': [{'genre': 'Adventure'}]}]}]}\n",
      "{'movies': [{'actor': [{'film': [{'name': 'Saving Private Ryan'}]}]}]}\n",
      "{'movies': [{'actor': [{'film': [{'genre': 'War'}]}]}]}\n"
     ]
    }
   ],
   "source": [
    "for s in chain.stream({\"query\": actor_query}):\n",
    "    print(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6902fe6f",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "You've now learned how to prompt a model to return XML. Next, check out the [broader guide on obtaining structured output](/docs/how_to/structured_output) for other related techniques."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}