langchain/docs/docs/use_cases/extraction/how_to/parse.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "id": "df29b30a-fd27-4e08-8269-870df5631f9e",
   "metadata": {},
   "source": [
    "---\n",
    "title: Parsing\n",
    "sidebar_position: 4\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea37db49-d389-4291-be73-885d06c1fb7e",
   "metadata": {},
   "source": [
    "LLMs that are able to follow prompt instructions well can be tasked with outputting information in a given format.\n",
    "\n",
    "This approach relies on designing good prompts and then parsing the output of the LLMs to make them extract information well.\n",
    "\n",
    "Here, we'll use Claude which is great at following instructions! See [Anthropic models](https://www.anthropic.com/api)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d71b32de-a6b4-45ed-83a9-ba1925f9470c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_anthropic.chat_models import ChatAnthropic\n",
    "\n",
    "model = ChatAnthropic(model_name=\"claude-3-sonnet-20240229\", temperature=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e412374-3beb-4bbf-966b-400c1f66a258",
   "metadata": {},
   "source": [
    ":::{.callout-tip}\n",
    "All the same considerations for extraction quality apply for parsing approach. Review the [guidelines](/docs/use_cases/extraction/guidelines) for extraction quality.\n",
    "\n",
    "This tutorial is meant to be simple, but generally should really include reference examples to squeeze out performance!\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abc1a945-0f80-4953-add4-cd572b6f2a51",
   "metadata": {},
   "source": [
    "## Using PydanticOutputParser\n",
    "\n",
    "The following example uses the built-in `PydanticOutputParser` to parse the output of a chat model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "497eb023-c043-443d-ac62-2d4ea85fe1b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "\n",
    "from langchain.output_parsers import PydanticOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    name: str = Field(..., description=\"The name of the person\")\n",
    "    height_in_meters: float = Field(\n",
    "        ..., description=\"The height of the person expressed in meters.\"\n",
    "    )\n",
    "\n",
    "\n",
    "class People(BaseModel):\n",
    "    \"\"\"Identifying information about all people in a text.\"\"\"\n",
    "\n",
    "    people: List[Person]\n",
    "\n",
    "\n",
    "# Set up a parser\n",
    "parser = PydanticOutputParser(pydantic_object=People)\n",
    "\n",
    "# Prompt\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"Answer the user query. Wrap the output in `json` tags\\n{format_instructions}\",\n",
    "        ),\n",
    "        (\"human\", \"{query}\"),\n",
    "    ]\n",
    ").partial(format_instructions=parser.get_format_instructions())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c31aa2c8-05a9-4a12-80c5-ea1250dea0ae",
   "metadata": {},
   "source": [
    "Let's take a look at what information is sent to the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "20b99ffb-a114-49a9-a7be-154c525f8ada",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"Anna is 23 years old and she is 6 feet tall\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4f3a66ce-de19-4571-9e54-67504ae3fba7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System: Answer the user query. Wrap the output in `json` tags\n",
      "The output should be formatted as a JSON instance that conforms to the JSON schema below.\n",
      "\n",
      "As an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\n",
      "the object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n",
      "\n",
      "Here is the output schema:\n",
      "```\n",
      "{\"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"title\": \"People\", \"type\": \"array\", \"items\": {\"$ref\": \"#/definitions/Person\"}}}, \"required\": [\"people\"], \"definitions\": {\"Person\": {\"title\": \"Person\", \"description\": \"Information about a person.\", \"type\": \"object\", \"properties\": {\"name\": {\"title\": \"Name\", \"description\": \"The name of the person\", \"type\": \"string\"}, \"height_in_meters\": {\"title\": \"Height In Meters\", \"description\": \"The height of the person expressed in meters.\", \"type\": \"number\"}}, \"required\": [\"name\", \"height_in_meters\"]}}}\n",
      "```\n",
      "Human: Anna is 23 years old and she is 6 feet tall\n"
     ]
    }
   ],
   "source": [
    "print(prompt.format_prompt(query=query).to_string())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3a46b5fd-9242-4b8c-a4e2-3f04fc19b3a4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "People(people=[Person(name='Anna', height_in_meters=1.83)])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain = prompt | model | parser\n",
    "chain.invoke({\"query\": query})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "815b3b87-3bc6-4b56-835e-c6b6703cef5d",
   "metadata": {},
   "source": [
    "## Custom Parsing\n",
    "\n",
    "It's easy to create a custom prompt and parser with `LangChain` and `LCEL`.\n",
    "\n",
    "You can use a simple function to parse the output from the model!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b1f11912-c1bb-4a2a-a482-79bf3996961f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "from typing import List, Optional\n",
    "\n",
    "from langchain_anthropic.chat_models import ChatAnthropic\n",
    "from langchain_core.messages import AIMessage\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    name: str = Field(..., description=\"The name of the person\")\n",
    "    height_in_meters: float = Field(\n",
    "        ..., description=\"The height of the person expressed in meters.\"\n",
    "    )\n",
    "\n",
    "\n",
    "class People(BaseModel):\n",
    "    \"\"\"Identifying information about all people in a text.\"\"\"\n",
    "\n",
    "    people: List[Person]\n",
    "\n",
    "\n",
    "# Prompt\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"Answer the user query. Output your answer as JSON that  \"\n",
    "            \"matches the given schema: ```json\\n{schema}\\n```. \"\n",
    "            \"Make sure to wrap the answer in ```json and ``` tags\",\n",
    "        ),\n",
    "        (\"human\", \"{query}\"),\n",
    "    ]\n",
    ").partial(schema=People.schema())\n",
    "\n",
    "\n",
    "# Custom parser\n",
    "def extract_json(message: AIMessage) -> List[dict]:\n",
    "    \"\"\"Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.\n",
    "\n",
    "    Parameters:\n",
    "        text (str): The text containing the JSON content.\n",
    "\n",
    "    Returns:\n",
    "        list: A list of extracted JSON strings.\n",
    "    \"\"\"\n",
    "    text = message.content\n",
    "    # Define the regular expression pattern to match JSON blocks\n",
    "    pattern = r\"```json(.*?)```\"\n",
    "\n",
    "    # Find all non-overlapping matches of the pattern in the string\n",
    "    matches = re.findall(pattern, text, re.DOTALL)\n",
    "\n",
    "    # Return the list of matched JSON strings, stripping any leading or trailing whitespace\n",
    "    try:\n",
    "        return [json.loads(match.strip()) for match in matches]\n",
    "    except Exception:\n",
    "        raise ValueError(f\"Failed to parse: {message}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cda52ef5-a354-47a7-9c25-45153c2389e2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System: Answer the user query. Output your answer as JSON that  matches the given schema: ```json\n",
      "{'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}\n",
      "```. Make sure to wrap the answer in ```json and ``` tags\n",
      "Human: Anna is 23 years old and she is 6 feet tall\n"
     ]
    }
   ],
   "source": [
    "query = \"Anna is 23 years old and she is 6 feet tall\"\n",
    "print(prompt.format_prompt(query=query).to_string())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "993dc61a-229d-4795-a746-0d17df86b5c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'people': [{'name': 'Anna', 'height_in_meters': 1.83}]}]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain = prompt | model | extract_json\n",
    "chain.invoke({\"query\": query})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3601bde",
   "metadata": {},
   "source": [
    "## Other Libraries\n",
    "\n",
    "If you're looking at extracting using a parsing approach, check out the [Kor](https://eyurtsev.github.io/kor/) library. It's written by one of the `LangChain` maintainers and it\n",
    "helps to craft a prompt that takes examples into account, allows controlling formats (e.g., JSON or CSV) and expresses the schema in TypeScript. It seems to work pretty!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}