langchain/docs/docs/how_to/extraction_parse.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ea37db49-d389-4291-be73-885d06c1fb7e",
   "metadata": {},
   "source": [
    "# How to use prompting alone (no tool calling) to do extraction\n",
    "\n",
    "Tool calling features are not required for generating structured output from LLMs. LLMs that are able to follow prompt instructions well can be tasked with outputting information in a given format.\n",
    "\n",
    "This approach relies on designing good prompts and then parsing the output of the LLMs to make them extract information well.\n",
    "\n",
    "To extract data without tool-calling features: \n",
    "\n",
    "1. Instruct the LLM to generate text following an expected format (e.g., JSON with a certain schema);\n",
    "2. Use [output parsers](/docs/concepts#output-parsers) to structure the model response into a desired Python object.\n",
    "\n",
    "First we select a LLM:\n",
    "\n",
    "```{=mdx}\n",
    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
    "\n",
    "<ChatModelTabs customVarName=\"model\" />\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "25487939-8713-4ec7-b774-e4a761ac8298",
   "metadata": {},
   "outputs": [],
   "source": [
    "# | output: false\n",
    "# | echo: false\n",
    "\n",
    "from langchain_anthropic.chat_models import ChatAnthropic\n",
    "\n",
    "model = ChatAnthropic(model_name=\"claude-3-sonnet-20240229\", temperature=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e412374-3beb-4bbf-966b-400c1f66a258",
   "metadata": {},
   "source": [
    ":::{.callout-tip}\n",
    "This tutorial is meant to be simple, but generally should really include reference examples to squeeze out performance!\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abc1a945-0f80-4953-add4-cd572b6f2a51",
   "metadata": {},
   "source": [
    "## Using PydanticOutputParser\n",
    "\n",
    "The following example uses the built-in `PydanticOutputParser` to parse the output of a chat model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "497eb023-c043-443d-ac62-2d4ea85fe1b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "\n",
    "from langchain_core.output_parsers import PydanticOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    name: str = Field(..., description=\"The name of the person\")\n",
    "    height_in_meters: float = Field(\n",
    "        ..., description=\"The height of the person expressed in meters.\"\n",
    "    )\n",
    "\n",
    "\n",
    "class People(BaseModel):\n",
    "    \"\"\"Identifying information about all people in a text.\"\"\"\n",
    "\n",
    "    people: List[Person]\n",
    "\n",
    "\n",
    "# Set up a parser\n",
    "parser = PydanticOutputParser(pydantic_object=People)\n",
    "\n",
    "# Prompt\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"Answer the user query. Wrap the output in `json` tags\\n{format_instructions}\",\n",
    "        ),\n",
    "        (\"human\", \"{query}\"),\n",
    "    ]\n",
    ").partial(format_instructions=parser.get_format_instructions())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c31aa2c8-05a9-4a12-80c5-ea1250dea0ae",
   "metadata": {},
   "source": [
    "Let's take a look at what information is sent to the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "20b99ffb-a114-49a9-a7be-154c525f8ada",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"Anna is 23 years old and she is 6 feet tall\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4f3a66ce-de19-4571-9e54-67504ae3fba7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System: Answer the user query. Wrap the output in `json` tags\n",
      "The output should be formatted as a JSON instance that conforms to the JSON schema below.\n",
      "\n",
      "As an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\n",
      "the object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n",
      "\n",
      "Here is the output schema:\n",
      "```\n",
      "{\"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"title\": \"People\", \"type\": \"array\", \"items\": {\"$ref\": \"#/definitions/Person\"}}}, \"required\": [\"people\"], \"definitions\": {\"Person\": {\"title\": \"Person\", \"description\": \"Information about a person.\", \"type\": \"object\", \"properties\": {\"name\": {\"title\": \"Name\", \"description\": \"The name of the person\", \"type\": \"string\"}, \"height_in_meters\": {\"title\": \"Height In Meters\", \"description\": \"The height of the person expressed in meters.\", \"type\": \"number\"}}, \"required\": [\"name\", \"height_in_meters\"]}}}\n",
      "```\n",
      "Human: Anna is 23 years old and she is 6 feet tall\n"
     ]
    }
   ],
   "source": [
    "print(prompt.format_prompt(query=query).to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f1048e0-1bfd-49f9-b697-74389a5ce69c",
   "metadata": {},
   "source": [
    "Having defined our prompt, we simply chain together the prompt, model and output parser:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7e0041eb-37dc-4384-9fe3-6dd8c356371e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "People(people=[Person(name='Anna', height_in_meters=1.83)])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain = prompt | model | parser\n",
    "chain.invoke({\"query\": query})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd492fe4-110a-4b83-a191-79fffbc1055a",
   "metadata": {},
   "source": [
    "Check out the associated [Langsmith trace](https://smith.langchain.com/public/92ed52a3-92b9-45af-a663-0a9c00e5e396/r).\n",
    "\n",
    "Note that the schema shows up in two places: \n",
    "\n",
    "1. In the prompt, via `parser.get_format_instructions()`;\n",
    "2. In the chain, to receive the formatted output and structure it into a Python object (in this case, the Pydantic object `People`)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "815b3b87-3bc6-4b56-835e-c6b6703cef5d",
   "metadata": {},
   "source": [
    "## Custom Parsing\n",
    "\n",
    "If desired, it's easy to create a custom prompt and parser with `LangChain` and `LCEL`.\n",
    "\n",
    "To create a custom parser, define a function to parse the output from the model (typically an [AIMessage](https://python.langchain.com/v0.2/api_reference/core/messages/langchain_core.messages.ai.AIMessage.html)) into an object of your choice.\n",
    "\n",
    "See below for a simple implementation of a JSON parser."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b1f11912-c1bb-4a2a-a482-79bf3996961f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "from typing import List, Optional\n",
    "\n",
    "from langchain_anthropic.chat_models import ChatAnthropic\n",
    "from langchain_core.messages import AIMessage\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field, validator\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    name: str = Field(..., description=\"The name of the person\")\n",
    "    height_in_meters: float = Field(\n",
    "        ..., description=\"The height of the person expressed in meters.\"\n",
    "    )\n",
    "\n",
    "\n",
    "class People(BaseModel):\n",
    "    \"\"\"Identifying information about all people in a text.\"\"\"\n",
    "\n",
    "    people: List[Person]\n",
    "\n",
    "\n",
    "# Prompt\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"Answer the user query. Output your answer as JSON that  \"\n",
    "            \"matches the given schema: ```json\\n{schema}\\n```. \"\n",
    "            \"Make sure to wrap the answer in ```json and ``` tags\",\n",
    "        ),\n",
    "        (\"human\", \"{query}\"),\n",
    "    ]\n",
    ").partial(schema=People.schema())\n",
    "\n",
    "\n",
    "# Custom parser\n",
    "def extract_json(message: AIMessage) -> List[dict]:\n",
    "    \"\"\"Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.\n",
    "\n",
    "    Parameters:\n",
    "        text (str): The text containing the JSON content.\n",
    "\n",
    "    Returns:\n",
    "        list: A list of extracted JSON strings.\n",
    "    \"\"\"\n",
    "    text = message.content\n",
    "    # Define the regular expression pattern to match JSON blocks\n",
    "    pattern = r\"```json(.*?)```\"\n",
    "\n",
    "    # Find all non-overlapping matches of the pattern in the string\n",
    "    matches = re.findall(pattern, text, re.DOTALL)\n",
    "\n",
    "    # Return the list of matched JSON strings, stripping any leading or trailing whitespace\n",
    "    try:\n",
    "        return [json.loads(match.strip()) for match in matches]\n",
    "    except Exception:\n",
    "        raise ValueError(f\"Failed to parse: {message}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9260d5e8-3b6c-4639-9f3b-fb2f90239e4b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System: Answer the user query. Output your answer as JSON that  matches the given schema: ```json\n",
      "{'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}\n",
      "```. Make sure to wrap the answer in ```json and ``` tags\n",
      "Human: Anna is 23 years old and she is 6 feet tall\n"
     ]
    }
   ],
   "source": [
    "query = \"Anna is 23 years old and she is 6 feet tall\"\n",
    "print(prompt.format_prompt(query=query).to_string())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c523301d-ae0e-45e3-b195-7fd28c67a5c4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'people': [{'name': 'Anna', 'height_in_meters': 1.83}]}]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain = prompt | model | extract_json\n",
    "chain.invoke({\"query\": query})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3601bde",
   "metadata": {},
   "source": [
    "## Other Libraries\n",
    "\n",
    "If you're looking at extracting using a parsing approach, check out the [Kor](https://eyurtsev.github.io/kor/) library. It's written by one of the `LangChain` maintainers and it\n",
    "helps to craft a prompt that takes examples into account, allows controlling formats (e.g., JSON or CSV) and expresses the schema in TypeScript. It seems to work pretty!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}