langchain/docs/docs/tutorials/extraction.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "id": "df29b30a-fd27-4e08-8269-870df5631f9e",
   "metadata": {},
   "source": [
    "---\n",
    "sidebar_position: 4\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d28530a6-ddfd-49c0-85dc-b723551f6614",
   "metadata": {},
   "source": [
    "# Build an Extraction Chain\n",
    "\n",
    "In this tutorial, we will use [tool-calling](/docs/concepts/tool_calling) features of [chat models](/docs/concepts/chat_models) to extract structured information from unstructured text. We will also demonstrate how to use [few-shot prompting](/docs/concepts/few_shot_prompting/) in this context to improve performance.\n",
    "\n",
    ":::important\n",
    "This tutorial requires `langchain-core>=0.3.20` and will only work with models that support **tool calling**.\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4412def2-38e3-4bd0-bbf0-fb09ff9e5985",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Jupyter Notebook\n",
    "\n",
    "This and other tutorials are perhaps most conveniently run in a [Jupyter notebooks](https://jupyter.org/). Going through guides in an interactive environment is a great way to better understand them. See [here](https://jupyter.org/install) for instructions on how to install.\n",
    "\n",
    "### Installation\n",
    "\n",
    "To install LangChain run:\n",
    "\n",
    "import Tabs from '@theme/Tabs';\n",
    "import TabItem from '@theme/TabItem';\n",
    "import CodeBlock from \"@theme/CodeBlock\";\n",
    "\n",
    "<Tabs>\n",
    "  <TabItem value=\"pip\" label=\"Pip\" default>\n",
    "    <CodeBlock language=\"bash\">pip install --upgrade langchain-core</CodeBlock>\n",
    "  </TabItem>\n",
    "  <TabItem value=\"conda\" label=\"Conda\">\n",
    "    <CodeBlock language=\"bash\">conda install langchain-core -c conda-forge</CodeBlock>\n",
    "  </TabItem>\n",
    "</Tabs>\n",
    "\n",
    "\n",
    "\n",
    "For more details, see our [Installation guide](/docs/how_to/installation).\n",
    "\n",
    "### LangSmith\n",
    "\n",
    "Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls.\n",
    "As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent.\n",
    "The best way to do this is with [LangSmith](https://smith.langchain.com).\n",
    "\n",
    "After you sign up at the link above, make sure to set your environment variables to start logging traces:\n",
    "\n",
    "```shell\n",
    "export LANGSMITH_TRACING=\"true\"\n",
    "export LANGSMITH_API_KEY=\"...\"\n",
    "```\n",
    "\n",
    "Or, if in a notebook, you can set them with:\n",
    "\n",
    "```python\n",
    "import getpass\n",
    "import os\n",
    "\n",
    "os.environ[\"LANGSMITH_TRACING\"] = \"true\"\n",
    "os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass()\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54d6b970-2ea3-4192-951e-21237212b359",
   "metadata": {},
   "source": [
    "## The Schema\n",
    "\n",
    "First, we need to describe what information we want to extract from the text.\n",
    "\n",
    "We'll use Pydantic to define an example schema  to extract personal information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c141084c-fb94-4093-8d6a-81175d688e40",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Optional\n",
    "\n",
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    # ^ Doc-string for the entity Person.\n",
    "    # This doc-string is sent to the LLM as the description of the schema Person,\n",
    "    # and it can help to improve extraction results.\n",
    "\n",
    "    # Note that:\n",
    "    # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
    "    # 2. Each field has a `description` -- this description is used by the LLM.\n",
    "    # Having a good description can help improve extraction results.\n",
    "    name: Optional[str] = Field(default=None, description=\"The name of the person\")\n",
    "    hair_color: Optional[str] = Field(\n",
    "        default=None, description=\"The color of the person's hair if known\"\n",
    "    )\n",
    "    height_in_meters: Optional[str] = Field(\n",
    "        default=None, description=\"Height measured in meters\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f248dd54-e36d-435a-b154-394ab4ed6792",
   "metadata": {},
   "source": [
    "There are two best practices when defining schema:\n",
    "\n",
    "1. Document the **attributes** and the **schema** itself: This information is sent to the LLM and is used to improve the quality of information extraction.\n",
    "2. Do not force the LLM to make up information! Above we used `Optional` for the attributes allowing the LLM to output `None` if it doesn't know the answer.\n",
    "\n",
    ":::important\n",
    "For best performance, document the schema well and make sure the model isn't force to return results if there's no information to be extracted in the text.\n",
    ":::\n",
    "\n",
    "## The Extractor\n",
    "\n",
    "Let's create an information extractor using the schema we defined above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5e490f6-35ad-455e-8ae4-2bae021583ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
    "\n",
    "# Define a custom prompt to provide instructions and any additional context.\n",
    "# 1) You can add examples into the prompt template to improve extraction quality\n",
    "# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
    "#    about the document from which the text was extracted.)\n",
    "prompt_template = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"You are an expert extraction algorithm. \"\n",
    "            \"Only extract relevant information from the text. \"\n",
    "            \"If you do not know the value of an attribute asked to extract, \"\n",
    "            \"return null for the attribute's value.\",\n",
    "        ),\n",
    "        # Please see the how-to about improving performance with\n",
    "        # reference examples.\n",
    "        # MessagesPlaceholder('examples'),\n",
    "        (\"human\", \"{text}\"),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "832bf6a1-8e0c-4b6a-aa37-12fe9c42a6d9",
   "metadata": {},
   "source": [
    "We need to use a model that supports function/tool calling.\n",
    "\n",
    "Please review [the documentation](/docs/concepts/tool_calling) for all models that can be used with this API.\n",
    "\n",
    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
    "\n",
    "<ChatModelTabs customVarName=\"llm\" />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "77c1311c-5252-41d6-83e6-fdb40b172e47",
   "metadata": {},
   "outputs": [],
   "source": [
    "# | output: false\n",
    "# | echo: false\n",
    "\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-4o-mini\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "04d846a6-d5cb-4009-ac19-61e3aac0177e",
   "metadata": {},
   "outputs": [],
   "source": [
    "structured_llm = llm.with_structured_output(schema=Person)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23582c0b-00ed-403f-a10e-3aeabf921f12",
   "metadata": {},
   "source": [
    "Let's test it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "dd42a935-022f-4860-b9e0-84268f55b22a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"Alan Smith is 6 feet tall and has blond hair.\"\n",
    "prompt = prompt_template.invoke({\"text\": text})\n",
    "structured_llm.invoke(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd1c493d-f9dc-4236-8da9-50f6919f5710",
   "metadata": {},
   "source": [
    ":::important \n",
    "\n",
    "Extraction is Generative 🤯\n",
    "\n",
    "LLMs are generative models, so they can do some pretty cool things like correctly extract the height of the person in meters\n",
    "even though it was provided in feet!\n",
    ":::\n",
    "\n",
    "We can see the LangSmith trace [here](https://smith.langchain.com/public/44b69a63-3b3b-47b8-8a6d-61b46533f015/r). Note that the [chat model portion of the trace](https://smith.langchain.com/public/44b69a63-3b3b-47b8-8a6d-61b46533f015/r/dd1f6305-f1e9-4919-bd8f-339d03a12d01) reveals the exact sequence of messages sent to the model, tools invoked, and other metadata."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28c5ef0c-b8d1-4e12-bd0e-e2528de87fcc",
   "metadata": {},
   "source": [
    "## Multiple Entities\n",
    "\n",
    "In **most cases**, you should be extracting a list of entities rather than a single entity.\n",
    "\n",
    "This can be easily achieved using pydantic by nesting models inside one another."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "591a0c16-7a17-4883-91ee-0d6d2fdb265c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "\n",
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    # ^ Doc-string for the entity Person.\n",
    "    # This doc-string is sent to the LLM as the description of the schema Person,\n",
    "    # and it can help to improve extraction results.\n",
    "\n",
    "    # Note that:\n",
    "    # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
    "    # 2. Each field has a `description` -- this description is used by the LLM.\n",
    "    # Having a good description can help improve extraction results.\n",
    "    name: Optional[str] = Field(default=None, description=\"The name of the person\")\n",
    "    hair_color: Optional[str] = Field(\n",
    "        default=None, description=\"The color of the person's hair if known\"\n",
    "    )\n",
    "    height_in_meters: Optional[str] = Field(\n",
    "        default=None, description=\"Height measured in meters\"\n",
    "    )\n",
    "\n",
    "\n",
    "class Data(BaseModel):\n",
    "    \"\"\"Extracted data about people.\"\"\"\n",
    "\n",
    "    # Creates a model so that we can extract multiple entities.\n",
    "    people: List[Person]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f5cda33-fd7b-481e-956a-703f45e40e1d",
   "metadata": {},
   "source": [
    ":::important\n",
    "Extraction results might not be perfect here. Read on to see how to use **Reference Examples** to improve the quality of extraction, and check out our extraction [how-to](/docs/how_to/#extraction) guides for more detail.\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "83ecf0db-757b-4ae3-a9d2-eb1c9f6b2631",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='1.83'), Person(name='Anna', hair_color='black', height_in_meters=None)])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "structured_llm = llm.with_structured_output(schema=Data)\n",
    "text = \"My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me.\"\n",
    "prompt = prompt_template.invoke({\"text\": text})\n",
    "structured_llm.invoke(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fba1d770-bf4d-4de4-9e4f-7384872ef0dc",
   "metadata": {},
   "source": [
    ":::tip\n",
    "When the schema accommodates the extraction of **multiple entities**, it also allows the model to extract **no entities** if no relevant information\n",
    "is in the text by providing an empty list. \n",
    "\n",
    "This is usually a **good** thing! It allows specifying **required** attributes on an entity without necessarily forcing the model to detect this entity.\n",
    ":::\n",
    "\n",
    "We can see the LangSmith trace [here](https://smith.langchain.com/public/7173764d-5e76-45fe-8496-84460bd9cdef/r)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c590f366-050a-43d4-8c78-acf84ccfbf9b",
   "metadata": {},
   "source": [
    "## Reference examples\n",
    "\n",
    "The behavior of LLM applications can be steered using [few-shot prompting](/docs/concepts/few_shot_prompting/). For [chat models](/docs/concepts/chat_models/), this can take the form of a sequence of pairs of input and response messages demonstrating desired behaviors.\n",
    "\n",
    "For example, we can convey the meaning of a symbol with alternating `user` and `assistant` [messages](/docs/concepts/messages/#role):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "0bb138d7-116e-4542-aa5f-bebf0c301ec6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7\n"
     ]
    }
   ],
   "source": [
    "messages = [\n",
    "    {\"role\": \"user\", \"content\": \"2 🦜 2\"},\n",
    "    {\"role\": \"assistant\", \"content\": \"4\"},\n",
    "    {\"role\": \"user\", \"content\": \"2 🦜 3\"},\n",
    "    {\"role\": \"assistant\", \"content\": \"5\"},\n",
    "    {\"role\": \"user\", \"content\": \"3 🦜 4\"},\n",
    "]\n",
    "\n",
    "response = llm.invoke(messages)\n",
    "print(response.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5691d07-e2b8-4ab3-a943-9b0b503e2549",
   "metadata": {},
   "source": [
    "[Structured output](/docs/concepts/structured_outputs/) often uses [tool calling](/docs/concepts/tool_calling/) under-the-hood. This typically involves the generation of [AI messages](/docs/concepts/messages/#aimessage) containing tool calls, as well as [tool messages](/docs/concepts/messages/#toolmessage) containing the results of tool calls. What should a sequence of messages look like in this case?\n",
    "\n",
    "Different [chat model providers](/docs/integrations/chat/) impose different requirements for valid message sequences. Some will accept a (repeating) message sequence of the form:\n",
    "\n",
    "- User message\n",
    "- AI message with tool call\n",
    "- Tool message with result\n",
    "\n",
    "Others require a final AI message containing some sort of response.\n",
    "\n",
    "LangChain includes a utility function [tool_example_to_messages](https://python.langchain.com/api_reference/core/utils/langchain_core.utils.function_calling.tool_example_to_messages.html) that will generate a valid sequence for most model providers. It simplifies the generation of structured few-shot examples by just requiring Pydantic representations of the corresponding tool calls.\n",
    "\n",
    "Let's try this out. We can convert pairs of input strings and desired Pydantic objects to a sequence of messages that can be provided to a chat model. Under the hood, LangChain will format the tool calls to each provider's required format.\n",
    "\n",
    "Note: this version of `tool_example_to_messages` requires `langchain-core>=0.3.20`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c604e476-a2be-4eda-b128-71399e280732",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.utils.function_calling import tool_example_to_messages\n",
    "\n",
    "examples = [\n",
    "    (\n",
    "        \"The ocean is vast and blue. It's more than 20,000 feet deep.\",\n",
    "        Data(people=[]),\n",
    "    ),\n",
    "    (\n",
    "        \"Fiona traveled far from France to Spain.\",\n",
    "        Data(people=[Person(name=\"Fiona\", height_in_meters=None, hair_color=None)]),\n",
    "    ),\n",
    "]\n",
    "\n",
    "\n",
    "messages = []\n",
    "\n",
    "for txt, tool_call in examples:\n",
    "    if tool_call.people:\n",
    "        # This final message is optional for some providers\n",
    "        ai_response = \"Detected people.\"\n",
    "    else:\n",
    "        ai_response = \"Detected no people.\"\n",
    "    messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beecc7a6-e423-4ca1-82b7-c2a751362fd6",
   "metadata": {},
   "source": [
    "Inspecting the result, we see these two example pairs generated eight messages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "628f67dd-aee0-4200-ac38-24a9fb16f1d1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================\u001b[1m Human Message \u001b[0m=================================\n",
      "\n",
      "The ocean is vast and blue. It's more than 20,000 feet deep.\n",
      "==================================\u001b[1m Ai Message \u001b[0m==================================\n",
      "Tool Calls:\n",
      "  Data (d8f2e054-7fb9-417f-b28f-0447a775b2c3)\n",
      " Call ID: d8f2e054-7fb9-417f-b28f-0447a775b2c3\n",
      "  Args:\n",
      "    people: []\n",
      "=================================\u001b[1m Tool Message \u001b[0m=================================\n",
      "\n",
      "You have correctly called this tool.\n",
      "==================================\u001b[1m Ai Message \u001b[0m==================================\n",
      "\n",
      "Detected no people.\n",
      "================================\u001b[1m Human Message \u001b[0m=================================\n",
      "\n",
      "Fiona traveled far from France to Spain.\n",
      "==================================\u001b[1m Ai Message \u001b[0m==================================\n",
      "Tool Calls:\n",
      "  Data (0178939e-a4b1-4d2a-a93e-b87f665cdfd6)\n",
      " Call ID: 0178939e-a4b1-4d2a-a93e-b87f665cdfd6\n",
      "  Args:\n",
      "    people: [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]\n",
      "=================================\u001b[1m Tool Message \u001b[0m=================================\n",
      "\n",
      "You have correctly called this tool.\n",
      "==================================\u001b[1m Ai Message \u001b[0m==================================\n",
      "\n",
      "Detected people.\n"
     ]
    }
   ],
   "source": [
    "for message in messages:\n",
    "    message.pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc8846f0-8bd1-48e1-bc4d-a62fbfa6a9f4",
   "metadata": {},
   "source": [
    "Let's compare performance with and without these messages. For example, let's pass a message for which we intend no people to be extracted:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "6b73d4e2-d18d-4d47-89ec-99b5eb6b234f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Data(people=[Person(name='Earth', hair_color='None', height_in_meters='0.00')])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "message_no_extraction = {\n",
    "    \"role\": \"user\",\n",
    "    \"content\": \"The solar system is large, but earth has only 1 moon.\",\n",
    "}\n",
    "\n",
    "structured_llm = llm.with_structured_output(schema=Data)\n",
    "structured_llm.invoke([message_no_extraction])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "350e1298-14f1-48e4-b11c-534af643e3a6",
   "metadata": {},
   "source": [
    "In this example, the model is liable to erroneously generate records of people.\n",
    "\n",
    "Because our few-shot examples contain examples of \"negatives\", we encourage the model to behave correctly in this case:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "eb1b3a99-4750-45bc-ad28-5d12751ed9f8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Data(people=[])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "structured_llm.invoke(messages + [message_no_extraction])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d1ae320-14bc-45ee-aeeb-8a986f3e6808",
   "metadata": {},
   "source": [
    ":::tip\n",
    "\n",
    "The [LangSmith](https://smith.langchain.com/public/b3433f57-7905-4430-923c-fed214525bf1/r) trace for the run reveals the exact sequence of messages sent to the chat model, tool calls generated, latency, token counts, and other metadata.\n",
    "\n",
    ":::\n",
    "\n",
    "See [this guide](/docs/how_to/extraction_examples/) for more detail on extraction workflows with reference examples, including how to incorporate prompt templates and customize the generation of example messages."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f07a7455-7de6-4a6f-9772-0477ef65e3dc",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "Now that you understand the basics of extraction with LangChain, you're ready to proceed to the rest of the how-to guides:\n",
    "\n",
    "- [Add Examples](/docs/how_to/extraction_examples): More detail on using **reference examples** to improve performance.\n",
    "- [Handle Long Text](/docs/how_to/extraction_long_text): What should you do if the text does not fit into the context window of the LLM?\n",
    "- [Use a Parsing Approach](/docs/how_to/extraction_parse): Use a prompt based approach to extract with models that do not support **tool/function calling**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3deb47ba",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}