langchain/docs/versioned_docs/version-0.2.x/tutorials/extraction.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "id": "df29b30a-fd27-4e08-8269-870df5631f9e",
   "metadata": {},
   "source": [
    "---\n",
    "title: Quickstart\n",
    "sidebar_position: 0\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d28530a6-ddfd-49c0-85dc-b723551f6614",
   "metadata": {},
   "source": [
    "In this quick start, we will use [chat models](/docs/modules/model_io/chat/) that are capable of **function/tool calling** to extract information from text.\n",
    "\n",
    ":::{.callout-important}\n",
    "Extraction using **function/tool calling** only works with [models that support **function/tool calling**](/docs/modules/model_io/chat/function_calling).\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4412def2-38e3-4bd0-bbf0-fb09ff9e5985",
   "metadata": {},
   "source": [
    "## Set up\n",
    "\n",
    "We will use the [structured output](/docs/modules/model_io/chat/structured_output) method available on LLMs that are capable of **function/tool calling**. \n",
    "\n",
    "Select a model, install the dependencies for it and set up API keys!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "380c0425-6062-4837-8630-c220240c83b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install langchain\n",
    "\n",
    "# Install a model capable of tool calling\n",
    "# pip install langchain-openai\n",
    "# pip install langchain-mistralai\n",
    "# pip install langchain-fireworks\n",
    "\n",
    "# Set env vars for the relevant model or load from a .env file:\n",
    "# import dotenv\n",
    "# dotenv.load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54d6b970-2ea3-4192-951e-21237212b359",
   "metadata": {},
   "source": [
    "## The Schema\n",
    "\n",
    "First, we need to describe what information we want to extract from the text.\n",
    "\n",
    "We'll use Pydantic to define an example schema  to extract personal information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c141084c-fb94-4093-8d6a-81175d688e40",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Optional\n",
    "\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    # ^ Doc-string for the entity Person.\n",
    "    # This doc-string is sent to the LLM as the description of the schema Person,\n",
    "    # and it can help to improve extraction results.\n",
    "\n",
    "    # Note that:\n",
    "    # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
    "    # 2. Each field has a `description` -- this description is used by the LLM.\n",
    "    # Having a good description can help improve extraction results.\n",
    "    name: Optional[str] = Field(default=None, description=\"The name of the person\")\n",
    "    hair_color: Optional[str] = Field(\n",
    "        default=None, description=\"The color of the peron's hair if known\"\n",
    "    )\n",
    "    height_in_meters: Optional[str] = Field(\n",
    "        default=None, description=\"Height measured in meters\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f248dd54-e36d-435a-b154-394ab4ed6792",
   "metadata": {},
   "source": [
    "There are two best practices when defining schema:\n",
    "\n",
    "1. Document the **attributes** and the **schema** itself: This information is sent to the LLM and is used to improve the quality of information extraction.\n",
    "2. Do not force the LLM to make up information! Above we used `Optional` for the attributes allowing the LLM to output `None` if it doesn't know the answer.\n",
    "\n",
    ":::{.callout-important}\n",
    "For best performance, document the schema well and make sure the model isn't force to return results if there's no information to be extracted in the text.\n",
    ":::\n",
    "\n",
    "## The Extractor\n",
    "\n",
    "Let's create an information extractor using the schema we defined above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a5e490f6-35ad-455e-8ae4-2bae021583ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Optional\n",
    "\n",
    "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "# Define a custom prompt to provide instructions and any additional context.\n",
    "# 1) You can add examples into the prompt template to improve extraction quality\n",
    "# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
    "#    about the document from which the text was extracted.)\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"You are an expert extraction algorithm. \"\n",
    "            \"Only extract relevant information from the text. \"\n",
    "            \"If you do not know the value of an attribute asked to extract, \"\n",
    "            \"return null for the attribute's value.\",\n",
    "        ),\n",
    "        # Please see the how-to about improving performance with\n",
    "        # reference examples.\n",
    "        # MessagesPlaceholder('examples'),\n",
    "        (\"human\", \"{text}\"),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "832bf6a1-8e0c-4b6a-aa37-12fe9c42a6d9",
   "metadata": {},
   "source": [
    "We need to use a model that supports function/tool calling.\n",
    "\n",
    "Please review [structured output](/docs/modules/model_io/chat/structured_output) for list of some models that can be used with this API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "04d846a6-d5cb-4009-ac19-61e3aac0177e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_mistralai import ChatMistralAI\n",
    "\n",
    "llm = ChatMistralAI(model=\"mistral-large-latest\", temperature=0)\n",
    "\n",
    "runnable = prompt | llm.with_structured_output(schema=Person)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23582c0b-00ed-403f-a10e-3aeabf921f12",
   "metadata": {},
   "source": [
    "Let's test it out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "13165ac8-a1dc-44ce-a6ed-f52b577473e4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Person(name='Alan Smith', hair_color='blond', height_in_meters='1.8288')"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"Alan Smith is 6 feet tall and has blond hair.\"\n",
    "runnable.invoke({\"text\": text})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd1c493d-f9dc-4236-8da9-50f6919f5710",
   "metadata": {},
   "source": [
    ":::{.callout-important} \n",
    "\n",
    "Extraction is Generative 🤯\n",
    "\n",
    "LLMs are generative models, so they can do some pretty cool things like correctly extract the height of the person in meters\n",
    "even though it was provided in feet!\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28c5ef0c-b8d1-4e12-bd0e-e2528de87fcc",
   "metadata": {},
   "source": [
    "## Multiple Entities\n",
    "\n",
    "In **most cases**, you should be extracting a list of entities rather than a single entity.\n",
    "\n",
    "This can be easily achieved using pydantic by nesting models inside one another."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "591a0c16-7a17-4883-91ee-0d6d2fdb265c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    # ^ Doc-string for the entity Person.\n",
    "    # This doc-string is sent to the LLM as the description of the schema Person,\n",
    "    # and it can help to improve extraction results.\n",
    "\n",
    "    # Note that:\n",
    "    # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
    "    # 2. Each field has a `description` -- this description is used by the LLM.\n",
    "    # Having a good description can help improve extraction results.\n",
    "    name: Optional[str] = Field(default=None, description=\"The name of the person\")\n",
    "    hair_color: Optional[str] = Field(\n",
    "        default=None, description=\"The color of the peron's hair if known\"\n",
    "    )\n",
    "    height_in_meters: Optional[str] = Field(\n",
    "        default=None, description=\"Height measured in meters\"\n",
    "    )\n",
    "\n",
    "\n",
    "class Data(BaseModel):\n",
    "    \"\"\"Extracted data about people.\"\"\"\n",
    "\n",
    "    # Creates a model so that we can extract multiple entities.\n",
    "    people: List[Person]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f5cda33-fd7b-481e-956a-703f45e40e1d",
   "metadata": {},
   "source": [
    ":::{.callout-important}\n",
    "Extraction might not be perfect here. Please continue to see how to use **Reference Examples** to improve the quality of extraction, and see the **guidelines** section!\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "cf7062cc-1d1d-4a37-9122-509d1b87f0a6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Data(people=[Person(name='Jeff', hair_color=None, height_in_meters=None), Person(name='Anna', hair_color=None, height_in_meters=None)])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "runnable = prompt | llm.with_structured_output(schema=Data)\n",
    "text = \"My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me.\"\n",
    "runnable.invoke({\"text\": text})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fba1d770-bf4d-4de4-9e4f-7384872ef0dc",
   "metadata": {},
   "source": [
    ":::{.callout-tip}\n",
    "When the schema accommodates the extraction of **multiple entities**, it also allows the model to extract **no entities** if no relevant information\n",
    "is in the text by providing an empty list. \n",
    "\n",
    "This is usually a **good** thing! It allows specifying **required** attributes on an entity without necessarily forcing the model to detect this entity.\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f07a7455-7de6-4a6f-9772-0477ef65e3dc",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "Now that you understand the basics of extraction with LangChain, you're ready to proceed to the rest of the how-to guide:\n",
    "\n",
    "- [Add Examples](/docs/use_cases/extraction/how_to/examples): Learn how to use **reference examples** to improve performance.\n",
    "- [Handle Long Text](/docs/use_cases/extraction/how_to/handle_long_text): What should you do if the text does not fit into the context window of the LLM?\n",
    "- [Handle Files](/docs/use_cases/extraction/how_to/handle_files): Examples of using LangChain document loaders and parsers to extract from files like PDFs.\n",
    "- [Use a Parsing Approach](/docs/use_cases/extraction/how_to/parse): Use a prompt based approach to extract with models that do not support **tool/function calling**.\n",
    "- [Guidelines](/docs/use_cases/extraction/guidelines): Guidelines for getting good performance on extraction tasks."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}