Files
langchain/docs/versioned_docs/version-0.2.x/tutorials/extraction.ipynb
Harrison Chase c6d0993c93 cr
2024-04-20 14:22:29 -07:00

358 lines
12 KiB
Plaintext

{
"cells": [
{
"cell_type": "raw",
"id": "df29b30a-fd27-4e08-8269-870df5631f9e",
"metadata": {},
"source": [
"---\n",
"title: Quickstart\n",
"sidebar_position: 0\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "d28530a6-ddfd-49c0-85dc-b723551f6614",
"metadata": {},
"source": [
"In this quick start, we will use [chat models](/docs/modules/model_io/chat/) that are capable of **function/tool calling** to extract information from text.\n",
"\n",
":::{.callout-important}\n",
"Extraction using **function/tool calling** only works with [models that support **function/tool calling**](/docs/modules/model_io/chat/function_calling).\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "4412def2-38e3-4bd0-bbf0-fb09ff9e5985",
"metadata": {},
"source": [
"## Set up\n",
"\n",
"We will use the [structured output](/docs/modules/model_io/chat/structured_output) method available on LLMs that are capable of **function/tool calling**. \n",
"\n",
"Select a model, install the dependencies for it and set up API keys!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "380c0425-6062-4837-8630-c220240c83b9",
"metadata": {},
"outputs": [],
"source": [
"!pip install langchain\n",
"\n",
"# Install a model capable of tool calling\n",
"# pip install langchain-openai\n",
"# pip install langchain-mistralai\n",
"# pip install langchain-fireworks\n",
"\n",
"# Set env vars for the relevant model or load from a .env file:\n",
"# import dotenv\n",
"# dotenv.load_dotenv()"
]
},
{
"cell_type": "markdown",
"id": "54d6b970-2ea3-4192-951e-21237212b359",
"metadata": {},
"source": [
"## The Schema\n",
"\n",
"First, we need to describe what information we want to extract from the text.\n",
"\n",
"We'll use Pydantic to define an example schema to extract personal information."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c141084c-fb94-4093-8d6a-81175d688e40",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional\n",
"\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" \"\"\"Information about a person.\"\"\"\n",
"\n",
" # ^ Doc-string for the entity Person.\n",
" # This doc-string is sent to the LLM as the description of the schema Person,\n",
" # and it can help to improve extraction results.\n",
"\n",
" # Note that:\n",
" # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
" # 2. Each field has a `description` -- this description is used by the LLM.\n",
" # Having a good description can help improve extraction results.\n",
" name: Optional[str] = Field(default=None, description=\"The name of the person\")\n",
" hair_color: Optional[str] = Field(\n",
" default=None, description=\"The color of the peron's hair if known\"\n",
" )\n",
" height_in_meters: Optional[str] = Field(\n",
" default=None, description=\"Height measured in meters\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "f248dd54-e36d-435a-b154-394ab4ed6792",
"metadata": {},
"source": [
"There are two best practices when defining schema:\n",
"\n",
"1. Document the **attributes** and the **schema** itself: This information is sent to the LLM and is used to improve the quality of information extraction.\n",
"2. Do not force the LLM to make up information! Above we used `Optional` for the attributes allowing the LLM to output `None` if it doesn't know the answer.\n",
"\n",
":::{.callout-important}\n",
"For best performance, document the schema well and make sure the model isn't force to return results if there's no information to be extracted in the text.\n",
":::\n",
"\n",
"## The Extractor\n",
"\n",
"Let's create an information extractor using the schema we defined above."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a5e490f6-35ad-455e-8ae4-2bae021583ff",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional\n",
"\n",
"from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"# Define a custom prompt to provide instructions and any additional context.\n",
"# 1) You can add examples into the prompt template to improve extraction quality\n",
"# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
"# about the document from which the text was extracted.)\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are an expert extraction algorithm. \"\n",
" \"Only extract relevant information from the text. \"\n",
" \"If you do not know the value of an attribute asked to extract, \"\n",
" \"return null for the attribute's value.\",\n",
" ),\n",
" # Please see the how-to about improving performance with\n",
" # reference examples.\n",
" # MessagesPlaceholder('examples'),\n",
" (\"human\", \"{text}\"),\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "832bf6a1-8e0c-4b6a-aa37-12fe9c42a6d9",
"metadata": {},
"source": [
"We need to use a model that supports function/tool calling.\n",
"\n",
"Please review [structured output](/docs/modules/model_io/chat/structured_output) for list of some models that can be used with this API."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "04d846a6-d5cb-4009-ac19-61e3aac0177e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_mistralai import ChatMistralAI\n",
"\n",
"llm = ChatMistralAI(model=\"mistral-large-latest\", temperature=0)\n",
"\n",
"runnable = prompt | llm.with_structured_output(schema=Person)"
]
},
{
"cell_type": "markdown",
"id": "23582c0b-00ed-403f-a10e-3aeabf921f12",
"metadata": {},
"source": [
"Let's test it out"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "13165ac8-a1dc-44ce-a6ed-f52b577473e4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Person(name='Alan Smith', hair_color='blond', height_in_meters='1.8288')"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = \"Alan Smith is 6 feet tall and has blond hair.\"\n",
"runnable.invoke({\"text\": text})"
]
},
{
"cell_type": "markdown",
"id": "bd1c493d-f9dc-4236-8da9-50f6919f5710",
"metadata": {},
"source": [
":::{.callout-important} \n",
"\n",
"Extraction is Generative 🤯\n",
"\n",
"LLMs are generative models, so they can do some pretty cool things like correctly extract the height of the person in meters\n",
"even though it was provided in feet!\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "28c5ef0c-b8d1-4e12-bd0e-e2528de87fcc",
"metadata": {},
"source": [
"## Multiple Entities\n",
"\n",
"In **most cases**, you should be extracting a list of entities rather than a single entity.\n",
"\n",
"This can be easily achieved using pydantic by nesting models inside one another."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "591a0c16-7a17-4883-91ee-0d6d2fdb265c",
"metadata": {},
"outputs": [],
"source": [
"from typing import List, Optional\n",
"\n",
"from langchain_core.pydantic_v1 import BaseModel, Field\n",
"\n",
"\n",
"class Person(BaseModel):\n",
" \"\"\"Information about a person.\"\"\"\n",
"\n",
" # ^ Doc-string for the entity Person.\n",
" # This doc-string is sent to the LLM as the description of the schema Person,\n",
" # and it can help to improve extraction results.\n",
"\n",
" # Note that:\n",
" # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
" # 2. Each field has a `description` -- this description is used by the LLM.\n",
" # Having a good description can help improve extraction results.\n",
" name: Optional[str] = Field(default=None, description=\"The name of the person\")\n",
" hair_color: Optional[str] = Field(\n",
" default=None, description=\"The color of the peron's hair if known\"\n",
" )\n",
" height_in_meters: Optional[str] = Field(\n",
" default=None, description=\"Height measured in meters\"\n",
" )\n",
"\n",
"\n",
"class Data(BaseModel):\n",
" \"\"\"Extracted data about people.\"\"\"\n",
"\n",
" # Creates a model so that we can extract multiple entities.\n",
" people: List[Person]"
]
},
{
"cell_type": "markdown",
"id": "5f5cda33-fd7b-481e-956a-703f45e40e1d",
"metadata": {},
"source": [
":::{.callout-important}\n",
"Extraction might not be perfect here. Please continue to see how to use **Reference Examples** to improve the quality of extraction, and see the **guidelines** section!\n",
":::"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "cf7062cc-1d1d-4a37-9122-509d1b87f0a6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Data(people=[Person(name='Jeff', hair_color=None, height_in_meters=None), Person(name='Anna', hair_color=None, height_in_meters=None)])"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"runnable = prompt | llm.with_structured_output(schema=Data)\n",
"text = \"My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me.\"\n",
"runnable.invoke({\"text\": text})"
]
},
{
"cell_type": "markdown",
"id": "fba1d770-bf4d-4de4-9e4f-7384872ef0dc",
"metadata": {},
"source": [
":::{.callout-tip}\n",
"When the schema accommodates the extraction of **multiple entities**, it also allows the model to extract **no entities** if no relevant information\n",
"is in the text by providing an empty list. \n",
"\n",
"This is usually a **good** thing! It allows specifying **required** attributes on an entity without necessarily forcing the model to detect this entity.\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "f07a7455-7de6-4a6f-9772-0477ef65e3dc",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"Now that you understand the basics of extraction with LangChain, you're ready to proceed to the rest of the how-to guide:\n",
"\n",
"- [Add Examples](/docs/use_cases/extraction/how_to/examples): Learn how to use **reference examples** to improve performance.\n",
"- [Handle Long Text](/docs/use_cases/extraction/how_to/handle_long_text): What should you do if the text does not fit into the context window of the LLM?\n",
"- [Handle Files](/docs/use_cases/extraction/how_to/handle_files): Examples of using LangChain document loaders and parsers to extract from files like PDFs.\n",
"- [Use a Parsing Approach](/docs/use_cases/extraction/how_to/parse): Use a prompt based approach to extract with models that do not support **tool/function calling**.\n",
"- [Guidelines](/docs/use_cases/extraction/guidelines): Guidelines for getting good performance on extraction tasks."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}