mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-15 11:28:58 +00:00
333 lines
27 KiB
Plaintext
333 lines
27 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c47f5b2f-e14c-43e7-a0ab-d71562636624",
|
|
"metadata": {},
|
|
"source": [
|
|
"---\n",
|
|
"sidebar_position: 3\n",
|
|
"keywords: [summarize, summarization, refine]\n",
|
|
"---"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "682a4f53-27db-43ef-a909-dd9ded76051b",
|
|
"metadata": {},
|
|
"source": [
|
|
"# How to summarize text through iterative refinement\n",
|
|
"\n",
|
|
"LLMs can summarize and otherwise distill desired information from text, including large volumes of text. In many cases, especially when the amount of text is large compared to the size of the model's context window, it can be helpful (or necessary) to break up the summarization task into smaller components.\n",
|
|
"\n",
|
|
"Iterative refinement represents one strategy for summarizing long texts. The strategy is as follows:\n",
|
|
"\n",
|
|
"- Split a text into smaller documents;\n",
|
|
"- Summarize the first document;\n",
|
|
"- Refine or update the result based on the next document;\n",
|
|
"- Repeat through the sequence of documents until finished.\n",
|
|
"\n",
|
|
"Note that this strategy is not parallelized. It is especially effective when understanding of a sub-document depends on prior context-- for instance, when summarizing a novel or body of text with an inherent sequence.\n",
|
|
"\n",
|
|
"[LangGraph](https://langchain-ai.github.io/langgraph/), built on top of `langchain-core`, is well-suited to this problem:\n",
|
|
"\n",
|
|
"- LangGraph allows for individual steps (such as successive summarizations) to be streamed, allowing for greater control of execution;\n",
|
|
"- LangGraph's [checkpointing](https://langchain-ai.github.io/langgraph/how-tos/persistence/) supports error recovery, extending with human-in-the-loop workflows, and easier incorporation into conversational applications.\n",
|
|
"- Because it is assembled from modular components, it is also simple to extend or modify (e.g., to incorporate [tool calling](/docs/concepts/tool_calling) or other behavior).\n",
|
|
"\n",
|
|
"Below, we demonstrate how to summarize text via iterative refinement."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4aa52e84-d1b5-4b33-b4c4-541156686ef3",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load chat model\n",
|
|
"\n",
|
|
"Let's first load a chat model:\n",
|
|
"\n",
|
|
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
|
|
"\n",
|
|
"<ChatModelTabs\n",
|
|
" customVarName=\"llm\"\n",
|
|
"/>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "e5f426fc-cea6-4351-8931-1e422d3c8b69",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# | output: false\n",
|
|
"# | echo: false\n",
|
|
"\n",
|
|
"from langchain_openai import ChatOpenAI\n",
|
|
"\n",
|
|
"llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b137fe82-0a53-4910-b53e-b87a297f329d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load documents"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a81dc91d-ae72-4996-b809-d4a9050e815e",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next, we need some documents to summarize. Below, we generate some toy documents for illustrative purposes. See the document loader [how-to guides](/docs/how_to/#document-loaders) and [integration pages](/docs/integrations/document_loaders/) for additional sources of data. The [summarization tutorial](/docs/tutorials/summarization) also includes an example summarizing a blog post."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "27c8fed0-b2d7-4549-a086-f5ee657efc41",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_core.documents import Document\n",
|
|
"\n",
|
|
"documents = [\n",
|
|
" Document(page_content=\"Apples are red\", metadata={\"title\": \"apple_book\"}),\n",
|
|
" Document(page_content=\"Blueberries are blue\", metadata={\"title\": \"blueberry_book\"}),\n",
|
|
" Document(page_content=\"Bananas are yelow\", metadata={\"title\": \"banana_book\"}),\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "84216044-6f1e-4b90-b4fa-29ec305abf51",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create graph\n",
|
|
"\n",
|
|
"Below we show a LangGraph implementation of this process:\n",
|
|
"\n",
|
|
"- We generate a simple chain for the initial summary that plucks out the first document, formats it into a prompt and runs inference with our LLM.\n",
|
|
"- We generate a second `refine_summary_chain` that operates on each successive document, refining the initial summary.\n",
|
|
"\n",
|
|
"We will need to install `langgraph`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bf7acdb7-19ca-43ba-98f4-91f5b804da21",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pip install -qU langgraph"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "669afa40-2708-4fa1-841e-c74a67bd9175",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import operator\n",
|
|
"from typing import List, Literal, TypedDict\n",
|
|
"\n",
|
|
"from langchain_core.output_parsers import StrOutputParser\n",
|
|
"from langchain_core.prompts import ChatPromptTemplate\n",
|
|
"from langchain_core.runnables import RunnableConfig\n",
|
|
"from langgraph.constants import Send\n",
|
|
"from langgraph.graph import END, START, StateGraph\n",
|
|
"\n",
|
|
"# Initial summary\n",
|
|
"summarize_prompt = ChatPromptTemplate(\n",
|
|
" [\n",
|
|
" (\"human\", \"Write a concise summary of the following: {context}\"),\n",
|
|
" ]\n",
|
|
")\n",
|
|
"initial_summary_chain = summarize_prompt | llm | StrOutputParser()\n",
|
|
"\n",
|
|
"# Refining the summary with new docs\n",
|
|
"refine_template = \"\"\"\n",
|
|
"Produce a final summary.\n",
|
|
"\n",
|
|
"Existing summary up to this point:\n",
|
|
"{existing_answer}\n",
|
|
"\n",
|
|
"New context:\n",
|
|
"------------\n",
|
|
"{context}\n",
|
|
"------------\n",
|
|
"\n",
|
|
"Given the new context, refine the original summary.\n",
|
|
"\"\"\"\n",
|
|
"refine_prompt = ChatPromptTemplate([(\"human\", refine_template)])\n",
|
|
"\n",
|
|
"refine_summary_chain = refine_prompt | llm | StrOutputParser()\n",
|
|
"\n",
|
|
"\n",
|
|
"# We will define the state of the graph to hold the document\n",
|
|
"# contents and summary. We also include an index to keep track\n",
|
|
"# of our position in the sequence of documents.\n",
|
|
"class State(TypedDict):\n",
|
|
" contents: List[str]\n",
|
|
" index: int\n",
|
|
" summary: str\n",
|
|
"\n",
|
|
"\n",
|
|
"# We define functions for each node, including a node that generates\n",
|
|
"# the initial summary:\n",
|
|
"async def generate_initial_summary(state: State, config: RunnableConfig):\n",
|
|
" summary = await initial_summary_chain.ainvoke(\n",
|
|
" state[\"contents\"][0],\n",
|
|
" config,\n",
|
|
" )\n",
|
|
" return {\"summary\": summary, \"index\": 1}\n",
|
|
"\n",
|
|
"\n",
|
|
"# And a node that refines the summary based on the next document\n",
|
|
"async def refine_summary(state: State, config: RunnableConfig):\n",
|
|
" content = state[\"contents\"][state[\"index\"]]\n",
|
|
" summary = await refine_summary_chain.ainvoke(\n",
|
|
" {\"existing_answer\": state[\"summary\"], \"context\": content},\n",
|
|
" config,\n",
|
|
" )\n",
|
|
"\n",
|
|
" return {\"summary\": summary, \"index\": state[\"index\"] + 1}\n",
|
|
"\n",
|
|
"\n",
|
|
"# Here we implement logic to either exit the application or refine\n",
|
|
"# the summary.\n",
|
|
"def should_refine(state: State) -> Literal[\"refine_summary\", END]:\n",
|
|
" if state[\"index\"] >= len(state[\"contents\"]):\n",
|
|
" return END\n",
|
|
" else:\n",
|
|
" return \"refine_summary\"\n",
|
|
"\n",
|
|
"\n",
|
|
"graph = StateGraph(State)\n",
|
|
"graph.add_node(\"generate_initial_summary\", generate_initial_summary)\n",
|
|
"graph.add_node(\"refine_summary\", refine_summary)\n",
|
|
"\n",
|
|
"graph.add_edge(START, \"generate_initial_summary\")\n",
|
|
"graph.add_conditional_edges(\"generate_initial_summary\", should_refine)\n",
|
|
"graph.add_conditional_edges(\"refine_summary\", should_refine)\n",
|
|
"app = graph.compile()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cdc11401-8640-4cf8-a713-4031df690cf7",
|
|
"metadata": {},
|
|
"source": [
|
|
"LangGraph allows the graph structure to be plotted to help visualize its function:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "21711ff5-4e06-4843-9109-e7d89e679449",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"image/jpeg": "",
|
|
"text/plain": [
|
|
"<IPython.core.display.Image object>"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from IPython.display import Image\n",
|
|
"\n",
|
|
"Image(app.get_graph().draw_mermaid_png())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "74f3e276-f003-4112-ba14-c6952076c4f8",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Invoke graph\n",
|
|
"\n",
|
|
"We can step through the execution as follows, printing out the summary as it is refined:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "0701bb7d-fbc6-497e-a577-25d56e6e43c6",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Apples are characterized by their red color.\n",
|
|
"Apples are characterized by their red color, while blueberries are known for their blue hue.\n",
|
|
"Apples are characterized by their red color, blueberries are known for their blue hue, and bananas are recognized for their yellow color.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"async for step in app.astream(\n",
|
|
" {\"contents\": [doc.page_content for doc in documents]},\n",
|
|
" stream_mode=\"values\",\n",
|
|
"):\n",
|
|
" if summary := step.get(\"summary\"):\n",
|
|
" print(summary)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "49147724-de8b-44fd-bf13-5ef3432c7c6b",
|
|
"metadata": {},
|
|
"source": [
|
|
"The final `step` contains the summary as synthesized from the entire set of documents."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f15c225a-db1d-48cf-b135-f588e7d615e6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Next steps\n",
|
|
"\n",
|
|
"Check out the summarization [how-to guides](/docs/how_to/#summarization) for additional summarization strategies, including those designed for larger volumes of text.\n",
|
|
"\n",
|
|
"See [this tutorial](/docs/tutorials/summarization) for more detail on summarization.\n",
|
|
"\n",
|
|
"See also the [LangGraph documentation](https://langchain-ai.github.io/langgraph/) for detail on building with LangGraph."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|