mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-01 19:12:42 +00:00
Harrison/base combine doc chain (#264)
This commit is contained in:
@@ -52,15 +52,32 @@ With these primitives in mind, the following chains exist:
|
||||
**Vector Database Question-Answering**
|
||||
|
||||
- **Links Used**: Vectorstore, LLMChain
|
||||
- **Notes**: This chain takes user input (a question), uses the Vectorstore and semantic search to find relevant documents, and then passes the documents plus to the original question to another LLM to generate a final answer.
|
||||
- **Notes**: This chain takes user input (a question), uses the Vectorstore and semantic search to find relevant documents, and then passes the documents plus the original question to another LLM to generate a final answer.
|
||||
- `Example Notebook <chains/vector_db_qa.ipynb>`_
|
||||
|
||||
**Vector Database Question-Answering With Sources**
|
||||
|
||||
- **Links Used**: Vectorstore, LLMChain
|
||||
- **Notes**: This chain takes user input (a question), uses the Vectorstore and semantic search to find relevant documents, and then passes the documents plus the original question to another LLM to generate a final answer with sources.
|
||||
- `Example Notebook <chains/vector_db_qa_with_sources.ipynb>`_
|
||||
|
||||
**Question-Answering With Sources**
|
||||
|
||||
- **Links Used**: LLMChain
|
||||
- **Notes**: This chain takes a question and multiple documents as input. It then runs a first LLMChain over all documents attempting to answer the provided question. It then runs a second LLMChain over the results of the first pass, combining the answers from documents into a single response that is returned.
|
||||
- `Example Notebook <chains/combine_documents.ipynb>`_
|
||||
- **Notes**: These types of chains take a question and multiple documents as input, and return an answer plus sources for where that answer came from. There are multiple underlying types of chains to do this, for more information see TODO.
|
||||
- `Example Notebook <chains/qa_with_sources.ipynb>`_
|
||||
|
||||
**Question-Answering**
|
||||
|
||||
- **Links Used**: LLMChain
|
||||
- **Notes**: These types of chains take a question and multiple documents as input, and return an answer. There are multiple underlying types of chains to do this, for more information see TODO.
|
||||
- `Example Notebook <chains/question_answering.ipynb>`_
|
||||
|
||||
**Summarization**
|
||||
|
||||
- **Links Used**: LLMChain
|
||||
- **Notes**: These types of chains take multiple documents as input, and return a summary of all documents. There are multiple underlying types of chains to do this, for more information see TODO.
|
||||
- `Example Notebook <chains/summarize.ipynb>`_
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
@@ -1,93 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d9a0131f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Map Reduce\n",
|
||||
"\n",
|
||||
"This notebok showcases an example of map-reduce chains: recursive summarization."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "e9db25f3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain import OpenAI, PromptTemplate, LLMChain\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.chains.mapreduce import MapReduceChain\n",
|
||||
"\n",
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"\n",
|
||||
"_prompt = \"\"\"Write a concise summary of the following:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"{text}\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"CONCISE SUMMARY:\"\"\"\n",
|
||||
"prompt = PromptTemplate(template=_prompt, input_variables=[\"text\"])\n",
|
||||
"\n",
|
||||
"text_splitter = CharacterTextSplitter()\n",
|
||||
"\n",
|
||||
"mp_chain = MapReduceChain.from_params(llm, prompt, text_splitter)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "99bbe19b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\"\\n\\nThe President discusses the recent aggression by Russia, and the response by the United States and its allies. He announces new sanctions against Russia, and says that the free world is united in holding Putin accountable. The President also discusses the American Rescue Plan, the Bipartisan Infrastructure Law, and the Bipartisan Innovation Act. Finally, the President addresses the need for women's rights and equality for LGBTQ+ Americans.\""
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"with open('../state_of_the_union.txt') as f:\n",
|
||||
" state_of_the_union = f.read()\n",
|
||||
"mp_chain.run(state_of_the_union)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "baa6e808",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
250
docs/examples/chains/qa_with_sources.ipynb
Normal file
250
docs/examples/chains/qa_with_sources.ipynb
Normal file
@@ -0,0 +1,250 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "74148cee",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Question Answering with Sources\n",
|
||||
"\n",
|
||||
"This notebook walks through how to use LangChain for question answering with sources over a list of documents. It covers three different chain types: `stuff`, `map_reduce`, and `refine`. For a more in depth explanation of what these chain types are, see [here](../../explanation/combine_docs.md)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ca2f0efc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prepare Data\n",
|
||||
"First we prepare the data. For this example we do similarity search over a vector database, but these documents could be fetched in any manner (the point of this notebook to highlight what to do AFTER you fetch the documents)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "78f28130",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"from langchain.embeddings.cohere import CohereEmbeddings\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch\n",
|
||||
"from langchain.vectorstores.faiss import FAISS\n",
|
||||
"from langchain.docstore.document import Document"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "4da195a3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open('../state_of_the_union.txt') as f:\n",
|
||||
" state_of_the_union = f.read()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||
"texts = text_splitter.split_text(state_of_the_union)\n",
|
||||
"\n",
|
||||
"embeddings = OpenAIEmbeddings()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "5ec2b55b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docsearch = FAISS.from_texts(texts, embeddings, metadatas=[{\"source\": i} for i in range(len(texts))])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "5286f58f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"docs = docsearch.similarity_search(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "005a47e9",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains.qa_with_sources import load_qa_with_sources_chain\n",
|
||||
"from langchain.llms import OpenAI"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d82f899a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `stuff` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `stuff` Chain to do question answering with sources."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "fc1a5ed6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type=\"stuff\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "e239964b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = [Document(page_content=t, metadata={\"source\": i}) for i, t in enumerate(texts[:3])]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "7d766417",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'output_text': ' The president did not mention Justice Breyer.\\nSOURCES: 0-pl, 1-pl, 2-pl'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"chain({\"input_documents\": docs, \"question\": query}, return_only_outputs=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c5dbb304",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `map_reduce` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `map_reduce` Chain to do question answering with sources."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "921db0a4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type=\"map_reduce\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "e417926a",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'output_text': ' The president did not mention Justice Breyer.\\nSOURCES: 0, 1, 2'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"chain({\"input_documents\": docs, \"question\": query}, return_only_outputs=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5bf0e1ab",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `refine` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `refine` Chain to do question answering with sources."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "904835c8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type=\"refine\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "f60875c6",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'output_text': \"\\n\\nThe president said that Justice Breyer has dedicated his life to serve the country and has left a legacy of excellence. He also thanked Justice Breyer for his service and for his commitment to advancing liberty and justice, including protecting the rights of women and the constitutional right affirmed in Roe v. Wade, preserving access to health care and a woman's right to choose, and advancing the bipartisan Equality Act to protect LGBTQ+ Americans. The president also noted that the State of the Union is strong because of the courage and determination of the American people, and that the nation will meet and overcome the challenges of our time as one people, just as the Ukrainian people have done in the face of adversity. Source: 0, 29, 35\"}"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"chain({\"input_documents\": docs, \"query_str\": query}, return_only_outputs=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "929620d0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
248
docs/examples/chains/question_answering.ipynb
Normal file
248
docs/examples/chains/question_answering.ipynb
Normal file
@@ -0,0 +1,248 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "05859721",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Question Answering\n",
|
||||
"\n",
|
||||
"This notebook walks through how to use LangChain for question answering over a list of documents. It covers three different types of chaings: `stuff`, `map_reduce`, and `refine`. For a more in depth explanation of what these chain types are, see [here](../../explanation/combine_docs.md)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "726f4996",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prepare Data\n",
|
||||
"First we prepare the data. For this example we do similarity search over a vector database, but these documents could be fetched in any manner (the point of this notebook to highlight what to do AFTER you fetch the documents)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "17fcbc0f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.vectorstores.faiss import FAISS\n",
|
||||
"from langchain.docstore.document import Document"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "291f0117",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open('../state_of_the_union.txt') as f:\n",
|
||||
" state_of_the_union = f.read()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||
"texts = text_splitter.split_text(state_of_the_union)\n",
|
||||
"\n",
|
||||
"embeddings = OpenAIEmbeddings()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "fd9666a9",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docsearch = FAISS.from_texts(texts, embeddings)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "d1eaf6e6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"docs = docsearch.similarity_search(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "a16e3453",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains.question_answering import load_qa_chain\n",
|
||||
"from langchain.llms import OpenAI"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f78787a0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `stuff` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `stuff` Chain to do question answering."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "180fd4c1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_qa_chain(OpenAI(temperature=0), chain_type=\"stuff\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "d145ae31",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = [Document(page_content=t) for t in texts[:3]]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "77fdf1aa",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'output_text': ' The president did not mention Justice Breyer.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"chain({\"input_documents\": docs, \"question\": query}, return_only_outputs=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "91522e29",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `map_reduce` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `map_reduce` Chain to do question answering."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "b0060f51",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_qa_chain(OpenAI(temperature=0), chain_type=\"map_reduce\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "fbdb9137",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'output_text': ' The president did not mention Justice Breyer.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"chain({\"input_documents\": docs, \"question\": query}, return_only_outputs=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6ea50ad0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `refine` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `refine` Chain to do question answering."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "fb167057",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_qa_chain(OpenAI(temperature=0), chain_type=\"refine\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "d8b5286e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'output_text': '\\n\\nThe president did not mention Justice Breyer in the given page content.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"chain({\"input_documents\": docs, \"query_str\": query}, return_only_outputs=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "49e9c6d7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
234
docs/examples/chains/summarize.ipynb
Normal file
234
docs/examples/chains/summarize.ipynb
Normal file
@@ -0,0 +1,234 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d9a0131f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Summarization\n",
|
||||
"\n",
|
||||
"This notebook walks through how to use LangChain for summarization over a list of documents. It covers three different chain types: `stuff`, `map_reduce`, and `refine`. For a more in depth explanation of what these chain types are, see [here](../../explanation/combine_docs.md)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0b5660bf",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prepare Data\n",
|
||||
"First we prepare the data. For this example we create multiple documents from one long one, but these documents could be fetched in any manner (the point of this notebook to highlight what to do AFTER you fetch the documents)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "e9db25f3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain import OpenAI, PromptTemplate, LLMChain\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"from langchain.chains.mapreduce import MapReduceChain\n",
|
||||
"\n",
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"text_splitter = CharacterTextSplitter()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "99bbe19b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open('../state_of_the_union.txt') as f:\n",
|
||||
" state_of_the_union = f.read()\n",
|
||||
"texts = text_splitter.split_text(state_of_the_union)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "baa6e808",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.docstore.document import Document"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "8dff4f43",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = [Document(page_content=t) for t in texts[:3]]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "27989fc4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains.summarize import load_summarize_chain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ea2d5c99",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `stuff` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `stuff` Chain to do summarization."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "f01f3196",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_summarize_chain(llm, chain_type=\"stuff\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "da4d9801",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' In his speech, President Biden addressed the ongoing conflict between Russia and Ukraine, and the need for the United States and its allies to stand with Ukraine. He also discussed the American Rescue Plan, the Bipartisan Infrastructure Law, and the Bipartisan Innovation Act, which will help to create jobs, modernize infrastructure, and level the playing field with China. He also emphasized the importance of buying American products to support American jobs.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain.run(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9c868e86",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `map_reduce` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `map_reduce` Chain to do summarization."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "ef28e1d4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_summarize_chain(llm, chain_type=\"map_reduce\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "f82c5f9f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' In response to Russian aggression in Ukraine, the US and its allies have imposed economic sanctions, cut off access to technology, seized assets of Russian oligarchs, and closed American airspace to Russian flights. The US is also providing military, economic, and humanitarian assistance to Ukraine, mobilizing ground forces, air squadrons, and ship deployments, and releasing 30 million barrels of oil from its Strategic Petroleum Reserve. President Biden has also passed the American Rescue Plan, Bipartisan Infrastructure Law, and Bipartisan Innovation Act to provide economic relief and rebuild America.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain.run(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f61350f9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### The `refine` Chain\n",
|
||||
"\n",
|
||||
"This sections shows results of using the `refine` Chain to do summarization."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "3bcbe31e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = load_summarize_chain(llm, chain_type=\"refine\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "c8cad866",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\"\\nIn this speech, the speaker addresses the American people and their allies, discussing the recent aggression of Russia's Vladimir Putin in Ukraine. The speaker outlines the actions taken by the United States and its allies to hold Putin accountable, including economic sanctions, cutting off access to technology, and seizing the assets of Russian oligarchs. The speaker also announces the closing of American airspace to Russian flights, further isolating Russia and adding an additional squeeze on their economy. The Russian stock market has lost 40% of its value and trading remains suspended. Together with our allies, the United States is providing military, economic, and humanitarian assistance to Ukraine, and has mobilized forces to protect NATO countries. The speaker also announces the release of 60 million barrels of oil from reserves around the world, with the United States releasing 30 million barrels from its own Strategic Petroleum Reserve. The speaker emphasizes that the United States and its allies will defend every inch of NATO territory and that Putin will pay a high price for his aggression. The speaker also acknowledges the hardships faced by the American people due to the pandemic and the American Rescue Plan, which has provided immediate economic relief for tens of millions of Americans, helped put food on their table, keep a roof over their heads, and cut the cost of health insurance. The speaker\""
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain.run(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0da92750",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
@@ -41,27 +41,27 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": 5,
|
||||
"id": "3018f865",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"qa = VectorDBQA(llm=OpenAI(), vectorstore=docsearch)"
|
||||
"qa = VectorDBQA.from_llm(llm=OpenAI(), vectorstore=docsearch)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": 4,
|
||||
"id": "032a47f8",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' The President said that Ketanji Brown Jackson is a consensus builder and has received a broad range of support since she was nominated.'"
|
||||
"\" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator and federal public defender, and from a family of public school educators and police officers. He also said that she has received a broad range of support since she was nominated, from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\""
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -74,7 +74,7 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f0f20b92",
|
||||
"id": "f056f6fd",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
@@ -96,7 +96,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.6"
|
||||
"version": "3.10.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
@@ -61,72 +61,6 @@
|
||||
" d.metadata = {'source': f\"{i}-pl\"}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "aa1c1b60",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### QAWithSourcesChain\n",
|
||||
"This shows how to use the `QAWithSourcesChain`, which takes in document objects and uses them directly."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "61bce191",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Justice Breyer\"\n",
|
||||
"docs = docsearch.similarity_search(query)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "57ddf8c7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains import QAWithSourcesChain\n",
|
||||
"from langchain.llms import OpenAI, Cohere\n",
|
||||
"from langchain.docstore.document import Document"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "f908a92a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = QAWithSourcesChain.from_llm(OpenAI(temperature=0))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "a505ac89",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'answer': ' The president thanked Justice Breyer for his service.',\n",
|
||||
" 'sources': '27-pl'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain({\"docs\": docs, \"question\": query}, return_only_outputs=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e6fc81de",
|
||||
@@ -159,10 +93,22 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 11,
|
||||
"id": "8ba36fa7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'answer': ' The president thanked Justice Breyer for his service.',\n",
|
||||
" 'sources': '27-pl'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain({\"question\": \"What did the president say about Justice Breyer\"}, return_only_outputs=True)"
|
||||
]
|
||||
@@ -192,7 +138,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.7"
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
@@ -1,180 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b118c9dc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# HuggingFace Tokenizers\n",
|
||||
"\n",
|
||||
"This notebook show cases how to use HuggingFace tokenizers to split text."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "e82c4685",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.text_splitter import CharacterTextSplitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "a8ce51d5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from transformers import GPT2TokenizerFast\n",
|
||||
"\n",
|
||||
"tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "ca5e72c0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open('../state_of_the_union.txt') as f:\n",
|
||||
" state_of_the_union = f.read()\n",
|
||||
"text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=1000, chunk_overlap=0)\n",
|
||||
"texts = text_splitter.split_text(state_of_the_union)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "37cdfbeb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n",
|
||||
"\n",
|
||||
"Last year COVID-19 kept us apart. This year we are finally together again. \n",
|
||||
"\n",
|
||||
"Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
|
||||
"\n",
|
||||
"With a duty to one another to the American people to the Constitution. \n",
|
||||
"\n",
|
||||
"And with an unwavering resolve that freedom will always triumph over tyranny. \n",
|
||||
"\n",
|
||||
"Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n",
|
||||
"\n",
|
||||
"He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n",
|
||||
"\n",
|
||||
"He met the Ukrainian people. \n",
|
||||
"\n",
|
||||
"From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n",
|
||||
"\n",
|
||||
"Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \n",
|
||||
"\n",
|
||||
"In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \n",
|
||||
"\n",
|
||||
"Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. \n",
|
||||
"\n",
|
||||
"Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. \n",
|
||||
"\n",
|
||||
"Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. \n",
|
||||
"\n",
|
||||
"They keep moving. \n",
|
||||
"\n",
|
||||
"And the costs and the threats to America and the world keep rising. \n",
|
||||
"\n",
|
||||
"That’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. \n",
|
||||
"\n",
|
||||
"The United States is a member along with 29 other nations. \n",
|
||||
"\n",
|
||||
"It matters. American diplomacy matters. American resolve matters. \n",
|
||||
"\n",
|
||||
"Putin’s latest attack on Ukraine was premeditated and unprovoked. \n",
|
||||
"\n",
|
||||
"He rejected repeated efforts at diplomacy. \n",
|
||||
"\n",
|
||||
"He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. \n",
|
||||
"\n",
|
||||
"We prepared extensively and carefully. \n",
|
||||
"\n",
|
||||
"We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. \n",
|
||||
"\n",
|
||||
"I spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. \n",
|
||||
"\n",
|
||||
"We countered Russia’s lies with truth. \n",
|
||||
"\n",
|
||||
"And now that he has acted the free world is holding him accountable. \n",
|
||||
"\n",
|
||||
"Along with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland. \n",
|
||||
"\n",
|
||||
"We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. \n",
|
||||
"\n",
|
||||
"Together with our allies –we are right now enforcing powerful economic sanctions. \n",
|
||||
"\n",
|
||||
"We are cutting off Russia’s largest banks from the international financial system. \n",
|
||||
"\n",
|
||||
"Preventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless. \n",
|
||||
"\n",
|
||||
"We are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come. \n",
|
||||
"\n",
|
||||
"Tonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more. \n",
|
||||
"\n",
|
||||
"The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs. \n",
|
||||
"\n",
|
||||
"We are joining with our European allies to find and seize your yachts your luxury apartments your private jets. We are coming for your ill-begotten gains. \n",
|
||||
"\n",
|
||||
"And tonight I am announcing that we will join our allies in closing off American air space to all Russian flights – further isolating Russia – and adding an additional squeeze –on their economy. The Ruble has lost 30% of its value. \n",
|
||||
"\n",
|
||||
"The Russian stock market has lost 40% of its value and trading remains suspended. Russia’s economy is reeling and Putin alone is to blame. \n",
|
||||
"\n",
|
||||
"Together with our allies we are providing support to the Ukrainians in their fight for freedom. Military assistance. Economic assistance. Humanitarian assistance. \n",
|
||||
"\n",
|
||||
"We are giving more than $1 Billion in direct assistance to Ukraine. \n",
|
||||
"\n",
|
||||
"And we will continue to aid the Ukrainian people as they defend their country and to help ease their suffering. \n",
|
||||
"\n",
|
||||
"Let me be clear, our forces are not engaged and will not engage in conflict with Russian forces in Ukraine. \n",
|
||||
"\n",
|
||||
"Our forces are not going to Europe to fight in Ukraine, but to defend our NATO Allies – in the event that Putin decides to keep moving west. \n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(texts[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d214aec2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
258
docs/examples/integrations/textsplitter.ipynb
Normal file
258
docs/examples/integrations/textsplitter.ipynb
Normal file
@@ -0,0 +1,258 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b118c9dc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Text Splitter\n",
|
||||
"\n",
|
||||
"When you want to deal wit long pieces of text, it is necessary to split up that text into chunks.\n",
|
||||
"This notebook showcases several ways to do that.\n",
|
||||
"\n",
|
||||
"At a high level, text splitters work as following:\n",
|
||||
"\n",
|
||||
"1. Split the text up into small, semantically meaningful chunks (often sentences).\n",
|
||||
"2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).\n",
|
||||
"3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "e82c4685",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter\n",
|
||||
"# This is a long document we can split up.\n",
|
||||
"with open('../state_of_the_union.txt') as f:\n",
|
||||
" state_of_the_union = f.read()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5c461b26",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Character Text Splitting\n",
|
||||
"\n",
|
||||
"Let's start with the most simple method: let's split based on characters (by default \"\\n\\n\") and measure chunk length by number of characters."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "79ff6737",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text_splitter = CharacterTextSplitter( \n",
|
||||
" separator = \"\\n\\n\",\n",
|
||||
" chunk_size = 1000,\n",
|
||||
" chunk_overlap = 200,\n",
|
||||
" length_function = len,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "38547666",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \\n\\nGroups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. '"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"texts = text_splitter.split_text(state_of_the_union)\n",
|
||||
"texts[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "13dc0983",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## HuggingFace Length Function\n",
|
||||
"Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use HuggingFace tokenizers to count the text length."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "a8ce51d5",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from transformers import GPT2TokenizerFast\n",
|
||||
"\n",
|
||||
"tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "ca5e72c0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0)\n",
|
||||
"texts = text_splitter.split_text(state_of_the_union)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "37cdfbeb",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n",
|
||||
"\n",
|
||||
"Last year COVID-19 kept us apart. This year we are finally together again. \n",
|
||||
"\n",
|
||||
"Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
|
||||
"\n",
|
||||
"With a duty to one another to the American people to the Constitution. \n",
|
||||
"\n",
|
||||
"And with an unwavering resolve that freedom will always triumph over tyranny. \n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(texts[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ea2973ac",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## NLTK Text Splitter\n",
|
||||
"Rather than just splitting on \"\\n\\n\", we can use NLTK to split based on tokenizers."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "20fa9c23",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text_splitter = NLTKTextSplitter(chunk_size=1000)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "5ea10835",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\\n\\nMembers of Congress and the Cabinet.\\n\\nJustices of the Supreme Court.\\n\\nMy fellow Americans.\\n\\nLast year COVID-19 kept us apart.\\n\\nThis year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents.\\n\\nBut most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constitution.\\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny.\\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\\n\\nBut he badly miscalculated.\\n\\nHe thought he could roll into Ukraine and the world would roll over.\\n\\nInstead he met a wall of strength he never imagined.\\n\\nHe met the Ukrainian people.\\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\\n\\nGroups of citizens blocking tanks with their bodies.\\n\\nEveryone from students to retirees teachers turned soldiers defending their homeland.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"texts = text_splitter.split_text(state_of_the_union)\n",
|
||||
"texts[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dab86b60",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Spacy Text Splitter\n",
|
||||
"Another alternative to NLTK is to use Spacy."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "f9cc9dfc",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text_splitter = SpacyTextSplitter(chunk_size=1000)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"id": "cef2b29e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\\n\\nMembers of Congress and the Cabinet.\\n\\nJustices of the Supreme Court.\\n\\nMy fellow Americans. \\n\\n\\n\\nLast year COVID-19 kept us apart.\\n\\nThis year we are finally together again.\\n\\n\\n\\n\\n\\nTonight, we meet as Democrats Republicans and Independents.\\n\\nBut most importantly as Americans.\\n\\n\\n\\n\\n\\nWith a duty to one another to the American people to the Constitution. \\n\\n\\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny.\\n\\n\\n\\n\\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\\n\\nBut he badly miscalculated.\\n\\n\\n\\n\\n\\nHe thought he could roll into Ukraine and the world would roll over.\\n\\nInstead he met a wall of strength he never imagined.\\n\\n\\n\\n\\n\\nHe met the Ukrainian people.\\n\\n\\n\\n\\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\\n\\n\\n\\n\\n\\nGroups of citizens blocking tanks with their bodies.\\n\\nEveryone from students to retirees teachers turned soldiers defending their homeland.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 19,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"texts = text_splitter.split_text(state_of_the_union)\n",
|
||||
"texts[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a1a118b1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
128
docs/explanation/combine_docs.md
Normal file
128
docs/explanation/combine_docs.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# Data Augmented Generation
|
||||
|
||||
## Overview
|
||||
|
||||
Language models are trained on large amounts of unstructured data, which makes them really good at general purpose text generation. However, there are many instances where you may want the language model to generate text based not on generic data but rather on specific data. Some common examples of this include:
|
||||
|
||||
- Summarization of a specific piece of text (a website, a private document, etc)
|
||||
- Question answering over a specific piece of text (a website, a private document, etc)
|
||||
- Question answering over multiple pieces of text (multiple websites, multiple private documents, etc)
|
||||
- Using the results of some external call to an API (results from a SQL query, etc)
|
||||
|
||||
All of these examples are instances when you do not want the LLM to generate text based solely on the data it was trained over, but rather you want it to incorporate other external data in some way. At a high level, this process can be broken down into two steps:
|
||||
|
||||
1. Fetching: Fetching the relevant data to include.
|
||||
2. Augmenting: Passing the data in as context to the LLM.
|
||||
|
||||
This guide is intended to provide an overview of how to do this. This includes an overview of the literature, as well as common tools, abstractions and chains for doing this.
|
||||
|
||||
## Related Literature
|
||||
There are a lot of related papers in this area. Most of them are focused on end-to-end methods that optimize the fetching of the relevant data as well as passing it in as context. These are a few of the papers that are particularly relevant:
|
||||
|
||||
**[RAG](https://arxiv.org/abs/2005.11401):** Retrieval Augmented Generation.
|
||||
This paper introduces RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.
|
||||
|
||||
**[REALM](https://arxiv.org/abs/2002.08909):** Retrieval-Augmented Language Model Pre-Training.
|
||||
To capture knowledge in a more modular and interpretable way, this paper augments language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference.
|
||||
|
||||
**[HayStack](https://haystack.deepset.ai/):** This is not a paper, but rather an open source library aimed at semantic search, question answering, summarization, and document ranking for a wide range of NLP applications. The underpinnings of this library are focused on the same `fetching` and `augmenting` concepts discussed here, and incorporate some of the methods in the above papers.
|
||||
|
||||
These papers/open-source projects are centered around retrieval of documents, which is important for question-answering tasks over a large corpus of documents (which is how they are evaluated). However, we use the terminology of `Data Augmented Generation` to highlight that retrieval from some document store is only one possible way of fetching relevant data to include. Other methods to fetch relevant data could involve hitting an API, querying a database, or just working with user provided data (eg a specific document that they want to summarize).
|
||||
|
||||
Let's now deep dive on the two steps involved: fetching and augmenting.
|
||||
|
||||
## Fetching
|
||||
There are many ways to fetch relevant data to pass in as context to a LM, and these methods largely depend
|
||||
on the use case.
|
||||
|
||||
**User provided:** In some cases, the user may provide the relevant data, and no algorithm for fetching is needed.
|
||||
An example of this is for summarization of specific documents: the user will provide the document to be summarized,
|
||||
and task the language model with summarizing it.
|
||||
|
||||
**Document Retrieval:** One of the more common use cases involves fetching relevant documents or pieces of text from
|
||||
a large corpus of data. A common example of this is question answering over a private collection of documents.
|
||||
|
||||
**API Querying:** Another common way to fetch data is from an API query. One example of this is WebGPT like system,
|
||||
where you first query Google (or another search API) for relevant information, and then those results are used in
|
||||
the generation step. Another example could be querying a structured database (like SQL) and then using a language model
|
||||
to synthesize those results.
|
||||
|
||||
There are two big issues to deal with in fetching:
|
||||
|
||||
1. Fetching small enough pieces of information
|
||||
2. Not fetching too many pieces of information (eg fetching only the most relevant pieces)
|
||||
|
||||
### Text Splitting
|
||||
One big issue with all of these methods is how to make sure you are working with pieces of text that are not too large.
|
||||
This is important because most language models have a context length, and so you cannot (yet) just pass a
|
||||
large document in as context. Therefor, it is important to not only fetch relevant data but also make sure it is
|
||||
small enough chunks.
|
||||
|
||||
LangChain provides some utilities to help with splitting up larger pieces of data. This comes in the form of the TextSplitter class.
|
||||
The class takes in a document and splits it up into chunks, with several parameters that control the
|
||||
size of the chunks as well as the overlap in the chunks (important for maintaining context).
|
||||
See [this walkthrough](../examples/integrations/textsplitter.ipynb) for more information.
|
||||
|
||||
### Relevant Documents
|
||||
A second large issue related fetching data is to make sure you are not fetching too many documents, and are only fetching
|
||||
the documents that are relevant to the query/question at hand. There are a few ways to deal with this.
|
||||
|
||||
One concrete example of this is vector stores for document retrieval, often used for semantic search or question answering.
|
||||
With this method, larger documents are split up into
|
||||
smaller chunks and then each chunk of text is passed to an embedding function which creates an embedding for that piece of text.
|
||||
Those are embeddings are then stored in a database. When a new search query or question comes in, an embedding is
|
||||
created for that query/question and then documents with embeddings most similar to that embedding are fetched.
|
||||
Examples of vector database companies include [Pinecone](https://www.pinecone.io/) and [Weaviate](https://weaviate.io/).
|
||||
|
||||
Although this is perhaps the most common way of document retrieval, people are starting to think about alternative
|
||||
data structures and indexing techniques specifically for working with language models. For a leading example of this,
|
||||
check out [GPT Index](https://github.com/jerryjliu/gpt_index) - a collection of data structures created by and optimized
|
||||
for language models.
|
||||
|
||||
## Augmenting
|
||||
So you've fetched your relevant data - now what? How do you pass them to the language model in a format it can understand?
|
||||
There are a few different methods, or chains, for doing so. LangChain supports three of the more common ones - and
|
||||
we are actively looking to include more, so if you have any ideas please reach out! Note that there is not
|
||||
one best method - the decision of which one to use is often very context specific. In order from simplest to
|
||||
most complex:
|
||||
|
||||
### Stuffing
|
||||
Stuffing is the simplest method, whereby you simply stuff all the related data into the prompt as context
|
||||
to pass to the language model. This is implemented in LangChain as the `StuffDocumentsChain`.
|
||||
|
||||
**Pros:** Only makes a single call to the LLM. When generating text, the LLM has access to all the data at once.
|
||||
|
||||
**Cons:** Most LLMs have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.
|
||||
|
||||
The main downside of this method is that it only works one smaller pieces of data. Once you are working
|
||||
with many pieces of data, this approach is no longer feasible. The next two approaches are designed to help deal with that.
|
||||
|
||||
### Map Reduce
|
||||
This method involves an initial prompt on each chunk of data (for summarization tasks, this
|
||||
could be a summary of that chunk; for question-answering tasks, it could be an answer based solely on that chunk).
|
||||
Then a different prompt is run to combine all the initial outputs. This is implemented in the LangChain as the `MapReduceDocumentsChain`.
|
||||
|
||||
**Pros:** Can scale to larger documents (and more documents) than `StuffDocumentsChain`. The calls to the LLM on individual documents are independent and can therefore be parallelized.
|
||||
|
||||
**Cons:** Requires many more calls to the LLM than `StuffDocumentsChain`. Loses some information during the final combining call.
|
||||
|
||||
### Refine
|
||||
This method involves an initial prompt on the first chunk of data, generating some output.
|
||||
For the remaining documents, that output is passed in, along with the next document,
|
||||
asking the LLM to refine the output based on the new document.
|
||||
|
||||
**Pros:** Can pull in more relevant context, and may be less lossy than `RefineDocumentsChain`.
|
||||
|
||||
**Cons:** Requires many more calls to the LLM than `StuffDocumentsChain`. The calls are also NOT independent, meaning they cannot be paralleled like `RefineDocumentsChain`. There is also some potential dependencies on the ordering of the documents.
|
||||
|
||||
## Use Cases
|
||||
LangChain supports the above three methods of augmenting LLMs with external data.
|
||||
These methods can be used to underpin several common use cases and they are discussed below.
|
||||
For all three of these use cases, all three methods are supported.
|
||||
It is important to note that a large part of these implementations is the prompts
|
||||
that are used. We provide default prompts for all three use cases, but these can be configured.
|
||||
This is in case you discover a prompt that works better for your specific application.
|
||||
|
||||
- [Question-Answering With Sources](../examples/chains/qa_with_sources.ipynb)
|
||||
- [Question-Answering](../examples/chains/question_answering.ipynb)
|
||||
- [Summarization](../examples/chains/summarize.ipynb)
|
@@ -159,6 +159,7 @@ see detailed information about the various classes, methods, and APIs.
|
||||
:name: resources
|
||||
|
||||
explanation/core_concepts.md
|
||||
explanation/combine_docs.md
|
||||
explanation/agents.md
|
||||
explanation/glossary.md
|
||||
explanation/cool_demos.md
|
||||
|
Reference in New Issue
Block a user