mirror of
https://github.com/hwchase17/langchain.git
synced 2025-05-28 10:39:23 +00:00
<!-- Thank you for contributing to LangChain! Replace this entire comment with: - **Description:** a description of the change, - **Issue:** the issue # it fixes (if applicable), - **Dependencies:** any dependencies required for this change, - **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below), - **Twitter handle:** we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. -->
534 lines
17 KiB
Plaintext
534 lines
17 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5151afed",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Question Answering\n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/question_answering/qa.ipynb)\n",
|
|
"\n",
|
|
"## Use case\n",
|
|
"Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents. \n",
|
|
"\n",
|
|
"LLMs, given their proficiency in understanding text, are a great tool for this.\n",
|
|
"\n",
|
|
"In this walkthrough we'll go over how to build a question-answering over documents application using LLMs. \n",
|
|
"\n",
|
|
"Two very related use cases which we cover elsewhere are:\n",
|
|
"- [QA over structured data](/docs/use_cases/qa_structured/sql) (e.g., SQL)\n",
|
|
"- [QA over code](/docs/use_cases/code_understanding) (e.g., Python)\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"The pipeline for converting raw unstructured data into a QA chain looks like this:\n",
|
|
"1. `Loading`: First we need to load our data. Use the [LangChain integration hub](https://integrations.langchain.com/) to browse the full set of loaders. \n",
|
|
"2. `Splitting`: [Text splitters](/docs/modules/data_connection/document_transformers/) break `Documents` into splits of specified size\n",
|
|
"3. `Storage`: Storage (e.g., often a [vectorstore](/docs/modules/data_connection/vectorstores/)) will house [and often embed](https://www.pinecone.io/learn/vector-embeddings/) the splits\n",
|
|
"4. `Retrieval`: The app retrieves splits from storage (e.g., often [with similar embeddings](https://www.pinecone.io/learn/k-nearest-neighbor/) to the input question)\n",
|
|
"5. `Generation`: An [LLM](/docs/modules/model_io/models/llms/) produces an answer using a prompt that includes the question and the retrieved data\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"## Quickstart\n",
|
|
"\n",
|
|
"Suppose we want a QA app over this [blog post](https://lilianweng.github.io/posts/2023-06-23-agent/). \n",
|
|
"\n",
|
|
"We can create this in a few lines of code. \n",
|
|
"\n",
|
|
"First set environment variables and install packages:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e14b744b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pip install langchain openai chromadb langchainhub\n",
|
|
"\n",
|
|
"# Set env var OPENAI_API_KEY or load from a .env file\n",
|
|
"# import dotenv\n",
|
|
"\n",
|
|
"# dotenv.load_dotenv()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "820244ae-74b4-4593-b392-822979dd91b8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load documents\n",
|
|
"\n",
|
|
"from langchain.document_loaders import WebBaseLoader\n",
|
|
"loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "c89a0aa7-1e7e-4557-90e5-a7ea87db00e7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Split documents\n",
|
|
"\n",
|
|
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
|
|
"text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)\n",
|
|
"splits = text_splitter.split_documents(loader.load())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "000e46f6-dafc-4a43-8417-463d0614fd30",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Embed and store splits\n",
|
|
"\n",
|
|
"from langchain.vectorstores import Chroma\n",
|
|
"from langchain.embeddings import OpenAIEmbeddings\n",
|
|
"vectorstore = Chroma.from_documents(documents=splits,embedding=OpenAIEmbeddings())\n",
|
|
"retriever = vectorstore.as_retriever()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "dacbde0b-7d45-4a2c-931d-81bb094aec94",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Prompt \n",
|
|
"# https://smith.langchain.com/hub/rlm/rag-prompt\n",
|
|
"\n",
|
|
"from langchain import hub\n",
|
|
"rag_prompt = hub.pull(\"rlm/rag-prompt\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "79b9fdae-c2bf-4cf6-884f-c19aa07dd975",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# LLM\n",
|
|
"\n",
|
|
"from langchain.chat_models import ChatOpenAI\n",
|
|
"llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "92c0f3ae-6ab2-4d04-9b22-1963b96b9db5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# RAG chain \n",
|
|
"\n",
|
|
"from langchain.schema.runnable import RunnablePassthrough\n",
|
|
"rag_chain = (\n",
|
|
" {\"context\": retriever, \"question\": RunnablePassthrough()} \n",
|
|
" | rag_prompt \n",
|
|
" | llm \n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "0d3b0f36-7b56-49c0-8e40-a1aa9ebcbf24",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"AIMessage(content='Task decomposition is the process of breaking down a task into smaller subgoals or steps. It can be done using simple prompting, task-specific instructions, or human inputs.')"
|
|
]
|
|
},
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"rag_chain.invoke(\"What is Task Decomposition?\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "639dc31a-7f16-40f6-ba2a-20e7c2ecfe60",
|
|
"metadata": {},
|
|
"source": [
|
|
"[Here](https://smith.langchain.com/public/2270a675-74de-47ac-b111-b232d8340a64/r) is the LangSmith trace for this chain.\n",
|
|
"\n",
|
|
"Below we will explain each step in more detail."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ba5daed6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 1. Load\n",
|
|
"\n",
|
|
"Specify a `DocumentLoader` to load in your unstructured data as `Documents`. \n",
|
|
"\n",
|
|
"A `Document` is a dict with text (`page_content`) and `metadata`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "cf4d5c72",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.document_loaders import WebBaseLoader\n",
|
|
"\n",
|
|
"loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
|
|
"data = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "fd2cc9a7",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Go deeper\n",
|
|
"- Browse the > 160 data loader integrations [here](https://integrations.langchain.com/).\n",
|
|
"- See further documentation on loaders [here](/docs/modules/data_connection/document_loaders/).\n",
|
|
"\n",
|
|
"## Step 2. Split\n",
|
|
"\n",
|
|
"Split the `Document` into chunks for embedding and vector storage."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "4b11c01d",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
|
|
"\n",
|
|
"text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)\n",
|
|
"all_splits = text_splitter.split_documents(data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0a33bd4d",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Go deeper\n",
|
|
"\n",
|
|
"- `DocumentSplitters` are just one type of the more generic `DocumentTransformers`.\n",
|
|
"- See further documentation on transformers [here](/docs/modules/data_connection/document_transformers/).\n",
|
|
"- `Context-aware splitters` keep the location (\"context\") of each split in the original `Document`:\n",
|
|
" - [Markdown files](/docs/use_cases/question_answering/how_to/document-context-aware-QA)\n",
|
|
" - [Code (py or js)](docs/integrations/document_loaders/source_code)\n",
|
|
" - [Documents](/docs/integrations/document_loaders/grobid)\n",
|
|
"\n",
|
|
"## Step 3. Store\n",
|
|
"\n",
|
|
"To be able to look up our document splits, we first need to store them where we can later look them up.\n",
|
|
"\n",
|
|
"The most common way to do this is to embed the contents of each document split.\n",
|
|
"\n",
|
|
"We store the embedding and splits in a vectorstore."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "e9c302c8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.embeddings import OpenAIEmbeddings\n",
|
|
"from langchain.vectorstores import Chroma\n",
|
|
"\n",
|
|
"vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dc6f22b0",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Go deeper\n",
|
|
"- Browse the > 40 vectorstores integrations [here](https://integrations.langchain.com/).\n",
|
|
"- See further documentation on vectorstores [here](/docs/modules/data_connection/vectorstores/).\n",
|
|
"- Browse the > 30 text embedding integrations [here](https://integrations.langchain.com/).\n",
|
|
"- See further documentation on embedding models [here](/docs/modules/data_connection/text_embedding/).\n",
|
|
"\n",
|
|
" Here are Steps 1-3:\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"## Step 4. Retrieve\n",
|
|
"\n",
|
|
"Retrieve relevant splits for any question using [similarity search](https://www.pinecone.io/learn/what-is-similarity-search/).\n",
|
|
"\n",
|
|
"This is simply \"top K\" retrieval where we select documents based on embedding similarity to the query."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "e2c26b7d",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"4"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"question = \"What are the approaches to Task Decomposition?\"\n",
|
|
"docs = vectorstore.similarity_search(question)\n",
|
|
"len(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5d5a113b",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Go deeper\n",
|
|
"\n",
|
|
"Vectorstores are commonly used for retrieval, but they are not the only option. For example, SVMs (see thread [here](https://twitter.com/karpathy/status/1647025230546886658?s=20)) can also be used.\n",
|
|
"\n",
|
|
"LangChain [has many retrievers](/docs/modules/data_connection/retrievers/) including, but not limited to, vectorstores. \n",
|
|
"\n",
|
|
"All retrievers implement a common method `get_relevant_documents()` (and its asynchronous variant `aget_relevant_documents()`)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "c901eaee",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"4"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain.retrievers import SVMRetriever\n",
|
|
"\n",
|
|
"svm_retriever = SVMRetriever.from_documents(all_splits,OpenAIEmbeddings())\n",
|
|
"docs_svm=svm_retriever.get_relevant_documents(question)\n",
|
|
"len(docs_svm)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "69de3d54",
|
|
"metadata": {},
|
|
"source": [
|
|
"Some common ways to improve on vector similarity search include:\n",
|
|
"- `MultiQueryRetriever` [generates variants of the input question](/docs/modules/data_connection/retrievers/MultiQueryRetriever) to improve retrieval.\n",
|
|
"- `Max marginal relevance` selects for [relevance and diversity](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf) among the retrieved documents.\n",
|
|
"- Documents can be filtered during retrieval using [`metadata` filters](/docs/use_cases/question_answering/how_to/document-context-aware-QA)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "9cfe3270-4e89-4c60-a2e5-9026b021bf76",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"from langchain.chat_models import ChatOpenAI\n",
|
|
"from langchain.retrievers.multi_query import MultiQueryRetriever\n",
|
|
"\n",
|
|
"logging.basicConfig()\n",
|
|
"logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO)\n",
|
|
"\n",
|
|
"retriever_from_llm = MultiQueryRetriever.from_llm(retriever=vectorstore.as_retriever(),\n",
|
|
" llm=ChatOpenAI(temperature=0))\n",
|
|
"unique_docs = retriever_from_llm.get_relevant_documents(query=question)\n",
|
|
"len(unique_docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ee8420e6-73a6-411b-a84d-74b096bddad7",
|
|
"metadata": {},
|
|
"source": [
|
|
"In addition, a useful concept for improving retrieval is decoupling the documents from the embedded search key.\n",
|
|
"\n",
|
|
"For example, we can embed a document summary or question that are likely to lead to the document being retrieved.\n",
|
|
"\n",
|
|
"See details in [here](docs/modules/data_connection/retrievers/multi_vector) on the multi-vector retriever for this purpose.\n",
|
|
"\n",
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "415d6824",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 5. Generate\n",
|
|
"\n",
|
|
"Distill the retrieved documents into an answer using an LLM/Chat model (e.g., `gpt-3.5-turbo`).\n",
|
|
"\n",
|
|
"We use the [Runnable](https://python.langchain.com/docs/expression_language/interface) protocol to define the chain.\n",
|
|
"\n",
|
|
"Runnable protocol pipes together components in a transparent way.\n",
|
|
"\n",
|
|
"We used a prompt for RAG that is checked into the LangChain prompt hub ([here](https://smith.langchain.com/hub/rlm/rag-prompt))."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "99fa1aec",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"AIMessage(content='Task decomposition is the process of breaking down a task into smaller subgoals or steps. It can be done using simple prompting, task-specific instructions, or human inputs.')"
|
|
]
|
|
},
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain.chat_models import ChatOpenAI\n",
|
|
"llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)\n",
|
|
"\n",
|
|
"from langchain.schema.runnable import RunnablePassthrough\n",
|
|
"rag_chain = (\n",
|
|
" {\"context\": retriever, \"question\": RunnablePassthrough()} \n",
|
|
" | rag_prompt \n",
|
|
" | llm \n",
|
|
")\n",
|
|
"\n",
|
|
"rag_chain.invoke(\"What is Task Decomposition?\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f7d52c84",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Go deeper\n",
|
|
"\n",
|
|
"#### Choosing LLMs\n",
|
|
"- Browse the > 90 LLM and chat model integrations [here](https://integrations.langchain.com/).\n",
|
|
"- See further documentation on LLMs and chat models [here](/docs/modules/model_io/models/).\n",
|
|
"- See a guide on local LLMS [here](/docs/modules/use_cases/question_answering/how_to/local_retrieval_qa)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "fa82f437",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Customizing the prompt\n",
|
|
"\n",
|
|
"As shown above, we can load prompts (e.g., [this RAG prompt](https://smith.langchain.com/hub/rlm/rag-prompt)) from the prompt hub.\n",
|
|
"\n",
|
|
"The prompt can also be easily customized, as shown below."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "e4fee704",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"AIMessage(content='Task decomposition is the process of breaking down a complicated task into smaller, more manageable subtasks or steps. It can be done using prompts, task-specific instructions, or human inputs. Thanks for asking!')"
|
|
]
|
|
},
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain.prompts import PromptTemplate\n",
|
|
"\n",
|
|
"template = \"\"\"Use the following pieces of context to answer the question at the end. \n",
|
|
"If you don't know the answer, just say that you don't know, don't try to make up an answer. \n",
|
|
"Use three sentences maximum and keep the answer as concise as possible. \n",
|
|
"Always say \"thanks for asking!\" at the end of the answer. \n",
|
|
"{context}\n",
|
|
"Question: {question}\n",
|
|
"Helpful Answer:\"\"\"\n",
|
|
"rag_prompt_custom = PromptTemplate.from_template(template)\n",
|
|
"\n",
|
|
"rag_chain = (\n",
|
|
" {\"context\": retriever, \"question\": RunnablePassthrough()} \n",
|
|
" | rag_prompt_custom \n",
|
|
" | llm \n",
|
|
")\n",
|
|
"\n",
|
|
"rag_chain.invoke(\"What is Task Decomposition?\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5f5b6297-715a-444e-b3ef-a6d27382b435",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can use [LangSmith](https://smith.langchain.com/public/129cac54-44d5-453a-9807-3bd4835e5f96/r) to see the trace."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.16"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|