Add research example

This commit is contained in:
Eugene Yurtsev
2023-06-02 12:56:26 -04:00
parent 203cb5e307
commit f79582c548

View File

@@ -0,0 +1,448 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d0187afb-f460-431f-aee3-50d68bc33446",
"metadata": {},
"source": [
"# Research Chain\n",
"\n",
"This is an experimental research chain to attempt to answer a research question using information on the web.\n",
"\n",
"A research chain is composed of the following components:\n",
"\n",
"1. A searcher that searches for documents using a search engine.\n",
" - The searcher is responsible to return a list of URLs of documents that\n",
" may be relevant to read to be able to answer the question.\n",
"2. A downloader that downloads the documents.\n",
"3. An HTML to markdown parser (hard coded) that converts the HTML to markdown.\n",
" * Conversion to markdown is lossy\n",
" * However, it can significantly reduce the token count of the document\n",
" * Markdown helps to preserve some styling information\n",
" (e.g., bold, italics, links, headers) which is expected to help the reader\n",
" to answer certain kinds of questions correctly.\n",
"4. A reader that reads the documents and produces an answer.\n",
"\n",
"## Limitations\n",
"\n",
"* Chain can be potentially long running (Use initialization parameters to control how many options are eplored).\n",
"* This research chain only implements a single hop at the moment; i.e.,\n",
" it goes from the questions to a list of URLs to documents to compiling answers.\n",
" Without continuing the crawl, web-sites that require pagnation will not be explored fully.\n",
"* The reader chain must match the type of question. For example, the QA refine chain \n",
" isn't good at extracting a list of entries from a long document.\n",
"* Content downloader may get blocked (since it looks like a bot). \n",
" \n",
"## Extending\n",
"\n",
"* Continue crawling documents to discover more relevant pages that were not surfaced by the search engine.\n",
"* Adapt reading strategy based on nature of question.\n",
"* Analyze the query and determine whether the query is a multi-hop query and change search/crawling strategy based on that.\n",
"* Provide smaller pieces to an agent. :)\n",
"* Add cheaper strategies for selecting which links should be explored further (e.g., based on tf-idf similarity instead of gpt-4)\n",
"* Add a summarization chain on top of the individually collected answers"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7eb466b8-24fa-4acc-b0ce-06fcfa2fa9c4",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.chains.research.api import Research\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.llms import OpenAI\n",
"from langchain.chains.question_answering import load_qa_chain\n",
"from langchain.chains.research.fetch import PlaywrightDownloadHandler"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "70474885-0acd-41b2-8050-15dd54f44f1e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"question = \"\"\"\\\n",
"Compile information about Harrison Chase from langchain. \n",
"Ignore if it's a different Harrison Chase. \n",
"Only include information you're certain about.\n",
"\n",
"Include:\n",
"* education history\n",
"* verbal skills\n",
"* work history \n",
"* favorite pets\n",
"* phone number\n",
"* hobbies\n",
"* does he like sports\n",
"* is he a honest person?\n",
"* living location\n",
"* social profiles (e.g., github, twitter, etc.)\n",
"* smoking habits\n",
"* which languages can he code in?\n",
"* does he play sports?\n",
"\n",
"Format your answer in a bullet point format for each sub-question.\n",
"\n",
"If the context text claims that a security verification is required or a different web-browser is \n",
"required to download the page, then please output: \"Download May Have Failed.\"\n",
"\"\"\".strip()"
]
},
{
"cell_type": "markdown",
"id": "6613da1c-3349-45f4-9770-19986750d548",
"metadata": {},
"source": [
"Instantiate LLMs"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "e74a44b4-2075-4cc6-933e-c769bf3f6002",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"llm = OpenAI(temperature=0, model='text-davinci-003') # Used for the readers and the query generator\n",
"selector_llm = ChatOpenAI(temperature=0, model='gpt-4') # Used for selecting which links to explore"
]
},
{
"cell_type": "markdown",
"id": "b243364a-e79b-432d-8035-3de8caf554a8",
"metadata": {},
"source": [
"Create a chain that can be used to extract the answer to the question above from a given document.\n",
"\n",
"This chain must be tailored to the task."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "2f538062-14e3-49ab-9b25-bc470eb5869c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"qa_chain = load_qa_chain(llm, chain_type='refine')"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "a96f3ed3-10de-4a85-9e93-a8b78d8bfbb6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"research = Research.from_llms(\n",
" query_generation_llm=llm, \n",
" link_selection_llm=selector_llm, \n",
" underlying_reader_chain=qa_chain, \n",
" top_k_per_search=3, \n",
" max_num_pages_per_doc=30, \n",
" download_handler=PlaywrightDownloadHandler()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "3207c696-a72c-4378-b427-7d285f5fdd1c",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n",
"Retrying langchain.llms.openai.acompletion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.\n"
]
}
],
"source": [
"results = await research.acall(inputs={'question': question})"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "843616b3-32d7-49c7-a42b-b0272d71f3ed",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"----------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
"https://www.linkedin.com/in/harrison-chase-961287118\n",
"\n",
"Download May Have Failed.\n",
"----------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
"https://github.com/hwchase17\n",
"\n",
"\n",
"* Education history: Unknown\n",
"* Verbal skills: Unknown\n",
"* Work history: GitHub user since 2015, 3,548 contributions in the last year\n",
"* Favorite pets: Unknown\n",
"* Phone number: Unknown\n",
"* Hobbies: Unknown\n",
"* Does he like sports: Unknown\n",
"* Is he a honest person?: Unknown\n",
"* Living location: Unknown\n",
"* Social profiles: GitHub (https://github.com/hwchase17)\n",
"* Smoking habits: Unknown\n",
"* Which languages can he code in?: Python, TypeScript\n",
"* Does he play sports?: Unknown\n",
"----------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
"https://twitter.com/hwchase17?lang=en\n",
"\n",
"\n",
"* Education history: Unknown\n",
"* Verbal skills: Unknown\n",
"* Work history: CEO of LangChainAI, previously RobustHQ and Kensho\n",
"* Favorite pets: Unknown\n",
"* Phone number: Unknown\n",
"* Hobbies: Unknown\n",
"* Does he like sports: Unknown\n",
"* Is he a honest person: Unknown\n",
"* Living location: Unknown\n",
"* Social profiles: Twitter (@hwchase17)\n",
"* Smoking habits: Unknown\n",
"* Which languages can he code in: Unknown\n",
"* Does he play sports: Unknown\n"
]
}
],
"source": [
"for doc in results['docs']:\n",
" print('--'*80)\n",
" print(doc.metadata['source'])\n",
" print(doc.page_content)"
]
},
{
"cell_type": "markdown",
"id": "edcaf496-3679-4b58-9baa-6124e1cc3435",
"metadata": {},
"source": [
"If useful we can produce another summary!"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "76136bff-b7df-4539-9bcb-760fc4449390",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"qa_chain = load_qa_chain(llm, chain_type='stuff')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "b352f3f3-6777-4795-acbb-ed26ecac137d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"summary = await qa_chain.acall(inputs={'input_documents': results['docs'], 'question': question})"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "1c90c305-a89d-42e2-b975-dda039e816b6",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
"* Education history: Unknown\n",
"* Verbal skills: Unknown\n",
"* Work history: CEO of LangChainAI, previously RobustHQ and Kensho\n",
"* Favorite pets: Unknown\n",
"* Phone number: Unknown\n",
"* Hobbies: Unknown\n",
"* Does he like sports: Unknown\n",
"* Is he a honest person: Unknown\n",
"* Living location: Unknown\n",
"* Social profiles: GitHub (https://github.com/hwchase17), Twitter (@hwchase17)\n",
"* Smoking habits: Unknown\n",
"* Which languages can he code in: Unknown\n",
"* Does he play sports: Unknown\n"
]
}
],
"source": [
"print(summary['output_text'])"
]
},
{
"cell_type": "markdown",
"id": "af2adfee-85d3-41af-900a-c594dc01ce16",
"metadata": {},
"source": [
"## Under the hood"
]
},
{
"cell_type": "markdown",
"id": "c307aa60-7e75-48e8-ba72-b45507ed3fe0",
"metadata": {},
"source": [
"A searcher is invoked first to find URLs that are good to explore"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "4e2c369c-8763-458e-9ab8-684466395890",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'question': \"Compile information about Harrison Chase from langchain. \\nIgnore if it's a different harrison chase. \\nOnly include information you're certain about.\\n\\nInclude:\\n* education history\\n* verbal skills\\n* work history \\n* favorite pets\\n* phone number\\n* hobbies\\n* does he like sports\\n* is he a honest person?\\n* living location\\n* social profiles (e.g., github, twitter, etc.)\\n* smoking habits\\n* which languages can he code in?\\n* does he play sports?\",\n",
" 'urls': ['https://www.linkedin.com/in/harrison-chase-961287118',\n",
" 'https://www.youtube.com/watch?v=zaYTXQFR0_s',\n",
" 'https://twitter.com/hwchase17?lang=en']}"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = await research.searcher.acall(inputs={'question': question})\n",
"results"
]
},
{
"cell_type": "markdown",
"id": "24099529-2d3d-4eca-8a1f-45a2539c8842",
"metadata": {},
"source": [
"The webpages are downloaded"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "c6f9d1b5-e513-4d8d-b325-32eacbee92b4",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"blobs = await research.downloader.adownload(results['urls'])"
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "964aae51-6a1f-4f63-a6fe-11c83557888a",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'<!DOCTYPE html><html lang=\"en-US\" class=\"artdeco \"'"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"blobs[0].as_string()[:50]"
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "7faf045a-e9e6-4ebb-b0ee-618b958a9790",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'https://www.linkedin.com/in/harrison-chase-961287118'"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"blobs[0].source"
]
},
{
"cell_type": "markdown",
"id": "58909ce0-fd1c-4e09-9d13-b43135ee8038",
"metadata": {},
"source": [
"The blobs are parsed with an HTML parser and read by the reader chain (not shown) -- see underlying code for details"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}