docs: Add question answering over a website to web scraping (#10637)

**Description:** I've added a new use-case to the Web scraping docs. I also fixed some typos in the existing text. --------- Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com>
2025-06-28 17:38:36 +00:00 · 2023-09-15 21:53:51 +02:00 · 2023-09-15 21:53:51 +02:00 · 75c04f0833
commit 75c04f0833
parent 976a18c1d5
1 changed files with 61 additions and 4 deletions
--- a/docs/extras/use_cases/web_scraping.ipynb
+++ b/docs/extras/use_cases/web_scraping.ipynb
@ -453,11 +453,11 @@
    "\n",
    "Related to scraping, we may want to answer specific questions using searched content.\n",
    "\n",
-    "We can automate the process of [web research](https://blog.langchain.dev/automating-web-research/) using a retriver, such as the `WebResearchRetriever` ([docs](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)).\n",
+    "We can automate the process of [web research](https://blog.langchain.dev/automating-web-research/) using a retriever, such as the `WebResearchRetriever` ([docs](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)).\n",
    "\n",
    "![Image description](/img/web_research.png)\n",
    "\n",
-    "Copy requirments [from here](https://github.com/langchain-ai/web-explorer/blob/main/requirements.txt):\n",
+    "Copy requirements [from here](https://github.com/langchain-ai/web-explorer/blob/main/requirements.txt):\n",
    "\n",
    "`pip install -r requirements.txt`\n",
    " \n",
@ -573,13 +573,70 @@
  },
  {
   "cell_type": "markdown",
-   "id": "ff62e5f5",
   "metadata": {},
   "source": [
    "### Going deeper \n",
    "\n",
    "* Here's a [app](https://github.com/langchain-ai/web-explorer/tree/main) that wraps this retriver with a lighweight UI."
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "312c399e",
+   "metadata": {},
+   "source": [
+    "## Question answering over a website\n",
+    "\n",
+    "To answer questions over a specific website, you can use Apify's [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor, which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n",
+    "and extract text content from the web pages.\n",
+    "\n",
+    "In the example below, we will deeply crawl the Python documentation of LangChain's Chat LLM models and answer a question over it.\n",
+    "\n",
+    "First, install the requirements\n",
+    "`pip install apify-client openai langchain chromadb tiktoken`\n",
+    " \n",
+    "Next, set `OPENAI_API_KEY` and `APIFY_API_TOKEN` in your environment variables.\n",
+    "\n",
+    "The full code follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "9b08da5e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Yes, LangChain offers integration with OpenAI chat models. You can use the ChatOpenAI class to interact with OpenAI models.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.docstore.document import Document\n",
+    "from langchain.indexes import VectorstoreIndexCreator\n",
+    "from langchain.utilities import ApifyWrapper\n",
+    "\n",
+    "apify = ApifyWrapper()\n",
+    "# Call the Actor to obtain text from the crawled webpages\n",
+    "loader = apify.call_actor(\n",
+    "    actor_id=\"apify/website-content-crawler\",\n",
+    "    run_input={\"startUrls\": [{\"url\": \"https://python.langchain.com/docs/integrations/chat/\"}]},\n",
+    "    dataset_mapping_function=lambda item: Document(\n",
+    "        page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "# Create a vector store based on the crawled data\n",
+    "index = VectorstoreIndexCreator().from_loaders([loader])\n",
+    "\n",
+    "# Query the vector store\n",
+    "query = \"Are any OpenAI chat models integrated in LangChain?\"\n",
+    "result = index.query(query)\n",
+    "print(result)"
+   ]
  }
 ],
 "metadata": {
@ -598,7 +655,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.9.16"
  }
 },
 "nbformat": 4,