docs: update ScrapeGraphAI tools (#32026)

It was outdated --------- Co-authored-by: Mason Daugherty <github@mdrxy.com>
2025-09-04 12:39:32 +00:00 · 2025-07-14 18:38:55 +02:00
parent d96b75f9d3
commit 26c2c8f70a
2 changed files with 92 additions and 29 deletions
--- a/docs/docs/integrations/providers/scrapegraph.mdx
+++ b/docs/docs/integrations/providers/scrapegraph.mdx
@@ -27,8 +27,8 @@ There are four tools available:
 ```python
 from langchain_scrapegraph.tools import (
    SmartScraperTool,    # Extract structured data from websites
+    SmartCrawlerTool,    # Extract data from multiple pages with crawling
    MarkdownifyTool,     # Convert webpages to markdown
-    LocalScraperTool,    # Process local HTML content
    GetCreditsTool,      # Check remaining API credits
 )
 ```
@@ -36,6 +36,6 @@ from langchain_scrapegraph.tools import (
 Each tool serves a specific purpose:

 - `SmartScraperTool`: Extract structured data from websites given a URL, prompt and optional output schema
+- `SmartCrawlerTool`: Extract data from multiple pages with advanced crawling options like depth control, page limits, and domain restrictions
 - `MarkdownifyTool`: Convert any webpage to clean markdown format
- `LocalScraperTool`: Extract structured data from a local HTML file given a prompt and optional output schema
 - `GetCreditsTool`: Check your remaining ScrapeGraph AI credits 
--- a/docs/docs/integrations/tools/scrapegraph.ipynb
+++ b/docs/docs/integrations/tools/scrapegraph.ipynb
@@ -30,8 +30,8 @@
    "| Class | Package | Serializable | JS support | Package latest |\n",
    "| :--- | :--- | :---: | :---: | :---: |\n",
    "| [SmartScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
+    "| [SmartCrawlerTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
    "| [MarkdownifyTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
-    "| [LocalScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
    "| [GetCreditsTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
    "\n",
    "### Tool features\n",
@@ -39,8 +39,8 @@
    "| Tool | Purpose | Input | Output |\n",
    "| :--- | :--- | :--- | :--- |\n",
    "| SmartScraperTool | Extract structured data from websites | URL + prompt | JSON |\n",
+    "| SmartCrawlerTool | Extract data from multiple pages with crawling | URL + prompt + crawl options | JSON |\n",
    "| MarkdownifyTool | Convert webpages to markdown | URL | Markdown text |\n",
-    "| LocalScraperTool | Extract data from HTML content | HTML + prompt | JSON |\n",
    "| GetCreditsTool | Check API credits | None | Credit info |\n",
    "\n",
    "\n",
@@ -122,21 +122,26 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
   "id": "8b3ddfe9",
   "metadata": {},
   "outputs": [],
   "source": [
+    "from scrapegraph_py.logger import sgai_logger\n",
+    "import json\n",
+    "\n",
    "from langchain_scrapegraph.tools import (\n",
    "    GetCreditsTool,\n",
-    "    LocalScraperTool,\n",
    "    MarkdownifyTool,\n",
+    "    SmartCrawlerTool,\n",
    "    SmartScraperTool,\n",
    ")\n",
    "\n",
+    "sgai_logger.set_logging(level=\"INFO\")\n",
+    "\n",
    "smartscraper = SmartScraperTool()\n",
+    "smartcrawler = SmartCrawlerTool()\n",
    "markdownify = MarkdownifyTool()\n",
-    "localscraper = LocalScraperTool()\n",
    "credits = GetCreditsTool()"
   ]
  },
@@ -152,9 +157,23 @@
    "Let's try each tool individually:"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "d5a88cf2",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "### SmartCrawler Tool\n",
+    "\n",
+    "The SmartCrawlerTool allows you to crawl multiple pages from a website and extract structured data with advanced crawling options like depth control, page limits, and domain restrictions.\n"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
   "id": "65310a8b",
   "metadata": {},
   "outputs": [
@@ -189,33 +208,71 @@
    "markdown = markdownify.invoke({\"website_url\": \"https://scrapegraphai.com\"})\n",
    "print(\"\\nMarkdownify Result (first 200 chars):\", markdown[:200])\n",
    "\n",
-    "local_html = \"\"\"\n",
-    "<html>\n",
-    "    <body>\n",
-    "        <h1>Company Name</h1>\n",
-    "        <p>We are a technology company focused on AI solutions.</p>\n",
-    "        <div class=\"contact\">\n",
-    "            <p>Email: contact@example.com</p>\n",
-    "            <p>Phone: (555) 123-4567</p>\n",
-    "        </div>\n",
-    "    </body>\n",
-    "</html>\n",
-    "\"\"\"\n",
+    "# SmartCrawler\n",
+    "url = \"https://scrapegraphai.com/\"\n",
+    "prompt = (\n",
+    "    \"What does the company do? and I need text content from their privacy and terms\"\n",
+    ")\n",
    "\n",
-    "# LocalScraper\n",
-    "result_local = localscraper.invoke(\n",
+    "# Use the tool with crawling parameters\n",
+    "result_crawler = smartcrawler.invoke(\n",
    "    {\n",
-    "        \"user_prompt\": \"Make a summary of the webpage and extract the email and phone number\",\n",
-    "        \"website_html\": local_html,\n",
+    "        \"url\": url,\n",
+    "        \"prompt\": prompt,\n",
+    "        \"cache_website\": True,\n",
+    "        \"depth\": 2,\n",
+    "        \"max_pages\": 2,\n",
+    "        \"same_domain_only\": True,\n",
    "    }\n",
    ")\n",
-    "print(\"LocalScraper Result:\", result_local)\n",
+    "\n",
+    "print(\"\\nSmartCrawler Result:\")\n",
+    "print(json.dumps(result_crawler, indent=2))\n",
    "\n",
    "# Check credits\n",
    "credits_info = credits.invoke({})\n",
    "print(\"\\nCredits Info:\", credits_info)"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f13fb466",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# SmartCrawler example\n",
+    "from scrapegraph_py.logger import sgai_logger\n",
+    "import json\n",
+    "\n",
+    "from langchain_scrapegraph.tools import SmartCrawlerTool\n",
+    "\n",
+    "sgai_logger.set_logging(level=\"INFO\")\n",
+    "\n",
+    "# Will automatically get SGAI_API_KEY from environment\n",
+    "tool = SmartCrawlerTool()\n",
+    "\n",
+    "# Example based on the provided code snippet\n",
+    "url = \"https://scrapegraphai.com/\"\n",
+    "prompt = (\n",
+    "    \"What does the company do? and I need text content from their privacy and terms\"\n",
+    ")\n",
+    "\n",
+    "# Use the tool with crawling parameters\n",
+    "result = tool.invoke(\n",
+    "    {\n",
+    "        \"url\": url,\n",
+    "        \"prompt\": prompt,\n",
+    "        \"cache_website\": True,\n",
+    "        \"depth\": 2,\n",
+    "        \"max_pages\": 2,\n",
+    "        \"same_domain_only\": True,\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "print(json.dumps(result, indent=2))"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "d6e73897",
@@ -350,15 +407,21 @@
   "source": [
    "## API reference\n",
    "\n",
-    "For detailed documentation of all ScrapeGraph features and configurations head to the Langchain API reference: https://python.langchain.com/docs/integrations/tools/scrapegraph\n",
+    "For detailed documentation of all ScrapeGraph features and configurations head to [the Langchain API reference](https://python.langchain.com/docs/integrations/tools/scrapegraph).\n",
    "\n",
-    "Or to the official SDK repo: https://github.com/ScrapeGraphAI/langchain-scrapegraph"
+    "Or to [the official SDK repo](https://github.com/ScrapeGraphAI/langchain-scrapegraph)."
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d710dad8",
+   "metadata": {},
+   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "langchain",
   "language": "python",
   "name": "python3"
  },
@@ -372,7 +435,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.10.16"
  }
 },
 "nbformat": 4,