diff --git a/docs/docs/integrations/providers/scrapegraph.mdx b/docs/docs/integrations/providers/scrapegraph.mdx new file mode 100644 index 00000000000..93507ef3a88 --- /dev/null +++ b/docs/docs/integrations/providers/scrapegraph.mdx @@ -0,0 +1,41 @@ +# ScrapeGraph AI + +>[ScrapeGraph AI](https://scrapegraphai.com) is a service that provides AI-powered web scraping capabilities. +>It offers tools for extracting structured data, converting webpages to markdown, and processing local HTML content +>using natural language prompts. + +## Installation and Setup + +Install the required packages: + +```bash +pip install langchain-scrapegraph +``` + +Set up your API key: + +```bash +export SGAI_API_KEY="your-scrapegraph-api-key" +``` + +## Tools + +See a [usage example](/docs/integrations/tools/scrapegraph). + +There are four tools available: + +```python +from langchain_scrapegraph.tools import ( + SmartScraperTool, # Extract structured data from websites + MarkdownifyTool, # Convert webpages to markdown + LocalScraperTool, # Process local HTML content + GetCreditsTool, # Check remaining API credits +) +``` + +Each tool serves a specific purpose: + +- `SmartScraperTool`: Extract structured data from websites given a URL, prompt and optional output schema +- `MarkdownifyTool`: Convert any webpage to clean markdown format +- `LocalScraperTool`: Extract structured data from a local HTML file given a prompt and optional output schema +- `GetCreditsTool`: Check your remaining ScrapeGraph AI credits diff --git a/docs/docs/integrations/tools/scrapegraph.ipynb b/docs/docs/integrations/tools/scrapegraph.ipynb new file mode 100644 index 00000000000..ed9b37d97bd --- /dev/null +++ b/docs/docs/integrations/tools/scrapegraph.ipynb @@ -0,0 +1,380 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "10238e62-3465-4973-9279-606cbb7ccf16", + "metadata": {}, + "source": [ + "---\n", + "sidebar_label: ScrapeGraph\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "a6f91f20", + "metadata": {}, + "source": [ + "# ScrapeGraph\n", + "\n", + "This notebook provides a quick overview for getting started with ScrapeGraph [tools](/docs/integrations/tools/). For detailed documentation of all ScrapeGraph features and configurations head to the [API reference](https://python.langchain.com/docs/integrations/tools/scrapegraph).\n", + "\n", + "For more information about ScrapeGraph AI:\n", + "- [ScrapeGraph AI Website](https://scrapegraphai.com)\n", + "- [Open Source Project](https://github.com/ScrapeGraphAI/Scrapegraph-ai)\n", + "\n", + "## Overview\n", + "\n", + "### Integration details\n", + "\n", + "| Class | Package | Serializable | JS support | Package latest |\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [SmartScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ |  |\n", + "| [MarkdownifyTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ |  |\n", + "| [LocalScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ |  |\n", + "| [GetCreditsTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ |  |\n", + "\n", + "### Tool features\n", + "\n", + "| Tool | Purpose | Input | Output |\n", + "| :--- | :--- | :--- | :--- |\n", + "| SmartScraperTool | Extract structured data from websites | URL + prompt | JSON |\n", + "| MarkdownifyTool | Convert webpages to markdown | URL | Markdown text |\n", + "| LocalScraperTool | Extract data from HTML content | HTML + prompt | JSON |\n", + "| GetCreditsTool | Check API credits | None | Credit info |\n", + "\n", + "\n", + "## Setup\n", + "\n", + "The integration requires the following packages:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f85b4089", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --quiet -U langchain-scrapegraph" + ] + }, + { + "cell_type": "markdown", + "id": "b15e9266", + "metadata": {}, + "source": [ + "### Credentials\n", + "\n", + "You'll need a ScrapeGraph AI API key to use these tools. Get one at [scrapegraphai.com](https://scrapegraphai.com)." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e0b178a2", + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "\n", + "if not os.environ.get(\"SGAI_API_KEY\"):\n", + " os.environ[\"SGAI_API_KEY\"] = getpass.getpass(\"ScrapeGraph AI API key:\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "bc5ab717", + "metadata": {}, + "source": [ + "It's also helpful (but not needed) to set up [LangSmith](https://smith.langchain.com/) for best-in-class observability:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6c2f136", + "metadata": {}, + "outputs": [], + "source": [ + "os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n", + "os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass()" + ] + }, + { + "cell_type": "markdown", + "id": "1c97218f", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Here we show how to instantiate instances of the ScrapeGraph tools:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "8b3ddfe9", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_scrapegraph.tools import (\n", + " GetCreditsTool,\n", + " LocalScraperTool,\n", + " MarkdownifyTool,\n", + " SmartScraperTool,\n", + ")\n", + "\n", + "smartscraper = SmartScraperTool()\n", + "markdownify = MarkdownifyTool()\n", + "localscraper = LocalScraperTool()\n", + "credits = GetCreditsTool()" + ] + }, + { + "cell_type": "markdown", + "id": "74147a1a", + "metadata": {}, + "source": [ + "## Invocation\n", + "\n", + "### [Invoke directly with args](/docs/concepts/tools)\n", + "\n", + "Let's try each tool individually:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "65310a8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': \"ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis.\"}\n", + "\n", + "Markdownify Result (first 200 chars): [ScrapeGraphAI](https://scrapegraphai.com/)\n", + "\n", + "PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up\n", + "\n", + "Op\n", + "LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': 'contact@example.com', 'phone': '(555) 123-4567'}}\n", + "\n", + "Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}\n" + ] + } + ], + "source": [ + "# SmartScraper\n", + "result = smartscraper.invoke(\n", + " {\n", + " \"user_prompt\": \"Extract the company name and description\",\n", + " \"website_url\": \"https://scrapegraphai.com\",\n", + " }\n", + ")\n", + "print(\"SmartScraper Result:\", result)\n", + "\n", + "# Markdownify\n", + "markdown = markdownify.invoke({\"website_url\": \"https://scrapegraphai.com\"})\n", + "print(\"\\nMarkdownify Result (first 200 chars):\", markdown[:200])\n", + "\n", + "local_html = \"\"\"\n", + "\n", + "
\n", + "We are a technology company focused on AI solutions.
\n", + "Email: contact@example.com
\n", + "Phone: (555) 123-4567
\n", + "