partners: 🕷️🦜 ScrapeGraph API Integration (#28559)

Hi Langchain team!

I'm the co-founder and mantainer at
[ScrapeGraphAI](https://scrapegraphai.com/).
By following the integration
[guide](https://python.langchain.com/docs/contributing/how_to/integrations/publish/)
on your site, I have created a new lib called
[langchain-scrapegraph](https://github.com/ScrapeGraphAI/langchain-scrapegraph).

With this PR I would like to integrate Scrapegraph as provider in
Langchain, adding the required documentation files.
Let me know if there are some changes to be made to be properly
integrated both in the lib and in the documentation.

Thank you 🕷️🦜

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Marco Perini 2024-12-09 03:38:21 +01:00 committed by GitHub
parent 317a38b83e
commit 2354bb7bfa
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 424 additions and 0 deletions

View File

@ -0,0 +1,41 @@
# ScrapeGraph AI
>[ScrapeGraph AI](https://scrapegraphai.com) is a service that provides AI-powered web scraping capabilities.
>It offers tools for extracting structured data, converting webpages to markdown, and processing local HTML content
>using natural language prompts.
## Installation and Setup
Install the required packages:
```bash
pip install langchain-scrapegraph
```
Set up your API key:
```bash
export SGAI_API_KEY="your-scrapegraph-api-key"
```
## Tools
See a [usage example](/docs/integrations/tools/scrapegraph).
There are four tools available:
```python
from langchain_scrapegraph.tools import (
SmartScraperTool, # Extract structured data from websites
MarkdownifyTool, # Convert webpages to markdown
LocalScraperTool, # Process local HTML content
GetCreditsTool, # Check remaining API credits
)
```
Each tool serves a specific purpose:
- `SmartScraperTool`: Extract structured data from websites given a URL, prompt and optional output schema
- `MarkdownifyTool`: Convert any webpage to clean markdown format
- `LocalScraperTool`: Extract structured data from a local HTML file given a prompt and optional output schema
- `GetCreditsTool`: Check your remaining ScrapeGraph AI credits

View File

@ -0,0 +1,380 @@
{
"cells": [
{
"cell_type": "raw",
"id": "10238e62-3465-4973-9279-606cbb7ccf16",
"metadata": {},
"source": [
"---\n",
"sidebar_label: ScrapeGraph\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "a6f91f20",
"metadata": {},
"source": [
"# ScrapeGraph\n",
"\n",
"This notebook provides a quick overview for getting started with ScrapeGraph [tools](/docs/integrations/tools/). For detailed documentation of all ScrapeGraph features and configurations head to the [API reference](https://python.langchain.com/docs/integrations/tools/scrapegraph).\n",
"\n",
"For more information about ScrapeGraph AI:\n",
"- [ScrapeGraph AI Website](https://scrapegraphai.com)\n",
"- [Open Source Project](https://github.com/ScrapeGraphAI/Scrapegraph-ai)\n",
"\n",
"## Overview\n",
"\n",
"### Integration details\n",
"\n",
"| Class | Package | Serializable | JS support | Package latest |\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [SmartScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
"| [MarkdownifyTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
"| [LocalScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
"| [GetCreditsTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
"\n",
"### Tool features\n",
"\n",
"| Tool | Purpose | Input | Output |\n",
"| :--- | :--- | :--- | :--- |\n",
"| SmartScraperTool | Extract structured data from websites | URL + prompt | JSON |\n",
"| MarkdownifyTool | Convert webpages to markdown | URL | Markdown text |\n",
"| LocalScraperTool | Extract data from HTML content | HTML + prompt | JSON |\n",
"| GetCreditsTool | Check API credits | None | Credit info |\n",
"\n",
"\n",
"## Setup\n",
"\n",
"The integration requires the following packages:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f85b4089",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install --quiet -U langchain-scrapegraph"
]
},
{
"cell_type": "markdown",
"id": "b15e9266",
"metadata": {},
"source": [
"### Credentials\n",
"\n",
"You'll need a ScrapeGraph AI API key to use these tools. Get one at [scrapegraphai.com](https://scrapegraphai.com)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e0b178a2",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"if not os.environ.get(\"SGAI_API_KEY\"):\n",
" os.environ[\"SGAI_API_KEY\"] = getpass.getpass(\"ScrapeGraph AI API key:\\n\")"
]
},
{
"cell_type": "markdown",
"id": "bc5ab717",
"metadata": {},
"source": [
"It's also helpful (but not needed) to set up [LangSmith](https://smith.langchain.com/) for best-in-class observability:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a6c2f136",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
"os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass()"
]
},
{
"cell_type": "markdown",
"id": "1c97218f",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"Here we show how to instantiate instances of the ScrapeGraph tools:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8b3ddfe9",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapegraph.tools import (\n",
" GetCreditsTool,\n",
" LocalScraperTool,\n",
" MarkdownifyTool,\n",
" SmartScraperTool,\n",
")\n",
"\n",
"smartscraper = SmartScraperTool()\n",
"markdownify = MarkdownifyTool()\n",
"localscraper = LocalScraperTool()\n",
"credits = GetCreditsTool()"
]
},
{
"cell_type": "markdown",
"id": "74147a1a",
"metadata": {},
"source": [
"## Invocation\n",
"\n",
"### [Invoke directly with args](/docs/concepts/tools)\n",
"\n",
"Let's try each tool individually:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "65310a8b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': \"ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis.\"}\n",
"\n",
"Markdownify Result (first 200 chars): [![ScrapeGraphAI Logo](https://scrapegraphai.com/images/scrapegraphai_logo.svg)ScrapeGraphAI](https://scrapegraphai.com/)\n",
"\n",
"PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up\n",
"\n",
"Op\n",
"LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': 'contact@example.com', 'phone': '(555) 123-4567'}}\n",
"\n",
"Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}\n"
]
}
],
"source": [
"# SmartScraper\n",
"result = smartscraper.invoke(\n",
" {\n",
" \"user_prompt\": \"Extract the company name and description\",\n",
" \"website_url\": \"https://scrapegraphai.com\",\n",
" }\n",
")\n",
"print(\"SmartScraper Result:\", result)\n",
"\n",
"# Markdownify\n",
"markdown = markdownify.invoke({\"website_url\": \"https://scrapegraphai.com\"})\n",
"print(\"\\nMarkdownify Result (first 200 chars):\", markdown[:200])\n",
"\n",
"local_html = \"\"\"\n",
"<html>\n",
" <body>\n",
" <h1>Company Name</h1>\n",
" <p>We are a technology company focused on AI solutions.</p>\n",
" <div class=\"contact\">\n",
" <p>Email: contact@example.com</p>\n",
" <p>Phone: (555) 123-4567</p>\n",
" </div>\n",
" </body>\n",
"</html>\n",
"\"\"\"\n",
"\n",
"# LocalScraper\n",
"result_local = localscraper.invoke(\n",
" {\n",
" \"user_prompt\": \"Make a summary of the webpage and extract the email and phone number\",\n",
" \"website_html\": local_html,\n",
" }\n",
")\n",
"print(\"LocalScraper Result:\", result_local)\n",
"\n",
"# Check credits\n",
"credits_info = credits.invoke({})\n",
"print(\"\\nCredits Info:\", credits_info)"
]
},
{
"cell_type": "markdown",
"id": "d6e73897",
"metadata": {},
"source": [
"### [Invoke with ToolCall](/docs/concepts/tools)\n",
"\n",
"We can also invoke the tool with a model-generated ToolCall:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f90e33a7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ToolMessage(content='{\"main_heading\": \"Get the data you need from any website\", \"description\": \"Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data.\"}', name='SmartScraper', tool_call_id='1')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_generated_tool_call = {\n",
" \"args\": {\n",
" \"user_prompt\": \"Extract the main heading and description\",\n",
" \"website_url\": \"https://scrapegraphai.com\",\n",
" },\n",
" \"id\": \"1\",\n",
" \"name\": smartscraper.name,\n",
" \"type\": \"tool_call\",\n",
"}\n",
"smartscraper.invoke(model_generated_tool_call)"
]
},
{
"cell_type": "markdown",
"id": "659f9fbd",
"metadata": {},
"source": [
"## Chaining\n",
"\n",
"Let's use our tools with an LLM to analyze a website:\n",
"\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "af3123ad",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"# | output: false\n",
"# | echo: false\n",
"\n",
"# %pip install -qU langchain langchain-openai\n",
"from langchain.chat_models import init_chat_model\n",
"\n",
"llm = init_chat_model(model=\"gpt-4o\", model_provider=\"openai\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "fdbf35b5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AIMessage(content='ScrapeGraph AI is an AI-powered web scraping tool that efficiently extracts and converts website data into structured formats via a simple API. It caters to developers, data scientists, and AI researchers, offering features like easy integration, support for dynamic content, and scalability for large projects. It supports various website types, including business, e-commerce, and educational sites. Contact: contact@scrapegraphai.com.', additional_kwargs={'tool_calls': [{'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'function': {'arguments': '{\"user_prompt\":\"Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.\",\"website_url\":\"https://scrapegraphai.com\"}', 'name': 'SmartScraper'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 480, 'total_tokens': 527, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_c7ca0ebaca', 'finish_reason': 'stop', 'logprobs': None}, id='run-45a12c86-d499-4273-8c59-0db926799bc7-0', tool_calls=[{'name': 'SmartScraper', 'args': {'user_prompt': 'Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.', 'website_url': 'https://scrapegraphai.com'}, 'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'type': 'tool_call'}], usage_metadata={'input_tokens': 480, 'output_tokens': 47, 'total_tokens': 527, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import RunnableConfig, chain\n",
"\n",
"prompt = ChatPromptTemplate(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are a helpful assistant that can use tools to extract structured information from websites.\",\n",
" ),\n",
" (\"human\", \"{user_input}\"),\n",
" (\"placeholder\", \"{messages}\"),\n",
" ]\n",
")\n",
"\n",
"llm_with_tools = llm.bind_tools([smartscraper], tool_choice=smartscraper.name)\n",
"llm_chain = prompt | llm_with_tools\n",
"\n",
"\n",
"@chain\n",
"def tool_chain(user_input: str, config: RunnableConfig):\n",
" input_ = {\"user_input\": user_input}\n",
" ai_msg = llm_chain.invoke(input_, config=config)\n",
" tool_msgs = smartscraper.batch(ai_msg.tool_calls, config=config)\n",
" return llm_chain.invoke({**input_, \"messages\": [ai_msg, *tool_msgs]}, config=config)\n",
"\n",
"\n",
"tool_chain.invoke(\n",
" \"What does ScrapeGraph AI do? Extract this information from their website https://scrapegraphai.com\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "4ac8146c",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all ScrapeGraph features and configurations head to the Langchain API reference: https://python.langchain.com/docs/integrations/tools/scrapegraph\n",
"\n",
"Or to the official SDK repo: https://github.com/ScrapeGraphAI/langchain-scrapegraph"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -68,6 +68,9 @@ packages:
- name: langchain-qdrant
repo: langchain-ai/langchain
path: libs/partners/qdrant
- name: langchain-scrapegraph
repo: ScrapeGraphAI/langchain-scrapegraph
path: .
- name: langchain-sema4
repo: langchain-ai/langchain-sema4
path: libs/sema4