mirror of
https://github.com/hwchase17/langchain.git
synced 2025-08-13 14:50:00 +00:00
docs: add scrapeless integration documentation (#32081)
Thank you for contributing to LangChain! - [x] **PR title**: "package: description" - Where "package" is whichever of langchain, core, etc. is being modified. Use "docs: ..." for purely docs changes, "infra: ..." for CI changes. - Example: "core: add foobar LLM" - **Description:** Integrated the Scrapeless package to enable Langchain users to seamlessly incorporate Scrapeless into their agents. - **Dependencies:** None - **Twitter handle:** [Scrapelessteam](https://x.com/Scrapelessteam) - [x] **Add tests and docs**: If you're adding a new integration, you must include: 1. A test for the integration, preferably unit tests that do not rely on network access, 2. An example notebook showing its use. It lives in `docs/docs/integrations` directory. - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See [contribution guidelines](https://python.langchain.com/docs/contributing/) for more. Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to `pyproject.toml` files (even optional ones) unless they are **required** for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. --------- Co-authored-by: Mason Daugherty <mason@langchain.dev> Co-authored-by: Mason Daugherty <github@mdrxy.com>
This commit is contained in:
parent
4a2a3fcd43
commit
166c027434
26
docs/docs/integrations/providers/scrapeless.mdx
Normal file
26
docs/docs/integrations/providers/scrapeless.mdx
Normal file
@ -0,0 +1,26 @@
|
||||
# Scrapeless
|
||||
|
||||
[Scrapeless](https://scrapeless.com) offers flexible and feature-rich data acquisition services with extensive parameter customization and multi-format export support.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
```bash
|
||||
pip install langchain-scrapeless
|
||||
```
|
||||
|
||||
You'll need to set up your Scrapeless API key:
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ["SCRAPELESS_API_KEY"] = "your-api-key"
|
||||
```
|
||||
|
||||
## Tools
|
||||
|
||||
The Scrapeless integration provides several tools:
|
||||
|
||||
- [ScrapelessDeepSerpGoogleSearchTool](/docs/integrations/tools/scrapeless_scraping_api) - Enables comprehensive extraction of Google SERP data across all result types.
|
||||
- [ScrapelessDeepSerpGoogleTrendsTool](/docs/integrations/tools/scrapeless_scraping_api) - Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.
|
||||
- [ScrapelessUniversalScrapingTool](/docs/integrations/tools/scrapeless_universal_scraping) - Access and extract data from JS-Render websites that typically block bots.
|
||||
- [ScrapelessCrawlerCrawlTool](/docs/integrations/tools/scrapeless_crawl) - Crawl a website and its linked pages to extract comprehensive data.
|
||||
- [ScrapelessCrawlerScrapeTool](/docs/integrations/tools/scrapeless_crawl) - Extract information from a single webpage.
|
446
docs/docs/integrations/tools/scrapeless_crawl.ipynb
Normal file
446
docs/docs/integrations/tools/scrapeless_crawl.ipynb
Normal file
File diff suppressed because one or more lines are too long
474
docs/docs/integrations/tools/scrapeless_scraping_api.ipynb
Normal file
474
docs/docs/integrations/tools/scrapeless_scraping_api.ipynb
Normal file
File diff suppressed because one or more lines are too long
339
docs/docs/integrations/tools/scrapeless_universal_scraping.ipynb
Normal file
339
docs/docs/integrations/tools/scrapeless_universal_scraping.ipynb
Normal file
@ -0,0 +1,339 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a6f91f20",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Scrapeless\n",
|
||||
"\n",
|
||||
"**Scrapeless** offers flexible and feature-rich data acquisition services with extensive parameter customization and multi-format export support. These capabilities empower LangChain to integrate and leverage external data more effectively. The core functional modules include:\n",
|
||||
"\n",
|
||||
"**DeepSerp**\n",
|
||||
"- **Google Search**: Enables comprehensive extraction of Google SERP data across all result types.\n",
|
||||
" - Supports selection of localized Google domains (e.g., `google.com`, `google.ad`) to retrieve region-specific search results.\n",
|
||||
" - Pagination supported for retrieving results beyond the first page.\n",
|
||||
" - Supports a search result filtering toggle to control whether to exclude duplicate or similar content.\n",
|
||||
"- **Google Trends**: Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.\n",
|
||||
" - Supports multi-keyword comparison.\n",
|
||||
" - Supports multiple data types: `interest_over_time`, `interest_by_region`, `related_queries`, and `related_topics`.\n",
|
||||
" - Allows filtering by specific Google properties (Web, YouTube, News, Shopping) for source-specific trend analysis.\n",
|
||||
"\n",
|
||||
"**Universal Scraping**\n",
|
||||
"- Designed for modern, JavaScript-heavy websites, allowing dynamic content extraction.\n",
|
||||
" - Global premium proxy support for bypassing geo-restrictions and improving reliability.\n",
|
||||
"\n",
|
||||
"**Crawler**\n",
|
||||
"- **Crawl**: Recursively crawl a website and its linked pages to extract site-wide content.\n",
|
||||
" - Supports configurable crawl depth and scoped URL targeting.\n",
|
||||
"- **Scrape**: Extract content from a single webpage with high precision.\n",
|
||||
" - Supports \"main content only\" extraction to exclude ads, footers, and other non-essential elements.\n",
|
||||
" - Allows batch scraping of multiple standalone URLs.\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"### Integration details\n",
|
||||
"\n",
|
||||
"| Class | Package | Serializable | JS support | Package latest |\n",
|
||||
"| :--- | :--- | :---: | :---: | :---: |\n",
|
||||
"| [ScrapelessUniversalScrapingTool](https://pypi.org/project/langchain-scrapeless/) | [langchain-scrapeless](https://pypi.org/project/langchain-scrapeless/) | ✅ | ❌ |  |\n",
|
||||
"\n",
|
||||
"### Tool features\n",
|
||||
"\n",
|
||||
"|Native async|Returns artifact|Return data|\n",
|
||||
"|:-:|:-:|:-:|\n",
|
||||
"|✅|✅|html, markdown, links, metadata, structured content|\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"The integration lives in the `langchain-scrapeless` package."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"id": "ca676665",
|
||||
"metadata": {
|
||||
"vscode": {
|
||||
"languageId": "raw"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"!pip install langchain-scrapeless"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b15e9266",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Credentials\n",
|
||||
"\n",
|
||||
"You'll need a Scrapeless API key to use this tool. You can set it as an environment variable:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e0b178a2-8816-40ca-b57c-ccdd86dde9c9",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"os.environ[\"SCRAPELESS_API_KEY\"] = \"your-api-key\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1c97218f-f366-479d-8bf7-fe9f2f6df73f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Instantiation\n",
|
||||
"\n",
|
||||
"Here we show how to instantiate an instance of the Scrapeless Universal Scraping Tool. This tool allows you to scrape any website using a headless browser with JavaScript rendering capabilities, customizable output types, and geo-specific proxy support.\n",
|
||||
"\n",
|
||||
"The tool accepts the following parameters during instantiation:\n",
|
||||
"- `url` (required, str): The URL of the website to scrape.\n",
|
||||
"- `headless` (optional, bool): Whether to use a headless browser. Default is True.\n",
|
||||
"- `js_render` (optional, bool): Whether to enable JavaScript rendering. Default is True.\n",
|
||||
"- `js_wait_until` (optional, str): Defines when to consider the JavaScript-rendered page ready. Default is `'domcontentloaded'`. Options include:\n",
|
||||
" - `load`: Wait until the page is fully loaded.\n",
|
||||
" - `domcontentloaded`: Wait until the DOM is fully loaded.\n",
|
||||
" - `networkidle0`: Wait until the network is idle.\n",
|
||||
" - `networkidle2`: Wait until the network is idle for 2 seconds.\n",
|
||||
"- `outputs` (optional, str): The specific type of data to extract from the page. Options include:\n",
|
||||
" - `phone_numbers`\n",
|
||||
" - `headings`\n",
|
||||
" - `images`\n",
|
||||
" - `audios`\n",
|
||||
" - `videos`\n",
|
||||
" - `links`\n",
|
||||
" - `menus`\n",
|
||||
" - `hashtags`\n",
|
||||
" - `emails`\n",
|
||||
" - `metadata`\n",
|
||||
" - `tables`\n",
|
||||
" - `favicon`\n",
|
||||
"- `response_type` (optional, str): Defines the format of the response. Default is `'html'`. Options include:\n",
|
||||
" - `html`: Return the raw HTML of the page.\n",
|
||||
" - `plaintext`: Return the plain text content.\n",
|
||||
" - `markdown`: Return a Markdown version of the page.\n",
|
||||
" - `png`: Return a PNG screenshot.\n",
|
||||
" - `jpeg`: Return a JPEG screenshot.\n",
|
||||
"- `response_image_full_page` (optional, bool): Whether to capture and return a full-page image when using screenshot output (png or jpeg). Default is False.\n",
|
||||
"- `selector` (optional, str): A specific CSS selector to scope scraping within a part of the page. Default is `None`.\n",
|
||||
"- `proxy_country` (optional, str): Two-letter country code for geo-specific proxy access (e.g., `'us'`, `'gb'`, `'de'`, `'jp'`). Default is `'ANY'`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "74147a1a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Invocation\n",
|
||||
"\n",
|
||||
"### Basic Usage"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "65310a8b-eb0c-4d9e-a618-4f4abe2414fc",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"<!DOCTYPE html><html><head>\n",
|
||||
" <title>Example Domain</title>\n",
|
||||
"\n",
|
||||
" <meta charset=\"utf-8\">\n",
|
||||
" <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\">\n",
|
||||
" <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n",
|
||||
" <style type=\"text/css\">\n",
|
||||
" body {\n",
|
||||
" background-color: #f0f0f2;\n",
|
||||
" margin: 0;\n",
|
||||
" padding: 0;\n",
|
||||
" font-family: -apple-system, system-ui, BlinkMacSystemFont, \"Segoe UI\", \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n",
|
||||
" \n",
|
||||
" }\n",
|
||||
" div {\n",
|
||||
" width: 600px;\n",
|
||||
" margin: 5em auto;\n",
|
||||
" padding: 2em;\n",
|
||||
" background-color: #fdfdff;\n",
|
||||
" border-radius: 0.5em;\n",
|
||||
" box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n",
|
||||
" }\n",
|
||||
" a:link, a:visited {\n",
|
||||
" color: #38488f;\n",
|
||||
" text-decoration: none;\n",
|
||||
" }\n",
|
||||
" @media (max-width: 700px) {\n",
|
||||
" div {\n",
|
||||
" margin: 0 auto;\n",
|
||||
" width: auto;\n",
|
||||
" }\n",
|
||||
" }\n",
|
||||
" </style> \n",
|
||||
"</head>\n",
|
||||
"\n",
|
||||
"<body>\n",
|
||||
"<div>\n",
|
||||
" <h1>Example Domain</h1>\n",
|
||||
" <p>This domain is for use in illustrative examples in documents. You may use this\n",
|
||||
" domain in literature without prior coordination or asking for permission.</p>\n",
|
||||
" <p><a href=\"https://www.iana.org/domains/example\">More information...</a></p>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"</body></html>\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain_scrapeless import ScrapelessUniversalScrapingTool\n",
|
||||
"\n",
|
||||
"tool = ScrapelessUniversalScrapingTool()\n",
|
||||
"\n",
|
||||
"# Basic usage\n",
|
||||
"result = tool.invoke(\"https://example.com\")\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d6e73897",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Advanced Usage with Parameters"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f90e33a7",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"# Well hello there.\n",
|
||||
"\n",
|
||||
"Welcome to exmaple.com.\n",
|
||||
"Chances are you got here by mistake (example.com, anyone?)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain_scrapeless import ScrapelessUniversalScrapingTool\n",
|
||||
"\n",
|
||||
"tool = ScrapelessUniversalScrapingTool()\n",
|
||||
"\n",
|
||||
"result = tool.invoke({\"url\": \"https://exmaple.com\", \"response_type\": \"markdown\"})\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "659f9fbd-6fcf-445f-aa8c-72d8e60154bd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Use within an agent"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "af3123ad-7a02-40e5-b58e-7d56e23e5830",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"================================\u001b[1m Human Message \u001b[0m=================================\n",
|
||||
"\n",
|
||||
"Use the scrapeless scraping tool to fetch https://www.scrapeless.com/en and extract the h1 tag.\n",
|
||||
"==================================\u001b[1m Ai Message \u001b[0m==================================\n",
|
||||
"Tool Calls:\n",
|
||||
" scrapeless_universal_scraping (call_jBrvMVL2ixhvf6gklhi7Gqtb)\n",
|
||||
" Call ID: call_jBrvMVL2ixhvf6gklhi7Gqtb\n",
|
||||
" Args:\n",
|
||||
" url: https://www.scrapeless.com/en\n",
|
||||
" outputs: headings\n",
|
||||
"=================================\u001b[1m Tool Message \u001b[0m=================================\n",
|
||||
"Name: scrapeless_universal_scraping\n",
|
||||
"\n",
|
||||
"{\"headings\":[\"Effortless Web Scraping Toolkitfor Business and Developers\",\"4.8\",\"4.5\",\"8.5\",\"A Flexible Toolkit for Accessing Public Web Data\",\"Deep SerpApi\",\"Scraping Browser\",\"Universal Scraping API\",\"Customized Services\",\"From Simple Data Scraping to Complex Anti-Bot Challenges, Scrapeless Has You Covered.\",\"Fully Compatible with Key Programming Languages and Tools\",\"Enterprise-level Data Scraping Solution\",\"Customized Data Scraping Solutions\",\"High Concurrency and High-Performance Scraping\",\"Data Cleaning and Transformation\",\"Real-Time Data Push and API Integration\",\"Data Security and Privacy Protection\",\"Enterprise-level SLA\",\"Why Scrapeless: Simplify Your Data Flow Effortlessly.\",\"Articles\",\"Organized Fresh Data\",\"Prices\",\"No need to hassle with browser maintenance\",\"Reviews\",\"Only pay for successful requests\",\"Products\",\"Fully scalable\",\"Unleash Your Competitive Edgein Data within the Industry\",\"Regulate Compliance for All Users\",\"Web Scraping Blog\",\"Scrapeless MCP Server Is Officially Live! Build Your Ultimate AI-Web Connector\",\"Product Updates | New Profile Feature\",\"How to Track Your Ranking on ChatGPT?\",\"For Scraping\",\"For Data\",\"For AI\",\"Top Scraper API\",\"Learning Center\",\"Legal\"]}\n",
|
||||
"==================================\u001b[1m Ai Message \u001b[0m==================================\n",
|
||||
"\n",
|
||||
"The h1 tag extracted from the website https://www.scrapeless.com/en is \"Effortless Web Scraping Toolkit for Business and Developers\".\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain_openai import ChatOpenAI\n",
|
||||
"from langchain_scrapeless import ScrapelessUniversalScrapingTool\n",
|
||||
"from langgraph.prebuilt import create_react_agent\n",
|
||||
"\n",
|
||||
"llm = ChatOpenAI()\n",
|
||||
"\n",
|
||||
"tool = ScrapelessUniversalScrapingTool()\n",
|
||||
"\n",
|
||||
"# Use the tool with an agent\n",
|
||||
"tools = [tool]\n",
|
||||
"agent = create_react_agent(llm, tools)\n",
|
||||
"\n",
|
||||
"for chunk in agent.stream(\n",
|
||||
" {\n",
|
||||
" \"messages\": [\n",
|
||||
" (\n",
|
||||
" \"human\",\n",
|
||||
" \"Use the scrapeless scraping tool to fetch https://www.scrapeless.com/en and extract the h1 tag.\",\n",
|
||||
" )\n",
|
||||
" ]\n",
|
||||
" },\n",
|
||||
" stream_mode=\"values\",\n",
|
||||
"):\n",
|
||||
" chunk[\"messages\"][-1].pretty_print()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ac8146c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## API reference\n",
|
||||
"\n",
|
||||
"- [Scrapeless Documentation](https://docs.scrapeless.com/en/universal-scraping-api/quickstart/introduction/)\n",
|
||||
"- [Scrapeless API Reference](https://apidocs.scrapeless.com/api-12948840)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "langchain",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
@ -716,4 +716,7 @@ packages:
|
||||
- name: toolbox-langchain
|
||||
repo: googleapis/mcp-toolbox-sdk-python
|
||||
path: packages/toolbox-langchain
|
||||
- name: langchain-scrapeless
|
||||
repo: scrapeless-ai/langchain-scrapeless
|
||||
path: .
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user