This commit is contained in:
Anderson 2025-07-29 08:55:31 +08:00 committed by GitHub
commit 2b18806561
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 1015 additions and 0 deletions

View File

@ -0,0 +1,27 @@
# Scrapeless
[Scrapeless](https://scrapeless.com) offers flexible and feature-rich data acquisition services with extensive parameter customization and multi-format export support.
## Installation and Setup
```bash
pip install langchain-scrapeless
```
You'll need to set up your Scrapeless API key:
```python
import os
os.environ["SCRAPELESS_API_KEY"] = "your-api-key"
```
## Tools
The Scrapeless integration provides several tools:
- [ScrapelessDeepSerpGoogleSearchTool](/docs/integrations/tools/scrapeless_scraping_api) - Enables comprehensive extraction of Google SERP data across all result types.
- [ScrapelessDeepSerpGoogleTrendsTool](/docs/integrations/tools/scrapeless_scraping_api) - Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.
- [ScrapelessUniversalScrapingTool](/docs/integrations/tools/scrapeless_universal_scraping) - Access and extract data from JS-Render websites that typically block bots.
- [ScrapelessCrawlerCrawlTool](/docs/integrations/tools/scrapeless_crawl) - Crawl a website and its linked pages to extract comprehensive data.
- [ScrapelessCrawlerScrapeTool](/docs/integrations/tools/scrapeless_crawl) - Extract information from a single webpage.

View File

@ -0,0 +1,349 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a6f91f20",
"metadata": {},
"source": [
"# Scrapeless\n",
"\n",
"**Scrapeless** offers flexible and feature-rich data acquisition services with extensive parameter customization and multi-format export support. These capabilities empower LangChain to integrate and leverage external data more effectively. The core functional modules include:\n",
"\n",
"**DeepSerp**\n",
"- **Google Search**: Enables comprehensive extraction of Google SERP data across all result types.\n",
" - Supports selection of localized Google domains (e.g., google.com, google.ad) to retrieve region-specific search results.\n",
" - Pagination supported for retrieving results beyond the first page.\n",
" - Supports a search result filtering toggle to control whether to exclude duplicate or similar content.\n",
"- **Google Trends**: Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.\n",
" - Supports multi-keyword comparison.\n",
" - Supports multiple data types: interest_over_time, interest_by_region, related_queries, and related_topics.\n",
" - Allows filtering by specific Google properties (Web, YouTube, News, Shopping) for source-specific trend analysis.\n",
"\n",
"**Universal Scraping**\n",
"- Designed for modern, JavaScript-heavy websites, allowing dynamic content extraction.\n",
" - Global premium proxy support for bypassing geo-restrictions and improving reliability.\n",
"\n",
"**Crawler**\n",
"- **Crawl**: Recursively crawl a website and its linked pages to extract site-wide content.\n",
" - Supports configurable crawl depth and scoped URL targeting.\n",
"- **Scrape**: Extract content from a single webpage with high precision.\n",
" - Supports \"main content only\" extraction to exclude ads, footers, and other non-essential elements.\n",
" - Allows batch scraping of multiple standalone URLs.\n",
"\n",
"## Overview\n",
"\n",
"### Integration details\n",
"\n",
"| Class | Package | Serializable | JS support | Package latest |\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [ScrapelessCrawlerScrapeTool](https://pypi.org/project/langchain-scrapeless/) | [langchain-scrapeless](https://pypi.org/project/langchain-scrapeless/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapeless?style=flat-square&label=%20) |\n",
"| [ScrapelessCrawlerCrawlTool](https://pypi.org/project/langchain-scrapeless/) | [langchain-scrapeless](https://pypi.org/project/langchain-scrapeless/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapeless?style=flat-square&label=%20) |\n",
"\n",
"### Tool features\n",
"\n",
"|Native async|Returns artifact|Return data|\n",
"|:-:|:-:|:-:|\n",
"|✅|✅|markdown, rawHtml, screenshot@fullPage, json, links, screenshot, html|\n",
"\n",
"\n",
"## Setup\n",
"\n",
"The integration lives in the `langchain-scrapeless` package."
]
},
{
"cell_type": "raw",
"id": "ca676665",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"pip install langchain-scrapeless"
]
},
{
"cell_type": "markdown",
"id": "b15e9266",
"metadata": {},
"source": [
"### Credentials\n",
"\n",
"You'll need a Scrapeless API key to use this tool. You can set it as an environment variable:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0b178a2-8816-40ca-b57c-ccdd86dde9c9",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"SCRAPELESS_API_KEY\"] = \"your-api-key\""
]
},
{
"cell_type": "markdown",
"id": "1c97218f-f366-479d-8bf7-fe9f2f6df73f",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"### ScrapelessCrawlerScrapeTool\n",
"\n",
"The ScrapelessCrawlerScrapeTool allows you to scrape content from one or multiple websites using Scrapelesss Crawler Scrape API. You can extract the main content, control formatting, headers, wait times, and output types.\n",
"\n",
"\n",
"The tool accepts the following parameters:\n",
"- `urls` (required, List[str]): One or more URLs of websites you want to scrape.\n",
"- `formats` (optional, List[str]): Defines the format(s) of the scraped output. Default is [\"markdown\"]. Options include:\n",
" - \"markdown\"\n",
" - \"rawHtml\"\n",
" - \"screenshot@fullPage\"\n",
" - \"json\"\n",
" - \"links\"\n",
" - \"screenshot\"\n",
" - \"html\"\n",
"- `only_main_content` (optional, bool): Whether to return only the main page content, excluding headers, navs, footers, etc. Default is True.\n",
"- `include_tags` (optional, List[str]): A list of HTML tags to include in the output (e.g., [\"h1\", \"p\"]). If set to None, no tags are explicitly included.\n",
"- `exclude_tags` (optional, List[str]): A list of HTML tags to exclude from the output. If set to None, no tags are explicitly excluded.\n",
"- `headers` (optional, Dict[str, str]): Custom headers to send with the request (e.g., for cookies or user-agent). Default is None.\n",
"- `wait_for` (optional, int): Time to wait in milliseconds before scraping. Useful for giving the page time to fully load. Default is 0.\n",
"- `timeout` (optional, int): Request timeout in milliseconds. Default is 30000.\n",
"\n",
"### ScrapelessCrawlerCrawlTool\n",
"\n",
"The ScrapelessCrawlerCrawlTool allows you to crawl a website starting from a base URL using Scrapelesss Crawler Crawl API. It supports advanced filtering of URLs, crawl depth control, content scraping options, headers customization, and more.\n",
"\n",
"The tool accepts the following parameters:\n",
"- `url` (required, str): The base URL to start crawling from.\n",
"\n",
"- `limit` (optional, int): Maximum number of pages to crawl. Default is 10000.\n",
"- `include_paths` (optional, List[str]): URL pathname regex patterns to include matching URLs in the crawl. Only URLs matching these patterns will be included. For example, setting [\"blog/.*\"] will only include URLs under the /blog/ path. Default is None.\n",
"- `exclude_paths` (optional, List[str]): URL pathname regex patterns to exclude matching URLs from the crawl. For example, setting [\"blog/.*\"] will exclude URLs under the /blog/ path. Default is None.\n",
"- `max_depth` (optional, int): Maximum crawl depth relative to the base URL, measured by the number of slashes in the URL path. Default is 10.\n",
"- `max_discovery_depth` (optional, int): Maximum crawl depth based on discovery order. Root and sitemapped pages have depth 0. For example, setting to 1 and ignoring sitemap will crawl only the entered URL and its immediate links. Default is None.\n",
"- `ignore_sitemap` (optional, bool): Whether to ignore the website sitemap during crawling. Default is False.\n",
"- `ignore_query_params` (optional, bool): Whether to ignore query parameter differences to avoid re-scraping similar URLs. Default is False.\n",
"- `deduplicate_similar_urls` (optional, bool): Whether to deduplicate similar URLs. Default is True.\n",
"- `regex_on_full_url` (optional, bool): Whether regex matching applies to the full URL instead of just the path. Default is True.\n",
"- `allow_backward_links` (optional, bool): Whether to allow crawling backlinks outside the URL hierarchy. Default is False.\n",
"- `allow_external_links` (optional, bool): Whether to allow crawling links to external websites. Default is False.\n",
"- `delay` (optional, int): Delay in seconds between page scrapes to respect rate limits. Default is 1.\n",
"- `formats` (optional, List[str]): The format(s) of the scraped content. Default is [\"markdown\"]. Options include:\n",
" - \"markdown\"\n",
" - \"rawHtml\"\n",
" - \"screenshot@fullPage\"\n",
" - \"json\"\n",
" - \"links\"\n",
" - \"screenshot\"\n",
" - \"html\"\n",
"- `only_main_content` (optional, bool): Whether to return only the main content, excluding headers, navigation bars, footers, etc. Default is True.\n",
"- `include_tags` (optional, List[str]): List of HTML tags to include in the output (e.g., [\"h1\", \"p\"]). Default is None (no explicit include filter).\n",
"- `exclude_tags` (optional, List[str]): List of HTML tags to exclude from the output. Default is None (no explicit exclude filter).\n",
"- `headers` (optional, Dict[str, str]): Custom HTTP headers to send with the requests, such as cookies or user-agent strings. Default is None.\n",
"- `wait_for` (optional, int): Time in milliseconds to wait before scraping the content, allowing the page to load fully. Default is 0.\n",
"- `timeout` (optional, int):Request timeout in milliseconds. Default is 30000."
]
},
{
"cell_type": "markdown",
"id": "74147a1a",
"metadata": {},
"source": [
"## Invocation\n",
"\n",
"### ScrapelessCrawlerCrawlTool\n",
"\n",
"#### Usage with Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65310a8b-eb0c-4d9e-a618-4f4abe2414fc",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessCrawlerCrawlTool\n",
"\n",
"tool = ScrapelessCrawlerCrawlTool()\n",
"\n",
"# Advanced usage\n",
"result = tool.invoke({\"url\": \"https://exmaple.com\", \"limit\": 4})\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "659f9fbd-6fcf-445f-aa8c-72d8e60154bd",
"metadata": {},
"source": [
"#### Use within an agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af3123ad-7a02-40e5-b58e-7d56e23e5830",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from langchain_scrapeless import ScrapelessCrawlerCrawlTool\n",
"from langgraph.prebuilt import create_react_agent\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"tool = ScrapelessCrawlerCrawlTool()\n",
"\n",
"# Use the tool with an agent\n",
"tools = [tool]\n",
"agent = create_react_agent(llm, tools)\n",
"\n",
"for chunk in agent.stream(\n",
" {\n",
" \"messages\": [\n",
" (\n",
" \"human\",\n",
" \"Use the scrapeless crawler crawl tool to crawl the website https://example.com and output the markdown content as a string.\",\n",
" )\n",
" ]\n",
" },\n",
" stream_mode=\"values\",\n",
"):\n",
" chunk[\"messages\"][-1].pretty_print()"
]
},
{
"cell_type": "markdown",
"id": "769b2246",
"metadata": {},
"source": [
"### ScrapelessCrawlerScrapeTool\n",
"\n",
"#### Usage with Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca993de7",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessDeepSerpGoogleTrendsTool\n",
"\n",
"tool = ScrapelessDeepSerpGoogleTrendsTool()\n",
"\n",
"# Basic usage\n",
"result = tool.invoke(\"Funny 2048,negamon monster trainer\")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "1a4db36f",
"metadata": {},
"source": [
"#### Advanced Usage with Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42c83c46",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessCrawlerScrapeTool\n",
"\n",
"tool = ScrapelessCrawlerScrapeTool()\n",
"\n",
"result = tool.invoke(\n",
" {\n",
" \"urls\": [\"https://exmaple.com\", \"https://www.scrapeless.com/en\"],\n",
" \"formats\": [\"markdown\"],\n",
" }\n",
")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "7dde00ff",
"metadata": {},
"source": [
"#### Use within an agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6ca1aff",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from langchain_scrapeless import ScrapelessCrawlerScrapeTool\n",
"from langgraph.prebuilt import create_react_agent\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"tool = ScrapelessCrawlerScrapeTool()\n",
"\n",
"# Use the tool with an agent\n",
"tools = [tool]\n",
"agent = create_react_agent(llm, tools)\n",
"\n",
"for chunk in agent.stream(\n",
" {\n",
" \"messages\": [\n",
" (\n",
" \"human\",\n",
" \"Use the scrapeless crawler scrape tool to get the website content of https://example.com and output the html content as a string.\",\n",
" )\n",
" ]\n",
" },\n",
" stream_mode=\"values\",\n",
"):\n",
" chunk[\"messages\"][-1].pretty_print()"
]
},
{
"cell_type": "markdown",
"id": "4ac8146c",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"- [Scrapeless Documentation](https://docs.scrapeless.com/en/crawl/quickstart/introduction/)\n",
"- [Scrapeless API Reference](https://apidocs.scrapeless.com/api-17509003)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,384 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a6f91f20",
"metadata": {},
"source": [
"# Scrapeless\n",
"\n",
"**Scrapeless** offers flexible and feature-rich data acquisition services with extensive parameter customization and multi-format export support. These capabilities empower LangChain to integrate and leverage external data more effectively. The core functional modules include:\n",
"\n",
"**DeepSerp**\n",
"- **Google Search**: Enables comprehensive extraction of Google SERP data across all result types.\n",
" - Supports selection of localized Google domains (e.g., google.com, google.ad) to retrieve region-specific search results.\n",
" - Pagination supported for retrieving results beyond the first page.\n",
" - Supports a search result filtering toggle to control whether to exclude duplicate or similar content.\n",
"- **Google Trends**: Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.\n",
" - Supports multi-keyword comparison.\n",
" - Supports multiple data types: interest_over_time, interest_by_region, related_queries, and related_topics.\n",
" - Allows filtering by specific Google properties (Web, YouTube, News, Shopping) for source-specific trend analysis.\n",
"\n",
"**Universal Scraping**\n",
"- Designed for modern, JavaScript-heavy websites, allowing dynamic content extraction.\n",
" - Global premium proxy support for bypassing geo-restrictions and improving reliability.\n",
"\n",
"**Crawler**\n",
"- **Crawl**: Recursively crawl a website and its linked pages to extract site-wide content.\n",
" - Supports configurable crawl depth and scoped URL targeting.\n",
"- **Scrape**: Extract content from a single webpage with high precision.\n",
" - Supports \"main content only\" extraction to exclude ads, footers, and other non-essential elements.\n",
" - Allows batch scraping of multiple standalone URLs.\n",
"\n",
"## Overview\n",
"\n",
"### Integration details\n",
"\n",
"| Class | Package | Serializable | JS support | Package latest |\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [ScrapelessDeepSerpGoogleSearchTool](https://pypi.org/project/langchain-scrapeless/) | [langchain-scrapeless](https://pypi.org/project/langchain-scrapeless/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapeless?style=flat-square&label=%20) |\n",
"| [ScrapelessDeepSerpGoogleTrendsTool](https://pypi.org/project/langchain-scrapeless/) | [langchain-scrapeless](https://pypi.org/project/langchain-scrapeless/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapeless?style=flat-square&label=%20) |\n",
"\n",
"### Tool features\n",
"\n",
"|Native async|Returns artifact|Return data|\n",
"|:-:|:-:|:-:|\n",
"|✅|❌|Search Results Based on Tool|\n",
"\n",
"\n",
"## Setup\n",
"\n",
"The integration lives in the `langchain-scrapeless` package."
]
},
{
"cell_type": "raw",
"id": "ca676665",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"pip install langchain-scrapeless"
]
},
{
"cell_type": "markdown",
"id": "b15e9266",
"metadata": {},
"source": [
"### Credentials\n",
"\n",
"You'll need a Scrapeless API key to use this tool. You can set it as an environment variable:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0b178a2-8816-40ca-b57c-ccdd86dde9c9",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"SCRAPELESS_API_KEY\"] = \"your-api-key\""
]
},
{
"cell_type": "markdown",
"id": "1c97218f-f366-479d-8bf7-fe9f2f6df73f",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"### ScrapelessDeepSerpGoogleSearchTool\n",
"\n",
"Here we show how to instantiate an instance of the ScrapelessDeepSerpGoogleSearchTool. The universal Information Search Engine allows you to retrieve any data information.\n",
"- Retrieves any data information.\n",
"- Handles explanatory queries (e.g., \"why\", \"how\").\n",
"- Supports comparative analysis requests.\n",
"\n",
"The tool accepts the following parameters:\n",
"- `q`: (str) The search query string. Supports advanced Google syntax like inurl:, site:, intitle:, as_eq, etc.\n",
"- `hl`: (str) Language code for result content, e.g., en, es, fr. Default: \"en\".\n",
"- `gl`: (str) Country code for geo-specific result targeting, e.g., us, uk, de. Default: \"us\".\n",
"- `google_domain`: (str) Which Google domain to use (e.g., google.com, google.co.jp). Default: \"google.com\".\n",
"- `start`: (int) Defines the result offset. It skips the given number of results. Used for pagination. Examples:\n",
" - 0 (default): the first page of results\n",
" - 10: the second page\n",
" - 20: the third page\n",
"- `num`: (int) Defines the maximum number of results to return. Examples:\n",
" - 10 (default): returns 10 results\n",
" - 40: returns 40 results\n",
" - 100: returns 100 results\n",
"- `ludocid`: (str) Defines the ID (CID) of the Google My Business listing you want to scrape. Also known as Google Place ID.\n",
"- `kgmid`: (str) Defines the ID (KGMID) of the Google Knowledge Graph listing you want to scrape. Also known as Google Knowledge Graph ID. Searches with the kgmid parameter will return results for the originally encrypted search parameters. For some searches, kgmid may override all other parameters except start and num.\n",
"- `ibp`: (str) Responsible for rendering layouts and expansions for some elements. Example: gwp;0,7 to expand searches with ludocid for expanded knowledge graph.\n",
"- `cr`: (str) Defines one or multiple countries to limit the search to. Uses format country{two-letter country code}, separated by |. Example:\n",
" - countryFR|countryDE only searches French and German pages.\n",
"- `lr`: (str) Defines one or multiple languages to limit the search to. Uses format lang_{two-letter language code}, separated by |. Example:\n",
" - lang_fr|lang_de only searches French and German pages.\n",
"- `tbs`: (str) Defines advanced search parameters not possible in the regular query field. Examples include advanced search for:\n",
" - patents\n",
" - dates\n",
" - news\n",
" - videos\n",
" - images\n",
" - apps\n",
" - text contents\n",
"- `safe`: (str) Defines the level of filtering for adult content. Values:\n",
" - active: blur explicit content\n",
" - off: no filtering\n",
"- `nfpr`: (str) Defines exclusion of results from auto-corrected queries when the original query is misspelled. Values:\n",
" - 1: exclude these results\n",
" - 0 (default): include them\n",
" - Note: This may not prevent Google from returning auto-corrected results if no other results are available.\n",
"- `filter`: (str) Defines if “Similar Results” and “Omitted Results” filters are on or off. Values:\n",
" - 1 (default): enable filters\n",
" - 0: disable filters\n",
"- `tbm`: (str) Defines the type of search to perform. Values:\n",
" - none: regular Google Search\n",
" - isch: Google Images\n",
" - lcl: Google Local\n",
" - vid: Google Videos\n",
" - nws: Google News\n",
" - shop: Google Shopping\n",
" - pts: Google Patents\n",
" - jobs: Google Jobs\n",
"\n",
"\n",
"### ScrapelessDeepSerpGoogleTrendsTool\n",
"\n",
"Here we show how to instantiate an instance of the ScrapelessDeepSerpGoogleTrendsTool. This tool allows you to query real-time or historical trend data from Google Trends with fine control over locale, category, and result type, using the Scrapeless API.\n",
"\n",
"The tool accepts the following parameters:\n",
"- `q` (required, str): Parameter defines the query or queries you want to search. You can use anything that you would use in a regular Google Trends search. The maximum number of queries per search is **5**. (This only applies to `interest_over_time` and `compared_breakdown_by_region` data types.) Other types of data will only accept **1 query** per search.\n",
"- `data_type` (optional, str): The type of data to retrieve. Default is \"interest_over_time\". Options include:\n",
" - \"autocomplete\"\n",
" - \"interest_over_time\"\n",
" - \"compared_breakdown_by_region\"\n",
" - \"interest_by_subregion\"\n",
" - \"related_queries\"\n",
" - \"related_topics\"\n",
"- `date` (optional, str): Defines the date range to fetch data for. Default is \"today 1-m\". Supported formats:\n",
" - Relative: \"now 1-H\", \"now 7-d\", \"today 12-m\", \"today 5-y\", \"all\" \n",
" - Custom date ranges: \"2023-01-01 2023-12-31\"\n",
" - With hours: \"2023-07-01T10 2023-07-03T22\"\n",
"- `hl` (optional, str): Language code to use in the search. Default is \"en\". Examples:\n",
" - \"es\" (Spanish)\n",
" - \"fr\" (French)\n",
"- `tz` (optional, str):Time zone offset. Default is \"420\" (PST).\n",
"- `geo` (optional, str): Two-letter country code to define the geographic origin of the search. Examples include:\n",
" - \"US\" (United States)\n",
" - \"GB\" (United Kingdom)\n",
" - \"JP\" (Japan)\n",
" - Leave empty or None for worldwide search.\n",
"- `cat` (optional, CategoryEnum): Category ID to narrow down the search context. Default is \"all_categories\" (0). Categories can include:\n",
" - \"0\" All categories\n",
" - Others like \"3\" News, \"29\" Sports, etc."
]
},
{
"cell_type": "markdown",
"id": "74147a1a",
"metadata": {},
"source": [
"## Invocation\n",
"\n",
"### ScrapelessDeepSerpGoogleSearchTool\n",
"\n",
"#### Basic Usage"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65310a8b-eb0c-4d9e-a618-4f4abe2414fc",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessDeepSerpGoogleSearchTool\n",
"\n",
"tool = ScrapelessDeepSerpGoogleSearchTool()\n",
"\n",
"# Basic usage\n",
"result = tool.invoke(\"I want to know Scrapeless\")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "d6e73897",
"metadata": {},
"source": [
"#### Advanced Usage with Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f90e33a7",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessDeepSerpGoogleSearchTool\n",
"\n",
"tool = ScrapelessDeepSerpGoogleSearchTool()\n",
"\n",
"# Advanced usage\n",
"result = tool.invoke({\"q\": \"Scrapeless\", \"hl\": \"en\", \"google_domain\": \"google.com\"})\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "659f9fbd-6fcf-445f-aa8c-72d8e60154bd",
"metadata": {},
"source": [
"#### Use within an agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af3123ad-7a02-40e5-b58e-7d56e23e5830",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from langchain_scrapeless import ScrapelessDeepSerpGoogleSearchTool\n",
"from langgraph.prebuilt import create_react_agent\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"tool = ScrapelessDeepSerpGoogleSearchTool()\n",
"\n",
"# Use the tool with an agent\n",
"tools = [tool]\n",
"agent = create_react_agent(llm, tools)\n",
"\n",
"for chunk in agent.stream(\n",
" {\"messages\": [(\"human\", \"I want to what is Scrapeless\")]}, stream_mode=\"values\"\n",
"):\n",
" chunk[\"messages\"][-1].pretty_print()"
]
},
{
"cell_type": "markdown",
"id": "769b2246",
"metadata": {},
"source": [
"### ScrapelessDeepSerpGoogleTrendsTool\n",
"\n",
"#### Basic Usage"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca993de7",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessDeepSerpGoogleTrendsTool\n",
"\n",
"tool = ScrapelessDeepSerpGoogleTrendsTool()\n",
"\n",
"# Basic usage\n",
"result = tool.invoke(\"Funny 2048,negamon monster trainer\")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "1a4db36f",
"metadata": {},
"source": [
"#### Advanced Usage with Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42c83c46",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessDeepSerpGoogleTrendsTool\n",
"\n",
"tool = ScrapelessDeepSerpGoogleTrendsTool()\n",
"\n",
"# Advanced usage\n",
"result = tool.invoke({\"q\": \"Scrapeless\", \"data_type\": \"related_topics\", \"hl\": \"en\"})\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "7dde00ff",
"metadata": {},
"source": [
"#### Use within an agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6ca1aff",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from langchain_scrapeless import ScrapelessDeepSerpGoogleTrendsTool\n",
"from langgraph.prebuilt import create_react_agent\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"tool = ScrapelessDeepSerpGoogleTrendsTool()\n",
"\n",
"# Use the tool with an agent\n",
"tools = [tool]\n",
"agent = create_react_agent(llm, tools)\n",
"\n",
"for chunk in agent.stream(\n",
" {\"messages\": [(\"human\", \"I want to know the iphone keyword trends\")]},\n",
" stream_mode=\"values\",\n",
"):\n",
" chunk[\"messages\"][-1].pretty_print()"
]
},
{
"cell_type": "markdown",
"id": "4ac8146c",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"- [Scrapeless Documentation](https://docs.scrapeless.com/en/deep-serp-api/quickstart/introduction/)\n",
"- [Scrapeless API Reference](https://apidocs.scrapeless.com/doc-800321)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,252 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a6f91f20",
"metadata": {},
"source": [
"# Scrapeless\n",
"\n",
"**Scrapeless** offers flexible and feature-rich data acquisition services with extensive parameter customization and multi-format export support. These capabilities empower LangChain to integrate and leverage external data more effectively. The core functional modules include:\n",
"\n",
"**DeepSerp**\n",
"- **Google Search**: Enables comprehensive extraction of Google SERP data across all result types.\n",
" - Supports selection of localized Google domains (e.g., google.com, google.ad) to retrieve region-specific search results.\n",
" - Pagination supported for retrieving results beyond the first page.\n",
" - Supports a search result filtering toggle to control whether to exclude duplicate or similar content.\n",
"- **Google Trends**: Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.\n",
" - Supports multi-keyword comparison.\n",
" - Supports multiple data types: interest_over_time, interest_by_region, related_queries, and related_topics.\n",
" - Allows filtering by specific Google properties (Web, YouTube, News, Shopping) for source-specific trend analysis.\n",
"\n",
"**Universal Scraping**\n",
"- Designed for modern, JavaScript-heavy websites, allowing dynamic content extraction.\n",
" - Global premium proxy support for bypassing geo-restrictions and improving reliability.\n",
"\n",
"**Crawler**\n",
"- **Crawl**: Recursively crawl a website and its linked pages to extract site-wide content.\n",
" - Supports configurable crawl depth and scoped URL targeting.\n",
"- **Scrape**: Extract content from a single webpage with high precision.\n",
" - Supports \"main content only\" extraction to exclude ads, footers, and other non-essential elements.\n",
" - Allows batch scraping of multiple standalone URLs.\n",
"\n",
"## Overview\n",
"\n",
"### Integration details\n",
"\n",
"| Class | Package | Serializable | JS support | Package latest |\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [ScrapelessUniversalScrapingTool](https://pypi.org/project/langchain-scrapeless/) | [langchain-scrapeless](https://pypi.org/project/langchain-scrapeless/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapeless?style=flat-square&label=%20) |\n",
"\n",
"### Tool features\n",
"\n",
"|Native async|Returns artifact|Return data|\n",
"|:-:|:-:|:-:|\n",
"|✅|✅|html, markdown, links, metadata, structured content|\n",
"\n",
"\n",
"## Setup\n",
"\n",
"The integration lives in the `langchain-scrapeless` package."
]
},
{
"cell_type": "raw",
"id": "ca676665",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"pip install langchain-scrapeless"
]
},
{
"cell_type": "markdown",
"id": "b15e9266",
"metadata": {},
"source": [
"### Credentials\n",
"\n",
"You'll need a Scrapeless API key to use this tool. You can set it as an environment variable:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0b178a2-8816-40ca-b57c-ccdd86dde9c9",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"SCRAPELESS_API_KEY\"] = \"your-api-key\""
]
},
{
"cell_type": "markdown",
"id": "1c97218f-f366-479d-8bf7-fe9f2f6df73f",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"Here we show how to instantiate an instance of the Scrapeless Universal Scraping Tool. This tool allows you to scrape any website using a headless browser with JavaScript rendering capabilities, customizable output types, and geo-specific proxy support.\n",
"\n",
"The tool accepts the following parameters during instantiation:\n",
"- `url` (required, str): The URL of the website to scrape.\n",
"- `headless` (optional, bool): Whether to use a headless browser. Default is True.\n",
"- `js_render` (optional, bool): Whether to enable JavaScript rendering. Default is True.\n",
"- `js_wait_until` (optional, str): Defines when to consider the JavaScript-rendered page ready. Default is \"domcontentloaded\". Options include:\n",
" - `load`: Wait until the page is fully loaded.\n",
" - `domcontentloaded`: Wait until the DOM is fully loaded.\n",
" - `networkidle0`: Wait until the network is idle.\n",
" - `networkidle2`: Wait until the network is idle for 2 seconds.\n",
"- `outputs` (optional, str): The specific type of data to extract from the page. Options include:\n",
" - `phone_numbers`\n",
" - `headings`\n",
" - `images`\n",
" - `audios`\n",
" - `videos`\n",
" - `links`\n",
" - `menus`\n",
" - `hashtags`\n",
" - `emails`\n",
" - `metadata`\n",
" - `tables`\n",
" - `favicon`\n",
"- `response_type` (optional, str): Defines the format of the response. Default is \"html\". Options include:\n",
" - `html`: Return the raw HTML of the page.\n",
" - `plaintext`: Return the plain text content.\n",
" - `markdown`: Return a Markdown version of the page.\n",
" - `png`: Return a PNG screenshot.\n",
" - `jpeg`: Return a JPEG screenshot.\n",
"- `response_image_full_page` (optional, bool): Whether to capture and return a full-page image when using screenshot output (png or jpeg). Default is False.\n",
"- `selector` (optional, str): A specific CSS selector to scope scraping within a part of the page. Default is None.\n",
"- `proxy_country` (optional, str): Two-letter country code for geo-specific proxy access (e.g., \"us\", \"gb\", \"de\", \"jp\"). Default is \"ANY\"."
]
},
{
"cell_type": "markdown",
"id": "74147a1a",
"metadata": {},
"source": [
"## Invocation\n",
"\n",
"### Basic Usage"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65310a8b-eb0c-4d9e-a618-4f4abe2414fc",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessUniversalScrapingTool\n",
"\n",
"tool = ScrapelessUniversalScrapingTool()\n",
"\n",
"# Basic usage\n",
"result = tool.invoke(\"https://example.com\")\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "d6e73897",
"metadata": {},
"source": [
"### Advanced Usage with Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f90e33a7",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scrapeless import ScrapelessUniversalScrapingTool\n",
"\n",
"tool = ScrapelessUniversalScrapingTool()\n",
"\n",
"result = tool.invoke({\"url\": \"https://exmaple.com\", \"response_type\": \"markdown\"})\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "659f9fbd-6fcf-445f-aa8c-72d8e60154bd",
"metadata": {},
"source": [
"### Use within an agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af3123ad-7a02-40e5-b58e-7d56e23e5830",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from langchain_scrapeless import ScrapelessUniversalScrapingTool\n",
"from langgraph.prebuilt import create_react_agent\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"tool = ScrapelessUniversalScrapingTool()\n",
"\n",
"# Use the tool with an agent\n",
"tools = [tool]\n",
"agent = create_react_agent(llm, tools)\n",
"\n",
"for chunk in agent.stream(\n",
" {\n",
" \"messages\": [\n",
" (\n",
" \"human\",\n",
" \"Use the scrapeless scraping tool to fetch https://www.scrapeless.com/en and extract the h1 tag.\",\n",
" )\n",
" ]\n",
" },\n",
" stream_mode=\"values\",\n",
"):\n",
" chunk[\"messages\"][-1].pretty_print()"
]
},
{
"cell_type": "markdown",
"id": "4ac8146c",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"- [Scrapeless Documentation](https://docs.scrapeless.com/en/universal-scraping-api/quickstart/introduction/)\n",
"- [Scrapeless API Reference](https://apidocs.scrapeless.com/api-12948840)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -693,6 +693,9 @@ packages:
- name: langchain-greennode
path: libs/greennode
repo: greennode-ai/langchain-greennode
- name: langchain-scrapeless
repo: scrapeless-ai/langchain-scrapeless
path: .
- name: langchain-tensorlake
path: .
repo: tensorlakeai/langchain-tensorlake