docs: add langchain-scraperapi (#31973)

Adds documentation for the integration langchain-scraperapi, which
contains 3 tools using the ScraperAPI service.

The tools give AI agents the ability to

Scrape the web and return HTML/text/markdown
Perform Google search and return json output
Perform Amazon search and return json output

For reference, here is the official repo for langchain_scraperapi:
https://github.com/scraperapi/langchain-scraperapi
This commit is contained in:
Chase Lean
2025-09-17 04:46:20 +03:00
committed by GitHub
parent f8640630d8
commit 543d90e108
3 changed files with 404 additions and 0 deletions

View File

@@ -0,0 +1,72 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ScraperAPI\n",
"\n",
"[ScraperAPI](https://www.scraperapi.com/) enables data collection from any public website with its web scraping API, without worrying about proxies, browsers, or CAPTCHA handling. [langchain-scraperapi](https://github.com/scraperapi/langchain-scraperapi) wraps this service, making it easy for AI agents to browse the web and scrape data from it.\n",
"\n",
"## Installation and Setup\n",
"\n",
"- Install the Python package with `pip install langchain-scraperapi`.\n",
"- Obtain an API key from [ScraperAPI](https://www.scraperapi.com/) and set the environment variable `SCRAPERAPI_API_KEY`.\n",
"\n",
"### Tools\n",
"\n",
"The package offers 3 tools to scrape any website, get structured Google search results, and get structured Amazon search results respectively.\n",
"\n",
"To import them:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install langchain_scraperapi\n",
"\n",
"from langchain_scraperapi.tools import (\n",
" ScraperAPIAmazonSearchTool,\n",
" ScraperAPIGoogleSearchTool,\n",
" ScraperAPITool,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example use:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tool = ScraperAPITool()\n",
"\n",
"result = tool.invoke({\"url\": \"https://example.com\", \"output_format\": \"markdown\"})\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a more detailed walkthrough of how to use these tools, visit the [official repository](https://github.com/scraperapi/langchain-scraperapi)."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,329 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d3a12ba8",
"metadata": {},
"source": [
"# LangChain ScraperAPI\n",
"\n",
"Give your AI agent the ability to browse websites, search Google and Amazon in just two lines of code.\n",
"\n",
"The `langchain-scraperapi` package adds three ready-to-use LangChain tools backed by the [ScraperAPI](https://www.scraperapi.com/) service:\n",
"\n",
"| Tool class | Use it to |\n",
"|------------|------------------|\n",
"| `ScraperAPITool` | Grab the HTML/text/markdown of any web page |\n",
"| `ScraperAPIGoogleSearchTool` | Get structured Google Search SERP data |\n",
"| `ScraperAPIAmazonSearchTool` | Get structured Amazon product-search data |\n",
"\n",
"## Overview\n",
"\n",
"### Integration details\n",
"\n",
"| Package | Serializable | [JS support](https://js.langchain.com/docs/integrations/tools/__module_name__) | Package latest |\n",
"| :--- | :---: | :---: | :---: |\n",
"| [langchain-scraperapi](https://pypi.org/project/langchain-scraperapi/) | ❌ | ❌ | v0.1.1 |"
]
},
{
"cell_type": "markdown",
"id": "d1f7c70f",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"Install the `langchain-scraperapi` package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "494ecbc3",
"metadata": {},
"outputs": [],
"source": [
"%pip install -U langchain-scraperapi"
]
},
{
"cell_type": "markdown",
"id": "c111d2fb",
"metadata": {},
"source": [
"### Credentials\n",
"\n",
"Create an account at https://www.scraperapi.com/ and get an API key."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d315465",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"SCRAPERAPI_API_KEY\"] = \"your-api-key\""
]
},
{
"cell_type": "markdown",
"id": "e06ffe48",
"metadata": {},
"source": [
"## Instantiation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27ae5612",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scraperapi.tools import ScraperAPITool\n",
"\n",
"tool = ScraperAPITool()"
]
},
{
"cell_type": "markdown",
"id": "9ff46136",
"metadata": {},
"source": [
"## Invocation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e1a4c7f",
"metadata": {},
"outputs": [],
"source": [
"output = tool.invoke(\n",
" {\n",
" \"url\": \"https://langchain.com\",\n",
" \"output_format\": \"markdown\",\n",
" \"render\": True,\n",
" }\n",
")\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"id": "051ef7b1",
"metadata": {},
"source": [
"## Features\n",
"\n",
"### 1. `ScraperAPITool` — browse any website\n",
"\n",
"Invoke the *raw* ScraperAPI endpoint and get HTML, rendered DOM, text, or markdown.\n",
"\n",
"**Invocation arguments**\n",
"\n",
"* **`url`** **(required)** target page URL \n",
"* **Optional (mirror ScraperAPI query params)** \n",
" * `output_format`: `\"text\"` | `\"markdown\"` (default returns raw HTML) \n",
" * `country_code`: e.g. `\"us\"`, `\"de\"` \n",
" * `device_type`: `\"desktop\"` | `\"mobile\"` \n",
" * `premium`: `bool` use premium proxies \n",
" * `render`: `bool` run JS before returning HTML \n",
" * `keep_headers`: `bool` include response headers \n",
" \n",
"For the complete set of modifiers see the [ScraperAPI request-customisation docs](https://docs.scraperapi.com/python/making-requests/customizing-requests)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a0c7cc2",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scraperapi.tools import ScraperAPITool\n",
"\n",
"tool = ScraperAPITool()\n",
"\n",
"html_text = tool.invoke(\n",
" {\n",
" \"url\": \"https://langchain.com\",\n",
" \"output_format\": \"markdown\",\n",
" \"render\": True,\n",
" }\n",
")\n",
"print(html_text[:300], \"…\")"
]
},
{
"cell_type": "markdown",
"id": "9f2947dd",
"metadata": {},
"source": [
"### 2. `ScraperAPIGoogleSearchTool` — structured Google Search\n",
"\n",
"Structured SERP data via `/structured/google/search`.\n",
"\n",
"**Invocation arguments**\n",
"\n",
"* **`query`** **(required)** natural-language search string \n",
"* **Optional** — `country_code`, `tld`, `uule`, `hl`, `gl`, `ie`, `oe`, `start`, `num` \n",
"* `output_format`: `\"json\"` (default) or `\"csv\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aeac1195",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scraperapi.tools import ScraperAPIGoogleSearchTool\n",
"\n",
"google_search = ScraperAPIGoogleSearchTool()\n",
"\n",
"results = google_search.invoke(\n",
" {\n",
" \"query\": \"what is langchain\",\n",
" \"num\": 20,\n",
" \"output_format\": \"json\",\n",
" }\n",
")\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"id": "3dc2f845",
"metadata": {},
"source": [
"### 3. `ScraperAPIAmazonSearchTool` — structured Amazon Search\n",
"\n",
"Structured product results via `/structured/amazon/search`.\n",
"\n",
"**Invocation arguments**\n",
"\n",
"* **`query`** **(required)** product search terms \n",
"* **Optional** — `country_code`, `tld`, `page` \n",
"* `output_format`: `\"json\"` (default) or `\"csv\"`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05a4a6ed",
"metadata": {},
"outputs": [],
"source": [
"from langchain_scraperapi.tools import ScraperAPIAmazonSearchTool\n",
"\n",
"amazon_search = ScraperAPIAmazonSearchTool()\n",
"\n",
"products = amazon_search.invoke(\n",
" {\n",
" \"query\": \"noise cancelling headphones\",\n",
" \"tld\": \"co.uk\",\n",
" \"page\": 2,\n",
" }\n",
")\n",
"print(products)"
]
},
{
"cell_type": "markdown",
"id": "607eb8c8",
"metadata": {},
"source": [
"## Use within an agent\n",
"\n",
"Here is an example of using the tools in an AI agent. The `ScraperAPITool` gives the AI the ability to browse any website, summarize articles, and click on links to navigate between pages."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6541b286",
"metadata": {},
"outputs": [],
"source": [
"%pip install -U langchain-openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb62e921",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"from langchain.agents import AgentExecutor, create_tool_calling_agent\n",
"from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
"from langchain_openai import ChatOpenAI\n",
"from langchain_scraperapi.tools import ScraperAPITool\n",
"\n",
"os.environ[\"SCRAPERAPI_API_KEY\"] = \"your-api-key\"\n",
"os.environ[\"OPENAI_API_KEY\"] = \"your-api-key\"\n",
"\n",
"tools = [ScraperAPITool(output_format=\"markdown\")]\n",
"llm = ChatOpenAI(model_name=\"gpt-4o\", temperature=0)\n",
"\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are a helpful assistant that can browse websites for users. When asked to browse a website or a link, do so with the ScraperAPITool, then provide information based on the website based on the user's needs.\",\n",
" ),\n",
" (\"human\", \"{input}\"),\n",
" MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n",
" ]\n",
")\n",
"\n",
"agent = create_tool_calling_agent(llm, tools, prompt)\n",
"agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)\n",
"response = agent_executor.invoke(\n",
" {\"input\": \"can you browse hacker news and summarize the first website\"}\n",
")"
]
},
{
"cell_type": "markdown",
"id": "4e90c894",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"Below you can find more information on additional parameters to the tools to customize your requests.\n",
"\n",
"* [ScraperAPITool](https://docs.scraperapi.com/python/making-requests/customizing-requests)\n",
"* [ScraperAPIGoogleSearchTool](https://docs.scraperapi.com/python/make-requests-with-scraperapi-in-python/scraperapi-structured-data-collection-in-python/google-serp-api-structured-data-in-python)\n",
"* [ScraperAPIAmazonSearchTool](https://docs.scraperapi.com/python/make-requests-with-scraperapi-in-python/scraperapi-structured-data-collection-in-python/amazon-search-api-structured-data-in-python)\n",
"\n",
"The LangChain wrappers surface these parameters directly."
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"main_language": "python",
"notebook_metadata_filter": "-all"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -749,3 +749,6 @@ packages:
- name: langchain-zeusdb
repo: zeusdb/langchain-zeusdb
path: libs/zeusdb
- name: langchain-scraperapi
path: .
repo: scraperapi/langchain-scraperapi