mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-17 15:35:14 +00:00
docs: add langchain-scraperapi (#31973)
Adds documentation for the integration langchain-scraperapi, which contains 3 tools using the ScraperAPI service. The tools give AI agents the ability to Scrape the web and return HTML/text/markdown Perform Google search and return json output Perform Amazon search and return json output For reference, here is the official repo for langchain_scraperapi: https://github.com/scraperapi/langchain-scraperapi
This commit is contained in:
72
docs/docs/integrations/providers/scraperapi.ipynb
Normal file
72
docs/docs/integrations/providers/scraperapi.ipynb
Normal file
@@ -0,0 +1,72 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# ScraperAPI\n",
|
||||
"\n",
|
||||
"[ScraperAPI](https://www.scraperapi.com/) enables data collection from any public website with its web scraping API, without worrying about proxies, browsers, or CAPTCHA handling. [langchain-scraperapi](https://github.com/scraperapi/langchain-scraperapi) wraps this service, making it easy for AI agents to browse the web and scrape data from it.\n",
|
||||
"\n",
|
||||
"## Installation and Setup\n",
|
||||
"\n",
|
||||
"- Install the Python package with `pip install langchain-scraperapi`.\n",
|
||||
"- Obtain an API key from [ScraperAPI](https://www.scraperapi.com/) and set the environment variable `SCRAPERAPI_API_KEY`.\n",
|
||||
"\n",
|
||||
"### Tools\n",
|
||||
"\n",
|
||||
"The package offers 3 tools to scrape any website, get structured Google search results, and get structured Amazon search results respectively.\n",
|
||||
"\n",
|
||||
"To import them:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install langchain_scraperapi\n",
|
||||
"\n",
|
||||
"from langchain_scraperapi.tools import (\n",
|
||||
" ScraperAPIAmazonSearchTool,\n",
|
||||
" ScraperAPIGoogleSearchTool,\n",
|
||||
" ScraperAPITool,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Example use:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tool = ScraperAPITool()\n",
|
||||
"\n",
|
||||
"result = tool.invoke({\"url\": \"https://example.com\", \"output_format\": \"markdown\"})\n",
|
||||
"print(result)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For a more detailed walkthrough of how to use these tools, visit the [official repository](https://github.com/scraperapi/langchain-scraperapi)."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
329
docs/docs/integrations/tools/scraperapi.ipynb
Normal file
329
docs/docs/integrations/tools/scraperapi.ipynb
Normal file
@@ -0,0 +1,329 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d3a12ba8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# LangChain – ScraperAPI\n",
|
||||
"\n",
|
||||
"Give your AI agent the ability to browse websites, search Google and Amazon in just two lines of code.\n",
|
||||
"\n",
|
||||
"The `langchain-scraperapi` package adds three ready-to-use LangChain tools backed by the [ScraperAPI](https://www.scraperapi.com/) service:\n",
|
||||
"\n",
|
||||
"| Tool class | Use it to |\n",
|
||||
"|------------|------------------|\n",
|
||||
"| `ScraperAPITool` | Grab the HTML/text/markdown of any web page |\n",
|
||||
"| `ScraperAPIGoogleSearchTool` | Get structured Google Search SERP data |\n",
|
||||
"| `ScraperAPIAmazonSearchTool` | Get structured Amazon product-search data |\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"### Integration details\n",
|
||||
"\n",
|
||||
"| Package | Serializable | [JS support](https://js.langchain.com/docs/integrations/tools/__module_name__) | Package latest |\n",
|
||||
"| :--- | :---: | :---: | :---: |\n",
|
||||
"| [langchain-scraperapi](https://pypi.org/project/langchain-scraperapi/) | ❌ | ❌ | v0.1.1 |"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d1f7c70f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Setup\n",
|
||||
"\n",
|
||||
"Install the `langchain-scraperapi` package."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "494ecbc3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install -U langchain-scraperapi"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c111d2fb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Credentials\n",
|
||||
"\n",
|
||||
"Create an account at https://www.scraperapi.com/ and get an API key."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "4d315465",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"os.environ[\"SCRAPERAPI_API_KEY\"] = \"your-api-key\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e06ffe48",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Instantiation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "27ae5612",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_scraperapi.tools import ScraperAPITool\n",
|
||||
"\n",
|
||||
"tool = ScraperAPITool()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9ff46136",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Invocation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6e1a4c7f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"output = tool.invoke(\n",
|
||||
" {\n",
|
||||
" \"url\": \"https://langchain.com\",\n",
|
||||
" \"output_format\": \"markdown\",\n",
|
||||
" \"render\": True,\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"print(output)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "051ef7b1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Features\n",
|
||||
"\n",
|
||||
"### 1. `ScraperAPITool` — browse any website\n",
|
||||
"\n",
|
||||
"Invoke the *raw* ScraperAPI endpoint and get HTML, rendered DOM, text, or markdown.\n",
|
||||
"\n",
|
||||
"**Invocation arguments**\n",
|
||||
"\n",
|
||||
"* **`url`** **(required)** – target page URL \n",
|
||||
"* **Optional (mirror ScraperAPI query params)** \n",
|
||||
" * `output_format`: `\"text\"` | `\"markdown\"` (default returns raw HTML) \n",
|
||||
" * `country_code`: e.g. `\"us\"`, `\"de\"` \n",
|
||||
" * `device_type`: `\"desktop\"` | `\"mobile\"` \n",
|
||||
" * `premium`: `bool` – use premium proxies \n",
|
||||
" * `render`: `bool` – run JS before returning HTML \n",
|
||||
" * `keep_headers`: `bool` – include response headers \n",
|
||||
" \n",
|
||||
"For the complete set of modifiers see the [ScraperAPI request-customisation docs](https://docs.scraperapi.com/python/making-requests/customizing-requests)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1a0c7cc2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_scraperapi.tools import ScraperAPITool\n",
|
||||
"\n",
|
||||
"tool = ScraperAPITool()\n",
|
||||
"\n",
|
||||
"html_text = tool.invoke(\n",
|
||||
" {\n",
|
||||
" \"url\": \"https://langchain.com\",\n",
|
||||
" \"output_format\": \"markdown\",\n",
|
||||
" \"render\": True,\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"print(html_text[:300], \"…\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9f2947dd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2. `ScraperAPIGoogleSearchTool` — structured Google Search\n",
|
||||
"\n",
|
||||
"Structured SERP data via `/structured/google/search`.\n",
|
||||
"\n",
|
||||
"**Invocation arguments**\n",
|
||||
"\n",
|
||||
"* **`query`** **(required)** – natural-language search string \n",
|
||||
"* **Optional** — `country_code`, `tld`, `uule`, `hl`, `gl`, `ie`, `oe`, `start`, `num` \n",
|
||||
"* `output_format`: `\"json\"` (default) or `\"csv\"`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "aeac1195",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_scraperapi.tools import ScraperAPIGoogleSearchTool\n",
|
||||
"\n",
|
||||
"google_search = ScraperAPIGoogleSearchTool()\n",
|
||||
"\n",
|
||||
"results = google_search.invoke(\n",
|
||||
" {\n",
|
||||
" \"query\": \"what is langchain\",\n",
|
||||
" \"num\": 20,\n",
|
||||
" \"output_format\": \"json\",\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"print(results)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3dc2f845",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 3. `ScraperAPIAmazonSearchTool` — structured Amazon Search\n",
|
||||
"\n",
|
||||
"Structured product results via `/structured/amazon/search`.\n",
|
||||
"\n",
|
||||
"**Invocation arguments**\n",
|
||||
"\n",
|
||||
"* **`query`** **(required)** – product search terms \n",
|
||||
"* **Optional** — `country_code`, `tld`, `page` \n",
|
||||
"* `output_format`: `\"json\"` (default) or `\"csv\"`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "05a4a6ed",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_scraperapi.tools import ScraperAPIAmazonSearchTool\n",
|
||||
"\n",
|
||||
"amazon_search = ScraperAPIAmazonSearchTool()\n",
|
||||
"\n",
|
||||
"products = amazon_search.invoke(\n",
|
||||
" {\n",
|
||||
" \"query\": \"noise cancelling headphones\",\n",
|
||||
" \"tld\": \"co.uk\",\n",
|
||||
" \"page\": 2,\n",
|
||||
" }\n",
|
||||
")\n",
|
||||
"print(products)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "607eb8c8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Use within an agent\n",
|
||||
"\n",
|
||||
"Here is an example of using the tools in an AI agent. The `ScraperAPITool` gives the AI the ability to browse any website, summarize articles, and click on links to navigate between pages."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6541b286",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install -U langchain-openai"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "cb62e921",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"from langchain.agents import AgentExecutor, create_tool_calling_agent\n",
|
||||
"from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
|
||||
"from langchain_openai import ChatOpenAI\n",
|
||||
"from langchain_scraperapi.tools import ScraperAPITool\n",
|
||||
"\n",
|
||||
"os.environ[\"SCRAPERAPI_API_KEY\"] = \"your-api-key\"\n",
|
||||
"os.environ[\"OPENAI_API_KEY\"] = \"your-api-key\"\n",
|
||||
"\n",
|
||||
"tools = [ScraperAPITool(output_format=\"markdown\")]\n",
|
||||
"llm = ChatOpenAI(model_name=\"gpt-4o\", temperature=0)\n",
|
||||
"\n",
|
||||
"prompt = ChatPromptTemplate.from_messages(\n",
|
||||
" [\n",
|
||||
" (\n",
|
||||
" \"system\",\n",
|
||||
" \"You are a helpful assistant that can browse websites for users. When asked to browse a website or a link, do so with the ScraperAPITool, then provide information based on the website based on the user's needs.\",\n",
|
||||
" ),\n",
|
||||
" (\"human\", \"{input}\"),\n",
|
||||
" MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n",
|
||||
" ]\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"agent = create_tool_calling_agent(llm, tools, prompt)\n",
|
||||
"agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)\n",
|
||||
"response = agent_executor.invoke(\n",
|
||||
" {\"input\": \"can you browse hacker news and summarize the first website\"}\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4e90c894",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## API reference\n",
|
||||
"\n",
|
||||
"Below you can find more information on additional parameters to the tools to customize your requests.\n",
|
||||
"\n",
|
||||
"* [ScraperAPITool](https://docs.scraperapi.com/python/making-requests/customizing-requests)\n",
|
||||
"* [ScraperAPIGoogleSearchTool](https://docs.scraperapi.com/python/make-requests-with-scraperapi-in-python/scraperapi-structured-data-collection-in-python/google-serp-api-structured-data-in-python)\n",
|
||||
"* [ScraperAPIAmazonSearchTool](https://docs.scraperapi.com/python/make-requests-with-scraperapi-in-python/scraperapi-structured-data-collection-in-python/amazon-search-api-structured-data-in-python)\n",
|
||||
"\n",
|
||||
"The LangChain wrappers surface these parameters directly."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"jupytext": {
|
||||
"cell_metadata_filter": "-all",
|
||||
"main_language": "python",
|
||||
"notebook_metadata_filter": "-all"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python",
|
||||
"version": "3.10.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
@@ -749,3 +749,6 @@ packages:
|
||||
- name: langchain-zeusdb
|
||||
repo: zeusdb/langchain-zeusdb
|
||||
path: libs/zeusdb
|
||||
- name: langchain-scraperapi
|
||||
path: .
|
||||
repo: scraperapi/langchain-scraperapi
|
Reference in New Issue
Block a user