diff --git a/docs/docs/integrations/providers/scraperapi.ipynb b/docs/docs/integrations/providers/scraperapi.ipynb new file mode 100644 index 00000000000..e83e5c09a56 --- /dev/null +++ b/docs/docs/integrations/providers/scraperapi.ipynb @@ -0,0 +1,72 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ScraperAPI\n", + "\n", + "[ScraperAPI](https://www.scraperapi.com/) enables data collection from any public website with its web scraping API, without worrying about proxies, browsers, or CAPTCHA handling. [langchain-scraperapi](https://github.com/scraperapi/langchain-scraperapi) wraps this service, making it easy for AI agents to browse the web and scrape data from it.\n", + "\n", + "## Installation and Setup\n", + "\n", + "- Install the Python package with `pip install langchain-scraperapi`.\n", + "- Obtain an API key from [ScraperAPI](https://www.scraperapi.com/) and set the environment variable `SCRAPERAPI_API_KEY`.\n", + "\n", + "### Tools\n", + "\n", + "The package offers 3 tools to scrape any website, get structured Google search results, and get structured Amazon search results respectively.\n", + "\n", + "To import them:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install langchain_scraperapi\n", + "\n", + "from langchain_scraperapi.tools import (\n", + " ScraperAPIAmazonSearchTool,\n", + " ScraperAPIGoogleSearchTool,\n", + " ScraperAPITool,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Example use:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tool = ScraperAPITool()\n", + "\n", + "result = tool.invoke({\"url\": \"https://example.com\", \"output_format\": \"markdown\"})\n", + "print(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For a more detailed walkthrough of how to use these tools, visit the [official repository](https://github.com/scraperapi/langchain-scraperapi)." + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/docs/integrations/tools/scraperapi.ipynb b/docs/docs/integrations/tools/scraperapi.ipynb new file mode 100644 index 00000000000..566ca442710 --- /dev/null +++ b/docs/docs/integrations/tools/scraperapi.ipynb @@ -0,0 +1,329 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d3a12ba8", + "metadata": {}, + "source": [ + "# LangChain – ScraperAPI\n", + "\n", + "Give your AI agent the ability to browse websites, search Google and Amazon in just two lines of code.\n", + "\n", + "The `langchain-scraperapi` package adds three ready-to-use LangChain tools backed by the [ScraperAPI](https://www.scraperapi.com/) service:\n", + "\n", + "| Tool class | Use it to |\n", + "|------------|------------------|\n", + "| `ScraperAPITool` | Grab the HTML/text/markdown of any web page |\n", + "| `ScraperAPIGoogleSearchTool` | Get structured Google Search SERP data |\n", + "| `ScraperAPIAmazonSearchTool` | Get structured Amazon product-search data |\n", + "\n", + "## Overview\n", + "\n", + "### Integration details\n", + "\n", + "| Package | Serializable | [JS support](https://js.langchain.com/docs/integrations/tools/__module_name__) | Package latest |\n", + "| :--- | :---: | :---: | :---: |\n", + "| [langchain-scraperapi](https://pypi.org/project/langchain-scraperapi/) | ❌ | ❌ | v0.1.1 |" + ] + }, + { + "cell_type": "markdown", + "id": "d1f7c70f", + "metadata": {}, + "source": [ + "### Setup\n", + "\n", + "Install the `langchain-scraperapi` package." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "494ecbc3", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -U langchain-scraperapi" + ] + }, + { + "cell_type": "markdown", + "id": "c111d2fb", + "metadata": {}, + "source": [ + "### Credentials\n", + "\n", + "Create an account at https://www.scraperapi.com/ and get an API key." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d315465", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "os.environ[\"SCRAPERAPI_API_KEY\"] = \"your-api-key\"" + ] + }, + { + "cell_type": "markdown", + "id": "e06ffe48", + "metadata": {}, + "source": [ + "## Instantiation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27ae5612", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_scraperapi.tools import ScraperAPITool\n", + "\n", + "tool = ScraperAPITool()" + ] + }, + { + "cell_type": "markdown", + "id": "9ff46136", + "metadata": {}, + "source": [ + "## Invocation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e1a4c7f", + "metadata": {}, + "outputs": [], + "source": [ + "output = tool.invoke(\n", + " {\n", + " \"url\": \"https://langchain.com\",\n", + " \"output_format\": \"markdown\",\n", + " \"render\": True,\n", + " }\n", + ")\n", + "print(output)" + ] + }, + { + "cell_type": "markdown", + "id": "051ef7b1", + "metadata": {}, + "source": [ + "## Features\n", + "\n", + "### 1. `ScraperAPITool` — browse any website\n", + "\n", + "Invoke the *raw* ScraperAPI endpoint and get HTML, rendered DOM, text, or markdown.\n", + "\n", + "**Invocation arguments**\n", + "\n", + "* **`url`** **(required)** – target page URL \n", + "* **Optional (mirror ScraperAPI query params)** \n", + " * `output_format`: `\"text\"` | `\"markdown\"` (default returns raw HTML) \n", + " * `country_code`: e.g. `\"us\"`, `\"de\"` \n", + " * `device_type`: `\"desktop\"` | `\"mobile\"` \n", + " * `premium`: `bool` – use premium proxies \n", + " * `render`: `bool` – run JS before returning HTML \n", + " * `keep_headers`: `bool` – include response headers \n", + " \n", + "For the complete set of modifiers see the [ScraperAPI request-customisation docs](https://docs.scraperapi.com/python/making-requests/customizing-requests)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a0c7cc2", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_scraperapi.tools import ScraperAPITool\n", + "\n", + "tool = ScraperAPITool()\n", + "\n", + "html_text = tool.invoke(\n", + " {\n", + " \"url\": \"https://langchain.com\",\n", + " \"output_format\": \"markdown\",\n", + " \"render\": True,\n", + " }\n", + ")\n", + "print(html_text[:300], \"…\")" + ] + }, + { + "cell_type": "markdown", + "id": "9f2947dd", + "metadata": {}, + "source": [ + "### 2. `ScraperAPIGoogleSearchTool` — structured Google Search\n", + "\n", + "Structured SERP data via `/structured/google/search`.\n", + "\n", + "**Invocation arguments**\n", + "\n", + "* **`query`** **(required)** – natural-language search string \n", + "* **Optional** — `country_code`, `tld`, `uule`, `hl`, `gl`, `ie`, `oe`, `start`, `num` \n", + "* `output_format`: `\"json\"` (default) or `\"csv\"`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aeac1195", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_scraperapi.tools import ScraperAPIGoogleSearchTool\n", + "\n", + "google_search = ScraperAPIGoogleSearchTool()\n", + "\n", + "results = google_search.invoke(\n", + " {\n", + " \"query\": \"what is langchain\",\n", + " \"num\": 20,\n", + " \"output_format\": \"json\",\n", + " }\n", + ")\n", + "print(results)" + ] + }, + { + "cell_type": "markdown", + "id": "3dc2f845", + "metadata": {}, + "source": [ + "### 3. `ScraperAPIAmazonSearchTool` — structured Amazon Search\n", + "\n", + "Structured product results via `/structured/amazon/search`.\n", + "\n", + "**Invocation arguments**\n", + "\n", + "* **`query`** **(required)** – product search terms \n", + "* **Optional** — `country_code`, `tld`, `page` \n", + "* `output_format`: `\"json\"` (default) or `\"csv\"`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "05a4a6ed", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_scraperapi.tools import ScraperAPIAmazonSearchTool\n", + "\n", + "amazon_search = ScraperAPIAmazonSearchTool()\n", + "\n", + "products = amazon_search.invoke(\n", + " {\n", + " \"query\": \"noise cancelling headphones\",\n", + " \"tld\": \"co.uk\",\n", + " \"page\": 2,\n", + " }\n", + ")\n", + "print(products)" + ] + }, + { + "cell_type": "markdown", + "id": "607eb8c8", + "metadata": {}, + "source": [ + "## Use within an agent\n", + "\n", + "Here is an example of using the tools in an AI agent. The `ScraperAPITool` gives the AI the ability to browse any website, summarize articles, and click on links to navigate between pages." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6541b286", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -U langchain-openai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb62e921", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "from langchain.agents import AgentExecutor, create_tool_calling_agent\n", + "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n", + "from langchain_openai import ChatOpenAI\n", + "from langchain_scraperapi.tools import ScraperAPITool\n", + "\n", + "os.environ[\"SCRAPERAPI_API_KEY\"] = \"your-api-key\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"your-api-key\"\n", + "\n", + "tools = [ScraperAPITool(output_format=\"markdown\")]\n", + "llm = ChatOpenAI(model_name=\"gpt-4o\", temperature=0)\n", + "\n", + "prompt = ChatPromptTemplate.from_messages(\n", + " [\n", + " (\n", + " \"system\",\n", + " \"You are a helpful assistant that can browse websites for users. When asked to browse a website or a link, do so with the ScraperAPITool, then provide information based on the website based on the user's needs.\",\n", + " ),\n", + " (\"human\", \"{input}\"),\n", + " MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n", + " ]\n", + ")\n", + "\n", + "agent = create_tool_calling_agent(llm, tools, prompt)\n", + "agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)\n", + "response = agent_executor.invoke(\n", + " {\"input\": \"can you browse hacker news and summarize the first website\"}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "4e90c894", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "Below you can find more information on additional parameters to the tools to customize your requests.\n", + "\n", + "* [ScraperAPITool](https://docs.scraperapi.com/python/making-requests/customizing-requests)\n", + "* [ScraperAPIGoogleSearchTool](https://docs.scraperapi.com/python/make-requests-with-scraperapi-in-python/scraperapi-structured-data-collection-in-python/google-serp-api-structured-data-in-python)\n", + "* [ScraperAPIAmazonSearchTool](https://docs.scraperapi.com/python/make-requests-with-scraperapi-in-python/scraperapi-structured-data-collection-in-python/amazon-search-api-structured-data-in-python)\n", + "\n", + "The LangChain wrappers surface these parameters directly." + ] + } + ], + "metadata": { + "jupytext": { + "cell_metadata_filter": "-all", + "main_language": "python", + "notebook_metadata_filter": "-all" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/libs/packages.yml b/libs/packages.yml index 1faf4cbf268..534a5aa460c 100644 --- a/libs/packages.yml +++ b/libs/packages.yml @@ -749,3 +749,6 @@ packages: - name: langchain-zeusdb repo: zeusdb/langchain-zeusdb path: libs/zeusdb +- name: langchain-scraperapi + path: . + repo: scraperapi/langchain-scraperapi \ No newline at end of file