community: Spider integration (#20937)

Added the [Spider.cloud](https://spider.cloud) document loader. [Spider](https://github.com/spider-rs/spider) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and cheapest crawler that returns LLM-ready data. ``` - **Description:** Adds Spider data loader - **Dependencies:** spider-client - **Twitter handle:** @WilliamEspegren ``` --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: = <=> Co-authored-by: Chester Curme <chester.curme@gmail.com>
2025-09-08 22:42:05 +00:00 · 2024-04-27 23:45:03 +02:00
parent 6342217b93
commit 804390ba4b
6 changed files with 223 additions and 1 deletions
--- a/docs/docs/integrations/document_loaders/spider.ipynb
+++ b/docs/docs/integrations/document_loaders/spider.ipynb
@@ -0,0 +1,95 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Spider\n",
+    "[Spider](https://spider.cloud/) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and most affordable crawler and scraper that returns LLM-ready data.\n",
+    "\n",
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pip install spider-client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage\n",
+    "To use spider you need to have an API key from [spider.cloud](https://spider.cloud/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\',  headers=headers,  json=json_data)print(response.json())```Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_community.document_loaders import SpiderLoader\n",
+    "\n",
+    "loader = SpiderLoader(\n",
+    "    api_key=\"YOUR_API_KEY\",\n",
+    "    url=\"https://spider.cloud\",\n",
+    "    mode=\"scrape\",  # if no API key is provided it looks for SPIDER_API_KEY in env\n",
+    ")\n",
+    "\n",
+    "data = loader.load()\n",
+    "print(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Modes\n",
+    "- `scrape`: Default mode that scrapes a single URL\n",
+    "- `crawl`: Crawl all subpages of the domain url provided"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Crawler options\n",
+    "The `params` parameter is a dictionary that can be passed to the loader. See the [Spider documentation](https://spider.cloud/docs/api) to see all available parameters"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/docs/docs/integrations/document_loaders/web_base.ipynb
+++ b/docs/docs/integrations/document_loaders/web_base.ipynb
@@ -9,7 +9,7 @@
    "\n",
    "This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n",
    "\n",
-    "If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n"
+    "If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n"
   ]
  },
  {
--- a/docs/docs/modules/data_connection/document_loaders/html.mdx
+++ b/docs/docs/modules/data_connection/document_loaders/html.mdx
@@ -55,6 +55,32 @@ data

 </CodeOutputBlock>

+## Loading HTML with SpiderLoader
+
+[Spider](https://spider.cloud/?ref=langchain) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
+
+Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc... 
+
+## Prerequisite
+
+You need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud).
+
+```python
+%pip install --upgrade --quiet  langchain langchain-community spider-client
+```
+```python
+from langchain_community.document_loaders import SpiderLoader
+
+loader = SpiderLoader(
+    api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl"
+)
+
+data = loader.load()
+```
+
+For guides and documentation, visit [Spider](https://spider.cloud/docs/api)
+
+
 ## Loading HTML with FireCrawlLoader

 [FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.