community: Spider integration (#20937)

Added the [Spider.cloud](https://spider.cloud) document loader.
[Spider](https://github.com/spider-rs/spider) is the
[fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)
and cheapest crawler that returns LLM-ready data.

```
- **Description:** Adds Spider data loader
- **Dependencies:** spider-client
- **Twitter handle:** @WilliamEspegren 
```

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: = <=>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
This commit is contained in:
WilliamEspegren
2024-04-27 23:45:03 +02:00
committed by GitHub
parent 6342217b93
commit 804390ba4b
6 changed files with 223 additions and 1 deletions

View File

@@ -0,0 +1,95 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spider\n",
"[Spider](https://spider.cloud/) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and most affordable crawler and scraper that returns LLM-ready data.\n",
"\n",
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install spider-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage\n",
"To use spider you need to have an API key from [spider.cloud](https://spider.cloud/)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]\n"
]
}
],
"source": [
"from langchain_community.document_loaders import SpiderLoader\n",
"\n",
"loader = SpiderLoader(\n",
" api_key=\"YOUR_API_KEY\",\n",
" url=\"https://spider.cloud\",\n",
" mode=\"scrape\", # if no API key is provided it looks for SPIDER_API_KEY in env\n",
")\n",
"\n",
"data = loader.load()\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modes\n",
"- `scrape`: Default mode that scrapes a single URL\n",
"- `crawl`: Crawl all subpages of the domain url provided"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Crawler options\n",
"The `params` parameter is a dictionary that can be passed to the loader. See the [Spider documentation](https://spider.cloud/docs/api) to see all available parameters"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -9,7 +9,7 @@
"\n",
"This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n",
"\n",
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n"
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n"
]
},
{

View File

@@ -55,6 +55,32 @@ data
</CodeOutputBlock>
## Loading HTML with SpiderLoader
[Spider](https://spider.cloud/?ref=langchain) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...
## Prerequisite
You need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud).
```python
%pip install --upgrade --quiet langchain langchain-community spider-client
```
```python
from langchain_community.document_loaders import SpiderLoader
loader = SpiderLoader(
api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl"
)
data = loader.load()
```
For guides and documentation, visit [Spider](https://spider.cloud/docs/api)
## Loading HTML with FireCrawlLoader
[FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.