mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-26 08:33:49 +00:00
docs: add HyperbrowserLoader docs (#29143)
### Description This PR adds docs for the [langchain-hyperbrowser](https://pypi.org/project/langchain-hyperbrowser/) package. It includes a document loader that uses Hyperbrowser to scrape or crawl any urls and return formatted markdown or html content as well as relevant metadata. [Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site. ### Issue None ### Dependencies None ### Twitter Handle `@hyperbrowser`
This commit is contained in:
parent
4c0217681a
commit
335ca3a606
221
docs/docs/integrations/document_loaders/hyperbrowser.ipynb
Normal file
221
docs/docs/integrations/document_loaders/hyperbrowser.ipynb
Normal file
@ -0,0 +1,221 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# HyperbrowserLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.\n",
|
||||
"\n",
|
||||
"Key Features:\n",
|
||||
"- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches\n",
|
||||
"- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright\n",
|
||||
"- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more\n",
|
||||
"- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies\n",
|
||||
"\n",
|
||||
"This notebook provides a quick overview for getting started with Hyperbrowser [document loader](https://python.langchain.com/docs/concepts/#document-loaders).\n",
|
||||
"\n",
|
||||
"For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"### Integration details\n",
|
||||
"\n",
|
||||
"| Class | Package | Local | Serializable | JS support|\n",
|
||||
"| :--- | :--- | :---: | :---: | :---: |\n",
|
||||
"| HyperbrowserLoader | langchain-hyperbrowser | ❌ | ❌ | ❌ | \n",
|
||||
"### Loader features\n",
|
||||
"| Source | Document Lazy Loading | Native Async Support |\n",
|
||||
"| :---: | :---: | :---: | \n",
|
||||
"| HyperbrowserLoader | ✅ | ✅ | \n",
|
||||
"\n",
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"To access Hyperbrowser document loader you'll need to install the `langchain-hyperbrowser` integration package, and create a Hyperbrowser account and get an API key.\n",
|
||||
"\n",
|
||||
"### Credentials\n",
|
||||
"\n",
|
||||
"Head to [Hyperbrowser](https://app.hyperbrowser.ai/) to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Installation\n",
|
||||
"\n",
|
||||
"Install **langchain-hyperbrowser**."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install -qU langchain-hyperbrowser"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialization\n",
|
||||
"\n",
|
||||
"Now we can instantiate our model object and load documents:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_hyperbrowser import HyperbrowserLoader\n",
|
||||
"\n",
|
||||
"loader = HyperbrowserLoader(\n",
|
||||
" urls=\"https://example.com\",\n",
|
||||
" api_key=\"YOUR_API_KEY\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(metadata={'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, page_content='Example Domain\\n\\n# Example Domain\\n\\nThis domain is for use in illustrative examples in documents. You may use this\\ndomain in literature without prior coordination or asking for permission.\\n\\n[More information...](https://www.iana.org/domains/example)')"
|
||||
]
|
||||
},
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"docs = loader.load()\n",
|
||||
"docs[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(docs[0].metadata)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Lazy Load"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"page = []\n",
|
||||
"for doc in loader.lazy_load():\n",
|
||||
" page.append(doc)\n",
|
||||
" if len(page) >= 10:\n",
|
||||
" # do some paged operation, e.g.\n",
|
||||
" # index.upsert(page)\n",
|
||||
"\n",
|
||||
" page = []"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Advanced Usage\n",
|
||||
"\n",
|
||||
"You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = HyperbrowserLoader(\n",
|
||||
" urls=\"https://hyperbrowser.ai\", api_key=\"YOUR_API_KEY\", operation=\"crawl\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = HyperbrowserLoader(\n",
|
||||
" urls=\"https://example.com\",\n",
|
||||
" api_key=\"YOUR_API_KEY\",\n",
|
||||
" operation=\"scrape\",\n",
|
||||
" params={\"scrape_options\": {\"include_tags\": [\"h1\", \"h2\", \"p\"]}},\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## API reference\n",
|
||||
"\n",
|
||||
"- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)\n",
|
||||
"- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)\n",
|
||||
"- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.16"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
67
docs/docs/integrations/providers/hyperbrowser.mdx
Normal file
67
docs/docs/integrations/providers/hyperbrowser.mdx
Normal file
@ -0,0 +1,67 @@
|
||||
# Hyperbrowser
|
||||
|
||||
> [Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.
|
||||
>
|
||||
> Key Features:
|
||||
>
|
||||
> - Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches
|
||||
> - Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright
|
||||
> - Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more
|
||||
> - Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies
|
||||
|
||||
For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
To get started with `langchain-hyperbrowser`, you can install the package using pip:
|
||||
|
||||
```bash
|
||||
pip install langchain-hyperbrowser
|
||||
```
|
||||
|
||||
And you should configure credentials by setting the following environment variables:
|
||||
|
||||
`HYPERBROWSER_API_KEY=<your-api-key>`
|
||||
|
||||
Make sure to get your API Key from https://app.hyperbrowser.ai/
|
||||
|
||||
## Document Loader
|
||||
|
||||
The `HyperbrowserLoader` class in `langchain-hyperbrowser` can easily be used to load content from any single page or multiple pages as well as crawl an entire site.
|
||||
The content can be loaded as markdown or html.
|
||||
|
||||
```python
|
||||
from langchain_hyperbrowser import HyperbrowserLoader
|
||||
|
||||
loader = HyperbrowserLoader(urls="https://example.com")
|
||||
docs = loader.load()
|
||||
|
||||
print(docs[0])
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page.
|
||||
|
||||
```python
|
||||
loader = HyperbrowserLoader(
|
||||
urls="https://hyperbrowser.ai", api_key="YOUR_API_KEY", operation="crawl"
|
||||
)
|
||||
```
|
||||
|
||||
Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait.
|
||||
|
||||
```python
|
||||
loader = HyperbrowserLoader(
|
||||
urls="https://example.com",
|
||||
api_key="YOUR_API_KEY",
|
||||
operation="scrape",
|
||||
params={"scrape_options": {"include_tags": ["h1", "h2", "p"]}}
|
||||
)
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)
|
||||
- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)
|
||||
- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)
|
@ -815,6 +815,13 @@ const FEATURE_TABLES = {
|
||||
source: "Uses Docling to load and parse web pages",
|
||||
api: "Package",
|
||||
apiLink: "https://python.langchain.com/docs/integrations/document_loaders/docling/"
|
||||
},
|
||||
{
|
||||
name: "Hyperbrowser",
|
||||
link: "hyperbrowser",
|
||||
source: "Platform for running and scaling headless browsers, can be used to scrape/crawl any site",
|
||||
api: "API",
|
||||
apiLink: "https://python.langchain.com/docs/integrations/document_loaders/hyperbrowser/"
|
||||
}
|
||||
]
|
||||
},
|
||||
|
@ -337,3 +337,7 @@ packages:
|
||||
path: .
|
||||
repo: AlwaysBluer/langchain-lindorm-integration
|
||||
downloads: 0
|
||||
- name: langchain-hyperbrowser
|
||||
path: .
|
||||
repo: hyperbrowserai/langchain-hyperbrowser
|
||||
downloads: 0
|
||||
|
Loading…
Reference in New Issue
Block a user