diff --git a/docs/docs/integrations/document_loaders/hyperbrowser.ipynb b/docs/docs/integrations/document_loaders/hyperbrowser.ipynb new file mode 100644 index 00000000000..723d92afe31 --- /dev/null +++ b/docs/docs/integrations/document_loaders/hyperbrowser.ipynb @@ -0,0 +1,221 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# HyperbrowserLoader" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.\n", + "\n", + "Key Features:\n", + "- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches\n", + "- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright\n", + "- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more\n", + "- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies\n", + "\n", + "This notebook provides a quick overview for getting started with Hyperbrowser [document loader](https://python.langchain.com/docs/concepts/#document-loaders).\n", + "\n", + "For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Local | Serializable | JS support|\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| HyperbrowserLoader | langchain-hyperbrowser | ❌ | ❌ | ❌ | \n", + "### Loader features\n", + "| Source | Document Lazy Loading | Native Async Support |\n", + "| :---: | :---: | :---: | \n", + "| HyperbrowserLoader | ✅ | ✅ | \n", + "\n", + "## Setup\n", + "\n", + "To access Hyperbrowser document loader you'll need to install the `langchain-hyperbrowser` integration package, and create a Hyperbrowser account and get an API key.\n", + "\n", + "### Credentials\n", + "\n", + "Head to [Hyperbrowser](https://app.hyperbrowser.ai/) to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Installation\n", + "\n", + "Install **langchain-hyperbrowser**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -qU langchain-hyperbrowser" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialization\n", + "\n", + "Now we can instantiate our model object and load documents:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_hyperbrowser import HyperbrowserLoader\n", + "\n", + "loader = HyperbrowserLoader(\n", + " urls=\"https://example.com\",\n", + " api_key=\"YOUR_API_KEY\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(metadata={'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, page_content='Example Domain\\n\\n# Example Domain\\n\\nThis domain is for use in illustrative examples in documents. You may use this\\ndomain in literature without prior coordination or asking for permission.\\n\\n[More information...](https://www.iana.org/domains/example)')" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs = loader.load()\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lazy Load" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "page = []\n", + "for doc in loader.lazy_load():\n", + " page.append(doc)\n", + " if len(page) >= 10:\n", + " # do some paged operation, e.g.\n", + " # index.upsert(page)\n", + "\n", + " page = []" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced Usage\n", + "\n", + "You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loader = HyperbrowserLoader(\n", + " urls=\"https://hyperbrowser.ai\", api_key=\"YOUR_API_KEY\", operation=\"crawl\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loader = HyperbrowserLoader(\n", + " urls=\"https://example.com\",\n", + " api_key=\"YOUR_API_KEY\",\n", + " operation=\"scrape\",\n", + " params={\"scrape_options\": {\"include_tags\": [\"h1\", \"h2\", \"p\"]}},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)\n", + "- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)\n", + "- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/docs/integrations/providers/hyperbrowser.mdx b/docs/docs/integrations/providers/hyperbrowser.mdx new file mode 100644 index 00000000000..6efbd47a2eb --- /dev/null +++ b/docs/docs/integrations/providers/hyperbrowser.mdx @@ -0,0 +1,67 @@ +# Hyperbrowser + +> [Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site. +> +> Key Features: +> +> - Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches +> - Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright +> - Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more +> - Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies + +For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai). + +## Installation and Setup + +To get started with `langchain-hyperbrowser`, you can install the package using pip: + +```bash +pip install langchain-hyperbrowser +``` + +And you should configure credentials by setting the following environment variables: + +`HYPERBROWSER_API_KEY=` + +Make sure to get your API Key from https://app.hyperbrowser.ai/ + +## Document Loader + +The `HyperbrowserLoader` class in `langchain-hyperbrowser` can easily be used to load content from any single page or multiple pages as well as crawl an entire site. +The content can be loaded as markdown or html. + +```python +from langchain_hyperbrowser import HyperbrowserLoader + +loader = HyperbrowserLoader(urls="https://example.com") +docs = loader.load() + +print(docs[0]) +``` + +## Advanced Usage + +You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page. + +```python +loader = HyperbrowserLoader( + urls="https://hyperbrowser.ai", api_key="YOUR_API_KEY", operation="crawl" +) +``` + +Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait. + +```python +loader = HyperbrowserLoader( + urls="https://example.com", + api_key="YOUR_API_KEY", + operation="scrape", + params={"scrape_options": {"include_tags": ["h1", "h2", "p"]}} +) +``` + +## Additional Resources + +- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/) +- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/) +- [PyPi](https://pypi.org/project/langchain-hyperbrowser/) diff --git a/docs/src/theme/FeatureTables.js b/docs/src/theme/FeatureTables.js index bebe1739e6e..d56d5a80c15 100644 --- a/docs/src/theme/FeatureTables.js +++ b/docs/src/theme/FeatureTables.js @@ -815,6 +815,13 @@ const FEATURE_TABLES = { source: "Uses Docling to load and parse web pages", api: "Package", apiLink: "https://python.langchain.com/docs/integrations/document_loaders/docling/" + }, + { + name: "Hyperbrowser", + link: "hyperbrowser", + source: "Platform for running and scaling headless browsers, can be used to scrape/crawl any site", + api: "API", + apiLink: "https://python.langchain.com/docs/integrations/document_loaders/hyperbrowser/" } ] }, diff --git a/libs/packages.yml b/libs/packages.yml index 7c8970e7ecc..e95f8c37c1b 100644 --- a/libs/packages.yml +++ b/libs/packages.yml @@ -337,3 +337,7 @@ packages: path: . repo: AlwaysBluer/langchain-lindorm-integration downloads: 0 +- name: langchain-hyperbrowser + path: . + repo: hyperbrowserai/langchain-hyperbrowser + downloads: 0