community[minor]: Firecrawl.dev integration (#20364)

Added the [FireCrawl](https://firecrawl.dev) document loader. Firecrawl
crawls and convert any website into LLM-ready data. It crawls all
accessible subpages and give you clean markdown for each.

    - **Description:** Adds FireCrawl data loader
    - **Dependencies:** firecrawl-py
    - **Twitter handle:** @mendableai 

ccing contributors: (@ericciarla @nickscamara)

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
This commit is contained in:
Nicolas 2024-04-12 15:13:48 -04:00 committed by GitHub
parent a1b105ac00
commit ad04585e30
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 295 additions and 2 deletions

File diff suppressed because one or more lines are too long

View File

@ -7,7 +7,9 @@
"source": [
"# WebBaseLoader\n",
"\n",
"This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`"
"This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n",
"\n",
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n"
]
},
{
@ -277,4 +279,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}

View File

@ -55,6 +55,32 @@ data
</CodeOutputBlock>
## Loading HTML with FireCrawlLoader
[FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.
FireCrawl handles complex tasks such as reverse proxies, caching, rate limits, and content blocked by JavaScript.
### Prerequisite
You need to have a FireCrawl API key to use this loader. You can get one by signing up at [FireCrawl](https://firecrawl.dev/?ref=langchainpy).
```python
%pip install --upgrade --quiet langchain langchain-community firecrawl-py
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(
api_key="YOUR_API_KEY", url="https://firecrawl.dev", mode="crawl"
)
data = loader.load()
```
For more information on how to use FireCrawl, visit [FireCrawl](https://firecrawl.dev/?ref=langchainpy).
## Loading HTML with AzureAIDocumentIntelligenceLoader
[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known as `Azure Form Recognizer`) is machine-learning

View File

@ -187,6 +187,9 @@ if TYPE_CHECKING:
from langchain_community.document_loaders.figma import (
FigmaFileLoader, # noqa: F401
)
from langchain_community.document_loaders.firecrawl import (
FireCrawlLoader, # noqa: F401
)
from langchain_community.document_loaders.gcs_directory import (
GCSDirectoryLoader, # noqa: F401
)
@ -560,6 +563,7 @@ __all__ = [
"FacebookChatLoader",
"FaunaLoader",
"FigmaFileLoader",
"FireCrawlLoader",
"FileSystemBlobLoader",
"GCSDirectoryLoader",
"GCSFileLoader",
@ -745,6 +749,7 @@ _module_lookup = {
"FacebookChatLoader": "langchain_community.document_loaders.facebook_chat",
"FaunaLoader": "langchain_community.document_loaders.fauna",
"FigmaFileLoader": "langchain_community.document_loaders.figma",
"FireCrawlLoader": "langchain_community.document_loaders.firecrawl",
"FileSystemBlobLoader": "langchain_community.document_loaders.blob_loaders",
"GCSDirectoryLoader": "langchain_community.document_loaders.gcs_directory",
"GCSFileLoader": "langchain_community.document_loaders.gcs_file",

View File

@ -0,0 +1,66 @@
from typing import Iterator, Literal, Optional
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
from langchain_core.utils import get_from_env
class FireCrawlLoader(BaseLoader):
"""Load web pages as Documents using FireCrawl.
Must have Python package `firecrawl` installed and a FireCrawl API key. See
https://www.firecrawl.dev/ for more.
"""
def __init__(
self,
url: str,
*,
api_key: Optional[str] = None,
mode: Literal["crawl", "scrape"] = "crawl",
params: Optional[dict] = None,
):
"""Initialize with API key and url.
Args:
url: The url to be crawled.
api_key: The Firecrawl API key. If not specified will be read from env var
FIREWALL_API_KEY. Get an API key
mode: The mode to run the loader in. Default is "crawl".
Options include "scrape" (single url) and
"crawl" (all accessible sub pages).
params: The parameters to pass to the Firecrawl API.
Examples include crawlerOptions.
For more details, visit: https://github.com/mendableai/firecrawl-py
"""
try:
from firecrawl import FirecrawlApp # noqa: F401
except ImportError:
raise ImportError(
"`firecrawl` package not found, please run `pip install firecrawl-py`"
)
if mode not in ("crawl", "scrape"):
raise ValueError(
f"Unrecognized mode '{mode}'. Expected one of 'crawl', 'scrape'."
)
api_key = api_key or get_from_env("api_key", "FIREWALL_API_KEY")
self.firecrawl = FirecrawlApp(api_key=api_key)
self.url = url
self.mode = mode
self.params = params
def lazy_load(self) -> Iterator[Document]:
if self.mode == "scrape":
firecrawl_docs = [self.firecrawl.scrape_url(self.url, params=self.params)]
elif self.mode == "crawl":
firecrawl_docs = self.firecrawl.crawl_url(self.url, params=self.params)
else:
raise ValueError(
f"Unrecognized mode '{self.mode}'. Expected one of 'crawl', 'scrape'."
)
for doc in firecrawl_docs:
yield Document(
page_content=doc.get("markdown", ""),
metadata=doc.get("metadata", {}),
)

View File

@ -65,6 +65,7 @@ EXPECTED_ALL = [
"FaunaLoader",
"FigmaFileLoader",
"FileSystemBlobLoader",
"FireCrawlLoader",
"GCSDirectoryLoader",
"GCSFileLoader",
"GeoDataFrameLoader",