community[minor]: Firecrawl.dev integration (#20364)

Added the [FireCrawl](https://firecrawl.dev) document loader. Firecrawl
crawls and convert any website into LLM-ready data. It crawls all
accessible subpages and give you clean markdown for each.

    - **Description:** Adds FireCrawl data loader
    - **Dependencies:** firecrawl-py
    - **Twitter handle:** @mendableai 

ccing contributors: (@ericciarla @nickscamara)

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
This commit is contained in:
Nicolas
2024-04-12 15:13:48 -04:00
committed by GitHub
parent a1b105ac00
commit ad04585e30
6 changed files with 295 additions and 2 deletions

File diff suppressed because one or more lines are too long

View File

@@ -7,7 +7,9 @@
"source": [
"# WebBaseLoader\n",
"\n",
"This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`"
"This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n",
"\n",
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n"
]
},
{
@@ -277,4 +279,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}

View File

@@ -55,6 +55,32 @@ data
</CodeOutputBlock>
## Loading HTML with FireCrawlLoader
[FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.
FireCrawl handles complex tasks such as reverse proxies, caching, rate limits, and content blocked by JavaScript.
### Prerequisite
You need to have a FireCrawl API key to use this loader. You can get one by signing up at [FireCrawl](https://firecrawl.dev/?ref=langchainpy).
```python
%pip install --upgrade --quiet langchain langchain-community firecrawl-py
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(
api_key="YOUR_API_KEY", url="https://firecrawl.dev", mode="crawl"
)
data = loader.load()
```
For more information on how to use FireCrawl, visit [FireCrawl](https://firecrawl.dev/?ref=langchainpy).
## Loading HTML with AzureAIDocumentIntelligenceLoader
[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known as `Azure Form Recognizer`) is machine-learning