Files
langchain/docs/versioned_docs/version-0.2.x/how_to/document_loader_html.mdx
2024-04-18 20:41:49 -07:00

110 lines
3.9 KiB
Plaintext

# HTML
>[The HyperText Markup Language or HTML](https://en.wikipedia.org/wiki/HTML) is the standard markup language for documents designed to be displayed in a web browser.
This covers how to load `HTML` documents into a document format that we can use downstream.
```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
```
```python
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
```
```python
data = loader.load()
```
```python
data
```
<CodeOutputBlock lang="python">
```
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]
```
</CodeOutputBlock>
## Loading HTML with BeautifulSoup4
We can also use `BeautifulSoup4` to load HTML documents using the `BSHTMLLoader`. This will extract the text from the HTML into `page_content`, and the page title as `title` into `metadata`.
```python
from langchain_community.document_loaders import BSHTMLLoader
```
```python
loader = BSHTMLLoader("example_data/fake-content.html")
data = loader.load()
data
```
<CodeOutputBlock lang="python">
```
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]
```
</CodeOutputBlock>
## Loading HTML with FireCrawlLoader
[FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.
FireCrawl handles complex tasks such as reverse proxies, caching, rate limits, and content blocked by JavaScript.
### Prerequisite
You need to have a FireCrawl API key to use this loader. You can get one by signing up at [FireCrawl](https://firecrawl.dev/?ref=langchainpy).
```python
%pip install --upgrade --quiet langchain langchain-community firecrawl-py
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(
api_key="YOUR_API_KEY", url="https://firecrawl.dev", mode="crawl"
)
data = loader.load()
```
For more information on how to use FireCrawl, visit [FireCrawl](https://firecrawl.dev/?ref=langchainpy).
## Loading HTML with AzureAIDocumentIntelligenceLoader
[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known as `Azure Form Recognizer`) is machine-learning
based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from
digital or scanned PDFs, images, Office and HTML files. Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
This [current implementation](https://aka.ms/di-langchain) of a loader using `Document Intelligence` can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with `MarkdownHeaderTextSplitter` for semantic document chunking. You can also use `mode="single"` or `mode="page"` to return pure texts in a single page or document split by page.
### Prerequisite
An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have. You will be passing `<endpoint>` and `<key>` as parameters to the loader.
```python
%pip install --upgrade --quiet langchain langchain-community azure-ai-documentintelligence
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)
documents = loader.load()
```