fix import error of bs4 (#1952)

Ran into a broken build if bs4 wasn't installed in the project.

Minor tweak to follow the other doc loaders optional package-loading
conventions.

Also updated html docs to include reference to this new html loader.

side note: Should there be 2 different html-to-text document loaders?
This new one only handles local files, while the existing unstructured
html loader handles HTML from local and remote. So it seems like the
improvement was adding the title to the metadata, which is useful but
could also be added to `html.py`
This commit is contained in:
Tim Asp
2023-03-23 21:56:13 -07:00
committed by GitHub
parent 8990122d5d
commit 030ce9f506
3 changed files with 59 additions and 7 deletions

View File

@@ -1,5 +1,8 @@
<!DOCTYPE html>
<html>
<head>
<title>Test Title</title>
</head>
<body>
<h1>My First Heading</h1>

View File

@@ -48,9 +48,7 @@
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='My First Heading\\n\\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]"
]
"text/plain": "[Document(page_content='My First Heading\\n\\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]"
},
"execution_count": 4,
"metadata": {},
@@ -61,13 +59,57 @@
"data"
]
},
{
"cell_type": "markdown",
"source": [
"## Loading HTML with BeautifulSoup4\n",
"\n",
"We can also use BeautifulSoup4 to load HTML documents using the `BSHTMLLoader`. This will extract the text from the html into `page_content`, and the page title as `title` into `metadata`."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 16,
"id": "79b1bce4",
"metadata": {},
"outputs": [],
"source": []
"source": [
"from langchain.document_loaders import BSHTMLLoader"
]
},
{
"cell_type": "code",
"execution_count": 17,
"outputs": [
{
"data": {
"text/plain": "[Document(page_content='\\n\\nTest Title\\n\\n\\nMy First Heading\\nMy first paragraph.\\n\\n\\n', lookup_str='', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'}, lookup_index=0)]"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"loader = BSHTMLLoader(\"example_data/fake-content.html\")\n",
"data = loader.load()\n",
"data"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
}
],
"metadata": {