mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-11 07:50:47 +00:00
Add HTML document_loader that includes page title metadata (#1720)
This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org>
This commit is contained in:
17
tests/integration_tests/document_loaders/test_bshtml.py
Normal file
17
tests/integration_tests/document_loaders/test_bshtml.py
Normal file
@@ -0,0 +1,17 @@
|
||||
from pathlib import Path
|
||||
|
||||
from langchain.document_loaders.html_bs import BSHTMLLoader
|
||||
|
||||
|
||||
def test_bs_html_loader() -> None:
|
||||
"""Test unstructured loader."""
|
||||
file_path = Path(__file__).parent.parent / "examples/example.html"
|
||||
loader = BSHTMLLoader(str(file_path))
|
||||
docs = loader.load()
|
||||
|
||||
assert len(docs) == 1
|
||||
|
||||
metadata = docs[0].metadata
|
||||
|
||||
assert metadata["title"] == "Chew dad's slippers"
|
||||
assert metadata["source"] == str(file_path)
|
25
tests/integration_tests/examples/example.html
Normal file
25
tests/integration_tests/examples/example.html
Normal file
@@ -0,0 +1,25 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>Chew dad's slippers</title>
|
||||
</head>
|
||||
<body>
|
||||
<h1>
|
||||
Instead of drinking water from the cat bowl, make sure to steal water from
|
||||
the toilet
|
||||
</h1>
|
||||
<h2>Chase the red dot</h2>
|
||||
<p>
|
||||
Munch, munch, chomp, chomp hate dogs. Spill litter box, scratch at owner,
|
||||
destroy all furniture, especially couch get scared by sudden appearance of
|
||||
cucumber cat is love, cat is life fat baby cat best buddy little guy for
|
||||
catch eat throw up catch eat throw up bad birds jump on fridge. Purr like
|
||||
a car engine oh yes, there is my human woman she does best pats ever that
|
||||
all i like about her hiss meow .
|
||||
</p>
|
||||
<p>
|
||||
Dead stare with ears cocked when owners are asleep, cry for no apparent
|
||||
reason meow all night. Plop down in the middle where everybody walks favor
|
||||
packaging over toy. Sit on the laptop kitty pounce, trip, faceplant.
|
||||
</p>
|
||||
</body>
|
||||
</html>
|
Reference in New Issue
Block a user