community[patch]: add HTTP response headers Content-Type to metadata of RecursiveUrlLoader document (#20875)

**Description:** 
The RecursiveUrlLoader loader offers a link_regex parameter that can
filter out URLs. However, this filtering capability is limited, and if
the internal links of the website change, unexpected resources may be
loaded. These resources, such as font files, can cause problems in
subsequent embedding processing.

>
https://blog.langchain.dev/assets/fonts/source-sans-pro-v21-latin-ext_latin-regular.woff2?v=0312715cbf

We can add the Content-Type in the HTTP response headers to the document
metadata so developers can choose which resources to use. This allows
developers to make their own choices.

For example, the following may be a good choice for text knowledge.

- text/plain - simple text file
- text/html - HTML web page
- text/xml - XML format file
- text/json - JSON format data
- application/pdf - PDF file
- application/msword - Word document

and ignore the following

- text/css - CSS stylesheet
- text/javascript - JavaScript script
- application/octet-stream - binary data
- image/jpeg - JPEG image
- image/png - PNG image
- image/gif - GIF image
- image/svg+xml - SVG image
- audio/mpeg - MPEG audio files
- video/mp4 - MP4 video file
- application/font-woff - WOFF font file
- application/font-ttf - TTF font file
- application/zip - ZIP compressed file
- application/octet-stream - binary data

**Twitter handle:** @coolbeevip

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
Lei Zhang
2024-04-26 02:29:41 +08:00
committed by GitHub
parent 37cbbc00a9
commit 748a6ae609
2 changed files with 69 additions and 16 deletions

View File

@@ -35,7 +35,7 @@ def test_sync_recursive_url_loader() -> None:
url, extractor=lambda _: "placeholder", use_async=False, max_depth=2
)
docs = loader.load()
assert len(docs) == 25
assert len(docs) == 24
assert docs[0].page_content == "placeholder"
@@ -55,3 +55,17 @@ def test_loading_invalid_url() -> None:
)
docs = loader.load()
assert len(docs) == 0
def test_sync_async_metadata_necessary_properties() -> None:
url = "https://docs.python.org/3.9/"
loader = RecursiveUrlLoader(url, use_async=False, max_depth=2)
async_loader = RecursiveUrlLoader(url, use_async=False, max_depth=2)
docs = loader.load()
async_docs = async_loader.load()
for doc in docs:
assert "source" in doc.metadata
assert "content_type" in doc.metadata
for doc in async_docs:
assert "source" in doc.metadata
assert "content_type" in doc.metadata