community[patch]: add HTTP response headers Content-Type to metadata of RecursiveUrlLoader document (#20875)

**Description:** The RecursiveUrlLoader loader offers a link_regex parameter that can filter out URLs. However, this filtering capability is limited, and if the internal links of the website change, unexpected resources may be loaded. These resources, such as font files, can cause problems in subsequent embedding processing. > https://blog.langchain.dev/assets/fonts/source-sans-pro-v21-latin-ext_latin-regular.woff2?v=0312715cbf We can add the Content-Type in the HTTP response headers to the document metadata so developers can choose which resources to use. This allows developers to make their own choices. For example, the following may be a good choice for text knowledge. - text/plain - simple text file - text/html - HTML web page - text/xml - XML format file - text/json - JSON format data - application/pdf - PDF file - application/msword - Word document and ignore the following - text/css - CSS stylesheet - text/javascript - JavaScript script - application/octet-stream - binary data - image/jpeg - JPEG image - image/png - PNG image - image/gif - GIF image - image/svg+xml - SVG image - audio/mpeg - MPEG audio files - video/mp4 - MP4 video file - application/font-woff - WOFF font file - application/font-ttf - TTF font file - application/zip - ZIP compressed file - application/octet-stream - binary data **Twitter handle:** @coolbeevip --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-09-19 00:58:32 +00:00 · 2024-04-26 02:29:41 +08:00
parent 37cbbc00a9
commit 748a6ae609
2 changed files with 69 additions and 16 deletions
--- a/libs/community/tests/integration_tests/document_loaders/test_recursive_url_loader.py
+++ b/libs/community/tests/integration_tests/document_loaders/test_recursive_url_loader.py
@@ -35,7 +35,7 @@ def test_sync_recursive_url_loader() -> None:
        url, extractor=lambda _: "placeholder", use_async=False, max_depth=2
    )
    docs = loader.load()
-    assert len(docs) == 25
+    assert len(docs) == 24
    assert docs[0].page_content == "placeholder"


@@ -55,3 +55,17 @@ def test_loading_invalid_url() -> None:
    )
    docs = loader.load()
    assert len(docs) == 0
+
+
+def test_sync_async_metadata_necessary_properties() -> None:
+    url = "https://docs.python.org/3.9/"
+    loader = RecursiveUrlLoader(url, use_async=False, max_depth=2)
+    async_loader = RecursiveUrlLoader(url, use_async=False, max_depth=2)
+    docs = loader.load()
+    async_docs = async_loader.load()
+    for doc in docs:
+        assert "source" in doc.metadata
+        assert "content_type" in doc.metadata
+    for doc in async_docs:
+        assert "source" in doc.metadata
+        assert "content_type" in doc.metadata