Add how to use a custom scraping function with the sitemap loader. (#5847)

Hi! I just added an example of how to use a custom scraping function with the sitemap loader. I recently used this feature and had to dig in the source code to find it. I thought it might be useful to other devs to have an example in the Jupyter Notebook directly. I only added the example to the documentation page. @eyurtsev I was not able to run the lint. Please let me know if I have to do anything else. I know this is a very small contribution, but I hope it will be valuable. My Twitter handle is @web3Dav3.
2025-08-18 09:01:03 +00:00 · 2023-06-07 23:16:51 -03:00 · 2023-06-07 23:16:51 -03:00 · 0b4a51930c
commit 0b4a51930c
parent c66755b661
1 changed files with 67 additions and 0 deletions
--- a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
@ -146,6 +146,73 @@
    "documents[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add custom scraping rules\n",
    "\n",
    "The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
    "\n",
    " The following example shows how to develop and use a custom function to avoid navigation and header elements."
   ]
  },
    {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the `beautifulsoup4` library and define the custom function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install beautifulsoup4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
    "    # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
    "    nav_elements = content.find_all('nav')\n",
    "    header_elements = content.find_all('header')\n",
    "\n",
    "    # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
    "    for element in nav_elements + header_elements:\n",
    "        element.decompose()\n",
    "\n",
    "    return str(content.get_text())"
   ]
 },    
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add your custom function to the `SitemapLoader` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = SitemapLoader(\n",
    "    \"https://langchain.readthedocs.io/sitemap.xml\",\n",
    "    filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
    "    parsing_function=remove_nav_and_header_elements\n",
    ")"
   ]
 },
  {
   "cell_type": "markdown",
   "metadata": {},