mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-01 19:03:25 +00:00
Add how to use a custom scraping function with the sitemap loader. (#5847)
Hi! I just added an example of how to use a custom scraping function with the sitemap loader. I recently used this feature and had to dig in the source code to find it. I thought it might be useful to other devs to have an example in the Jupyter Notebook directly. I only added the example to the documentation page. @eyurtsev I was not able to run the lint. Please let me know if I have to do anything else. I know this is a very small contribution, but I hope it will be valuable. My Twitter handle is @web3Dav3. <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 -->
This commit is contained in:
parent
c66755b661
commit
0b4a51930c
@ -146,6 +146,73 @@
|
||||
"documents[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Add custom scraping rules\n",
|
||||
"\n",
|
||||
"The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
|
||||
"\n",
|
||||
" The following example shows how to develop and use a custom function to avoid navigation and header elements."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Import the `beautifulsoup4` library and define the custom function."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pip install beautifulsoup4"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from bs4 import BeautifulSoup\n",
|
||||
"\n",
|
||||
"def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
|
||||
" # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
|
||||
" nav_elements = content.find_all('nav')\n",
|
||||
" header_elements = content.find_all('header')\n",
|
||||
"\n",
|
||||
" # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
|
||||
" for element in nav_elements + header_elements:\n",
|
||||
" element.decompose()\n",
|
||||
"\n",
|
||||
" return str(content.get_text())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Add your custom function to the `SitemapLoader` object."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = SitemapLoader(\n",
|
||||
" \"https://langchain.readthedocs.io/sitemap.xml\",\n",
|
||||
" filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
|
||||
" parsing_function=remove_nav_and_header_elements\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
Loading…
Reference in New Issue
Block a user