mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-03 03:38:06 +00:00
Add how to use a custom scraping function with the sitemap loader. (#5847)
Hi! I just added an example of how to use a custom scraping function with the sitemap loader. I recently used this feature and had to dig in the source code to find it. I thought it might be useful to other devs to have an example in the Jupyter Notebook directly. I only added the example to the documentation page. @eyurtsev I was not able to run the lint. Please let me know if I have to do anything else. I know this is a very small contribution, but I hope it will be valuable. My Twitter handle is @web3Dav3. <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @vowelparrot VectorStores / Retrievers / Memory - @dev2049 -->
This commit is contained in:
parent
c66755b661
commit
0b4a51930c
@ -146,6 +146,73 @@
|
|||||||
"documents[0]"
|
"documents[0]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Add custom scraping rules\n",
|
||||||
|
"\n",
|
||||||
|
"The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
|
||||||
|
"\n",
|
||||||
|
" The following example shows how to develop and use a custom function to avoid navigation and header elements."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Import the `beautifulsoup4` library and define the custom function."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"pip install beautifulsoup4"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from bs4 import BeautifulSoup\n",
|
||||||
|
"\n",
|
||||||
|
"def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
|
||||||
|
" # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
|
||||||
|
" nav_elements = content.find_all('nav')\n",
|
||||||
|
" header_elements = content.find_all('header')\n",
|
||||||
|
"\n",
|
||||||
|
" # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
|
||||||
|
" for element in nav_elements + header_elements:\n",
|
||||||
|
" element.decompose()\n",
|
||||||
|
"\n",
|
||||||
|
" return str(content.get_text())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Add your custom function to the `SitemapLoader` object."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"loader = SitemapLoader(\n",
|
||||||
|
" \"https://langchain.readthedocs.io/sitemap.xml\",\n",
|
||||||
|
" filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
|
||||||
|
" parsing_function=remove_nav_and_header_elements\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
|
Loading…
Reference in New Issue
Block a user