Compare commits

...

2 Commits

Author SHA1 Message Date
isaac hershenson
0466dc90e1 edited based on recursiveurl comments 2024-06-06 18:04:14 -07:00
isaac hershenson
a824281c45 first draft of sitemaploader docs 2024-06-06 13:31:23 -07:00
2 changed files with 334 additions and 93 deletions

View File

@@ -6,9 +6,27 @@
"source": [
"# Sitemap\n",
"\n",
"Extends from the `WebBaseLoader`, `SitemapLoader` loads a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document.\n",
"Extends from the `WebBaseLoader`, `SitemapLoader` loads a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document. This is helpful when you are provided with a sitemap that contains all the pages you wish to use as Documents.\n",
"\n",
"The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load. Note, while this will speed up the scraping process, but it may cause the server to block you. Be careful!"
"The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. You can increase this if you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load. Note, while this will speed up the scraping process, but it may cause the server to block you. Be careful!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Example\n",
"\n",
"Let's run through a basic example of how to use the `SiteMapLoader` on the site map for the [Semrush Features Sitemap](https://www.semrush.com/features/sitemap/). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Library Installation\n",
"\n",
"Before starting let's make sure we have installed the proper libraries to run our code examples."
]
},
{
@@ -17,16 +35,24 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet nest_asyncio"
"%pip install --upgrade --quiet nest_asyncio langchain_community"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Asyncio Bug Fix\n",
"\n",
"The code block below should always be run to fix a bug with asyncio and jupyter."
]
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# fixes a bug with asyncio and jupyter\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
@@ -34,123 +60,107 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:root:USER_AGENT environment variable not set, consider setting it to identify your requests.\n"
]
}
],
"source": [
"from langchain_community.document_loaders.sitemap import SitemapLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 12,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Fetching pages: 100%|###########################| 53/53 [00:06<00:00, 8.79it/s]\n"
]
}
],
"source": [
"sitemap_loader = SitemapLoader(web_path=\"https://api.python.langchain.com/sitemap.xml\")\n",
"sitemap_loader = SitemapLoader(web_path=\"https://www.semrush.com/features/sitemap/\")\n",
"\n",
"docs = sitemap_loader.load()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can change the `requests_per_second` parameter to increase the max concurrent requests. and use `requests_kwargs` to pass kwargs when send requests."
"Let's examine the first document we loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sitemap_loader.requests_per_second = 2\n",
"# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue\n",
"sitemap_loader.requests_kwargs = {\"verify\": False}"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nLangChain Python API Reference Documentation.\\n\\n\\nYou will be automatically redirected to the new location of this page.\\n\\n', metadata={'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-02-09T01:10:49.422114+00:00', 'changefreq': 'weekly', 'priority': '1'})"
"{'source': 'https://www.semrush.com/features/',\n",
" 'loc': 'https://www.semrush.com/features/',\n",
" 'changefreq': 'daily'}"
]
},
"execution_count": 6,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering sitemap URLs\n",
"\n",
"Sitemaps can be massive files, with thousands of URLs. Often you don't need every single one of them. You can filter the URLs by passing a list of strings or regex patterns to the `filter_urls` parameter. Only URLs that match one of the patterns will be loaded."
"docs[0].metadata"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"loader = SitemapLoader(\n",
" web_path=\"https://api.python.langchain.com/sitemap.xml\",\n",
" filter_urls=[\"https://api.python.langchain.com/en/latest\"],\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nLangChain Python API Reference Documentation.\\n\\n\\nYou will be automatically redirected to the new location of this page.\\n\\n', metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-02-12T05:26:10.971077+00:00', 'changefreq': 'daily', 'priority': '0.9'})"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
"name": "stdout",
"output_type": "stream",
"text": [
" Features | Semrush Skip to content Your browser is out of date. The site mi\n"
]
}
],
"source": [
"documents[0]"
"print(docs[0].page_content[:200].replace('\\n',''))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add custom scraping rules\n",
"\n",
"The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
"\n",
" The following example shows how to develop and use a custom function to avoid navigation and header elements."
"Great! That looks like the page that is first on the site map and we are receiving the proper meta data and page content in a parsed format. Let's now look at some variations we can make to our basic example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the `beautifulsoup4` library and define the custom function."
"## More Examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding a Parsing Function\n",
"\n",
"In the basic example we see that our loader return raw HTML, which in most cases is not we want. To alleviate this problem we can pass in the parameter `parsing_function` which allows us to parse the HTML that is returned. In the example below we define a parser that removes all `title` elements and returns the content of all the other elements."
]
},
{
@@ -159,25 +169,23 @@
"metadata": {},
"outputs": [],
"source": [
"pip install beautifulsoup4"
"%pip install beautifulsoup4"
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"def remove_title_elements(content: BeautifulSoup) -> str:\n",
" # Find all 'title' elements in the BeautifulSoup object\n",
" title_elements = content.find_all(\"title\")\n",
"\n",
"def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
" # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
" nav_elements = content.find_all(\"nav\")\n",
" header_elements = content.find_all(\"header\")\n",
"\n",
" # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
" for element in nav_elements + header_elements:\n",
" # Remove each 'title' element from the BeautifulSoup object\n",
" for element in title_elements:\n",
" element.decompose()\n",
"\n",
" return str(content.get_text())"
@@ -187,36 +195,142 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Add your custom function to the `SitemapLoader` object."
"Let's add our custom parsing function to the `SitemapLoader` object."
]
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 15,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Fetching pages: 100%|###########################| 53/53 [00:05<00:00, 9.39it/s]\n"
]
}
],
"source": [
"loader = SitemapLoader(\n",
" \"https://api.python.langchain.com/sitemap.xml\",\n",
" filter_urls=[\"https://api.python.langchain.com/en/latest/\"],\n",
" parsing_function=remove_nav_and_header_elements,\n",
")"
"sitemap_loader = SitemapLoader(\n",
" \"https://www.semrush.com/features/sitemap/\",\n",
" parsing_function=remove_title_elements,\n",
")\n",
"docs = sitemap_loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Skip to content Your browser is out of date. The site might not be displayed correctly. Please\n"
]
}
],
"source": [
"print(docs[0].page_content[:200].replace('\\n',''))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Local Sitemap\n",
"As we can see, the title element containing the string \"Features | Semrush\" has been removed, and our loader only returns the values of the other elements."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering sitemap URLs\n",
"\n",
"The sitemap loader can also be used to load local files."
"Sitemaps can be massive files, with thousands of URLs. Often you don't need every single one of them. You can filter the URLs by passing a list of strings or regex patterns to the `filter_urls` parameter. Only URLs that match one of the patterns will be loaded. In this case, let's find URLs that contain the string \"ppc\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 47,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Fetching pages: 100%|#############################| 3/3 [00:00<00:00, 7.16it/s]\n"
]
}
],
"source": [
"sitemap_loader = SitemapLoader(\n",
" web_path=\"https://www.semrush.com/features/sitemap/\",\n",
" filter_urls=[\".*ppc.*\"],\n",
")\n",
"docs = sitemap_loader.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, we only pulled 3 documents instead of 53 - let's take a look at the metadata of the first document to ensure it actually pulled the URLs we wanted."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'source': 'https://www.semrush.com/features/ppc-keyword-research-tools/',\n",
" 'loc': 'https://www.semrush.com/features/ppc-keyword-research-tools/',\n",
" 'changefreq': 'daily'}"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].metadata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, this URL does ideed contain the string \"ppc\" which is exactly what we expected."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Local Sitemap\n",
"\n",
"The sitemap loader can also be used to load local files, as show in the code example below."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Fetching pages: 100%|#############################| 3/3 [00:00<00:00, 16.53it/s]\n"
]
}
],
"source": [
"sitemap_loader = SitemapLoader(web_path=\"example_data/sitemap.xml\", is_local=True)\n",
"\n",
@@ -224,11 +338,13 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": []
"source": [
"## More Topics\n",
"\n",
"There are a varity of other changes you cna make to the functionality of the base `SiteMapLoader`. For example you can change the `requests_per_second` parameter to increase the max concurrent requests, and use `requests_kwargs` to pass kwargs when sending requests. To read about all the possible modifications that can be made, read the API reference."
]
}
],
"metadata": {
@@ -247,7 +363,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -62,6 +62,127 @@ class SitemapLoader(WebBaseLoader):
Use the filter URLs argument to limit which URLs can be loaded.
See https://python.langchain.com/docs/security
Instantiate:
.. code-block:: python
from langchain_community.document_loaders.sitemap import SitemapLoader
url = "https://www.semrush.com/features/sitemap/"
sitemap_loader = SitemapLoader(
web_path=url,
# filter_urls=None,
# parsing_function=None,
# blocksize=None,
# blocknum=0,
# meta_function=None,
# is_local=False,
# continue_on_failure=False,
# restrict_to_same_domain=True,
# ...
)
Load:
Use ``.load()`` to synchronously load into memory all Documents, with one
Document per URL in the site map.
.. code-block:: python
docs = sitemap_loader.load()
print(docs[0].page_content.replace('\n','')[:100])
print(docs[0].metadata)
.. code-block:: python
Features | Semrush Skip to content Your browser is out of date. The site m
{'source': 'https://www.semrush.com/features/', 'loc': 'https://www.semrush.com/features/', 'changefreq': 'daily'}
Async load:
.. code-block:: python
docs = await sitemap_loader.aload()
print(docs[0].page_content.replace('\n','')[:100])
print(docs[0].metadata)
.. code-block:: python
Features | Semrush Skip to content Your browser is out of date. The site m
{'source': 'https://www.semrush.com/features/', 'loc': 'https://www.semrush.com/features/', 'changefreq': 'daily'}
Lazy load:
.. code-block:: python
docs = []
docs_lazy = sitemap_loader.lazy_load()
# async variant:
# docs_lazy = await loader.alazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content.replace('\n','')[:100])
print(docs[0].metadata)
.. code-block:: python
Features | Semrush Skip to content Your browser is out of date. The site m
{'source': 'https://www.semrush.com/features/', 'loc': 'https://www.semrush.com/features/', 'changefreq': 'daily'}
Content parsing:
By default the loader sets the Document page content as all the text contained
on the page. To fine tune our parser we can use the ``parsing_function`` parameter.
.. code-block:: python
from bs4 import BeautifulSoup
def remove_title_elements(content: BeautifulSoup) -> str:
# Find all 'title' elements in the BeautifulSoup object
title_elements = content.find_all("title")
# Remove each 'title' element from the BeautifulSoup object
for element in title_elements:
element.decompose()
return str(content.get_text())
sitemap_loader = SitemapLoader(
"https://www.semrush.com/features/sitemap/",
parsing_function=remove_title_elements,
)
docs = sitemap_loader.load()
print(docs[0].page_content[:200].replace('\n',''))
.. code-block:: python
Skip to content Your browser is out of date. The site might not be displayed correctly. Please
Filtering URLs:
By default our loader loads every single website on the site map. We can use four
parameters to filter what URLs it pulls. We can pass a list of regexes to the
``filter_urls`` parameter to match specific URLs. We can also use the ``blocksize``
and ``blocknum`` params to select a specific block of URLs out of the entire site
map. We can also set the ``restrict_to_same_domain`` parameter to further restrict
what URLs get pulled.
.. code-block:: python
sitemap_loader = SitemapLoader(
"https://www.semrush.com/features/sitemap/",
restrict_to_same_domain=True,
filter_urls=['.*tools.*'],
blocksize=5,
blocknum=1
)
docs = sitemap_loader.load()
[doc.metadata['source'] for doc in docs]
.. code-block:: python
['https://www.semrush.com/features/local-seo-tools/',
'https://www.semrush.com/features/link-building-and-prospecting-tools/',
'https://www.semrush.com/features/technical-seo-tools/',
'https://www.semrush.com/features/pr-monitoring-tools/',
'https://www.semrush.com/features/serp-tracking-tools/']
"""
def __init__(
@@ -216,3 +337,7 @@ class SitemapLoader(WebBaseLoader):
page_content=self.parsing_function(result),
metadata=self.meta_function(els[i], result),
)
async def aload(self) -> List[Document]:
"""Load data into Document objects."""
return [document async for document in self.alazy_load()]