Vwp/docs improved document loaders (#4006)

Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
2025-10-07 05:07:26 +00:00 · 2023-05-02 15:24:53 -07:00
parent 1c68cbdb28
commit aa38355999
57 changed files with 1227 additions and 779 deletions
--- a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
@@ -4,9 +4,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Sitemap Loader\n",
+    "# Sitemap\n",
    "\n",
-    "Extends from the [WebBaseLoader](), this will load a sitemap from a given URL, and then scrape and load all the pages in the sitemap, returning each page as a document.\n",
+    "Extends from the `WebBaseLoader`, this will load a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document.\n",
    "\n",
    "The scraping is done concurrently, using `WebBaseLoader`.  There are reasonable limits to concurrent requests, defaulting to 2 per second.  If you aren't concerned about being a good citizen, or you control the server you are scraping and don't care about load, you can change the `requests_per_second` parameter to increase the max concurrent requests.  Note, while this will speed up the scraping process, but may cause the server to block you.  Be careful!"
   ]
@@ -20,10 +20,10 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Requirement already satisfied: nest_asyncio in /Users/tasp/Code/projects/langchain/.venv/lib/python3.10/site-packages (1.5.6)\r\n",
-      "\r\n",
-      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\r\n",
-      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\r\n"
+      "Requirement already satisfied: nest_asyncio in /Users/tasp/Code/projects/langchain/.venv/lib/python3.10/site-packages (1.5.6)\n",
+      "\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
@@ -39,6 +39,7 @@
   "source": [
    "# fixes a bug with asyncio and jupyter\n",
    "import nest_asyncio\n",
+    "\n",
    "nest_asyncio.apply()"
   ]
  },
@@ -88,7 +89,7 @@
   "source": [
    "## Filtering sitemap URLs\n",
    "\n",
-    "Sitemaps can be massive files, with thousands of urls.  Often you don't need every single one of them.  You can filter the urls by passing a list of strings or regex patterns to the `url_filter` parameter.  Only urls that match one of the patterns will be loaded."
+    "Sitemaps can be massive files, with thousands of URLs.  Often you don't need every single one of them.  You can filter the URLs by passing a list of strings or regex patterns to the `url_filter` parameter.  Only URLs that match one of the patterns will be loaded."
   ]
  },
  {
@@ -148,9 +149,9 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.10.6"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 1
+ "nbformat_minor": 4
 }