Vwp/docs improved document loaders (#4006)

Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
2025-10-04 11:49:23 +00:00 · 2023-05-02 15:24:53 -07:00
parent 1c68cbdb28
commit aa38355999
57 changed files with 1227 additions and 779 deletions
--- a/docs/modules/indexes/document_loaders/examples/diffbot.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/diffbot.ipynb
@@ -1,13 +1,16 @@
 {
 "cells": [
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "2dfc4698",
   "metadata": {},
   "source": [
    "# Diffbot\n",
    "\n",
+    ">Unlike traditional web scraping tools, [Diffbot](https://docs.diffbot.com/docs) doesn't require any rules to read the content on a page.\n",
+    ">It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type.\n",
+    ">The result is a website transformed into clean structured data (like JSON or CSV), ready for your application.\n",
+    "\n",
    "This covers how to extract HTML documents from a list of URLs using the [Diffbot extract API](https://www.diffbot.com/products/extract/), into a document format that we can use downstream."
   ]
  },
@@ -24,7 +27,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "6fffec88",
   "metadata": {},
@@ -45,7 +47,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "e0ce8c05",
   "metadata": {},