Vwp/docs improved document loaders (#4006)

Huge thanks to @leo-gan for improving the document loaders notebooks

---------

Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
This commit is contained in:
Zander Chase
2023-05-02 15:24:53 -07:00
committed by GitHub
parent 1c68cbdb28
commit aa38355999
57 changed files with 1227 additions and 779 deletions

View File

@@ -1,13 +1,16 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "2dfc4698",
"metadata": {},
"source": [
"# Diffbot\n",
"\n",
">Unlike traditional web scraping tools, [Diffbot](https://docs.diffbot.com/docs) doesn't require any rules to read the content on a page.\n",
">It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type.\n",
">The result is a website transformed into clean structured data (like JSON or CSV), ready for your application.\n",
"\n",
"This covers how to extract HTML documents from a list of URLs using the [Diffbot extract API](https://www.diffbot.com/products/extract/), into a document format that we can use downstream."
]
},
@@ -24,7 +27,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6fffec88",
"metadata": {},
@@ -45,7 +47,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e0ce8c05",
"metadata": {},