mirror of
https://github.com/hwchase17/langchain.git
synced 2025-10-04 11:49:23 +00:00
Vwp/docs improved document loaders (#4006)
Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
This commit is contained in:
@@ -1,13 +1,16 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "2dfc4698",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Diffbot\n",
|
||||
"\n",
|
||||
">Unlike traditional web scraping tools, [Diffbot](https://docs.diffbot.com/docs) doesn't require any rules to read the content on a page.\n",
|
||||
">It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type.\n",
|
||||
">The result is a website transformed into clean structured data (like JSON or CSV), ready for your application.\n",
|
||||
"\n",
|
||||
"This covers how to extract HTML documents from a list of URLs using the [Diffbot extract API](https://www.diffbot.com/products/extract/), into a document format that we can use downstream."
|
||||
]
|
||||
},
|
||||
@@ -24,7 +27,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "6fffec88",
|
||||
"metadata": {},
|
||||
@@ -45,7 +47,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "e0ce8c05",
|
||||
"metadata": {},
|
||||
|
Reference in New Issue
Block a user