Added new use case docs for Web Scraping, Chromium loader, BS4 transformer (#8732)

- Description: Added a new use case category called "Web Scraping", and a tutorial to scrape websites using OpenAI Functions Extraction chain to the docs. - Tag maintainer:@baskaryan @hwchase17 , - Twitter handle: https://www.linkedin.com/in/haiphunghiem/ (I'm on LinkedIn mostly) --------- Co-authored-by: Lance Martin <lance@langchain.dev>
2025-09-04 04:28:58 +00:00 · 2023-08-11 14:46:59 -04:00
parent 6cb763507c
commit e4418d1b7e
11 changed files with 1045 additions and 0 deletions
--- a/docs/extras/integrations/document_loaders/async_chromium.ipynb
+++ b/docs/extras/integrations/document_loaders/async_chromium.ipynb
@@ -0,0 +1,101 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad553e51",
+   "metadata": {},
+   "source": [
+    "# Async Chromium\n",
+    "\n",
+    "Chromium is one of the browsers supported by Playwright, a library used to control browser automation. \n",
+    "\n",
+    "By running `p.chromium.launch(headless=True)`, we are launching a headless instance of Chromium. \n",
+    "\n",
+    "Headless mode means that the browser is running without a graphical user interface.\n",
+    "\n",
+    "`AsyncChromiumLoader` load the page, and then we use `Html2TextTransformer` to trasnform to text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1c3a4c19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install -q playwright beautifulsoup4\n",
+    "! playwright install"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "dd2cdea7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'<!DOCTYPE html><html lang=\"en\"><head><script src=\"https://s0.2mdn.net/instream/video/client.js\" asyn'"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.document_loaders import AsyncChromiumLoader\n",
+    "urls = [\"https://www.wsj.com\"]\n",
+    "loader = AsyncChromiumLoader(urls)\n",
+    "docs = loader.load()\n",
+    "docs[0].page_content[0:100]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "013caa7e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\"Skip to Main ContentSkip to SearchSkip to... Select * Top News * What's News *\\nFeatured Stories * Retirement * Life & Arts * Hip-Hop * Sports * Video *\\nEconomy * Real Estate * Sports * CMO * CIO * CFO * Risk & Compliance *\\nLogistics Report * Sustainable Business * Heard on the Street * Barron’s *\\nMarketWatch * Mansion Global * Penta * Opinion * Journal Reports * Sponsored\\nOffers Explore Our Brands * WSJ * * * * * Barron's * * * * * MarketWatch * * *\\n* * IBD # The Wall Street Journal SubscribeSig\""
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.document_transformers import Html2TextTransformer\n",
+    "html2text = Html2TextTransformer()\n",
+    "docs_transformed = html2text.transform_documents(docs)\n",
+    "docs_transformed[0].page_content[0:500]"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/extras/integrations/document_transformers/beautiful_soup.ipynb
+++ b/docs/extras/integrations/document_transformers/beautiful_soup.ipynb
@@ -0,0 +1,95 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2ed9a4c2",
+   "metadata": {},
+   "source": [
+    "# Beautiful Soup\n",
+    "\n",
+    "Beautiful Soup offers fine-grained control over HTML content, enabling specific tag extraction, removal, and content cleaning. \n",
+    "\n",
+    "It's suited for cases where you want to extract specific information and clean up the HTML content according to your needs.\n",
+    "\n",
+    "For example, we can scrape text content within `<p>, <li>, <div>, and <a>` tags from the HTML content:\n",
+    "\n",
+    "* `<p>`: The paragraph tag. It defines a paragraph in HTML and is used to group together related sentences and/or phrases.\n",
+    " \n",
+    "* `<li>`: The list item tag. It is used within ordered (`<ol>`) and unordered (`<ul>`) lists to define individual items within the list.\n",
+    " \n",
+    "* `<div>`: The division tag. It is a block-level element used to group other inline or block-level elements.\n",
+    " \n",
+    "* `<a>`: The anchor tag. It is used to define hyperlinks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "dd710e5b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import AsyncChromiumLoader\n",
+    "from langchain.document_transformers import BeautifulSoupTransformer\n",
+    "\n",
+    "# Load HTML\n",
+    "loader = AsyncChromiumLoader([\"https://www.wsj.com\"])\n",
+    "html = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "052b64dd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Transform\n",
+    "bs_transformer = BeautifulSoupTransformer()\n",
+    "docs_transformed = bs_transformer.transform_documents(html,tags_to_extract=[\"p\", \"li\", \"div\", \"a\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "b53a5307",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Conservative legal activists are challenging Amazon, Comcast and others using many of the same tools that helped kill affirmative-action programs in colleges.1,2099 min read U.S. stock indexes fell and government-bond prices climbed, after Moody’s lowered credit ratings for 10 smaller U.S. banks and said it was reviewing ratings for six larger ones. The Dow industrials dropped more than 150 points.3 min read Penn Entertainment’s Barstool Sportsbook app will be rebranded as ESPN Bet this fall as '"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "docs_transformed[0].page_content[0:500]"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}