Added new use case docs for Web Scraping, Chromium loader, BS4 transformer (#8732)

- Description: Added a new use case category called "Web Scraping", and a tutorial to scrape websites using OpenAI Functions Extraction chain to the docs. - Tag maintainer:@baskaryan @hwchase17 , - Twitter handle: https://www.linkedin.com/in/haiphunghiem/ (I'm on LinkedIn mostly) --------- Co-authored-by: Lance Martin <lance@langchain.dev>
2025-09-03 20:16:52 +00:00 · 2023-08-11 14:46:59 -04:00
parent 6cb763507c
commit e4418d1b7e
11 changed files with 1045 additions and 0 deletions
--- a/docs/docs_skeleton/docs/use_cases/web_scraping/index.mdx
+++ b/docs/docs_skeleton/docs/use_cases/web_scraping/index.mdx
@@ -0,0 +1,9 @@
+---
+sidebar_position: 3
+---
+
+# Web Scraping
+
+Web scraping has historically been a challenging endeavor due to the ever-changing nature of website structures, making it tedious for developers to maintain their scraping scripts. Traditional methods often rely on specific HTML tags and patterns which, when altered, can disrupt data extraction processes.
+
+Enter the LLM-based method for parsing HTML: By leveraging the capabilities of LLMs, and especially OpenAI Functions in LangChain's extraction chain, developers can instruct the model to extract only the desired data in a specified format. This method not only streamlines the extraction process but also significantly reduces the time spent on manual debugging and script modifications. Its adaptability means that even if websites undergo significant design changes, the extraction remains consistent and robust. This level of resilience translates to reduced maintenance efforts, cost savings, and ensures a higher quality of extracted data. Compared to its predecessors, LLM-based approach wins out the web scraping domain by transforming a historically cumbersome task into a more automated and efficient process.
--- a/docs/docs_skeleton/static/img/web_research.png
+++ b/docs/docs_skeleton/static/img/web_research.png
--- a/docs/docs_skeleton/static/img/web_scraping.png
+++ b/docs/docs_skeleton/static/img/web_scraping.png
--- a/docs/docs_skeleton/static/img/wsj_page.png
+++ b/docs/docs_skeleton/static/img/wsj_page.png