community[minor]: add user agent for web scraping loaders (#22480)

**Description:** This PR adds a `USER_AGENT` env variable that is to be
used for web scraping. It creates a util to get that user agent and uses
it in the classes used for scraping in [this piece of
doc](https://python.langchain.com/v0.1/docs/use_cases/web_scraping/).
Identifying your scraper is considered a good politeness practice, this
PR aims at easing it.
**Issue:** `None`
**Dependencies:** `None`
**Twitter handle:** `None`
This commit is contained in:
Emilien Chauvet
2024-06-05 17:20:34 +02:00
committed by GitHub
parent 8250c177de
commit c3d4126eb1
5 changed files with 34 additions and 6 deletions

View File

@@ -48,7 +48,7 @@
"from langchain_community.document_loaders import AsyncChromiumLoader\n",
"\n",
"urls = [\"https://www.wsj.com\"]\n",
"loader = AsyncChromiumLoader(urls)\n",
"loader = AsyncChromiumLoader(urls, user_agent=\"MyAppUserAgent\")\n",
"docs = loader.load()\n",
"docs[0].page_content[0:100]"
]