langchain/libs/community/langchain_community/document_loaders
sByteman 31e7664afd
community[minor]: add proxy support to RecursiveUrlLoader (#27364)
**Description**
This PR introduces the proxies parameter to the RecursiveUrlLoader
class, allowing the user to specify proxy servers for requests. This
update enables crawling through proxy servers, providing enhanced
flexibility for network configurations.
The key changes include:
  1.Added an optional proxies parameter to the constructor (__init__).
2.Updated the documentation to explain the proxies parameter usage with
an example.
3.Modified the _get_child_links_recursive method to pass the proxies
parameter to the requests.get function.



**Sample Usage**

```python
from bs4 import BeautifulSoup as Soup
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

proxies = {
    "http": "http://localhost:1080",
    "https": "http://localhost:1080",
}
url = "https://python.langchain.com/docs/concepts/#langchain-expression-language-lcel"
loader = RecursiveUrlLoader(
    url=url, max_depth=1, extractor=lambda x: Soup(x, "html.parser").text,proxies=proxies
)
docs = loader.load()
```

---------

Co-authored-by: root <root@thb>
2024-10-16 16:29:59 +00:00
..
blob_loaders infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
parsers community: Add warning when page_content is empty (#25955) 2024-09-19 05:22:09 +00:00
__init__.py [community] Added PebbloTextLoader for loading text data in PebbloSafeLoader (#26582) 2024-09-19 09:59:04 -04:00
acreom.py community[patch]: Add missing annotations (#24890) 2024-07-31 18:13:44 +00:00
airbyte_json.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
airbyte.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
airtable.py docs: fix kwargs docstring (#25010) 2024-08-02 19:54:54 -07:00
apify_dataset.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
arcgis_loader.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
arxiv.py docs: Arxiv docs update (#23871) 2024-07-05 11:43:51 -04:00
assemblyai.py community[patch]: docstrings update (#20301) 2024-04-11 16:23:27 -04:00
astradb.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
async_html.py community[patch]: Release 0.2.11 (#24989) 2024-08-02 20:08:44 +00:00
athena.py community: make AthenaLoader profile_name optional and fix type hint (#24958) 2024-08-05 14:28:58 +00:00
azlyrics.py
azure_ai_data.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
azure_blob_storage_container.py community[patch]: type ignore fixes (#18395) 2024-03-01 11:21:02 -08:00
azure_blob_storage_file.py
baiducloud_bos_directory.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
baiducloud_bos_file.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
base_o365.py [community] [Bugfix] base_o365 document loader metadata needs to be JSON serializable (#26322) 2024-10-14 12:48:31 -04:00
base.py core: Move document loader interfaces to core (#17723) 2024-03-06 13:59:00 -05:00
bibtex.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
bigquery.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
bilibili.py community[patch]: docstrings update (#20301) 2024-04-11 16:23:27 -04:00
blackboard.py community: add flag to toggle progress bar (#24463) 2024-07-20 13:18:02 +00:00
blockchain.py community: add supported blockchains to Blockchain Document Loader (#25428) 2024-08-23 14:39:42 +00:00
brave_search.py
browserbase.py community: updated Browserbase loader (#21757) 2024-05-16 08:21:23 -07:00
browserless.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
cassandra.py community[minor]: Add Cassandra ByteStore (#22064) 2024-05-23 10:46:23 -04:00
chatgpt.py
chm.py community[patch]: docstrings (#16810) 2024-02-09 12:48:57 -08:00
chromium.py community[minor]: add user agent for web scraping loaders (#22480) 2024-06-05 15:20:34 +00:00
college_confidential.py
concurrent.py community[patch]: import flattening fix (#20110) 2024-04-10 13:01:19 -04:00
confluence.py community[minor]: Fix missing 'keep_newlines' parameter forward-pass to 'process_pages' function in confluence loader (#20086) (#20087) 2024-08-23 12:59:38 +00:00
conllu.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
couchbase.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
csv_loader.py community[patch]: added content_columns option to CSVLoader (#23809) 2024-09-02 20:25:53 +00:00
cube_semantic.py community[patch]: Implement lazy_load() for CubeSemanticLoader (#18535) 2024-03-05 17:32:31 -08:00
datadog_logs.py
dataframe.py community[patch]: support modin document loader (#18866) 2024-03-10 18:40:04 -07:00
dedoc.py community[minor]: added new document loaders based on dedoc library (#24303) 2024-07-23 02:04:53 +00:00
diffbot.py
directory.py community: glob multiple patterns when using DirectoryLoader (#22852) 2024-06-18 09:24:50 -07:00
discord.py
doc_intelligence.py docs: community docstring updates (#21040) 2024-04-29 17:40:23 -04:00
docugami.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
docusaurus.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
dropbox.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
duckdb_loader.py
email.py community[patch]: Small Fix in OutlookMessageLoader (Close the Message once Open) (#22744) 2024-06-10 13:08:39 -07:00
epub.py
etherscan.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
evernote.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
excel.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
facebook_chat.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
fauna.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
figma.py
firecrawl.py Community: Updated Firecrawl Document Loader to v1 (#26548) 2024-10-15 13:13:28 +00:00
gcs_directory.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
gcs_file.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
generic.py community[patch]: import flattening fix (#20110) 2024-04-10 13:01:19 -04:00
geodataframe.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
git.py Merge pull request #18539 2024-03-06 13:25:14 -05:00
gitbook.py community: add flag to toggle progress bar (#24463) 2024-07-20 13:18:02 +00:00
github.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
glue_catalog.py community[minor]: Add glue catalog loader (#20220) 2024-04-16 11:39:23 -04:00
google_speech_to_text.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
googledrive.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
gutenberg.py
helpers.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
hn.py
html_bs.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
html.py
hugging_face_dataset.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
hugging_face_model.py community[patch]: Add missing annotations (#24890) 2024-07-31 18:13:44 +00:00
ifixit.py
image_captions.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
image.py
imsdb.py
iugu.py
joplin.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
json_loader.py docs: Standardize DocumentLoader docstrings (#22932) 2024-06-18 03:26:36 +00:00
kinetica_loader.py community[patch]: Kinetica Integrations handled error in querying; quotes in table names; updated gpudb API (#22724) 2024-06-11 10:01:26 -04:00
lakefs.py
larksuite.py community[minor]: Add LarkSuite wiki document loader. (#21016) 2024-04-29 10:37:50 -04:00
llmsherpa.py community[minor]: add support for llmsherpa (#19741) 2024-03-29 16:04:57 -07:00
markdown.py [docs]: doc loader changes (#25417) 2024-08-14 19:46:33 -07:00
mastodon.py Merge pull request #18671 2024-03-06 13:23:14 -05:00
max_compute.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
mediawikidump.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
merge.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
mhtml.py community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
mintbase.py community[minor]: add mintbase loader to langchain (#20089) 2024-04-30 04:11:56 +00:00
modern_treasury.py
mongodb.py community: Enhance MongoDBLoader with flexible metadata and optimized field extraction (#23376) 2024-09-17 10:23:17 -04:00
news.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
notebook.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
notion.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
notiondb.py community: Fix KeyError in NotionDB loader when 'name' is missing (#24224) 2024-08-01 13:55:40 +00:00
nuclia.py infra: add print rule to ruff (#16221) 2024-02-09 16:13:30 -08:00
obs_directory.py
obs_file.py
obsidian.py community[patch]: Add missing annotations (#24890) 2024-07-31 18:13:44 +00:00
odt.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
onedrive_file.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
onedrive.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
onenote.py community[patch]: Fix validation error in SettingsConfigDict across multiple Langchain modules (#26852) 2024-09-25 10:02:14 -04:00
open_city_data.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
oracleadb_loader.py community: Add support for clob datatype in oracle database (#27330) 2024-10-16 02:19:20 +00:00
oracleai.py community[minor]: Oraclevs integration (#21123) 2024-05-04 03:15:35 +00:00
org_mode.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
pdf.py docs minor fix (#25794) 2024-08-28 04:14:36 +00:00
pebblo.py community[minor]: [Pebblo] Enhance PebbloSafeLoader to take anonymize flag (#26812) 2024-09-25 09:33:06 -04:00
polars_dataframe.py
powerpoint.py
psychic.py multiple: Remove unnecessary Ruff suppression comments (#21050) 2024-04-30 17:13:48 +00:00
pubmed.py community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
pyspark_dataframe.py
python.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
quip.py community[major]: lint for usage of xml library (#22132) 2024-05-24 15:23:53 +00:00
readthedocs.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
recursive_url_loader.py community[minor]: add proxy support to RecursiveUrlLoader (#27364) 2024-10-16 16:29:59 +00:00
reddit.py
roam.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
rocksetdb.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
rspace.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
rss.py multiple: Remove unnecessary Ruff suppression comments (#21050) 2024-04-30 17:13:48 +00:00
rst.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
rtf.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
s3_directory.py community[patch]: Skip nested directories when using S3DirectoryLoader (#17829) 2024-03-08 16:50:58 -08:00
s3_file.py community[patch]: support unstructured_kwargs for s3 loader (#15473) 2024-03-27 22:03:48 +00:00
scrapfly.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
scrapingant.py community[minor]: Add ScrapingAnt Loader Community Integration (#24514) 2024-07-24 21:11:43 -04:00
sharepoint.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
sitemap.py community[patch]: SitemapLoader restrict depth of parsing sitemap (CVE-2024-2965) (#22903) 2024-06-14 13:04:40 -04:00
slack_directory.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
snowflake_loader.py community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
spider.py doc list not empty (#21208) 2024-05-20 08:24:06 -07:00
spreedly.py
sql_database.py community[patch]: restore compatibility with SQLAlchemy 1.x (#22546) 2024-06-19 17:58:57 +00:00
srt.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
stripe.py
surrealdb.py community[patch]: SurrealDB fix for asyncio (#16092) 2024-01-23 19:46:19 -08:00
telegram.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
tencent_cos_directory.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
tencent_cos_file.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
tensorflow_datasets.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
text.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
tidb.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
tomarkdown.py community[patch]: Update URL to the 2markdown API (#24546) 2024-07-23 14:27:55 +00:00
toml.py community: Use default load() implementation in doc loaders (#18385) 2024-03-01 14:46:52 -05:00
trello.py community: Implement lazy_load() for TrelloLoader (#18658) 2024-03-06 13:04:36 -05:00
tsv.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
twitter.py
unstructured.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
url_playwright.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
url_selenium.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
url.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
vsdx.py community[patch]: import flattening fix (#20110) 2024-04-10 13:01:19 -04:00
weather.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
web_base.py [docs]: standardize doc loader doc strings (#25325) 2024-08-13 23:18:56 +00:00
whatsapp_chat.py community: Implement lazy_load() for WhatsAppChatLoader (#18677) 2024-03-06 13:03:46 -05:00
wikipedia.py community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
word_document.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
xml.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
xorbits.py
youtube.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
yuque.py community[minor]: add Yuque document loader (#17924) 2024-03-05 15:54:07 -08:00