langchain/libs/community/tests/integration_tests/document_loaders
Andras L Ferenczi 64df60e690
community[minor]: Add custom sitemap URL parameter to GitbookLoader (#30549)
## Description
This PR adds a new `sitemap_url` parameter to the `GitbookLoader` class
that allows users to specify a custom sitemap URL when loading content
from a GitBook site. This is particularly useful for GitBook sites that
use non-standard sitemap file names like `sitemap-pages.xml` instead of
the default `sitemap.xml`.
The standard `GitbookLoader` assumes that the sitemap is located at
`/sitemap.xml`, but some GitBook instances (including GitBook's own
documentation) use different paths for their sitemaps. This parameter
makes the loader more flexible and helps users extract content from a
wider range of GitBook sites.
## Issue
Fixes bug
[30473](https://github.com/langchain-ai/langchain/issues/30473) where
the `GitbookLoader` would fail to find pages on GitBook sites that use
custom sitemap URLs.
## Dependencies
No new dependencies required.
*I've added*:
* Unit tests to verify the parameter works correctly
* Integration tests to confirm the parameter is properly used with real
GitBook sites
* Updated docstrings with parameter documentation
The changes are fully backward compatible, as the parameter is optional
with a sensible default.

---------

Co-authored-by: andrasfe <andrasf94@gmail.com>
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
2025-04-01 16:17:21 +00:00
..
parsers community[patch]: Handle gray scale images in ImageBlobParser (Fixes 30261 and 29586) (#30493) 2025-03-28 10:15:40 -04:00
__init__.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_arxiv.py community: Bump ruff version to 0.9 (#29206) 2025-02-08 01:21:10 +00:00
test_astradb.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
test_bigquery.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_bilibili.py community[patch]: fix bilibili loader handling of multi-page content (#30283) 2025-03-14 14:53:03 -04:00
test_blockchain.py infra: add print rule to ruff (#16221) 2024-02-09 16:13:30 -08:00
test_cassandra.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
test_confluence.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_couchbase.py infra: add print rule to ruff (#16221) 2024-02-09 16:13:30 -08:00
test_csv_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_dedoc.py community[minor]: added new document loaders based on dedoc library (#24303) 2024-07-23 02:04:53 +00:00
test_docusaurus.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_duckdb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_email.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_etherscan.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_excel.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_facebook_chat.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_fauna.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_figma.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_geodataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_gitbook.py community[minor]: Add custom sitemap URL parameter to GitbookLoader (#30549) 2025-04-01 16:17:21 +00:00
test_github.py community[patch]: Add Pagination to GitHubIssuesLoader for Efficient GitHub Issues Retrieval (#16934) 2024-02-12 18:30:36 -08:00
test_google_speech_to_text.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_ifixit.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_joplin.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_json_loader.py community[minor]: Fix json._validate_metadata_func() (#22842) 2024-12-13 21:24:20 +00:00
test_lakefs.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_language.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_larksuite.py community[minor]: Add LarkSuite wiki document loader. (#21016) 2024-04-29 10:37:50 -04:00
test_llmsherpa.py community[minor]: add support for llmsherpa (#19741) 2024-03-29 16:04:57 -07:00
test_mastodon.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
test_max_compute.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_modern_treasury.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_news.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_nuclia.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_odt.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_oracleds.py community[minor]: Oraclevs integration (#21123) 2024-05-04 03:15:35 +00:00
test_org_mode.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_pdf.py community[minor]: 05 - Refactoring PyPDFium2 parser (#29625) 2025-02-07 21:31:12 -05:00
test_polars_dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_pubmed.py infra: add print rule to ruff (#16221) 2024-02-09 16:13:30 -08:00
test_pyspark_dataframe_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_python.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_quip.py community[patch]: Add missing annotations (#24890) 2024-07-31 18:13:44 +00:00
test_recursive_url_loader.py community[patch]: fix integrated test case test_recursive_url_loader.py assertions (issue-20919) (#20920) 2024-04-26 10:00:08 -04:00
test_rocksetdb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_rss.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_rst.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_sitemap.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_slack.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
test_spreedly.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_sql_database.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
test_stripe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_telegram.py infra: add print rule to ruff (#16221) 2024-02-09 16:13:30 -08:00
test_tensorflow_datasets.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
test_tidb.py community[minor]: Add tidb loader support (#17788) 2024-02-21 16:42:33 -08:00
test_tsv.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_unstructured.py community[patch]: Load list of files using UnstructuredFileLoader (#16216) 2024-01-23 19:37:37 -08:00
test_url_playwright.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
test_url.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_whatsapp_chat.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_wikipedia.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
test_xml.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00
test_xorbits.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00