mirror of
https://github.com/hwchase17/langchain.git
synced 2026-03-18 11:07:36 +00:00
# feat(core): add more file extensions to ignore in HTML link extraction ## Description This PR enhances the HTML link extraction utility in `libs/core/langchain_core/utils/html.py` by expanding the `SUFFIXES_TO_IGNORE` list to include additional common binary file extensions: - `.webp` - `.pdf` - `.docx` - `.xlsx` - `.pptx` - `.pptm` These file types are non-HTML, non-crawlable resources. Ignoring them prevents `find_all_links` and `extract_sub_links` from mistakenly treating such binary assets as navigable links. This improves link filtering, reduces unnecessary crawling, and aligns behavior with typical web scraping expectations. ## Summary of Changes - **Updated** `libs/core/langchain_core/utils/html.py`: Added `.webp`, `.pdf`, `.docx`, `.xlsx`, `.pptx`, `.pptm` to `SUFFIXES_TO_IGNORE`. ## Related Issues N/A ## Verification - `ruff check libs/core/langchain_core/utils/html.py`: **Passed** - `mypy libs/core/langchain_core/utils/html.py`: **Passed** - `pytest libs/core/tests/unit_tests/utils/test_html.py`: **Passed** (11 tests) --------- Co-authored-by: Mason Daugherty <mason@langchain.dev>