langchain

mirror of https://github.com/hwchase17/langchain.git synced 2026-03-18 11:07:36 +00:00

Files

Aman Gupta 2847814c70 feat(core): add more file extensions to ignore in HTML link extraction (#34552 )

# feat(core): add more file extensions to ignore in HTML link extraction

## Description
This PR enhances the HTML link extraction utility in  
`libs/core/langchain_core/utils/html.py` by expanding the
`SUFFIXES_TO_IGNORE` list to include additional common binary file
extensions:

- `.webp`
- `.pdf`
- `.docx`
- `.xlsx`
- `.pptx`
- `.pptm`

These file types are non-HTML, non-crawlable resources. Ignoring them
prevents `find_all_links` and `extract_sub_links` from mistakenly
treating such binary assets as navigable links. This improves link
filtering, reduces unnecessary crawling, and aligns behavior with
typical web scraping expectations.

## Summary of Changes
- **Updated** `libs/core/langchain_core/utils/html.py`: Added `.webp`,
`.pdf`, `.docx`, `.xlsx`, `.pptx`, `.pptm` to `SUFFIXES_TO_IGNORE`.

## Related Issues
N/A

## Verification
- `ruff check libs/core/langchain_core/utils/html.py`: **Passed**  
- `mypy libs/core/langchain_core/utils/html.py`: **Passed**  
- `pytest libs/core/tests/unit_tests/utils/test_html.py`: **Passed** (11
tests)

---------

Co-authored-by: Mason Daugherty <mason@langchain.dev>

2026-01-08 14:40:22 -05:00

__init__.py

style: more work for refs (#33508 )

2025-10-15 18:46:55 -04:00

_merge.py

test(core): add tests for formatting utils and merge functions (#34511 )

2026-01-05 14:20:11 -05:00

aiter.py

chore(core): fix some ruff preview rules (#34425 )

2025-12-19 14:33:42 -06:00

env.py

style(core): fix mypy no-any-return violations (#34204 )