Files
langchain/libs
Aman Gupta 2847814c70 feat(core): add more file extensions to ignore in HTML link extraction (#34552)
# feat(core): add more file extensions to ignore in HTML link extraction

## Description
This PR enhances the HTML link extraction utility in  
`libs/core/langchain_core/utils/html.py` by expanding the
`SUFFIXES_TO_IGNORE` list to include additional common binary file
extensions:

- `.webp`
- `.pdf`
- `.docx`
- `.xlsx`
- `.pptx`
- `.pptm`

These file types are non-HTML, non-crawlable resources. Ignoring them
prevents `find_all_links` and `extract_sub_links` from mistakenly
treating such binary assets as navigable links. This improves link
filtering, reduces unnecessary crawling, and aligns behavior with
typical web scraping expectations.

## Summary of Changes
- **Updated** `libs/core/langchain_core/utils/html.py`: Added `.webp`,
`.pdf`, `.docx`, `.xlsx`, `.pptx`, `.pptm` to `SUFFIXES_TO_IGNORE`.

## Related Issues
N/A

## Verification
- `ruff check libs/core/langchain_core/utils/html.py`: **Passed**  
- `mypy libs/core/langchain_core/utils/html.py`: **Passed**  
- `pytest libs/core/tests/unit_tests/utils/test_html.py`: **Passed** (11
tests)

---------

Co-authored-by: Mason Daugherty <mason@langchain.dev>
2026-01-08 14:40:22 -05:00
..
2026-01-07 14:34:23 -05:00

Packages

Important

View all LangChain integrations packages

This repository is structured as a monorepo, with various packages located in this libs/ directory. Packages to note in this directory include:

core/             # Core primitives and abstractions for langchain
langchain/        # langchain-classic
langchain_v1/     # langchain
partners/         # Certain third-party providers integrations (see below)
standard-tests/   # Standardized tests for integrations
text-splitters/   # Text splitter utilities

(Each package contains its own README.md file with specific details about that package.)

Integrations (partners/)

The partners/ directory contains a small subset of third-party provider integrations that are maintained directly by the LangChain team. These include, but are not limited to:

Most integrations have been moved to their own repositories for improved versioning, dependency management, collaboration, and testing. This includes packages from popular providers such as Google and AWS. Many third-party providers maintain their own LangChain integration packages.

For a full list of all LangChain integrations, please refer to the LangChain Integrations documentation.