# feat(core): add more file extensions to ignore in HTML link extraction ## Description This PR enhances the HTML link extraction utility in `libs/core/langchain_core/utils/html.py` by expanding the `SUFFIXES_TO_IGNORE` list to include additional common binary file extensions: - `.webp` - `.pdf` - `.docx` - `.xlsx` - `.pptx` - `.pptm` These file types are non-HTML, non-crawlable resources. Ignoring them prevents `find_all_links` and `extract_sub_links` from mistakenly treating such binary assets as navigable links. This improves link filtering, reduces unnecessary crawling, and aligns behavior with typical web scraping expectations. ## Summary of Changes - **Updated** `libs/core/langchain_core/utils/html.py`: Added `.webp`, `.pdf`, `.docx`, `.xlsx`, `.pptx`, `.pptm` to `SUFFIXES_TO_IGNORE`. ## Related Issues N/A ## Verification - `ruff check libs/core/langchain_core/utils/html.py`: **Passed** - `mypy libs/core/langchain_core/utils/html.py`: **Passed** - `pytest libs/core/tests/unit_tests/utils/test_html.py`: **Passed** (11 tests) --------- Co-authored-by: Mason Daugherty <mason@langchain.dev>
Packages
Important
This repository is structured as a monorepo, with various packages located in this libs/ directory. Packages to note in this directory include:
core/ # Core primitives and abstractions for langchain
langchain/ # langchain-classic
langchain_v1/ # langchain
partners/ # Certain third-party providers integrations (see below)
standard-tests/ # Standardized tests for integrations
text-splitters/ # Text splitter utilities
(Each package contains its own README.md file with specific details about that package.)
Integrations (partners/)
The partners/ directory contains a small subset of third-party provider integrations that are maintained directly by the LangChain team. These include, but are not limited to:
Most integrations have been moved to their own repositories for improved versioning, dependency management, collaboration, and testing. This includes packages from popular providers such as Google and AWS. Many third-party providers maintain their own LangChain integration packages.
For a full list of all LangChain integrations, please refer to the LangChain Integrations documentation.