refactor(core): improve docstrings for HTML link extraction utilities (#34550)

# refactor(core): improve docstrings for HTML link extraction utilities

## Description
This PR updates and clarifies the docstrings for `find_all_links` and
`extract_sub_links` in
`libs/core/langchain_core/utils/html.py`.

The previous return-value descriptions were vague (e.g., "all links",
"sub links"). They have now been revised to clearly describe the
behavior and output of each function:

- **find_all_links** → “A list of all links found in the HTML.”
- **extract_sub_links** → “A list of absolute paths to sub links.”

These improvements make the utilities more understandable and
developer-friendly without altering functionality.

## Verification
- `ruff check libs/core/langchain_core/utils/html.py`: **Passed**  
- `pytest libs/core/tests/unit_tests/utils/test_html.py`: **Passed**

## Checklists
- PR title follows the required format: `TYPE(SCOPE): DESCRIPTION`  
- Changes are limited to the `langchain-core` package  
- `make format`, `make lint`, and `make test` pass
This commit is contained in:
Aman Gupta
2026-01-08 20:51:17 +05:30
committed by GitHub
parent 2b6911d9af
commit 50c5bb5607

View File

@@ -43,7 +43,7 @@ def find_all_links(
pattern: Regex to use for extracting links from raw HTML.
Returns:
all links
A list of all links found in the HTML.
"""
pattern = pattern or DEFAULT_LINK_REGEX
return list(set(re.findall(pattern, raw_html)))
@@ -73,7 +73,7 @@ def extract_sub_links(
exception. Otherwise, raise the exception.
Returns:
sub links.
A list of absolute paths to sub links.
"""
base_url_to_use = base_url if base_url is not None else url
parsed_base_url = urlparse(base_url_to_use)