mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-08 14:31:55 +00:00
security: Remove xslt_path and harden XML parsers in HTMLSectionSplitter: package: langchain-text-splitters (#31819)
## Summary - Removes the `xslt_path` parameter from HTMLSectionSplitter to eliminate XXE attack vector - Hardens XML/HTML parsers with secure configurations to prevent XXE attacks - Adds comprehensive security tests to ensure the vulnerability is fixed ## Context This PR addresses a critical XXE vulnerability discovered in the HTMLSectionSplitter component. The vulnerability allowed attackers to: - Read sensitive local files (SSH keys, passwords, configuration files) - Perform Server-Side Request Forgery (SSRF) attacks - Exfiltrate data to attacker-controlled servers ## Changes Made 1. **Removed `xslt_path` parameter** - This eliminates the primary attack vector where users could supply malicious XSLT files 2. **Hardened XML parsers** - Added security configurations to prevent XXE attacks even with the default XSLT: - `no_network=True` - Blocks network access - `resolve_entities=False` - Prevents entity expansion - `load_dtd=False` - Disables DTD processing - `XSLTAccessControl.DENY_ALL` - Blocks all file/network I/O in XSLT transformations 3. **Added security tests** - New test file `test_html_security.py` with comprehensive tests for various XXE attack vectors 4. **Updated existing tests** - Modified tests that were using the removed `xslt_path` parameter ## Test Plan - [x] All existing tests pass - [x] New security tests verify XXE attacks are blocked - [x] Code passes linting and formatting checks - [x] Tested with both old and new versions of lxml Twitter handle: @_colemurray
This commit is contained in:
@@ -309,7 +309,6 @@ class HTMLSectionSplitter:
|
||||
def __init__(
|
||||
self,
|
||||
headers_to_split_on: List[Tuple[str, str]],
|
||||
xslt_path: Optional[str] = None,
|
||||
**kwargs: Any,
|
||||
) -> None:
|
||||
"""Create a new HTMLSectionSplitter.
|
||||
@@ -318,20 +317,13 @@ class HTMLSectionSplitter:
|
||||
headers_to_split_on: list of tuples of headers we want to track mapped to
|
||||
(arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4,
|
||||
h5, h6 e.g. [("h1", "Header 1"), ("h2", "Header 2"].
|
||||
xslt_path: path to xslt file for document transformation.
|
||||
Uses a default if not passed.
|
||||
Needed for html contents that using different format and layouts.
|
||||
**kwargs (Any): Additional optional arguments for customizations.
|
||||
|
||||
"""
|
||||
self.headers_to_split_on = dict(headers_to_split_on)
|
||||
|
||||
if xslt_path is None:
|
||||
self.xslt_path = (
|
||||
pathlib.Path(__file__).parent / "xsl/converting_to_header.xslt"
|
||||
).absolute()
|
||||
else:
|
||||
self.xslt_path = pathlib.Path(xslt_path).absolute()
|
||||
self.xslt_path = (
|
||||
pathlib.Path(__file__).parent / "xsl/converting_to_header.xslt"
|
||||
).absolute()
|
||||
self.kwargs = kwargs
|
||||
|
||||
def split_documents(self, documents: Iterable[Document]) -> List[Document]:
|
||||
@@ -457,11 +449,20 @@ class HTMLSectionSplitter:
|
||||
"Unable to import lxml, please install with `pip install lxml`."
|
||||
) from e
|
||||
# use lxml library to parse html document and return xml ElementTree
|
||||
parser = etree.HTMLParser()
|
||||
tree = etree.parse(StringIO(html_content), parser)
|
||||
# Create secure parsers to prevent XXE attacks
|
||||
html_parser = etree.HTMLParser(no_network=True)
|
||||
xslt_parser = etree.XMLParser(
|
||||
resolve_entities=False, no_network=True, load_dtd=False
|
||||
)
|
||||
|
||||
xslt_tree = etree.parse(self.xslt_path)
|
||||
transform = etree.XSLT(xslt_tree)
|
||||
# Apply XSLT access control to prevent file/network access
|
||||
# DENY_ALL is a predefined access control that blocks all file/network access
|
||||
# Type ignore needed due to incomplete lxml type stubs
|
||||
ac = etree.XSLTAccessControl.DENY_ALL # type: ignore[attr-defined]
|
||||
|
||||
tree = etree.parse(StringIO(html_content), html_parser)
|
||||
xslt_tree = etree.parse(self.xslt_path, xslt_parser)
|
||||
transform = etree.XSLT(xslt_tree, access_control=ac)
|
||||
result = transform(tree)
|
||||
return str(result)
|
||||
|
||||
|
Reference in New Issue
Block a user