Commit Graph

1 Commits

Author SHA1 Message Date
Cole Murray
43eef43550
security: Remove xslt_path and harden XML parsers in HTMLSectionSplitter: package: langchain-text-splitters (#31819)
## Summary
- Removes the `xslt_path` parameter from HTMLSectionSplitter to
eliminate XXE attack vector
- Hardens XML/HTML parsers with secure configurations to prevent XXE
attacks
- Adds comprehensive security tests to ensure the vulnerability is fixed

  ## Context
This PR addresses a critical XXE vulnerability discovered in the
HTMLSectionSplitter component. The vulnerability allowed attackers to:
- Read sensitive local files (SSH keys, passwords, configuration files)
  - Perform Server-Side Request Forgery (SSRF) attacks
  - Exfiltrate data to attacker-controlled servers

  ## Changes Made
1. **Removed `xslt_path` parameter** - This eliminates the primary
attack vector where users could supply malicious XSLT files
2. **Hardened XML parsers** - Added security configurations to prevent
XXE attacks even with the default XSLT:
     - `no_network=True` - Blocks network access
- `resolve_entities=False` - Prevents entity expansion -
`load_dtd=False` - Disables DTD processing -
`XSLTAccessControl.DENY_ALL` - Blocks all file/network I/O in XSLT
transformations

3. **Added security tests** - New test file `test_html_security.py` with
comprehensive tests for various XXE attack vectors
4. **Updated existing tests** - Modified tests that were using the
removed `xslt_path` parameter

  ## Test Plan
  - [x] All existing tests pass
  - [x] New security tests verify XXE attacks are blocked
  - [x] Code passes linting and formatting checks
  - [x] Tested with both old and new versions of lxml


Twitter handle: @_colemurray
2025-07-02 15:24:08 -04:00