langchain[patch]: In HTMLHeaderTextSplitter set default encoding to utf-8 (#16372)

- **Description:** The HTMLHeaderTextSplitter Class now explicitly specifies utf-8 encoding in the part of the split_text_from_file method that calls the HTMLParser. - **Issue:** Prevent garbled characters due to differences in encoding of html files (except for English in particular, I noticed that problem with Japanese). - **Dependencies:** No dependencies, - **Twitter handle:** @i_w__a
2025-06-26 16:43:35 +00:00 · 2024-01-24 11:20:29 +09:00 · 2024-01-24 11:20:29 +09:00 · 95ee69a301
commit 95ee69a301
parent e135e5257c
1 changed files with 3 additions and 1 deletions
--- a/libs/langchain/langchain/text_splitter.py
+++ b/libs/langchain/langchain/text_splitter.py
@ -598,7 +598,9 @@ class HTMLHeaderTextSplitter:
                "Unable to import lxml, please install with `pip install lxml`."
            ) from e
        # use lxml library to parse html document and return xml ElementTree
-        parser = etree.HTMLParser()
+        # Explicitly encoding in utf-8 allows non-English
+        # html files to be processed without garbled characters
+        parser = etree.HTMLParser(encoding="utf-8")
        tree = etree.parse(file, parser)

        # document transformation for "structure-aware" chunking is handled with xsl.