langchain[patch]: In HTMLHeaderTextSplitter set default encoding to utf-8 (#16372)

- **Description:** The HTMLHeaderTextSplitter Class now explicitly
specifies utf-8 encoding in the part of the split_text_from_file method
that calls the HTMLParser.
- **Issue:** Prevent garbled characters due to differences in encoding
of html files (except for English in particular, I noticed that problem
with Japanese).
  - **Dependencies:** No dependencies,
  - **Twitter handle:**  @i_w__a
This commit is contained in:
i-w-a 2024-01-24 11:20:29 +09:00 committed by GitHub
parent e135e5257c
commit 95ee69a301
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -598,7 +598,9 @@ class HTMLHeaderTextSplitter:
"Unable to import lxml, please install with `pip install lxml`."
) from e
# use lxml library to parse html document and return xml ElementTree
parser = etree.HTMLParser()
# Explicitly encoding in utf-8 allows non-English
# html files to be processed without garbled characters
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(file, parser)
# document transformation for "structure-aware" chunking is handled with xsl.