mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-26 16:43:35 +00:00
langchain[patch]: In HTMLHeaderTextSplitter set default encoding to utf-8 (#16372)
- **Description:** The HTMLHeaderTextSplitter Class now explicitly specifies utf-8 encoding in the part of the split_text_from_file method that calls the HTMLParser. - **Issue:** Prevent garbled characters due to differences in encoding of html files (except for English in particular, I noticed that problem with Japanese). - **Dependencies:** No dependencies, - **Twitter handle:** @i_w__a
This commit is contained in:
parent
e135e5257c
commit
95ee69a301
@ -598,7 +598,9 @@ class HTMLHeaderTextSplitter:
|
||||
"Unable to import lxml, please install with `pip install lxml`."
|
||||
) from e
|
||||
# use lxml library to parse html document and return xml ElementTree
|
||||
parser = etree.HTMLParser()
|
||||
# Explicitly encoding in utf-8 allows non-English
|
||||
# html files to be processed without garbled characters
|
||||
parser = etree.HTMLParser(encoding="utf-8")
|
||||
tree = etree.parse(file, parser)
|
||||
|
||||
# document transformation for "structure-aware" chunking is handled with xsl.
|
||||
|
Loading…
Reference in New Issue
Block a user