mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-30 18:33:40 +00:00
langchain[patch]: In HTMLHeaderTextSplitter set default encoding to utf-8 (#16372)
- **Description:** The HTMLHeaderTextSplitter Class now explicitly specifies utf-8 encoding in the part of the split_text_from_file method that calls the HTMLParser. - **Issue:** Prevent garbled characters due to differences in encoding of html files (except for English in particular, I noticed that problem with Japanese). - **Dependencies:** No dependencies, - **Twitter handle:** @i_w__a
This commit is contained in:
parent
e135e5257c
commit
95ee69a301
@ -598,7 +598,9 @@ class HTMLHeaderTextSplitter:
|
|||||||
"Unable to import lxml, please install with `pip install lxml`."
|
"Unable to import lxml, please install with `pip install lxml`."
|
||||||
) from e
|
) from e
|
||||||
# use lxml library to parse html document and return xml ElementTree
|
# use lxml library to parse html document and return xml ElementTree
|
||||||
parser = etree.HTMLParser()
|
# Explicitly encoding in utf-8 allows non-English
|
||||||
|
# html files to be processed without garbled characters
|
||||||
|
parser = etree.HTMLParser(encoding="utf-8")
|
||||||
tree = etree.parse(file, parser)
|
tree = etree.parse(file, parser)
|
||||||
|
|
||||||
# document transformation for "structure-aware" chunking is handled with xsl.
|
# document transformation for "structure-aware" chunking is handled with xsl.
|
||||||
|
Loading…
Reference in New Issue
Block a user