(text-splitters): Small Fix in _process_html for HTMLSemanticPreservingSplitter to properly extract the metadata. (#29215)

- **Description:** Include `main` in the list of elements whose child
elements needs to be processed for splitting the HTML.
- **Issue:** #29184
This commit is contained in:
Mohammad Mohtashim 2025-01-15 20:18:06 +05:00 committed by GitHub
parent 4867fe7ac8
commit 288613d361
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -696,7 +696,7 @@ class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
placeholder_count: int,
) -> Tuple[List[Document], Dict[str, str], List[str], Dict[str, str], int]:
for elem in element:
if elem.name.lower() in ["html", "body", "div"]:
if elem.name.lower() in ["html", "body", "div", "main"]:
children = elem.find_all(recursive=False)
(
documents,