Luke
f69695069d
text_splitters: Add HTMLSemanticPreservingSplitter ( #25911 )
...
**Description:**
With current HTML splitters, they rely on secondary use of the
`RecursiveCharacterSplitter` to further chunk the document into
manageable chunks. The issue with this is it fails to maintain important
structures such as tables, lists, etc within HTML.
This Implementation of a HTML splitter, allows the user to define a
maximum chunk size, HTML elements to preserve in full, options to
preserve `<a>` href links in the output and custom handlers.
The core splitting begins with headers, similar to `HTMLHeaderSplitter`.
If these sections exceed the length of the `max_chunk_size` further
recursive splitting is triggered. During this splitting, elements listed
to preserve, will be excluded from the splitting process. This can cause
chunks to be slightly larger then the max size, depending on preserved
length. However, all contextual relevance of the preserved item remains
intact.
**Custom Handlers**: Sometimes, companies such as Atlassian have custom
HTML elements, that are not parsed by default with `BeautifulSoup`.
Custom handlers allows a user to provide a function to be ran whenever a
specific html tag is encountered. This allows the user to preserve and
gather information within custom html tags that `bs4` will potentially
miss during extraction.
**Dependencies:** User will need to install `bs4` in their project to
utilise this class
I have also added in `how_to` and unit tests, which require `bs4` to
run, otherwise they will be skipped.
Flowchart of process:

---------
Co-authored-by: Bagatur <baskaryan@gmail.com >
Co-authored-by: Chester Curme <chester.curme@gmail.com >
2024-12-19 12:09:22 -05:00
..
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-16 13:46:49 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-11-13 14:16:50 -05:00
2024-11-13 14:16:50 -05:00
2024-11-13 14:16:50 -05:00
2024-11-13 14:16:50 -05:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-11-20 14:10:42 -05:00
2024-11-20 14:10:42 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-11-26 10:43:12 -05:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-12-19 12:09:22 -05:00
2024-12-19 12:09:22 -05:00
2024-11-07 11:48:45 -05:00
2024-11-07 11:48:45 -05:00
2024-11-07 11:48:45 -05:00
2024-11-07 11:48:45 -05:00
2024-11-07 11:48:45 -05:00
2024-11-07 11:48:45 -05:00
2024-11-07 11:48:45 -05:00
2024-11-07 11:48:45 -05:00
2024-11-07 11:48:45 -05:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-11-09 21:04:18 +00:00
2024-11-09 21:04:18 +00:00
2024-11-09 21:04:18 +00:00
2024-11-09 21:04:18 +00:00
2024-11-09 21:04:18 +00:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00
2024-10-30 12:35:38 -04:00