New tokenizer implementation for MPT and GPT-J

Improves output quality by making these tokenizers more closely match the behavior of the huggingface `tokenizers` based BPE tokenizers these models were trained with. Featuring: * Fixed unicode handling (via ICU) * Fixed BPE token merge handling * Complete added vocabulary handling
2025-09-02 00:57:09 +00:00 · 2023-05-21 05:18:42 -07:00
parent 6ed9c1a8d8
commit bbcee1ced5
13 changed files with 47162 additions and 239 deletions
--- a/.codespellrc
+++ b/.codespellrc
@@ -1,4 +1,4 @@
 [codespell]
-skip = .git,*.pdf,*.svg
+skip = .git,*.pdf,*.svg,*_tokenizer_config.h
 #
 # ignore-words-list =