Understanding GPT tokenizers

𝕊𝕚𝕤𝕪𝕡𝕙𝕖𝕒𝕟 · edit-2 2 years ago

Understanding GPT tokenizers

ShrimpsIsBugs · 2 years ago

So English is tokenized most efficiently because the tokenizer was mostly trained on English text, so tokenizing words of this language efficiently was most important - do I understand that correctly?

𝕊𝕚𝕤𝕪𝕡𝕙𝕖𝕒𝕟 · edit-2 2 years ago

AFAIK the way it works is that the more frequent a long sequence is, the more likely it is to get a single token. I’m not sure the fact that English is tokenized in the most efficient way is because they explicitly made it prefer English or if it is just a result of the corpus containing mostly English text.

Understanding GPT tokenizers

Understanding GPT tokenizers

TL;DR (by GPT-4 🤖)

Notes (by GPT-4 🤖)