Understanding GPT tokenizers

𝕊𝕚𝕤𝕪𝕡𝕙𝕖𝕒𝕟 · edit-2 2 years ago

Understanding GPT tokenizers

𝕊𝕚𝕤𝕪𝕡𝕙𝕖𝕒𝕟 · edit-2 2 years ago

Here is an example of tokenization being biased toward English (using the author’s Observable notebook):

This is the same sentence in English and my native Hungarian. I understand that this is due to the difference in the amount of text available in the two languages in the training corpus. But it’s still a bit annoying that using the API for Hungarian text is more expensive :)

ShrimpsIsBugs · 2 years ago

So English is tokenized most efficiently because the tokenizer was mostly trained on English text, so tokenizing words of this language efficiently was most important - do I understand that correctly?

𝕊𝕚𝕤𝕪𝕡𝕙𝕖𝕒𝕟 · edit-2 2 years ago

AFAIK the way it works is that the more frequent a long sequence is, the more likely it is to get a single token. I’m not sure the fact that English is tokenized in the most efficient way is because they explicitly made it prefer English or if it is just a result of the corpus containing mostly English text.

Understanding GPT tokenizers

Understanding GPT tokenizers

TL;DR (by GPT-4 🤖)

Notes (by GPT-4 🤖)