Just over a year ago, I wrote about how integer tokenization using the GPT3 and GPT3.5 tokenizer was insane, in that it failed to create a coherent number representation in token space, such that large numbers of integers were assigned a single unique token, and even multi-token integers were not split in a consistent way such that, for instance, the number: 2249 is tokenized as ‘2’, and ‘249’ (1-3). The number 2250 is tokenized as ‘22’ and ‘50’ (2-2) and the number ‘2251’ is tokenized as ‘225’ and ‘1’ (3-1). This would make it very hard for models to learn arithmetic since instead of using consistent decimal algorithms like computers and humans do, the model has to learn special-cased algorithms in an inconsistent token-space including memorizing the outcome of all calculations using unique memorized tokens.

When this post was written, pretty much the only available model was GPT3.5. Now, just over a year later, there has been an absolute explosion of new models with new tokenizers, so I thought to go back and look at what has changed. Largely, it seems that the problem has been solved. New tokenizers such as Mistral, Llama, Gemma, and GPT4 have a consistent integer-token relationship. This change has probably helped significantly improve the mathematical skills of models beyond other improvements in scale, architecture, data etc and means that using newer tokenizers likely provides a significant math benefit vs old ones such as GPT-neox.

Having experimented with the tokenizers of the main open models (and GPT4), the current situation is that there are basically two strategies now employed. The first, used by Llama, Mistral, Gemma, Deepseek, and Yi, is to do the obvious thing and match the tokenization to our decimal system. That is, each integer receives it’s own unique tokens and multiple integers result in multiple tokens. I.e. the number 2164 = [2,1,6,4]. This lets models perform arithmetic by learning the decimal algorithms that we understand well. Interestingly, the GPT4, and Llama3 take a different approach. They instead chunk the numbers 0->999 as unique tokens and then split longer integers as sub-chunks of these tokens always to the left. I.e. 2164 -> [216,4], 21645 -> [216,45], and 216456 -> [216,456]. It is unclear why they have chosen this strategy since, while it is consistent, it seems much harder for the model to learn, as in effect, it has to learn a three-digit decimal representation with 1000 primitives instead of 10. Nevertheless, it is likely that the model can learn such algorithms, especially larger models with the requisite capacity for this level of memorization, and perhaps it is most efficient in some sense.