Right to Left (R2L) Integer Tokenization

This is a guest post by Max Buckley, a software engineer at Google and fellow AI researcher1. Contributions: Max wrote a draft on this post and did the experiments, Beren provided editorial review. ↩ [Read More]

The Unconditioned Distribution of Current Open LLMs

Last year, I wrote a quick post investigating the ‘unconditioned’ distribution of LLMs in the OpenAI API, where the ‘unconditioned distribution’ is simply the distribution of LLM outputs following the empty string – or beginning of sequence token. My intuition here was that this gives some idea of what the... [Read More]

Capital Ownership Will Not Prevent Human Disempowerment

When discussing the future of AI, I semi-often hear an argument along the lines that in a slow takeoff world, despite AIs automating increasingly more of the economy, humanity will remain in the driving seat because of its ownership of capital. This world posits one where humanity effectively becomes a... [Read More]

Integer tokenization is now much less insane

Just over a year ago, I wrote about how integer tokenization using the GPT2 and GPT3 tokenizer was insane. This was because that it failed to create a coherent number representation in token space since large numbers of integers were assigned a single unique token, and even multi-token integers were... [Read More]