Epistemic Status: Fairly sure about this from experience but could be missing crucial considerations. I don’t present any super detailed evidence here so it is theoretically just vibes.

When forecasting AI progress, the forecasters and modellers often break AI progress down into two components: increased compute, and ‘algorithmic progress’. My argument here is that the term ‘algorithmic progress’ for ‘the remainder after compute’ is misleading and that we should really think about and model AI progress as three terms – compute, algorithms, and data. My claim is that a large fraction (but certainly not all) AI progress that is currently conceived as ‘algorithmic progress’ is actually ‘data progress’, and that this term ‘algorithmic’ gives a false impression about what are the key forces and key improvements that have driven AI progress in the past three years or so.

From experience in the field, there have not been that many truly ‘algorithmic’ improvements with massive impact. The primary one of course is the switch to RLVR and figuring out how to do mid-training (although both of these are vitally dependent upon the datasets). Other minor ones include things like qk-norm, finegrained experts and improvement to expert balancing, and perhaps the muon optimizer. The impact of most of these is utterly dwarfed by ‘better’ data, however, and this is something that pure scaling and flop-based analyses miss.

Models today are certainly trained using vastly more flops than previously, but they are also trained on significantly ‘higher quality’ data where ‘high quality’ means aligned with the specific tasks we care about the models being able to perform (cynically: the evals). The models are not getting so good by scale alone. A GPT4 scale model trained on the dataset of GPT3 would be substantially worse across all benchmarks, even we somehow replicated the GPT3-dataset to be the scale of GPT4s dataset. However this model was never released (and probably never trained) so improvements in data are easily hidden and misattributed to scale or other progress. An easy way to see this is to look at model improvements for a fixed flop count and model size. These improvements have been substantial and projects as models like the Phi series show.

It is very noticeable that e.g. Qwen3 uses an architecture and training setup that is practically identical to Llama2 and yet achieves vastly greater performance which would require incredibly more OOMs of flops if you could train on an infinite Llama2 dataset. This is almost entirely because the Qwen3 datasets are both bigger but crucially much more closely aligned with the capabilities we care about the models having – e.g. the capabilities that we measure and benchmark.

My opinion here is that we have essentially been seeing a very strong Flynn effect for the models which has explained a large proportion of recent gains as we switch from almost totally uncurated web data to highly specialized synthetic data which perfectly (and exhaustively) targets the tasks we want the models to learn. It’s like the difference between giving an exam to some kid that wandered in from the jungle vs one that has been obsessively tiger-parented from birth to do well at this exam. Clearly the tiger-parented one will do vastly better with the same innate aptitude because their entire existence has been constructed to make them good at answering things similar to the exam questions, even if they have never seen the exact exam questions themselves before. Conversely, the jungle kid probably destroys the tiger-parented kid at various miscellaneous jungle related skills but nobody measures or cares about these because they are irrelevant for the vast, vast majority of tasks people want the jungle kid to do. Translating this metaphor back to LLM-land, Qwen3 has seen vast amounts of synthetic math and code and knowledge-based multiple choice questions all designed to make it as good as possible on benchmarks, Llama2 has seen mostly random web pages which incidentally occasionally contain some math and code but with very little quality filter. Llama2 probably destroys Qwen3 at knowing about obscure internet forum posts from 2008, precisely understanding the distribution of internet spam at different points throughout history, and knows all the ways in which poor common-crawl parsing can create broken seeming documents, but nobody (quite rightly) thinks that these skills are important, worth measuring, or relevant for AGI.

One way to track this is the sheer amount of spend on data labelling companies from big labs. ScaleAI and SurgeAI’s revenue each sit around $1B and most of this, as far as I can tell, is from data labelling for big AI labs. This spend is significantly less than compute spend, it is true, but it nevertheless must contribute a significant fraction to a lab’s total spending. I don’t have enough data to claim this but it seems at least plausible that the spend is increasing at a similar rate as compute spend (e.g. 3-4x per year), albeit from a much lower base.

When we see frontier models improving at various benchmarks we should think not just of increased scale and clever ML research idas but billions of dollars spent paying PhDs, MDs, and other experts to write questions and provide example answers and reasoning targeting these precise capabilities. With the advent of outcome based RL and the move towards more ‘agentic’ use-cases, this data also includes custom RL environments which are often pixel-perfect replications of commonly used environments such as specific websites like Airbnb or Amazon, browsers, terminals and computer file-systems, and so on alongside large amounts of human trajectories exhaustively covering most common use-cases with these systems.

In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.

There is some intriguing evidence that actual algorithmic progress is beginning to contribute more than in the past few years. Clearly there have been algorithmic breakthroughs enabling RL to start working (although this is also substantially a data breakthrough in that the default policies of LLMs became good enough that there is no longer an exploration problem with the RL training since the default policy is good enough to get nontrivial reward). We have also started to see bigger changes to architecture embraced by big labs such as Deepseek’s MLA, and Google’s recent Gemma3n release than previously. Finally, muon is starting to gain traction as an optimizer to displace AdamW. There have also been improvements in mid-training recipes although again this is heavily entangled with the data. This is in contrast from the 2022-2024 era which was largely simply scaling up model size and data size and increasing data quality but where the actual core training methods and architectures remained essentially unchanged. If so, it is possible that the trend lines will continue and that we will simply move towards greater actual algorithmic progress as the cheap improvements from data progress slows.

One way this could be quantified relatively straightforward is to just run ablation experiments with fixed compute training a 2022 or a 2025 frontier architecture and training recipe on either 2022 data (the pile?) or 2025 data (qwen3 training set?) and seeing where in fact the gains come from. My money would be very substantially on the datasets but I could be wrong here and could be missing some key factors.