Epistemic status: Mostly re-litigating old debates, I believe. Hopefully still somewhat interesting. This is just a short post for a short note which took me a worrying length of time to realize.
For a while people were claiming that the pretraining next token prediction objective could directly lead to superintelligence in the limit because the limit of prediction is being able to predict everything. For instance, if asked to predict ‘5 + 4 = ‘ the next token should indeed be ‘9’ and hence the model has to learn mathematics solely from the prediction task. Similarly, it was posited that we should be able to write things like ‘The solution to alignment is: ‘ and get good alignment solutions out of this. This was core to the ‘pure scaling is all that matters’ camp which was that scaling pretraining is all you need for AGI. Now the goalposts here have shifted implicitly from scaling pretraining to scaling RL or scaling something else (data?). Perhaps the most direct statement from Janus in simulators is “Something which can predict everything all the time is more formidable than any demonstrator it predicts: the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum”, which may be correct but it is unclear that this leads directly to superintelligence.
Nowadays, I think most people have implicitly updated away from the strong form of this view, given the direction of the field recently and the new breakthroughs in RL. But it is still worth trying to make it explicit and understand why certain things happened, not just that they did and we were all right all along.
The reason scaling pretraining (aka unsupervised learning on a fixed corpus of e.g. web data) does not scale to AGI or imply omniscience is pretty simple. The object that pretraining optimizes is the approximation to the true posterior over token sequences taken from the internet. Better scaling means better approximating the distribution of common-crawl. The distribution of common-crawl does not contain superintelligent behaviour and hence scaling alone will not reach it. Ironically, in the extreme, scaling might make superintelligence less likely because maybe for whatever reason the model overgeneralized to superintelligence beyond the dataset in some intermediate regime then collapses back as it even better approximates the true distribution, in a way exactly like how children learning language overgeneralize saying words like ‘goed’ before they eventually learn all the weird epicycles of grammar.
To see this more clearly, let’s go back to the example of prompting the pretrained model with ‘The solution to alignment is: ‘. Suppose the model somehow magically generalized deeply and actually knew the solution to alignment. Even if this was the case, actually completing this prompt with the solution to alignment would be incredibly unlikely for the model and would also be objectively incorrect. The model knows a huge amount about our world. It knows that alignment is not solved and that people talk all the time about it not being solved.
Well then, isn’t the solution just clever prompting? Like if we write something like: ‘The year is 2050 and this is an alignment textbook showing how alignment can be solved …. ‘. This is still not sufficient. Now think. In today’s internet, where would text like this most likely be from. A couple of places – one is in discussions of prompting, two is in very weird schizophrenic forums. There are probably other cases but in no case is the actual solution to alignment likely to be found with such a prefix in today’s internet. Certainly clever prompting and context management can introduce many bits of selection into the LLMs responses, but these bits are typically used to either push the predictor towards a specific sub-distribution of the data (hard to do when trying to derive new knowledge where the subdistribution does not actually exist!) or to essentially treat the LLM as a source of randomness by applying selection post-hoc – i.e. curation.
What about instruction following? We train the AI to follow instructions by presenting it with a data distribution with lots of examples of question and answer formats. This creates a new aspect of the world for the LLM, one in which questions are always followed by answers. But which questions and which answers?
Instruction following and SFT datasets are not magic. Underneath it all there is still a predictor. Only the distribution of data underlying the predictor has changed. Instead of the model trying to predict what a miscellaneous internet person would say, they have to predict what the mysterious ‘question answerer’ would say. But this question-answerer is not omniscient. Any question like ‘What is the solution to alignment?’ would probably be followed by something like ‘Alignment is a challenging problem and there is no globally recognized solution…’.
The fundamental difficulty it is very hard to point prediction at truth, in the abstract sense. Prediction is prediction. Truth is truth. If untrue things are likely to be in the text then the model will predict those. If a question is followed by a non-sequitur, it is likely to predict the non-sequitur. Another way to think about this is that prediction and truth overlap and are highly correlated at the start but then, as always, eventually the tails come apart and further improvements in prediction do not lead to further improvements in truth but instead simply better modelling of the epicycles of falsehood.
The real question is whether scaling pretraining should realize something substantially greater than the pretraining dataset: ‘the conditional structure of the universe implicated by their sum’. Another way perhaps is to think of this as some kind of ‘inductive closure’ over the dataset – i.e. the dataset implies the existence of (and gives lots of circumstantial evidence of) humans and their mental processes, at least those that lead to people writing texts that end up on common-crawl. Presumably this set of mental processes is (human-level) AGI complete in some sense, hence pretraining on web data can lead to AGI.
This, I think, is still uncertain. It should certainly be possible to derive novel knowledge from the combination and synthesis of huge amounts of existing knowledge, although LLMs seem surprisingly poor at this compared to what a human with all of human knowledge downloaded in their head could probably do. LLMs probably do have a whole bunch of ‘discoveries’ locked up somewhere in their latent space already. The problem is accessing this space. Even if a predictor ‘knows’ the answer 1, it will still not answer using that information unless it is one of the most likely completions. This means that to actually extract this information via prompting (although other methods like internal probing could work) you would need to input many bits of information into the prompt – potentially enough information to almost derive that answer yourself.
Another issue that becomes obvious once you think about things from the Bayesian posterior perspective is that of underdeterminism of inferences about the causes of the data. We assume that e.g. a model can obtain a perfect representation of various aspects of human psychology from studying internet text, but can it in reality? There are a very very large number of possible hypotheses which are all consistent with the data. Especially when the data has many authors and is corrupted by all kinds of noise. Perfect inductive inference of the true generative process behind some data-set is not always possible and so perfect Bayesian inference will find the correct weighting of possible hypotheses, but this distribution over hypotheses could still be high entropy. A lot of information can sometimes be derived from pure observation but other times it is extremely hard to firmly verify or validate hypotheses this way. This is why science has a fundamental focus on experiments since, when done correctly, they allow causal inferences to be made and hypotheses explicitly ruled in or out by controlling all other variables. LLMs given random data do not have this luxury and so their posteriors could end up very fuzzy. This is true even of perfect Bayesian inference and so even in the limit of scaling we will not be able to exceed it.
While so far we have just discussed the initial core claim that pretraining scaling will lead to AGI, from today’s perspective this whole discussion completely ignores the vital importance of the data. Questions about whether something like predictive loss can scale to superintellience themselves are meaningless without considering the datasets. Predictive loss on some datasets can likely lead to superintelligence (those datasets containing huge amounts of samples of superintelligences acting and thinking). Datasets without this do not. At the trivial level, a dataset comprised purely of random numbers will not lead to superintelligence nor any kind of meaningful intelligence at all. There must be signal in the data and that signal must contain enough bits to describe superintelligence.
Obviously the Bayesian posterior over a dataset full of superintelligent behaviour across all possible tasks will give rise to superintelligence. But this does not solve the bootstrapping problem of how we get the superintelligent behaviour data in the first place. Reinforcement learning partially solves this, since RL is just iterative predictive learning on continually updated data. However, the limits of RL are the limits of our critics since we cannot get signal on capabilities we cannot judge and so the bootstrapping must stall here. In some (many?) situations this is no problem as we can judge perfectly, and hence RL can bootstrap to superintelligence in these domains. One bet is to hope that superintelligence in these domains will either generalize or can be used to invent ways to bootstrap further. The alternative is to find some other way to bootstrap.
-
Meaning that the answer can be extracted somehow from the model ↩