Creating worlds where iterative alignment succeeds

A major theorized difficulty of the alignment problem is its zero-shot nature. The idea is that any AGI system we build will rapidly be able to outcompete its creators (us) in accumulating power, and hence if it is not aligned right from the beginning then we won’t be able to either safely iterate on it by containing unaligned versions, nor will we be able to intervene on it later, after deployment, and apply fixees to its misaligned value system (the ability to do this is corrigiblity). Additionally, it is often argued that there is some kind of phase transition around AGI – a sharp left turn – whereby capabilities will immprove rapidly but alignment will not. Because of this, it is often argued that we cannot meaningfully test alignment techniques on pre-AGI agents which we can safely iterate upon.

If we grant all of these arguments, then our p(doom) ends up exceptionally high by default. This is essentially because zero-shot solving major intellectual problems is basically impossible. In the history of science, hardly ever have major scientific problems been solved with no empirical interaction or iteration with the underlying phenomena ¹. Humanity does not have a good track record of success at this, and arguably this is because it is impossible. A-priori reasoning can only go so far, even with infinite time and reasoning ability. In the real-world and not in formal systems, errors and uncertainty in the reasoning processing always creep in and thus reasoning power eventually decays to a maximum entropy floor on sufficiently long reasoning chains without empirical grounding.

A similar point was also explained by John in this post. However, here the argument is that given that in worlds where iterative design is possible, we will probably succeed (by no means guaranteed in my view), most of the p(doom) lies in worlds where we iterative design fails. Thus, we should focus primarily on reducing risks in the worlds where iterative design fails – i.e. try to solve zero-shot alignment.

However, this logic completely ignores the fact that through our actions today and leading up to AGI, we can potentially strongly affect which world we end up in. Specifically, if we spend a lot of time thinking and trying to design increasingly safe ways to contain and iterate on AGI-ish models, as well as trying to extract as much empirical bits as we can from pre-AGI models, then we move ourselves towards worlds where iterative design is increasingly likely to succeed.

John’s post is actually good at this because it highlights additional problems that could occur with iterative design schemes, such as various kinds of goodhearting through iteration, and the implicit optimization power applied by the iteration process causing problems to be hidden and suddenly reoccur. HOwever, the better we understand these kinds of issues, the more likely it is that we can design safer iterative design protocols which don’t suffer from these issues or take countermeasures against them. We know such countermeasures are possible. For instance, the iteration and tweaking until you succeed applying optimization power against your signal, is a generalization of the well-known issue of multiple comparisons in statistics and can be addressed through super simple measures such as Bonferroni corrections. There is probably a principled generalization of this approach to handle the more general case – for instance models which have gone through additional finetuning iterations receive proportionally greater scrutiny, or we have a validation-set of held-out finetuning data or interpretability tools, which we never train against, and which if any of them are tripped we abort the whole process. Thinking about these problems and designing sensible programs to handle them thus constitutes potentially high impact interventions that can substantially reduce our p(doom). This is especially important to realize given that actually reducing p(doom) in the zero-shot case is probably exceptionally difficult, while increasing the probability of iterative design succeeding is likely much easier and has direct paths of attack we can attempt now. This means that on the margin, we should focus more efforts on creating worlds where iterative design succeeds vs trying to solve the much harder problem of zero-shot alignment.

Of course, this approach heavily depends upon the us having the capabilities to materially affect the success or failure probabilities of iterative alignment. The counterargument would be that whether we exist in a world where iterative alignment succeeds or fails is a fact about the world that we cannot change and hence it is useless to try to do anything to affect this. This is a fair point except that we are pretty uncertain about whether this is a true claim about the world or not. The primary arguments against iteration typically require relatively strong forms of FOOM or sharp-left-turns. Importantly, both of these are empirical claims without any evidence at present (of course we shouldn’t necessarily expect evidence here either), meaning that our probabilities of these scenarios are strongly affected by priors and should be highly uncertain. Thus, a secondary important goal is trying to get as many bits of information about these questions as we can, to figure out whether we are in an iterated success or a necessarily zero-shot world or somewhere in the middle.

A priori, a world where we can do literally nothing to increase the chances of iterative design is highly unlikely, because it requires an existential quantifier over all possible approaches of safe iteration not working at all. Indeed, a large number of approaches have been proposed which could make iteration safer including preventing foom, myopia, satisficing, quantilizing, testing the AGI in a simbox, interpretability to detect deception/adversarial plans etc. In general effort invested in understanding / interpreting models, building secure simboxes, sanitizing/shielding potentially dangerous data from being picked up in training sets, and collecting good datasets of aligned behaviour all seem likely to make iteration safer and more secure. Additionally, thinking about better iteration protocols which makes it less likely we fool and optimize against ourselves should also improve our chances. It is worth noting that due to the incredibly bleak chances of the zero-shot world, even in worlds where we know that iterative design is probably unsafe, then most of our p(survival) still resides in doing iterative alignment in these worlds.

Indeed, the only examples which might pattern match this – for instance the problem of whether black holes destroy information, where we clearly cannot iteratively experiment on black holes in the far future – are better examples of mathematical results within an existing formal system (General Relativity + Information Theory) where these formal mathematical systems were extensively empirically validated in other circumstances. This would imply that any zero-shot solution to alignment almost certainly goes through a similar process. We develop a formal mathematical theory of intelligence, or value learning, or some kind of alignment relevant thing, and then solve alignment through this formal system. The challenge here is that a.) developing such grounded formal theories is hard and if it is intelligence then we basically already have AGI, b.) it is not clear if alignment can even be solved through a formal mathematical system at all. c.) Sharp-left-turn arguments would also attack this approach, since any formal theory must be based on pre-AGI models, which would therefore not be expected to generalize to AGI. ↩