There are many facets to the alignment problem but one is as a computer security problem. We want to design a secure system to test our AGIs in to ensure they are aligned, which they cannot ‘break out of’. Having such a secure AGI box is necessary to have any safe testing, evaluation, or empirical study of AGIs. If we were confident we could construct such a containment box, then it would be highly beneficial for AI safety and alignment, since we could actually study the AGI we seek to align in depth.

The fundamental problem here is that we need to make a computer setup secure against a determined and highly intelligent adversary. This is essentially the problem of computer security.

In computer security, there is the crucial concept of the threat model: the adversary your system is designed to be secure against. In many cases, there is a hierarchy of threat models for systems which need to be more or less secure. Many systems need to be secure against casual hackers and random script-kiddies. Others need to be secure against well-seasoned and determined criminal hackers. Finally, some systems may need to be secure against well-resourced nation-state adversaries which can dedicate hundreds of extremely smart and well-trained personell, with substantial resources, over decades, to cracking your security.

This is true even in cryptography where we can extract mathematical guarantees of security (at least in theory). Here, the threat model is introduced in terms of computational power needed to brute-force the encrpytion key. Essentially, it is a measure of computationally how hard the one-way function is to invert.

In AI safety, work is often pretty vague on the explicit threat model our alignment techniques are meant to work against. Often, it is either assumed that the threat is some potential near-term AGI built from essentially scaling up current ML techniques, or alternatively, that it is a post-FOOM superintelligence which is tacitly assumed to be essentially omniscient and potentially possessing unbounded computational power. Of course, in an ideal world, we should design all systems to be secure against the worst-case scenario of an AIXI-like superintelligence, but this ignores the fundamental tradeoffs that come with this approach. Security, intrinsically, has costs, and by ignoring all approaches that work against near-future or resource limited AGIs in favour of only those which might hypothetically work against AIXI, we could potentially be cutting ourselves off from a solution to the problem we will actually face.

This is similar to the way that most people don’t go out of their way to secure all their communication and computer systems from nation-state actors like the NSA or Mossad, even if it is theoretically possible to do so.

To make this realistic, in alignment, we need to formalize a set of probable threat models for AGIs at different levels of capability. This is my first stab at such a hierarchy, but of course there is fundamental uncertainty around this and we will need much more accurate estimates in practice,

1.) Near-term AGI built from scaling up current ML techniques. Probably including an unsupervised multimodal world model with an RL module on top which is trained to optimize some objective. It is comprised entirely of neural networks so primarily learns by amortizing and extrapolating and is extremely data hungry. The AGI is probably trained on all the internet so we can assume it ‘knows’ at least most of what we know, but maybe not if we are smart and careful. It is not omniscient and primarily learns by trial and error with some planning capabilities within its learnt world model. It’s out of distribution capabilities are not amazing. It can only implement explicit algorithms that are straightforward to express in the forward pass of the network. All computation is implemented in a fixed feedforward pass.

2.) A near-ish term AGI based around deep learning methods but with significant advances beyond current SOTA (maybe built on explicit Bayesian inference/reasoning or some kind of symbolic engine in addition to more powerful neural networks (larger scale with probably a better architecture). Importantly, this AGI has powerful and flexible program synthesis capabilities and can run arbitrary code. This means it read algorithms in human research papers, figure out that they apply to a situation, emit code that implements them, and then run it in a special CPU module. Such an AGI probably also has significantly improved probabilistic estimation, so is extremely well calibrated and approximates the Bayesian posterior pretty well for most cases except intractably large state-spaces. The AGI has dynamic computation capablities that mean it can sequentially direct attention to a task and improve with more compute time.

3.) A post-FOOM superintelligence which essentially functions as a computationally bounded AIXI. It knows all extant human knowledge and can rapidly deduce all possible deductions from these. It has extremely strong future prediction abilities and the ability to model human actions and intentions with extremely high accuracy. It still needs some novel data to invent new technologies such as nanobots which are far away from current human knowledge. It can essentially act like a whole human civilization in terms of scientific output. Nevertheless, it is still bounded by known computational bounds. For instance, it cannot solve NP hard problems in polynomial time. It is also not omniscient. It may be approximated as performing perfect Bayesian inference over incredibly large state-spaces. Without enough data, its posterior will still be highly entropic.

4.) A hypothetical computationally unbounded AIXI which has effectively unbounded computational resources (either through solving P = NP in favour of P or through some unknown magic). It can perform unbounded search on all possible turing machines for the one that foils our defenses.

If we can figure out a likely hierarchy of threat models, we can estimate what security techniques are likely to hold against what level of AGI capability. This seeems like a much more reasoned approach than just assuming the worst always.