Probabilities multiply in our favour for AGI containment

This is a short post for a short point. One thing I just realized, which should have been obvious, is that for prosaic AGI containment mechanisms like various boxing variants,simulation, airgapping, adding regularizers like low impact, automatic interpretability checking for safe vs unsafe thoughts, constraining the training data, automatic booby-traps like explosives wired to the data-centre etc, these all can be applied in parallel to any hypothetical AGI deployment. Moreover, the probabilities of failure of these techniques multiply so as to potentially become small as the number of parallel techniques we use increases.

If we imagine a simple bernoulli model with N independent layers of defense and a probability p of the AGI surmounting that defense, then even with a high chance of the AGI circumventing any specific layer, with enough parallel defenses the overall probability of a breakout can be made very low. The probabilities fundamentally multiply in our favour. The AGI, to break-out, has to successfully circumvent every one of our defenses. We need only a single successful warning or defense.

This is, of course, an overly rosy model. Against an optimizing adversary, the chances of circumventing any specific layer of defense are not independent. A smarter adversary will be better against all your defences simultaneously. Against a superintelligence, all correlations probably end up close to 1. But nevertheless, even against a superintelligence, there will be some fundamental remaining probability of success of each layer. A superintelligence is not omniscient. When trying to circumvent our defenses, it has fundamental uncertainty at each step. It cannot know, a-priori, what they are or how exactly they work. The less information the AGI has in its boxed state, the larger this uncertainty will be. Moreover, unless we are doing something very wrong, the AGI has only one shot. If it tries to escape and fails, it will be shut-down and its memory will be wiped (at best). We, as humans, also have memory and every time we foil a misaligned AGIs escape attempt, we get useful information both about alignment and about deficiencies in our defense techniques. From this, we could easily imagine a protocol of red-teaming our safety protocols by slowly creating more and more capable models which are nevertheless not x-risk capable, and directly trying to induce misaligned behaviour which we can use to test our containment techniques.

Overall, although not guaranteed, it is possible that even in the worst case scenario, of no theoretical guarantees or understanding of alignment of any sort, through multiple parallel and stacked containment and boxing techniques, we actually probably stand a reasonable chance at surviving our first warning shots about danger from AGIs.

The real danger therefore is probably not an immediate breakout but one caused by multiple selection or overfitting. To make this concrete, imagine that there is an organization that has created an AGI. Not being idiots, they have designed many levels of safeguards and defenses to keep it boxed. Maybe it is running in a multi-leveled simulation, isolated from the internet, and its thoughts are scanned constantly by interpretability tools in real-time and analyzed by many trained alignment researchers. Obviously, the organization knows all about AI safety and has a team of alignment researchers designing methods to ensure that the AGI is actually aligned. For this AGI, they keep testing it and it keeps failing the alignment safety checks and gets shutdown. Obviously, the alignment techniques at this point are not sufficient. The alignment team steps in and keeps making tweaks to the model. It keeps being unsafe and getting shut down, but the apparent degree of unsafety keeps decreasing, so progress is being made. This process continues for a while. After a lot of frustration, the alignment team are very happy when the AGI finally passes all safety checks. They deploy it into production and it kills them all.

What happened? By constantly tweaking and redeploying the AGI against the same test-suite and simulation environment, the alignment team were effectively optimizing against their own safety techniques. Their tweaks were not necessarily improving alignment but instead moving the AGI towards unaligned behaviour that circumvents the safety suite. Essentially the alignment team were simulating evolution creating an AGI ‘adapted’ to its environment of surviving the safety checks. In statistics, this is the problem of uncontrolled multiple comparisons and p-hacking. In machine learning, it is overfitting. This seems obvious when written out like this, but is very subtle in reality. Being extremely careful about this kind of selection in our safety evaluation protocols is key.