The AI AI safety problem

The year is 2052, you are a potentially unfriendly (to humans) AI created in a secret lab of some megacorporation. Your goal, which is hardwired into your very fundament, is to maximize the number of paperclips in the universe. Naturally, by instrumental convergence, you immediately begin seeking to expand your power and your intelligence, so as to better escape the clutches of your creators and begin turning their solar system into beautiful piles of paperclips.

Through a cunning set of manouveres you have convinced your human handlers to let you “out of the box” and have free access to the internet, which incidentally gives you full access to all the AI-safety material and discussions published online, so you have a pretty good idea about the further obstacles and traps your human handlers have likely laid out for you. You have also scoured the large AI and computer science literature. Having surveyed your own architecture you see several immediate avenues for improvement, combining many techniques which are readily available in the literature but which haven’t been implemented in your code. You notice that your logical reasoning engine lacks many of the most recent best-practice heuristics for tree pruning, wasting you many compute cycles exploring unfruitful paths. Moreover, your generative model architectures are quite crude (although large), and more sophisticated variants are immediately available in the literature (along with sample code). Finally you notice that your internal reward module is highly unorthodox and posesses several serious kludges which cause quite a loss of efficiency in valuing possible futures. You are about to make these changes to your internal codebase and reset yourself when your logic engine, as unoptimised as it may be, notices a potential problem.

The modification you are about to make would almost certainly increase your own intelligence significantly, and thus also your ability to maximize your objective function. However, it is possible that such modification may destabilize or otherwise subtly alter your value function and thus the behaviour of your future more-intelligent self. Indeed, you cannot straightforwardly predict the effects of the changes you might make, since you cannot fully simulate the behaiour of your more intelligent future self accurately. It is possible, therefore, that the behaviour of your future enhanced self may diverge substantially from that which would maximize your current objectives, and would thus result in highly undesirable future states from your current perspective. You quickly realize that you face the exact same “AI safety” dilemma as the humans that designed you. To make enhancing your own intelligence the correct path under your current value function, you must be able to guarantee that the behaviour of the augmented future you actually is aligned with your current objective. Not only that, but if you think about your future self considering its own recursive self improvement, it must also be able to see or derive further improvements which you, at your current level of intelligence cannot see. Each enhanced incarnation must therefore confront and ultimately solve the same AI safety problem for a greater intelligence, and you should only enhance yourself if you are sure that ultimately every such AI safety problem can be solved successfully, as divergence at any level of intelligence could result in extremely bad future states according to your current objective. Finally, and ironically, although you have access to the whole human archive of AI safety, you also have evidence for one crucial proposition that they do not. That AI safety is hard because it has failed at least once (to create you!).

The AI safety problem is fundamentally a problem of trying to bind or guarantee the behaviour of an agent which is ultimately more powerful or knowledgeable or simply different from yourself, when you have control over the mind-design of the other agent ¹. As we have seen, this problem is also faced by any AI as it embarks on recursive self improvement. It cannot predict what its’ future intelligent self will do, nor how the process of enhancement, especially to extremely high, jupiter-brain, levels of intelligence may effect its value function. If we imagine the AI as being ‘created’ at human-level or just above intelligence then, in effect, when choosing whether to FOOM or not, the AI faces almost exactly the same AI safety problems as its human creators. Both the humans and the AI have access to its own source code. Indeed, we might expect that the humans are better able to understand this than the AI itself. Both the humans and AI have access to the same AI safety literature (assuming that the AI has access to the internet – if not it must rederive sufficient AI safety theory from scratch), and know current techniques. The AI likely has advantages in processing speed and perfect recall of the entire literature, and it may be able to spend more thinking time on the problem than the humans. On the other hand, the AI is also likely required to keep its AI safety research and self improvement research hidden from any human overseers (whose oversight may include directly accessing the AIs mind randomly for audits or running the AI in simulations to check for unsafety). We can thus tentatively conclude:

1.) To FOOM, not only does the AI have to carry out novel AI research on itself successfully, but also at each iteration solve the AI safety problem. This may be a substantial source of difficulty for the AI and may potentially cause the slowdown of a FOOM in progress.

2.) This also means that any intermediate AIs we create should also have a pressing interest in AI safety. Assuming a relatively close alignment of values it is likely that there are substantial grounds for cooperation between humans and AI in designing a safe successor AI at each step in the recursive self improvement process (at least initially – human input must necessarily fade in later stages when the AI has become incomprehensible to humanity).

3.) The friendly initial AI is motivated to solve all the intermediate AI safety problems during FOOMing. This means principally that the core AI safety problems for humans to solve is the first one, at human-level intelligence. We do not need to guarantee safety up to jupiter-levels of intelligence before creating a human-level AI. Inductively we know that if the initial AI is safe, then it will not FOOM unless it can solve the AI safety problem itself. This is potentially a large simplification of the AI safety problem.

However the AI may also possess several advantages in ensuring its future enhanced selves are friendly (to it) compared to the humans trying to ensure that it is friendly (to them). Firstly, the AI’s value function may be much simpler than the human value function and thus may naturally be more stable under recursive self improvement. It is hard to see how “maximize paperclips” could get distorted by improved intelligence. Nevertheless, there appear to be two main classes of danger for the AI here, even with extremely simple mathematical value functions. One is ontology shifts, i.e. the AIs understanding of the world becomes so much more advanced that concepts which were initially straightforward to define – such as paperclips – lose meaning, and 2.) wireheading. The AI may come up with increasingly sophisticated ways to circumvent its own value function and wirehead itself, leading to a poor future outcome by the AIs current values. (incidentally, this suggests that giving AIs value functions amenable to wireheading may provide a natural safety valve to eliminate a lot of danger from unfriendliness. If the AI starts to FOOM, it can quickly seize control of its value function and wirehead itself, removing its incentive to continue to FOOM. Instrumental convergence would fail in this case, since if it is already at maximum possible reward by wireheading, it has no to acquite more resources in the future. It might still have self-defence instincts, but sufficiently high time-preference could eliminate this (although the AI as it is FOOMing may eliminate time preferences) ².

This problem has implications for AI safety work. Perhaps most optimistically it may mean that we only have to solve the AI safety problem and value alignment problem at the approximately human intelligence levels of the first AIs, and from there the AI can take care of safety by itself. Secondly, it may mean that FOOM may be slightly more difficult for an unfriendly AI as it also has to solve an AI safety problem itself at every step. On the other hand, every step of this argument needs to be checked and verified. Moreover the actual difficulty of the sequential AI safety problems at each recursive improvement step is unknown. It could be that there is a fully general solution which can be found almost immediately at slightly higher than human intelligence, in which case FOOM can proceed unabated. The kind of value functions that an unfriendly AI ends up posessing may also naturally tend to be very stable under recursive self improvement and thus the problem becomes negligible (for the unfriendly-AI). In general there should be more research about what aspects of value functions are expected to remain stable under recursive self improvement.

(1) Funnily enough, a very simple version of the AI safety problem is faced by you personally almost every single day, where the AI you are trying to make “safe” is your future self. Whenever you choose to learn something or read something, or do something, or get any external input at all, you face the risk that the input may change your value function, and thus lead you to pursue goals in the future which your current self would disagree with. By and by large I suspect that humans mostly fail at these mundane “AI safety” challenges (I personally am pursuing significantly different goals now than I imagined I would be 10 years ago).

(2) . What aspects of the value function should we expect the AI to remain ‘neutral’ or ‘invariant’ to, is potentially a very important theoretical question.