I’ve been reading Bostrom’s book Superintelligence recently and one potential solution to some of the problems of accidental unfriendly AI leapt out at me. Bostrom has some pretty silly examples of how encoding utility functions directly into the AI can fail. The most obvious case is the extreme paperclip maximiser where it is given the goal of maximising the number of paperclips in the universe and, unsurprisingly, will try to do just that by first taking over the world and then launching interstellar probes to convert the entire universe into paperclips. This sort of blind maximisation goal is obviously dumb.

A more subtle and interesting point comes from goals which are explicitly bounded. For instance, “make one million paperclips”. Here doom could still ensue since the AI can never be completely sure that it has achieved the goal. There will always be some infintesimal probability remaining that it has failed and, if it is a superintelligence it will be able to devise increasingly elaborate ways to test this and refine the probability of failure even more, but also elaborate ways in which it could be mistaken and have failed somehow. Thus, a result called “infrastructure profusion” could ensue, in which the AI instead of turning the universe into paperclips, turns the universe into a series of vast devices to obsessively check whether it has actually and without any possible doubt made its 1 million paperclips, and also layers and layers of fortifications to ensure that it is not disturbed or destroyed for the rest of time while it tries to verify its created number of paperclips to endlessly higher degrees of certainty. The problem doesn’t even go away if you give it broader goals such as that it has to make between 900000 and 1100000 paperclips since it can always be uncertain arond the margin.

Now, this sounds pretty silly. Taking over the entire universe to test that it actually has 1 million paperclips or whatever to an unimaginable level of obsessiveness, but it follows fairly straightforwardly from any kind of unhedged goal like that. If the only goal is to ensure that it has 1 million paperclips, and there are no other considerations, then of course any tiny bit of marginal probability towards the correct number of paperclips it can eke out is worthwhile.

This sort of extreme obsessiveness over a single metric also occurs in machine learning systems where it is called overfitting. Here there is a similar mismatch between the goal we implicitly have for the system and the goal we directly tell it to optimize. For instance, we want to train a system so that it can classify objects from pictures correctly. It doesn’t. It “wants” to maximise whatever score function it has on the dataset you give it to train on. It doesn’t care about generalising to other datasets or whatever. The only thing it wants is to maximise training set accuracy no matter the cost to everything else.

The way this is dealt with in machine learning is through a family of techniques known as regularisation. These methods typically add additional penalty terms to the utility function of the system to force it to achieve its objectives in ways that we generally find useful. With neural networks a common regularisation loss is called the L2 loss which directly penalises the size of the weights the neural network learns. This helps prevent the neural network from overfitting to specific examples in the dataset which typically requires precise combinations of large weights in order to successfully memorise every example without learning anything more about the problem.

A similar approach could be used to try to make the utility function of a superintelligent AGI slightly more safe. For instance we could penalise the amount of computation the AGI uses, or the amount of energy it uses, or the effect it has (through some objective measure) on the earth. These sorts of costs are added directly to the utility function so the AGI should try to maintain them and not circumvent them in the same way that it would try to circumvent direct attempts to slow it down or turn it off. Another interesting approach could be to directly penalise the speed at which the AGI achieves its goals - for instance it could expect to achieve its goals along a timeline of N seconds and any faster progress is very sharply penalised. This could then slow the rate of recursive self-improvement or general superintelligent action to a timescale that is potentially auditable by humans or other machine intelligences, to help ensure safety.

Of course, constraining the options available to a superintelligence is always a fraught task, and there could be obvious loopholes in whatever penalty terms we choose. To avoid the penalty on computation, the AGI might transition to a different substrate where it can technically count whatever it does as no longer “computation” – this problem is especially acute if we simply define computation in terms of FLOPS. Similarly, to avoid energy costs the AI could perhaps offload all its thoughts into some superintelligently invented reversible computer which technically could perhaps be energy-neutral. These sorts of tricks though require immense resources already (both energeticaly and also in terms of thought-time to invent all the new technology required), which will still be penalised under its current utility function so it is not clear if these solutions would actually be realizable, especially if the superintelligence was also given high temporal discount rates. Of course given a large enough reward in the future (such as being able to produce a billion trillion paperclips with the entire cosmic endowment vs the relatively few paperclips coming from administering humanity’s paperclip industry), the AGI might just decide to eat these present costs for the future rewards. This is always a live possibility, but can be ameliorated to some extent by a.) giving the AI a highly future-discounted utility function, and b.) giving it small goals to begin with - if it only wants to make a million paperclips then the upfront cost of repurposing the entire solar system into computronium to have the intelligence to invent fully revserible computing, when the energy and matter usage this entails is massively penalised by its utility function, is unlikely to be worthwhile given that it can easily just produce the required number of paperclips without doing any of that.

One final issue that regularising methods could run into is that of ontology shifts. This is discussed by Bostrom in Superintelligence as well. For instance, suppose that the superintelligence comes up with some breathtaking new understanding of the nature of reality, such that our feeble notions of distance, time, energy, etc become meaningless and therefore so do the constraints. This is always a possibility when dealing with superintelligence and it is hard to see how this problem does not fundamentally beset any solution to the control problem. Nevertheless, it’s important to note that regularised goals, as they are an integral part of the utility function might actually be more resilient to this than most. This is because as the superinteligent agent, must presumably update its utility function to account for the new ontological shift, and this would include the regularisation terms as well.

Perhaps a final and altruistic constraint we could program into any seed AI would be a distance constraint. This would penalize the AI (possibly exponentially) for any it has with matter in direct proportion to the distance from the earth it is. If this penalty rises quickly enough (i.e. exponentially) then even if it kills everyone on Earth, the AI won’t expand and convert the entire universe to paperclips, so at least the aliens get a chance.

Ultimately I think distance constraints like this, should be a universally good thing to program into any seed AI in any case. If it is unfriendly for some accidental reason then it should obey the distance constraint. Moreover, the constraint should be stable since it is part of its intrinsic utility function and as a final goal it should not seek to remove it or circumvent it. If AGI comes late it might even be possible to execute a “controlled FOOM” on some distant asteroid with Earth at least moderately protected by the constraint. Even if the FOOM happens on earth, with enough advance warning and global coordination it might be possible to send a small “backup party” of humanity to Pluto or someplace so that if it does go south, at least some remnant of humanity might survive.

Regularisation methods have the advantage that they “fail-safe” in that an AI that is too heavily regularised will just do nothing. Moreover, as a core part of the utility function of the agent such regularisation terms should be relatively robust (as much as the rest of its utility function) and it should not try to route around them directly through mindhacking, although it will in the physical world. They do make agents strictly less effective than they would be without them (that’s their entire purpose) so in a race scenario or a multipolar world then they are a huge liability and would likely be skipped. This sort of reasoning is why a multipolar world is fundamentally much less safe than a singleton. For a fast-takeoff singleton we only need to solve the control problem once. In a multipolar world the control problem needs to be solved constantly and forever or else the multipolar world will collapse back into a singleton with the most competitive utility function imaginable, which will by definition be unfriendly.

Regularised utility functions are definitely not a panacea. They can and will be routed around. And they should be used in combination with other methods, but it is possible that they may provide an additional margin of safety by constraining the rate and pace of AGI and its utilisation of resources, or even perhaps if distance constraints could be correctly specified, enable controlled and at least somewhat “safe” experimentation with FOOMS by simply starting the AI in a distant region, knowing it won’t expand to cover the entire universe.