One of the key questions in safe-AI is that of value-learning. How to endow any potentially super intelligent AGI with a sufficient and robust notion of human values that results in safe behaviour (or considered as such by the vast majority of humanity) when potentially scaled up and executed by a superintelligent being. Current discourse around the safety of value-learning are often concerned with false extrapolations of human values to extreme scales which tend to lead to extremely undesirable outcomes (by our values). A standard Yudkowsky-esque argument is that somehow the AI will learn that ‘smiles make humans happy’ and then proceed to dissasemble the entire universe’s matter (including humans) to make tiny nano-size smiles, thus maximizing this misspecified value function. While a somewhat absurd example this does get to the heart of potential worries about value-learning gone wrong. Notice that a key assumption here is that the AI simply picks out one objective function to optimize, and that is this one objective function is flawed in some way then the results will be disastrous.

We argue that a straightforward way to potentially ameliorate this is by taking a bayesian approach to the learnt reward function. Instead of learning a reward function itself, we instead learn a posterior distribution over potential value functions. The AI then acts to optimize some function of the value function (typically maximize it), but crucially averaged over all possible value functions weighted by their posterior probability. Mathematically this is that instead of maximizing \(\underset{V}{argmax} \, \, V(world)\) the AI maximizes \(\mathbb{E}_{p(V | o)} V(world)\). This approach has several advantages over the learning of a single value function. For one, it is intrinsically more robust to the sort of value function misspecification, since the AI must maximize the average value over all possible value functions considered in its posterior. This means that when extrapolated to an extreme, in the smile example, the AIs value posterior is also likely to include value functions for which the whole universe being converted into nanosmiles is a very undesirable state of affairs, thus leading the AI to instead maximize a kind of middle ground between value functions, which is likely to be safer.

I don’t have a full mathematical proof of this and I am not sure one is possible in the general case, since it will depend heavily on the actual learning mechanism the AI has for inferring the value posterior. There are obviously various potential pathological cases here. Obviously the AI could learn a posterior where the majority of value functions weighted by probability behave badly (for us) under scaling. This might just be a natural property of value functions learnable from human behaviour – that they are not amenable to scaling. However, a significant majority of the value functions thus learnt would need to tend in the same direction – i.e. not be orthogonal – for unsafe behaviour to develop. For instance, suppose one value function is that the universe should be covered with smiles, and another is that it should be covered by laughs. In theory these value functions may cancel each other out to result in safety if they are orthogonal, but will not if not. This is a serious possibility and should have some empirical and theoretical study devoted to it. A second possibility is a pascal’s mugging problem of small but extremely beneficial events. That is supposing the smile value function is assigned some small but finite probability. Now if the AI manages to actually convert the entire universe to smiles, this could produce an extremely beneficial outcome by the AIs standards, potentially beneficial enough to ensure that the mean outcome is very positive even if all other value functions in the distribution assign negative value to this outcome. (similarly a negative event could occur in which a likely outcome in general (and desirable under most value functions) is extremely negatively valenced under one, and this extreme negative valence could outweight the rest of the distribution. This is a potential concern although there are multiple obvious methods which can be used to reduce the likelihood of issues like this. The first is simple to use a different aggregation function instead of the mean, such as the median. This will immediately result in behaviour which is less sensitive to extreme outliers. Other methods could include the normalization of value functions (so that each only has a set number of ‘points’ to spread around so that one is not disproportionately valuable and thus results in a ‘utility monster’) and b.) regularising the posterior over the value function through the use of strong structural priors. This latter method I believe has a lot of potential for use in practical AI safety. In effect, many of the potential issues arise from errors of extrapolation – i.e. the AI learns some value function over a regular space from humans (somehow) and then extrapolates it to scales far beyond standard human ranges. This means that away from the data the AIs learnt value function may diverge almost arbitrarily from the “true” extrapolation (if one even exists). This is a standard problem of supervised learning and can be addressed using standard methods. If we put strong priors of high uncertainty of extrapolation away from the learnt data, and also put a strong prior that high uncertainty means do nothing then this may potentially enhance safety. Of course safety can likely not be 100% guaranteed using these methods, but it can be substantially improved and that is ultimately what matters.