AGI will have learnt reward models.

There has been a lot of debate and discussion recently in the AI safety community about whether AGI will likely optimize for fixed goals or be a wrapper mind. The term wrapper mind is largely a restatement of the old idea of a utility maximizer, with AIXI as a canonical example. The fundamental idea is that there is an agent with some fixed utility function which it maximizes without any kind of feedback which can change its utility function. Rather, the optimization process is assumed to be ‘wrapped around’ some core and unchanging utility function. The capabilities core of the agent is also totally modular and disjoint from the utility function such that arbitrary planners and utility functions can be composed so long as they have the right I/O interfaces ¹ . The core ‘code’ of an AIXI like agent is incredibly simple and, for instance, could be implemented in this Python pseudocode:

A note on terminology. Throughout this post we equivocate between reward and utility functions, since they are in practice the same – for any utility finite function – i.e., a preference ordering over states – you can design a reward function that represents it and vice versa. There are potentially some subtle differences in the infinite horizon case but we ignore them. When we say value function we mean the long-term expected reward for a given state, as in standard RL ².

class WrapperMind():
    ...

def action_perception_loop():
    while True:
        observation = self.sensors.get_observation()
        state = self.update_state(self.current_state, observation)
        all_action_plans = self.generate_action_plans(state)
        all_trajectories = self.world_model.generate_all_trajectories(all_action_plans, state)
        optimal_plan, optimal_utility = self.evaluate_trajectories(all_trajectories)
        self.execute(optimal_plan)

There’s a couple of central elements to this architecture which must be included in any AIXI-like architecture. The AGI needs some sensorimotor equipment to both sense the world and execute its action plans. It needs a Bayesian filtering component to be able to update its representation of the world state given new observations and its current state. It needs a world model that can generate sets of action plans and then generate ‘rollouts’ which are simulations of likely futures given an action plan. Finally, it needs a utility function that can calculate the utility of different simulated trajectories into the future and pick the best one. Let’s zoom in on this component a little more and see how the evaluate_trajectories function might look inside. It might look like this:

def evaluate_trajectories(self, all_trajectories):
    all_utilities = []
    for trajectory in all_trajectories:
        utility = self.utility_function(trajectory)
        all_utilities.append(utility)
    optimal_index, optimal_utility = find_max(all_utilities)
    optimal_plan = action_plans[optimal_index]
    return optimal_plan 

Essentially, the AIXI agent just takes all trajectories and ranks them according to its utility function and then picks the best one to execute. The fundamental problem with such an architecture, which is severely underappreciated, is that it implicitly assumes a utility oracle. That is, there exists some function self.utility_function() which is built into the agent from the beginning which can assign a consistent utility value to arbitrary world-states.

While conceptually simple, my argument is that actually designing and building such a function into an agent to achieve a specific and complex goal in the external world is incredibly difficult or impossible for agents pursuing sufficiently complex goals and operating in sufficiently complex environments. This includes almost all goals humans are likely to want to program an AGI with. This means that in practice we cannot construct AIXI-like agents that optimize for arbitrary goals in the real world, and that any agent we do build must utilize some kind of learned utility model. Specifically, this is a utility (or reward) function $u_\theta(x)$ where $\theta$ is some set of parameters and $x$ is some kind of state, where the utility function is learned by some learning process (typically supervised learning) against a dataset of state, utility pairs that are provided either by the environment or by human designers. What this means is that, unlike a wrapper mind, the agent’s utility function can be influenced by its own experiences – for good or for ill.

This seems like a strong claim, and I cannot definitively prove this impossibility result, but I will try to give a sense of the difficulty by the thought experiment of sitting down and trying to design a utility oracle for a paperclipper.

A paperclipper is a straightforward wrapper mind. We can describe its utility function in words.. We want to maximize the number of paperclips present in the lightcone in the far future. First, let’s consider the type signature such a function must have. We know what it is a function to: a scalar value representing the number of paperclips in the lightcone (assuming a linear paperclip utility). However, what is the domain of this function? There are essentially two choices for the domain.

First, we could have the utility function take in the agent’s observations of the world. I.e. if the agent is ‘seeing’ a lightcone full of paperclips then this is good, otherwise bad. Alternatively, we could build a utility function which takes as inputs the agent’s estimation of its own world-state – i.e. the internal state variable the agent maintains.

Let’s take the case of the utility function taking in the agent’s observations of the world. There are then two issues with this. The first is that, even in the best case, it requires us, the utility function designers, to know what a lightcone of paperclips looks like, which we probably don’t. The second is the issue of partial observability. This agent can only calculate the utility of things it can observe. But this means that it has a severe case of lacking object permanence. If it looks at some paperclips one moment, and then somewhere else the next, the paperclips may as well have disappeared. Essentially this paperclipper would just spend all its time staring at paperclips and would be penalized for looking away to do other things (such as make more paperclips!). A final, slightly more subtle, issue that can arise here is ‘perceptual shift’. This utility function assumes essentially a non-embedded agent with a fixed perceptual interface with the world, but this is not true for embedded agents. They might add new sensory modalities, or improve their existing sensors, or some of their sensory equipment might go offline for maintenance – and this would totally break their utility function!

If direct mappings from observations is challenging, we have the second option: to build a utility function which takes as inputs the agent’s estimation of its own world-state – i.e. the internal state variable the agent maintains. This solves some of the problems with the observational case. For one, the agent now has object permanence in that its state estimation can include the number of total paperclips, even if it is not looking directly at them. Using the internal state also reduces the dependency upon the agent’s perceptual interface which can shift arbitrarily, as long as it also learns a correct mapping from its new sensations to its internal state.

Again, however, there are two major issues with this approach. Firstly, it is not robust to ontological shifts in the agent’s own internal state, which may occur during online learning. If the agent learns some new information or develops a new paradigm which radically changes its internal state representations, then the fixed utility function won’t be able to follow this transformation and will get mangled. Secondly, a much bigger issue is that even if we assume that the paperclipper eventually converges onto some fixed internal ontology, having a fixed utility function hardcoded in at the beginning would require us to understand the internal representations of a future superintelligence ³ . I would argue that this is also probably impossible to get right in all the details ahead of time.

Due to these barriers, it seems likely that we cannot successfully build AIXI-like wrapper minds with a fixed, hardcoded utility function that optimizes for arbitrary goals ⁴ without drift. It is of course possible to try to build wrapper minds which end up with badly specified and inconsistent utility functions due to the above, and these could end up taking over the lightcone. It strongly depends upon to what extent having a mangled utility function harms capabilities/power-seeking in practice. This likely depends substantially on the degree of myopia or discounting built into the reward function since without discounting, the general power-seeking attractor becomes extremely large.

Due to these difficulties, we are likely to build agents which are not wrapper minds, but which instead have learnt utility functions. That is, they take some ‘utility data’ of observations/examples and associated utility functions and then try to learn a generalizable utility function from this. Broadly, it seems likely that such a learnt utility function will be necessary for any model-based planning kind of agent which must evaluate hypothetical future trajectories, or indeed for learning in any situation in which there is not direct access to the ground truth reward. Shard theory has a fair appreciation of this point, although it mixes it up with amortization when in fact it is orthogonal. Both direct and amortized agents must utilize learnt reward functions and in some sense direct optimizers are harder hit with this, since they must possess a reward model to evaluate arbitrary hypotheticals dreamed up by their world model during planning, as opposed to simply evaluating currently experienced world-state. However, even an amortized agent, if it is undergoing some kind of RSI, must continue to scale and grow the reward model to keep it in sync with the capacity of the world model representations.

If is the case that agents with a hardcoded utility function are not really feasible for AGI, then this has both some positive and some negative aspects for our alignment chances. One crucial component of alignment being difficult is the explicit assumption that our alignment utility function will not generalize in a FOOM scenario to either growth or out of distribution scenarios in a FOOM scenario. This is followed by the implicit assumption that some kind of true utility function that an unaligned AI will follow will generalize arbitrarily, which itself relies on the implicit assumption of the existence of a utility oracle. However, if the reward/utility function is learnt using the same kind of architectures that the AGI already uses internally (i.e. supervised learning with deep neural networks), then it seems likely that capabilities and the utility function will actually generalize to approximately the same extent as capabilities during scaling and/or RSI since they are based off of the same architecture with the same fundamental constraints ⁵ . Moreover, going too far off distribution might break any utility function and hence the agent’s behaviour might become incoherent.

On the negative side, in some sense this means that the super naive alignment idea of ‘find the perfect aligned utility function’ is doomed to fail because there may well be no such utility function and, even if there is, there is no way to reliably encode this into the AGI ⁶ . Having a learnt utility function also opens up another avenue for misalignment – in the learning of the utility function the AGI might misgeneralize from its limited set of ‘utility examples’. Even in the case where the AGI appears to have found a sensible utility function, there might be all kinds of adversarial examples which could be exploited by the action-selection components. Additionally, since AGI will have to learn its utility function from data, if we use continual or online learning there is also the problem of drift where the AGI might encounter some set of experiences which render it misaligned over time or else there might be random drift due to various other noise sources.

On the plus side, learnt utility functions mean firstly that the agent’s utility function could potentially be corrected after deployment by just giving it the right data – i.e. to clarify misunderstandings / misgeneralizations. Secondly, this means that the AGI’s utility function, if it is learnt from data, should have roughly the same generalization performance/guarantees as other supervised learning problems. This is because we are assuming we are training the reward model with approximately the same DL architecture and training method as we use in, for instance, self-supervised learning for sensory modalities. From recent ML advances, we know that learning in this way has many predictable properties including highly regular scaling laws. Moreover, we know that DL is actually pretty good at learning fuzzily bounded concepts; for instance GPT3 appears to have a pretty decent appreciation of common-sense human morality. This means that perhaps we can get an approximately aligned learnt utility function by just using some supervised learning on some set of illustrative examples we have carefully designed. The real question is whether the generalization capabilities of the reward model can handle the adversarial optimization pressure of the planner. However, if the planner wins and argmaxes highly goodhearted plans, this will translate into incoherent actions which are overfitted to noise in the reward model rather than some coherent alien utility. This means that it is effectively impossible for the AGI to truly argmax in a coherent way, meaning that the optimization power the AGI can deploy is effectively bounded by the fidelity and generalization capacity of its utility model ⁷.

While it seems that having to learn a reward model makes AGI harder to align, it is not clear that this is actually the case. Such an architecture would make AGI much more like humans, who also do not optimize for fixed goals and must learn a reward model and form values from noisy and highly underspecified data. Indeed, it is possible given the likely efficiency of the brain that the overall architecture used in the brain will end up very similar to the AGI designs used in practice ⁸. This allows us to better use our intuitions of handling humans and evidence from human value formation to help alignment more than in the case of a totally alien utility maximizer.

This also provides some evidence that at least some degree of alignment is possible, since humans tend to both be reasonably well aligned with each other (most people have some degree of empathy and care for others and would not destroy humanity for some strange paperclip-like goal). Humans also appeared moderately aligned with evolution’s goal of inclusive genetic fitness (IGF). Humanity as a whole has been one of the most reproductively successful species in history; most humans, even in low birthrate countries, often desire more children than they have; and humans genuinely tend to care about their children. Moreover, evolution managed to create this degree of alignment solely by operating on a very low dimensional information channel (the genome), operating blindly with a random selection algorithm, and finally not being able to control the data the agents receive to learn this alignment from in the first place.

However, we have vast advantages over evolution in aligning our AGI systems. We can specify every aspect of the architecture and optimizer, which evolution instead has to control through a genomic bottleneck. If we ‘raise’ and test our AGIs in simulated environments, we can control every aspect of their training data ⁹. With sufficiently powerful interpretability tools, we would have real-time and complete coverage of all the representations the AGI is learning and we can determine if they are aligned or not, if misgeneralization, or mesaoptimization or deception etc are occurring and remove it ¹⁰ . Moreover, with comprehensive alignment test suites we would be able to get way more signal about degree of alignment than evolution, which only gets a few bits of information about IGF every generation at best. All together, this suggests to me that a high degree of alignment, even with learnt reward models, should be possible in theory, but obviously whether we can practically determine the solution is unknown. This way also suggests that there may be important things to learn about alignment from studying the neuroscientific details of how humans (and other animals) end up aligned to both each other and their IGF.

How does this work in current RL systems?

A natural question to ask is how this is covered in current RL methods. The answer is: either by humans designing reward functions using various proxy measures which, if scaled up to superintelligence, will inevitably fail to generalize, or alternatively using simple enough environments that a ‘true’ utility function is possible to figure out. In many simple games such as Chess and Go, this fundamental ontology identification problem does not arise. This is because there is both a known and straightforwardly computable utility function: win the game. Humans may design surrogate rewards on top of this to perform reward shaping since the actual utility is too sparse to allow rapid learning, but the fundamental utility function is known. Secondly, the ontology is known and fixed. The basic structure of the game is defined by known and fixed rules, and hence the utility function can be programmed in directly from the beginning¹¹. AIXI like agents are very straightforward to implement in such an environment and, unsurprisingly, are highly successful. Even in simple board games the issue of lacking a utility oracle still arises for model-based planning where the agent must be able to evaluate hypothetical board states it reaches when performing MCTS (Monte Carlo tree search). This is typically tackled in a few ways. In original chess engines like Deep Blue, valuation of hypothetical games was done by a human handcrafted utility function which was a bunch of heuristics that took into account the number of pieces of each side and some more qualitative positional information. In alpha-go, and other NN based methods, this is replaced with a learnt (amortized) value function output by a neural network as a function of game-state. Importantly, this is not an oracle since it is only a learnt approximation to the ‘true value’ of a state, and is not guaranteed to have the consistency properties of a true utility function.

In more general RL, this problem is effectively brushed under the rug and delegated to the human designers of the environment. The mathematical formalism of RL, the reward function is assumed to be a part of the environment. i.e. the MDP (Markov decision process) specification contains a state-space, a transition function, and a reward function. This is quite clearly an incorrect philosophical conceptualization of the reward function since properly the reward function is a property of the agent and experiences in the real world do not come with helpful ‘reward values’ attached. In practice, this means that the designers of the environment (i.e. us) implicitly specify the reward function which is usually some proxy of the behaviours we want to encourage. After often a fair bit of iteration, we can usually design a proxy that works quite well for the capabilities of the agents we train (although almost certainly will not scale arbitrarily) although there are also amusing instances in which it fails.

In model-free RL, this problem is typically not considered at all since the reward is conceptualized as part of the environment and agents just learn an amortized policy or value function from a dataset of environmental experiences which include the reward. In model-based RL with explicit planning, this problem does arise sometimes, since you are planning based on world-model rollouts which do not come with attached rewards. The most common solution is to essentially do rollouts in the observation space, or do rollouts in the latent space and then also learn a decoder to decode to the observation space, and then query the environmental reward function.

Interestingly, learning a reward model for use in planning has a subtle and pernicious effect we will have to deal with in AGI systems, which AIXI sweeps under the rug: with an imperfect world or reward model, the planner effectively acts as an adversary to the reward model. The planner will try very hard to push the reward model off distribution so as to get it to move into regions where it misgeneralizes and predicts incorrect high reward. It will often attempt this in preference to actually getting reward in the standard way, since it is often easier to push the reward model off distribution by doing weird things than by actually solving the problem. This is effectively a subtler form of wireheading. The planner will also often act adversarially to a learnt and imperfect world model and push it off distribution towards states with high reward – and this can occur even with a perfect utility oracle. An even subtler issue that can arise with a learnt reward and world model is reward misgeneralization. If the agent is initially bad at the task, it usually gets very little reward from the environment. With a learnt reward model, this can be overgeneralized into ‘everything is always bad’ even if this is not actually the case, resulting in the agent failing to take even potentially good actions but sticking with its known set of bad actions. A similar thing occurs if an agent is doing well consistently, then the reward model will misgeneralize that ‘everything is awesome’ and start predicting high reward even for bad states, leading ultimately to a decline in performance. There are a variety of ways to solve this in practice such as enforcing sufficient exploration and somewhat randomizing moves so it can at least get some learning signal ¹², as well as penalizing divergence from a prior policy which does not suffer from these issues.

What about selection theories and coherence theories?

The view that utility maximizers are inevitable is supported by a number of coherence theories developed early on in game theory which show that any agent without a consistent utility function is exploitable in some sense. That is, an adversary can design some kind of dutch-book game to ensure that the agent will consistently take negative EV gambles. The most classic example is non-transitive preferences. Suppose you prefer A to B and B to C but also prefer C to A. Then an adversary can just offer you to trade your C into B for a cost, your new B into A, for a cost, and then you will also accept trading A into C, for a cost, at which point the cycle will repeat. Such an agent can be exploited arbitrarily and drained of resources indefinitely. Hence, in any setting where there are adversaries like this, all agents must have consistent utility functions.

However, these coherence theorems overlook a fundamental real world constraint – computational cost which is a significant constraint for any embedded agent. It is often too computationally expensive to design a coherent utility function, given the almost exponential explosion in the number of scenarios needing to be considered as worlds grow in complexity. Similarly, as utility functions become more complex, so do the computational requirements to design a dutch-booking series of trades with the agent.

To me, these coherence theorems are similar to the logical omniscience assumption in classical logic. Given a set of propositions of assumed truthfulness, you can instantly derive all consequences of these propositions. If you don’t have logical omniscience, then an adversary can necessarily present arguments to make you ‘prove’ almost arbitrary falsities. Of course, in reality, logical omniscience is an incredibly strong and computationally intractable position for real reasoners, and such exploitation only rarely crops up. I would argue that something similar is true of the ‘utility omniscience’ which we assume that utility maximizers have. Maintaining a truly consistency utility function in a complex world becomes computationally intractable in practice and a complex and mostly consistent utility function also requires a significant amount of computational resources to ‘hack’. There are also simple ways to defend against most adversaries behaving this way, such that a lack of utility omniscience never becomes an issue in practice. These defenses include simply stopping to trade with a counterparty to which you consistently lose resources.

This property is essentially the orthogonality thesis. ↩
We also somewhat equivocate between reward/utility functions and probability distributions, since this equivalence also exists and is well-known in optimal control theory. ↩
Another possibility is to create a utility function in human-understandable concept space and then use some kind of mapping from that function to the internal state-space of the AGI. If this mapping is learnt, for instance with reward, internal-state-space pairs, then this is essentially identical to the learnt utility function approach but with slightly more complexity. However, there might be cleverer ways to make this more consistent than directly learning from data, for instance by applying some kind of consistency regularization. It may also be easier to generate more training data, if the human specified utility function can be assumed to generalize perfectly. ↩
In fact, I would claim that much of the difficulty people find in alignment is implicitly running into this exact issue of building a utility function. For instance, issues relating to the pointers problem, the human values problem, ontology identification and ontology shifts. All of these challenges are not unique to the agent but implicit in giving the AGI any utility function at all. ↩
It is possible that this is not the case and that for whatever reason (insufficient ‘utility data’ or inherent difficulty of modelling utility functions) that utilities generalize or scale much worse than generic capabilities, but I think there is no clear evidence for this and the same architectural prior is against. ↩
A more interesting, but more speculative case here is that we can use this argument to strike whole classes of utility metafunctions from possible alignment solutions. As an example, we have to define a process that would actually allow the utility function to grow in capacity or the agent will necessarily hit a utility modelling and thus eventually capabilities wall at some point. And this opens another question of whether there isn’t some FOOMy or even just a drifting limit that is reached where the utility function cannot be stably held at some maximal capacity. ↩
As a further point, if we are at all sensible as a civilization, we would make sure the AGI also learnt the uncertainty of its reward model and do some kind of distribution matching or quantilization instead of assuming certainty and argmaxing against it. In this case, when very far off distribution, a correctly calibrated reward model becomes radically uncertain and hence the AGI ends up with an essentially uniform preference over actions, which is almost certainly safe. This might be one such method by which we try to bound AGI behaviour in practice to allow for testing and iteration ↩
I would argue that this prediction is being born out in practice where ML systems are looking increasingly brain-like. For instance, we use neural networks instead of GOFAI systems or some other strange model. We train these networks with self-supervised learning on naturalistic settings (like predictive processing). We will ultimately wrap these self-supervised systems in RL like the brain wraps cortex with basal ganglia. Etc. It is possible that there is a sudden shift in ML-system construction (for instance perhaps created by extremely powerful meta-learning over architecture search) but this is not the current regime. ↩
To perform this reliably, it is necessary to have a very good theory of how training data and environments interact to produce certain generalization properties in the learned model. It is possible that such a theory is alignment complete or otherwise harder than alignment via some other route. ↩
This isn’t necessarily alignment complete because we can often verify properties easier than producing them. Even in the best case, such monitoring is potentially vulnerable to some failure modes. It is highly probably, however, in my opinion, that having these capabilities would nevertheless improve our chances dramatically. ↩
Partial observability makes this slightly more complex but not especially since it is perfectly possible to define the utility function in terms of belief states, which are just weighted mixtures of states with known utilities. ↩
Applications of various pitfalls of RL to the human condition left as an exercise to the reader. ↩