Epistemic note: Very short point and I’m pretty uncertain on this myself. Trying to work out the arguments in blog format.
In the alignment discourse I notice a lot of vaguely described but very real worry along the lines of “Even if we train an AI to be aligned and it seems to be aligned, how do we know it ‘truly cares’ about human values vs just mimicking them?”. I have thought a lot about what human values are and how they could be implemented in the brain, and I’ve realized recently that this distinction makes me uncomfortable. It reminds me too much of the endless and perennial debates about whether LLMs ‘truly’ understand or are just correlation machines or whether machines can ever ‘be truly’ conscious. In each case we are questioning whether some vague, almost metaphysical quality can be applied to a system or not. But in almost all cases there is no obvious empirical test to run to check this. In the case of LLM understanding, what would ‘prove’ to LLM-understanding-deniers that an LLM does actually understand vs ‘just pattern matching’1?
This brings me to my core thought. Would we all be less confused if instead of talking about AIs ‘truly caring’ or ‘having values’ or ‘wanting’ things, we instead only ever talked directly about reward functions, value functions, and resultant behaviour? If we were just straight up behaviourist about our AI’s values?
This would eliminate whole classes of confusion and worries at the cost, perhaps, of throwing some baby out with the bathwater. Perhaps there is something special in the way our brains ‘have values’, ‘want things’, and ‘care about goodness’ vs how existing RL agents do it, perhaps not. It is certainly an interesting question to try to formalize and study. People, including at least partly me, seem to have a strong intuition that there is some impassable gulf between whatever we do in our brains when we want something and an RL system predicting high reward at the accomplishment of some state. But at the same time this feels like a general intuition that we, as humans have, when we see a machine doing something core to our self-concept that the machine is not ‘truly’ doing the thing but using some other somehow-fake mechanism. Like a chess-engine is not ‘truly thinking’ and an LLM is not ‘truly understanding’, nor does an RL agent ‘truly care’ about our values. My suspicion is that this intuition comes about because we can understand (mostly) what these systems are doing, while we cannot introspect, and do not understand, our own cognitive methods, and hence we assume they work in some different, almost ineffable way.
My general feeling here is that we should try moving a little more towards behaviourism and understanding how each component of the rewards and values of the AI system work on their own terms and how they lead to behaviours, and then try to assess alignemnt in this light rather than speculate about whether the AI ‘really cares’ about our values or is somehow faking it. We should also try to frame things in terms of empirical predictions about behaviour as much as possible – i.e. how we should expect an AI that cares about our values to act in some situation vs one that does not but is only trained to mimic our values. One key case where behaviourism fails, at least where we can see it, is deception, where the AI is only pretending to be aligned outwardly while inwardly scheming to instead follow some other value function. To test for this, we must break the AI system up into parts and test these individually since behaviour is not enough. If an AI is deceptive however, there must be some value function implemented somewhere which can value states according to the hidden secret values and we must use interpretability or other methods to find it.
To me, the core techniques we need to build to assess alignment are to study and analyze the reward and value networks that the AI has learned2. In any near term AGI system, the goal system is going to depend heavily on learnt reward and value models, unless the AGI is constrained only to falsifiable domains or others where we somehow know the true reward function. This means, concretely, that at some point during training, part or all of the AI system will be presented with pairs of data, such as state trajectories, and accompanying rewards. The reward model must learn to generalize from these examples in the usual way and then from the reward model we bootstrap a value critic via the usual bellman backups.
Crucially, the reward model is basically just a standard SFT learning machine. We should expect it to generalize in the standard way SFT trained models generalize. The network will have certain inductive biases which in modern networks are pretty minimal but which certainly contain biases towards low frequency or ‘simple’ functions by some metric. The model will then learn to generalize across new examples in a way that accords with that and which is mostly successful if done correctly, given the successes of this approach in many other domains. My core claim is that this reward model essentially defines the ‘values’ and the ‘wants’ of the AI, if we operationalize this as the end states it ends up optimizing towards. This is because, assuming standard RL-style training, the behaviours of the model are directly optimized towards maximizing the long-term sum of these rewards in either an amortized or direct way (or both). Assuming the AGI is good and a strong optimizer, its behaviours will then systematically lead towards high levels of reward.
From here, there are natural and important questions about how well the reward model will generalize, especially against adversarial optimization pressure of the planner. Solving these questions is one core part of the alignment problem3. The latter especially – guarding against adversarial pressure – is potentially very challenging.
One classic argument I want to address though because it is popular and also wrong is the idea that aligned AI is nearly impossible because there are an infinity of possible reward functions and the aligned ones are an infinitesimal fraction of that so the AI will definitely be misaligned. If you look closely, this argument is analogous to the argument that Bayesian inference or inductive reasoning can never work since there are an infinity of hypotheses which can perfectly fit the data and hence you can never pick the correct one. The reason these fail is that any reasoner or any value learner needs a strong prior heavily penalizing complexity such as the Kolmogorov complexity of hypotheses. This then rules out the vast majority of absurd highly complex hypotheses that fit the data and in practice tends towards ‘sensible’ hypotheses winning given sufficient data4.
The real question is to do with what kind of complexity priors in value space neural networks naturally implement and how well these priors map onto ‘naturalistic’ human concepts of value. These are real questions which can be empirically tested in real neural networks today. My prior here is that these values will end up being quite close to natural human ones because a.) there is a huge amount of human ‘value data’ already around since humans love to discuss values and right and wrong behaviours etc and b.) human values are themselves the product of a very similar value learning process with likely similar inductive biases over value space. The learned values won’t be identical to humans of course. It seems likely that humans and current NNs have quite different inductive biases at the margin, similar to how CNNs focus heavily on textures while human visual recognition focuses on objects and boundaries. It then needs to be determined empirically whether these generalization differences are concerning for alignment or not and how to adjust our datasets or training processes to make value generalization more human like.
-
Obviously with consciousness the situation is even worse as people postulate (and believe!) absurd thought experiments like p-zombies in which falsifiability of consciousness is by definition impossible. ↩
-
Again it really is amazing how nobody in alignment actually seems to study value or reward functions – including me! I keep beating this drum and not doing anything and neither does anybody else. The AIs reward function and value function will exist in some kind of neural network that is stored on our computers. Instead of speculating, we can just look at what it does and test it out on a bunch of inputs. ↩
-
Some other core problems are a.) how do we learn the reward model by giving it sensible examples of values to begin with and can we specify these in natural language vs e.g. value label pairs? b.) how do we maintain value stability during RSI or just in general for a continual learning agent, and c.) Is there some way we can bootstrap AI and other technology to specify ‘better values’ in general? ↩
-
This is itself an interesting and deep fact about our universe that it seems to privilege low complexity causal mechanisms. ↩