My path to prosaic alignment and open questions

One of the big updates I have made in the past six months is strongly towards the belief that solving alignment for current LLM-like agents is not only possible, but is actually fairly straightforward and has a good chance of being solved by standard research progress over the next ten years.

A short summary of this view would basically be that near-term AGI systems will probably look like extremely large DL models trained on huge amounts of multimodal unsupervised data with myopic unsupervised learning similar to today’s LLMs. These LLMs are then likely to be trained with amortized model-free RL on a very wide set of tasks so they become generally capable. Crucially, this myopic pretraining both appears to be resistant to inner alignment failures and does not appear to create mesaoptimizers or gradient hackers ¹. Instead, myopic pretraining appears to lead to exactly the outcome that would be expected from the unsupervised loss – a close approximation of the Bayesian posterior over the training dataset.

Secondly, given a sufficiently large dataset (such as the entirety of all human text ever written), it seems plausible that to correctly predict and understand this data, the model needs to build extremely general latent state representations of almost all human concepts including, naturally, human concepts of ‘morality’, ‘alignment’ and so on. By the natural abstraction hypothesis, we should expect these representations to be extremely close to the human equivalents of these concepts. This is necessarily so functionally – in terms of predictions following from these concepts – but also likely in terms of internal structural representations and hence their generalization to novel situations. Moreover, because this technique harnesses the power of ML towards learning these key concepts, the internal representations should improve with model scale, dataset size, and general capabilities progress. If this is true, then simply by pretraining large multimodal models with sufficiently rich datasets, with an unsupervised myopic objective, we create extremely capable but non-agentic ‘mirrors’ of all the required human concepts.

With this base pretrained ‘mirror’, then you can create general agentic systems through reinforcement learning training on top with lots of scale and on a large range of tasks. Crucially, this is not an unnatural and more difficult path to AGI, but rather the most straightforward and direct one. This is supported both by studies of human and animal intelligence which appears to comprise a huge unsupervised neocortex trained on mostly unsupervised objectives controlled by much smaller RL subsystems, as well as by more general theoretical considerations such as the fact that the information bandwidth you can get by myopic unsupervised prediction is so vastly greater than on pure reward information that it seems this should be a dominant strategy in terms of sample efficiency and actually being able to learn useful representations at all in complex environments.

Since the RL takes place upon a latent space which contains extremely robust and accurate concepts of human concepts such as morality, kindness, empathy, and alignment, the fundamental RL components such as reward models, value functions etc, will ultimately take this latent space to be their base ontology. This means that it is, in fact, extremely easy to specify reward functions, and for the model to learn reward functions and policies which take into account concepts such as alignment and other human moral concepts as fundamental units. This can be done in a number of ways including direct amplification of the concepts within the latent space to construct a human-designed and controlled reward model, artificial critique and feedback (as in constitutional AI) to design aligned finetuning datasets or to directly construct an aligned reward module. The fact that the latent concepts needing to be referenced are ‘already there’ in some sense may also make it easier for cruder techniques such as RLHF where human raters provide somewhat noisy feedback to generalize correctly, since the model already possesses very similar concepts to the human raters internally and Occam’s razor will bias the model towards solutions that are compositions of the correct human concepts and which generalize well. At best, this could even lead to alignment essentially by accident where all that is necessary is to train the model in instruction-following and then just ask it for alignment.

To me, this pathway seems to be roughly the default pathway to highly aligned systems based on current technologies in the near-term. Moreover, it seems like it could be highly robust. Many of the fundamental problems considered in the early alignment literature were fundamentally about how to specify what we want to the machine, which was usually assumed to take in only either hand-written computer programs or mathematical symbols. The key advance in deep learning is that we don’t have to do this – we have built machines that can understand human and fuzzy natural language and can ‘understand what we mean’ in a sensible way. This part of the problem is thus solved by default.

The second part relates how to bind the goal-directed cognitions and actions of the AI to the objectives we desire. While this is less clear philosophically, empirically current RL techniques seem extremely good at this in practice. Amortized model-free RL techniques simply compute the posterior action distribution given the state of the world and the goal. Model-based planners also compute such a distribution at runtime instead of over a dataset. These systems are all extremely controllable when humans can specify the goal well. The basic solution then simply looks like hooking up known and working RL algorithms to a natural language system which can interpret what we mean by ‘alignment’ or whatever goal system we want to train the AGI with. I still feel like there is some missing philosophical clarity about whether this is ‘really’ instilling values or ‘intrinsic goals’ or not, but algorithmically this appears to work well in practice with failures in RL mostly seemingly related to a lack of capacity to either achieve the goal or understand what the goal really is (such as basic reward hacking where humans have misspecified it).

This model of alignment still has a number of important risks. It presupposes a high level of reliability both in the accuracy of the unsupervised latent concepts relating to human concepts of alignment and morality; it assumes that RL techniques can be developed that reliably pick out and amplify these concepts correctly and generalize well instead of overfitting, goodhearting, or being reward hacked. More generally, it also assumes that AGI systems will look very similar to today’s large DL models and that they will not have very sudden changes in structure or internal representations as might occur due to recursive self-improvement (RSI) or some other means.

Given this model for how alignment gets solved in the near-term, there are the following open-questions related to the approach and making the pieces work well individually:

1.) Determine the extent to which natural abstractions hold and representations of ‘human values’ are correctly represented inside of the LLM and where they differ from normal human values, especially in unusual scenarios that are likely to be encountered by a near-term AGI.

2.) How can we ‘bind’ to these values and create agents which have these values through an RL process? How do Rl agents come to have ‘values’ if they do? What is the computational underpinnings of these values, if so?

3.) Given an RL agent optimizing values specified fuzzily inside its own latent space, how do we prevent overfitting or goodhearting warping its behaviour far away from what we would consider aligned?

4.) Similarly, how can we prevent reward hacking or other wireheading as well as instrumental convergence which removes corrigibility and ultimately human influence over the future?

5.) How can we figure out a set of values for which to install in the AGI which are acceptable to most humans and which will ultimately produce a universe that humans will find significant moral value in?

6.) How can we prevent value drift over time, especially as the AGI learns and develops and prevent uncontrolled intelligence explosions from RSI?

7.) How can we ensure that during pretraining or the RL phase the model actually stays aligned to the desired task and does not become inner misaligned and does not learn to strategically deceive us?

All of these questions are huge and important, but they are much more specific and seem much more tractable to research than more highly abstract ones about the foundations of agency and optimization. Moreover, they are the kind of questions that can be productively investigated right now with current AI systems and produce methods which can ultimately be refined to be highly robust and effective.

In an ideal world, we would actually be sure of this and have strong empirical and theoretical arguments one way or another. ↩