Everybody knows about the hedonic treadmill. Your hedonic state adjusts to your circumstances over time and quickly reverts to a mostly stable baseline. This is true of basic physiological needs – you feel hungry; you seek out food; you eat; you feel sated, and you no longer seek food. This also applies to more subjective psychological states. You cannot, normally, feel extremely happy, or extremely sad, forever. If life is getting better, you feel some brief initial happiness, but this quickly reverts to baseline. This also happens to a surprising extent on the downward side as well. Badness can quickly become normalized and accepted. This cycle of valuation and devaluation makes human behaviour difficult ¹ to interpret as being that of a utility maximizer with a fixed reward or utility function. This is also one of the reasons why human experience is, in many ways, very unlike that of a utility maximizer.

It is also a good fit to many biological problems, and indeed many other real-world problems which are satisficing rather than maximizing objectives. I.e. usually you want to eat enough food to have sufficient energy to go about your other objectives, not maximize your total amount of food consumption. From an AI safety perspective, satisficing objectives, if we can figure out how to robustly encode them and make them stable, are likely to be much safer than maximization objectives ², especially if the satisfiable region is broad. This is because the objective is bounded and, if relatively easily achievable, there is a much less strong incentive towards instrumental convergence to generic power-seeking behaviour. Satisficing objectives in the brain do not appear to be implemented by simply satisficing on a reward function, instead there is a slightly more complex PID control loop, where the salience and importance of varying objectives are flexibly increased and decreased in line with physiological needs to maintain homeostasis. I wrote previously about the benefits of dynamic reward functions and homeostatic control for alignment and specifically preventing the many pathologies of pure maximization.

However, there is a more important angle. Beyond just satisficing, hedonic treadmills are an existence proof of dynamic, flexible, and corrigible value changes occurring regularly in intelligent creatures including humans. Moreover, both reinforcement learning and these homeostatic control mechanisms are evolutionarily ancient (much older than the neocortex), so it is likely that they are very simple algorithms at their core. Importantly, the existence and ubiquity of such loops provide an existence proof of dynamically controlling policies and values learnt by reinforcement learning algorithms in a flexible and (almost entirely)³ corrigible way. This means that there must exist RL algorithms that give an outside system very powerful levers into its own internal objectives, and allow it to be flexibly changed while maintaining performance and coherent behaviour. In the brain, these levers are mostly pulled by fairly simple homeostatic loops controlled by the hypothalamus, but this doesn’t have to be the case. The level of flexibility is such that we could create very complex and robust dynamic ‘value programs’ scaffolded around our general RL algorithms. Exactly what the best ‘value programs’ are is an open question which needs to be experimented with, but the primary issue is simply creating RL algorithms that allow for dynamic revaluation in the first place.

Building agents with such algorithms wrapped in various homeostatic control loops would let us use the power of RL to optimize over nondifferentiable multi-step environments, while also providing us with a huge amount of dynamical control over the objective(s) pursued by the resulting agent. In the ideal case, this control could let us mostly automate controlling various pathologies of strong maximization such as goodhearting and unintended consequences, allow us to dynamically update the reward function if it turns out to be misspecified, and let us encode a high degree of conservatism in the AGIs actions – i.e. limiting the maximum amount of optimization pressure employed by the agent, preventing the agent from acting when it has a high degree of uncertainty over its reward function, etc.

Hedonic treadmills in the brain

While the actual neuroscientific implementation is largely irrelevant for the existence proof, it is quite likely that understanding how such loops are implemented in the brain would provide insight into how to implement them in ML systems. Unfortunately, we are very far from understanding how even basic hedonic loops like those involved in food consumption work at a mechanistic level. In my opinion, how such dynamic control is achieved algorithmically, is actually one of the most important and fundamental unsolved questions on both RL theory and neuroscience.

From the neuroscientific perspective, taking feeding as an exemplar case, the start of the loop is moderately well characterized, and is controlled by the hypothalamus, which receives and releases various feeding inducers or inhibitors which monitor and control food and glucose levels in the blood and are released when food is tasted or detected in the stomach. This processing is controlled by a few specialized nuclei in the hypothalamus and has relatively simple slow control loop dynamics which simply apply inverse control to a prediction error outside the ideal range. This nucleus then projects to various regions in the brain, including the dopaminergic neurons in the VTA that are central to behavioural selection and reinforcement learning in the basal ganglia and cortex. Presumably, this signalling carries information that essentially tells the dopamine neurons to modulate their reward and reward prediction error firing – presumably making food be more or less desired, as appropriate. However, here is the big puzzle: how exactly does modulating the reward signal lead to rapid and flexible changes of behaviour?

The standard model of the basal ganglia is as a model-free policy trained with RPE firing from dopamine neurons using temporal difference or actor-critic learning. This learning approach causes the network to learn an amortized policy that simply maps from states to good actions. Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can’t change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch. This is because the both the policy (if it is a learnt neural network) and the value function amortize and compress huge amounts of information into a relatively small object. This would be expected to come at a cost to generality due to the nonlinear compression mapping. If the reward function changes, the value function and optimal policy changes in a nonlinear way which is hard to understand in general.

It is possible to argue that in humans and higher mammals this is dealt with using model-based planning in cortical substrate. However, this argument is insufficient on two grounds. Firstly, hedonic loops are not some kind of weird rare exception: they are fundamental to human decision-making and the human condition in general. They occur all day every day. If this argument was true, it would essentially mean that due to your reward function changing so rapidly, the entire model-free RL portion of your brain was useless (it would actually be bad since it would keep pushing for old and obsolete policies!). Given the expense of maintaining a basal ganglia and model-free RL system, it seems extremely unlikely that this would be maintained as a spandrel or vestigial brain region. Secondly, and more conclusively, hedonic control loops are an incredibly evolutionarily ancient invention which does not depend on the cortex. This is for the obvious reason that fixed utility maximization is bad in almost any biological context – almost all biological variables need to be satisficed and kept in a healthy range rather than maximized. They exist in insects which have complex behaviours but no cortex and operate entirely by model-free RL. Indeed, we are starting to understand the neural bases of these circuits in fruitflies. For hedonic loops to work at all, there must be some model-free RL algorithm which allows flexible goal changes to result in effective changes to the learnt policy without retraining. Moreover, this algorithm must be simple – it was discovered by evolution in ancient prehistory and does not require highly developed brains with advanced unsupervised cortices. Beyond this, the algorithm must allow some kind of linear interpolation of policies – i.e. you can smoothly adjust the ‘weightings’ between different objectives and the policies update seamlessly and coherently with the new weighted reward function.

Before getting to some proposed solutions, let’s think about the larger picture. What does this mean? Essentially, that flexible, compositional policies are possible, and that policies can be smoothly interpolated between reward functions ⁴. This is a huge amount of control to have over an RL agent. If we have the right algorithm and have access to the right levers, we can tweak the reward function as we go (including quite drastically) to modify and control behaviour over time. Essentially, there is some form of model-free RL that is intrinsically highly corrigible and tameable, where complex learnt policies can be flexibly controlled by relatively simple ‘outer loops’. Evolution essentially uses these loops to control a wide variety of behaviours including maintaining homeostasis in many different domains simultaneously, as well as implementing our general hedonic loop. These loops appear to be relatively simple hardcoded PID controllers which, for simple drives such as hunger, are likely implemented directly in the relevant hypothalamic nuclei.

Essentially, what this shows, is that it is possible to tame RL algorithms and give us levers to update and control the effective reward function the learnt policies optimizes at runtime and without retraining. The question is what such algorithms are and how do we build them, since current model-free RL does not appear to have these nice properties.

I puzzled over this for a while in 2021, and eventually ended up writing a paper which proposed the reward basis model. Essentially, the idea here is that I showed that if you learn a set of different reward functions at once, and assume a fixed policy, then the value function of a linear combination of the reward functions can be expressed as a linear combination of the value functions for each reward. What this means, is that if you learn a set of reward function ‘bases’, and learn a value function for each reward basis, then you can instantly generalize to any value function in the span of the value bases. Given a value (or Q) function, it is then trivial to locally argmax this in a discrete action space to obtain an optimal policy. This is related to successor representations but is a more memory efficient (at the cost of flexibility) and is, in my opinion, a nicer decomposition overall. It is one of those things that is so simple I was amazed how nobody had discovered it for so long. I argued that this is probably what the basal ganglia are doing and how they achieve their behavioural flexibility to changing reward functions, and there is a fair bit of supporting, if circumstantial evidence for this hypothesis both from the heterogeneity of dopamine firing (different dopamine neurons represent different combinations of reward bases) as well as circuit-level evidence from fruit-flies, where the mushroom body, the region that coordinates model-free RL, actually appears to implement a very similar algorithm.

Taming RL with linear reinforcement learning

Having read more of the field in the last two years, I eventually stumbled upon Emo Todorov’s work ⁵ and realized that what I was doing was essentially rederiving (badly) the rudiments of a field called linear RL that Todorov and collaborators essentially invented between 2007 and 2012 and brought to a high degree of theoretical sophistication. I think this field has been surprisingly understudied – almost nobody even within RL knows about it – despite the power of its results.

Essentially, what they showed, is that it is possible to derive RL algorithms to solve a subclass of MDPs – which they call linear MDPS (lMDPS). These algorithms have a number of nice properties – firstly, subsets of the Bellman recursion – such as optimal action selection, can be solved analytically. Secondly, the resulting policies and value functions have incredibly nice properties. The most important one is essentially linearity – both policies and value functions can be linearly composed. If you have two policies you can construct a linear combination of them, and this is the optimal policy for a reward function which is a linear combination of the reward functions used to train each policy individually. This means that it is straightforward to decompose a complex RL task into composable ‘skills’ which can be reused and recycled as needed. It also allows extremely powerful compositional generalization from a set of base policies to all policies in their span. From an alignment perspective, this would give us a set of useful and powerful control ‘knobs’ over the behaviour of a model which we could dynamically adjust to control behaviour. This could be done both autonomously with meta-level control systems, such as occur in the brain regulating homeostatic loops like feeding, but also directly.

While having very nice properties, linear MDPs are, of course, a highly restrictive set of all possible MDPs, and it was perhaps unclear how much this would transfer to more complex behaviours. However, recent results are showing that many of these properties are also present, at least to some extent in deep RL networks. For instance, heuristically, Haarnoja showed that a basic linear combination of entropy weighted policies works well heuristically. Moreover, a number of important recent theoretical papers have proven that this weighted combination works pretty well for entropically weighted policies. Indeed, people have also worked out how to build a predicate logic of policies. For instance it is possible to take unions, intersections, and AND, OR, and NOT of policies. This has been generalized to the concept of skill machines for policies which appear to allow a high degree of compositional generalization even in deep learning systems trained with RL.

My suspicion is that the surprising success of this approach is based on the fact that, in fact, DL networks appear to be secretly almost linear. This is supported by recent findings that some RL trained networks appear to have naturally learnt a linear world model. Suppose the latent space is good enough that policies can become a relatively simple (ideally linear) function of the latent states of this world model. In such a world, essentially all RL is linear RL, since the MDP ‘state’ that the RL algorithms operate on is in fact the linear latent state. Thus, the policies trained on this have all the nice properties of linear RL algorithms and are, in fact, highly controllable. Moreover, this could also be exploited directly if you are training a new policy on a given latent state, it may be possible to simply initialize the algorithm with an easily computable linear RL policy.

While still very speculative, the picture that may be emerging is that the learnt policies in RL are not intrinsically inscrutable objects, but in fact have a rich and mostly linear internal structure which gives us many levels to compose and control them from the outside. This both allows us to build agents that generalize better and can more flexibly adapt to changing reward functions, as well as are much more controllable and ‘corrigible’ to us than a pure fixed reward maximizer would be. It is important not to get carried away however. The evidence for this at large scales is speculative. There are definitely many potential obstacles in the path before we get reliable methods for controlling and composing RL policies at scale. However, the actual goal no longer seems completely insurmountable. We may just live in a world where RL turns out to be relatively controllable. If this is the case, then we would be able to get a significant amount of alignment mileage out of building value/reward control systems for our RL agents, and can likely control even very powerful RL agents this way. This is how it happens in the brain, where a very large and powerful unsupervised cortical system can be very well controlled by a few simple loops operating out of the hypothalamus. Designing and understanding the properties and failure modes of various outer value programs would be extremely important in making such systems robust enough to actually be safe. Additionally, the current niceness and linearity of deep networks seems to be coincidental and is also imperfect. It is likely that we can improve the intrinsic linearity of the internal representations through a variety of means – including explicit regularization and designing architectures with inductive biases that encourage this structure of the latent space. Beyond this, a greater theoretical understanding of linear RL, including extensions from the discrete domain upon which it is defined to the continuous latent spaces of deep neural networks is sorely needed.

Specifically, the utility function can no longer be only over the current external state. Either, the current state must be include the physiological state of the hedonic loop itself – i.e. explicitly different reward function for each of the states of the hedonic loop – or alternatively, the utility function can be over all histories of states. In either case, this significantly expands the dimensionality of the problem. ↩
A small subfield of AI safety work has explored simply not directly encoding or implementing maximizing agents. This is the idea behind quantilization, distribution matching, and regularization by impact measures. ↩
Our hedonic treadmills are surprisingly corrigible. Very rarely do we choose to fight them directly and when we do it is by hacking the outer control loops rather than attack the corrigibility of the objective itself. With food especially, there are a bunch of small counterexamples. Bulimics throw up food after eating to sate some of their hunger impulses by securing release of Ghrelin while making sure they actually absorb minimal calories from their food. Conversely, (and maybe apocryphally, the ancient romans would throw up food to make room for more food during banquets). Some psychological disorders, especially severe depression, also seem to short-circuit these hedonic loops to prevent a return from extreme sadness to normal baseline levels. However, it is not clear whether the corrigibility is maintained under arbitrary self modification and reflection. For instance, it is likely that a lot of people would remove their homeostatic control loops to effectively wirehead themselves – i.e. obtain pleasure without building any ‘tolerance’ to it. Indeed, these loops are one of the main mechanisms the brain uses to prevent itself from wireheading, with surprising but partial success. ↩
One trivial possibility is simply that the brain learns a separate policy for each reward function weighting. While very computationally and memory expensive, this is potentially a solution. However, we have fairly definitive experimental evidence that this is not the case in mammals. Specifically, the experiments by Morrison and Berridge demonstrated that by intervening on the hypothalamic valuation circuits, it is possible to adjust policies zero-shot such that the animal has never experienced a previously repulsive stimulus as pleasurable. This implies that if separate policies are learnt for each internal state, they cannot be learnt in a purely model-free manner which requires actually experiencing the stimulus to update the value function / policies. ↩
As a minor aside, I would also highly recommend Todorov’s paper as a highly accessible and intuitive introduction to the core ideas of control as inference, from a primarily neuroscience perspective. ↩