A Retrospective on Active Inference

Active Inference is a theory of adaptive action selection for agents proposed by Karl Friston initially and now expanded upon by many authors and forms a small academic subfield of research. The core claims of the theory are that action selection and decision-making can be usefully understood as inference problems (hence the name ‘active’ inference) and that adaptive action can be derived from first principles via the free energy principle, which essentially states that existence of an entity presupposes it performing some approximation to Bayesian inference which includes inference over ‘active states’. Active inference claims promise both as a descriptive theory which can explain the processes behind action selection of entities (including the brain) but also as a prescriptive theory which can be used in the design of effective decision-making agents for various tasks.

When I started my PhD I was very excited by the potential of this theory as a way to address some shortcomings of standard reinforcement learning theory, including a few things that I found intellectually unsatisfying. I spent a very large amount of time trying to understand the convoluted maths of active inference and, in my humble opinion, reached the frontier of understanding here and made some decently well-recognised contributions to the field. My research focused on two main aspects of active inference theory — firstly figuring out how to scale active inference using ‘modern’ (as of 2018 era) deep neural network methods and figuring out how active inference and deep reinforcement learning are related and secondly figuring out where the putative exploration benefits proposed for active inference vs deep RL come from. I think I actually broadly succeeded at both of these questions during my PhD and the results of have shaped my opinion of active inference and rendered it quite apart from what many inside the field think.

While none of the discussion and conclusions in this blog post are new, they are scattered across many different papers ¹, so after several discussions about active inference with various people, I decided to summarize my thoughts here in one accessible place.

As described in my post on my PhD experience, I got started out early on in my PhD being interested in active inference and its seeming potential to provide advancements over traditional RL methods. While from a theoretical perspective, it seemed promising, at the time, active inference was constrained to solving fairly trivial grid-world MDP problems. My initial thinking was to try scaling it to small RL tasks by using the tools and techniques from deep reinforcement learning. This ultimately succeeded, producing results similar to deep RL.

However, ultimately, I realised that my initial thoughts about scaling active inference were fundamentally misguided. Active inference, at its most abstract level, can of course be scaled, since at a high level it is just the idea of treating reward functions as probability distributions and applying techniques from approximate Bayesian inference to estimate the posterior over the policy or action space. What may or may not be scalable is the generative model that is assumed for the agent, where the scalability is determined by the degree of flexibility and expressivity in its parametrisation. Early active inference papers differed primarily from existing RL methods by assuming discrete categorical distribution for the transition and observation matrices of the environment — thus restricting its range to discrete grid-world environments and enabling something close to analytical solvability, hence reducing the need for more general RL approaches.

Our approach to scaling RL simply utilized deep neural networks to parametric these distributions instead, as is implicitly done in deep RL. As such the question of scalability is not inherent to the paradigm as a whole but the specific modelling choices used within the paradigm. Unsurprisingly our results were generally successful but lead to the immediate question of how active inference differs from standard RL.

The answer to this, it turns out, is that they are extremely similar. Both active inference and RL can be derived as consequences of a general mapping of decision-making problems onto a specific Bayesian inference problem. RL has relatively long been known to be able to be represented as such through the ‘control as inference’ framework. The question then turned into how active inference and control as inference were related. This turns out to be a relatively technical difference in that they are largely the same but make slightly different assumptions about how the notion of reward is encoded into the probabilistic graphical model that they optimize over. This difference in encoding gives rise to the slightly different objectives that both frameworks optimize. This means that active inference is essentially isomorphic to RL, or, more grandiosely, but accurately, you could think of RL methods as a subset of active inference. While theoretically interesting, this result has little practical importance for designing action selection algorithms and indeed algorithms from RL can be ported over directly to active inference and vice-versa. I wrote up my results on this in this paper which provides a decent overview of the connection ².

This meant, however, that’s since active inference and RL were so close, that there is and was relatively little special sauce that active inference could bring to the table above standard RL methods — it provides theoretical insights, maybe, and a decent amount of understanding, but no particular secret sauce that would bring about improvements on practical tasks.

The one place where active inference claims some kind of advantage is in the superior quality of its exploration. This is due to the expected free-energy term (EFE) which is optimised by active inference, and which claims to be naturally derivable via first principles probabilistic arguments, being decomposable into a reward optimising term and an exploration (information-gain) optimising term.

This claim and the lack of clear derivations in the existing literature lead me to try to figure out my own way of deriving the EFE from first principles which ultimately lead me down the rabbit-hole of figuring out whether and how exploration bonuses via optimising information gain can be derived as a consequence of performing Bayesian inference to infer the optimal actions. I first figured out that the EFE is not directly derivable from a standard Bayesian treatment of the problem and that existing derivations of the EFE were incorrect or were not contentful in that they largely assumed their conclusion, and I continued to tracing down the answer of where information maximising terms in action selection derive from, which ends up being a relatively simple mathematical relationship and leads to a distinction between what I call divergence objectives and evidence objectives. Divergence objectives require you to minimise a divergence between desired and actual states of the world, and result in information maximising exploratory behaviour, while evidence objectives correspond to standard utility maximisation. I have written about some possible relevance of this distinction for alignment here but again I think it does not have particular practical advantages vs simply using exploration bonuses as an ad-hoc tool to improve RL algorithms where they are needed.

This is pretty much my primary engagements with and contributions to the field of active inference over the course of my PhD. Now, while I feel my results have been mostly negative regarding it, I definitely think it was interesting and somewhat useful to have studied it in depth. Active inference contains and is linked to many beautiful ideas about probabilistic interpretations of decision-making, which are certainly interesting to understand and may potentially have some theoretical implications although, like much of current machine learning, theory is a poor guide to empirical success.

To me, the most appealing and interesting thing about active inference is that it shines light on relating and understanding action selection as a Bayesian inference problem. This lets us both derive new objectives and gain a better understanding of existing algorithm. However, while theoretically illuminating from a practical perspectives the Bayesian lens is much less valuable. This is because, while there exist a huge amount of methods of solving Bayesian inference problems in the academic literature, in practice the best thing to do is parametrize your densities by a neural network and perform black box variational inference to boil down your density estimation problem into one of optimising the parameters of some neural network to maximise some surrogate of the log-likelihood, eschewing sampling, MCMC, and a bewildering amount of special case algorithms that do not appear to scale well. While this may potentially change in the future, for the moment this very simple but effective recipe reigns supreme and hence a close understanding of the Bayesian intricacies that lie behind the simple methods is not that important.

In terms of alignment, this has led me indirectly to some interesting (at least in my own opinion) distinctions between evidence and divergence objectives as well as between amortised and direct optimization, both of which come from understanding the deep foundations of RL as applied Bayesian inference.

From an alignment perspective, I also believe that there are some interesting insights and approaches which can be applied productively to alignment. Many putative alignment concerns can be boiled down to the AI neglecting and poorly estimating uncertainties about crucial components such as its utility function, or similarly not being correctly regularised as occurs naturally in Bayesian settings with sensible priors. For instance, problems of misspecified utility functions such as unintentional squiggle maximisers, or extreme goodharting are all ultimately pathologies arising from a lack of uncertainty specification or poor calibration of the uncertainty, and can be resolved through sensible uncertainty calibration. Similarly, methods for regularised or soft optimization, such as quantification, are best expressed within a Bayesian framework and can be more easily understood and formalised in terms of the entropy of the desired distribution using a divergence objective. More generally, modelling the preference aggregation required for some kind of CEV-like solution as Bayesian inference seems to me to be a promising direction.

More generally, I think that a positive aspect of active inference is that it focuses on a set of generative models for discrete systems which are somewhat novel for reinforcement learning and in these situations it can perform well. Additionally, it provides a nice modelling language for understanding and predicting behaviour of agents in these kind of environments. Generally, trying to understand and fit the generative models underlying a particular behaviour or set of observations is a good approach, although not unique to active inference.

The focus on Bayesian inference and applying more complex Bayesian inference methods than the brute force of SGD is also to be applauded, and I am happy that active inference drives at least some additional research in this direction, as ultimately, I think it is possible that more principled or advanced Bayesian algorithms may scale and ultimately be useful for learning beyond just SGD and related first order optimisers, at least in certain circumstances. In my opinion, work done on trying to scale up such approaches to be competitive with modern machine learning is commendable and important.

While the theory of active inference claims to be closely related to neuroscience, and often argues that various phenomena either at the neuronal or behavioural level can be well-explained by an active inference account, I have come to consider the evidence of this quite weak. Certainly, attempting to understand and model brain function through a normative Bayesian lens is an interesting way to do things, since the brain certainly performs tasks that are isomorphic in some sense to Bayesian inference. However, now I tend to prefer a more bottom-up empirical approach to understanding the algorithmic level of brain function and think that most advances will happen empirically by understanding brain algorithms on their own terms and then trying to understand their Bayesian nature rather than trying to take existing Bayesian algorithms and attempting to match them to neural data. Time will tell which ends up being the successful strategy.

Overall, during my PhD I feel like I managed to get to the forefront of the active inference field and truly understand it, at least as the literature stood in 2020-2021 and previously. My assessment is that the empirical neuroscientific backing of the theory is relatively weak, although interesting, and that from a theoretical perspective it differs relatively little from classical reinforcement learning. However, its focus is different and it provides some interesting perspectives and insights especially around the probabilistic and Bayesian formulation of reinforcement learning and general decision-making problems. Such a formulation allows the general application of Bayesian methods — a huge field — to the problem and generally unifies the problems of perception and action together, which is nice theoretically although thus far it has lacked a clear practical benefit. I feel that the most promising directions that follow from the theory are further developing and understanding the bayesian formulation of action, which has lead to some interesting RL progress and trying to scale more structured generative models for agents which are amenable to more efficient and advanced inference algorithms than simple SGD on unstructured neural networks. Finally, I feel like the Bayesian perspective on action and decision-making is insufficiently appreciated in thinking about alignment and which is likely to yield some interesting insights and methods.

Perhaps the cleanest description is in the discussion in my PhD thesis but that is a heavy document to wade through and synthesise. ↩
I also go into this in more detail in chapter 5 of my PhD thesis. ↩