I have had several meetings and discussions recently about predictive coding and whether it is an interesting and viable direction of study with various people involved in PC. Naturally, I have a bunch of thoughts on this which are not always entirely expressed in published papers and which I thought would be a service to the field to write down. Of course these are just my thoughts and hunches and are not entirely justified by evidence. My primary audience here is people who are relatively new to PC or are thinking about getting involved although people who already know the literature deeply may also be interested in my takes on things.

I have been involved in the predictive coding literature for about 4 years now during my PhD and postdoc and worked actively on it for about 2.5 years (2020-2022). One of my main inspirations during this time was to try to understand how the cortex performs credit assignment. This is perhaps the most fundamental challenge of neuroscience at a computational level. We know that the cortex is highly uniform and must learn all its representations from scratch with online learning. It seems likely that the cortex uses simple autoregressive unsupervised learning to do this based on predicting sensory stimuli. This is supported by a wide variety of converging evidence in both neuroscience and machine learning that such approaches are highly scalable and highly successful and also learn similar representations (down to having similar individual neurons ). However the cortex contains neurons that are many layers deep from cortical synapses and the only way we thus far know how to train deep neural networks is with backprop. However, there are a number of reasons it seems unlikely that the cortex performs backprop directly (i.e. see Tim Lillicrap’s review paper so we need some other algorithms that either approximates backprop or performs credit assignment successfully in a different way. PC proposes to be one such algorithm. One of the primary appeals of PC now is in its supposed relationship to backpropagation which made people think that PC could potentially be a way for the brain to perform backprop and hence do credit assignment successfully. I now no longer think that this relationship is useful in practice for a number of reasons, which I will get into later.

But first, here is a brief history of the PC/BP relationship for people who are relatively new to PC:

1.) The first exciting results in this vein were Whittington and Bogacz who showed that under a specific condition where the strength of prediction errors decreases exponentially with depth, a predictive coding network can approximate backprop. In the same year (2017), Scellier et al demonstrate a similar result for a related algorithm called Equilibrium Propagation (EP).

2.) Song et al demonstrate that PC can approximate backprop on the very first step of the inference phase.

3.) Contemporaneously I (Millidge et al) demonstrate that PC can approximate backprop for arbitrary computation graphs (not just MLPs) under the ‘fixed prediction assumption’ which is where during inference the predictions between layers are fixed to their feedforward pass values.

4.) Salvatori et al extend the original Song et al derivation to arbitrary graphs as well

5.) I (Millidge et al) unify all of these backprop approximations in PC (as well as equilibrium propagation and contrastive Hebbian learning) under a single theoretical framework of moving infinitesimally from the initialization of an energy based model.

In general, I am not very happy with the literature on the PC = backprop despite being a major contributor to it. The reason for this is that the assumptions we have to make to get an approximation to backprop basically massively reduce the biological plausibility of the scheme such that it is not really any better than backprop directly. While maybe being interesting on some theoretical level, this means that this is not a useful advance in practice and neither is it particularly relevant to the learning algorithms implemented in the cortex.

Basically, what I believe happened is that most of the PC = backprop literature (including myself) got nerdsniped into what turned out to be a fairly trivial relationship which can ultimately be expressed in a few lines of simple math. My writeup of this is in this paper. This is progress, of a sort. In my opinion, we figured out that this approach is largely a dead end including most of the algorithms in the literature such as equilibrium propagation and contrastive Hebbian learning which existed independently and before PC. Unfortunately, this means that I think the field has made almost no progress on actually figuring out what credit assignment algorithm the brain uses and has instead just been exploring the sterile arcana of weird approximations to backprop.

In some sense, we reached an unsurprising conclusion –- to approximate backprop sufficiently well you end up having to make assumptions that turn your algorithm into what is essentially a bad version of backprop with all the same restrictions as backprop and the same biological implausibilities. There appears as yet to be no free lunch where we can derive a biologically plausible version of backprop with better computational properties.

Instead, what I believe here is that we need to go beyond this and find algorithms that do not approximate backprop but can nevertheless perform credit assignment well enough to train deep neural networks. It is far from proven that to train deep neural networks, backprop is the only way. It is just that currently we do not have such an algorithm. However, to me, the brain is basically an existence proof that such an algorithm (and almost certainly many more if there is more than one) is possible and can be implemented in neuromorphic hardware / spiking neural networks.

Some directions that I think are promising:

1.) Our recent work on ‘inference learning’ including approximately simultaneously both Yuhang’s work on ‘Prospective Configuration’ and Nick Alonso’s work on ‘Inference Learning’. This is essentially what you get if you do PC without making any assumptions to get it to approximate backprop. So it differs from backprop and appears to be able to train deep neural networks (albeit with no particular advantage over BP at large batch sizes) and seems to have some advantages at low batch size and in continual learning. I made some progress towards a theoretical analysis of this algorithm in this paper which shows its links to target propagation. The scalability and robustness of this algorithm still need to be investigated as well as the computational issues with the inference phase, but it is at least moving away from the approximates-backprop attractor. Target propagation and variants. These do not approximate backprop although, as far as I know, hasn’t been shown to scale super well to large scale networks. It has interesting theoretical relationships to Gauss-Newton optimization which is understood in the literature. Due to this it may perform a slightly better optimization but be more computationally costly. It is also not super biologically plausible itself.

2.) Recent work from Joao Sacramento’s group and Alexander Meulemans. Their recent stuff on direct feedback control is interesting although I suspect it can be related to existing PC and other algorithms.

3.) Generally, my current high level theory of the learning algorithm in the brain is mostly aligned with the general idea of inference learning but not specialized per se to Gaussian distributions like PC. More generally, I believe the brain learns via an incremental EM algorithms implemented through some kind of message passing algorithm (at least the E-step; the M step is probably Hebbian related). I don’t have exact specifics of this yet but this is my overall feeling. This is, of course, highly general given that most things, including backprop can be expressed as message passing. Exactly how this can perform temporal credit assignment ala BPTT is also unclear but general filtering has a very nice message passing description so there seems to be hope there. If I wasn’t switching to AI alignment this is what I would be focusing on personally. Please reach out to me if you are interested in similar directions.

4.) More broadly, if I was pursuing this directly, I would at this point step back and try to think about things more from first principles. What is the evidence from neuroscience on how spiking networks actually learn? How does STDP actually fit into things (and is it actually a correct or misleading description of real neural learning?). How can we do credit assignment through time? In retrospect, I personally, was too strongly focused on PC and not broad minded enough. In academia there are naturally strong incentives towards doing this kind of local optimization within a field and it is hard to fight against them.

But where does this leave PC as an algorithm and field of study? I think, obviously, we need to move away from the PC = BP correspondence and instead focus on the inference learning / prospective configuration case where PC is distinct from BP.

Secondly, instead of focusing on trying to match and beat backprop where its core strengths are (learning with large batch size on massive i.i.d datasets), I feel the future of PC research has to mostly focus on where backprop is weak and where PC can be stronger. As I see it, there are basically two strengths of PC vs BP – firstly, low data and online learning. There is at least some preliminary evidence that PC is better in such regimes. Secondly, flexibility in terms of network topology – since if performs dynamical inference at ‘runtime’ PC can perform arbitrary conditioning on on any network state, including inputs and outputs. For instance, PC can be trivially run ‘in reverse’ and infer inputs from outputs, or from any subset of inputs to any subset of outputs. This is a major potential advantage over backprop which can only learn a single feedforward mapping. This property is a direct consequence of PC as an inference algorithm for a probabilistic energy based model which learns a joint distribution rather than a conditional distribution.

To make a full list, here are some areas for future work where I think PC can differentiate itself from BP:

1.) In small batch (or batch size of 1) and online learning. We showed some interesting preliminary results of this here and it would be interesting to see if these results hold more generally.

2.) Extending and using PC on networks with more complex topologies than standard feedforward nets. Can we find good network architectures for more complex tasks than feedforward classification which can utilize this flexibility?

3.) For causal inference problems. As well as conditional inference, PC can also simulate and perform correct casual inference about interventions on the nodes in its generative model. Can this capability be used to sensibly simulate interventions vs conditioning? I have a super short post/proof of concept for this to follow shortly

4.) Internal conditioning – because every neuron in the PC network is a latent variable in the generative model, it is possible to condition on any arbitrary internal subset and fix them to desired values and then infer the rest of the network’s activity. This could theoretically be used to get much greater control over the processing of a PC network and may allow greater steerability.

5.) Handling incomplete outputs or flexible input/output mappings. Because PC is a generative model it can straightforwardly handle incomplete or missing input (or output) data while BP struggles with this. This is because PC can just infer what expected or missing inputs or outputs are.

6.) For dynamic filtering across time. The problems of the inference phase are much less acute if the PC system is constantly running and assimilating information, since it will always be close to ‘equilibrium’ of the inference phase. Making PC actually work for this is a big challenge which would be super cool if it works

7.) Using PC networks as memory. This has been mostly shown by my collaborator Tommaso, but it is super interesting that predictive coding networks, due to their generative model nature, can implicitly serve as associative memories and recover inputs from noisy samples if run recurrently. Perhaps this is similar to how the brain does associative memory lookup?

Is PC likely to be competitive with backprop on GPUS

I have thought about this a fair bit and my direct answer is: no. Fundamentally doing PC on a GPU requires simulating the inference phase which is always going to be slower than direct backprop. This is one reason why I think that the PC-BP approximation results are not that useful (the other being that they make assumptions that sacrifice the biological plausibility). PC and other variants do not provide a way to do backprop faster on current hardware and are not competitive.

It may be possible to design digital hardware for which they are more competitive. In a naive way, I also do not think this is possible since PC inference phase requires at least as much memory transfer as BP to simply move prediction errors from one end of the network to another. However, depending on how well PC inference phase can be partitioned or use ‘stale’ information, you might be able to get it to be roughly competitive on some topologies. The place where PC could shine, theoretically, is if it turns out that the inference phase results in less update steps being needed than in BP (for which there is some minor preliminary evidence in our new paper). In this case, PC would present a tradeoff between more expensive (due to simulating the inference phase) but fewer steps than BP. Depending on the hardware architecture this may be a competitive or superior choice, although it would likely require a fair bit of finetuning and seems unlikely to be directly competitive on GPUs which have implicitly been heavily optimized for BP by NVIDIA.

Where PC may be useful is in neuromorphic and especially analog hardware. This is because if you can get the inference phase to be ‘computed’ by physics, it can happen effectively instantaneously, and therefore be much faster than BP on GPUs. The reason this may be possible is that the inference phase in PC simply consists in finding the minimum of an energy based model, and if you can set up an equivalent physical energy potential, then physics will find the minimum for you. PC has a very nice natural analogy as a system of springs, for instance, which is also explored in Yuhang’s paper. If you could design the electrical equivalent of this then it could potentially lead to a much faster implementation of PC. Having studied analog hardware a fair bit, I think this could potentially work, but there are numerous difficulties involved with analog hardware design that I am not particularly equipped to wrestle with personally.

Despite being fairly negative about PC here, it is also important to recognize many positive aspects. I think trying to understand credit assignment and learning as a process of variational inference is fundamentally correct. I think PC is fundamentally correct about the unsupervised, predictive, and generative nature of learning in the cortex. I think it is correct to stress the importance of top-down connections for credit assignment (and here I think our hybrid inference paper is insightful and correct about the dual nature of inference in the brain). I think algorithms that perform credit assignment by spreading prediction errors is a useful frame to think about it. I think the theoretical and mathematical framework of energy based models may be a useful frame to think in.

I think predictive coding theory (and Friston) wins some Bayes points for essentially predicting the modern autoregressive and unsupervised learning paradigm. Back in 2005-2012 when Friston did most of his PC work, current ML did not exist at all, and even back when I started in 2017 most people were primarily thinking about supervised learning on ImageNet. Simply the idea that you can learn everything from unsupervised autoregressive/predictive learning of stimuli was a highly nontrivial insight at the time and proved to be entirely correct.

Ironically, I also think that most of the prevailing criticisms of PC are actually off-base and wrong. For instance, all of the following common criticisms are wrong:

1.) PC is unfalsifiable / not proven. A lot of philosophers seem to love this one and it is just almost entirely wrong. PC is entirely falsifiable. It is not yet proven but there is a variety of vaguely supportive neural evidence. I am generally not super keen on most of the direct neuroscience being conclusive because there are many isomorphic ways of structuring approximately the same computations and such experiments typically only test for one such way. For instance, you can rewrite (linear) PC to not have explicit prediction errors at all. In general, I find PC to be the global neuroscience theory with the most evidence behind it currently.

2.) PC assumes Gaussian distributions so cannot handle multimodal distributions. This is another weirdly common viewpoint (including by Yoshua Bengio) and is also just directly wrong. PC models the latent representations of the network as Gaussians, but then allows nonlinear mappings between the layers, thus achieving an arbitrary degree of flexibility. This criticism is akin to saying that VAEs cannot handle multimodality because it assumes a Gaussian for the latent space.

3.) PC depends on / is derived from the free energy principle. Historically this is true insofar as Friston invented both but the logical status of PC does not depend on the FEP being true. PC is straightforwardly derivable as a variational inference algorithm for unsupervised learning and is thus sensible on standard Bayesian grounds.

Why am I not currently working on PC?

Basically, AGI seems to be arriving soon and AI safety seems much more pressing. I used to think that understanding the brain and its credit assignment algorithms would be either necessary or highly useful for getting to AGI. I no longer think this. Given recent progress, I think it is fairly clear that DL approaches can pass the finish line with no particular help from any new insights from neuroscience. Scaling, simple objectives, and backprop seem to be largely all you need. In such a world, we reach AGI very soon. GPUs already outnumber the brain in FLOP-count and large clusters are approaching or exceeding its memory bandwidth. In such worlds PC and neuroscience are not super relevant. The main thing is figuring out how to align the AGI that we build to ensure a safe future.

Beyond this EA-adjacent concern with AI safety, on a purely personal level, I feel like I pretty much understand the PC / FEP / active inference landscape pretty well and am looking for something intellectually new.

Will PC be useful for neuromorphic hardware and will neuromorphic hardware be relevant before AGI?

My answers to this are yes, and no, respectively. I think PC-like algorithms running on neuromorphic hardware will eventually become the future of AI due to their greater efficiency vs GPUs. However, eventually is a long way away and AGI is not. I think neurormorphic hardware is basically a ‘post singularity’ technology and so not super relevant. I think we will reach AGI and GPUs before we develop any competitive neuromorphic hardware. There are at least 2OOMs of scaling in the current paradigm without any hardware advances and GPUs are still improving rapidly. Given the power of GPT4 and prior models before that, I think it would be pretty surprising if scaling them up by another 2 OOMs was insufficient for AGI.