Originally written in early 2021 while writing up my thesis. Never got around to publishing it on my blog then. I think it might be interesting to people wanting to see what a PhD looks like from the ‘inside’. Note that everyone’s PhD experience is highly personal to them and mine was unusual in several aspects, especially in having very weak and hands-off supervision for most of the time except for my year in Sussex. Also, I feel like I was much more undirected and unsure about what I wanted to do than most other PhD students I have interacted with, especially recently. Recent PhD students seem a lot more goal driven and also have much clearer paths and mentorship laid out now than existed when I started my PhD.
As my PhD comes to an end, it is time to reflect back on three and a half years of work, struggle, failure, and some minor successes eventually culminating in the dreaded PhD thesis. While there’s a huge amount of writing on the subject of the PhD experience, including lots of people writing up their own experiences, it’s important to note that everybody’s experience differs significantly from everyone else’s and because of this I think it’s worth putting forward my own experience, in as unvarnished an approach as possible so people know what’s going on. In general, looking back, I did my PhD far from optimally and basically only had any success at all due to my luck in meeting and chatting with Chris Buckley and going to Sussex where I got proper mentorship and became heavily involved in the active inference and predictive coding communities. I was also ridiculously undirected in my first year and did basically nothing of productive value, improved somewhat in my second year, and then was very productive in the third and final year once I got to Sussex.
Ultimately, because I’ve kept a fairly consistent and detailed diary for the last five years, I can quite precisely state what I was working on in my PhD at all times, including tracking all the dead ends and irrelevant side-quests and odd subjects I learnt during this time, as well as my thinking at each stage. I hope this provides a good window into what the process of doing a PhD is like, as well as some of the more generalizable things I have learnt over time. Of course, I am also doing this implicitly in the hope that my last 3 and a half years of my life haven’t been wasted going down some incredibly strange side-alley, which is always the risk of academia, and I now think there is a good chance this has been the case. Well, here we go.
One important thing to note is a key difference between US and UK PhDs. My understanding is that in the US (having never done one myself and purely osmosed this through the internet), that UK PhDs are only three to three and a half years long (I got funding from the UK government) and to obtain a PhD the standard is about 1-2.5 publishable pieces of work. Crucially this means publishable anywhere respectable but not necessarily in a top-tier journal. My understanding of US machine learning PhDs is that they effectively require publication in a top-tier machine learning conference which I have tried (but failed) to achieve in my PhD (I would only succeed at this in my postdoc). Also, in the UK PhD students don’t teach or take classes; the job is full-time research. Finally, you are assigned a supervisor before starting – there is no period of supervisor ‘matching’ (some UK universities run combined masters/PhD thesis programs where the first year, out of four, is a combined master’s thesis with rotations to different labs and match with potential supervisors there) and you begin with a research project already.
Undergrad at Oxford
My background is that I did my undergrad (2013-2016) in Psychology, philosophy, and linguistics where I started out primarily interested in linguistics – primarily theoretical syntax. I did this tripartite degree because all sounded interesting and Oxford doesn’t offer purely linguistics programmes (except a modern language and linguistics) which I wasn’t interested in. I’m very glad I didn’t go for a straight linguistics program in the end (or history, which was my alternative) because I discovered that I really enjoyed the neuroscience and coding aspects of my degree and preferred that to theoretical linguistics. Throughout my undergrad I did not learn to code except blindly copypasting and running other people’s matlab scripts for my undergraduate psychology project.
In undergrad, I got more interested in psychology, and especially neuroscience. I took modules in cognitive and computational neuroscience. I also heard about machine learning for the first time with first the Deepmind Atari results and secondly the AlphaGo match in my final year of undergrad (2016). I was also getting worried that I had studied a very impractical subject with no real-world utility and had no real chance of getting a job. So I decided to apply for a Masters in Artificial Intelligence (and machine learning) at the University of Edinburgh, and got in. At that point, I had no coding background and no maths experience beyond A-level (high-school) other than a psychology statistics course, which was awful and almost entirely centered on the magical incantations you use to do an ANOVA analysis in SPSS, which is literally the worst piece of software ever invented. I have always been interested in the nature of intelligence and thought that I would be able to do well applying what I have learned about the human brain and cognition to artificial systems, which ironically turned out to be fairly close to what I did in my PhD, although it was an exceptionally naive belief.
Master’s at Edinburgh
When I got into Edinburgh, I knew almost nothing about artificial intelligence or machine learning, other than as science fiction. I had also read some lesswrong at this point, and thought that AI sounded cool and important but potentially dangerous. I googled the prerequisites for machine learning and discovered I needed to learn lots of maths and to code. During the summer, I bought myself a calculus and linear algebra textbook and did some reading and some exercises but didn’t get far with it particularly, since I was totally unused to studying – surprisingly I had never at all had to study in high-school or even undergrad with any particular intensity. Lessons had always been super easy, and I could pretty much dash off a reasonable essay for undergrad in a couple of hours the night before. I did see Gilbert Strang’s linear algebra youtube lectures recommended widely online, and I think I watched about half of them, although I didn’t do any exercises and only picked up a superficial familiarity with linear algebra. I also went through some introductory python tutorials since I thought that was a good idea and sort of got into coding, although never used it for anything other than the exercises in the tutorials. I also spent a lot of time lurking on reddit’s machine learning and so got to know the lingo and in-group terminology of a lot of ML.
Upon arriving in Edinburgh, you are sent a list of courses to choose from and descriptions on the website of each course. There were two compulsory courses for an AI masters, a mathematical course (either Introduction to Applied Machine Learning (IAML) or Machine Learning and Pattern Recognition (MLPR) as well as a technical “Machine Learning Practical” where you actually implement ML algorithms and test them. In general, as I had learnt from high-school and undergrad, you always try to take the hardest courses, which are thus the coolest, and avoid anything with ‘applied’ or ‘studies’ in the name as much as possible, so I chose to do MLPR. I also chose to take the course (Accelerated Natural Language Processing) since it was “accelerated” and therefore sounded tough, and I also wanted to use my linguistics background for something. If you didn’t know how to program upon entry you had to take an introductory programming course in Java for some reason (even though pretty much all the ML modules used Python (or MATLAB)). At that point I had started reading hackernews so knew that Java was terminally uncool and so pretended I already knew how to program even though I had only done like one online free python tutorial in my life. For MLPR they made a big deal about how it was mathematically intense, and sent out a list of questions beforehand saying “if you can’t do these questions then don’t take the course or you will fail”. The questions were high-school calculus and linear algebra – matrices, derivatives, gradients etc. I had forgotten all my high-school maths and scored 0, but was somehow undeterred by this and decided it must be a good course if they are trying to convince people not to take it.
Term started and I slammed right into MLPR and MLP. MLPR was my first problem. Looking back on it, it is impressive as courses go. It starts out with the most basic elements of statistics – linear regression, and continues through logistic regression, artificial neural networks, Gaussian processes, PCA, ICA, up to variational inference and MCMC sampling. I went back to look at the course notes (which are very well done) and although they seem trivial now I remember staring for a very long time at the same notes being intensely confused. It really is a testament to the power of expertise built over long periods of time to make things comprehensible and also how deep expertise can go. There were homework sheets once a week and also coding assignments in python. The homework sheets were my bane as they usually involved proving various linear algebra identities and asking questions about how to implement or extend or prove the correctness of various learning methods, and I had never written a proof in my life nor really understood linear algebra at all. I learned an awful lot very quickly about linear algebra, eigenvalues and eigenvectors, matrices, statistics, learning theory, and bayesian inference.
MLP was also tough. They had us building ANNs from scratch (in numpy) including CNNs from scratch and implementing backprop by ourselves. This was, let’s say, a challenge to somebody who had no coding experience but somehow I managed to get this figured out relatively quickly. At that point, despite implementing it personally, it still seemed magic how backprop worked. We also constructed networks in tensorflow (2016 vintage!) and I was completely confused about the notion of a computation graph and what static vs dynamic compilation of the graph entailed. Also, I could never figure out how the “sess.run” of tensorflow actually did and what the hell the “feed_dict” was.
ANLP wasn’t that bad overall. Funnily enough (even in 2016) it barely covered any machine learning methods. I think they have revamped the course significantly now to cover it. Then we studied HMM models for speech recognition, and various parsers for parse-trees. I spent more time than I would have liked debugging CKY-parsers. I learnt a fair about about automata, the Chomsky Hierarchy, and finite state machines.
I think that the term September-December 2016 was probably the academically most intense period of my life. I basically was exposed to the foundations of machine learning (although I definitely didn’t fully understand them then!) as well as having to learn a lot of maths (to deal with having no maths background) and programming (as no programming) background simultaneously. Although I understood most of the ANLP course and I could do MLP and write up neural networks in tensorflow for MNIST and CIFAR, I had huge uncertainty and stumbling blocks over pretty much all the MLPR maths. Luckily, the exam for MLPR wasn’t until the end of the year otherwise I would undoubtedly have failed if I had had to take it over Christmas.
That term I had also met Mycah, my future wife, at a Fresher’s week Coffee-Crawl. Over christmas we went on holiday and stayed a few days in Snowdonia in Wales and touristed around.
The second term was less intense than the first and was mostly consolidating my machine learning knowledge. We had to choose modules for the second term. Although I didn’t understand much of MLPR and my maths was very shaky, I decided to take the most mathematically intense course in the AI masters – Probabilistic Modelling and Reasoning (PMR). This covered Bayesian graphical models, and variational and MCMC approaches to inference in depth. I also learnt a large amount of technical algorithms related to Factor graphs (Junction graphs) which I promptly forgot and never used again. I also took courses on RL (amazing how formidable the Bellman recursion etc seemed then and how simple it is now) and “Natural Language Understanding” which covered ML methods in MLP. Interestingly then (early 2017) was just before transformers burst onto the scene and so we never learnt about them – only LSTMs and RNN based models as well as vintage techniques like topic modelling and latent dirichlet analysis. We implemented RNNs and LSTMs from scratch. In RL I had fun implementing Q learning from scratch on a toy car-racing game – which proved to be surprisingly challenging in the end.
Finally, for the third term we had to come up with a project proposal for our masters dissertation. The way this worked was that potential supervisors posted up short descriptions of their projects and you you had to pick the five that were most interesting to you. Then, you met with the supervisors and talked to them about your suitability for the project, usually in terms of marks in the various courses you took. I chose four machine learning ones and was rejected from all of them and ended up with my fifth choice Richard Shillcock who wanted me to study models of autistic savantism using neural networks.
Throughout I largely completely avoided the main purpose of the project and instead meandered through various interesting parts of machine learning, including MCMC algorithms, siamese neural networks, and different normalization methods (batch-norm, layer-norm). At this time I also found predictive coding, and the free energy principle which were, in the end, to form the foundation of my PHD thesis, although I didn’t really understand anything at the time. Nevertheless, the ideas were sufficiently interesting that I ended up pivoting my masters thesis towards this ending up with a mashup of my original project with predictive coding – ending up proposing an predictive processing model of autism and autistic savantism, supported by neural network experiments. Somehow, despite all logic, this thesis ended up getting a distinction and indeed winning prize for the best dissertation of the year – which I do not understand at all. Take a look here for yourself. Looking back on it now it is incredibly cringeworthy and awkward.
Progression to PhD
Nevertheless, despite my rather dilettantish work, my supervisor Richard was sufficiently impressed to offer me a PhD position. I was initially uncertain as to whether to take it, and briefly tried looking for jobs. After spending about two weeks applying for jobs, talking to recruiters and failing to get to interviews (I was not invited to a single interview – I still don’t know why. I guess I must be really bad at phone screens!) I decided to just go for the PhD, which came with a £12000/year tax-free government stipend. Considering that entry level jobs in computer science/AI paid only around £25000 a year in salary in Edinburgh, UK, I decided that this was not actually that great an opportunity cost and went for it.
Over the summer when I was meant to be working on my masters dissertation, I also discovered the Julia programming language – which sounded amazing, and I spent a lot of time learning it. I would eventually do a machine learning paper in Julia although I never really got into working on it full time.
My original PhD research project which Richard intended me to work on was developing further ideas from my master’s project. The idea was that if we split the input into separate modalities, as in the brain, and then used an autoencoder to predict from one modality to another then this would be useful for the brain and would give rise to better learning and representations in some (pretty undefined) sense. At this point, I didn’t really know what was going on and only had a fairly slippery hold on ML concepts. So I set about trying to do this. I also discovered Keras which (in late 2017) was a billion times better than raw tensorflow so I rapidly started using that for everything. I hadn’t discovered PyTorch yet. While working on the startup, I also put some time into working on this paper and in a couple of months, by December 2017 had worked up a decent set of experiments and a basic paper on this. I wasn’t very keen on it at all and thought it was pretty rubbish, but wrote it up and put it on psyarxiv. I was still very bad at writing papers, as you can see.
Also at this time (February 2018), I proposed to Mycah and we got engaged. We took a three week working holiday to Prague since we both worked remotely (well, I didn’t but Richard didn’t mind me not being in the office for those three weeks). I didn’t get much work done in that time.
March 2018: (6 months in) By March 2018 I decided that my startup was doomed and I needed to focus on my PhD. In April, I got more seriously into predictive coding and studied the Bogacz tutorial, and implemented my own super basic versions in Python.
April 2018: I also came up with the idea of relating predictive coding to the phenomenon of retinal stabilization which is where if an image has a fixed position on the retina it rapidly fades from conscious perception, sometimes in less than a second. I tied this in with predictive coding ideas of prediction error minimization where the basic idea was that if the image is stabilized on the retina, it quickly becomes perfectly predicted and thus there are no prediction errors to subserve conscious awareness. I also had the idea that small jittery eye movements known as fixational eye movements may serve the purpose of implicit data augmentation for the brain by simply presenting the same scene at multiple distinct angles. While I think the ideas behind this were interesting, the implementation and testing of them wasn’t and I produced another rubbish paper (and code).
May 2018: Here I spent most of may finishing up the retinal stabilization paper. I also got briefly into eye tracking (foreshadowing other work) and became interested in optimal foraging theory – specifically the idea that eye movements can be considered within the optimal foraging framework – i.e. they ‘forage for information’. I conducted a quick analysis of an eyetracking dataset to see if eye movements followed the characteristic levy flight distribution used in optimal foraging which is a heavy tailed distribution empirically characterised in animal foraging. This paper ended up being so rubbish that I didn’t even put it on psyarxiv (but it can be found here for posterity and the code)
This month Richard also had the idea of studying vocal imitation in seals (bats) where seals will tend to imitate other seals and thus make their calls more similar to their surrounding seals. Together we chatted about this and thought that it could give rise to ‘gradients’ in the soundscape whereby seals imitate other seals which would allow mothers to find their pups in a crowded colony. I thought this was a cool idea and quickly coded up some basic simulations of this,which eventually found their way into this paper (code).
June 2018: By this time I was feeling pretty adrift in my PhD. I had been at it for almost a year and still had produced nothing particularly useful. I was still interested in predictive coding and had been reading a lot of papers about that and the free energy principle although I didn’t really understand it at all at any level of mathematical detail. I was casting around for ideas to latch onto. One thing I did was try to start implementing predictive processing models in Python and started, rather grandiosely to create a large predictive coding library, implementing all the algorithms from the literature. I worked on this off and on for the next three months. Remnants of this project can be found here.
I also decided to try and help out Richard with his eye tracking work, given that I had no other real ideas of what to do. ALthough he initially warned me against it, he eventually accepted me in and I started reading about LMER analyses and started analysing this huge tract of eye tracking data that he and previous collaborators had recorded. I studied a big textbook on LMER analyses and started learning R and got good at various R statistics packages. He was right in his advice though in that I rapidly discovered that doing statistical analyses of eye tracking experiments is not particularly what I enjoyed doing and I found it hard to have a lot of motivation for the questions we were investigating.
July 2018: I continued working on my predictive coding library and on the eye tracking. For the eye tracking my question shifted several times and I ended up looking at whether empirically, we can understand eye movements to maximize information gain. I was also looking at whether dyslexics or not had different hemispheric patterns in their eye movements. By the end of July 2018 I was rapidly disillusioned with this and basically stopped. Code from this time can be found here.
I also continued working on the predictive coding library. I also started getting into Rust and went through a whole bunch of tutorials for the language. I also started working on the nand2tetris tutorial series which teaches you how to build a computer from the ground up including logic gates.
August 2018: I continued working on my predictive coding library and also did a fair bit of rust coding. Also, I started getting into eye tracking more working on the computer and played around with various libraries for controlling your computer with eyetracking, including from webcams, which was cool. I didn’t make much paper progress at this time.
September 2018: I pretty much stopped working on Phd stuff to pursue my interests in computer eye tracking. I had the idea that I would use this as the basis for some startup idea combining it with machine learning, although didn’t do much with it at the time. Also, I was getting married to Mycah on 07/10/2018 and wedding planning took up a lot of time
October 2018: I got married to Mycah and took a two week honeymoon which took up most of the month. Minimal PhD work was done.
November 2018: After the honeymoon was over, I returned to a very disappointing PhD situation. I had been doing a PhD for more than a year and had produced no publishable work and didn’t even know what I was to work on for my main thesis. Richard, my supervisor, said I had nothing to worry about, but of course I was worried. I decided I needed to focus on a single topic, as a lack of focus had been my major problem previously. I decided to focus entirely on predictive coding and the free energy principle, since that was the only thing that had continuously held my interest for more than a year although I had never studied it formally. After a few weeks of fooling around and vaguely reading papers, I sat down and exhaustively studied the Buckley tutorial, and the Friston 2003 and 2005 papers in depth and tried to understand everything. It took me about two weeks to understand just those two papers.
December 2018: By early December I had a reasonably good understanding of what was going on. I spent the rest of december implementing and experimenting with basic predictive coding networks, including dynamical predictive coding networks and DEM (although I didn’t fully understand it mathematically). I overall wasn’t super productive in a directed way over Christmas but read a lot of papers.
January 2019: I started seriously experimenting with the predictive coding networks with the aim of writing up a paper and by the end of January I had written up this “paper”, or really a report (paper and code) which goes through my progress in implementing these models. I was aware, however, that this was not really a contribution but instead just copying what other people had done and actually implementing their maths. I wanted to do something new.
Also at this time I decided that I was too far behind in various mathematical and computer science prerequisites and started a habit I maintained throughout 2019 and some of 2020 of spending two hours or more a day watching lecture videos to increase my understanding of key topics. At this point, this helped a great deal and gave me a lot of intellectual breadth, although it ate into direct paper productivity by a surprising amount (much more than the supposed 2 hours / day). See the intellectual development 2019 and 2020 for more.
February 2019: I wanted to do something original and decided to combine the predictive coding and active inference models I was working on together to see if I could produce anything useful. I spent most of february working on this on the simple Cart-pole environment, which was immensely simple and did not really work very well. I spent an awful lot of time tuning this awful code to no avail.
March 2019: I also focused on getting these cart-pole results out which I did about the middle of March. This resulted in this this paper and code which is still terrible, although I think you can discern some degree of improvement over my previous papers at this point – it at least seems somewhat more professional. I submitted it to the journal cognitive science and eventually, got at least neural responses and revisions instead of a pure rejection.
Towards the end of March I also started getting into deep reinforcement learning, and especially control as inference.
April 2019: I continued studying deep reinforcement learning and control as inference. I really struggled especially control as inference and understanding how everything fit together. I spent many days in the library staring at the Levine paper without much understanding. But I knew that fundamentally what I wanted to do was to ‘scale up’ my current methods in active inference using deep reinforcement learning. In january I had also read the Kai Ueltzhoffer’s deep active inference paper, and David Ha’s world models paper and decided that by combining the two would be my next goal after the cart-pole paper. I also learnt julia and implemented some simple reinforcement learning models. At this time I was also trying to understand DEM and emailed Chris and Manuel about it, which turned out to be a huge turning point later.
I also was deeply trying to understand the discrete-state-space active inference models which had eluded me for many months. I finally understood these in late April 2019.
At the end of April and first week of May, Mycah and I spent a week and a half in Paris on holiday which was a nice reset.
May 2019: This month I had a pretty good idea of what I was trying to show – replicate the David Ha paper but using active inference to create a good model. Stupidly, I decided to first try to replicate the David Ha paper entirely in Julia using Flux, which meant I spent more than a month futilely fighting with Flux and attempting to get the model working without success. This was a huge mistake and I never got this working.
June 2019: Now almost two years into my PhD and I had no papers. I was getting worried. I focused completely on this paper as deep active inference and eventually gave up on trying to replicate the David Ha paper and instead just focused on building a super simple actor-critic agent which I ported from pytorch code to Julia. This eventually worked and by the end of June I had written up my first decent paper (Deep active inference as variational policy gradients (paper and code). This is the first paper I think was publishable quality and in fact was my first published paper – after TWO YEARS of PhD work. It took me a surprisingly long time to get to this level; this has been something that has consistently surprised me, as well as afterwards, how easy it is to keep pushing out papers once you have arrived at that level.
July 2019: I started working on a followup for the paper but with a variational autoencoder perception model to allow it to deal with pixels and partially observed state. I never got around to finishing this but should have, since it would have been another relatively straightforward paper to get from where I currently was. However it would have been very incremental and not super exciting intellectually.
I had emailed Chris and Manuel about my new paper which I had out and they invited me down to Sussex to talk about it. This ended up being a crucial conversation which would eventually lead to them inviting me down to Brighton for a research visit which would last all of 2020, where I effectively joined their team, and which ultimately made my PhD. I gave the talk in the last week of July. After giving the talk I started chatting to Alec Tschantz another PhD student at Sussex who was working on similar ideas; we agreed to collaborate on extending our deep active inference work after we returned
Other things also started moving. I was contacted by Giovanni Pezzulo who was impressed with my work and wanted to recruit me for a postdoc in Rome, Italy after my PhD.
August 2019: I collaborated mostly with Alec Tschantz and we were working on ideas of optimization in the latent space of deep reinforcement learning models, which never went anywhere, although very similar ideas ended up being executed much better than my attempt by Danijar in Dreamer.
September 2019: Similarly, I spent a lot of time working on these deep reinforcement learning ideas which did not end up going anywhere. I also got seriously into stochastic differential equations and stochastic optimal control. I studied a lot of control theory and started developing my ideas relating kalman filtering to predictive coding. I kept trying (without success) to understand the DEM algorithms and had a vague idea that colored (smooth) noise could improve inference algorithms. In these months, I also joined the University of Edinburgh Hyperloop club, which makes a functioning hyperloop pod for the SpaceX hyperloop challenge. I helped out with the Kalman filtering and navigation and implemented a Kalman filter for the pod.
October 2019: Mycah and I visited her parents in Tennessee for three weeks, so I was completely on holiday, and did no PhD work. At the end of the month, I did, however, have a mathematical breakthrough and understand how Kalman filtering relates to predictive coding and finally understood Kalman filtering properly.
November 2019: I worked on my Kalman filtering and predictive coding paper, which eventually became the Neural Kalman Filter paper (and code), which I sat on for a year until I eventually threw it on arxiv. I also became very interested in nonlinear filtering theory, and spent a week studying Kutschireiter’s tutorial in depth and went through a lot of lectures in this topic. I started trying to understand how to build nonlinear filtering algorithms which worked under the assumption of smooth noise and whether that would work. In general, I had stopped work on my many deep reinforcement learning projects which were still open.
Crucially, I also attended at the beginning of November a conference in Edinburgh on the Free Energy principle and predictive processing. I chatted a fair bit with Pezzulo who wanted to offer me a postdoc. Crucially, I also chatted with Christopher Buckley, who offered a research placement in his lab in 2020, which I accepted. We agreed I would start in early January.
December 2019: I continued working on my ideas for control theory with smooth (analytic noises) which largely went nowhere. I studied a lot of stochastic differential equations. I also had a relaxed Christmas in Edinburgh with Mycah.
Note at this point I was 2 and a quarter years through my PhD and had only had one paper published. Things were looking very precarious and uncertain. I had, however, mostly imperceptibly at the time, managed to build up a lot of knowledge of the literature across a number of fields and my general intellectual development was going well due to my lecture course habit.
January 2020: On the 2nd of January 2020, Mycah and I loaded everything into a van and drove down from Edinburgh to Brighton to take up the research fellowship at the university of Sussex. I started officially on Jan 4th. For the first month there, Alec and I worked very closely on our ideas of merging active inference with deep reinforcement learning. We especially focused on understanding the nature of exploration in deep reinforcement learning, and the ability of information-directed exploration to enable superior exploration in sparse-reward environments. While we were originally intending to submit for ICML 2020, we missed the deadline for this and ended up submitting to a ICLR workshop instead. Here is the paper and code. While doing this Alec and I worked very closely together and I learned a lot from the mentorship of Chris and Alec and especially a lot of the tricks to make model-based RL work in practice. Alec especially impressed me with his incredible grasp of the Deep RL literature and coding sills, and Chris with his mathematical expertise. I had to up my game to impress them. While we were working on the deep model based reinforcement learning method, Chris and I were also focused on how to derive the expected free energy from the log model evidence. I spent many bus-rides home (40 minute bus ride to and from the office) scribbling maths in my notepad. After about two weeks we discovered that it could not be not derived in this way, and therefore I spent a lot of time understanding how the expected free energy relates to variational inference.
Feburary 2020: I spent a lot of time honing our model-based reinforcement learning paper (Reinforcement learning through active inference) and submitting that to the workshop. Also, I focused heavily on writing up my thoughts on the expected free energy, resulting in this paper, which is my second paper to be published in a journal, and one which I am very proud of. I think these two papers of early 2020 represent me finally breaking through the research barrier and making real, if small, contributions to the state of the art in various fields. I also spent some time working on predictive coding models, and started implementing these again, and did some brief work on smooth noises that never went anywhere.
March 2020: Chris and Alec had ideas about adding a ‘feedforward sweep’ to predictive coding models using amortized inference, like variational autoencoders. They had only a fairly vague idea and by working closely with Chris I managed to make this mathematically precise and coded up the first implementation. I then worked closely with Alec converting this from numpy to pytorch and scaling it up to handle better tasks much faster than before. While coding up these initial predictive coding networks, I also had the ideas of biological plausibility (I had been studying the biologically plausible backpropagation literature, and especially James Whittington’s equivalence between predictive coding and backprop. The predictive coding algorithm which would eventually became this paper (and code), and implemented the first versions of these relaxations to show that they worked. By the beginning of April I had written up a first draft of the relaxed predictive coding paper to show them.
We were also working closely together on extending our deep reinforcement learning work. I spent some time trying to replicate Danijar’s work on Dreamer, and also spent a fair bit of time implementing and experimenting with additional model-based planning algorithms which never really went anywhere. I started to really grok the distinction between iterative and amortized inference and its relationship to model-free and model-based RL at this time.
April 2020: I continued to work on the predictive coding with a feedforward sweep, which we called hybrid predictive coding, and would eventually (two years later!) result in this paper and code. I also started working with Conor Heinz on his model of schooling and collective behaviour active inference (still not out yet in 2023!) and helped out a lot with the maths there. I also continued working on the deep reinforcement learning and especially seeing whether reward information gain, as opposed to just state information gain, helped exploration for deep reinforcement learning agents. Once again this work has largely fallen by the wayside.
May 2020: I became heavily interested in biologically plausible backprop and read a ton of papers in this area. THis month I also heavily focused on working on my proof and paper showing that predictive coding approximates backprop on arbitrary computation graphs. Specifically, I spent a lot of time implementing backprop in CNNs and LSTMs from scratch to demonstrate that predictive coding can learn in these models and in doing so I obtained a really good understanding of backprop. This resulted in this paper and code which got rejected from Neurips, although it got shared widely on twitter and has become my most popular paper.
June 2020: After the diversion of the predictive coding backprop paper I continued work on the hybrid inference paper in predictive coding. I also started getting seriously interested in continuing the work from the ‘Whence The Expected Free Energy’ paper about the mathematical origins of exploration, and the relationship between the expected free energy and other approaches, including control as inference. I wrote up my thoughts in this workshop paper which demonstrates the relationship between control as inference in terms of their objectives, and focuses again on the relationship to the expected free energy. Really, from a personal perspective, the important thing about this work is that in writing it I properly grokked the control as inference mathematical framework.
Additionally, Alec and I thought about how to port the ideas in the hybrid paper over to the deep reinforcement literature, and we came up with the idea of hybrid inference. Here essentially we can conceptualize model-free RL as iterative inference and model-based RL as hybrid inference. This immediately led to this paper, and we also worked strongly together on an actual implementation of this model, which was eventually published here.
July 2020: I spent a lot of time working on and thinking about the nature of exploration gain and information-seeking exploration in machine learning, and eventually had a mathematical breakthrough and realized where it was from (which eventually made it into Chapter 6 of my PhD thesis although is not published elsewhere). I also did a lot of reading into model based planning algorithms as well as tried to do a literature survey on biologically plausible approximations to backprop. GPT3 also came out and I spent a fair bit of time reading Gwern and thinking about its implications.
August 2020: I invented a new algorithm for biologically plausible (ish) backprop in the brain – which I called Activation Relaxation, did all the experiments and wrote this paper (and code). Also, I looped back to active inference and worked with Alec and Axel constant to create this paper which was quick and relatively painless. I also spent a fair bit of time working on a direct comparison of a bunch of model-based planning methods, which largely went nowhere. In retrospect this was a pretty bad idea which would have sucked up a lot of time for relatively little gain.
September 2020: I spent a lot of time continuing to work on the exploration and information gain stuff as well as the relaxing predictive coding paper. I was spending a lot of time trying (without a lot of success) to scale up those results to larger networks and get them working more reliably. I also spent a fair amount of time trying to get what I called ‘continual predictive coding’ – i.e. where you can run inference and weight updates simultaneously – working, but I never did. Eventually, years later Tommaso would get this working (turns out what you mostly need is a high inference learning rate) and called it incremental predictive coding.
I also started to get seriously worried about finding a job to do after my PhD. My PhD funding was meant to run out in September (3 years from September 2017) but was extended by the UK government for another 6 months due to Covid. Due to only earning £12000 / year I had only a few months’ savings and could not live for a long time without any additional money. I thought about applying to some jobs but didn’t find any machine learning jobs I was interested in. I also thought about applying for postdocs and here I found Rafal Bogacz’ postdoc in Oxford. He had reached out to Chris and me a few months back about the predictive coding paper and I saw that he had a postdoc available on his website, so I restarted the conversation and applied.
Another thing I was considering doing was another startup. I didn’t have any direct ideas yet but GPT3 had been released and I was pretty sure I could do something interesting here. I spent a lot of time brainstorming ideas for potential startups and thinking about them. I also applied to Entrepreneur First (EF) and got through to their first stage and would ultimately get an offer which I rejected to instead do the postdoc. I also discovered EleutherAI and started following along with their discord although I was always too intimidated by the social aspects to actually get involved and post – which was a major mistake.
October 2020: I started to get more serious about my second startup idea and continued in the same doomed path by creating a potential product but still without any investor or customer validation. I also went through interviews for Rafal’s postdoc and EF and got acceptances for both and spent a lot of time agonizing about what I should actually do. I also continued trying to derive information gain and exploration from a variational free energy objective. I was however mostly playing with GPT3 and starting to read about transformers. I also got very interested in the paper ‘hopfield networks are all you need’ and wrote up a blog post showing my understanding of them and a walkthrough. I ultimately decided to accept the postdoc with Rafal instead of go for a startup at EF. This was for a couple of reasons. Firstly, I was unsure of actually wanting or needing a cofounder and already thought I had an idea when EF was the opposite. Secondly, the investment terms of £80k for 10% are exceptionally bad for founders (800k valuation !?). Thirdly, I still had a lot of hope in the idea of PC and many ideas to explore here which seemed extremely important and potentially the future of neuroscience and credit assignment in the brain. Postponing startup stuff for a few years while this got cracked seemed like a reasonable decision.
November 2020: I spent most of this month continuing to try to understand information gain in active inference models. I also spent a lot of trying to figure out continual predictive coding. I also became heavily interseted in spiking networks and trying to understand credit assignment through time and implemented superspike and e-prop. I spent a lot of time trying to design fully general temporal credit assignment algorithms without success. I also read a lot about equilibrium propagation and target propagation and toyed with the idea of writing up a review of biologically plausible credit assignment algorithms as an update to this paper by James Whittington. In retrospect, this would have been a good idea to do in about a year’s time or later and would have been valuable. Unfortunately, I am too far outside of the field now and don’t really have time or inclination to write such a thing up although I probably have most of the information to and could do it in a few months.
By this point, I had also come to a plan for my thesis, which needed to be done by March 2021 before my funding for my PhD ran out. I would basically start on writing up the thesis after Christmas so that by the end of December I needed to be finished with all my current research projects. Then, before my postdoc started on the first of April, I would have three months in Oxford, hopefully free of distractions to actually write the thesis. If we assume a thesis has 6 chapters, then this is approximately 2 chapters a month or a chapter every two weeks, which is not a prohibitively massive amount. This was an ambitious, but actually doable plan and it was necessary to get the PhD on time.
December 2020: I did not do much over December and Christmas. I finished writing up the information gain paper and had a finished draft which eventually became this paper (and code) and presented it at a TNB meeting to Karl Friston. I also spent some time doing startup stuff and a lot of time just relaxing around Christmas. After Christmas, I had the idea of writing a review of predictive coding methods. I wrote up about half of this paper in the final week of December and during the new year’s break.
January 2021: Early January was spent in moving from Brighton to Oxford and getting settled in. In the first two weeks, I also finished up the predictive coding review paper and sent it to Chris and Anil for feedback. I then began about 2 or 3 weeks late on the actual thesis. With having a week or two at the end scheduled as slack to review, and realizing that I was instead going to write 8 chapters, this put me at the chapter / week limit. In the second half of January, I actually exceeded this deadline and managed to write three main chapters – the introduction, and chapters 3 and 4 of the thesis on predictive coding and deep active inference.
February 2021: February was spent entirely in thesis writing mode. The only paper I did was adapt my second chapter, which as an introduction to the free energy principle, into a standalone paper which was never very successful. I wrote up chapter 2 of the thesis as well as most of chapters 5 and 6.
March 2021: By March I was actually doing well and ahead of schedule with the writing. The thesis was feeling eminently finishable, and indeed it was. By the second week of March, I had finished off a full draft of the thesis and emailed it out to CHris and Anil for comments. I also spent a little bit of time reading about the basal ganglia to get ready to start my postdoc as well as started having virtual (we were still in covid lockdown) meetings with Rafal and Yuhang and Tomasso with whom I would be collaborating closely during the thesis. But, by and by large the PhD thesis and thus the PhD was done! I had actually done it and completed a PhD. You can find the final PhD thesis here as well as all the latex code for it here.
On 1st of April I started the postdoc. I was slightly delayed in the official submission due to waiting on comments from Chris and Anil and Richard my official supervisor. I submitted it around mid April. In mid June I had the official viva with Karl Friston as my external examiner and passed with no corrections. I was now officially a Dr, after a long 3 and a half years of work and mostly aimless wandering, I was done.
It is really fascinating to look back and reflect on my intellectual development over the course of my PhD. In hindsight, it appears that almost all development has taken place in two large spurts, with slower growth in between. Crucially, these growth spurts emerged whenever I was thrown into a very challenging environment. These spurts were firstly my masters degree, which is perhaps the period of most rapid intellectual growth I have ever undergone, where I learned to become proficient in coding and machine learning from scratch while having no background in programming and basic maths and having to learn them all simultaneously. Secondly, in Sussex as a visiting scholar has been absolutely instrumental to my progress since I think having the competition with other PhD students and good mentorship has been vital for my continued intellectual development.
Another thing which I think is really hard to appreciate without a PhD and you can only do so when looking back is just how deep expertise can go and how sticky it is – both how hard everything is when you don’t have it but simultaneously how easy everything is when you do have the expertise. For instance in machine learning, when I was just starting, I remember taking weeks and being completely befuddled by relatively straightforward things which I can now breeze by – this also means it becomes increasingly difficult to remember how hard it was to do them in the first place
Another skill which I have improved immensely over the course of my PhD is how to write decent academic prose extremely fluently and rapidly. When I was starting out writing papers was incredibly difficult and very jarring and slow, and they still somehow sounded amateurish. If you compare it to some of my more recent work now, the difference is astounding. What’s more my more recent work, for instance my predictive coding review was composed over just two weeks of research and writing – while it’s over 50 pages long and took almost a month for the supervisors to read it. Additionally, much of my PhD was written in about a month of real time and probably about 1-2 weeks of full time work which is astounding. This contrasts with my first ever accepted paper (Deep Active Inference as Variational Policy Gradients) which took me about two weeks to write up the first draft, and it is much worse than the predictive coding review.
In general, it is very easy to underestimate the amount of skill that goes into making an artifact – whether it is a scientific paper or a product or some code or whatever. It looks easy from the outside, and indeed, in theory, it often is easy, but the practice is where the difficulty lies. There is almost always a vast amount of tacit knowledge which must be learned with difficulty and with trial and error before you can really become productive in a new field. This is why the research bar is so high – or at least I found it be so. Usually, I would say it takes about 2 years of research practice before you can start seriously contributing to scientific research in a meaningful way. This is approximately how long it took me. With good mentorship it is probably less, about 1 year. This good mentorship is a big if. This is probably why PhD programmes are as long as they are. Nevertheless, there is also lots of experience when I look at CVs of people, especially in the US who have considerable amount of research experience in undergrad, and may have papers. It lets me understand how that works as essentially an apprenticeship period so people can be productive immediately upon starting the PhD. I feel that this has increased in prevalence a lot over the last five years, as things have become so much more competitive than they were even when I started my PhD in 2017 (and it felt plenty competitive then).
Another thing I have learnt deeply is the sheer importance of mentorship and feedback – before I went down to Susssex and found a group of people in my field who could give me feedback, I was totally lost and immensely unproductive in comparison to my final PhD year of 2020. It is not the unproductivity of not doing work, but rather the work is much less impactful and never succeeds as well. One of the primary functions of feedback is to reorient you when you are going down unproductive paths and thus allows you to spend much more time (although never all of it – never more than half in my experience) of time that is actually spent on things that will end up being worthwhile. Almost all experiments you run and ideas you have will never make it into the final paper. Additionally, feedback is necessary to develop your nose for what are interesting problems to tackle and what is likely to be an uninteresting and dead end. It is all too easy while you are by yourself to focus overmuch on some narrow question and never look outside and ask whether the question is interesting, useful, or tractable at all. I definitely fell into this trap a large number of times especially in my second year where almost everything I worked on didn’t turn out very useful. In the first year, I was so undirected that I didn’t even get far enough to go in the wrong direction. Only in the final year did I have enough feedback and a correctly trained ‘nose’ to be able to have even 50% of my work be vaguely useful. Even then, in the macro picture, it is unclear how useful any of this actually was. Predictive coding and active inference, while contributing some theoretical insights, did not turn out to be vital insights for ML progress or achieving AGI as I had initially hoped.
On the other hand, once you have passed the research barrier, you are now in an amazing position atop the commanding heights of the field. I feel like over the last 6 months or so I have begun to really appreciate that feeling of power and mastery of a field. Of course, you cannot stay there for long; you must find new fields to master. But the key point is that once you have mastered one field, and reached the top of one mountain, it is always easier to reach the peak of the next and the next. The real barrier is breaking down that first peak and after that it is a lot of work, but you know the destination you are struggling towards, as well as what it is like to climb a peak, and this knowledge which makes it vastly easier to find your way in the future.
Also the nature of inspiration and discovery is different from how I always imagined it. While I definitely cannot claim to be an Einstein, I feel like I have made several discoveries which enhance human knowledge in a small way. These are principally 1.) The general relationship between active inference and control as inference. 2.) The origin of information seeking exploratory terms, 3.) the activation relaxation algorithm, and 4.) A general understanding of predictive coding and its relationship to backpropagation. In none of these cases was there a ‘eureka moment’. In all cases, there was just a slow progression towards the solution, meandering back and forwards around it, until suddenly stumbling upon it at just the right time. This stumbling was never really sudden and often did not feel profound. It just felt like things slowly slipped into place. Sometimes it wasn’t even obvious that you had figured it out for a while afterwards, as you check your answer and have a bit of a longer think about it. The closest was the AR algorithm which came to me one evening as I was thinking about why there need to be prediction errors in predictive coding, and it just fell out quite naturally. It was a while before I even realized what I had done then and what I had figured out. In all the other cases, there was the slow accumulation of knowledge, often completely imperceptible, and you just slowly tie something together which makes perfect sense and seems extremely obvious to you, but is much less obvious to other people and they seem surprised by it or interested in it.
Another thing which is worth reflecting on is just how much better you get at working slowly over time. It has been a very slow struggle, but I have come on immensely in this regard during my PhD and hope that I will continue to do so. I never had to learn to intensely focus on anything in undergraduate education and definitely not in highschool so the first time I had to do any serious and intense work was in my masters degree and, unsurprisingly, I was really bad at it. When I started out my PhD I was almost entirely incapable of doing more than a few hours of focused work a day. By the end, I managed to reliably get up to about 5-6 hours from 2-3 hours a day – or doubling my effective productivity. Hopefully this trend will continue although it gets increasingly harder to squeeze out useful hours. My primary problem is that now I no longer have huge stretches of unstructured time but instead a large number of meetings with collaborators and people I am supervising and so forth. I am slowly transitioning to being a manager. Figuring out how to be productive in the interstitial times between meetings seems to be the next big challenge of time management. I especially am finding myself limited on time for unstructured deep thought which is necessary for creative progress. This is something I need to carve out time explicitly to address. This is challenging though is that a key part of insight is spontaneity – you get insights at unusual times when you are not expecting them. It seems unrealistic to carve out an ‘insight hour’ to have deep thoughts in. Exactly how to do well with this remains unknown to me thus far.
I have also found it useful and interesting to peel back your day and think about how much work you actually spend doing something serious. Like fully focused, intense work. For me, it has always been much less than I think – definitely less than half the time I have spent ‘working’ and probably close to one quarter of that time if that. In undergrad I could probably only do about an hour of real work a day, and about 3-4 hours of distracted ‘work’. I was really really bad. I spent the rest of the day on social media or reddit and was absurdly ineffective. By the time I started my masters I could probably manage about 2 hours of real work and for a total of about 8 hours of work a day. Throughout my PhD I’ve been slowly increasing that time to about 12 hours of ‘work’ and about 4 hours of real deep work every day, which is getting to the maximum of what the day supports – and I am likely a workaholic. It is clear that the key is maximizing the amount of ‘real’ work you do, although the meme that you only really do four hours of work a day is there for a reason. I firmly believe that it takes a lot of practice, and potentially maturation and development, but you can develop the capability for more real or deep work over time. It’s like a muscle. One that develops very slowly, but can under sufficient pressure be adjusted upwards allowing you to be much more effective slowly over time. I hope this process continues and I can, if I set up my life right, achieve much higher levels of real work over time.
Also, this I think gets to one of the key advantage of tiger parenting schemes for children and why child prodigies do so well. in effect, more than anything they actually learn during that work, they have been slowly exercising their focus and work muscles, over a formative period of development. This means that when they compete with morons like me in undergrad, they can pull 4-8 hours of deep work a day while I struggle to pull 1. It is no surprise that against that I am doomed. They have about 10-15 years of work practice on me – which is absolutely enormous and will take many decades to catch up on. It’s amazing how much early work and early practice does overall to generate incredible skill. Another reason that starting work early is so effective is that you have so much time and so little responsibility. In undergrad and even in the first two years of my PhD I had entire days where I had literally nothing to do except read and think. Now most days I have at least one meeting or other necessary event as well as a bunch of admin which eats up precious time. So you can never truly focus or get into the obsessive depths you could without these tethers to reality. Of course having all this unstructured time, without discipline, which I did not have, inevitably just leads to massive procrastination and waste rather than deep productivity. But the potential is there.
Further reflections in 2023
The previous set of reflections were written in Mid 2021 while I was still deep in academia. I am now older and hopefully have some more ideas as well as experience in industry and AI safety.
In retrospect, basically, there were a couple of major mistakes I made in my PhD not listed here. Firstly, I should have tried much more to get internships at top AI labs. I did not realize it so much then but there is a massive difference in quality between average academia, where I was, and elite AI labs in terms of productivity and exposure to the technology. It would have been highly valuable experience doing this at any time in my PhD and especially before I received mentorship in Sussex. This could potentially have propelled my PhD in a totally different and probably better direction and also given me the option of actually leaving to join one of the labs at what would have turned out to have been an incredibly opportune time. Imagine joining OpenAI in 2018 – that would have been quite a journey! I basically made no effort to do so and this was primarily out of a sense of inferiority and that ‘I stood no chance’. Having interacted with the interns and PhD students who actually get these internships I now realize that I would have been fine and was probably pretty average for this demographic and not vastly inferior. This kind of impostor syndrome is highly limiting primarily because it stops you from even seeking out these kinds of opportunities in the first place and hence implicitly confirms itself.
My startup ideas were doomed primarily because I had zero connections to any kind of VC or funders and made no attempt to make them. The entire VC game revolves around these social connections and trying to build a product by yourself and then get VC traction is an exceptionally bad way to move forward and just wastes so much time. The correct way is to find cofounders, pitch yourselves to VCs based on charisma, get money, and then use that money to hire people to actually build the thing and get more money and repeat. Also the way this works in industry is very different from how I imagined in academia and I just had no idea how different it actually was. Despite this, I definitely made 100% the correct decision going into academia for a PhD instead of industry in Edinburgh since all of those jobs that I could have got there were entirely dead ends which would have taken me further and further away from the actual core of AI while the PhD preserved my optionality and gave me some experience reading papers and getting up to the frontier, as well as three years of extremely wide reading and study which I feel has been invaluable and would have been almost impossible to do with a real job.
I think that overall I probably spent too long in academia in general and was far past the point of marginal returns. However, it was unclear how exactly to leave academia earlier and I did not have any good offers on the table. I should also have got deeply involved with EleutherAI since that was a fantastic opportunity to make a big impact and was very cool. I actually did recognize this as it was starting and I started to follow it in late 2020. However, I always had various PhD work etc to do instead and was slightly intimidated to contribute so I procrastinated and this cost me a lot of potential value. This was also a big mistake. If there is a very cool scene forming working on things that are clearly going to be important – you should join! THis should be high priority. Even if it does not look immediately promising, the scope for upward unknown unknowns is very high. A similar story happened with Lesswrong which I have been following off and on since 2013 (!) but was always too intimidated to contribute. I would have been a lot better off had I started thinking and posting my blog ideas there as soon as I had joined or even during my PhD. Only having it as my official ‘job’ to make LW posts has broken through that psychological barrier for me.
Finally, it is surprising how little most of the stuff I did mattered, which is sad. I was constantly thinking I was on the edge of some big breakthrough in neuroscience, but that never really happened and it seems unlikely that PC can provide this at this point. I also spent so much time working ad doing stuff which went nowhere and I could have spent that time having much more fun with my wife or reading or doing some hobbies. It should be a counteraction to workaholism to realize that almost all of what you do during work is actually completely wasted and you are just objectively better off to spend that time enjoying yourself. You only live once!
I should have had more metacognition and explored further alternatives. However, the actual number of big breakthroughs in the field is exceptionally small and it seems highly unlikely that you would have found one even if you had done something different on the outside view. Essentially, there were no massive breakthroughs in neuroscience of this period, and I do not feel we are appreciably closer to cracking fundamental issues such as credit assignment in the brain. I think my work is maybe slightly helpful on net but is unlikely to lead to massive impact. In machine learning of course there have been fundamental breakthroughs but it turns out that by and by large academia is not the place to actually make them and despite many many more researchers, most breakthroughs have been clustered around a few elite tech companies with the resources to actually push out the frontier. Academia has contributed a lot around the edges but not to any fundamental huge breakthrough. Given my position at the start out my PhD, outside of joining Deepmind or OpenAI, my chance of making a fundamental AI breakthrough was also low. Nevertheless, I did have a good time in academia and it was a fundamentally good environment in which to mature intellectually. I wrote some papers and, what I feel, were some decent intellectual contributions. I got to spend years of doing nothing but primarily study and attack problems and try to contribute to human knowledge, which is an exceptionally noble goal. I read a huge number of machine learning and neuroscience papers and think I have a reasonable understanding at the forefront of both fields. I learnt a huge amount and now feel like I have very good models which I can bring to AI safety and other pressing problems. I am, and was, however, ready to leave. You cannot stay in academia forever if you want to make an impact.