Now that deep learning has been fully victorious and the reasons for its victory have started to be understood, it is interesting to revisit some old debates and return to the question of why GOFAI failed with a new perspective.

What is GOFAI?

GOFAI is hard to pin down precisely and is more of a ‘know it when you see it’ phenomenon, and is often today used primarily as a smear. However, I think it is possible to come to a more principled definition. Expert systems are definitely GOFAI, for instance, while early connectionist work like Hopfield networks are not. One potential distinction is between statistical and non-statistical methods. However, it fails to separate connectionism from a large number of other statistical based approaches which generally do not scale. This distinction is always murky, however, since in some sense connectionism is just statistical learning in the large. After all, a neural network is nothing more than a very deep nonlinear regression model.

Perhaps the clearest definition that I can find, germane to my point, is that GOFAI systems aim to use human labour and optimization power to build their systems instead of machine optimization power. This includes both direct hardcoding of the system such as with expert systems and a wide variety of early ‘AI’ programs, as well as human curation of large datasets of ‘facts’ or ‘propositions’ as in large projects such as Cyc, or even just large amounts of human effort in feature engineering or scaffolding together systems based off of a set of components – which is pre-DL computer vision and NLP pipelines largely functioned.

When phrased like that, the reason for failure is obvious. As Moore’s law continues and compute becomes exponentially cheaper, the balance of power between computer optimization power vs human optimization power changes radically, as indeed it has. The bitter lesson is basically that compute is cheap and humans are not. This is especially important in AI systems where the systems we are trying to model – ultimately reality itself – are irreducibly complex.

Perhaps the most important recent lesson from DL theory is that scaling laws are in the data and not the model. A fundamental seeming fact about large naturalistic datasets, and indeed almost any task which involves the natural world, is an apparently infinite power-law decay of features to be learnt. There is almost always some rare and complex complication that exists to mess up your models lurking just beyond what you have learnt. There is likely a sensible maximum entropy reason for this, although exactly why is not yet super clear. However, that natural datasets are structured in this way is an empirical fact. It is this structure that creates scaling laws. As the model increases in capacity, it can learn more features. As the model sees more data, it has more examples of each feature to learn from and increasingly rare features eventually pass some approximate ‘learnability’ threshold. To make progress on a naturalistic dataset, you fundamentally need a certain amount of data and optimization power which increases as a power-law in the loss you want to achieve. The fundamental underlying issue with GOFAI, then, is the relative cost of optimisation power. To get to X level of performance you need to exert X amount of power to move down the scaling law. This X can come either from humans or from machines or some combination of both. In the 80s FLOPS were so expensive that machines didn’t work at all and your only option was really humans for large parts of it so GOFAI systems reached SOTA. As Moore’s law drove down the cost of FLOPS, the optimal tradeoff shifted more and more towards machine optimisation power. DL is just the continuation and really apotheosis of this trend where we have human optimisation power fading to 0 and machines -> 100%.

There are a number of other failure modes of GOFAI such as brittleness, and lack of symbol grounding, and so forth, but all of them stem from his fundamental fact. Human optimization power is not competitive with machine optimization power. Reality is fundamentally complex and you are just going down the same (endless) power-law scaling curve as DL systems. Just that while DL spends GPU cycles and parameters, you are spending human effort and gradient descent is a much better optimiser than you.

Successes of GOFAI

Finally, it is important to note that for its time, GOFAI did not fail. GOFAI systems were the state of the art for the 1980s, and arguably many decades until the recent DL revolution. For instance, in computer vision GOFAI-ish systems based on hand-engineered features were dominant until at least the late 2000s. However, in other fields such as NLP, statistical methods such as HMMs, and LDA were often state of the art, although NLP pipelines typically included much GOFAI scaffolding as well. This is exactly what we should expect given our model of the marginal cost of computer vs human optimization power. Of course connectionism was state of the art in the 80s – then compute was so cheap that getting humans to do almost everything was the correct tradeoff.

Connectionism at this time either did not work or appeared to scale worse and be more finickity than the equivalent GOFAI systems and had massive apparent issues with vanishing gradients etc which often meant that DNN based approaches just did not work at all. It turned out that all of these issues could be solved with careful initialization schemes and architectural choices, but this was not clear at the time.

Neither did all of connectionism fully succeed. Only a tiny subset of connectionism passed through the scaling filter to become modern ML, and much of modern ML looks completely different to the connectionism of the 80s. Basically, three ideas made it through: 1.) using hierarchical multi-layer neural networks with nonlinearities, although no architectures made it through except CNNs and, arguably, LSTMs. 2.) Using backprop as the credit assignment algorithm – this is by far the biggest win. 3.) Training these networks on naturalistic data and ‘getting out of the way’. However, a vast amount of connectionism turned out not to scale either. This includes a huge number of different architectures that did not work. A vast amount of work on different learning and inference algorithms, and alternative NN approaches such as spiking networks have never worked to this day.

Finally, many current ML techniques which do scale were invented within the GOFAI paradigm. Perhaps the clearest example of this is Monte-Carlo-Tree-Search which is the algorithm used by DeepBlue to play chess. While human-designed MCTS is unique in that it can leverage compute in a direct way since it is simply an encoding of optimization which takes advantage of the structure of MDPs. GOFAI also spun off a large number of generically useful software engineering ideas and practices which are now no longer considered ‘AI’ such as basically everything that came from the Lisp language.

Why Connectionism?

A related question, of current instead of just historical interest, is why did the specific aspects of connectionism that did succeed, succeed? This is the question that machine learning theory tries to answer, and which I aim to slowly figure out on this blog. Basically, there seems to be a few key factors.

The main one is having sufficiently powerful and flexible neural network architectures with sufficient scale and data to learn useful, generalizable functions. Both scale and data bring large blessings. Strangely, deep learning is a field where bigger is easier. This stands in stark contrast with GOFAI systems where larger is always harder.

The reason for this is fundamentally due to an unintuitive property of the optimization space of neural networks and their inductive biases with scale. Large neural networks appear to have better conditioning (such that SGD works better without exploding or vanishing gradients) and a much smoother and more linear loss landscape with much larger and easier to find minima which are of high quality. Similarly, greater datasets force models to find generalizing algorithms rather than memorizing specific features, and this manifests itself as an implicit regularization towards broad and useful minima. Greater dataset sizes also help speed up optimization by enabling much larger minibatches to be used which reduce gradient noise until it is no longer the primary bottleneck on optimization. Exactly why this happens and the reasons for it are still open questions in ML theory.