Epistemic status: I owe a lot of my thoughts to Jacob Cannell’s work on both brains and deep learning. My thinking here comes from my experience in large-scale ML as well as neuroscience background and specifically experience in analog hardware for deep learning
The lesson of connectionism is the unification of brains and machines. Both brains and current DL systems implement the core of intelligence: (amortized) bayesian inference with on an extremely flexible learning substrate with massive optimization power and data. We are consistently surprised by the similarities between our DL systems and human intelligence. This includes at the macro-scale where it is becoming clear that our sensory systems operate on very similar principles to state of the art deep learning models – unsupervised learning with predictive/autoregressive objectives – as well as at a representational level. Indeed, our AI systems often appear to have many similar limitations to ourselves. At a macro-level, they struggle with mental math and have human-like cognitive biases, and appear to subvert all pre-DL AI stereotypes; they are best at emotions and creativity and worst at logic and math problems. At a micro level, we see strong convergence between representations in the brain and representations in DL systems across a range of different architectures. Finally, at a functional level, the general strategy of the cortex and DL is very similar – both use huge relatively unstructured neural networks trained with some kind of unsupervised predictive learning – i.e. autoregressive objectives and backprop in DL – predictive coding (which is just autoregressive prediction with a slightly different credit assignment algorithm) in the cortex.
My primary contention in this post, which I am stating more strongly than I believe for a rhetorical point, is that these similarities are not coincidental. The primary cause of differences between DL systems and the brain is due to hardware, not software. Specifically, that the major reasons that DL-based systems look different from the brain is not due to different ‘software’ such as learning objectives (autoregressive prediction vs temporal predictive coding), credit assignment algorithm (backprop vs whatever the brain does), or low level details such as neuron implementation (ReLU linear neuron vs actual cortical pyramidal cells). Rather, the primary differences that we see are mostly due to the fundamental constraints and tradeoffs of the underpinning hardware architecture (GPUs for DL vs biological NNs for the brain). This drives the differences in ‘architecture’ at a software level – i.e. what primitives are effective and efficient in deep learning are different from those that work in the brain and thus state of the art deep learning nets and the brain look different, even though both ‘work’.
In a way, the point of this post is to argue against the picture presented in The Hardware Lottery. The techniques in ML which have been successful are not arbitrary, nor are they arbitrarily decided by existing hardware. Instead, the successful ML techniques arise from the adaptation of the fundamental principles of intelligence, which is shared between brains and machines, to existing hardware architectures. This adaptation requires hardware close enough to the algorithms which implement these principles to be able to run them efficiently. This is why all ML runs on GPUs and not CPUs, even though almost all research prior to 2012 was done on CPUs. However, there are also other ways of implementing these same fundamental primitives which are adapted to other hardware architectures, especially neuromorphic analog hardware like the brain. These elements form a ‘parallel’ tech tree which we have not explored anywhere near as much as the DL-based strain, but which we start to explore as better neuromorphic hardware comes online. Secondly, the hardware architectures that we have converged upon – GPUs and brains – are themselves not arbitrary. There are fundamental physical reasons why GPUs have to be the way they are, and similarly why neuromorphic architectures have their own sets of properties, and the way that intelligent systems can be constructed for each substrate is downstream of these fundamental reasons.
To start understanding why brains and current DL systems differ, therefore, we need to understand the primary constraints they face as a consequence of their hardware architectures. Specifically, the key constraints on brains are:
1.) Energy usage
2.) Size / volume
3.) continual single batch learning in real-time
While the key constraints on GPU-based DL systems are:
1.) Dense SIMD operations only
2.) limited memory bandwidth relative to FLOP count
The primary constraints on brains are firstly the sheer energy cost of maintaining the brain and powering its operation. Unlike GPUs which can utilize 700-1000s of watts of power at peak, even human brains have a much smaller energy budget (about 10-20 watts, almost 2 OOMs lower) due to both metabolic costs of supporting the brain and difficulties with cooling. This very limited energy budget means that slow, highly sparse connections and firing is optimal. Neural firing speed cannot be very high due to this. Large amounts of the brain are not fully utilized (dark neurons to complement dark silicon). Related to this is size. The brain’s size is primarily constrained by a.) the need to fit all the neurons within the skull and b.) the energy upkeep of larger brain. As an analog circuit, expanding the number of parameters means not just taking longer to compute but also expanding the physical space of the ‘chip’ to account for these extra neurons. This forms a fairly taut constraint on brain size although much less than the sheer energy upkeep as neurons can be packed at least moderately more densely and in evolutionary terms the brain and skull has also expanded dramatically in human’s compared to other primates. Indeed, in the quest to increase intelligence and the number of neurons, primate brains appear to have evolved much denser neural tissue (in neuron count) than other mammals. An additional major constraint on brains, which may actually result in significant algorithmic rather than hardware differences, is that brains are forced to learn in a continual learning and single-batch setting. Specifically, we are fed data as a sequential stream which could be, and usually is, highly autocorrelated, and secondly, that we must learn with a ‘batch size’ of 1. While this learning method can bring about some advantages – specifically, that we have a natural curriculum to our learning (at least sometimes) and that we can interact and intervene on the world which makes causal discovery easier, it also comes with major costs. Because we do not learn from i.i.d data, brains have a major issue with preventing catastrophic forgetting and have a bunch of algorithmic mechanisms to deal with this including quite an elaborated memory hierarchy distributed throughout the brain, as well as synaptic fast weights. Single batch learning means that gradient noise, or in general noisy specific datapoints, can have a major impact on learning, and it is likely that the brain has to utilize more computation for each datapoint than DL-systems to counteract this. Finally, another constraint on the brain (and of analog computing in general) is its intrinsic two dimensionality. Since computation is mostly analog and constrained to the neural sheet, two-dimensional vector-matrix operations are trivial to represent and compute. However ‘three dimensional’ matrix-matrix and tensor operations are much less natural and must be handled in other ways – typically recurrence and sequential processing. This inability to handle tensor operations is another reason why our brains cannot effectively operate on minibatches like DL systems (neurons don’t have a natural batch dimension) as well as why we need to use recurrence to handle time rather than just treating it as another neural dimension, as do transformers.
These fundamental constraints and limitations are not faced by DL-based systems. We essentially have no limit on the energy usage or size and volume of our GPU clusters. GPUs can handle tensor operations of arbitrary dimensionality, although this flexibility naturally comes at a cost for matrix-vector operations compared to ideal analog coprocessors, and due to this ability to store and operate on extremely large matrices naturally, we can also perform i.i.d learning with very large batch sizes straightforwardly, unlike the brain. Broadly, this means that with GPU-based DL systems, we get a huge amount more raw statistical and computational power than the brain. However, the architectural constraints of GPUs also mean that this computational power tends to be used in a brute-force and less efficient way than in the brain, leading to DL systems with a much higher FLOP count and dataset size than the brain still being outperformed by it in a number of areas. While some of this is definitely algorithmic, a large part is also due to the nature of the architecture.
Specifically, turning to the hardware constraints of GPUs, a fundamental one is the specialization of GPUs to dense SIMD operations. GPUs are designed and extremely highly optimized for large, dense, matmuls and other elementary linear algebra ops. This means that while GPUs are fantastic at crunching extremely large matrices, extremely large matrices are not always the optimal hammer for every task. A large amount of the weights of existing DL systems are likely unnecessary – and can be shown to be unnecessary by the success of pruning methods. The true latent variables that the DL system tries to model are likely highly sparse and have some power-law small-world structure. The brain, after performing a pruning step early in life, naturally adapts its connectivity statistics to these features of the latent space it is trying to model. GPUs cannot and keep fitting a dense graph to a sparse one. Of course, this can achieve a highly successful fit – sparse graphs are subsets of a fully connected dense graph – but at a significantly greater computational cost.
In general, this is a pattern we see again and again when comparing the brain vs GPU DL systems. The brain has various clever adaptive connectivity and fast-weight mechanics to store memories and represent data efficiently; DL systems power through it with massive dense matrices. The brain uses recurrent iterative inference to refine representations over time with a much smaller parameter count; DL systems just train massively deep architectures which brute force and approximate the iterative inference. The brain has clever attention mechanisms all throughout the cortex so that only important information is attended to and all other information is suppressed, enabling computation to be conserved and focused where it is most important; DL systems just attend to everything by default.
A second, major, limitation of GPU based architectures compared to the brain is their memory bandwidth. Unlike the specialization for dense SIMD matrices, this is a fundamental flaw of the von-Neumann architecture. Specifically, the separation of processing and memory units is highly unsuited to any kind of ‘shallow’ computation where a relatively small amount of computation is spread out over a lot of data. Unfortunately for GPUs, this describes most of the computation that appears to be necessary for intelligence. Each ‘weight’ in a forward pass is involved in only a few ops – a multiplication and an addition; maybe a nonlinearity if it is lucky. The cost of these operations is dwarfed by the cost of shuttling this stored weight back and forth from memory 1.
Due to the Von-Neumann bottleneck, as well as basic chip physics (memory takes up a lot of area you really want to use for useful computational logic), having large amounts of memory very close to computation is fundamentally expensive. In computer architecture, this problem is solved by having a memory hierarchy going from the various in-CPU caches which are extremely fast but have a very small memory capacity and are used to store registers and other vital information, to slower caches, to RAM, and ultimately to disk. What this means in practice is that computers end up having fairly small ‘rapid working memories’ compared to their computational power and, for ML tasks, end up getting bottlenecked by memory accesses. This is an absolutely massive problem for large-scale DL systems compared to the brain and spans orders of magnitudes. For instance, a H100 has easily an equal FLOP count to the brain, however it only has 80GB of RAM compared to the 1-10T synapses of the brain, which can all be active simultaneously. This is why, brain-scale ML systems must currently be trained on large GPU clusters, even though on a FLOP-for-FLOP basis, a single modern GPU card is competitive with the brain. This memory-poverty is further exacerbated by the ability to only perform dense matrix ops because it cannot even slightly save memory by sparsifying its computations and only representing the vital elements.
These differences in fundamental architectural constraints result in differences in software architecture that implement approximately the same algorithms and the same functionality. These are the reasons that brains and current DL systems look so different on the surface level, rather than major algorithmic differences. Interestingly, despite fairly significant architectural differences, we often see representational similarities such as where we can match DL-based representations with those in the brain. The reasons for these similarities are at present mostly unclear, but I suspect that they occur for fundamental reasons we do not yet fully understand.
This bottleneck suggests two solutions. First, we could try to reduce the distinction between memory and compute – or at least co-locate memory and compute units on the chip. At the far end of this path is the brain, and neuromorphic computing in general, where we build systems comprised of neuron-like units which combine both memory and computational functions. Nearer to the present is specialized AI chips such as Cerebras and, indeed GPUs, which exploit the highly parallel nature of neural networks to spread memory and compute out to a large number of local co-processors and then combine the results later. The other approach is to keep the von-Neumann architecture but instead to increase the computation density of each weight – i.e. to design different neural network architectures which do more computation with each weight. This approach diverges from the brain and seems to head towards more alien systems of intelligence, but there is no a-priori reason why they would not work. However, research here seems to be either more neglected than the alternative or works less well in practice. ↩