Epistemic Status: Just some quick thoughts written without a super deep knowledge of SLT so caveat emptor.
Recently, I happened to run into Jesse Hoogland at the Post-AGI workshop and we got onto discussing his work on SLT. SLT had been vaguely in the air when I was at Conjecture and was more involved in interpretability, but it had always seemed very arcane and mathematically difficult and I had never really gotten into it at that time. In any case, he nicely gave me some pointers and papers to read which I finally got around to reading because of the Christmas break, so here are some of my initial thoughts. This is mostly based on the (great!) sequence on distilling SLT as well as some of the other Timaeus papers I read. I haven’t yet attempted to read any of Watanabe’s original work so I’m probably misunderstanding a lot, but here goes anyway.
The basic idea of SLT is super simple. Existing neural networks have various kinds of symmetries in their weights which means that multiple weight configurations produce the same function and hence get the same loss. When this happens it is called a singularity and is equivalent to having zero eigenvectors in the fisher information matrix (hence making the fisher semi-positive-definite instead of just regularly positive definite). This has two very closely related effects:
Firstly, the pure number of parameters as used in e.g. the BIC no longer becomes a good measure of the ‘effective’ number of parameters, and hence statistical complexity of a model. It may have a lot of parameters but loads of them are effectively useless because they are ‘eaten up’ by symmetries and so the class of functions actually computed by the model is smaller than it would be if all these parameters actually did separate things without the symmetries. SLT provides a way to quantify this to compute the ‘Widely-applicable’ BIC (WBIC) (lol) which just replaces the number of parameters in the complexity term with the number of ‘true parameters’ which is smaller and takes into account the symmetries.
Secondly, because there are now singularities/symmetries in the model, this means that the minima in the loss landscape change. Instead of optimal points we get optimal planes or higher dimensional structures which all have identical minimal loss. In the Bayesian setting this then produces the question of what should the asymptotically optimal posterior be? If the optimum is just a point then the posterior is the delta function at the optimal point, but if the optimum is some kind of geometric structure, then we have to define a posterior over this structure and how should we do this?
The second insight here is that we can try to get a handle on this by looking at the geometric structure of the optimal set, and by relating measures of the volume of an epsilon ball around the optimal set implied by the sharpness of the fisher to the free energy (which balances entropy and accuracy) of that region of the minimal set. Specifically, SLT shows that regions of the optimal set with greater ‘volume’ are ‘preferred’ by the posterior and have lower free energy than those without much volume. This can also be measured by what SLT calls the ‘local learning coefficient’.
Interestingly, I’ve had similar ideas and musings myself, especially trying to relate the generalization and ‘findability’ of minima to their parameter space volume and to Hessian (not Fisher but closely related) sharpness. Obviously SLT is vastly more mathematically sophisticated (and actually proven!) than my musings but it’s interesting I was thinking naturally along similar lines then and it’s really edifying to see similar ideas actually proven and formalized properly.
However, this made me think that perhaps SLT-like arguments are more general than SLT actually proves them to be1. My heuristic arguments made no mention of singularities specifically in the weight space and indeed most of the math of SLT doesn’t actually seem to rely massively on the existence of singularities. The primary reason that SLT needs singularities is because it is concerned with the asymptotic limit of infinite data/optimization power where it assumes the global optimum can be reached, and hence if the model is not singular, then the optimum is a point and then nothing interesting happens. So the primary reason singularities are needed is simply to spread the optimal posterior out over some nontrivial volume with interesting geometric structure vs a point.
However, there being singularities in parameter space is not the only way to get such a result. There are two other ways that I can think of, which I think actually crop up a lot more in practice. Firstly, we can simply not assume the asymptotic limit of infinite data/optimization steps and hence we do not end up with a delta function posterior even in the case of nonsingular models. In practice all models are trained for finite steps and often, for large models, they are pretty far away from saturation. This is usually true in practice but seems very hard to mathematically model while the SLT makes the modelling easier (and probably tractable at all) by basically assuming equilibrium at the minima. Although idk if we can tractably, show much, I think this nevertheless provides some additional intuition of why early stopping acts as a defacto regularizer – simply because it decreases the effective number of parameters of the network! The argument here is basically that due to early stopping, our posterior hasn’t reached equilibrium, which means that our network becomes effectively singular in that many possible input-output functions can achieve the same loss at a given number of training steps which means that the effective dimensionality of the network is less since there are now ‘pseudo-singularities’ caused by insufficient optimization which means that the network is more likely to have better complexity and hence generalize better than a network trained for longer. Directly trying to mathematically model these learning dynamics in nontrivial networks seems very challenging although you could potentially prove results about some simple cases such as e.g. quadratic convex loss plus SGD from some uniform distribution of initial conditions for N steps.
Secondly, and likely more fruitfully, in practice we use stochastic gradient descent which operates over a randomly selected minibatch rather than the full batch of the data at a time. This introduces an irreducible level of noise which cannot be overcome even in the limit of infinite data and training time. Effectively, stochastic gradient descent introduces a noise variance, or effective temperature, of learning-rate/batch_size which controls the SNR of the gradients ‘at equilibrium’ and hence determines the irreducible loss floor that the optimizer simply lacks the ‘resolution’ to look below.
This is interesting though because this introduces another set of ‘pseudo-singularities’ – i.e. combinations of parameters that make no difference to the loss on average across many noisy training minibatches simply because the differences between the parameters in this region are too small to impact the loss on a level which rises above the SNR floor imposed by the stochasticity of the gradients. Effectively, many regions in the parameter space ‘appear singular’ to the optimizer since in practice once inside this space the optimizer cannot decrease loss any more on average, even though with theoretically infinite resolution the optimizer could further optimize to an actual singularity.
What this means though is that for the perspective of SGD with some fixed noise SNR set by the learning rate and batch size (and potentially momentum coefficients) that the network reaches the conditions for SLT-like arguments to apply as soon as it hits the noise floor instead of when it actually hits the optimum. The posterior still has to spread out to cover many possible parameter combinations which are roughly equivalent in loss due to noise rather than due to actually outputting the same function analytically, such that free energy differences between them will depend primarily on the complexity term and hence their local geometry. This to me, means that this is a possible way of actually bridging SLT math to the regimes existing in real-world large scale pretraining where, empirically, the loss after training for a long time seems primarily determined by the noise floor of the optimizer (i.e. batch size and learning rate) rather than e.g. the actually hitting some kind of local minima. Moreover, the actual math should be much simpler in this case since the gradient noise we observe in SGD is typically approximated as a Gaussian and looking at the asymptotics of gradient flow SDEs with Gaussian noise is an extremely well understood area. My feeling is thus that most of the insights of SLT should then apply in this more realistic noisy regime (perhaps with some adjustments), but I feel that practitioners of SLT should study the noisy stochastic optimization regime much more closely2.
-
SLT argues, correctly, that SLT applies because neural networks almost always are singular and hence will exhibit this kind of structure and behaviour in the limit. This is of course true, however my counterargument here is that empirically, we never actually get close in training to a minimum anyway so any kind of argument about doing some infinitesimal taylor expansion in some epsilon ball around the optimal set is not really applicable. ↩
-
And perhaps they already do and I’ve just missed it given my very cursory look at the literature. ↩