Alignment needs empirical evidence

There has recently been a lot of discussion on Lesswrong about whether alignment is a uniquely hard problem because of the intrinsic lack of empirical evidence. Once we have an AGI, it seems unlikely we could safely experiment on it for a long time (potentially decades) until we crack alignment. The argument goes that instead need to solve alignment ‘in advance’ using primarily a-priori reasoning and theory construction. My feeling is that attempting to do this is extremely hard and very unlikely to succeed. I also think this is supported by intellectual history which shows the dismal failure of most fields of study to make substantive intellectual progress without a constant stream of new empirical evidence. I think that, based on outside view grounds, that this is very likely to also happen with alignment absent a strong push for empiricism.

However, the argument that direct experimentation on AGIs is going to be tough is still valid. However, this does not mean that we should throw up our hands despair. Instead, as a field, we need to be inventive in coming up with new methodologies for getting empirical evidence. This can be from studying existing ML models, or coming up with deliberate models of various failure modes we are concerned with, developing explicit ways to measure and quantify alignment relevant properties, empirically looking at the scaling relationships of current alignment techniques, and red-teaming models so as to try to elicit misaligned behaviours at non-threatening capability levels that we can then study. These approaches are just from off the top of my head, and if there is one thing that we know from the history of science it is that lateral and inventive thinking to come up with a new way to measure or quantify a phenomenon is a big step towards developing a rigorous science around it.

One approach which I think isn’t getting enough attention is in trying to deliberately induce misaligned behaviour in current ML models which are not at a dangerous level of capabilities. For instance, we should be able to experimentally create things like mesaoptimizers, deception, gradient hacking, etc in current ML models and then learn to quantify and study their properties in a safe and contained way. This should provide a rich vein of empirical evidence which we can study to try to derive a more general theory which can scale to larger models and ultimately to AGI.

Theory brings generality, which is exactly what we need to ensure that our techniques will scale to AGIs before we build them, but theory cannot be built from nothing. Empirically, if we study the history of science, it seems theory needs a huge base of empirical evidence to work with, as well as constant reciprocal interaction between theory and practice to really progress.

Why theorizing alone cannot go far

Empirically, throughout history, as a civilization, it seems to be the case that just thinking about things really hard, without going out and gathering new empirical evidence, seems to be extremely unproductive and rarely leads to new advances. Almost all of today’s sciences became true sciences when they figured out a methodology for getting reliable empirical evidence about their phenomenon of study. The development of new measurement devices reliably ushers in new periods of advancement.

The fundamental reason why a-priori reasoning cannot advance far alone is due to exponential state space of logical arguments as well as the diminishing signal to noise ratio.

Even in an ideal world, without uncertainty, at any stage of any logical argument, there are an absolutely vast number of possible further moves to make, and the space of logical moves grows exponentially even if each move is actually logically valid, it is completely intractable to search this space. This is why we, as a species, can often fail to discover arguments which are, in theory, a-priori deducible from known premises (such as evolution ¹) in a reasonable amount of time after all the prerequisites are there.

Secondly, in the real world we very rarely have the luxury of complete certainty in our premises or the infallibility of our logical deductions. As such, without the in-built error correction provided by empirical evidence, it is almost inevitable that you diverge from the true path very quickly and find yourself wandering in the infinite wastes of falsehood with no way to get back on the true path except by returning all the way to your original premises.

Another way to think of this is from the machine learning / statistics perspectives. We can think of the scientific process as society slowly performing some kind of approximate Bayesian inference over the empirical data it has received. When we have very little data, we have two problems. The minor one is that our (societies’) ‘priors’ are more important and this can completely stymie fields of scientific advance indefinitely if the priors are wrong. Secondly, and more fundamentally, is that with little data there is a huge optimal manifold of hypotheses with effectively equal likelihood with nothing to preference one hypothesis over another. A stochastic optimization process like science then essentially just randomly walks over this manifold. Absent new information, this equilibrium can persist indefinitely.

What this looks like in practice is that the field has a number of schools which disagree about fundamental questions. Each school is usually founded by a charismatic intellectual founder who made most of the initial arguments where then the followers tend to merely fill out relatively small details. There are constant debates and controversy but all seem intractable and little progress is ever made ². Usually these schools entrench themselves for a long time developing ever more elaborate arguments against their rivals which can diverge arbitrarily far from the truth. These schools usually take up most of the ‘intellectual space’ available on the optimal manifold but occasionally, an intellectual entrepreneur can find a new and unexplored region and create a new school around their thought. This creates a lot of drama and realignment in the field but eventually settles down to a new equilibrium. This same pattern is observable in many fields throughout history even those that are now scientific, such as astronomy before telescopes, much of medicine essentially up until the 20th century, chemistry before the 17th century, all theology, essentially all psychology/psychiatry up until at least the late 19th century (much is still like this today), etc.

I would argue that this pattern also describes many fields of study today such as the many subdisciplines of philosophy, much of academic sociology and anthropology, much of the humanities, parts of linguistics such as theoretical syntax of the Chomskyan variety which largely disdain empirical evidence, psychiatry, some parts of economics (especially macro and development), programming language theorists in CS, basically all of political science as well as things like business and millitary strategy, etc. The same thing, I would tentatively argue, has happened to high energy particle physics, where new experimental evidence is extremely hard to come by. This also shows that extremely high average IQ and mathematical aptitude of participants is no defense against these dynamics. The key thing is that there is (or sometimes isn’t) some base of empirical evidence, and that gathering any further evidence is very hard, and nobody in that field has worked out a reliable methodology to do so.

AI alignment is still a very young field, and so we probably still have a lot of free energy left in the school-building stage to get through, and we also get constant injections of at least some empirical evidence from machine learning, but the same dynamics could be, and probably are, setting in already. As a species, we really cannot afford AI alignment falling into such an unproductive trap while ML progress accelerates, and so we really need to focus on getting sound and robust methodologies to generate useful experimental data for alignment.

Case studies of theory and practice

In general, our current experience of ML where experiment and engineering runs far in advance of theory is also extremely common historically (and perhaps is a common element of any technological revolution). We saw this happen again and again in the first and second industrial revolution where industrial advances happened primarily as a result of tinkerers and engineers experimenting practically while a theoretical understanding of the phenomena they were exploiting emerged only decades later. In the case of steam engines, the first steam engines were invented by Thomas Savery in the closing years of the 17th century and Thomas Newcomen in 1712, and were steadily improved. Then, in 1764, James Watt invented the condensing chamber which dramatically improved their efficiency ³. A theoretical understanding of why steam engines worked, and their fundamental limits only started to emerge in the early 19th century by Carnot and the rigorous field of thermodynamics only started moving towards maturity in the mid 19th century with Boltzmann, Planck and Clausius and only really achieved a full modern understanding with Gibbs in the 1870s. This is, at best, a 50 year gap between Watt and Carnot, and at worst an 150+ year gap between Newcomen and Gibbs. We probably can’t be so lax with alignment as a 50 year gap between the development of the first superintelligences and a full theoretical understanding of them would probably spell our doom.

A similar pattern occured with the advances in metallurgy in the 18th and 19th century, such as the Bessmer process to make steel, which was developed by purely empirical means, but the materials science and chemistry knowledge needed to really understand these processes only emerged in the early-mid 20th century. A hybrid course of developments happened with electricity where initial experiments in the late 18th and early 19th outran theory, but then theory slowly caught up and a full theory of electromagnetism was propounded by Maxwell in the 1860s. Future development of technologies like radio waves 50 years later was thus entirely informed by the theory. In the early 20th century, theory in physics seem to have primarily outrun experiment or followed very closely on its heels such that technologies like nuclear weapons were developed with a (mostly) known theory already there. I don’t have a fantastic grasp of the history of chemistry, but I am pretty sure that there was a huge amount of empirical work that occured first, including rigorous scientific measurement and study from the 17th century, with a full theoretical understanding of what was going on only really developing in the 19th (the periodic table and acids and bases) up to the early 20th century for atomic orbital and bonding theory.

In the early development of statistics in the late 19th century, the pioneering work was often theoretical and then the practical application of this followed later on or was at best contemporaneous. For instance, pioneers here such as Galton and Fisher tended to invent statistical methods to match the problems they were facing in their empirical work (usually in biology) but they also had strong mathematical and theoretical inclinations which ensured that both sides developed cotemporaneously.

My feeling is that, in intellectual history, pretty much the only case of significant advances being made solely as a result of a-priori reasoning is in phyiscs and specifically Einstein’s development of special ⁴ and then general relativity. This has a large cultural cachet but is a very rare occurence. In general, it appears that typically practice proceeds theory by often a substantial margin. Naturally this makes sense since of course a phenomenon has to be discovered and investigated before somebody can make a theory to explain it. What is interesting is that we often see substantial gaps between the phenomenon and the theory being developed. Some of this stems from the intrinsic difficulty of the theory, while oftentimes there is also a significant and fundamental intellectual gap where the theory explaining a phenomenon is much higher up on the tech-tree than the ability to gain a pretty good empirical mastery of it.

What does this mean for alignment? Fundamentally, I think that the way alignment will make progress is primarily through direct dialectical engagement with the empirical data we can get about it. If we can make test models of things like mesaoptimizers and deception that we can study in depth then we will be doing well. Interpretability is highly blessed here since we have essentially full access to anything we could ever want to access in our neural networks – so much easier than neuroscience! To make rapid progress, we need the interplay of empiricism and theory ⁵.

If it is the case that alignment techniques and theoretical understanding based in simpler models do not generalize to full AGI, then we are in trouble. Our best bet then, I suspect, would be to try to figure out protocols and containment techniques such as boxing that would allow us to contain the AGI long enough to experiment with it sufficiently to figure out why our alignment techniques from smaller models do not generalize. I am more optimistic on boxing than most and I think this could work, but would definitely be more risk to humanity than I think would be ideal.

If reality turns out such that we cannot learn from studying alignment in less capable models, and we cannot safely box a superintelligence for a reasonable amount of time – i.e. the hard left turn model, then I share Eliezer’s pessimism. Our primary hope then would be to try to study all the empirical evidence we have from ML to deduce a general theory of intelligence, and what we know from neuroscience and psychology to figure out a general theory of values and what human values are, and then try to synthesize that into an alignment method for which we get one shot. I am not super bullish on this, but it is a shot after all.

In general, as a civilization we appear to have gone down the tech tree of evolutionary biology remarkably late. The theory of evolution and much that it entails is derivable a-priori from a few very simple postulates (that children inherit characteristics from their parents, that not all organisms rear the same amount of children, and that differential reproductive success is at least somewhat related to the innate characteristics). Empirical evidence for these postulates were widely available for pretty much all of human history, and yet it took until Darwin/Wallace to come up with the theory. This case speaks to the paucity of our ability to do a-priori reasoning to novel conclusions even when the required reasoning is extremely simple. Similarly, there was technological constraint on experimentation here. Mendel’s breeding experiments on peas (and even many that take place today) required no technology unique to the 19th century and could easily have been done by the ancient Greeks. Similarly, in medicine, randomized controlled trials require no technology unknown to the first civilizations. ↩
Importantly, this is how it looks on the outside from our perspective looking back many years. From the inside, this period probably is one of intellectual excitement and constant argument and debate, substantial refinement of argument, as well as a great deal of academic drama. The only thing is that, over a long horizon, nothing really seems to have changed. ↩
My personal feeling is that contemporary ML actually has substantial similarities to the development of the early steam engines. Primarily and initially driven by tinkerers, guided by their intuition, adding various additional components and trying tweaks of existing designs to improve efficiency and performance. Nevertheless, there exists a beautiful and simple theory which reveals fundamental facts about the universe lurking in the wings which will eventually be found thanks to our attempts to understand what we have built. Steam engines lead directly to thermodynamics and ultimately our understanding of entropy, work, and the ultimate energetic limits on civilization. Neural networks, I suspect, will lead us to a unified theory of intelligence, information, and kinds of minds that can exist and hence the ultimate limits of intelligence. ↩
Actually, I don’t think special relativity counts since a.) there were empirical problems known to science at this time such as the presence or absence of the aether etc which were known motivators, and b.) that Poincare and Lorentz appear to have invented it essentially simultaneously with Einstein. ↩
It is also important to think hard about what ‘theory’ and ‘practice’ might mean in terms of alignment. It is very possible that we might actually solve alignment primarily through practical methods which have been developed through empirical iteration on aligning less capable models. A theoretical understanding of how these methods work would then develop later, post singularity. This isn’t an ideal scenario but I am more optimistic than most, I think, that this could work. Of course, there is nothing stopping reality from hating us and all alignment techniques must necessarily fail to generalize to high capabilities regimes, and we are also in a world of extremely rapid and easy FOOM and uncontrollable AI proliferation, while human values could be a set of measure 0 in mind-space. I think the current ML paradigm actually gives us hints to the contrary: from the scaling laws we see declining returns to intelligence (as opposed to the increasing or linear returns needed for FOOM) and smooth and predictable scaling instead of sharp discontinuous jumps. Large scale ML models are extremely computationally costly which seems to limit proliferation in the near term. If things like shard theory and empathy emerging naturally are true then maybe human values might be some kind of natural attractor. ↩