Strong infohazard norms lead to predictable failure modes

Obligatory disclaimer: This post is meant to argue against overuse of infohazard norms in the AI safety community and demonstrate failure modes that I have personally observed. It is not an argument for never using infohazards anywhere or that true infohazards do not exist. None of this is meant to be an absolute statement. Caveat emptor. Use common sense. If you have actually stumbled across some completely new technique which speeds up training by 100x or whatever then, in general, you should not share it. Most of the concerns here are presented with a mistake theory frame – it is assumed that these pathologies arise not due to any specific bad intentions but rather due to natural social and epistemic dynamics.

When entering into the alignment community I was initially pretty sold on the idea of infohazards and the importance of trying to keep potentially infohazardous ideas secret. Having now obtained much more experience with how norms based on infohazards work in practice I have updated fairly strongly against them. I generally do not think they should be used except in exceptional circumstances. I also have observed that strong infohazard norms lead to a number of fairly predictable failure modes when intersected with standard human social dynamics and status games, as well as lead people who practice them into a number of epistemic traps. Moreover, strong infohazard norms can often be weaponized and wielded as additional tools of power by potential cynical bad actors leading to predictably poor outcomes.

Specifically, I think it should only be acceptable to claim something is infohazardous when you have strong empirical evidence that 1.) it substantially advances capabilities (i.e. more than the median NeurIPS paper), 2.) It empirically works on actual ML systems at scale, 3.) it is already not reasonably known within the ML community, and 4.) when there is no reason to expect differential impact on safety vs capabilities i.e. when the idea has no safety implications and is pure capabilities.

My fundamental critique of infohazards and infohazard norms as they work in practice is that claiming something is infohazards prevents obtaining critical feedback which leads to miscalibration. Claiming something is infohazardous makes you seem high status since you implicitly claim to have powerful ideas and also be safety-conscious. Meanwhile, the infohazard status means that there is no real way to push back against infohazard claims either to disprove them or to decrease the status of the claimant. What this means in practice is that there are only positive incentives to claim infohazards and no feedback from reality leading to people both being miscalibrated and overproducing claims of infohazardousness either for genuine reasons, or as a cynical status move, or some combination of both.

Specifically, a number of failure modes I have observed from infohazard norms are as follows:

Infohazards prevent seeking feedback and critical evaluation of work

Perhaps the most obvious point. Not sharing some research or idea because it is infohazardous means people cannot give feedback on it and highlight either that it is already known and has been tried and experimented with, or that it is obviously wrong for some reason that a critical outsider can see. Nobody will ever be able to see that the emperor has no clothes if it is too ‘infohazardous’ for the emperor ever to leave the palace in the first place.

Keeping things secret prevents feedback and predictably leads to poor epistemics since people cannot learn as easily what are good ideas and what are not. An additional issue in practice is that it is often difficult to tell if a given potentially infohazardous idea will work just from hearing about it without empirical testing. The history of machine learning is littered with people with clever ideas that do not work in practice or do not scale. Conversely, there are many ideas which have worked out super well which almost everybody who heard them initially would think was stupid or impractical. People in the alignment community often do not have the desire nor the capability to actually implement and test these potentially infohazardous ideas since, pragmatically, it can often just be fiddly and annoying, as well as take up a lot of time and effort to set up rigorous experiments to test if your infohazardous idea actually works in practice. However, not testing your infohazardous ideas means that you never get any kind of feedback on whether or not they are true (and so actually infohazardous) and prevents learning.

Thinking your work is infohazardous leads to overrating novelty or power of your work

This is an extension of the previous point, but many times I have observed once somebody has an idea they think is infohazardous, they rarely do a proper literature search to see if it has already been discovered and tested. From their perspective, this may even be rational since if you think it is infohazardous, you do not want to pursue it and hence time spent checking novelty (which often is a fairly grueling literature search) is basically wasted. However, by selection effects, this misleads people into thinking that thinking of infohazardous ideas is super easy and therefore that infohazards are much easier to think of and dangerous than they actually are.

Most of these ‘infohazard’ ideas are in the form of ‘big, if true’. Usually, however, as expected of most random ideas, they are not true. However treating them as if they are on priors leads to predictably incorrect epistemics.

Many times have I heard people talk about ideas they thought up that are ‘super infohazardous’ and ‘may substantially advance capabilities’ and then later when I have been made privy to the idea, realized that they had, in fact, reinvented an idea that had been publicly available in the ML literature for several years with very mixed evidence for its success – hence why it was not widely used and known to the person coming up with the idea.

In general, while there are definitely some gaps and low-hanging fruit, capabilities research is much more of an efficient market than alignment research – as should be expected from the relative totals of cumulative effort that have been poured into the respective fields – and that most random ideas you come up with have been tested and found not to work or to have very mixed success, or be not worth it when considered against all other tradeoffs.

A failure mode of academia which encourages this is the difficulty of publishing negative results. You cannot publish a paper saying: ‘we tried some random idea and it didn’t work’. What this means is that what appears to be gaps in idea-space are actually dead-zones of failed PhD projects. Simply identifying something that nobody appears to have done before, especially if it is relatively simple and easy to test, is no indication that it is actually novel. Rather, it is likely that some random PhD student (or a large number of random PhD students) tried it and it failed and they moved on without a trace.

Infohazards assume an incorrect model of scientific progress

One issue I have with the culture of AI safety and alignment in general is that it often presupposes too much of a “great man” theory of progress ¹ – the idea that there will be a single ‘genius’ who solves ‘The Problem’ of alignment and that everything else has a relatively small impact. This is not how scientific fields develop in real life. While there are certainly very large individual differences in performance, and a log-normal distribution of impact, with outliers having vastly more impact than the median, nevertheless in almost all scientific fields progress is highly distributed – single individuals very rarely completely solve entire fields themselves.

Solving alignment seems unlikely to be different a-priori, and appears to require a deep and broad understanding of how deep learning and neural networks function and generalize, as well as significant progress in understanding their internal representations, and learned goals. In addition, there must likely be large code infrastructures built up around monitoring and testing of powerful AI systems and an sensible system of multilateral AI regulation between countries. This is not the kind of thing that can be invented by a lone genius from scratch in a cave. This is a problem that requires a large number of very smart people building on each other’s ideas and outputs over a long period of time, like any normal science or technological endeavor. This is why having widespread adoption of the ideas and problems of alignment, as well as dissemination of technical work is crucial.

This is also why some of the ideas proposed to fix some of the issues caused by infohazard norms fall flat. For instance, to get feedback, it is often proposed to have a group of trusted insiders who have access to all the infohazardous information and can build on it themselves. However, not only is such a group likely to just get overloaded with adjudicating infohazard requests, but we should naturally not expect the vast majority of insights to come from a small recognizable group of people at the beginning of the field. The existing set of ‘trusted alignment people’ is strongly unlikely to generate all, or even a majority, of the insights required to successfully align superhuman AI systems in the real world. Even Einstein – the archetypal lone genius – who was at the time a random patent clerk in Switzerland far from the center of the action – would not have been able to make any discoveries if all theoretical physics research of the time was held to be ‘infohazardous’ and only circulated privately among the physics professors of a few elite universities at the time. Indeed, it is highly unlikely that in such a scenario much theoretical physics would have been done at all.

Similarly, take the case in ML. The vast majority of advancements in current ML come from a widely distributed network of contributors in academia and industry. If knowledge of all advancements was restricted to the set of ML experts in 2012 when AlexNet was published, this would have prevented almost everybody who has since contributed to ML from entering the field and slowed progress down immeasurably. Of course there is naturally a power-law distribution of impact where a few individuals show outlier productivity and impact, however progress in almost all scientific fields is extremely distributed and not confined to a few geniuses which originate the vast majority of the inventions.

Another way to think about this is that the AI capabilities research ‘market’ is currently much more efficient than the AI safety market. There are a lot more capabilities researchers between industry and academia than safety researchers. The AI capabilities researchers have zero problem sharing their work and building off the work of others – ML academia directly incentivises this and, until recently it seems, so did the promotion practices of most industry labs. Capabilities researchers also tend to get significantly stronger empirical feedback loops than a lot of alignment research and, generally, better mentorship and experience in actually conducting science. This naturally leads to much faster capabilities progress than alignment progress. Having strict infohazard norms and locking down knowledge of new advances to tiny groups of people currently at the top of the alignment status hierarchy further weakens the epistemics of the alignment community and significantly increases the barriers to entry – which is exactly the opposite of what we want. We need to be making the alignment research market more efficient, and with less barriers to research dissemination and access than capabilities if we want to out-progress them. Strict infohazard norms move things in the wrong direction.

Infohazards prevent results from becoming common knowledge and impose significant frictions

This is related to the incorrect model of science, but having norms around various ideas being infohazardous naturally inhibits their spread and imposes friction for people who want to learn or understand more about them and work on them. What this ultimately means is that many ideas, which should be widely spread so people can build on them, end up being secluded and eventually ignored due to their supposedly infohazardous nature. Research that could be published and built upon is instead kept silent and thus eventually loses all relevance. Concerns about infohazards sometimes even are used to inhibit transmission of publicly available ML research such as arxiv papers which simply means that AI alignment researchers become less informed about the state of the art in capabilities than otherwise, since the normal social transmission of ideas is inhibited.

In organizations, infohazard policies can often impose substantial friction when there are internal restrictions on who can see what and specific projects that are infohazardous. Not being allowed to tell colleagues about what you are working on imposes a lot of friction, creates isolation, prevents rubberducking and seeking feedback, and generally creates a lot of distracting status-games.

Infohazards imply a lack of trust, but any solution will require trust

A minor point, but one that is especially apparent in social and organizational settings. If you claim you have infohazardous information, but do not share it, you are effectively telling people that you do not trust them to not to misuse this information for e.g. capabilities advancements. For a random interaction, this might make sense but in many cases this feels unwarranted. For instance, a fellow employee of an alignment related org should probably be trusted by default, otherwise why have they been hired? Similarly, fellow alignment researchers should probably be trusted as unlikely to just turn around and advance capabilities. Ultimately, if people are to coordinate to solve alignment, they have to trust one another, and having strict infohazard norms strongly implies that there is no such trust.

Infohazards amplify in-group and social status dynamics

While not related to epistemics, and not intrinsic to the idea of infohazards, a very common failure mode I have observed in practice is that infohazards and access to ‘infohazardous information’ gets inextricably tied up with social status in both organizations and the general social scene. Having an ‘infohazardous’ idea is high status since it claims you are in possession of knowledge so powerful and dangerous it must be kept secret. This naturally encourages people to claim their ideas are infohazardous by default even if it is unclear if they actually are, or indeed even if they have no ideas at all. Moreover, there is no counter to this dynamic since often the supposedly infohazardous idea is never revealed (of course because it is too infohazardous) which means it can never be judged or defectors punished for overclaiming. In market terms, it is impossible to short an infohazard, leading naturally to an irrationally high price.

Secondly, access to infohazards often becomes a marker and gatekeeper of status within organizations. People higher up in the organization get access to more ‘infohazardous’ material and ideas. If somebody has an idea that could potentially be infohazardous, they should report it to their manager who can assess the degree of infohazard. Working on an ‘infohazardous’ project naturally becomes a sign of status, since it shows you are more trusted than those who are not working on such projects. To the extent which we expect ability to assess infohazards to correlate with status, this is fine but in practice status dynamics usually win out over actual correct assignments to handling infohazardous information, and the result is simply an additional axis of status dynamics and a lot of interpersonal friction.

Finally, infohazard norms can be used to distort debate and create and sustain groupthink bubbles. Axioms and claims can be made that are defended with the argument that there are infohazardous reasons, which cannot be shared, to believe in them. Leaders can claim their actions are driven by infohazard considerations such that they cannot share their full reasoning which renders them impervious to questioning. This leads to an environment where it can be very hard to critique ideas and rationales for action and can create a propensity towards group-think and cultishness.

Infohazards can be abused as tools of power

While the previous points have originated from a mistake theory frame, where it has been assumed that everybody is operating in good faith and just being miscalibrated. Infohazard norms can also be purposefully abused by bad or simply cynical actors. Claims that specific information or reasoning behind actions is too infohazardous to share can be used to isolate different individuals in an organization and prevent them from coordinating to recognize patterns of negative or malicious behaviour. They can also be used to license and compel obedience to actions which otherwise appear misintentioned or misinformed by claiming that there is secretly a good but infohazardous justification for the action. Infohazard concerns can also be weaponized to create a general culture of fear and distrust which makes divide and rule strategies easier and norms around not sharing infohazards can be wielded against whistleblowers and used to justify threats of termination or social shaming against the targets of power.

These risks are magnified due to the typical conflation of status with access to infohazardous material within an information. Typically, the higher status individual would be assumed to both have more infohazardous information than others of lower status, as well as often being accepted as the arbiter of what information is potentially infohazardous in the first place.

Infohazards fail the ‘skin in the game’ test

Perhaps the final, and most meta issue with infohazards in practice is that they fail the Talebian test of ‘skin in the game’. People claiming an idea is infohazardous detach the idea from reality and receive no feedback on the idea, nor on their claim. There is no downside to overclaiming an infohazardous idea, nor do they usually touch reality and test whether their actual idea is indeed infohazardous empirically. Indeed, it is often looked down upon to perform the ‘capabilities’ work which would be necessary to put an infohazardous idea to an empirical test. With status upsides, and no feedback from reality, all the incentives are skewed and, a-priori, such a setup should naturally lead to a great profusion of claims of infohazardy-ness.

To make a highly uncharitable analogy, the situation is like a party full of socialist intellectuals who all claim they have billion-dollar startup ideas, but that, of course, being the moral people they are, they wouldn’t sully themselves by actually building these startups and becoming billionaires themselves because that would be exploiting the working classes. Of course their claims of having billion-dollar startup ideas might be true, but it seems highly unlikely on priors. Now it’s possible to argue that AI safety folk, due to thinking hard about how AI might be dangerous and could exploit humanity might be able to tap into unusual insights into the required capabilities for AGI. Analogously, the socialist intellectual might be able to argue that their deep study of the evils of capitalism might make them uniquely positioned to understand how best to exploit the proletariat to become a billionaire. Again, this is possible, but without any strong evidence in favour, and in consideration of the status dynamics at play, such arguments should generally be dismissed

The generalized problem here is twofold. First, the lack of feedback is crucial. It is easy to come up with plausible ideas that could work, and much harder to come up with great ideas that do work. If you never get any feedback on your ideas, you cannot learn to distinguish the plausible-but-bad ideas from actually good ideas and instead be heavily optimistically biased towards all ideas being good. Secondly, there is the general failure mode in which people in a less efficient market attempt to point out flaws and failure modes in a more efficient market. In some specific cases this can work where there are obviously perverse equilibria, but usually the consensus equilibrium view of the less efficient market is markedly miscalibrated and misinformed about the actual reasons for the more efficient equilibrium.

Overall, while I think the ‘great man’ theory is largely empirically incorrect, it has some useful qualities. It encourages people to be ambitious and actually try to solve ‘the problem’ rather than incentivising marginal incremental contributions as academia does. It also encourages independent thinking. On the bad side, it leads to severe miscalibration about what the shape of the solution will likely look like, as well as prescribing a very narrow ideal of a math prodigy as being necessary to contribute to the problem at all. ↩