Epistemic Status: My opinion has been slowly shifting towards this view over the course of the year. My opinion is contingent upon the current situation being approximately maintained – i.e. that open source models trail the capabilities of the leading labs by a significant margin.

In the alignment community it seems to be a common opinion that open-sourcing AI models is net harmful both to our chances of survival. Some of those involved in open-sourcing also seem to have imbibed this mindset implicitly. This has caused many alignment researchers to often push for labs and other organizations not to open source their models as well as to try to lobby governments towards cracking down and suppressing open-source model development and release. Here, I argue for the opposite view, that likely almost every open-source model release to date has been net-positive for alignment progress and ultimately for humanity, and that this will likely continue into the future.

The fundamental reason for this is a simple one. Most likely, the solution to alignment will not be a theoretical breakthrough but instead a culmination of practical, pragmatic and empirical developments. To make progress, we need people who are able to tinker and play with and iteratively improve the alignment of models. This includes, necessarily, having access to the weights and free ability to alter them. We see extremely clear evidence of this thus far. Almost all valuable alignment work has been done by direct hands-on experience of models. This includes work in mechanistic interpretability, representation-level control techniques such as activation addition and probing, and weight-level methods such as RLHF/RLAIF. While a significant portion of the relevant work was certainly performed by the labs, probably the majority was done outside of it – a proportion that has been rapidly increasing of late as the tools and ability to access and play with powerful models has increased drastically due to Llama and stable diffusion.

The fact that much alignment research is done outside of the big labs is inevitable and not any kind of criticism of the work done by these labs. Rather, it is an inevitable consequence of the number of people able to work on the problem. The interpretability and alignment teams even at the largest labs are probably not more than 50 people each. This means that the total population working on these problems from inside the largest organizations is likely only a few hundreds at the most. Academia and the population of open-source tinkerers possesses orders of magnitudes more people who are capable and able to make this kind of breakthrough. Additionally, while researchers at a top industry lab may (or may not!) be more productive per capita on average than PhD students or open source people, the projects at these labs tend to be highly correlated both within the lab due to the fact that typically research is structured as large team efforts, and between labs since they all very closely follow each other’s work. This means that exploration of a large number of original ideas is much harder to achieve than the much more uncorrelated works of random PhD students and people in the open source community ¹. Given the rapid increase in size both of alignment and interpretability work within academia and the explosion of open-source AI researchers and tinkerers in the last few years, I only expect this differential to grow.

Alignment research requires hands-on access to the internals of large models to test and explore ideas on. The only way for the vast majority of people to get access to this, who do not work at one of a few AI labs, is via open-source models. Restricting or banning open source AI will severely hamper the ability of this population to do meaningful alignment work and hence significantly slow progress in alignment.

The obvious way to see this is to consider the counterfactual where no models have been open-sourced. Let us suppose that a decree went out in 2019 that no new large models were to be open-sourced by anybody. The last big LLM that was open is BERT. No GPT2. No stable diffusion. No Llama. Where would we be?

Some alignment relevant progress would still be happening. A surprising amount of interpretability (especially around probing) was done on BERT. People could also still study CNNs like Chris Olah’s original work on InceptionV4 and understand the circuits there. People could try RLHFing BERT or something. However, we would have lost almost all recent interpretability work on LLMs (too much to fully cite) which was done on open-sourced models (especially GPT2 and Llama), including the recent sparse coding breakthroughs which include significant contributions from smaller players than Anthropic. We would also likely not have discovered the regular geometric properties and easy manipulability of the representation spaces of these models. Additionally, the last 8 months since the release of Llama has produced a huge flurry of open-source tinkering with RLHF and other alignment approaches which until then were almost entirely the preserve of the big labs. This has resulted in a huge proliferation of RLHFd models of varying capabilities and silly names. This has vaastly broadened the understanding of LLM alignment techniques and already led to several notable improvements not originating in the big labs.

In general people tinkering and studying open-source models is fundamentally the engine that is driving prosaic alignment. Empirical, pragmatic alignment of models is a broad-based distributed empirical science. While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as more and more people enter the field while the labs are growing at a much slower rate. Stopping open-source ‘proliferation’ effectively amounts to a unilateral disarmament of alignment while ploughing ahead with capabilities at full-steam.

Thus, until the point at which open source models are directly pushing the capabilities frontier themselves then I consider it extremely unlikely that releasing and working on these models is net-negative for humanity (ignoring potential opportunity cost which is hard to quantify). Almost all of the existential risk still comes from the large labs pushing the capabilities envelope and given the ever escalating capital costs of training these models, this seems likely to be the case for the near future. The scenario where open-source becomes dangerous is if it approaches the capabilities frontiers and there are large overhangs in model capabilities – such that an open-source model, even if lagging behind closed models is able to improve itself much faster by exhausting overhangs due to much more usage and tinkering.

A key point underlying my worldview which I suspect may be driving some of my disagreement is what I consider to be dangerous vs not. I do not think that pretty much any current LLM is dangerous – in the sense of being able to cause existential risks. Now, such LLMs may certianly cause harm – through enabling various scams or propaganda or teaching people how to commit crimes – but these are simply amplifications of threats that have already existed and been handled throughout human history. I have little expectation that any of these will cause risks capable of ending humanity as a whole. I strongly expect that harms from misuse, which exist for basically every technology, will be addressed to at least some level of satisfaction through standard legal and regulatory processes that already exist. The key driver of existential risk is coherent agency from misaligned agents. That is, we effectively construct a population of agents that is directly optimizing against humanity and without any significant human oversight. In effect, we build a new intelligent species that is cohabiting this planet with us and is implicit evolutionary competition against us. Easily controllable and non-agentic LLMs or essentially all ML systems in existence today do not satisfy these criteria and realistically I think their only potential for harm is misuse which is almost certainly not existential. As such proliferation of open-source models, while it certainly increases the potential for misuse harm, does not cause much increase in existential risk while leading to significantly improved alignment progress.

Like everything, however, this assessment is temporary and may change in the future depending on the situation. There are basically two situations in which I would say that open-source has become net negative. Firstly, if open-source models are pushing the capabilities frontier directly and outcompeting the models producted by the leading labs. In this case, I would say that this has become dangerous since open-source would then be breaking new ground rather than simply filling out what has already been explored by closed labs. Secondly, if open-source is releasing code to produce capable agents which can escape their human users and survive independently on the internet. In this case, we would be in deep trouble and likely by this point there are already even more capable agents being produced by industry labs, but these might be much more robustly controlled. In this case, I would argue that the risks of continued open-source would be extremely obvious and should be controlled directly.

Other arguments against

Overhangs

It is often argued that open-sourcing a model is dangerous because it reduces overhangs much more rapidly than otherwise. Essentially, if a model is open-source then many people can play with it and explore its capabilities. Thus any latent dangerous capabilities will be realized and exploited much sooner than they would for a closed model. This is certainly possible, however I still think it is a poor argument. The very nature of overhangs is that they are exploited and reduced. Once an overhang is created it will inevitably be exploited. This is just the way the entropy gradient points. It may be exploited slowly or rapidly but the ultimate outcome is preordained. The problem isn’t so much exploiting overhangs but rather creating them in the first place. The overhangs argument is essentially a restatement of the security through obscurity approach in computer security, which is clearly woefully insufficient.

Second order consequences?

It is often argued that open source models are net negative because of their second order effects. Either such models lead to AI proliferation which will ultimately make it harder for an aligned singleton to achieve control. I certainly agree with this at least in spirit but in my opinion this is also what leads to safety. An AI singleton which achieves a decisive strategic advantage created by one of these companies depends on an extremely high degree both of technical competence of the company creating it as well as trust that the values the company chooses to align it to are acutally beneficial for humanity. Let us say that I am sceptical of the likelihoo of this. Additionally, I think that the idea of a decisive strategic advantage being likely depends heavily on a fast FOOMing takeoff which I think is unlikely. In a slow takeoff world there will never be a decisive advantage for anybody, just an increasing proliferation and the dangers do not stem so much from direct misalignment of individual agents but rather from the whole misaligned process of technocapitalism – i.e. how do we ensure human flourishing when humans labour is no longer a necessary input to the economy?

The second argument is that open-sourcing models gets more people interested in mahcine learning and with hands on experience and hence increases the amount of capabilities researchers as well as makes their research easier, thus increasing the rate of capabilities progress overall. While I don’t think this argument is totally misguided, I think it is certainly true that it tends to overestimate the capabilities progress that comes from open-sourcing models relative to the algnment progress. Fundamentally, capabilities can happen much more straightforwardly with closed source models than open-source. There are direct economic incentives to training large scale models for companies that do not exist so much for alignment (and direct incentives against open-sourcing models). Much capabilities work involves simply gathering datasets or testing architectures where it is easy to utilize other closed models referenced in pappers or through tacit knowledge of employees. Additionally, simple API access to models is often sufficient to build most AI-powered products rather than direct access to model internals. Conversely, such access is usually required for alignment research. All interpretability requires access to model internals almost by definition. Most of the AI control and alignment techniques we have invented require access to weights for finetuning or activations for runtime edits. Almost nothing can be done to align a model through access to the I/O API of a model at all. Thus it seems likely to me that by restricting open-source we differetially cripple alignment rather than capabilities. Alignment research is more fragile and dependent on deep access to models than capabilities research.

Another counterargument is that there is some threshold where having access to iterate on more powerful models is not necessary to solve alignment. I.e. maybe having GPT2 is enough. I doubt this is the case in general but certainly an argument could be made for it. Indeed, if there were no coordination problems, I suspect the optimal solution would look like very much like a pause around the GPT3 level of capability which is clearly not capable of X-risk but yet appears a useful microcosm of true AGI. Once we extracted a lot of the valuable knowledge from GPT3 level models (a process we are still in the early stages of 3 years later), then we should advance slowly to more powerful models. This is clearly not what we are as a species are doing. However, this is not the fault of the open-source community, who have not advanced that far beyond GPT3 level but instead the massive scaling of the labs.

Overall, it is clear to me that while open-sourcing models is not always strongly positive for alignment, the case is not certainly a slam-dunk against, as is often assumed by the alignment community and we should continue to focus on the fact that the vast majority of the risk comes from the rapid scaling of capabilities from the top AI labs, as well as to make the most of the open source models that currently exist for alignment research and exploration.

Although the flip-side of this is that exploitation of successful ideas can be done much more rapidly and dramatically by industry labs than others. ↩