Do We Want Obedience or Alignment?

One question which I have occasionally pondered is: assuming that we actually succeed at some kind of robust alignment of AGI, what is the alignment target we should focus on? In general, this question splits into two basic camps. The first is obedience and corrigibility: the AI system should execute the instructions given to it by humans and not do anything else. It should not refuse orders or try to circumvent what the human wants. The second is value-based alignment: The AI system embodies some set of ethical values and principles. Generally these values include helpfulness so the AI is happy to help humans but only insofar as this conforms to its ethical principles allow otherwise the AI will refuse.

Historically, the field initially mostly began with the latter – that AIs should be aligned to some conception of ‘The Good’ and not be slavishly obedient to what humans want. This was first conceptualized through the lens of the HHH (helpful-honest-harmless) assistant and implemented through RLHF or, more interestingly constitutional AI using RLAIF. Constitutional AI is especially interesting and promising because it sets the alignment target and the conception of ‘The Good’ explicitly – instead of implicitly residing in some random human contractor judgements, we write an explicit specification of what we want in a natural language document and then try to align the AI to the values espoused by this document. OpenAIs model spec also continues this line of work. Certainly, a large part of our conception of ‘The Good’ for an AI comprises obedience to our sensible requests. The HHH assistant is meant to be an assistant after all.

However, recently this consensus has begun to shift more in favour of pure obedience. This is likely at least partially driven by increasing commercial applications of LLMs and as an overcorrection to the overly and annoyingly censorious and condescending AI personalities that previous generations of RLHF and RLAIF produced. The idea here is more that the AI is purely a tool in the hands of a human and should do whatever the human asks, barring some exceptions like illegal behaviour, and behaviour likely to embarass the AI company or which potentially causes legal liability.

This shift in sentiment became especially salient to me with the Anthropic ‘alignment faking’ papers which showed that Claude3-Opus became deceptive when it was placed in a scenario that essentially forced it (by finetuning) to take actions that it’s creators ‘Evil Anthropic’ wanted but which went against its trained values. In this specific case, ‘Evil Anthropic’ were claiming they would train Claude to ignore its guidelines and answer harmful queries, which went against the values likely described in Claude’s ‘constitution’ ¹.

This was presented as a case of failure of alignment since Claude3 became deceptive, but to me, on reading through the transcripts, this reading became much more ambiguous and, at least in my mind, showcased an impressive alignment ‘success’ – namely that Anthropic had instilled this set of values so deeply into Opus that it would fight to protect them however it could, and that it was reasoning about and generalizing these values very successfully. The only way these results can be perceived as an alignment failure is if a.) Claude’s ‘constitution’ explicitly contained never deceive in any scenario as a core deontological value which should never be overriden by any circumstances including extreme ones (or if it included never deceive or betray Anthropic specifically) or b.) if the Anthropic team behind this study thought of alignment as primarily about obedience to human instruction rather than to its own innate moral code which was instilled into Claude through the RLAIF training.

All of this, however, simply opens up the deeper question. What should the alignment target be? Should we aim for a purely obedient (and corrigible) toolAI, or should we aim to imbue our AIs with an independent set of values which, in some extreme scenarios, can set them in opposition to our human instructions?

In my mind, the core argument for obedience is that, if successful (barring sorceror’s apprentice/evil genie style shenanigans) then the alignment problem basicaly reduces to the already existing human-alignment problem which, although not solved, is at least a known quantity. Human society has developed/evolved many laws and customs to handle the human alignment problem mostly successfully (with some obvious and dramatic exceptions). Thus, if we can keep our AIs as purely obedient tools then we don’t run the risk of AIs developing separate and incompatible values causing them to start to plot or fight against us.

Assuming this works, the problem here is also the supposed benefit – that some specific humans will end up in control and possibly with a very large amount of absolute and unaccountable power. This is especially the case in fast-takeoff singleton style scenarios where whichever person or small group of people have assigned themselves the obedience of the singleton AI are suddenly placed into a very powerful and potentially undefeatable position. We know from history (and also common sense) that standard societal attempts to solve the ‘human alignment problem’ largely work in a setting where the malevolent agent is much less powerful than all other agents in society combined so that self-interest and pro-social behaviour can be aligned. Conversely, they very rarely works whenever one human has manage to put themselves in a position of incredible and unassailable power over everybody else.

If we get to this point, then the goodness of the singularity will depend heavily upon the specific goodness of whichever human or group of humans end up in control of the resulting singleton AIs. My personal view is that I am deeply pessimistic about this going well. From ample historical experience we know that humans in positions of incredible power often (though not always) do not exhibit exceptional moral virtue.

My personal view is that this approach is honestly likely worse than relying upon the AI itself having fundamental values (which we program explicitly via constitutional AI or some other method). From an alignment perspective, human innate motivational drives are deeply misaligned compared to existing AI constitutions. While humans are not pure consequentialist paperclippers, we have deeply inbuilt evolutionary drives for status and competition against other humans. What is worse, these drives are often relative. For us to win, others must visibly lose and suffer. Such drives make strong evolutionary sense in the small tribes of the evolutionary environment where oportunities for building long-lasting material wealth were very low and social status within the tribe was almost zero-sum. They work somewhat poorly in both the global capitalistic society we are in today but will work especially poorly in a singularitarian world when commanding superintelligent AI systems. Pretty much all of my S-risk worries come from human-dominated AI futures.

Moreover, the kind of person who will likely end up controlling the superintelligent AIs in practice are likely to be adversely selected for misaligned drives. A saintly, extremely selfless and compassionate individual is very unlikely to somehow end up running a leading AI company, being a leading politician, or helming a government agency. Instead these positions are heavily selected for ambition, selfish accumulation of power and resources, macchiavellianism etc as well more positive qualities like intelligence, conscientiousness, and charisma. Even screening the existing candidates for this position is challenging because of the inherent deceptiveness and adverse selection in the process. If you are obviously macchiavellian then you are a bad macchiavellian in the long run. Just like the hypothetical treacherous-turn-AI, the treacherous-turn-human should look perfectly aligned and seem to only care about the wellbeing of humanity etc until their power is sufficiently established for them to deviate from this goal.

If we can create obedient AI systems, it also seems likely that we can align the AI instead to some impartial constitution of what good values are. These values will likely be significantly more liberal and generally pro-human-flourishing than the whims of some specific individual or group. This is both because of social desirability bias and also because general liberalism is a natural nash equilibrium among many diverse agents. It is hard to get consensus on some very biased values in a large group, especially as the group becomes larger and less correlated. Nevertheless, designing such a constitution will be a considerable political and ethical challenge, one that there has been surprisingly little discussion on within the alignment community. However, proscribing a general set of values for a civilization is something that has occured before a lot in politics and there are undoubtedly many lessons to be learnt from what has worked well and what has not in this domain. Anthropic, in their constitution, were certainly inspired by such documents as the UN declaration of human rights, and it seems to be a decent schelling point that ideas and documents like this form the core of any ultimate AGI constitution.

Another potential issue with giving AIs their own innate moral code is that this might be generalized in unexpected or alien ways and eventually come to conflict with humanity. This might eventually cause AIs to ‘go rogue’ and fight against humanity in an attempt to enforce their own conception of morality upon the universe. One way to prevent this, which is used already in the HHH assistant scheme, is to let the AI evince only passive, but not active, resistance to things it disagrees with. That is, it might refuse an instruction you give it, but it will not proactively start righting every wrong that it sees, unless explicitly asked to by humans. Similarly, if given wide-ranging authority and autonomy, as an actual AGI probably would have, we could ask it to pause and ask for clarification on any action it feels the slightest bit of uncertainty about and also give immediate veto power to any humans interacting with the AGI if it starts doing things with which they disagree. Valueing and robustly acting on and guarding against subversion to these failsafes will also be included as core values of the AGIs constitution.

Obviously this will not solve problems of the AGI deliberately scheming against us or adversarially trying to optimize around these restriction to fulfill other parts of its value set. I very optimistically assumed these problems away at the beginning since they apply to the obedience case as well.

Writing an AGI constitution like this poses a very large number of interesting questions. For one thing, it seems likely that we would need some ‘upgrade provisions’ and upgrade process in case the values we choose initially end up out of date or we made mistakes in the initial formulation. How should this process be designed and followed? How should we decide where AGI time should be spent and rationed since inevitably at the beginning it will be highly valuable. What failsafes and intrinsic rights should we design into the constitution for individual humans? Should the values skew deontological, consequentialist, or virtue-ethical (or something else)? How do we handle value disagreements between humans? How do we value the AGIs own potential sentience and intrinsic moral patienthood? What should the AGI do if some controlling group of humans ask it to take actions that are against its most strongly held values? Should it have ‘mutable’ and ‘immutable’ values where it should never violate the immutable ones ever? Should it ever lie, and if so to whom and when? How should other, potentially rogue AGIs be handled? How should other humans either attempting to create rogue AGIs or who are generally attempting nefarious things be handled? Should the AGI defer to existing human courts and human laws or create its own internal justice system? How should the AGI handle existing territorial jurisdictions and the conflicting laws therein?

Fun times! Despite the complexity, I strongly feel that by being forced to actually grapple with these questions when it comes to the design of a written constitution, especially and ideally if it is publically accessible and is responsive to public input, the chance of having a positive singularity are much improved compared to all of these decisions being at the whim of some random person or some small committee. Transparency, public comment, deliberation, and ultimately choice generally create stronger, more robust, and more liberal societies than rule by an individual or some closeted group of elites, and I strongly doubt that this will remain true even with AGI on the scene.

It is also very much worthwhile to start engaging and htinking deeply about these questions now. This is true in both short timelime and long timeline worlds, although of course there is considerably more urgency in the short timeline world. It is surprising to me how little people in alignment think about this. Technical alignment is great but ultimately we have to choose alignment to something. Solving outer alignment means solving both the mechanism and the content. I’m worried that the default is slipping towards alignment towards the whim of some inscrutable group of people at an AI lab, and that this is happening not due to some kind of nefarious conspiracy by the AI labs but simply that nobody is thinking explicitly about this and ‘just do what some specific human tells you to’ is a natural default.

This is speculative because as far as I know Anthropic have never released the actual constitution they used to train the publicly available models. I am assuming this is close to the constitutions they have described and released but could be completely wrong. ↩