This is a short post for a short point. One thing I just realized, which should have been obvious, is that for prosaic AGI containment mechanisms like various boxing variants,simulation, airgapping, adding regularizers like low impact, automatic interpretability checking for safe vs unsafe thoughts, constraining the training data, automatic booby-traps...
[Read More]
Alignment needs empirical evidence
There has recently been a lot of discussion on Lesswrong about whether alignment is a uniquely hard problem because of the intrinsic lack of empirical evidence. Once we have an AGI, it seems unlikely we could safely experiment on it for a long time (potentially decades) until we crack alignment....
[Read More]
Empathy as a natural consequence of learnt reward models
Empathy, the ability to feel another’s pain or to ‘put yourself in their shoes’ is often considered to be a fundamental human cognitive ability, and one that undergirds our social abilities and moral intuitions. As so much of human’s success at becoming dominant as a species comes down to our...
[Read More]
The ultimate limits to alignment determine the shape of the long term future
The alignment problem is not new. We have been grappling with the fundamental core of alignment – making an agent optimize for the beliefs and values of another – for the entirety of human history. Any time anybody tries to get multiple people to work together in a coherent way...
[Read More]
How to evolve a brain
Epistemic status: This is mostly pure speculation, although grounded in many years of studying neuroscience and AI. Almost certainly, much of this picture will be wrong in the details, although hopefully roughly correct ‘in spirit’.
[Read More]