Towards concrete threat models for AGI

There are many facets to the alignment problem but one is as a computer security problem. We want to design a secure system to test our AGIs in to ensure they are aligned, which they cannot ‘break out of’. Having such a secure AGI box is necessary to have any... [Read More]

Probabilities multiply in our favour for AGI containment

This is a short post for a short point. One thing I just realized, which should have been obvious, is that for prosaic AGI containment mechanisms like various boxing variants,simulation, airgapping, adding regularizers like low impact, automatic interpretability checking for safe vs unsafe thoughts, constraining the training data, automatic booby-traps... [Read More]

Alignment needs empirical evidence

There has recently been a lot of discussion on Lesswrong about whether alignment is a uniquely hard problem because of the intrinsic lack of empirical evidence. Once we have an AGI, it seems unlikely we could safely experiment on it for a long time (potentially decades) until we crack alignment.... [Read More]

Empathy as a natural consequence of learnt reward models

Empathy, the ability to feel another’s pain or to ‘put yourself in their shoes’ is often considered to be a fundamental human cognitive ability, and one that undergirds our social abilities and moral intuitions. As so much of human’s success at becoming dominant as a species comes down to our... [Read More]