Intellectual progress in 2021

Overall 2021 was much less of a productive and growth year than was 2020. The main reason for this is that, in retrospect, 2020 was exceptional in that for the first time in Sussex in Chris Buckley’s lab I had proper research mentorship and direction and that I also lucked... [Read More]

Towards concrete threat models for AGI

There are many facets to the alignment problem but one is as a computer security problem. We want to design a secure system to test our AGIs in to ensure they are aligned, which they cannot ‘break out of’. Having such a secure AGI box is necessary to have any... [Read More]

Probabilities multiply in our favour for AGI containment

This is a short post for a short point. One thing I just realized, which should have been obvious, is that for prosaic AGI containment mechanisms like various boxing variants,simulation, airgapping, adding regularizers like low impact, automatic interpretability checking for safe vs unsafe thoughts, constraining the training data, automatic booby-traps... [Read More]

Alignment needs empirical evidence

There has recently been a lot of discussion on Lesswrong about whether alignment is a uniquely hard problem because of the intrinsic lack of empirical evidence. Once we have an AGI, it seems unlikely we could safely experiment on it for a long time (potentially decades) until we crack alignment.... [Read More]