Gradient Hacking is extremely difficult.

Epistemic Status: Originally started out as a comment on this post but expanded enough to become its own post. My view has been formed by spending a reasonable amount of time trying and failing to construct toy gradient hackers by hand, but this could just reflect me being insufficiently creative... [Read More]

Creating worlds where iterative alignment succeeds

A major theorized difficulty of the alignment problem is its zero-shot nature. The idea is that any AGI system we build will rapidly be able to outcompete its creators (us) in accumulating power, and hence if it is not aligned right from the beginning then we won’t be able to... [Read More]

An ML interpretation of Shard Theory

Shard theory has always seemed slightly esoteric and confusing to me — what are ‘shards’, why might we expect these to form in RL agents? When first reading shard theory, there were two main sources of confusion for me. The first, is why an agent optimising a reward function should... [Read More]

Preventing Goodheart with homeostatic reward functions.

Current decision theory and almost all AI alignment work assumes that we will build AGIs with some fixed utility function that it will optimize forever. This naturally runs the risk of extreme goodhearting, where if we do not get exactly the ‘correct’ utility function, then the slight differences between our... [Read More]