Beren's Blog

An ML interpretation of Shard Theory

Posted on December 5, 2022

Shard theory has always seemed slightly esoteric and confusing to me — what are ‘shards’, why might we expect these to form in RL agents? When first reading shard theory, there were two main sources of confusion for me. The first, is why an agent optimising a reward function should... [Read More]

Preventing Goodheart with homeostatic reward functions.

Posted on November 29, 2022

Current decision theory and almost all AI alignment work assumes that we will build AGIs with some fixed utility function that it will optimize forever. This naturally runs the risk of extreme goodhearting, where if we do not get exactly the ‘correct’ utility function, then the slight differences between our... [Read More]

The development of human sexuality as example of alignment

Posted on November 27, 2022

Here I want to bring attention to what I think is an extremely impresive case of evolution’s ability to ‘align’ humans in the wild: the development of human sexuality. [Read More]

Don't argmax; Distribution match

Posted on November 27, 2022

I mentioned this briefly in a previous post, but thought I should expand on it a little. Basically, using argmax objectives, as in AIXI or many RL systems are intrinsically exceptionally bad from an alignment perspective due to the standard and well-known issues of goodhearting, ignoring uncertainty etc. There have... [Read More]

AGI will have learnt reward models.

Posted on November 26, 2022

There has been a lot of debate and discussion recently in the AI safety community about whether AGI will likely optimize for fixed goals or be a wrapper mind. The term wrapper mind is largely a restatement of the old idea of a utility maximizer, with AIXI as a canonical... [Read More]