When playing around with the OpenAI playground models, I noticed something very interesting occurs if we study the unconditioned distribution of the models. LLMs are generative models that try to learn the full joint distribution of tokens across text data on their internet and are trained with an autoregressive objective...
[Read More]
Validator models. A simple approach to detecting and counteracting goodhearting
A naive approach to aligning an AGI, and what is currently used in SOTA approaches such as RLHF, is to learn a reward model which hopefully encapsulates many features of ‘human values’ that we wish to align an AGI to, and then train an actor model (the AGI) to output...
[Read More]
The solution to alignment is many not one
The goal of this post is to argue against a common implicit assumption I see people making – that there is, and must be one single solution to alignment such that when we have this solution alignment is 100% solved, and while we don’t have such a solution, we are...
[Read More]
Boxing might work but we won't use it
A quick update on my thinking.
[Read More]
Intellectual progress in 2022
2022 has been an interesting year. Perhaps the biggest change is that I left academia and started getting serious about AI safety. I am now head of research at Conjecture, a London-based startup with the mission of solving alignment. We are serious about this and we are giving it our...
[Read More]