The solution to alignment is many not one

The goal of this post is to argue against a common implicit assumption I see people making – that there is, and must be one single solution to alignment such that when we have this solution alignment is 100% solved, and while we don’t have such a solution, we are... [Read More]

Intellectual progress in 2022

2022 has been an interesting year. Perhaps the biggest change is that I left academia and started getting serious about AI safety. I am now head of research at Conjecture, a London-based startup with the mission of solving alignment. We are serious about this and we are giving it our... [Read More]

Integer tokenization is insane

After spending a lot of time with language models, I have come to the conclusion that tokenization in general is insane and it is a miracle that language models learn anything at all. To drill down into one specific example of silliness which has been bothering me recently, let’s look... [Read More]

Gradient Hacking is extremely difficult.

Epistemic Status: Originally started out as a comment on this post but expanded enough to become its own post. My view has been formed by spending a reasonable amount of time trying and failing to construct toy gradient hackers by hand, but this could just reflect me being insufficiently creative... [Read More]