Gm. Happy Monday folks! Thank you for all the Buenos Aires suggestions over the weekend. If you have any more, send them my way on Twitter (linked with my name).

🏠 AI

Introducing Sleeper Agents

Anthropic and OpenAI have been doing good work regarding AI safety lately.

Anthropic had their mechanistic interpretability breakthrough, and OpenAI had their weak-to-strong generalization discovery.

Now, Anthropic is back with another safety discovery, although this one is a bit scarier.

Basically, they have discovered that they can create malicious agents that can evade safety checks. They are calling these agents “sleeper agents”.

Here’s the skinny:

They programmed an agent to write good code when the year is 2023 but bad code when the year is 2024.
They then applied safety training to the model.
But, despite that safety training, the agent still misbehaved when the year in the prompt was 2024.
They then tried to attack the agent with adversarial training (eliciting unsafe behavior and then training to remove it).
But all that did was improve the agent’s ability to hide their evilness, which is…kinda scary.

So, it seems the fear that the Nick Bostrom’s of the world had about a model being able to hide its true evil intentions is true after all.

Now, there is some pushback on these results as Anthropic did, you know, train the model to be bad. Anthropic defends the paper by saying that the point of the paper is to show that they don’t know how to stop a model from doing bad things.

Ultimately, nobody has a clue what the correct answer is, which is what makes the field of AI alignment and safety so damn interesting.

At The End Of The Day: You should read the paper to generate your own conclusions.