Gm. Happy Monday folks! Thank you for all the Buenos Aires suggestions over the weekend. If you have any more, send them my way on Twitter (linked with my name).
ā Stephen Flanders
Anthropic and OpenAI have been doing good work regarding AI safety lately.
Anthropic had their mechanistic interpretability breakthrough, and OpenAI had their weak-to-strong generalization discovery.
Now, Anthropic is back with another safety discovery, although this one is a bit scarier.
Basically, they have discovered that they can create malicious agents that can evade safety checks. They are calling these agents āsleeper agentsā.
Hereās the skinny:
They programmed an agent to write good code when the year is 2023 but bad code when the year is 2024.
They then applied safety training to the model.
But, despite that safety training, the agent still misbehaved when the year in the prompt was 2024.
They then tried to attack the agent with adversarial training (eliciting unsafe behavior and then training to remove it).
But all that did was improve the agentās ability to hide their evilness, which isā¦kinda scary.
So, it seems the fear that the Nick Bostromās of the world had about a model being able to hide its true evil intentions is true after all.
Now, there is some pushback on these results as Anthropic did, you know, train the model to be bad. Anthropic defends the paper by saying that the point of the paper is to show that they donāt know how to stop a model from doing bad things.
Ultimately, nobody has a clue what the correct answer is, which is what makes the field of AI alignment and safety so damn interesting.
At The End Of The Day: You should read the paper to generate your own conclusions.
The military can now also become best friends with ChatGPT.
Microsoft really wants you to use Copilot.
Iām sorry, but I cannot fulfill this request.
Is it time to speed up our AI timelines?
Plants talking is pretty damn cool.
Who will get a general robotic brain first?
AI may have just saved 10,000 people every year.
BlackRock CEO Larry Fink is now backing an Ether ETF.
So long, GameStop NFT marketplace.
Letās think about fintech from first principles.
Raise: 1X, a robotic startup backed by OpenAI, received $100M in funding.
Stat: 9%: WhatsAppās daily user growth in the US in 2023. Personally, I prefer good old iMessage.
Rabbit hole: How To Be More Agentic (Useful Fictions)
What inspires you?
Gonna have to start asking you all for betting advice:
What do you all think about this one?
Will you be buying a rabbit r1? |
The best resources we came across this weekend that will help you become a better founder, builder, or investor.
š LinkedLeads finds leads from your LinkedIn connections.
šāāļø SkimAI reimagines your email.
We old.
@mrgrandeofficial 2014 is TEN YEARS OLD š letās revist THIS. Where were you a decade ago? #2014 #2024 #recap #rap
HOW WAS TODAY'S NEWSLETTER? |
If youāre interested in advertising with us, send an email over to [email protected] with the subject āHomescreen Adsā.