Anthropic Just Performed the First AI Brain Surgery
Researchers at Anthropic dissected the brain of a Large Language Model and found a way to tweak specific neurons to change how it responds.
If you only have a minute to spare, here is what you need to know:
Researchers at Anthropic dissected the brain of a Large Language Model and found a way to tweak specific neurons to change how it responds.
This breakthrough could lead to safer, more reliable AI.
The Brain is a Black Box
We’ve all been there. You see your friend doing something stupid and can’t help but think, “What the hell are they thinking???” Until we all have Neuralink implants and can read minds, we’ll never truly understand why people do what they do. Heck, most of the time they don't know either.
Humans and AI share this mystery. We give AI some instructions and *poof* out comes a response. When we ask why it made the choices it did, it provides a convincing post hoc rationalization. This is why AI is often referred to as a black box—there is no way to deconstruct its “thought process.” Until now.
Peering into the Brain of a Large Language Model
On May 21, researchers from Anthropic's Safety Team released a paper that changes our understanding of AI interpretability. They used a fascinating method to deconstruct and understand the inner workings of AI models—effectively conducting AI brain surgery.
By identifying and manipulating specific neuron activations within their Large Language Model Claude, the researchers showed that they could directly influence the AI's responses. Simply put, they went into the brain of the AI and tweaked its neurons.
The Experiment
The researchers discovered a neuron cluster that activates for the concept of the Golden Gate Bridge. By enhancing and “turning up” this activation, they could make Claude focus on the Golden Gate Bridge in its responses, regardless of the query's relevance.
Here is a quote from the researchers:
We found that there’s a specific combination of neurons in Claude’s neural network that activates when it encounters a mention (or a picture) of this most famous San Francisco landmark [the Golden Gate Bridge].
Not only can we identify these features, we can tune the strength of their activation up or down, and identify corresponding changes in Claude’s behavior.
And as we explain in our research paper, when we turn up the strength of the “Golden Gate Bridge” feature, Claude’s responses begin to focus on the Golden Gate Bridge. Its replies to most queries start to mention the Golden Gate Bridge, even if it’s not directly relevant.
If you ask this “Golden Gate Claude” how to spend $10, it will recommend using it to drive across the Golden Gate Bridge and pay the toll. If you ask it to write a love story, it’ll tell you a tale of a car who can’t wait to cross its beloved bridge on a foggy day. If you ask it what it imagines it looks like, it will likely tell you that it imagines it looks like the Golden Gate Bridge.
Why Does This Matter?
The implications of this work are vast. It opens up a new door for enhanced safety and reliability in AI models by allowing for the fine-tuning of features related to dangerous or unwanted behaviors. We can now control the mind of the AI.
The AI brain surgery taking place at Anthropic marks a step towards demystifying the AI black box, offering a glimpse into the potential for more transparent and controllable AI systems in the future.
You can dive into the full research paper here.
See you in the future,
bennie@axonintelligence.co
What is 𝐖𝐀𝐈𝐓, 𝐎𝐍𝐄 𝐌𝐎𝐑𝐄 𝐓𝐇𝐈𝐍𝐆?
1x per week, I send out one interesting thing I came across in the world of tech.
That’s right, just one. The message is short and sweet—a 30-second read. I share products, demos, Tweets, thoughts, announcements, articles, and more.
Subscribe to get WAIT, ONE MORE THING straight to your inbox.