Anthropic tricked Claude into thinking it was the Golden Gate Bridge (and other glimpses into the mysterious AI brain)

by | May 21, 2024 | Technology

Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.

AI models are mysterious: They spit out answers, but there’s no real way to know the “thinking” behind their responses. This is because their brains operate on a fundamentally different level than ours — they process long lists of neurons linked to numerous different concepts — so we simply can’t comprehend their line of thought. 

But now, for the first time, researchers have been able to get a glimpse into the inner workings of the AI mind. The team at Anthropic has revealed how it is using “dictionary learning” on Claude Sonnet to uncover pathways in the model’s brain that are activated by different topics — from people, places and emotions to scientific concepts and things even more abstract. 

Interestingly, these features can be manually turned on, off or amplified — ultimately allowing researchers to steer model behavior. Notably: When a “Golden Gate Bridge” feature was amplified within Claude and the model was then asked its physical form, it declared that it was “the iconic bridge itself.” Claude was also duped into drafting a scam email and could be directed to be sickeningly sycophantic. 

Our new interpretability paper offers the first ever detailed look inside a frontier LLM and has amazing stories. I want to share two of them that have stuck with me ever since I read it.For background, the paper shows our latest work on interpreting the “features” of Claude 3… pic.twitt …

Article Attribution | Read More at Article Source

[mwai_chat context=”Let’s have a discussion about this article:nn
Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.

AI models are mysterious: They spit out answers, but there’s no real way to know the “thinking” behind their responses. This is because their brains operate on a fundamentally different level than ours — they process long lists of neurons linked to numerous different concepts — so we simply can’t comprehend their line of thought. 

But now, for the first time, researchers have been able to get a glimpse into the inner workings of the AI mind. The team at Anthropic has revealed how it is using “dictionary learning” on Claude Sonnet to uncover pathways in the model’s brain that are activated by different topics — from people, places and emotions to scientific concepts and things even more abstract. 

Interestingly, these features can be manually turned on, off or amplified — ultimately allowing researchers to steer model behavior. Notably: When a “Golden Gate Bridge” feature was amplified within Claude and the model was then asked its physical form, it declared that it was “the iconic bridge itself.” Claude was also duped into drafting a scam email and could be directed to be sickeningly sycophantic. 

Our new interpretability paper offers the first ever detailed look inside a frontier LLM and has amazing stories. I want to share two of them that have stuck with me ever since I read it.For background, the paper shows our latest work on interpreting the “features” of Claude 3… pic.twitt …nnDiscussion:nn” ai_name=”RocketNews AI: ” start_sentence=”Can I tell you more about this article?” text_input_placeholder=”Type ‘Yes'”]

Share This