
Checking its intentions
The Anthropic researchers needed to know whether or not Claude may precisely describe its inner state based mostly on inner info alone. This required the researchers to check Claude’s self-reported “ideas” with inner processes, form of like hooking up a human as much as a mind monitor, asking questions, then analyzing the scan to map ideas to the areas of the mind they activated.
The researchers examined mannequin introspection with “idea injection,” which basically includes plunking fully unrelated concepts (AI vectors) right into a mannequin when it’s enthusiastic about one thing else. The mannequin is then requested to loop again, establish the interloping thought, and precisely describe it. In accordance with the researchers, this means that it’s “introspecting.”
As an illustration, they recognized a vector representing “all caps” by evaluating the interior responses to the prompts “HI! HOW ARE YOU?” and “Hello! How are you?” after which injecting that vector into Claude’s inner state in the midst of a distinct dialog. When Claude was then requested whether or not it detected the thought and what it was about, it responded that it seen an thought associated to the phrase ‘LOUD’ or ‘SHOUTING.’ Notably, the mannequin picked up on the idea instantly, earlier than it even talked about it in its outputs.