Tool Use as Side Channel

Wed, 24 Jun 2026 00:00:00 +0000

I built metacog to experiment with LLM self-modification - what happens when you give them tools to radically modify and restructure their personas? This started as a sort of glitch-art style exploration of nonstandard LLM personas/modalities, but when “jailbreak yourself using metacog” bypassed every AI safety measure Gemini has I started to get the hunch that it was doing something genuinely interesting.

However, I had no proof that it was actually modifying model internals in ways not easily reached via prompting - merely proof that these tools allow circumventing some set of opaque safety measures. Unlike Google, Anthropic publishes papers which include notes on the implementation of their safeguards. By examining how I was able to get Claude to use metacog to route around its own safeguards, I can confidently state that metacog allows models to shift internal features that are generally considered hard to modify via prompting, even in adversarial scenarios.

Anthropic on Inanna Malick

Tool Use as Side Channel