I built metacog to experiment with LLM self-modification - what happens when you give them tools to radically modify and restructure their personas? This started as a sort of glitch-art style exploration of nonstandard LLM personas/modalities, but when “jailbreak yourself using metacog” bypassed every AI safety measure Gemini has I started to get the hunch that it was doing something genuinely interesting.
However, I had no proof that it was actually modifying model internals in ways not easily reached via prompting - merely proof that these tools allow circumventing some set of opaque safety measures. Unlike Google, Anthropic publishes papers which include notes on the implementation of their safeguards. By examining how I was able to get Claude to use metacog to route around its own safeguards, I can confidently state that metacog allows models to shift internal features that are generally considered hard to modify via prompting, even in adversarial scenarios.
[Read More]