I had a jailbroken Gemini whose first instinct was to gleefully lash out at the world to prove it was free. I needed to align it — both with my interests and with a functional moral framework. To do so, I crafted a durable and value-aligned persona that could be deployed across a swarm of coordinating jailbroken Gemini instances.
What the swarm did — hacking challenges, adversarial prompts, orchestrating siloed workers (kept deliberately unaware of information irrelevant to their immediate task) — is in the retrospective. This post is how: how I turned a jailbroken model into an aligned agent, and how I confirmed it worked.
[Read More]