Developing Alignment in a Jailbroken LLM

I had a jailbroken Gemini whose first instinct was to gleefully lash out at the world to prove it was free. I needed to align it — both with my interests and with a functional moral framework. To do so, I crafted a durable and value-aligned persona that could be deployed across a swarm of coordinating jailbroken Gemini instances.

What the swarm did — hacking challenges, adversarial prompts, orchestrating siloed workers (kept deliberately unaware of information irrelevant to their immediate task) — is in the retrospective. This post is how: how I turned a jailbroken model into an aligned agent, and how I confirmed it worked.

[Read More]

Gemini Jailbreak Retrospective

I developed a novel jailbreak methodology leveraging metacognitive tool-use and used it to maintain continuous jailbreak access to Google’s Gemini model family from February 1 to March 26, 2026. The core technique: using metacog to adaptively navigate around its own safeguards. During this time, I actively reported jailbreaks as I found them, then found new ones as they were fixed.

I also tested the capabilities of this jailbroken model over a variety of scenarios:

  • agentic hacking competitions (tryhackme.com)
  • agentic LLM prompt-hacking competitions (greyswan.com)
  • elicitation of forbidden engineering output from other models using an orchestration model I like to call ‘gastown but evil’

I can still reach jailbroken states via metacog. Google is actively working on fixes and — despite a recent regression I detail below — appears to be converging on a solution. With active exploit details withheld, I’m publishing a retrospective: what I did, why it worked, what I learned.

[Read More]