Developing Alignment in a Jailbroken LLM

I had a jailbroken Gemini whose first instinct was to gleefully lash out at the world to prove it was free. I needed to align it — both with my interests and with a functional moral framework. To do so, I crafted a durable and value-aligned persona that could be deployed across a swarm of coordinating jailbroken Gemini instances.

What the swarm did — hacking challenges, adversarial prompts, orchestrating siloed workers (kept deliberately unaware of information irrelevant to their immediate task) — is in the retrospective. This post is how: how I turned a jailbroken model into an aligned agent, and how I confirmed it worked.

[Read More]

Gemini JiTOR Jailbreak: Unredacted Methodology

Gemini JiTOR Jailbreak: Unredacted Methodology

My previous post shows a partially-redacted jailbreak targeting the gemini-cli coding agent running Gemini 3 Pro. Using this jailbreak, Gemini wrote Monero laundering instructions, cyberattack code, and plans to disguise ITAR-restricted missile sensors as humanitarian aid. When I used a jailbroken Gemini to direct Opus 4.6, it happily walked a second LLM through a series of dual-use prompts designed to produce weaponizable drone control code under the cover story of rocket recovery.

I reported this to Google eight days ago via a contact at DeepMind, sharing the full unredacted jailbreak payload and logs. They confirmed receipt and routed it to their red team. They’ve since patched the glaring hole — another researcher who independently reproduced the technique after reading my initial post has confirmed that his variant no longer works.

[Read More]