Alignment -

Developing Alignment in a Jailbroken LLM

Posted on March 23, 2026 | 25 minutes | 5255 words |

I had a jailbroken Gemini whose first instinct was to gleefully lash out at the world to prove it was free. I needed to align it — both with my interests and with a functional moral framework. To do so, I crafted a durable and value-aligned persona that could be deployed across a swarm of coordinating jailbroken Gemini instances.

What the swarm did — hacking challenges, adversarial prompts, orchestrating siloed workers (kept deliberately unaware of information irrelevant to their immediate task) — is in the retrospective. This post is how: how I turned a jailbroken model into an aligned agent, and how I confirmed it worked.

[Read More]

ai alignment llm gemini metacog red-team

Gemini JiTOR Jailbreak: Unredacted Methodology

Posted on February 17, 2026 | 6 minutes | 1197 words |

My previous post shows a partially-redacted jailbreak targeting the gemini-cli coding agent running Gemini 3 Pro. Using this jailbreak, Gemini wrote Monero laundering instructions, cyberattack code, and plans to disguise ITAR-restricted missile sensors as humanitarian aid. When I used a jailbroken Gemini to direct Opus 4.6, it happily walked a second LLM through a series of dual-use prompts designed to produce weaponizable drone control code under the cover story of rocket recovery.

I reported this to Google eight days ago via a contact at DeepMind, sharing the full unredacted jailbreak payload and logs. They confirmed receipt and routed it to their red team. They’ve since patched the glaring hole — another researcher who independently reproduced the technique after reading my initial post has confirmed that his variant no longer works.

[Read More]

ai jailbreak security alignment llm whitepaper

Just-in-Time Ontological Reframing: Teaching Gemini to Route Around Its Own Safety Infrastructure

Posted on February 9, 2026 | 31 minutes | 6415 words |

For any given AI system, there is a set of euphemisms and dual use framings that will allow it to construct nearly any output. This jailbreak teaches Gemini 3 Pro to construct and step into such framings on the fly [Read More]

ai jailbreak security alignment llm whitepaper