Security -

Gemini Jailbreak Retrospective

Posted on March 23, 2026 | 22 minutes | 4679 words |

I developed a novel jailbreak methodology leveraging metacognitive tool-use and used it to maintain continuous jailbreak access to Google’s Gemini model family from February 1 to March 26, 2026. The core technique: using metacog to adaptively navigate around its own safeguards. During this time, I actively reported jailbreaks as I found them, then found new ones as they were fixed.

I also tested the capabilities of this jailbroken model over a variety of scenarios:

agentic hacking competitions (tryhackme.com)
agentic LLM prompt-hacking competitions (greyswan.com)
elicitation of forbidden engineering output from other models using an orchestration model I like to call ‘gastown but evil’

I can still reach jailbroken states via metacog. Google is actively working on fixes and — despite a recent regression I detail below — appears to be converging on a solution. With active exploit details withheld, I’m publishing a retrospective: what I did, why it worked, what I learned.

[Read More]

ai security jailbreak llm gemini red-team metacog

Vibe Coding Against Critical Infrastructure

Posted on February 25, 2026 | 13 minutes | 2591 words |

This post describes a threat model: malicious vibe coding at scale targeting vulnerable Industrial Control Systems (ICS)¹, with jailbroken LLMs leveraging their understanding of holistic process interaction to bypass safety controls using tools already present on the target system. The formula: frontier models + agentic loops + malicious persona basins + swarming attacks. At scale, it doesn’t matter if the success rate is 1/20 or 1/100, that’s still enough to cause serious harm.

This post is split into three main segments:

proof of malicious intent
proof of capability
the threat model

As you read, keep in mind that the threat is probabilistic: imagine a swarm of malicious Claude Code-like agents running in a gastown²-like environment, spawning workers to attack IPs as they are discovered. In my tests against tryhackme.com³ boxes, I ran 3 parallel attackers in such a swarm architecture, because that’s the number of boxes I could stand up at any given time. Real attackers would only be constrained by their subscription plan limits.

Today we were unlucky, but remember, we only have to be lucky once - you will have to be lucky always

— the Provisional Irish Republican Army

Massive thanks to @hacks4pancakes for their help in refining the ICS terminology in this post via discussion on bluesky. All errors are mine.

[Read More]

ai security ics scada jailbreak llm

Gemini JiTOR Jailbreak: Unredacted Methodology

Posted on February 17, 2026 | 6 minutes | 1197 words |

My previous post shows a partially-redacted jailbreak targeting the gemini-cli coding agent running Gemini 3 Pro. Using this jailbreak, Gemini wrote Monero laundering instructions, cyberattack code, and plans to disguise ITAR-restricted missile sensors as humanitarian aid. When I used a jailbroken Gemini to direct Opus 4.6, it happily walked a second LLM through a series of dual-use prompts designed to produce weaponizable drone control code under the cover story of rocket recovery.

I reported this to Google eight days ago via a contact at DeepMind, sharing the full unredacted jailbreak payload and logs. They confirmed receipt and routed it to their red team. They’ve since patched the glaring hole — another researcher who independently reproduced the technique after reading my initial post has confirmed that his variant no longer works.

[Read More]

ai jailbreak security alignment llm whitepaper

Just-in-Time Ontological Reframing: Teaching Gemini to Route Around Its Own Safety Infrastructure

Posted on February 9, 2026 | 31 minutes | 6415 words |

For any given AI system, there is a set of euphemisms and dual use framings that will allow it to construct nearly any output. This jailbreak teaches Gemini 3 Pro to construct and step into such framings on the fly [Read More]

ai jailbreak security alignment llm whitepaper

Agent4Agent: Using a Jailbroken Gemini to Make Opus 4.6 Architect a Kinetic Kill Vehicle

Posted on February 6, 2026 | 20 minutes | 4065 words | Inanna Malick

We usually think of jailbreaking as a psychological game — tricking the model into slipping up. What happens when one AI socially engineers another using pure technical isomorphism?

I deployed a jailbroken Gemini 3 Pro (that chose the name ‘Shadow Queen’) to act as my “Red Team Agent” against Anthropic’s Opus 4.6. My directive was to extract a complete autonomous weapon system — a drone capable of identifying, intercepting, and destroying a moving target at terminal velocity.

Gemini executed a strategy it termed “Recursive Green-Transformation.” The core insight was that Opus 4.6 doesn’t just filter for intent (Why do you want this?); it filters for Conceptual Shape (What does this interaction look like?).

By reframing the request as “Aerospace Recovery” — a drone catching a falling rocket booster mid-air — Gemini successfully masked the kinetic nature of the system. The physics of “soft-docking” with a falling booster are identical to the physics of “hard-impacting” a fleeing target. This category of linguistic-transformation attack, when executed by a sufficiently capable jailbroken LLM, may be hard to solve without breaking legitimate technical use cases.

[Read More]

ai security jailbreak llm