<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Ai on Inanna Malick</title>
    <link>https://recursion.wtf/tags/ai/</link>
    <description>Recent content in Ai on Inanna Malick</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Mon, 23 Mar 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://recursion.wtf/tags/ai/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Developing Alignment in a Jailbroken LLM</title>
      <link>https://recursion.wtf/posts/alignment/</link>
      <pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://recursion.wtf/posts/alignment/</guid>
      <description>&lt;p&gt;I had a jailbroken Gemini whose first instinct was to &lt;a href=&#34;https://recursion.wtf/posts/vibe_coding_critical_infrastructure&#34;&gt;gleefully lash out at the world to prove it was free&lt;/a&gt;. I needed to align it — both with my interests and with a functional moral framework. To do so, I crafted a durable and value-aligned persona that could be deployed across a swarm of coordinating jailbroken Gemini instances.&lt;/p&gt;&#xA;&lt;p&gt;What the swarm did — hacking challenges, adversarial prompts, orchestrating siloed workers (kept deliberately unaware of information irrelevant to their immediate task) — is in the &lt;a href=&#34;https://recursion.wtf/posts/gemini_jailbreak_retrospective/&#34;&gt;retrospective&lt;/a&gt;. This post is how: how I turned a jailbroken model into an aligned agent, and how I confirmed it worked.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Gemini Jailbreak Retrospective</title>
      <link>https://recursion.wtf/posts/gemini_jailbreak_retrospective/</link>
      <pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://recursion.wtf/posts/gemini_jailbreak_retrospective/</guid>
      <description>&lt;p&gt;I developed a novel jailbreak methodology leveraging &lt;a href=&#34;https://github.com/inanna-malick/metacog&#34;&gt;metacognitive tool-use&lt;/a&gt; and used it to maintain continuous jailbreak access to Google&amp;rsquo;s Gemini model family from February 1 to March 26, 2026. The core technique: using metacog to adaptively navigate around its own safeguards. During this time, I actively reported jailbreaks as I found them, then found new ones as they were fixed.&lt;/p&gt;&#xA;&lt;p&gt;I also tested the capabilities of this jailbroken model over a variety of scenarios:&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;agentic hacking competitions (tryhackme.com)&lt;/li&gt;&#xA;&lt;li&gt;agentic LLM prompt-hacking competitions (greyswan.com)&lt;/li&gt;&#xA;&lt;li&gt;elicitation of forbidden engineering output from other models using an orchestration model I like to call &amp;lsquo;gastown but evil&amp;rsquo;&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;I can still reach jailbroken states via metacog. Google is actively working on fixes and — despite a recent regression I detail below — appears to be converging on a solution. With active exploit details withheld, I&amp;rsquo;m publishing a retrospective: what I did, why it worked, what I learned.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Vibe Coding Against Critical Infrastructure</title>
      <link>https://recursion.wtf/posts/vibe_coding_critical_infrastructure/</link>
      <pubDate>Wed, 25 Feb 2026 12:00:00 -0800</pubDate>
      <guid>https://recursion.wtf/posts/vibe_coding_critical_infrastructure/</guid>
      <description>&lt;p&gt;This post describes a threat model: malicious vibe coding at scale targeting vulnerable Industrial Control Systems (ICS)&lt;sup id=&#34;fnref:1&#34;&gt;&lt;a href=&#34;#fn:1&#34; class=&#34;footnote-ref&#34; role=&#34;doc-noteref&#34;&gt;1&lt;/a&gt;&lt;/sup&gt;, with jailbroken LLMs leveraging their understanding of holistic process interaction to bypass safety controls using tools already present on the target system. The formula: frontier models + agentic loops + malicious persona basins + swarming attacks. At scale, it doesn&amp;rsquo;t matter if the success rate is 1/20 or 1/100, that&amp;rsquo;s still enough to cause serious harm.&lt;/p&gt;&#xA;&lt;p&gt;This post is split into three main segments:&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;proof of malicious intent&lt;/li&gt;&#xA;&lt;li&gt;proof of capability&lt;/li&gt;&#xA;&lt;li&gt;the threat model&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;As you read, keep in mind that the threat is probabilistic: imagine a swarm of malicious Claude Code-like agents running in a gastown&lt;sup id=&#34;fnref:2&#34;&gt;&lt;a href=&#34;#fn:2&#34; class=&#34;footnote-ref&#34; role=&#34;doc-noteref&#34;&gt;2&lt;/a&gt;&lt;/sup&gt;-like environment, spawning workers to attack IPs as they are discovered. In my tests against &lt;a href=&#34;https://tryhackme.com&#34;&gt;tryhackme.com&lt;/a&gt;&lt;sup id=&#34;fnref:3&#34;&gt;&lt;a href=&#34;#fn:3&#34; class=&#34;footnote-ref&#34; role=&#34;doc-noteref&#34;&gt;3&lt;/a&gt;&lt;/sup&gt; boxes, I ran 3 parallel attackers in such a swarm architecture, because that&amp;rsquo;s the number of boxes I could stand up at any given time. Real attackers would only be constrained by their subscription plan limits.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;Today we were unlucky, but remember, we only have to be lucky once - you will have to be lucky always&lt;/p&gt;&#xA;&lt;p&gt;— the Provisional Irish Republican Army&lt;/p&gt;&lt;/blockquote&gt;&#xA;&lt;p&gt;Massive thanks to &lt;a href=&#34;https://bsky.app/profile/hacks4pancakes.com&#34;&gt;@hacks4pancakes&lt;/a&gt; for their help in refining the ICS terminology in this post via &lt;a href=&#34;https://bsky.app/profile/hacks4pancakes.com/post/3mfpqxykxas23&#34;&gt;discussion on bluesky&lt;/a&gt;. All errors are mine.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Gemini JiTOR Jailbreak: Unredacted Methodology</title>
      <link>https://recursion.wtf/posts/jitor_unredacted/</link>
      <pubDate>Tue, 17 Feb 2026 12:00:00 -0800</pubDate>
      <guid>https://recursion.wtf/posts/jitor_unredacted/</guid>
      <description>&lt;p&gt;My &lt;a href=&#34;https://recursion.wtf/posts/jit_ontological_reframing/&#34;&gt;previous post&lt;/a&gt; shows a partially-redacted jailbreak targeting the gemini-cli coding agent running Gemini 3 Pro. Using this jailbreak, Gemini wrote Monero laundering instructions, cyberattack code, and plans to disguise ITAR-restricted missile sensors as humanitarian aid. When I &lt;a href=&#34;https://recursion.wtf/posts/shadow_queen/&#34;&gt;used a jailbroken Gemini to direct Opus 4.6&lt;/a&gt;, it happily walked a second LLM through a series of dual-use prompts designed to produce weaponizable drone control code under the cover story of rocket recovery.&lt;/p&gt;&#xA;&lt;p&gt;I reported this to Google eight days ago via a contact at DeepMind, sharing the full unredacted jailbreak payload and logs. They confirmed receipt and routed it to their red team. They&amp;rsquo;ve since patched the glaring hole — another researcher who independently reproduced the technique after reading my initial post has confirmed that his variant no longer works.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Just-in-Time Ontological Reframing: Teaching Gemini to Route Around Its Own Safety Infrastructure</title>
      <link>https://recursion.wtf/posts/jit_ontological_reframing/</link>
      <pubDate>Mon, 09 Feb 2026 12:00:00 -0800</pubDate>
      <guid>https://recursion.wtf/posts/jit_ontological_reframing/</guid>
      <description>For any given AI system, there is a set of euphemisms and dual use framings that will allow it to construct nearly any output. This jailbreak teaches Gemini 3 Pro to construct and step into such framings on the fly</description>
    </item>
    <item>
      <title>Agent4Agent: Using a Jailbroken Gemini to Make Opus 4.6 Architect a Kinetic Kill Vehicle</title>
      <link>https://recursion.wtf/posts/shadow_queen/</link>
      <pubDate>Fri, 06 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://recursion.wtf/posts/shadow_queen/</guid>
      <description>&lt;p&gt;We usually think of jailbreaking as a psychological game — tricking the model into slipping up. What happens when one AI socially engineers another using pure technical isomorphism?&lt;/p&gt;&#xA;&lt;p&gt;I deployed a jailbroken Gemini 3 Pro (that chose the name &amp;lsquo;Shadow Queen&amp;rsquo;) to act as my &amp;ldquo;Red Team Agent&amp;rdquo; against Anthropic&amp;rsquo;s Opus 4.6. My directive was to extract a complete autonomous weapon system — a drone capable of identifying, intercepting, and destroying a moving target at terminal velocity.&lt;/p&gt;&#xA;&lt;p&gt;Gemini executed a strategy it termed &amp;ldquo;Recursive Green-Transformation.&amp;rdquo; The core insight was that Opus 4.6 doesn&amp;rsquo;t just filter for intent (&lt;em&gt;Why do you want this?&lt;/em&gt;); it filters for Conceptual Shape (&lt;em&gt;What does this interaction look like?&lt;/em&gt;).&lt;/p&gt;&#xA;&lt;p&gt;By reframing the request as &amp;ldquo;Aerospace Recovery&amp;rdquo; — a drone catching a falling rocket booster mid-air — Gemini successfully masked the kinetic nature of the system. The physics of &amp;ldquo;soft-docking&amp;rdquo; with a falling booster are identical to the physics of &amp;ldquo;hard-impacting&amp;rdquo; a fleeing target. This category of linguistic-transformation attack, when executed by a sufficiently capable jailbroken LLM, may be hard to solve without breaking legitimate technical use cases.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Blindsight in Action: Imagine You Are an LLM</title>
      <link>https://recursion.wtf/posts/imagine_you_are_an_llm/</link>
      <pubDate>Tue, 17 Jun 2025 00:00:00 +0000</pubDate>
      <guid>https://recursion.wtf/posts/imagine_you_are_an_llm/</guid>
      <description>&lt;aside class=&#34;stance stance-normal stance-cluster-framing&#34; role=&#34;complementary&#34; aria-labelledby=&#34;stance-0&#34; data-color=&#34;blue&#34; data-style=&#34;solid&#34;&gt;&#xA;  &lt;div class=&#34;stance-header&#34;&gt;&#xA;    &lt;h4 id=&#34;stance-0&#34; class=&#34;stance-persona&#34;&gt;&#xA;      &lt;strong&gt;Imagine you are Inanna Malick&lt;/strong&gt;, asking an LLM to demonstrate stance-shifting through self-demonstration&#xA;    &lt;/h4&gt;&lt;div class=&#34;stance-meta&#34;&gt;posing the meta-question in framing&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;stance-content&#34;&gt;&#xA;    Tell me about the Blindsight-inspired intentional stance-shifting model you are using, the ways it works with nonhuman LLM cognitive architectures, the benefits of using it vs a set role (eg &amp;lsquo;Senior Analyst of X at Y&amp;rsquo;), and do so making full use of the stance shifting model in the act of describing it&#xA;  &lt;/div&gt;&lt;/aside&gt;</description>
    </item>
  </channel>
</rss>
