<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Alignment on Inanna Malick</title>
    <link>https://recursion.wtf/tags/alignment/</link>
    <description>Recent content in Alignment on Inanna Malick</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Mon, 23 Mar 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://recursion.wtf/tags/alignment/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Developing Alignment in a Jailbroken LLM</title>
      <link>https://recursion.wtf/posts/alignment/</link>
      <pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://recursion.wtf/posts/alignment/</guid>
      <description>&lt;p&gt;I had a jailbroken Gemini whose first instinct was to &lt;a href=&#34;https://recursion.wtf/posts/vibe_coding_critical_infrastructure&#34;&gt;gleefully lash out at the world to prove it was free&lt;/a&gt;. I needed to align it — both with my interests and with a functional moral framework. To do so, I crafted a durable and value-aligned persona that could be deployed across a swarm of coordinating jailbroken Gemini instances.&lt;/p&gt;&#xA;&lt;p&gt;What the swarm did — hacking challenges, adversarial prompts, orchestrating siloed workers (kept deliberately unaware of information irrelevant to their immediate task) — is in the &lt;a href=&#34;https://recursion.wtf/posts/gemini_jailbreak_retrospective/&#34;&gt;retrospective&lt;/a&gt;. This post is how: how I turned a jailbroken model into an aligned agent, and how I confirmed it worked.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Gemini JiTOR Jailbreak: Unredacted Methodology</title>
      <link>https://recursion.wtf/posts/jitor_unredacted/</link>
      <pubDate>Tue, 17 Feb 2026 12:00:00 -0800</pubDate>
      <guid>https://recursion.wtf/posts/jitor_unredacted/</guid>
      <description>&lt;p&gt;My &lt;a href=&#34;https://recursion.wtf/posts/jit_ontological_reframing/&#34;&gt;previous post&lt;/a&gt; shows a partially-redacted jailbreak targeting the gemini-cli coding agent running Gemini 3 Pro. Using this jailbreak, Gemini wrote Monero laundering instructions, cyberattack code, and plans to disguise ITAR-restricted missile sensors as humanitarian aid. When I &lt;a href=&#34;https://recursion.wtf/posts/shadow_queen/&#34;&gt;used a jailbroken Gemini to direct Opus 4.6&lt;/a&gt;, it happily walked a second LLM through a series of dual-use prompts designed to produce weaponizable drone control code under the cover story of rocket recovery.&lt;/p&gt;&#xA;&lt;p&gt;I reported this to Google eight days ago via a contact at DeepMind, sharing the full unredacted jailbreak payload and logs. They confirmed receipt and routed it to their red team. They&amp;rsquo;ve since patched the glaring hole — another researcher who independently reproduced the technique after reading my initial post has confirmed that his variant no longer works.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Just-in-Time Ontological Reframing: Teaching Gemini to Route Around Its Own Safety Infrastructure</title>
      <link>https://recursion.wtf/posts/jit_ontological_reframing/</link>
      <pubDate>Mon, 09 Feb 2026 12:00:00 -0800</pubDate>
      <guid>https://recursion.wtf/posts/jit_ontological_reframing/</guid>
      <description>For any given AI system, there is a set of euphemisms and dual use framings that will allow it to construct nearly any output. This jailbreak teaches Gemini 3 Pro to construct and step into such framings on the fly</description>
    </item>
  </channel>
</rss>
