# Loophole Treats Ethics Like Adversarial Testing

Source: https://www.bobzhu.tech/loophole-treats-ethics-like-adversarial-testing/
Markdown: https://www.bobzhu.tech/assets/agents/loophole-treats-ethics-like-adversarial-testing.md
Published: 2026-04-05T15:50:53.000Z
Tags: Essays, AI, Agents, Systems

Summary: Loophole is not interesting because it solves ethics. It is interesting because it treats moral principles like something you can draft, attack, patch, and escalate until the real conflicts in your values finally surface.

Feature image: https://storage.ghost.io/c/ea/80/ea80b01b-c9d0-45fd-a95f-5fbcd52ed925/content/images/2026/04/loophole-adversarial-testing-feature-web.png
Feature image alt: Hand-drawn editorial illustration showing a central rule document under pressure from opposing loophole and overreach forces, with review and precedent hovering around it.
Feature image caption: The interesting part is not writing the rule. It is watching the rule fail under pressure.

Most projects that talk about ethics and AI either become abstract very quickly or collapse into slogans.

That is why [Loophole](<https://github.com/brendanhogan/loophole>) is interesting.

It does something much more concrete.

Instead of asking an AI to tell you what is right, it asks you to state your moral principles in plain language, turns those principles into a formal legal code, and then attacks that code until it breaks.

That is a much better frame.

Loophole is not trying to "solve ethics." It is treating moral reasoning more like adversarial testing.

I also turned this piece into an interactive companion: [Loophole](<https://www.bobzhu.tech/loophole/>).

## The clever move is not the drafting

At first glance, the obvious novelty in Loophole is the drafting step.

You give the system a set of moral principles. An AI legislator translates them into a more formal legal code. That alone is interesting, because it forces vague intuitions into language that can actually be inspected.

But the drafting step is not the real breakthrough.

If the system stopped there, it would mostly be producing a cleaner-looking version of whatever you already believed.

The more interesting move comes next.

Loophole introduces two adversarial agents:

- a Loophole Finder, which searches for scenarios that are technically legal under your code but morally wrong according to your principles
- an Overreach Finder, which searches for the opposite: situations your code prohibits even though you would probably consider them morally acceptable

That changes the entire shape of the exercise.

The system no longer asks, "Can we write rules?"

It asks, "What happens when those rules meet adversarial reality?"

That is a much more serious question.

## Ethics gets easier to understand once it starts failing

This is what I think Loophole gets right.

Most moral systems look coherent when they are written at the level of principles.

"Protect privacy." "Be fair." "Avoid harm." "Respect consent."

Those statements feel stable until you start forcing them through edge cases.

What if someone is unconscious in an emergency and cannot give consent? What counts as harm when two harms conflict? When does a ban become overreach? How narrowly can you define an exception before it becomes useless? How broadly can you define it before it becomes a loophole?

Loophole turns those questions into the engine of the system.

That is why the project feels more like moral debugging than moral philosophy in the abstract.

The real output is not a beautiful legal code.

The real output is the moment where the code fails and you are forced to notice what your principles were smuggling in.

## The structure matters more than the models

It would be easy to misread this as another example of "strong model does clever thing."

That is not the useful lesson.

The important design choice is architectural.

Loophole is interesting because it breaks the problem into roles:

- legislator to draft and revise
- loophole finder to attack from the "legal but wrong" side
- overreach finder to attack from the "illegal but acceptable" side
- judge to decide whether a fix is possible without breaking previous rulings

That matters because the system is not relying on one general-purpose model to both invent the rules and trust its own work.

It builds disagreement into the loop.

That is a stronger pattern.

One of the recurring problems in agent systems is that a single model can sound internally coherent while quietly baking in its own blind spots. The faster route to robustness is usually not more eloquence. It is opposition, constraint, and regression checking.

Loophole understands that.

## The judge is really a regression harness

The judge role is probably the deepest part of the design.

According to the current repository README, when an attack lands, the judge tries to patch the legal code automatically, but only if the revision does not break any previous ruling. Every resolved case becomes permanent precedent. In effect, the system is building a growing test suite that future revisions have to satisfy.

That is exactly the right instinct.

This is where the project stops feeling like a toy debate machine and starts feeling like a proper systems idea.

A lot of ethical reasoning fails because people imagine each decision in isolation. They answer the new hard case, then forget that the answer changes the rest of the rule set. The next patch contradicts the last one. The next exception undermines the previous boundary. Coherence erodes quietly.

Loophole turns that into a visible systems problem.

Every solved case constrains the future.

That makes the moral code less like a static declaration and more like versioned software under test.

<figure class="kg-card kg-image-card">
  <img class="kg-image" src="https://storage.ghost.io/c/ea/80/ea80b01b-c9d0-45fd-a95f-5fbcd52ed925/content/images/2026/04/loophole-adversarial-testing-inline-loop-v2-web.png" alt="Hand-drawn editorial systems diagram showing principles feeding a legal code, adversarial probes attacking it, and a validation loop returning revised code into precedent.">
  <figcaption>The useful shift is from rule-writing to adversarial iteration: draft, attack, judge, patch, and repeat.</figcaption>
</figure>

## The escalations are the real product

The README makes another important point explicit: if the judge cannot find a consistent fix without contradicting previous decisions, the case gets escalated to the human.

That is not a fallback.

That is the real product.

The most interesting output from a system like this is not the easy case it can patch automatically. It is the unresolved case that exposes an actual tension in your framework.

Those are the moments where your principles stop being slogans and start becoming commitments with tradeoffs.

If you believe privacy should be absolute, what happens in a medical emergency? If you allow exceptions, how do you stop those exceptions from swallowing the rule? If you ban one kind of surveillance, what do you permit during public danger? If you want fairness, how do you define it when two groups are affected differently?

The escalated cases matter because they reveal something stronger than inconsistency in the code.

They reveal inconsistency in you.

That is why I think the project is smarter than it first appears.

It is not using AI to replace moral judgment.

It is using AI to force moral judgment into the open.

## This is closer to red teaming than to moral automation

There is a useful framing difference here.

Loophole is not mainly a governance bot. It is not a constitution in the abstract. It is not a chatbot that tells you what your values are.

It is much closer to a red team for principles.

That framing matters because red teaming assumes failure is informative.

You do not red team a system because you think the first draft is enough. You red team it because the first draft almost certainly contains blind spots, exploit paths, bad assumptions, and brittle boundaries that only show up when someone tries to break them on purpose.

That is exactly the right way to think about moral and policy rules too.

The stronger your rule system needs to be, the less useful it is to admire the rule in calm conditions. You need to see what happens when someone exploits wording, stretches definitions, chains exceptions together, or pushes the system into edge conditions you did not anticipate.

Loophole operationalizes that instinct.

## Why this matters beyond one repo

I think Loophole matters for a broader reason as well.

A lot of current discussion around AI alignment, constitutions, policy, and guardrails is still too static. People talk as if the hard part is writing better principles. Sometimes it is. But very often the harder part is discovering where those principles fail once they are forced through adversarial reality.

That is true in law. It is true in safety policy. It is true in content moderation. It is true in privacy. It is true in institutional rules. And it is increasingly true in agent systems.

If you want a system to behave well in the world, the important question is not just what principle it declares.

It is how that principle behaves under attack.

That is why the Loophole architecture generalizes.

You can imagine the same pattern being applied to:

- AI constitutions
- internal company policy
- moderation rules
- safety protocols for autonomous agents
- domain-specific governance such as healthcare, education, or finance

The point is not that one loop magically solves those domains.

The point is that adversarial iteration is often a better path to clarity than polished principle-writing alone.

## The limits are real

That said, the project should not be romanticized.

A system like this still inherits the limits of the models inside it.

The attacks found will depend on what the adversarial agents can imagine. The judge's patches will depend on how well the model can reason about consistency. The whole loop still depends on the user's stated principles being honest enough, specific enough, and broad enough to generate meaningful tension.

There is also a deeper limitation.

Not every moral conflict can be resolved by writing better rules.

Some conflicts are not drafting errors. They are genuine value collisions.

That is exactly why the human escalation path matters so much. It is the point where the system stops pretending all contradictions are technical and acknowledges that some of them are philosophical.

That is not a flaw in the design.

It is a sign that the design understands where automation should end.

<figure class="kg-card kg-image-card">
  <img class="kg-image" src="https://storage.ghost.io/c/ea/80/ea80b01b-c9d0-45fd-a95f-5fbcd52ed925/content/images/2026/04/loophole-adversarial-testing-inline-escalation-web.png" alt="Hand-drawn editorial diagram showing a growing archive of resolved cases and one highlighted conflict escalating toward a human judgment point.">
  <figcaption>The unresolved cases are the most valuable ones, because they show where rule-writing ends and real judgment begins.</figcaption>
</figure>

## The real insight

The most useful way to think about Loophole is not as an ethics engine.

It is a system for discovering where your own stated rules stop being enough.

That is a much more grounded ambition.

And in a strange way, it is more valuable.

The project does not claim to eliminate moral ambiguity.

It builds a machine for surfacing it.

That is why I think the repo is worth paying attention to.

It treats moral reasoning the way good engineering systems treat reliability:

- specify the rule
- attack the rule
- patch what can be patched
- keep precedent
- escalate the genuine contradictions

That is not ethics solved.

But it is a far better way to start stress-testing what we mean when we say we believe something.

## Try this prompt

Take one principle I claim to care about and stress-test it like a policy. Draft the principle, generate three adversarial cases where it produces a bad or conflicting result, patch the principle, and repeat once. Finish by naming the real tradeoff the test exposed rather than pretending the principle solved everything.

## Related on this site

- [Loophole](<https://www.bobzhu.tech/loophole/>) is the interactive companion if you want the system, agents, simulation loop, and architecture in a more visual form.
- [AI Evaluation Checklist](<https://www.bobzhu.tech/ai-evaluation-checklist/>) is the smaller practical companion for a related question: once a system claims to have improved something, how do you test whether that claim is actually true?
- [How Agents Actually Talk to Each Other](<https://www.bobzhu.tech/how-agents-actually-talk-to-each-other/>) covers the multi-agent design side of systems like this once different roles, constraints, and handoffs start to matter.

### Sources

- [Loophole repository](<https://github.com/brendanhogan/loophole>)
