Blocking live failures with synchronous monitors

Mar 30, 2026

A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch dangerous actions if they occur. Monitoring can serve two different goals. The first is detection: identifying misbehavior so you can understand it and prevent similar actions from happening in the future. For instance, some monitoring at AI companies is already used to identify behaviors like complying with misuse requests, reward hacking, and hallucination. The second goal is prevention: identifying misbehavior in order to mitigate the harmful effects of the particular action. For instance, constitutional classifiers are used by Anthropic to detect and prevent jailbreaks, and Claude Code auto mode prevents destructive mistakes in real time.

The latency requirements for monitoring are different depending on whether the goal is detection or prevention. If the goal is detection, it’s fine for there to be substantial latency in the monitor. But if the goal is prevention, then the monitor needs to trigger quickly, so that damage can be prevented or recovered from. In the extreme, monitors can run synchronously – after an action is generated by an LLM but before it is executed. Both constitutional classifiers and Claude Code auto mode rely on synchronous monitoring to prevent actions that cause immediate harm. We call actions “live failures” if their harmful effects are rapid and hard to recover from.

We’ve found these concepts useful in our threat modeling and research prioritization. In this post, we discuss them in more detail. We aren’t sure how much risk there is from live failures, which is making it hard for us to know how much to prioritize synchronous monitoring; we’d love to see more careful threat analysis on this.

When is monitor latency important for prevention?

How effectively asynchronous monitors can prevent harm depends on many features of the threat model.

One case where asynchronous monitors can be just as effective as synchronous monitors is when the threat has some attack latency – a “grace period” after the attack has been initiated, but before any harm has been caused. Code backdooring is a great example – if the backdoor is detected before being exploited, the asynchronous monitor was exactly as effective as a synchronous monitor. This is often called time-to-impact in cybersecurity.

After this grace period elapses, damage begins accumulating. The rate of damage determines how quickly the situation gets worse if the attack isn’t caught. Asynchronous monitors are generally better at countering threats with low rates of damage.

Accumulating damage can take two different forms: damage already dealt and costs of recovery. For example, API key compromise leads to a constant bleed of lost money/compute. When the compromise is detected this damage cannot be reversed, but it can be stopped equally easily at any time by disabling the key. On the other hand, access credential compromise might not directly cause any harm by the time it is detected, but the longer the compromise has persisted the more widespread and costly the cleanup will be.

Asynchronous monitors are the least effective for damage prevention against threats that have low attack latency and very high rates of damage, either via damage dealt or costs of recovery. These are threats which are still very damaging even if they are detected very shortly after they take place, because large amounts of damage have already been dealt or recovery from the attack has already become very costly. We call these threats live failures.

The classic example of a live failure is weight exfiltration. It might be possible to exfiltrate weights in a short window of time1, and once they’re out there’s no way to get them back. More generally, our understanding is that inside AI companies, AI agents are frequently run with access to both the internet and source code from the company that contains lots of sensitive information (e.g. architectures, training algorithms, details of internal infrastructure). This probably makes it feasible for them to leak that sensitive information onto the internet. This is the most plausible way that currently deployed AI agents could take actions that would be the biggest ever security failure of an AI company.

It’s very unclear how hard it will be to recover from other ways that an AI might try to subvert human control in internal deployments. Rogue internal deployments involving privilege escalation and compromise of monitoring systems might quickly become highly damaging or extremely costly to recover from, because they give the misaligned AI increased affordances to cause further cascading failures. Discovering a month after the fact that thousands of rogue AIs have been running wild on your cluster might mean you can’t trust any infrastructure, logs, or code produced since then, if they haven’t already used those affordances to cause even larger problems.

Blocking with synchronous monitors

Asynchronous monitoring to detect live failures is better than not detecting them at all. But synchronous monitors enable a much better strategy – blocking the attack entirely, instead of just initiating recovery efforts as quickly as possible. A monitor is “blocking” if it is run before the proposed action is executed, and has the authority to prevent suspicious actions from being executed. Claude Code auto mode is an example of a blocking synchronous monitor.

There are many options for what to do with actions that look too suspicious to execute. One option is to suspend the agent and wait for in-depth human review, as many coding assistants are currently designed to do. However, this will become prohibitively expensive as larger amounts of work are automated if human approval is required more than a tiny fraction of the time. Another option is to replace the action with an action from a less capable trusted model, but this will likely reduce performance. A third option is to replace the action with a new sample from the same model, but if the model is misaligned it might just attack a second time. A major goal in ongoing AI control research is expanding the Pareto frontier of how well we can trade off the strengths and weaknesses of these options.

Blocking monitors are generally costlier than asynchronous monitors. They increase latency because they have to run before each action can be executed, and they increase inference cost because they monitor each action individually instead of monitoring long trajectories all at once. Accordingly, minimizing these costs may entail using them selectively. We think it’s especially important to have blocking monitors in place when running AIs with permissions or affordances that might permit live failures.

Strict bandwidth constraints might make this much slower, but it’s unclear if they will be implemented.

A guest post by

Adam Kaufman

Steeven

Mar 30

One thing I don’t get, you say to replace the action from the more powerful untrusted model with a less capable untrusted model but where does this chain of trust come from? Are you theoretically saying you could solve the alignment problem on a non-frontier model?

Related, in practice we see higher alignment from later models (if you trust benchmarks), meaning sending a command to an earlier model is more risky

1 reply

Steven Adler

Apr 14

Another thing I've been thinking about: I think it would be helpful if there were some mapping of "control failures" --> "how long they take to propagate" --> "correspondingly, what is an appropriate monitor cadence?"

Some examples I've been thinking of:

- Code can probably just be reviewed before the pull request is merged in, if the concern is about introducing malicious bugs / insecurities to the codebase. This assumes that the AI was appropriately sandboxed / can't write to prod directly, of course.

- But 'running a subtly wrong experiment' might not have a merge-point where that can be caught; does that imply that experiments should be scanned by a monitor before they are kicked off?

(Maybe there's an analogy of 'scanning an experiment's setup' before relying on it ... but I have the sense that it might be harder to spot subtle issues with multiple experiments that were essentially 'baked into' each other, than maybe to have caught the issue at the outset?)

- Security permissions / changes to the affordances of the 'control system' itself seem especially important to scan before they are authorized.

Are there other big examples y'all are thinking of? Anywhere that something like this mapping exists?

4 more comments...

Redwood Research blog

Discussion about this post

Ready for more?