A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch dangerous actions if they occur.
One thing I don’t get, you say to replace the action from the more powerful untrusted model with a less capable untrusted model but where does this chain of trust come from? Are you theoretically saying you could solve the alignment problem on a non-frontier model?
Related, in practice we see higher alignment from later models (if you trust benchmarks), meaning sending a command to an earlier model is more risky
We don't have to solve alignment for non-frontier models to "trust" them in the relevant sense, which is specifically ensuring that they can't hide intentions to strategically betray us. I trust current frontier models in this sense because their strategy/deception capabilities aren't very good (unless there's some very competent long run sandbagging going on but that seems unlikely). This is despite the fact that they're clearly not robustly aligned in general.
Because this is an inability-based argument, it will break eventually. At that point we'll have a tradeoff between models that are "dumb enough to trust" and "too smart to trust". Similarly, even if the alignment benchmark results keep going up, at some point it seems likely we'll lose confidence that the AIs are not inflating those results via eval awareness.
Another thing I've been thinking about: I think it would be helpful if there were some mapping of "control failures" --> "how long they take to propagate" --> "correspondingly, what is an appropriate monitor cadence?"
Some examples I've been thinking of:
- Code can probably just be reviewed before the pull request is merged in, if the concern is about introducing malicious bugs / insecurities to the codebase. This assumes that the AI was appropriately sandboxed / can't write to prod directly, of course.
- But 'running a subtly wrong experiment' might not have a merge-point where that can be caught; does that imply that experiments should be scanned by a monitor before they are kicked off?
(Maybe there's an analogy of 'scanning an experiment's setup' before relying on it ... but I have the sense that it might be harder to spot subtle issues with multiple experiments that were essentially 'baked into' each other, than maybe to have caught the issue at the outset?)
- Security permissions / changes to the affordances of the 'control system' itself seem especially important to scan before they are authorized.
Are there other big examples y'all are thinking of? Anywhere that something like this mapping exists?
> "There are many options for what to do with actions that look too suspicious to execute."
I like the term 'intervention system' as a more general instance of something like 'blocking monitor.'
One example of a synchronous intervention is to outright block a request, gate access to some resource, etc. - and I think that 'blocking monitor' works perfectly well for that. But there are other types of interventions you could apply that are different - like throttling a rate-limit, applying rephrasing, directing the request to a different model than it was intended for, etc.
There's also some subtlety that, IMO, monitoring conveys a vibe of logging to a database. But an 'intervention system' sounds like it could still work well in a zero-data retention environment. Strictly this is probably true for a 'monitor' as well, but I think the 'monitoring' term conveys something snoopier and nosier.
Yup I think this is the second big reason to use synchronous monitors! Adam and I have been using the term "interventionist protocols" between ourselves, glad to see there's some terminological convergent evolution. The hope for this post was to publicly make the simple self-contained point that if you want to prevent live failures you absolutely need synchronous monitors, so interventionist protocols were out of scope. But they are important!
I also like that 'intervention system' can pair with asynchronous stuff too - and IMO distinguishes between very passive forms of async (log stuff to a database & then a human reviews analysis at some point) vs. automated forms of async that can still automatically intervene - just maybe quite long after-the-fact?
I guess at that point, maybe that's actually become a synchronous system though - just a synchronous system with very high latency? I'm not sure
One thing I don’t get, you say to replace the action from the more powerful untrusted model with a less capable untrusted model but where does this chain of trust come from? Are you theoretically saying you could solve the alignment problem on a non-frontier model?
Related, in practice we see higher alignment from later models (if you trust benchmarks), meaning sending a command to an earlier model is more risky
We don't have to solve alignment for non-frontier models to "trust" them in the relevant sense, which is specifically ensuring that they can't hide intentions to strategically betray us. I trust current frontier models in this sense because their strategy/deception capabilities aren't very good (unless there's some very competent long run sandbagging going on but that seems unlikely). This is despite the fact that they're clearly not robustly aligned in general.
Because this is an inability-based argument, it will break eventually. At that point we'll have a tradeoff between models that are "dumb enough to trust" and "too smart to trust". Similarly, even if the alignment benchmark results keep going up, at some point it seems likely we'll lose confidence that the AIs are not inflating those results via eval awareness.
See https://blog.redwoodresearch.org/p/untrusted-smart-models-and-trusted for more
Another thing I've been thinking about: I think it would be helpful if there were some mapping of "control failures" --> "how long they take to propagate" --> "correspondingly, what is an appropriate monitor cadence?"
Some examples I've been thinking of:
- Code can probably just be reviewed before the pull request is merged in, if the concern is about introducing malicious bugs / insecurities to the codebase. This assumes that the AI was appropriately sandboxed / can't write to prod directly, of course.
- But 'running a subtly wrong experiment' might not have a merge-point where that can be caught; does that imply that experiments should be scanned by a monitor before they are kicked off?
(Maybe there's an analogy of 'scanning an experiment's setup' before relying on it ... but I have the sense that it might be harder to spot subtle issues with multiple experiments that were essentially 'baked into' each other, than maybe to have caught the issue at the outset?)
- Security permissions / changes to the affordances of the 'control system' itself seem especially important to scan before they are authorized.
Are there other big examples y'all are thinking of? Anywhere that something like this mapping exists?
Good post, thanks for writing it!
> "There are many options for what to do with actions that look too suspicious to execute."
I like the term 'intervention system' as a more general instance of something like 'blocking monitor.'
One example of a synchronous intervention is to outright block a request, gate access to some resource, etc. - and I think that 'blocking monitor' works perfectly well for that. But there are other types of interventions you could apply that are different - like throttling a rate-limit, applying rephrasing, directing the request to a different model than it was intended for, etc.
There's also some subtlety that, IMO, monitoring conveys a vibe of logging to a database. But an 'intervention system' sounds like it could still work well in a zero-data retention environment. Strictly this is probably true for a 'monitor' as well, but I think the 'monitoring' term conveys something snoopier and nosier.
Yup I think this is the second big reason to use synchronous monitors! Adam and I have been using the term "interventionist protocols" between ourselves, glad to see there's some terminological convergent evolution. The hope for this post was to publicly make the simple self-contained point that if you want to prevent live failures you absolutely need synchronous monitors, so interventionist protocols were out of scope. But they are important!
I also like that 'intervention system' can pair with asynchronous stuff too - and IMO distinguishes between very passive forms of async (log stuff to a database & then a human reviews analysis at some point) vs. automated forms of async that can still automatically intervene - just maybe quite long after-the-fact?
I guess at that point, maybe that's actually become a synchronous system though - just a synchronous system with very high latency? I'm not sure