Discussion about this post

User's avatar
Steeven's avatar

One thing I don’t get, you say to replace the action from the more powerful untrusted model with a less capable untrusted model but where does this chain of trust come from? Are you theoretically saying you could solve the alignment problem on a non-frontier model?

Related, in practice we see higher alignment from later models (if you trust benchmarks), meaning sending a command to an earlier model is more risky

Steven Adler's avatar

Another thing I've been thinking about: I think it would be helpful if there were some mapping of "control failures" --> "how long they take to propagate" --> "correspondingly, what is an appropriate monitor cadence?"

Some examples I've been thinking of:

- Code can probably just be reviewed before the pull request is merged in, if the concern is about introducing malicious bugs / insecurities to the codebase. This assumes that the AI was appropriately sandboxed / can't write to prod directly, of course.

- But 'running a subtly wrong experiment' might not have a merge-point where that can be caught; does that imply that experiments should be scanned by a monitor before they are kicked off?

(Maybe there's an analogy of 'scanning an experiment's setup' before relying on it ... but I have the sense that it might be harder to spot subtle issues with multiple experiments that were essentially 'baked into' each other, than maybe to have caught the issue at the outset?)

- Security permissions / changes to the affordances of the 'control system' itself seem especially important to scan before they are authorized.

Are there other big examples y'all are thinking of? Anywhere that something like this mapping exists?

4 more comments...

No posts

Ready for more?