Discussion about this post

User's avatar
Steeven's avatar

One thing I don’t get, you say to replace the action from the more powerful untrusted model with a less capable untrusted model but where does this chain of trust come from? Are you theoretically saying you could solve the alignment problem on a non-frontier model?

Related, in practice we see higher alignment from later models (if you trust benchmarks), meaning sending a command to an earlier model is more risky

Steven Adler's avatar

Good post, thanks for writing it!

> "There are many options for what to do with actions that look too suspicious to execute."

I like the term 'intervention system' as a more general instance of something like 'blocking monitor.'

One example of a synchronous intervention is to outright block a request, gate access to some resource, etc. - and I think that 'blocking monitor' works perfectly well for that. But there are other types of interventions you could apply that are different - like throttling a rate-limit, applying rephrasing, directing the request to a different model than it was intended for, etc.

There's also some subtlety that, IMO, monitoring conveys a vibe of logging to a database. But an 'intervention system' sounds like it could still work well in a zero-data retention environment. Strictly this is probably true for a 'monitor' as well, but I think the 'monitoring' term conveys something snoopier and nosier.

3 more comments...

No posts

Ready for more?