Discussion about this post

User's avatar
Max Nadeau's avatar

"When something quite bad can occur in a small number of actions (potentially even in a single trajectory) or in a short period of time, it's important to sometimes block suspicious looking actions until they are reviewed."

Why? I think there is not perfect overlap between the categories “single actions that are catastrophic, i.e. high stakes failures” and “actions that require monitoring and blocking in the moment”. One example where they come apart is “producing a batch of data that would cause the next generation of models to be misaligned”—this might be a high-stakes failure, but you have plenty of time to catch it before you actually train that model. Another example would be "writing code that is catastrophically insecure but only gets run rarely".

No posts

Ready for more?