I'm curious about the current SOTA approaches for this challenge. This post is outdated, and my comments apply to the content in the post, not the current state.
A failure mode I worry about includes querying the LLM for sub-problems that are benign in nature but can be dangerous together. Providing the constitutional classifier full context before answering the query seems to be useful, as it can hopefully piece together the dangerous nature of the queries combined. I also worry about jailbreaks targeted towards the classifiers themselves, rather than just at the model for eliciting the dangerous content. However, I suppose this is probably difficult to achieve / iterate against.
A note on cyber: With models as powerful as Mythos (and even Opus 4.6), the simple query limit won't suffice. Thus, similar mitigations to bio are clearly necessary, and also worries about modularity are extra important.
A note on fine-tuning: Will a fine-tuning api mitigate the risks we are worried about? Subtle dangers such as weird generalization patterns (see the evil numbers dataset) seem potentially concerning, and might go under the radar of current monitoring practices. Seems like a cool problem to red-team!
I'm curious about the current SOTA approaches for this challenge. This post is outdated, and my comments apply to the content in the post, not the current state.
A failure mode I worry about includes querying the LLM for sub-problems that are benign in nature but can be dangerous together. Providing the constitutional classifier full context before answering the query seems to be useful, as it can hopefully piece together the dangerous nature of the queries combined. I also worry about jailbreaks targeted towards the classifiers themselves, rather than just at the model for eliciting the dangerous content. However, I suppose this is probably difficult to achieve / iterate against.
A note on cyber: With models as powerful as Mythos (and even Opus 4.6), the simple query limit won't suffice. Thus, similar mitigations to bio are clearly necessary, and also worries about modularity are extra important.
A note on fine-tuning: Will a fine-tuning api mitigate the risks we are worried about? Subtle dangers such as weird generalization patterns (see the evil numbers dataset) seem potentially concerning, and might go under the radar of current monitoring practices. Seems like a cool problem to red-team!