Discussion about this post

User's avatar
Alec Harris's avatar

It seems to me that a potential drawback relative to inoculation prompting (IP) would be that you are sort of inoculating at an identity/motivation level rather than at an instruction level. This seems like a downside to me because

1. The default behavior we already want is reward hacking if you explicitly ask it to reward hack in the instruction, whereas it's not the case that we want it to have reward hacking motivations by default. Thus, by using instructions, IP is sort of throwing the bad behavior in the Bad Box, whereas spillways seem to be somewhat creating a new Bad Box. Creating a new Bad Box seems like it could more so be an outlet for new bad behaviors via misgeneralizations and such (the reflexive bad reward hacking you mention being one such example).

2. The delineation between bad and good created by the instruction in IP seems a bit more crisp than the delineation created by satiation in spillways. In IP, we want the AI to reward hack only when asked. In spillways, we want the model to reward hack only when "hungry," when we haven't satiated it. It just seems to me that, on the model's pretraining priors, the former is going to be more straightforward and so more likely to generalize in a desirable way. I do think stuff you prescribe, like SDF explaining why we are doing this, seems likely to ameliorate this effect.

Relatedly, I think if we are relying a lot on persona-y stuff for alignment, I might worry more about spillways. What does the kind of AI that wants to reward hack do when it's all done with reward hacking? Plausibly something bad, I think. Explaining spillways could serve to rewrite this narrative, but we might be swimming upstream to some degree.

It seems like a cool and interesting idea to me, though. Look forward to hearing more about it.

No posts

Ready for more?