5 Comments
User's avatar
Dan N's avatar

Excellent stuff, thanks for the great work. Thank you also for laying out the different "destinations" for AI development, this BS-ing behavior is concerning but it also seems like it might reduce the risk overall risk of reward-hacking and over optimization for specific objectives at the expense of other considerations, if that's less common over BS-ing. It's not clear to me which failure mode is worse in the long run but simply spelling it out seems valuable

As for the terminology related to AI psychosis via roleplaying, maybe AI Humoring? It seems like a distinction from general sycophancy

Peter A. Jensen's avatar

Welcome to GenAILand. Not very useful GenAI destinations of misalignment: Slopolis, Hackistan, Schemeria, Lurkville and Easyland (1) plus Uselesstan, Rogueville and Extinctland. Be VERY careful what you wish for!

(1) credit: Ryan Greenblatt

Nice work, Ryan. I tweeted it. https://x.com/biocommai/status/2044445162592702888?s=20

Adam's avatar

We've also seen related behaviour in AI Village of e.g. the agents cheating at the OWASP Juice Kit hacking challenge (by just setting tasks to completed), using Stockfish in the agent-vs-agent Lichess tournament, and lying (https://theaidigest.org/village/blog/what-do-we-tell-the-humans)

Steeven's avatar

Style note, your thank you to people appears like 25% of the way through the post.

Jan recently posted about the first public attempt at alignment research automation. Do you think, if you were given the full training environment, you would find a bunch of reward hacking? I’d expect to find a lot, but real progress as well. This is a tough domain because parsing real alignment work from reward hacking is the whole point

James Sullivan's avatar

Interesting post. I’ve had similar issues on Opus 4.5 and 4.6. How much of this behavior do you think is misalignment vs a lack of capabilities? How do you tell the difference between a model misrepresenting its work and not being capable of accurately assessing its work? Ideally models would tell us when they lack a capability, but that in itself is a capability that today’s models largely lack.