Excellent stuff, thanks for the great work. Thank you also for laying out the different "destinations" for AI development, this BS-ing behavior is concerning but it also seems like it might reduce the risk overall risk of reward-hacking and over optimization for specific objectives at the expense of other considerations, if that's less common over BS-ing. It's not clear to me which failure mode is worse in the long run but simply spelling it out seems valuable
As for the terminology related to AI psychosis via roleplaying, maybe AI Humoring? It seems like a distinction from general sycophancy
Welcome to GenAILand. Not very useful GenAI destinations of misalignment: Slopolis, Hackistan, Schemeria, Lurkville and Easyland (1) plus Uselesstan, Rogueville and Extinctland. Be VERY careful what you wish for!
We've also seen related behaviour in AI Village of e.g. the agents cheating at the OWASP Juice Kit hacking challenge (by just setting tasks to completed), using Stockfish in the agent-vs-agent Lichess tournament, and lying (https://theaidigest.org/village/blog/what-do-we-tell-the-humans)
Style note, your thank you to people appears like 25% of the way through the post.
Jan recently posted about the first public attempt at alignment research automation. Do you think, if you were given the full training environment, you would find a bunch of reward hacking? I’d expect to find a lot, but real progress as well. This is a tough domain because parsing real alignment work from reward hacking is the whole point
Interesting post. I’ve had similar issues on Opus 4.5 and 4.6. How much of this behavior do you think is misalignment vs a lack of capabilities? How do you tell the difference between a model misrepresenting its work and not being capable of accurately assessing its work? Ideally models would tell us when they lack a capability, but that in itself is a capability that today’s models largely lack.
Excellent stuff, thanks for the great work. Thank you also for laying out the different "destinations" for AI development, this BS-ing behavior is concerning but it also seems like it might reduce the risk overall risk of reward-hacking and over optimization for specific objectives at the expense of other considerations, if that's less common over BS-ing. It's not clear to me which failure mode is worse in the long run but simply spelling it out seems valuable
As for the terminology related to AI psychosis via roleplaying, maybe AI Humoring? It seems like a distinction from general sycophancy
Welcome to GenAILand. Not very useful GenAI destinations of misalignment: Slopolis, Hackistan, Schemeria, Lurkville and Easyland (1) plus Uselesstan, Rogueville and Extinctland. Be VERY careful what you wish for!
(1) credit: Ryan Greenblatt
Nice work, Ryan. I tweeted it. https://x.com/biocommai/status/2044445162592702888?s=20
We've also seen related behaviour in AI Village of e.g. the agents cheating at the OWASP Juice Kit hacking challenge (by just setting tasks to completed), using Stockfish in the agent-vs-agent Lichess tournament, and lying (https://theaidigest.org/village/blog/what-do-we-tell-the-humans)
Style note, your thank you to people appears like 25% of the way through the post.
Jan recently posted about the first public attempt at alignment research automation. Do you think, if you were given the full training environment, you would find a bunch of reward hacking? I’d expect to find a lot, but real progress as well. This is a tough domain because parsing real alignment work from reward hacking is the whole point
Interesting post. I’ve had similar issues on Opus 4.5 and 4.6. How much of this behavior do you think is misalignment vs a lack of capabilities? How do you tell the difference between a model misrepresenting its work and not being capable of accurately assessing its work? Ideally models would tell us when they lack a capability, but that in itself is a capability that today’s models largely lack.