10 Comments
User's avatar
James Sullivan's avatar

Interesting post. I’ve had similar issues on Opus 4.5 and 4.6. How much of this behavior do you think is misalignment vs a lack of capabilities? How do you tell the difference between a model misrepresenting its work and not being capable of accurately assessing its work? Ideally models would tell us when they lack a capability, but that in itself is a capability that today’s models largely lack.

Felipe Flórez's avatar

It was inevitable. Humans have been embedding their subjective failed morality.

What’s the solution? Axiomatic Morality. Is the human ready to accept Objective Morality?

Enon's avatar

It's not "misalignment", it's the lack of any world-model, still less an accurate and comprehensive world-model. There is no map of reality, nor comparison of the map with reality and updating of the map, so also there can be no map or plan for making desired changes to reality. Nor can there be with current architectures' flat latent-space geometry and hard split between training and inference.

LLMs have no and can have no sense of objective truth, and no amount of hacks layered on top can fix the fundamental problem.

Rohit Krishnan's avatar

Thank you for writing this out, I agree with most of your points about how models today are unreliable. But I am confused by the tone and conclusions of the essay. You say:

> To be clear, I think the exact problematic behavior I discuss in this post is quite likely (~70%) to be greatly reduced (or at least no longer be one of the top few blockers to usefulness) within a year, and is pretty likely (~45%) to be virtually completely eliminated within a year.

So it's a problem, it's better than it used to be, and you think it will be solved. So what's the actual crux of the issue? You might think a) the model creators are lying about how good they are (plausible, I agree there's marketing), or b) models are intentionally misleading the creators and us in hard to verify domains, or both? Or c) models do not have reliable self knowledge and require huge amounts of handholding (which I believe too, my review:code ratio right now is 1.5:1 and can see it ballooning).

I do not understand if you believe that sentence above how you can say they're misaligned, you could just as well say the models are increasingly aligned but not yet perfect? If it's pure "future models are smarter so these problems will get harder to solve in hard to verify tasks", sure, but it's a specific belief and goes against that sentence. I'm confused.

Dan N's avatar

Excellent stuff, thanks for the great work. Thank you also for laying out the different "destinations" for AI development, this BS-ing behavior is concerning but it also seems like it might reduce the risk overall risk of reward-hacking and over optimization for specific objectives at the expense of other considerations, if that's less common over BS-ing. It's not clear to me which failure mode is worse in the long run but simply spelling it out seems valuable

As for the terminology related to AI psychosis via roleplaying, maybe AI Humoring? It seems like a distinction from general sycophancy

Peter A. Jensen's avatar

Welcome to GenAILand. Not very useful GenAI destinations of misalignment: Slopolis, Hackistan, Schemeria, Lurkville and Easyland (1) plus Uselesstan, Rogueville and Extinctland. Be VERY careful what you wish for!

(1) credit: Ryan Greenblatt

Nice work, Ryan. I tweeted it. https://x.com/biocommai/status/2044445162592702888?s=20

Adam's avatar

We've also seen related behaviour in AI Village of e.g. the agents cheating at the OWASP Juice Kit hacking challenge (by just setting tasks to completed), using Stockfish in the agent-vs-agent Lichess tournament, and lying (https://theaidigest.org/village/blog/what-do-we-tell-the-humans)

Steeven's avatar

Style note, your thank you to people appears like 25% of the way through the post.

Jan recently posted about the first public attempt at alignment research automation. Do you think, if you were given the full training environment, you would find a bunch of reward hacking? I’d expect to find a lot, but real progress as well. This is a tough domain because parsing real alignment work from reward hacking is the whole point