Discussion about this post

User's avatar
Adam's avatar

We've also seen related behaviour in AI Village of e.g. the agents cheating at the OWASP Juice Kit hacking challenge (by just setting tasks to completed), using Stockfish in the agent-vs-agent Lichess tournament, and lying (https://theaidigest.org/village/blog/what-do-we-tell-the-humans)

Steeven's avatar

Style note, your thank you to people appears like 25% of the way through the post.

Jan recently posted about the first public attempt at alignment research automation. Do you think, if you were given the full training environment, you would find a bunch of reward hacking? I’d expect to find a lot, but real progress as well. This is a tough domain because parsing real alignment work from reward hacking is the whole point

1 more comment...

No posts

Ready for more?