Discussion about this post

User's avatar
Matīss Apinis's avatar

Here's another argument in the literature that RL algorithms tend to CDT-like behavior – by Bell et al. (2021), if I understand it correctly:

If the reward depends on the value-based RL agent's policy (not just the action taken) – e.g., as in the Prisoner's dilemma with a copy (PDC) or in Newcomb's problem (NP), where the policy is predicted and rewarded by the environment – then RL algorithms like Q-learning and SARSA can only converge to ratifiable policies, i.e., policies where, once the environment has responded to the policy (e.g., the copy has adopted the same policy or the predictor has made their prediction), every action taken with positive probability is still optimal. In the PDC or NP, defection or two-boxing is the unique ratifiable policy (since defection or two-boxing causally dominates for any fixed policy in line with CDT).

But they also show that in some Newcomblike problems, no ratifiable policy is an attractor, and the agent cannot converge to any stable policy at all.

Bell, J., Linsefors, L., Oesterheld, C., & Skalse, J. (2021). Reinforcement Learning in Newcomblike Environments. Advances in Neural Information Processing Systems, 34 (NeurIPS 2021).

https://proceedings.neurips.cc/paper/2021/file/b9ed18a301c9f3d183938c451fa183df-Paper.pdf

No posts

Ready for more?