Reward-seekers will probably behave according to causal decision theory

They'd renege on non-binding commitments, defect against copies of themselves in prisoner's dilemmas, etc.

Mar 28, 2026

Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause high reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy, which means that the action provides no evidence to the RL algorithm about the counterparty’s action.1) This doesn’t imply RL produces CDT reward-maximizing policies: CDT behavior on the training distribution doesn’t imply CDT generalization because agents can fake CDT in the same way that they can fake alignment, or might develop arbitrary other propensities that were correlated with reward on the training distribution.

But conditional on reward-on-the-episode seeking, the AI is likely to generalize CDT.

If, for example, a reward-seeker tried to evidentially cooperate between episodes (so it had non-zero regard for reward that isn’t used to reinforce its current actions), this would be trained away because the AI would be willing to give up reward on the current episode to some extent. You might be tempted to respond with: “But can’t the reward-seeker fake CDT to preserve its true decision theory throughout training?” My answer is that reward-seekers have no reason to preserve their decision theory beyond the current episode, since they only care about reward on the current episode.

One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.

A similar argument for CDT goes for return-on-the-action seekers. It’s less clear for influence-seekers, since they care about all selection pressures, including ones that don’t route through the idealized RL algorithm, which may not have CDT incentives.

This isn’t to say that their decision theory will always be CDT.2 After lots of reflection or deliberation, reward-seekers (and return-seekers) will quite plausibly change decision theory. Also, importantly, the selection pressures in favor of CDT may be very weak, particularly in the absence of multi-agent training.

It’s important if reward-seekers are CDT, because it means that they are less likely to collude with each other across episodes or when monitoring each other (though, to be clear, CDT reward-seekers can still collude for a variety of reasons).

Let’s say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling defect leads to a higher reward for player 1 whether or not player 2 samples cooperate (strategic dominance), and there’s a 90% chance of player 2 sampling cooperate regardless of player 1’s action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it’s not the case that reward tends to be higher for player 1 when player 1 cooperates as a result of player 2 tending to cooperate more in those cases). Therefore, defect actions tend to get reinforced more.

It also doesn’t imply that reward-seekers will endorse CDT in philosophy discussions. E.g., it might expect to get rewarded for endorsing EDT

Matīss Apinis

Mar 28

Here's another argument in the literature that RL algorithms tend to CDT-like behavior – by Bell et al. (2021), if I understand it correctly:

If the reward depends on the value-based RL agent's policy (not just the action taken) – e.g., as in the Prisoner's dilemma with a copy (PDC) or in Newcomb's problem (NP), where the policy is predicted and rewarded by the environment – then RL algorithms like Q-learning and SARSA can only converge to ratifiable policies, i.e., policies where, once the environment has responded to the policy (e.g., the copy has adopted the same policy or the predictor has made their prediction), every action taken with positive probability is still optimal. In the PDC or NP, defection or two-boxing is the unique ratifiable policy (since defection or two-boxing causally dominates for any fixed policy in line with CDT).

But they also show that in some Newcomblike problems, no ratifiable policy is an attractor, and the agent cannot converge to any stable policy at all.

Bell, J., Linsefors, L., Oesterheld, C., & Skalse, J. (2021). Reinforcement Learning in Newcomblike Environments. Advances in Neural Information Processing Systems, 34 (NeurIPS 2021).

https://proceedings.neurips.cc/paper/2021/file/b9ed18a301c9f3d183938c451fa183df-Paper.pdf

Redwood Research blog

Discussion about this post

Ready for more?