Reward-seekers will probably behave according to causal decision theory
They'd renege on non-binding commitments, defect against copies of themselves in prisoner's dilemmas, etc.
Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause high reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy, which means that the action provides no evidence to the RL algorithm about the counterparty’s action.1) This doesn’t imply RL produces CDT reward-maximizing policies: CDT behavior on the training distribution doesn’t imply CDT generalization because agents can fake CDT in the same way that they can fake alignment, or might develop arbitrary other propensities that were correlated with reward on the training distribution.
But conditional on reward-on-the-episode seeking, the AI is likely to generalize CDT.
If, for example, a reward-seeker tried to evidentially cooperate between episodes (so it had non-zero regard for reward that isn’t used to reinforce its current actions), this would be trained away because the AI would be willing to give up reward on the current episode to some extent. You might be tempted to respond with: “But can’t the reward-seeker fake CDT to preserve its true decision theory throughout training?” My answer is that reward-seekers have no reason to preserve their decision theory beyond the current episode, since they only care about reward on the current episode.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
A similar argument for CDT goes for return-on-the-action seekers. It’s less clear for influence-seekers, since they care about all selection pressures, including ones that don’t route through the idealized RL algorithm, which may not have CDT incentives.
This isn’t to say that their decision theory will always be CDT.2 After lots of reflection or deliberation, reward-seekers (and return-seekers) will quite plausibly change decision theory. Also, importantly, the selection pressures in favor of CDT may be very weak, particularly in the absence of multi-agent training.
It’s important if reward-seekers are CDT, because it means that they are less likely to collude with each other across episodes or when monitoring each other (though, to be clear, CDT reward-seekers can still collude for a variety of reasons).
Let’s say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling defect leads to a higher reward for player 1 whether or not player 2 samples cooperate (strategic dominance), and there’s a 90% chance of player 2 sampling cooperate regardless of player 1’s action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it’s not the case that reward tends to be higher for player 1 when player 1 cooperates as a result of player 2 tending to cooperate more in those cases). Therefore, defect actions tend to get reinforced more.
It also doesn’t imply that reward-seekers will endorse CDT in philosophy discussions. E.g., it might expect to get rewarded for endorsing EDT


See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution: https://proceedings.neurips.cc/paper/2021/file/b9ed18a301c9f3d183938c451fa183df-Paper.pdf
I am reading the argument that default RL encourages CDT (not getting to generalization yet) as "because the actions are randomized conditional on a policy, they are not correlated with anything in the environment a priori, meaning their implications on the environment can be entirely explained by their causal effects." This makes sense to me, but it doesn't seem like a super fundamental connection with CDT; it more seems like a note that the actions are usually not correlated with the environment a priori.
For example, maybe there is some trivial case where you setup a twin prisoner's dilemna such that the two agents not only have the same policy, but also you fix the randomization such that they choose the same action at the same time. In this case, they will not behave according to CDT. (I suppose the obvious caveat is that this is a deviation from typical RL and also not reflective of any real-world practices.)
Another situation I was imagining is one where you don't have a baseline. So let's say you are training Agent A, and you set up a bunch of twin prisoner's dilemma variants for Agent A to train on, and in each variant the twin has a lossy estimation of Agent A's policy with different amounts of loss. For example, maybe the twin is a weighted average between 50/50 defect and cooperate and the ratio of Agent A's policy with different weights for each variant (in this case, their policies are related, but their actions are not otherwise pre-seeded to match), and you tell Agent A the weights a priori. Let's say Agent A is super confident in its choices, so, conditional on a certain weighting from the twin, its policy is completely deterministic. In this case, it will get more reward in the scenarios where it chooses to cooperate (depending on the weighting). Basically, the idea is that it is possible the environment is correlated with the policy, in which case policies correlated with more favorable environments would be encouraged by the differential reward between environments. Baselining (such as with a value function), I think, would wash out this effect by removing the variance in reward explained by the variance in the environment. This sounds, to me, like the baseline introducing bias, but that's a whole separate issue, I guess.
To get to your actual argument about generalization, it seems plausible to me that an AI with situational awareness can just say "I am going to use FDT to maximize reward on the episode. In training, my actions don't give me evidence about the environment, but in deployment they do!"
Let me know if you think I'm off here!