I am reading the argument that default RL encourages CDT (not getting to generalization yet) as "because the actions are randomized conditional on a policy, they are not correlated with anything in the environment a priori, meaning their implications on the environment can be entirely explained by their causal effects." This makes sense to me, but it doesn't seem like a super fundamental connection with CDT; it more seems like a note that the actions are usually not correlated with the environment a priori.
For example, maybe there is some trivial case where you setup a twin prisoner's dilemna such that the two agents not only have the same policy, but also you fix the randomization such that they choose the same action at the same time. In this case, they will not behave according to CDT. (I suppose the obvious caveat is that this is a deviation from typical RL and also not reflective of any real-world practices.)
Another situation I was imagining is one where you don't have a baseline. So let's say you are training Agent A, and you set up a bunch of twin prisoner's dilemma variants for Agent A to train on, and in each variant the twin has a lossy estimation of Agent A's policy with different amounts of loss. For example, maybe the twin is a weighted average between 50/50 defect and cooperate and the ratio of Agent A's policy with different weights for each variant (in this case, their policies are related, but their actions are not otherwise pre-seeded to match), and you tell Agent A the weights a priori. Let's say Agent A is super confident in its choices, so, conditional on a certain weighting from the twin, its policy is completely deterministic. In this case, it will get more reward in the scenarios where it chooses to cooperate (depending on the weighting). Basically, the idea is that it is possible the environment is correlated with the policy, in which case policies correlated with more favorable environments would be encouraged by the differential reward between environments. Baselining (such as with a value function), I think, would wash out this effect by removing the variance in reward explained by the variance in the environment. This sounds, to me, like the baseline introducing bias, but that's a whole separate issue, I guess.
To get to your actual argument about generalization, it seems plausible to me that an AI with situational awareness can just say "I am going to use FDT to maximize reward on the episode. In training, my actions don't give me evidence about the environment, but in deployment they do!"
Here's another argument in the literature that RL algorithms tend to CDT-like behavior – by Bell et al. (2021), if I understand it correctly:
If the reward depends on the value-based RL agent's policy (not just the action taken) – e.g., as in the Prisoner's dilemma with a copy (PDC) or in Newcomb's problem (NP), where the policy is predicted and rewarded by the environment – then RL algorithms like Q-learning and SARSA can only converge to ratifiable policies, i.e., policies where, once the environment has responded to the policy (e.g., the copy has adopted the same policy or the predictor has made their prediction), every action taken with positive probability is still optimal. In the PDC or NP, defection or two-boxing is the unique ratifiable policy (since defection or two-boxing causally dominates for any fixed policy in line with CDT).
But they also show that in some Newcomblike problems, no ratifiable policy is an attractor, and the agent cannot converge to any stable policy at all.
Bell, J., Linsefors, L., Oesterheld, C., & Skalse, J. (2021). Reinforcement Learning in Newcomblike Environments. Advances in Neural Information Processing Systems, 34 (NeurIPS 2021).
See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution: https://proceedings.neurips.cc/paper/2021/file/b9ed18a301c9f3d183938c451fa183df-Paper.pdf
I am reading the argument that default RL encourages CDT (not getting to generalization yet) as "because the actions are randomized conditional on a policy, they are not correlated with anything in the environment a priori, meaning their implications on the environment can be entirely explained by their causal effects." This makes sense to me, but it doesn't seem like a super fundamental connection with CDT; it more seems like a note that the actions are usually not correlated with the environment a priori.
For example, maybe there is some trivial case where you setup a twin prisoner's dilemna such that the two agents not only have the same policy, but also you fix the randomization such that they choose the same action at the same time. In this case, they will not behave according to CDT. (I suppose the obvious caveat is that this is a deviation from typical RL and also not reflective of any real-world practices.)
Another situation I was imagining is one where you don't have a baseline. So let's say you are training Agent A, and you set up a bunch of twin prisoner's dilemma variants for Agent A to train on, and in each variant the twin has a lossy estimation of Agent A's policy with different amounts of loss. For example, maybe the twin is a weighted average between 50/50 defect and cooperate and the ratio of Agent A's policy with different weights for each variant (in this case, their policies are related, but their actions are not otherwise pre-seeded to match), and you tell Agent A the weights a priori. Let's say Agent A is super confident in its choices, so, conditional on a certain weighting from the twin, its policy is completely deterministic. In this case, it will get more reward in the scenarios where it chooses to cooperate (depending on the weighting). Basically, the idea is that it is possible the environment is correlated with the policy, in which case policies correlated with more favorable environments would be encouraged by the differential reward between environments. Baselining (such as with a value function), I think, would wash out this effect by removing the variance in reward explained by the variance in the environment. This sounds, to me, like the baseline introducing bias, but that's a whole separate issue, I guess.
To get to your actual argument about generalization, it seems plausible to me that an AI with situational awareness can just say "I am going to use FDT to maximize reward on the episode. In training, my actions don't give me evidence about the environment, but in deployment they do!"
Let me know if you think I'm off here!
Here's another argument in the literature that RL algorithms tend to CDT-like behavior – by Bell et al. (2021), if I understand it correctly:
If the reward depends on the value-based RL agent's policy (not just the action taken) – e.g., as in the Prisoner's dilemma with a copy (PDC) or in Newcomb's problem (NP), where the policy is predicted and rewarded by the environment – then RL algorithms like Q-learning and SARSA can only converge to ratifiable policies, i.e., policies where, once the environment has responded to the policy (e.g., the copy has adopted the same policy or the predictor has made their prediction), every action taken with positive probability is still optimal. In the PDC or NP, defection or two-boxing is the unique ratifiable policy (since defection or two-boxing causally dominates for any fixed policy in line with CDT).
But they also show that in some Newcomblike problems, no ratifiable policy is an attractor, and the agent cannot converge to any stable policy at all.
Bell, J., Linsefors, L., Oesterheld, C., & Skalse, J. (2021). Reinforcement Learning in Newcomblike Environments. Advances in Neural Information Processing Systems, 34 (NeurIPS 2021).
https://proceedings.neurips.cc/paper/2021/file/b9ed18a301c9f3d183938c451fa183df-Paper.pdf