Discussion about this post

User's avatar
Alex Mallen's avatar

See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution: https://proceedings.neurips.cc/paper/2021/file/b9ed18a301c9f3d183938c451fa183df-Paper.pdf

Alec Harris's avatar

I am reading the argument that default RL encourages CDT (not getting to generalization yet) as "because the actions are randomized conditional on a policy, they are not correlated with anything in the environment a priori, meaning their implications on the environment can be entirely explained by their causal effects." This makes sense to me, but it doesn't seem like a super fundamental connection with CDT; it more seems like a note that the actions are usually not correlated with the environment a priori.

For example, maybe there is some trivial case where you setup a twin prisoner's dilemna such that the two agents not only have the same policy, but also you fix the randomization such that they choose the same action at the same time. In this case, they will not behave according to CDT. (I suppose the obvious caveat is that this is a deviation from typical RL and also not reflective of any real-world practices.)

Another situation I was imagining is one where you don't have a baseline. So let's say you are training Agent A, and you set up a bunch of twin prisoner's dilemma variants for Agent A to train on, and in each variant the twin has a lossy estimation of Agent A's policy with different amounts of loss. For example, maybe the twin is a weighted average between 50/50 defect and cooperate and the ratio of Agent A's policy with different weights for each variant (in this case, their policies are related, but their actions are not otherwise pre-seeded to match), and you tell Agent A the weights a priori. Let's say Agent A is super confident in its choices, so, conditional on a certain weighting from the twin, its policy is completely deterministic. In this case, it will get more reward in the scenarios where it chooses to cooperate (depending on the weighting). Basically, the idea is that it is possible the environment is correlated with the policy, in which case policies correlated with more favorable environments would be encouraged by the differential reward between environments. Baselining (such as with a value function), I think, would wash out this effect by removing the variance in reward explained by the variance in the environment. This sounds, to me, like the baseline introducing bias, but that's a whole separate issue, I guess.

To get to your actual argument about generalization, it seems plausible to me that an AI with situational awareness can just say "I am going to use FDT to maximize reward on the episode. In training, my actions don't give me evidence about the environment, but in deployment they do!"

Let me know if you think I'm off here!

1 more comment...

No posts

Ready for more?