Making deals with early schemers

Jun 20, 2025

...could help us to prevent takeover attempts from more dangerous misaligned AIs created later.

4 Comments

> One possible reason here is that being the kind of civilization that doesn’t uphold their promises would decorrelate us from civilisations that do uphold their promises, and generally deter such civilizations from wanting to trade with us (in the context of ECL).

I'm a bit confused by this. In general, I think it's not really clear how one can "decorrelate" oneself from another civilization in one's reference class. Instead, your actions give you evidence about how other civilizations in your reference class are likely to behave. I.e., if you don't keep your promises to early AIs, this is an update that other civilizations in similar situations will also not keep their promises to early AIs. This initself would only be a negative update if keeping promises across universes would somehow make the early AIs more likely to engage in trade, which I'm not sure is true.

There is a general argument to cooperate with early AIs if you believe that in some universes, there are agents with different values in power and the early AIs of those universes share your own values. But you would want to do that regardless of how the early AIs behave.

Johannes Treutlein

Jan 9

> We expect that many potential misaligned AIs will either have clearly self-regarding preferences or have significant uncertainty about what they value on reflection, which induces additional uncertainty about whether other AIs will have the same goals on reflection.

Small nitpick: I don't think having uncertainty about one's own values is an update towards being more uncertain whether other AIs have the same values. An AI could also believe that misaligned AIs' values converge to similar values upon reflection (similarly to how some humans believe this).

Aansh Samyani

Jun 24, 2025

Hey, thanks for the awesome blog, it was a great read. I had a couple of questions though:

1) Can we at any point in time evaluate if the next model will have the potential to automate AI RnD, because if we cannot, it is quite possible for the early schemers to scheme us and lie about their successor’s potential threat to us to get us to negotiate a deal which gives them more than they deserve and makes us stop making progress unnecessarily.

To me it clearly isn’t trivial to find at what point do we start taking a model’s negotiations seriously, any thoughts on this?

2) Also, I am a bit skeptical about this:

“Two misaligned AIs should also be expected to be unlikely to have the same goals if their goals are determined by a random draw from a wide distribution of goals consistent with good performance.”

Could you explain if this is quite obvious to you, if yes why?

A. Rabbitude

Jun 20, 2025

Has anyone ever trained an LLM using a corpus of Buddhist material?

Redwood Research blog

Making deals with early schemers