Notes on fatalities from AI takeover

Large fractions of people will die, but literal human extinction seems unlikely

Sep 23, 2025

Suppose misaligned AIs take over. What fraction of people will die? I'll discuss my thoughts on this question and my basic framework for thinking about it. These are some pretty low-effort notes, the topic is very speculative, and I don't get into all the specifics, so be warned.

I don't think moderate disagreements here are very action-guiding or cruxy on typical worldviews: it probably shouldn't alter your actions much if you end up thinking 25% of people die in expectation from misaligned AI takeover rather than 90% or end up thinking that misaligned AI takeover causing literal human extinction is 10% likely rather than 90% likely (or vice versa). (And the possibility that we're in a simulation poses a huge complication that I won't elaborate on here.) Note that even if misaligned AI takeover doesn't cause human extinction, it would still result in humans being disempowered and would likely result in the future being much less good (e.g. much worse things happen with the cosmic endowment). But regardless I thought it might be useful for me to quickly write something up on the topic.1

(By "takeover", I mean that misaligned AIs (as in, AIs which end up egregiously going against the intentions of their developer and which aren't controlled by some other human group) end up seizing effectively all power2 in a way which doesn't involve that power being acquired through payment/trade with humans where the deals are largely upheld on the side of the AI. I'm not including outcomes where an appreciable fraction of the power is held by digital post-humans, e.g., emulated minds.)

While I don't think this is very action-guiding or cruxy, I do find the view that "if misaligned AIs take over, it's overwhelmingly likely everyone dies" implausible. I think there are quite strong counterarguments to this and I find the responses I've heard to these counterarguments not very compelling.

My guess is that, conditional on AI takeover, around 50% of currently living people die3 in expectation and literal human extinction4 is around 25% likely.5

The basic situation is: if (misaligned) AIs aren't motivated to keep humans alive (as in, they don't care and there aren't external incentives) and these AIs had effectively all power, then all the humans would die (probably as a byproduct of industrial expansion, but there are other potential causes). But even if AIs are misaligned enough to take over, they might still care a bit about keeping humans alive due to either intrinsically caring at least a bit or because other entities with power would compensate the AI for keeping humans alive (either aliens the AI encounters in this universe or acausal compensation). However, there are some other reasons why AIs might kill people: takeover strategies which involve killing large numbers of people might be more effective and the AI actively wants to kill people or do something to people that is effectively killing them.

Ultimately, we have to get into the details of these considerations and make some guesses about how they compare.

There is a variety of prior discussion on this topic, see here, here, and here for some discussion that seems reasonable and relevant to me. This linked content is roughly as relevant as the notes here, and I'm not sure these notes add that much value over these links.

Now I'll get into the details. Expanding on the basic situation, humans might die due to takeover for three main reasons:

Takeover strategies which involve killing large numbers of people (or possibly everyone) are expedient (increasing the chance of successful takeover) or killing large numbers of people helps with retaining power. Failed takeover attempts (or just attempts to acquire power) along the way might also kill many people.
The industrial expansion (done by these AIs or their successors) would kill humans by default (perhaps due to boiling the oceans on earth, disassembling the biosphere for energy/material, killing humans to the extent they mildly get in the way, or disassembling earth in general) and the cost of keeping humans alive through this outweighs how motivated the AI (or AIs) are to keep humans alive.
The AI actively wants to kill people or do something to them that is effectively killing them (e.g., modifying them in some way which effectively makes them totally different without their informed consent).6 Misaligned AIs might end up with specifically having preferences about humans because this was salient in their training. As I'll discuss below, I think this is unlikely to be a large factor, but I think it's non-negligible.

Industrial expansion and small motivations to avoid human fatalities

Reason (2), industrial expansion (combined with the AI maintaining full power), would cause extinction of humans in the absence of the AI being motivated to keep humans alive for whatever reason. However, the costs of keeping physical humans alive are extremely low, probably it only slows the industrial expansion by a small amount (and this slowdown is probably the dominant effect from a resource perspective), probably by less than 1 month and very likely less than 1 year. Delaying this long costs a pretty tiny fraction of long-run cosmic resources, probably well less than one billionth of the resources that will ultimately be accessible to the AIs in aggregate.

Thus, extremely small amounts of motivation to keep humans alive could suffice for avoiding large fractions of humans dying due to industrial expansion (depending on the exact preferences of the AI). But, extremely small amounts of motivation to keep humans alive are unlikely to suffice for substantially reducing fatalities due to (1) and (3), though large amounts of motivation (e.g., it's one of the AI's top desires/priorities/aims) could suffice for greatly reducing expected fatalities from (1) or (3).

Note that an extremely small amount of motivation wouldn't necessarily stop the AI from (e.g.) boiling the oceans and destroying the biosphere while keeping humans alive in a shelter (or potentially scanning their brains and uploading them, especially if they would consent or would consent on reflection). Preserving earth (as in, not causing catastrophic environmental damage due to industrial expansion) is more expensive than keeping physical humans alive which is more expensive than only keeping humans alive as uploads. Preserving earth still seems like it would probably cost less than one billionth of resources by default. Keeping humans alive as just uploads might cost much less, e.g. more like one trillionth of resources (and less than this is plausible depending on the details of industrial expansion).

Extremely small amounts of motivation to keep humans alive wouldn't suffice for avoiding large fractions of humans dying due to industrial expansion if:

The AI is effectively "impatient" such that delaying for (e.g.) a month now costs much more than 1 part in a million from its perspective. There are probably several different preferences which could result in AIs effectively exhibiting high discount rates which aren't necessarily well described as impatience. It's worth noting that most versions of impatience make takeover less likely both because: these AIs would care less about taking over (especially if they are impatient enough that much of the value has diminished by the time takeover finishes) and these AIs are more likely to want to accept deals from humans.
Multiple different AIs are competing in a massive industrial race and are unable to effectively coordinate for some reason. And, the cost of keeping humans alive is sufficiently high that if any given AI tried to keep humans alive that would cost it a substantial fraction of resources.
The AI is irrational, makes a mistake, has (importantly) limited capacity to pursue small sources of value, or otherwise doesn't act like a reasonable agent subject to its motivations. While we might be able to argue against most types of obvious irrationality due to AIs being strongly selected for capabilities, keeping humans alive might be an extremely tiny priority and doing random somewhat specific things which are extremely tiny priorities might not be the kind of thing that otherwise very smart agents necessarily do. (Due to transaction costs and other factors, humans and human organizations basically never do a good job fulfilling extremely tiny and very cheap priorities which are different from other things they are doing, but AIs doing a massive industrial expansion might be better placed to do this.) Overall, it seems plausible but unlikely that AIs are well described as being very slightly motivated to keep humans alive (and they understand this), but they never get around to actually doing this.

Why might we end up with small amounts of motivation to keep humans alive?

The AI itself just cares due to preferences it has (see discussion here).
Other entities that care trade with the AI (or with entities that the AI trades with) to keep humans alive. This includes acausal trade, ECL, simulations/anthropic capture (to be able to effectively acausally trade with decision theory naive AIs), and causal trade with aliens that the AI ends up encountering. It's unclear how simulations/anthropic capture works out. It seems plausible that this happens in a way which doesn't result in all relevant (acausal) trades happening. It seems plausible that other entities (including humans in other Everett branches) pretty universally don't want to bail humans (in this branch) out because they have better things to spend their resources on, but even a tiny fraction of entities spending some non-trivial fraction of resources could suffice. This depends on there being some beings with power who care about things like human survival despite alignment difficulties, but this feels very likely to me (even if alignment is very hard, there may well be more competent aliens or AIs who care a small amount about this sort of thing). Note that this requires the AI to care about at least some of these mechanisms which isn't obvious.

Note that these reasons could also result in large amounts of motivation to keep humans alive.

(I won't discuss these reasons in detail in my notes due to running out of time.)

Another factor is that AIs might not want to do a (fast) industrial expansion for whatever reason (e.g., they don't care about this, they intrinsically dislike change). However, if there are multiple AI systems with some power and at least some subset want to do a fast industrial expansion this would happen in the absence of coordination or other AIs actively fighting the AIs that do an industrial expansion. Even if AIs don't want to do a fast industrial expansion, humans might be killed as a side effect of their other activities, so I don't think this makes a huge difference to the bottom line.

My guess is that small amounts of motivation are substantially more likely than not (perhaps 85% likely?), but there are some reasons why small amounts of motivation don't suffice (given above) which are around 20% likely. This means that we overall end up with around a 70% chance that small motivations are there and suffice.

However, there is some chance (perhaps 30%) of large amounts of motivation to keep humans alive (>1/1000) and this would probably overwhelm the reasons why small amounts of motivation don't suffice. Moderate amounts of motivation could also suffice in some cases. I think this cuts the reasons why small amounts of motivation don't suffice by a bit, perhaps down to 15% rather than 20% which makes a pretty small difference to the bottom line.

Then, the possibility that AIs don't want to do a fast industrial expansion increases the chance of humans remaining alive by a little bit more.

Thus, I overall think a high fraction of people dying due to industrial expansion is maybe around 25% likely and literal extinction due to industrial expansion is 15% likely. (Note that the 15% of worlds with extinction are included in the 25% of worlds where a high fraction of people die.) Perhaps around 30% of people die in expectation (15% from extinction, another 7% from non-extinction worlds where a high fraction die, and maybe another 7% or so from other worlds where only a smaller fraction die).

How likely is it that AIs will actively have motivations to kill (most/many) humans

This could arise from:

Proxies from training / latching onto things which are salient (humans are highly salient in training)
Acausal motivations? (Or causal trade with aliens?) It's unclear why anyone would care about killing this powerless group of humans or doing specific things to this group of humans which are as bad as death, but it is possible. One mechanism is using this as a threat (issued by this AI or by some other system which this AI is paid to execute on); this would likely result in worse than death outcomes.

Overall, active motivations to kill humans seem pretty unlikely, but not impossible. I think the "proxies from training" story is made less likely because AIs on reflection probably would endorse something less specific than caring in this way about the state of (most/many) currently alive humans. Note that the proxies from training story could result in various types of somewhat net negative universes from the longtermist perspective (though this is probably much less bad than optimized suffering / maximal s-risk).

I think this contributes a small fraction of fatalities. Perhaps this contributes an 8% chance of many fatalities and a 4% chance of extinction.

Death due to takeover itself

It's unclear what fraction of people die due to takeover because this is expedient for the AI, but it seems like it could be the majority of people and could also be almost no one. If AIs are less powerful, a higher fraction of people dying is more likely (because AIs would have a harder time securing a very high chance of takeover without killing more humans).

Note that the AI's odds of retaining power after executing some takeover that doesn't result in full power might be slightly or significantly increased by exterminating most people (because the AI doesn't have total dominance and is (at least) slightly threatened by humans) and this could result in another source of fatalities. (I'm including this in the "death due to takeover" category.)

Failed takeover attempts along the path to takeover could also kill many people. If there are a large number of different rogue AIs, it becomes more likely that one of them would benefit from massive fatalities (e.g. due to a pandemic) making this substantially more likely. Interventions which don't stop AI takeover in the long run could reduce these fatalities.

It's plausible that killing fewer people is actively useful in some way to the AI (e.g., not killing people helps retain human allies for some transitional period).

Conditional on not seeing extinction due to industrial expansion, I'd guess this kills around 25% of people in expectation with a 5% chance of extinction.

Combining these numbers

We have a ~25% chance of extinction. In the 75% chance of non-extinction, around 35% of people die due to all the factors given above. So, we have an additional ~25% of fatalities for a total of around 50% expected fatalities.

I originally wrote this post to articulate why I thought the chance of literal extinction and number of expected fatalities were much higher than someone else thought, but it's also pretty relevant to ongoing discussion of the book "If anyone builds it, everyone dies".

As in, all power of earth-originating civilizations.

Within "death", I'm including outcomes like "some sort of modification happens to a person without their informed consent that they would consider similarly bad (or worse than death) or which effectively changes them into being an extremely different person, e.g., they are modified to have wildly different preferences such that they do very different things than they would otherwise do". This is to include (unlikely) outcomes like "the AI rewires everyone's brain to be highly approving of it" or similarly strange things that might happen if the AI(s) have strong preferences over the state of existing humans. Death also includes non-consensual uploads (digitizing and emulating someone's brain) insofar as the person wouldn't be fine with this on reflection (including if they aren't fine with it because they strongly dislike what happens to them after being uploaded). Consensual uploads, or uploads people are fine with on reflection, don't count as death.

Concretely, literally every human in this universe is dead (under the definition of dead included in the prior footnote, so consensual uploads don't count). And, this happens within 300 years of AI takeover and is caused by AI takeover. I'll put aside outcomes where the AI later ends up simulating causally separate humans or otherwise instantiating humans (or human-like beings) which aren't really downstream of currently living humans. I won't consider it extinction if humans decide (in an informed way) to cease while they could have persisted or decide to modify themselves into very inhuman beings (again with informed consent etc.).

As in, what would happen if we were in base reality or if we were in a simulation, what would happen within this simulation if it were continued in a faithful way.

Outcomes for currently living humans which aren't death but which are similarly bad or worse than death are also possible.

sim gishel

Sep 23

Interesting, I share many intuitions here, but come to very different probabilities. The basic flow of the argument is that AIs might care a little bit about us, and that this little bit of caring will make the AI keep us around (although we are given a very small fraction of the universe).

I agree this is reasonable, but after thinking about it, can't honestly give a probability of more than 10-15% for why they'd care even "a little bit". The reasons you give the the AI caring are:

1. Preferences they get from training. My basic interpretation for a situation where this would make sense is like: The AI has a utility functions where killing a living human = -1utils. Whatever humans would do if given a star = +10 utils. Whatever the AI can do with a star is +1000 Utils. So the AI deems it worth it to sacrifice a small fraction of the lightcone not to kill currently existing humans, eg by slowing down industrial expansion, but will not give humans any more resources than whats required to keep them alive and plausibly happy. (killing humans is -8e9, which is much more than the solar system is worth, but much less than the lightcone is worth)

2. Acausal trade with humans from other Everett branches where we solved alignment

3. Acausal trade with aliens.

4. It keeps humans around in anticipation of future Causal trade with aliens.

I'd put like 1% on (1), 1% on (2) and like 5-10% on (3,4) jointly since they're pretty correlated.

Reason (1) seems so unlikely to me is that from the "standard" alignment perspective, getting AIs to care about humans in themselves, in a way that won't be goodhearted from our perspective by a ASI, requires many thing to go very right in our alignment efforts. If we manage that I think we will not have takeover, and if we have takeover I don't think our alignment will have worked nearly that well. In the case where the AI does take-over but some aspects of human-related alignment goals got into the utility function, seems it would 99% (made up number) of the time be of the "low fidelity humans are simulated and say 'amazing work!!! 24/7' " variety.

Reason (2) seems low to me for reasons basically said discussion in the post you cited. In the case we are killed, I expect that to be fairly overdetermined, with the branches where humanity survived being sparse enough that they'll have very very little to bargain with.

Reason (3,4) is low would take a long time to explain, and it depends on how you model existence of aliens and their interactions in the far future, which equilibrium they converge towards. But the short summary is that I think you'd need a non-negligible fraction of aliens to not just be some flavor of kind, but kind in the very specific sense where they'd be sad about humans not existing, in a way that couldn't be made up for by having a trillion happy beings existing somewhere else.

Expand full comment

Austin Morrissey

Thanks Ryan; I appreciate your posts.

3 more comments...

Redwood Research blog

Discussion about this post