Redwood Research blog
Subscribe
Sign in
Home
Podcast
Chat
Reading List
Archive
About
Latest
Top
Discussions
Blocking live failures with synchronous monitors
A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch dangerous…
Mar 30
•
James Lucassen
and
Adam Kaufman
6
5
Reward-seekers will probably behave according to causal decision theory
They'd renege on non-binding commitments, defect against copies of themselves in prisoner's dilemmas, etc.
Mar 28
•
Alex Mallen
10
1
1
AI's capability improvements haven't come from it getting less affordable
AI inference is still cheap relative to human labor
Mar 27
•
Anders Cairns Woodruff
18
2
3
Are AIs more likely to pursue on-episode or beyond-episode reward?
RL would encourage on-episode reward seeking, but beyond-episode reward seekers may learn to goal-guard.
Mar 12
•
Anders Cairns Woodruff
and
Alex Mallen
7
1
The case for satiating cheaply-satisfied AI preferences
Some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial one.
Mar 10
•
Alex Mallen
14
1
1
February 2026
Frontier AI companies probably can't leave the US
The executive branch can, and probably would, block a frontier AI company's departure.
Feb 26
•
Anders Cairns Woodruff
19
1
Announcing ControlConf 2026
Berkeley, April 18-19
Feb 26
•
Buck Shlegeris
10
Will reward-seekers respond to distant incentives?
Reward-seekers are supposed to be safer because they respond to incentives under developer control. But what if they also respond to incentives that…
Feb 16
•
Alex Mallen
8
1
How do we (more) safely defer to AIs?
How can we make AIs aligned and well-elicited on extremely hard to check open ended tasks?
Feb 12
•
Ryan Greenblatt
and
Julian Stastny
9
Distinguish between inference scaling and "larger tasks use more compute"
Most recent progress probably isn't from unsustainable inference scaling
Feb 11
•
Ryan Greenblatt
9
3
1
January 2026
Fitness-Seekers: Generalizing the Reward-Seeking Threat Model
If you think reward-seekers are plausible, you should also think “fitness-seekers” are plausible. But their risks aren’t the same.
Jan 29
•
Alex Mallen
10
1
The inaugural Redwood Research podcast
With Buck Shlegeris and Ryan Greenblatt
Jan 4
•
Buck Shlegeris
27
13
2
3:52:01
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts