Redwood Research blog
Subscribe
Sign in
Home
Podcast
Chat
Reading List
Archive
About
Latest
Top
Discussions
Will reward-seekers respond to distant incentives?
Reward-seekers are supposed to be safer because they respond to incentives under developer control. But what if they also respond to incentives that…
Feb 16
•
Alex Mallen
6
How do we (more) safely defer to AIs?
How can we make AIs aligned and well-elicited on extremely hard to check open ended tasks?
Feb 12
•
Ryan Greenblatt
and
Julian Stastny
7
Distinguish between inference scaling and "larger tasks use more compute"
Most recent progress probably isn't from unsustainable inference scaling
Feb 11
•
Ryan Greenblatt
8
2
1
January 2026
Fitness-Seekers: Generalizing the Reward-Seeking Threat Model
If you think reward-seekers are plausible, you should also think “fitness-seekers” are plausible. But their risks aren’t the same.
Jan 29
•
Alex Mallen
9
1
1
The inaugural Redwood Research podcast
With Buck Shlegeris and Ryan Greenblatt
Jan 4
•
Buck Shlegeris
26
13
2
3:52:01
Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts
Recent AIs are much better at chaining together knowledge in a single forward pass
Jan 1
•
Ryan Greenblatt
15
1
4
December 2025
Measuring no CoT math time horizon (single forward pass)
Opus 4.5 has around a 3.5 minute 50%-reliablity time horizon
Dec 26, 2025
•
Ryan Greenblatt
10
1
Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance
AI can sometimes distribute cognition over many extra tokens
Dec 22, 2025
•
Ryan Greenblatt
11
3
BashArena and Control Setting Design
We’ve just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we’ve used in the past.
Dec 18, 2025
•
Adam Kaufman
and
James Lucassen
8
4
The behavioral selection model for predicting AI motivations
The basic arguments about AI motivations in one causal graph
Dec 4, 2025
•
Alex Mallen
and
Buck Shlegeris
12
4
November 2025
Will AI systems drift into misalignment?
A reason alignment could be hard
Nov 15, 2025
•
Josh Clymer
15
3
1
What's up with Anthropic predicting AGI by early 2027?
I operationalize Anthropic's prediction of "powerful AI" and explain why I'm skeptical
Nov 3, 2025
•
Ryan Greenblatt
21
2
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts