Archive - Redwood Research blog

Will reward-seekers respond to distant incentives?

Reward-seekers are supposed to be safer because they respond to incentives under developer control. But what if they also respond to incentives that…

Feb 16 • Alex Mallen

How do we (more) safely defer to AIs?

How can we make AIs aligned and well-elicited on extremely hard to check open ended tasks?

Feb 12 • Ryan Greenblatt and Julian Stastny

Distinguish between inference scaling and "larger tasks use more compute"

Most recent progress probably isn't from unsustainable inference scaling

Feb 11 • Ryan Greenblatt

January 2026

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

If you think reward-seekers are plausible, you should also think “fitness-seekers” are plausible. But their risks aren’t the same.

Jan 29 • Alex Mallen

The inaugural Redwood Research podcast

With Buck Shlegeris and Ryan Greenblatt

Jan 4 • Buck Shlegeris

3:52:01

Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts

Recent AIs are much better at chaining together knowledge in a single forward pass

Jan 1 • Ryan Greenblatt

December 2025

Measuring no CoT math time horizon (single forward pass)

Opus 4.5 has around a 3.5 minute 50%-reliablity time horizon

Dec 26, 2025 • Ryan Greenblatt

Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance

AI can sometimes distribute cognition over many extra tokens

Dec 22, 2025 • Ryan Greenblatt

BashArena and Control Setting Design

We’ve just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we’ve used in the past.

Dec 18, 2025 • Adam Kaufman and James Lucassen

The behavioral selection model for predicting AI motivations

The basic arguments about AI motivations in one causal graph

Dec 4, 2025 • Alex Mallen and Buck Shlegeris

November 2025

Will AI systems drift into misalignment?

A reason alignment could be hard

Nov 15, 2025 • Josh Clymer

What's up with Anthropic predicting AGI by early 2027?

I operationalize Anthropic's prediction of "powerful AI" and explain why I'm skeptical

Nov 3, 2025 • Ryan Greenblatt

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts