Sitemap - 2025 - Redwood Research blog

Measuring no CoT math time horizon (single forward pass)

Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance

BashArena and Control Setting Design

The behavioral selection model for predicting AI motivations

Will AI systems drift into misalignment?

What's up with Anthropic predicting AGI by early 2027?

Sonnet 4.5's eval gaming seriously undermines alignment evals

Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?

Is 90% of code at Anthropic being written by AIs?

Reducing risk from scheming by studying trained-in scheming behavior

Iterated Development and Study of Schemers (IDSS)

The Thinking Machines Tinker API is good news for AI control and security

Plans A, B, C, and D for misalignment risk

Notes on fatalities from AI takeover

Focus transparency on risk reports, not safety cases

Prospects for studying actual schemers

What training data should developers filter to reduce risk from misaligned AI?

AIs will greatly change engineering in AI companies well before AGI

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)

Notes on cooperating with unaligned AIs

Being honest with AIs

My AGI timeline updates from GPT-5 (and 2025 so far)

Four places where you can put LLM monitoring

Should we update against seeing relatively fast AI progress in 2025 and 2026?

Why it's hard to make settings for high-stakes control research

Recent Redwood Research project proposals

Reading List

What's worse, spies or schemers?

Ryan on the 80,000 Hours podcast

How much novel security-critical infrastructure do you need during the singularity?

Two proposed projects on abstract analogies for scheming

There are two fundamentally different constraints on schemers

Jankily controlling superintelligence

What does 10x-ing effective compute get you?

Comparing risk from internally-deployed AI to insider and outsider threats from humans

Making deals with early schemers

Prefix cache untrusted monitors: a method to apply after you catch your AI

AI safety techniques leveraging distillation

When does training a model change its goals?

The case for countermeasures to memetic spread of misaligned values

AIs at the current capability level may be important for future safety work

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Training-time schemers vs behavioral schemers

What's going on with AI progress and trends? (As of 5/2025)

How can we solve diffuse threats like research sabotage with AI control?

7+ tractable directions in AI control

Clarifying AI R&D threat models

How training-gamers might function (and win)

Handling schemers if shutdown is not an option

Ctrl-Z: Controlling AI Agents via Resampling

To be legible, evidence of misalignment probably has to be behavioral

Why do misalignment risks increase as AIs get more capable?

An overview of areas of control work

An overview of control measures

Buck on the 80,000 Hours podcast

Notes on countermeasures for exploration hacking (aka sandbagging)

Notes on handling non-concentrated failures with AI control: high level methods and different regimes

Prioritizing threats for AI control

How might we safely pass the buck to AI?

Takeaways from sketching a control safety case

Planning for Extreme AI Risks

Ten people on the inside

When does capability elicitation bound risk?

How will we update about scheming?

Thoughts on the conservative assumptions in AI control

Extending control evaluations to non-scheming threats