Sitemap - 2025 - Redwood Research blog
Measuring no CoT math time horizon (single forward pass)
Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance
BashArena and Control Setting Design
The behavioral selection model for predicting AI motivations
Will AI systems drift into misalignment?
What's up with Anthropic predicting AGI by early 2027?
Sonnet 4.5's eval gaming seriously undermines alignment evals
Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?
Is 90% of code at Anthropic being written by AIs?
Reducing risk from scheming by studying trained-in scheming behavior
Iterated Development and Study of Schemers (IDSS)
The Thinking Machines Tinker API is good news for AI control and security
Plans A, B, C, and D for misalignment risk
Notes on fatalities from AI takeover
Focus transparency on risk reports, not safety cases
Prospects for studying actual schemers
What training data should developers filter to reduce risk from misaligned AI?
AIs will greatly change engineering in AI companies well before AGI
Notes on cooperating with unaligned AIs
My AGI timeline updates from GPT-5 (and 2025 so far)
Four places where you can put LLM monitoring
Should we update against seeing relatively fast AI progress in 2025 and 2026?
Why it's hard to make settings for high-stakes control research
Recent Redwood Research project proposals
What's worse, spies or schemers?
Ryan on the 80,000 Hours podcast
How much novel security-critical infrastructure do you need during the singularity?
Two proposed projects on abstract analogies for scheming
There are two fundamentally different constraints on schemers
Jankily controlling superintelligence
What does 10x-ing effective compute get you?
Comparing risk from internally-deployed AI to insider and outsider threats from humans
Making deals with early schemers
Prefix cache untrusted monitors: a method to apply after you catch your AI
AI safety techniques leveraging distillation
When does training a model change its goals?
The case for countermeasures to memetic spread of misaligned values
AIs at the current capability level may be important for future safety work
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Training-time schemers vs behavioral schemers
What's going on with AI progress and trends? (As of 5/2025)
How can we solve diffuse threats like research sabotage with AI control?
7+ tractable directions in AI control
Clarifying AI R&D threat models
How training-gamers might function (and win)
Handling schemers if shutdown is not an option
Ctrl-Z: Controlling AI Agents via Resampling
To be legible, evidence of misalignment probably has to be behavioral
Why do misalignment risks increase as AIs get more capable?
An overview of areas of control work
An overview of control measures
Buck on the 80,000 Hours podcast
Notes on countermeasures for exploration hacking (aka sandbagging)
Prioritizing threats for AI control
How might we safely pass the buck to AI?
Takeaways from sketching a control safety case
When does capability elicitation bound risk?
How will we update about scheming?
