Redwood Research blog
Subscribe
Sign in
Home
Podcast
Chat
Reading List
Archive
About
Latest
Top
Discussions
Incriminating misaligned AI models via distillation
Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model.
May 18
•
Alek Westover
,
Sebastian Prasanna
,
Alex Mallen
,
Alexa Pan
,
Julian Stastny
, and
Arun Jose
27
2
Risk reports need to address deployment-time spread of misalignment
Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment
May 15
•
Alex Mallen
9
How useful is the information you get from working inside an AI company?
My median guess: it's as good as a crystal ball that sees 2.5 months into the future.
May 11
•
Buck Shlegeris
and
Anders Cairns Woodruff
20
1
A review of “Investigating the consequences of accidentally grading CoT during RL”
Last week, OpenAI staff shared an early draft of Investigating the consequences of accidentally grading CoT during RL with Redwood Research staff.
May 7
•
Buck Shlegeris
19
1
1
Risk from fitness-seeking AIs: mechanisms and mitigations
Fitness-seeking is increasingly what misalignment looks like in practice—how should we respond?
May 1
•
Alex Mallen
8
April 2026
Research Sabotage in ML Codebases
One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety…
Apr 29
•
Eric Gan
14
Recursive forecasting
Eliciting long-term forecasts from myopic fitness-seekers
Apr 28
•
Arun Jose
and
Alex Mallen
16
2
Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation
A controlled reward-seeking motivation could make AI safer and more useful
Apr 27
•
Anders Cairns Woodruff
and
Alex Mallen
17
1
1
AI companies should publish security assessments
Third-party experts should assess defenses against tampering and theft — and publish high-level findings
Apr 27
•
Ryan Greenblatt
10
2
1
A taxonomy of barriers to trading with early misaligned AIs
Deals with early misaligned AIs could reduce takeover risk and increase our chances of reaching a better future. When are such deals feasible?
Apr 21
•
Alexa Pan
16
1
Introducing LinuxArena
A new control setting for more realistic software engineering deployments
Apr 20
•
Tyler
31
1
Current AIs seem pretty misaligned to me
In my experience, AIs often oversell their work, downplay problems, and cheat
Apr 15
•
Ryan Greenblatt
63
10
10
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts