Discussion about this post

User's avatar
asher's avatar

"It is known that OpenAI models are sometimes trained via SFT on outputs from previous models, which may lead later models to inherit unmonitorable CoTs from prior ones (especially given that these prior models might have had effects on their chains of thought)"

I think SFT on CoT traces could be bad for monitorability regardless of whether the traces come from an unmonitorable model. For instance, SFT might impose a prior that the model's CoT should look nice, even if otherwise it would learn to reason about misalignment in its CoT. I'm currently uncertain to what degree this, or the quoted point from the post, matter in practice.

(Eg, it's possible that the alternative to the previous-model-CoT prior is the regular-model-output prior, which isn't much better for faithfulness. My current guess is that as long as the SFT happens before training and doesn't induce EM-like effects, it doesn't matter too much either way).

(Sorry if this was already discussed somewhere, I skimmed)

No posts

Ready for more?