We should study methods to train away deeply ingrained behaviors in LLMs that are structurally similar to scheming.
Two proposed projects on abstract analogies…
We should study methods to train away deeply ingrained behaviors in LLMs that are structurally similar to scheming.