Discussion about this post

User's avatar
H. Floyd's avatar

The same leak runs both ways. A retry protocol tells the agent its tool call was flagged. The agent learns something was flagged from the retry itself.

Low-information retry is safer for the same underlying reason: less feedback means less for the attacker to exploit. Full monitor rationale gives them a map. Less information starves the map, even if a bit still leaks through.

The aggregation choice extends it. Min across resamples fails because the model can submit one clean action and nine suspicious ones. The clean one clears the threshold. Max changes the game: the attacker cannot rely on any single sample passing, so they cannot clean one action to hide the rest.

The practical question for anyone using Claude Code or Codex today comes down to how much feedback your tool gives the agent after a flagged action. Every piece of feedback is a signal the attacker can use.

No posts

Ready for more?