3 Comments
User's avatar
Avi (Firecrystal Scribe)'s avatar

I'm not sure I understand. Are you saying that the model somehow encodes problem relevant information into the filler tokens? I could imagine it using some sort of binary encoding with alternating white space characters. But is there evidence of that or is it just that somehow having filler tokens makes it better for an unknown reason?

Theo Reinsberg's avatar

I think the filler tokens are deterministically generated and part of the prompt.

The way this works is that since the filler tokens can see previous tokens, the hidden states of the filler tokens can perform computations. The forward pass(es) that actually generate the final answer token(s) can then see the results of the filler token computations by attending to the K/V values of the filler tokens.

Since tokens cannot see later tokens, this means that filler tokens would not improve performance if they were placed at the beginning of the prompt.

your_friend's avatar

very interesting (and a bit concerning?)!

we have tried using DeepSeek v3, and it didn't show statistically significant gain from filler tokens, but we didn't care about contamination on the test dataset.