Discussion about this post

User's avatar
Avi (Firecrystal Scribe)'s avatar

I'm not sure I understand. Are you saying that the model somehow encodes problem relevant information into the filler tokens? I could imagine it using some sort of binary encoding with alternating white space characters. But is there evidence of that or is it just that somehow having filler tokens makes it better for an unknown reason?

your_friend's avatar

very interesting (and a bit concerning?)!

we have tried using DeepSeek v3, and it didn't show statistically significant gain from filler tokens, but we didn't care about contamination on the test dataset.

1 more comment...

No posts

Ready for more?