Discussion about this post

User's avatar
Nikola Jurkovic's avatar

Very cool! I very shallowly did a similar analysis inside METR with similar results but never found the time to publish it. I'm glad I can link to yours.

William Cuozzo's avatar

Great work! This nontrivially updated me toward faster timelines. Though, I think there might be a little more to the eval overfitting argument.

Overfitting plausibly disproportionately reduces costs at longer time horizons. The cost of applying “memorized solutions” or, more realistically, “techniques that look similar to training data” could be more slowly increasing in task difficulty than the cost of more novel reasoning. So overfitting gives a bigger relative cost discount on harder tasks. This might be especially true if companies are specifically targeting the current eval frontier.

This connects to a broader bearish argument: that current impressive performance is largely explained by how far the training manifold stretches, not how far models reason off of it. I personally come down closer to the bull side. Most economically and scientifically relevant tasks are probably close enough to some kind of training data, and humans don't generalize that far from their own "training manifold" either.

Various evals have attempted to get at this (the different ARC-AGI ones come to mind) but the fact that they still eventually get hill-climbed by these "pattern-matching" machines is evidence that a lot of "off-distribution" tasks can be solved by rearrangements of on-distribution learned building blocks (especially with reasoning models). That said, I do think if you put a lot of interleaving novel concepts into the context window, you can throw current systems off relative to how impressive they seem working with concepts that appear a lot on the internet.

Where I think the bear case is stronger is software-only intelligence explosion dynamics specifically. The tasks that matter for recursive AI R&D improvement might start close to some kind of training data but involve increasingly more novel reasoning as low-hanging fruit is picked and getting more AI R&D-relevant data from experiments becomes compute-constrained. Maybe models just fail rather than succeed expensively, and those failures wouldn't show up in cost ratio data at all.

No posts

Ready for more?