Discussion about this post

User's avatar
Christian's avatar

I do wonder why the SWE-Bench and METR benchmarks are taken as THE best indicator of progress. SWE-Bench is a very specific benchmark that only captures a small fraction of real world software engineering. METR themselves have published work that shows the benchmark only captures very narrow algorithmic work, not software engineering holistically. Benchmarks tell a very limited story, so to extrapolate predictions from limited benchmarks is a good example of Goodhart’s law. Real-world impact on software engineering is again, much more limited that a benchmark such as SWE-Bench shows.

Expand full comment
posters on my wall's avatar

Thanks for the post! Can you please provide some commentary on why this may not be over updating on the release of a single model?

For example, an alternative theory of why GPT-5 is not all that impressive is that the main motives behind this release were not to release the most impressive model, but instead

* match competitors performance while serving the smallest (expected) model possible

* drawn attention to OA while lower token pricings to pressure anthropic's inference profit while shifting to ads

* not alarm (or even anti-alarm) the public about AI progress while continuing to develop better models

Expand full comment
1 more comment...

No posts