I do wonder why the SWE-Bench and METR benchmarks are taken as THE best indicator of progress. SWE-Bench is a very specific benchmark that only captures a small fraction of real world software engineering. METR themselves have published work that shows the benchmark only captures very narrow algorithmic work, not software engineering holistically. Benchmarks tell a very limited story, so to extrapolate predictions from limited benchmarks is a good example of Goodhart’s law. Real-world impact on software engineering is again, much more limited that a benchmark such as SWE-Bench shows.
Thanks for the post! Can you please provide some commentary on why this may not be over updating on the release of a single model?
For example, an alternative theory of why GPT-5 is not all that impressive is that the main motives behind this release were not to release the most impressive model, but instead
* match competitors performance while serving the smallest (expected) model possible
* drawn attention to OA while lower token pricings to pressure anthropic's inference profit while shifting to ads
* not alarm (or even anti-alarm) the public about AI progress while continuing to develop better models
I do wonder why the SWE-Bench and METR benchmarks are taken as THE best indicator of progress. SWE-Bench is a very specific benchmark that only captures a small fraction of real world software engineering. METR themselves have published work that shows the benchmark only captures very narrow algorithmic work, not software engineering holistically. Benchmarks tell a very limited story, so to extrapolate predictions from limited benchmarks is a good example of Goodhart’s law. Real-world impact on software engineering is again, much more limited that a benchmark such as SWE-Bench shows.
Thanks for the post! Can you please provide some commentary on why this may not be over updating on the release of a single model?
For example, an alternative theory of why GPT-5 is not all that impressive is that the main motives behind this release were not to release the most impressive model, but instead
* match competitors performance while serving the smallest (expected) model possible
* drawn attention to OA while lower token pricings to pressure anthropic's inference profit while shifting to ads
* not alarm (or even anti-alarm) the public about AI progress while continuing to develop better models
This update isn't mostly from GPT-5, I just used the GPT-5 release as an impetus to write up my updates for the year. See https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=6ue8BPWrcoa2eGJdP
Maybe "my timeline updates from 2025 as of GPT-5" would have been more accurate.