3 Comments
User's avatar
posters on my wall's avatar

Thanks for the post! Can you please provide some commentary on why this may not be over updating on the release of a single model?

For example, an alternative theory of why GPT-5 is not all that impressive is that the main motives behind this release were not to release the most impressive model, but instead

* match competitors performance while serving the smallest (expected) model possible

* drawn attention to OA while lower token pricings to pressure anthropic's inference profit while shifting to ads

* not alarm (or even anti-alarm) the public about AI progress while continuing to develop better models

Expand full comment
Ryan Greenblatt's avatar

This update isn't mostly from GPT-5, I just used the GPT-5 release as an impetus to write up my updates for the year. See https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=6ue8BPWrcoa2eGJdP

Maybe "my timeline updates from 2025 as of GPT-5" would have been more accurate.

Expand full comment
Christian's avatar

I do wonder why the SWE-Bench and METR benchmarks are taken as THE best indicator of progress. SWE-Bench is a very specific benchmark that only captures a small fraction of real world software engineering. METR themselves have published work that shows the benchmark only captures very narrow algorithmic work, not software engineering holistically. Benchmarks tell a very limited story, so to extrapolate predictions from limited benchmarks is a good example of Goodhart’s law. Real-world impact on software engineering is again, much more limited that a benchmark such as SWE-Bench shows.

Expand full comment