Should we update against seeing relatively fast AI progress in 2025 and 2026?

Maybe we should (re)assess the case for relatively fast progress after the GPT-5 release.

Jul 28, 2025

Around the early o3 announcement (and maybe somewhat before that?), I felt like there were some reasonably compelling arguments for putting a decent amount of weight on relatively fast AI progress in 2025 (and maybe in 2026):

Maybe AI companies will be able to rapidly scale up RL further because RL compute is still pretty low (so there is a bunch of overhang here); this could cause fast progress if the companies can effectively directly RL on useful stuff or RL transfers well even from more arbitrary tasks (e.g. competition programming)
Maybe OpenAI hasn't really tried hard to scale up RL on agentic software engineering and has instead focused on scaling up single turn RL. So, when people (either OpenAI themselves or other people like Anthropic) scale up RL on agentic software engineering, we might see rapid progress.
It seems plausible that larger pretraining runs are still pretty helpful, but prior runs have gone wrong for somewhat random reasons. So, maybe we'll see some more successful large pretraining runs (with new improved algorithms) in 2025.

I updated against this perspective somewhat because:

The releases of 3.7 Sonnet and 4 Opus were somewhat below expectations on this perspective. It looks like there wasn't some easy way to just actually do a bunch of RL on agentic software engineering (with reasoning?) in a way that makes a massive difference (and wasn't already in the process of being scaled up). Or, at least Anthropic wasn't able to pull this off; it seems plausible that Anthropic is substantially worse at RL than OpenAI (at least at some aspects of RL like effectively scaling up RL on more narrow tasks). Interestingly, reasoning doesn't seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
We haven't yet seen much better models due to more (or algorithmically improved) pretraining AFAICT.
We haven't seen OpenAI releases that perform substantially better than o3 at software engineering yet despite o3 being announced 7 months ago. (That said, o3 was actually released only 3 months ago.)
I updated towards thinking that the training of o3 was more focused on software engineering than I previously thought (at least the final release version of o3) and the returns weren't that big. (This is due to rumors, seeing that OpenAI was training on software engineering tasks here, and based on OpenAI releases and communication like Codex.)
I updated a bit against this perspective due to xAI seemingly scaling things up a bunch, but I don't put as much weight on this because it seems pretty plausible they just did a bad job scaling things up. (E.g., maybe they didn't actually scale up RL to pretraining scale or if they did, maybe this RL was mostly compute inefficient RL on lower quality environments. xAI might also just generally be algorithmically behind.)

GPT-5 is expected to be released in 0.5-3 weeks and rumors indicate that it is substantially more focused on practical (agentic) software engineering. This is (arguably) the first major model release from OpenAI since o3, and it should resolve some of our uncertainties (particularly related to whether there was/is a bunch of low hanging fruit at OpenAI due to them not being very focused on software engineering).

My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won't be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon1 on METR's evaluation suite2. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.3

If GPT-5 is actually a large (way above trend) jump in agentic software engineering with (e.g.) a >6 hour time horizon4 (which seems plausible but unlikely to me), then we'll have seen relatively fast (and possibly very fast) software progress in 2025 and we'd naively expect this to continue.5 If GPT-5 is below trend6, then it seems like the case against expecting relatively faster AI progress in 2025/2026 due to scaling up RL focused on agentic software engineering is pretty strong.

Overall, I wonder if I have (thus far) insufficiently updated my overall timelines picture based on the observations we've had so far in 2025. I'm a bit worried that I'm still operating on cached beliefs when these observations should have pushed away a bunch of the shorter timelines mass. Regardless, I think that the release of GPT-5 (or really, 2-6 weeks after the release of GPT-5 so that we have a better picture of GPT-5's capabilities) will be a good point to (re)assess and consider stronger updates.

Edit: An earlier version of this post said "3.5 hours", but this was actually a mistake because I thought o3 had a 2 hour time horizon when it actually has a 1.5 hour time horizon. I also edited from ">8" to ">6" at a later point in this post as ">8 hours" was meant to refer to 2 doublings from o3 which is actually ">6 hours".

I do worry that METR's evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.

The faster 2024-2025 agentic software engineering time horizon (see figure 19 in METR's paper) has a 4 month doubling time. o3 was released 4 months before GPT-5 is expected to be released and o3 has a 1.5 hour time horizon (edit: this used to say 2 hour which was a mistake), so this yields a 3 hour time horizon for GPT-5. I think that GPT-5 is more likely than not to be below trend (on at least METR's specific evaluation) so I round this down a bit to 2.75 hours, though I have a pretty wide confidence interval. I expect below trend rather than above trend due to some early reports about GPT-5, the trend being pretty fast, Opus 4 having lower than expected results, and thinking that the METR evaluation suite might have issues with larger time horizons that result in misleadingly lower numbers.

Again, I'd want to look at multiple metrics. I'm referring to seeing agentic software engineering performance that looks analogous to a >6 hour time horizon on METR's evaluation suite when aggregating over multiple relevant metrics.

It seems more likely to be a massive jump if OpenAI actually wasn't yet very focused on agentic software engineering when training o3, but is more focused on this now. This article claims that something like this is the case.

It's harder to confidently notice that GPT-5 is below trend relative to how hard it is to tell if GPT-5 is way above trend. We should expect it's some amount better than o3 and the difference between a 2 and a 3 hour time horizon is legitimately hard to measure.

Atticus Wang

Jul 28

> Interestingly, reasoning doesn't seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.

Source?

Expand full comment

asher

Note also that Sama has said:

"we achieved gold medal level performance on the 2025 IMO competition with a general-purpose reasoning system! to emphasize, this is an LLM doing math and not a specific formal math system; it is part of our main push towards general intelligence...

we are releasing GPT-5 soon but want to set accurate expectations: this is an experimental model that incorporates new research techniques we will use in future models. we think you will love GPT-5, but we don't plan to release a model with IMO gold level of capability for many months."

which unfortunately implies that the GPT5 release may be significantly behind what they have internally (though it's still an update if GPT5 itself, the first model incorporating the new research techniques, is unexpectedly good or bad)

1 reply

2 more comments...

Redwood Research blog

Discussion about this post