select search filters
briefings
roundups & rapid reactions
Fiona fox's blog

expert reaction to OpenAI’s announcement of ChatGPT-5

Scientists comment on OpenAI’s announcement on the launch of ChatGPT-5. 

 

Dr Edoardo Ponti, Assistant Professor in Natural Language Processing, University of Edinburgh, said:

“Most advancements (e.g. in coding skills) are compelling but far from the dramatic leaps observed in some of the previous releases. The presentation was partly weakened by flaws in the result reports and a hallucinated demo. Moreover, it left a bit unclear where GPT-5 stands with respect to models from OpenAI’s competitors. Particularly noteworthy are the “minimal effort” design, automatically switching between model variants and reasoning depths, as well as strong long-context abilities, which may reveal extensive usage of synthetic data during model training.”

 

Dr Junade Ali, Software Engineer and Computer Scientist, the Institution of Engineering and Technology said:

“The most significant development is that OpenAI are looking to reduce the number of models available to just the new GPT-5 suite. This means those on premium subscriptions will not be expected to select a particular model for a particular purpose. In essence, GPT-5 will select the optimal model for them. 

“This moves OpenAI’s ChatGPT in line with the approach adopted in Google’s Gemini and Anthropic’s Claude. Of course, the real-world impact will need to be assessed, but this approach does have the potential to improve user experience and reduce energy waste from powerful models being used unnecessarily.

“Nevertheless, the environmental cost of the energy usage of Large Language Models remains a key concern, as tech giants continue to look towards nuclear power for carbon-free electricity. This highlights the importance of society adopting environmentally friendly energy generation and storage technologies to address the demands of an ever more technological society.”

 

Prof Anthony Cohn, Professor of Automated Reasoning, University of Leeds and The Alan Turing Institute, said:

“We have waited a long time for the release of GPT-5.  Was it worth the wait? It’s too early to say – access to the ChatGPT-5 interface is still rolling out, so I have not been able to test it personally yet. However, given all the hype in advance, I was underwhelmed by the livestream presentation on August 7th – GPT-5 does seem to have improved performance and has better functionality than its predecessors, but it’s a long way from “AGI” (however that is defined). However, I was never expecting GPT-5 to be close to “AGI” – although believing our brains are just (biological) machines and machine intelligence is possible in principle, I think it requires more than just LLM technology – for example, reliable symbolic reasoning will also be important as will be able to learn from the physical world – LLMs are not embodied, even if supplied with video data in their training. One important omission, as Altman admitted, is its inability to learn “on the go” – this has always been a fundamental problem with deep-learning based systems which require huge amounts of data and huge amounts of training – effectively precluding on-the-fly learning. The ability to generate the software for a complete website as demonstrated in the live stream is impressive if it works in general. Also, the claimed improvements in safety and reliability are welcome. But the well-known weakness of LLMs that they “hallucinate” has only been “mitigated” according to the presentation – and it’s hard to imagine that this “hallucination” problem will ever be solved completely given the way that LLMs operate (statistically predicting future text based on its training data). I also noted that one of the presenters at one point said GPT-5 “seems to understand” – this is the core of issue – do LLM’s really understand, or are they, as Professor Emily Bender (a linguist at the University of Washington) once wrote, ultimately merely “stochastic parrots”. The livestream showed some leading-edge performance on some well-known benchmarks, but as François Chollet has pointed out, it still trails the Grok-4 LLM by some margin on the ARC-AGI-2 benchmark he proposed as an AGI test – Grok-4 only scores 15.9% on this very challenging test. A positive point is that GPT-5 pricing on its API interface is slightly cheaper than some of its other recent models, in particular o3. When DeepSeek’s R1 model came out earlier this year, much was made of the fact that it was allegedly very cheap to train; OpenAI did not say how much it cost to train GPT-5, but it’s a safe bet it was a lot. Whether the cost (and the carbon) was worth it remains to be seen. Meanwhile, early indications seem to be that while GPT-5 is an improvement on earlier models, it is not a real step change, certainly no more than the difference between, say, GPT-3 and GPT-4.  But since there is no real new data available for training, and existing data on the web is contaminated by LLM-generations, this is probably not surprising, notwithstanding improved training, and the advent of “reasoning” models.”

 

Dr Jeff Dalton, Chancellor’s Fellow in the School of Informatics, University of Edinburgh, said:

How big a step is GPT-5?

“OpenAI likes the student level comparison, GPT-3 the high-schooler, GPT-4 the college student, GPT-5 the first-year PhD.  Analogies aside, day-to-day reliability appears to be improved.  Where GPT-4 makes mistakes on roughly one response in five, GPT-5 is more reliable and competent with misses closer to one in twenty.  It feels more like a specialist you can keep in your pocket because it uses a unified “reasoning” model that chooses in real-time whether a prompt needs a quick reply or deeper thought.

 

What changed in the training recipe?

“OpenAI still keeps the parameter count to itself, but it has sketched out the syllabus.  The model first grinds through curated content then tough highly curated datasets, augmented with synthetic problem sets produced by earlier GPT-4 reasoning models, then graduates to real PhD-level questions written by domain experts.  That staged curriculum aims to build reasoning skills, not just imitate web content.

 

Where will people notice the upgrade?

“People will feel the improvement in how easy it is to make software.  GPT-5 enables almost anyone to sketch an idea and watch it transform into a functioning app.  Tell it, ‘I need a daily Spanish vocabulary quiz that emails my score,’ and it quietly handles the heavy lifting, planning the steps, writing and testing the code, picking colours and layouts, then hands back something that looks ready to use.  What used to take a development team weeks can now happen in an afternoon by a non-expert, opening the door for teachers, small-business owners and hobbyists to build their own software instead of needing a developer.

 

Does GPT-5 solve hallucinations or safety worries?

“Solve is too strong, but progress is real.  In everyday chat, the error rate falls from about twenty percent to roughly five percent.  In a tough healthcare test, the hardest questions dropped from ten wrong answers in a hundred to fewer than two.  The model also says ‘I’m not sure’ more readily when it hits its limits, exactly the behaviour you want.  We still have to tackle a lot of the safety issues around AI agents, and this doesn’t solve that.

 

Any remaining concerns?

“Part of the training data is synthetic, so GPT-5 can inherit biases and accuracy issues from earlier GPT-4 models.  Outside testing and auditing are still important.  And while it now feels like a junior researcher, it can’t set its own goals or design its own experiments.  GPT-5 continues the path forward, making AI more reliable and useful.  OpenAI just made this model available to all workers in the US government – we need something similar for the NHS and UK government workers or risk falling behind.


My take:

Subject-matter expert in your pocket

“OpenAI likes the classroom metaphor: GPT-3 was a bright high school student, GPT-4 a solid college student, GPT-5 a first-year PhD.  Labels aside, day-to-day reliability is where people will feel the difference.  Clear factual slips drop from roughly one in five answers to about one in twenty.  Thanks to a new simplified unified reasoning architecture, it decides in real time whether your question needs a quick response or a deeper evaluation.  The result is that people get both speed and depth without fiddling with complex AI settings and guessing at what model to use.  It’s the closest yet to keeping a subject-matter expert, a lawyer, financial analyst, or doctor in your pocket, though it is still a long way from true AGI.

 

A leap for software development

“GPT-5 is tuned not just to write code but to ship working software.  Ask for ‘a daily Spanish vocabulary quiz that emails my score’ and it will break the job into steps, write and test the back-end, choose colours and layouts, and hand back something that looks ready for users.  Tasks that filled a developer’s week with GPT-4 can now be sketched in an afternoon.  Think of it as a senior pair programmer that handles autonomy, collaboration, context, and testing in one go.

 

More factual, less agreeable

“OpenAI’s numbers show conversation-level hallucinations falling from over 20% to under 5%, and the model is about three times less likely to go along with things when a user is wrong.  On hard medical evaluations, the toughest questions dropped from around ten wrong answers in a hundred to fewer than two.  That’s not perfect, but a sizable step toward safe deployment in high-stakes fields.

 

How was it trained

  • Curriculum learning on synthetic data – earlier “o-series” models generated graduated problem sets; human PhDs added real complex expert-level tasks.
  • Reinforcement learning for depth – It gets rewards for solving hard multi-step problems, not pattern-matching.
  • Meta-prompting and tool use – the model can help rewrite vague instructions before acting and hooks into external APIs with minimal wrapper code.

 

Bottom line

“GPT-5 is a significant incremental upgrade, smaller than the GPT-3 to GPT-4 leap, but big enough in accuracy, domain depth, and software-building skill to change everyday workflows.  You can now prototype an app, or check a niche technical point, with much more confidence that the answer will be right and provided in a form you can use.”

 

Dr Mike Cook, Senior Lecturer in Computer Science, King’s College London, said:

How significant is this announcement? Compared to GPT-4?

“It felt quite muted, if I’m being honest. OpenAI have a difficult job selling new models because there often aren’t new features or tangible things you can point to and say “this is new” – most of the time all they can say is that something is better, which is nice but not as exciting. However even by those standards it felt a bit downbeat, a lot of the announcements were just general claims that it did things better, followed by a few specific examples. GPT-5 looks better than GPT-4 at some of the tasks it was good at, but the examples they gave were extremely narrow and I think show the company’s focuses – mainly on programming.

 

What major improvement goes GPT-5 have over GPT-4?

“One significant change is that GPT-5 will now try to partially answer questions that in the past it would have just rejected. For example, if we ask a question about a sensitive topic, in the past it might have simply refused to engage at all. Now it will try to answer as much as it can without violating its safety constraints. This feels like a huge vulnerability to me, but I assume OpenAI have tested it as thoroughly as the rest of their toolset.

“They spoke a lot about its ability to code better than previous models but in my opinion this is not that relevant for most users (except those working in the software industry). There are a lot of issues surrounding using AI as a programmer, especially if you don’t understand how to program yourself.

“There were also some minor improvements to other things – you can now connect your calendar to ChatGPT, the thought of which terrifies me honestly but they very cheerily showed how they all use it. You can pick a colour scheme for your chats now. And you can change the personality of ChatGPT’s voice mode, for example to become a bit more sarcastic. This is very clearly a response to Elon Musk’s Grok which has very strongly expressed personalities. They didn’t even demonstrate it though, so it felt like more of a sideshow for them.

 

Does GPT-5 provide a “fix” to some of the issues with GPT-3 and GPT-4? (i.e. hallucinations)

“It’s impossible to fix hallucinations for good, but OpenAI do claim that GPT-5 has reduced hallucinations. However we know from experience that as the model is trained and upgraded this tendency to hallucinate can change significantly, so it’ll remain to be seen how consistent this is.

 

Are there any other potential issues to consider (i.e. how much closer does it take us to AGI? Are there issues with the potential training data included – could some of the data be artificially generated data)

“My professional opinion is that AGI as we understand it is not possible. However using OpenAI’s own definitions of AGI, I don’t think this particularly takes us closer. It appears to perform better than the previous models – but a big issue throughout the entire presentation was that there was very little evidence or assessment of what it can do. There were a lot of people saying that it was good at this, or helped with that, but in my view there was no actual evidence whether it was performing correctly or accurately (with the exception of a few graphs showing benchmarks). This was a presentation based mostly on vibes in my view.

 

Are there more significant energy costs with larger models?

“Energy costs will always be higher for bigger models, and there’s no exception here – on top of that, they’re integrating it into more and more other apps and software, and encouraging people to use it more, so it’s not just the individual costs of one user but also the fact that they’re ramping up the number of users everywhere. Incredibly, in one of their demos they showed a coding problem, and recommended opening “several” tabs with the same prompt in, to have GPT-5 solve the problem in multiple ways and pick the best one after. This complicated processing will require greater amounts of energy to achieve.

 

Any other comments/considerations?

“OpenAI feel like they are being led by their users now – they know that the big use-cases for ChatGPT are coding and schoolwork, and so they made sure to put those front and centre in the presentation. But there wasn’t any significant evidence of what was changing – it was just vibes. I suspect it’s because they really just need to reassure the people who already use ChatGPT in this way, but I nevertheless found it very odd to watch.”

 

 

 

Declared interests

Dr Edoardo Ponti: None

Dr Junade Ali: None

Prof Anthony Cohn: None

Dr Jeff Dalton: “Past funding from: amazon, google, bloomberg, and apple research awards.

Past employment at Twitter and Google as software developer.

Past recent employment at Bloomberg, London as a consultant/contractor.

Current employment: Head of AI and Chief Scientist at Valence an enterprise AI company focused on AI coaching chatbots.
https://www.valence.co/authors/jeff-dalton

Dr Mike Cook: None

in this section

filter RoundUps by year

search by tag