A study published in Science evaluates the performance of large language models (LLMs) on the reasoning tasks of a physician.
Prof Gustavo Carneiro, Professor of AI and Machine Learning, University of Surrey, said:
“The first sentence of the press release is too optimistic. It says that the method “outperformed human doctors in emergency room decisions,” but the task was blinded second-opinion differential diagnosis generation, not real-time decision-making or patient management.
“The paper is of excellent quality, clearly showing that modern LLMs can excel in traditional benchmarks of text-based clinical reasoning, outperforming clinicians in constrained reasoning tasks. This conclusion needs to be taken carefully because the paper does not claim that these LLMs are clinically competent or safe in real health care settings. The authors and press release are careful on this point.
“This paper demonstrates that LLMs can rival clinicians on certain real world clinical reasoning tasks. While related studies have reported similar findings with weaker results, this paper appears to be the first to show such performance convincingly.
“Regarding confounders, one important issue, mentioned by the paper, is that it checked for model contamination. This refers to the possibility that the LLM was trained on data that also appears in the evaluation set, which is also called data leakage. Since it’s difficult to guarantee that this never occurred, the authors tested for it indirectly by comparing model performance on examples from before and after the pretraining cutoff date (page 1, first paragraph of the Results section). They found no statistically significant difference between the two. This suggests that contamination is unlikely, although it cannot be completely ruled out. The paper is cautious about this, but I believe the only way around that is with a prospective study.
“It is important to say that the benchmarks results can measure reasoning quality, but not system safety. The paper is very careful about this.
“It is also important to say that AI is not ready to replace doctors in the emergency department. This paper shows that AI may outperform humans on a narrow set of tasks (generating differential diagnoses and suggesting next diagnostic steps from text), but not on the broader tasks of emergency care, which include physical examination, real time judgment under uncertainty, team coordination, and responsibility for patient outcomes.
“About self-diagnosing, AI can support medical reasoning, but within clinical systems with human oversight, safeguards, and accountability. Otherwise, I think it is premature to use it.”
Dr Joseph Alderman, NIHR clinical lecturer, AI researcher and and NHS anaesthetist at the University of Birmingham said:
“This study by Brodeur et al is the latest to show that large language models (LLMs) can perform well at medical tasks. The researchers tested the tool against a demanding and varied set of cases, including NEJM Clinical Pathological Conferences, which are amongst the most challenging diagnostic puzzles in medicine, as well as real emergency department cases from a major academic hospital in the USA. Whilst research like this shows that LLMs can make accurate diagnoses and treatment plans based on written information, this is only a small part of the job most doctors do. Doctors in the emergency department need to provide comfort and reassurance to patients and their loved ones on what may be the worst day of their lives. They need to listen carefully to what patients tell them, make a clinical examination, and suggest investigations and tests. The right course of treatment for one patient may be inappropriate for another, even if the medical facts in each case are very similar. Being a good doctor requires judgement, compassion and experience, as well as raw medical knowledge.
“Increasingly, members of the public are turning to online AI chatbots to ask questions about their health. This could open new opportunities, enabling patients to understand any conditions they live with and make choices to improve their health. On the other hand, these systems are not perfect. They can be inaccurate and unreliable, and could give advice which is unhelpful or harmful. It is important we all think carefully about these risks, and double check important decisions with trained healthcare professionals.”
Professor Ewen Harrison, Professor of Surgery and Data Science and Co-Director Centre for Medical Informatics, University of Edinburgh, said:
“This is an important study showing that modern AI systems can be good at one of the central tasks of doctors and nurses: taking the information available about a patient and suggesting which diagnoses should be considered.
“This matters – these systems are no longer just passing medical exams or solving artificial test cases. They are starting to look like useful second-opinion tools for clinicians, particularly when it is important to consider a wider range of possible diagnoses and avoid missing something important.
“But this does not mean AI should be quickly ushered into clinical care without limits. Producing a good list of possible diagnoses is not the same as improving patient care. We still need studies showing that these tools help doctors and nurses make better decisions, reduce harm, avoid unnecessary tests, and work safely in busy hospitals and GP practices.
“This study moves the field forward, but it does not by itself change clinical practice. The responsible route is not to ban these systems, but also not to let them drift into casual use. They should be tested in real clinical settings, used as second-opinion tools rather than replacements for clinicians, and monitored against the outcomes that actually matter to patients: better, safer, quicker care.”
Dr Wei Xing, Assistant Professor in the University of Sheffield’s School of Mathematical and Physical Sciences, said:
“This is one of the largest evaluations of LLMs in clinical reasoning to date, and the inclusion of real emergency department data is a genuine step forward. Two findings in the paper, however, deserve more scrutiny than they received. In one management reasoning experiment, physicians using GPT-4 scored 41%, no better than GPT-4 alone at 42% and well above physicians without AI at 34%, suggesting that doctors may unconsciously defer to the AI’s answer rather than thinking independently. This tendency could grow more significant as AI becomes more routinely used in clinical settings.
“The real-world data from 76 patients at a single elite academic centre tells a more nuanced story than the headline implies: o1 identified the correct diagnosis in 67% of triage cases against 55% and 50% for the two attending physicians, a genuine gap, but one with no accompanying analysis of where or for whom the model fails. Whether errors concentrate among elderly patients, non-English speakers, or those with atypical presentations remains entirely unknown, and without that analysis a strong average accuracy offers limited reassurance. What this study demonstrates is that an LLM can outperform physicians on structured, text-based reasoning tasks under controlled conditions. It does not demonstrate that AI is safe for routine clinical use, nor that the public should turn to freely available AI tools as a substitute for medical advice.”
Prof Aldo Faisal, Professor of AI & Neuroscience, Imperial College London, said:
Is this good quality research?
“Methodologically from the evaluation perspective this is exemplary: many physician baselines, blinded comparisons on real ER cases, validated rubrics. This is how clinical AI evaluation should be done.”
What are the implications? Is there overspeculation?
“A model that performs well on Boston case vignettes tells you little about a 78-year-old in a London emergency department with a head injury. That’s why we need sovereign, open health foundation models trained on UK and European health data. UK and Europe cannot safely deploy clinical AI for its own patients using only closed commercial American models. That is why we are building Nightingale AI.”
What does this paper actually show us?
“The question is no longer whether these systems can reason about a vignette, but whether they can reason about a patient and their multimodal data not just text — which is precisely the gap Nightingale AI is built to close.”
How does this fit with existing evidence?
“The trajectory is unambiguous — each generation of frontier models outperforms the last. The question now is whether we’ve saturated these benchmarks – I think we have. The frontier has moved from ‘can the model get the diagnosis’ to ‘can it help a clinician make a better decision in a real workflow.’”
Have the authors accounted for limitations?
“Three limitations matter. It’s text only — no imaging, no ECG, no patient in front of you. The cases were curated for teaching; real data are messy and contain many modalities. And the model is a closed US commercial system whose training data is a trade secret — we cannot fully audit what we cannot see inside.”
Are there risks of hallucinations and over-reliance?
“Both risks are real and this paper doesn’t address them. LLMs still confabulate confidently, and the more fluent the output, the more dangerous a wrong answer becomes.
“They used a US closed commercial model – we cannot fully audit what we cannot see inside.
“The answer is open, inspectable models with proper monitoring — which is precisely what Nightingale AI is being built to provide.”
Is AI ready to overtake doctors in the emergency room?
“No. Emergency medicine isn’t a diagnostic puzzle on text-based patient descriptions — it’s triage, resuscitation, judgement under uncertainty, talking to frightened families. The testing on scores of text vignettes don’t measure any of that. An AI second opinion at triage could be valuable, but only after prospective trials show real benefit. We are not there yet.”
Message for the public who might want to diagnose themselves with public/consumer AI?
“Don’t. A consumer chatbot is not a medical device. It has no regulatory status and no liability when it’s wrong. Use these tools to prepare better questions for your doctor — not to replace one.” … The gap between paper benchmark and real-world medicine is enormous.”
‘Performance of a large language model on the reasoning tasks of a physician’ by Peter G. Brodeur et al. was published in Science at 18:00 UK time on Thursday 30 April 2026.
DOI: 10.1126/science.adz4433
Declared interests
Prof Aldo Faisal: “Note I lead the Nightingale AI European/UK academic open and sovereign health foundation model called Nightingale-AI (nightingale—ai.org).”
Professor Ewen Harrison: “The senior authors and I are editors at NEJM AI.”
Prof Gustavo Carneiro: “I don’t have any COIs.”
Dr Joseph Alderman: Dr Alderman is leading a team to build “The Health Chatbot Users’ Guide” – guidance for the public who may want to use AI chatbots to ask health questions. This is a project which is funded by a research grant. He declares that he is not in receipt of any industry funding or support for this project, or for any of his other work. https://healthchatbotguide.org/
For all other experts, no reply to our request for DOIs was received.