Scientists comment on two medical AI models, published in Nature, for patient management.
Comments from our friends at SMC Spain:
Ignacio Miranda Gómez, head of the Breast Imaging Unit at the International Breast Cancer Centre (IBCC) and at the Teknon Medical Centre in Barcelona, said:
“The latest advances in medical AI show that the most advanced systems can now achieve levels of performance comparable to, and even superior to, those of doctors in specific clinical tasks such as diagnosis, test selection, prescribing treatments and patient follow-up.
“Two recent studies, focusing on the AMIE and MIRA systems, represent a qualitative leap forward compared with previous generations of medical AI. Whilst AMIE stands out for its ability to conduct complex clinical conversations and manage patients across multiple visits, MIRA goes a step further by integrating with an electronic health record and carrying out clinical actions such as requesting diagnostic tests, prescribing medication or recommending hospital admissions.
“The results show that both systems were able to match or exceed the performance of doctors in simulated settings, particularly in areas such as adherence to clinical guidelines, the accuracy of recommendations and medication safety.
“However, the researchers themselves emphasise that these technologies are not yet ready for autonomous use in clinical practice. The studies were conducted in controlled environments with simulated patients, so their efficacy and safety still need to be demonstrated in real hospitals and clinics.
“Current evidence points towards a model of collaboration between healthcare professionals and AI, rather than the replacement of doctors. In this scenario, AI would take on analytical, administrative and decision-support tasks, whilst professionals would remain responsible for clinical supervision, communication with patients, managing uncertainty and final decisions regarding healthcare. These advances suggest that, in the coming years, artificial intelligence could become a key ally in improving the quality of care, reducing the administrative burden and facilitating more consistent, evidence-based care – always under human supervision.
“Neither AMIE nor MIRA are unique in their field: recently, an advanced AI model was presented in *Science* capable of outperforming medical diagnosis in a controlled environment.
“The three studies represent three distinct generations of medical AI, and comparing them helps us understand the direction in which the field is heading. Whilst AMIE demonstrates that an AI can conduct a consultation like a doctor and the model featured in Science shows that it can reason like a doctor, MIRA aims to demonstrate that it can work like a doctor within a hospital. The most disruptive advance is not that MIRA diagnoses slightly better than other models, but that it translates that reasoning into structured clinical actions (ordering tests, prescribing treatment, scheduling procedures and admissions). Therefore, from the perspective of transforming the healthcare system, MIRA probably represents the closest step yet towards a true ‘clinical co-pilot’ integrated into hospital practice.
“The three studies convey a common message: artificial intelligence is achieving levels of performance comparable to, or even superior to, those of many professionals in specific diagnostic and decision-making tasks. However, the researchers emphasise that all the results come from controlled or simulated environments and that prospective studies involving real patients are still needed to confirm their safety, efficacy and impact on clinical outcomes.
“Far from suggesting the replacement of healthcare professionals, the authors believe that the most promising role for these technologies will be to support doctors. In this model, AI would take on repetitive, administrative or information-analysis tasks, whilst professionals would remain responsible for clinical supervision, final decision-making and the human relationship with patients.”
Alfonso Valencia, ICREA professor and Director of Life Sciences at the Barcelona Supercomputing Centre (BSC), said:
“These two independent studies present AI systems for the clinical management of patients. Both pieces of work represent significant technical advances, which must be interpreted in context, and do not represent systems implemented in real hospitals.
“MIRA is an autonomous agent operating in a simulated electronic health record environment, capable of conducting interviews, requesting diagnostic tests and proposing treatments. Evaluated across hundreds of real-world A&E cases, it matched or exceeded doctors’ performance in many, but not all, of the conditions assessed. The second system, AMIE, is a conversational system optimised for clinical reasoning across multiple visits. This system also proved to be as effective as a panel of primary care doctors, whilst being more closely aligned with clinical recommendations and guidelines – an orthodox approach that may or may not be beneficial in real-world settings where adaptation to specific cases and flexibility are so important.
“These developments can be seen as a technical advance with the potential to improve hospital processes, but they are not yet systems deployed in the real world.
“From a technical perspective, given the complexity of such systems, we must await their use by other researchers to be certain of the validity of the results (for example, potential contamination between training and application data – a typical and serious problem in systems that use such massive amounts of data that it is very difficult to assess its quality and origin) – beyond the evaluation carried out prior to publication. In this regard, it is essential that the systems are open-source (i.e. that they can be used by others). Whilst MIRA is open-source, AMIE is not, which makes it impossible to evaluate it independently and, therefore, it is not something we can ultimately rely on.
“In any case, it is important to emphasise that we are operating in the realm of development and not yet in that of implementation within complex, regulated systems such as hospitals. In this respect, the limitations are substantial. These systems are not yet ready to interact with the complexity of real patients, doctors and systems, including the many interactions that are not purely text-based, and which are decisive in real-world practice.
“In summary, these are relevant scientific publications that make it clear that AI applications in medical decision-making environments are advancing at great speed, largely driven by large companies, but not exclusively so (fortunately). Before they can be implemented in real-world systems, prospective studies involving real patients and ethical oversight are still required, following the standard – and legally mandated – process for any application in medicine.”
Dr Dominic Oliver, Postdoctoral Researcher, Department of Psychiatry, University of Oxford, said:
“Advancements in AI are increasingly providing opportunities to support healthcare but there is limited evidence that this can be done safely and effectively to benefit patients. These two studies showcase AI agents that aim to help clinicians at multiple stages across a patient’s journey through the healthcare system.
“Both studies provide rigorous but preliminary evidence that these tools could be useful to aid healthcare management. Both AI agents showed similar or better performance for diagnosing and managing clinical cases compared to doctors, particularly in adhering to clinical guidelines.
“There are a few limitations to this work that suggest that these tools are not yet ready to be used in the clinic.
“Healthcare services are stretched and these studies represent a promising and safe first step to show that AI tools like AMIE and MIRA may eventually help clinicians deliver care more efficiently and consistently. Due to the pace of academic publishing, the AI field has continued to advance rapidly since this work was performed, roughly two years ago. This means that versions of AMIE and MIRA running today with newer models may be even more capable than what is seen in these studies. However, before AI agents can be used routinely, further research is needed to demonstrate that they are safe, effective and fair when used with real patients in real clinical settings.”
Prof Julie Jacko, Chaired Professor of Health Informatics and Data Science, The University of Edinburgh, said:
AMIE (Liévin et al.)
“This is a carefully constructed study with some real methodological strengths, particularly the randomized, blinded OSCE design and the attempt to formalize ‘management reasoning’ using a structured rubric across multi‑visit cases. That gives more weight to the findings than many previous AI evaluations in this area.
“However, the evaluation is tightly linked to guideline concordance, and the system is explicitly designed to retrieve and structure its outputs around those guidelines. That makes the comparison to clinicians, who are not constrained to follow guidelines in the same way, somewhat asymmetric. In addition, many of the reported gains are in the precision and completeness of plans, rather than clear differences in clinical correctness.
“The press release reflects the results, but the ‘as good as physicians’ claim is best interpreted within this specific evaluation framework. Overall, this is a strong experimental study and a meaningful step forward, but it demonstrates performance against a structured standard rather than fully capturing the complexity of real clinical decision-making.”
MIRA (Kather et al.)
“This is an ambitious and technically rigorous study that evaluates an AI agent across an unusually broad clinical workflow, and it stands out for forcing structured decisions within an electronic health record–like environment rather than relying on free‑text outputs. The matched comparison with clinicians and the range of metrics, diagnosis, test selection, procedures, prescribing and admission decisions, are clear strengths.
“That said, several of the key outcomes are defined relative to what was documented in the underlying dataset, meaning the model is being rewarded for reproducing recorded clinical behaviour rather than necessarily demonstrating optimal care. Related to that, some metrics favour more comprehensive or higher‑recall decisions, such as ordering more tests or identifying more procedures, which is not the same as better clinical judgement. The small physician comparator group also limits how firmly we can interpret performance differences.
“The press release accurately describes the findings, but the claim of physician‑level performance should be seen in light of these design choices. Overall, this is a high‑quality study that provides a compelling demonstration of feasibility, while leaving open important questions about how these systems would perform under broader, real‑world conditions.”
Prof Catherine Pope, Professor of Medical Sociology, University of Oxford, said:
“The Ferber et al and Lievin et al papers provide welcome evidence about potential clinical uses of large language models (LLMs). It is easy to be captivated by headline claims that these kinds of LLMs ‘outperform doctors’ but the devil, as always, is in the detail. Both studies are based on simulation – Ferber et al on simulated chat created from patient notes, the second on exam formats that use actors to replicate medical scenarios for the purpose of training and assessing doctors. This is some remove from the messy, complex, human world of everyday healthcare.
“Both studies demonstrate that LLMs can mimic some aspects of experienced physician performance, but crucially both concede that while there may be promise here, much more research is needed before these LLMs can, or should, be deployed in the real world. The point – made well – in in the Ferber et al piece is that use in the real world will need to be in partnership with clinicians: these technologies are unlikely to replace doctors, and many will contend that they crucially do not and cannot substitute for the essential human aspects of care.”
Dr Midhun Parakkal Unni, Academic Fellow in AI for Health, University of Sheffield, said:
“The authors developed conversational agents for disease management, evaluated their performance in simulated scenarios, and benchmarked them against clinicians under identical conditions.
“Generalising beyond the distribution on which they are trained is generally hard for machine learning systems, and it is not obvious how LLM-type foundational models behave in scenarios they haven’t seen before. This makes large-scale real-world testing an absolute necessity before claiming the usefulness of the LLM-integrated systems for clinical practice.
“That said, the papers are responsibly written and demonstrate an outstanding engineering achievement. Conclusions are backed by data, as long as we don’t extend them inadvertently to the real world. One of the main limitations of the papers is their reliance on a simulated LLM-based patient agent. Also, there is potential for LLMs to have seen papers published using the MIMIC-4 dataset (at least in the case of the MIRA agent) and therefore perform better. As many real-world cases are repeats of previous cases, this may not be a problem in practice. However, one must note that this is not always the case.
“In terms of the current state of the art, this is clearly a step beyond expert-level question-answering by LLMs and is the necessary step before one can have confidence to test it in the real world. These studies are of great significance for the development of engineering pipelines for future real-world evaluations. However, given the performance we see for current LLMs, the result is not unexpected. One has to wait and see what the real-world challenges would be when these are put into practice, as patients interact with an agent in a life-critical situation, since simulations may not capture the full breadth of human behaviour.”
Dr Wei Xing, Assistant Professor in the University of Sheffield’s School of Mathematical and Physical Sciences, said:
On the AIME paper:
“This is a methodologically careful study. The design is randomised and blinded, and the statistical corrections for multiple comparisons are done properly. But this result needs context. This is the third major paper from this group on AMIE. The most recent prior study tested AMIE with real patients. In that study, doctors produced more practical and more cost-effective care plans than AMIE did. This new paper goes back to a fully simulated setting, and it does not address that earlier finding. Its strong results here should be read against that background. There is also a question about where AMIE’s advantage actually comes from. On one of the benchmarks in this paper, general purpose AI models with no special clinical training scored similarly to AMIE. This suggests AMIE’s edge may reflect the rapid general progress of AI models, more than the specific system built around it. AMIE is tested on scripted patient actors, communicating only through text. The authors are clear that it is not ready for clinical use, and this setup is quite different from how doctors actually work with patients.”
On the MIRA paper:
“This study is also careful, and a strength compared to the AMIE paper is that it uses real historical patient records rather than scripted scenarios, with extensive additional safety checks. But the headline figure, that the AI beat doctors on diagnostic accuracy, is mostly driven by conditions with clear test results, like appendicitis and pancreatitis. For pneumonia and urinary tract infections, two of the most common reasons people go to emergency departments, both the AI and the doctors did worst, and the gap between them was smallest. The AI also ordered roughly twice as many blood tests as the doctors did. More information could itself explain higher accuracy, so this is not quite a level comparison. This is a retrospective simulation using old patient records. It did not involve real patients, real time clinical settings, or interaction with practising doctors. It cannot tell us yet how this would perform in an actual hospital.”
‘Towards Autonomous Medical Artificial Intelligence Agents’ by Dyke Ferber et al. was published in Nature at 16:00 UK time on Wednesday the 17th of June 2026.
DOI: 10.1038/s41586-026-10675-5
‘Towards Conversational AI for Disease’ by Liévin, V et al. was published in Nature at 16:00 UK time on Wednesday the 17th of June 2026.
DOI: 10.1038/s41586-026-10764-5
Declared interests
Dr Dominic Oliver: “I have received consultancy fees from Google DeepMind outside the work discussed here. I am also the Principal Investigator of a Wellcome Trust-funded study investigating conversational AI agents in psychiatry, in which Google DeepMind is an industry partner.”
Prof Julie Jacko: “no conflicts of interest to declare”
Prof Catherine Pope: “I conduct research about organisation and delivery of healthcare, and am interested in digital health care/technologies. I also co-lead the MSc Applied Digital Health. Current funded projects include a project on AI scribes (ambient voice technologies – AVTs) in general practice consultations (led by Abi Eccles and John Powell) which will involve a number of different commercial providers of AVTs and a project looking at AI assisted ‘intelligent navigation’ for same day primary care access (NIHR503515) working in partnership with Visiba Group. Previously I have explored the deployment of digital triage systems in 999 and 111 NHS services.
“I am a trustee for the Foundation for Sociology of Health and Illness, Green Templeton College and the Society for Studies of Organising in healthcare.
“I am an NIHR Senior Investigator and chair of the NIHR Senior Investigator Award Committee. I have served on various other NIHR funding panels and review research proposals and final reports for these.
“I receive royalty payments from Wiley, Macmillan and McGrawHill (and ALCS who collect royalties on behalf of authors). “
Dr Midhun Parakkal Unni: “I have previously worked in the following companies: Tata Consultancy Services Limited (India), HCL Technologies Limited (India), Gaitq Limited (UK)”
Dr Wei Xing: “no interests to declare”