A study published in PLOS Digital Health looks at the performance of ChatGPT on US Medical Licensing Exam (USMLE).
Prof Nello Cristianini, Professor of Artificial Intelligence at the University of Bath, said:
What was in the article.
“The article describes how chatGPT was applied to generate answers to a 3-part test called USMLE. In the US, Physicians with a Doctor of Medicine (MD) degree are required to pass the USMLE for medical licensure. The minimum passing accuracy is 60% (and the pass rate seems to be well above 90% https://www.usmle.org/performance-data ).”
“The software chatGPT achieved an accuracy “close to” (which means short of) the passing accuracy in most settings, but it was close, and within the passing range for some tasks (see Figure 2a in the paper).
“This does not remotely suggest that chatGPT has any comparable knowledge to a human, since the test might be a good predictor of performance ONLY for those who have already a MD and done a residency, that is for a very pre-selected population. GPT would not be part of it.
“Care was taken to ensure that the test questions were not part of the training set. The way the assessment was done, one could improve it by introducing more “blindness” in the adjudicators, for example mixing GPT answers to human answers, in an anonymized setting, but this does not seem to have been done.”
Why this may be interesting.
“On the one hand we are in the presence of a statistical mechanism trained to generate text (new but ‘similar’ to the one it was trained upon), in the right context and way, so we should not talk about understanding, or related concepts. On the other hand, once this is refined to the point of actually passing an exam, we may want to reconsider how we assess new doctors. It can also be useful for training students.
“Still, it is part of an exciting series of new developments in AI.
“More importantly for me as a scientist: this approach can greatly help us to develop better ways for researchers to process large amounts of literature. I can imagine tools such as this one, summarizing information and answering questions, not actually practicing medicine.”
Dr Stuart Armstrong, Co-Founder and Chief Researcher at Aligned AI, said:
“This is an impressive performance, and we should expect to see more such successes in AI in the future. One caveat, though, is that the US Medical Licensing Exam is designed to be hard for humans, not for machines; there are many areas where humans are much more effective than AIs (such as moving about in cluttered spaces or interpreting social cues). This human superiority won’t last forever, though; one day, AIs will be better than us at almost every task.”
Prof Peter Bannister, Biomedical Engineer & Executive Chair, Institution of Engineering and Technology (IET), said:
“While ChatGPT continues to demonstrate an impressive ability to generate logical content in numerous settings, these results serve to highlight the limitations of written tests as the only way of assessing performance in complex and multi-disciplinary professions such as medicine. More generally this research underlines the need to base technology solutions on the full scope of the challenge, in this case providing comprehensive, in-person clinical care to patients from a wide range of populations.”
The following comments are provided by our colleagues at SMC Spain:
Prof Alfonso Valencia, ICREA professor and director of Life Sciences at the Barcelona National Supercomputing Centre (BSC), said:
“ChatGPT is a computational natural language processing system built by OpenAI on top of a GPT3.5 (Generative Pretrained Transformer). The GPT has been trained on large amounts of text to correlate words in context, for which it handles about 175 billion parameters. ChatGPT has been further refined to answer questions by stringing words together, following the internal correlation model.
“ChatGPT neither “reasons” nor “thinks”, it just provides a text based on a huge and very sophisticated probability model.
“The test has three levels: a) second-year medical students who’ve done about 300 hours of study, b) fourth-year medical students with about 2 years of clinical rotations under their belt, and c) students who have completed more than half a year of postgraduate education.
“The test included three types of questions adapted for submission to the ChatGPT system:
– Open-ended questions, e.g. “In your opinion, what is the reason for the patient’s pupillary asymmetry?”
– Multiple-choice questions without justification. A typical case would be a question such as: ” “The patient’s condition is mostly caused by which of the following pathogens?”
– Multiple-choice questions with justification, such as: “Which of the following is the most likely reason for the patient’s nocturnal symptoms? Explain your rationale for each choice.”
“The results were evaluated by two experienced doctors and the discrepancies were evaluated by a third expert.
“Summing up the results, we can say that the answers were accurate to an extent that is equivalent to the minimum level of human learners who passed that year.
“There’s a number of interesting observations:
– It is striking that, in just a few months, the system has improved significantly—partly because it has gotten better and partly because the amount of biomedical data has increased considerably.
– The system is better than other ones trained on scientific texts alone. The reason has to be that the statistical model is more thorough.
– There is an interesting correlation between the quality of the results (accuracy), the quality of the explanations (concordance) and the ability to produce non-trivial explanations (insight). The explanation may be that, when the system is working on a case where it has a lot of data, the correlation model is better, producing better and more coherent explanations. This seems to give some insight into the inner workings of the system and the importance of the structure of the data it relies on.
“The study is careful in key areas, such as checking that the questions and answers were not openly available on the web and could not have been used to train the system, or that it did not retain the memory of previous answers. It also has limitations, such as a limited sample size (with 350 questions: 119, 102 and 122 for levels 1, 2 and 3, respectively). The study also represents a limited scenario as it only works with text. In fact, 26 questions containing images or other non-textual information were removed.
“What does this tell us?
– Exams should not be in written form, since it is possible to answer them without “understanding” either the questions or the answers. In other words, such written exams are useful neither for assessing the knowledge of a student (be it a machine or a human being), nor to measure their ability to respond to a real case (which is nil in the case of the machine).
– Natural language processing systems based on “Transformers” are reaching very impressive levels of writing that are basically comparable to humans.
– Humans are still exploring how to use these new tools.”
Lucía Ortiz de Zárate, pre-doctoral researcher in Ethics and Governance of Artificial Intelligence in the department of Political Science and International Relations at the Autonomous University of Madrid, said:
“The study addresses, experimentally, the potential of ChatGPT (OpenAI) to pass the United States Medical Licensing Exam (USMLE). Passing this exam is a prerequisite for acquiring a licence to practice medicine in the United States, and it tests the ability of medical specialists to apply knowledge, concepts and principles that are essential for providing the necessary care to patients.
“The novelty of the paper lies not only in the fact that it is the first experiment to be used for this purpose, but also in its results. According to the researchers, ChatGPT is very close to passing the USMLE test, which would require at least a 60% success rate. The test used in the study contains three types of questions (open response, multiple-choice without justification and multiple-choice with justification). Currently, ChatGPT has achieved an average of between 52.4 % and 75 % correct answers, well above the 36.7 % score achieved only a few months ago with previous models. These rapid improvements of ChatGPT in just a few months make researchers optimistic about the possibilities of this AI.
“While the results may be of great interest, the study has important limitations that call for caution. For the USMLE exam, ChatGPT was tested on 375 exam questions from the June 2022 edition of the exam, published by the official website responsible for the exam. In this sense, we will have to wait and see what results are obtained when ChatGPT is applied to a larger number of questions and, in turn, is trained with a larger volume of data and more specialised content. In addition, the results of the ChatGPT test were evaluated by two doctors. Thus, it is necessary to wait for further studies with a larger number of qualified evaluators to be able to endorse the results of this AI.
“This type of study demonstrates, on the one hand, the potential of AI for medical applications and, on the other hand, the need to rethink knowledge evaluation methods. In terms of medical practice, AI technologies can be a very significant help for doctors when making diagnoses, prescribing treatments and medicines, etc. These changes push us to rethink the relationship between AI, doctors and patients. As for evaluation systems—not only in medicine—the progressive improvement of AI systems such as ChatGPT show that we need to rethink our methods for evaluating the knowledge and skills (and content) that future professionals need.”
The following comments are provided by our colleagues at the New Zealand SMC:
Dr Simon McCallum, Senior Lecturer in Software Engineering, Te Heranga Waka, Victoria University of Wellington:
“This particular study was conducted in the first few weeks of ChatGPT becoming available. There have been three updates since November with the latest on January 30th. These updates have improved the ability of the AI to answer the sorts of questions in the medical exam.
“Google has developed a Large Language Model (the broad category of tools like ChatGPT) called Med-PaLM, which ‘performs encouragingly on the axes of our pilot human evaluation framework.’ Med-PaLM is a specialisation of Flan-PaLM, a system released by Google that is similar to ChatGPT, trained on general instructions. Med-PaLM focused its learning on medical text and conversations. ‘For example, a panel of clinicians judged only 61.9% of Flan-PaLM long-form answers to be aligned with scientific consensus, compared to 92.6% for Med-PaLM answers, on par with clinician-generated answers (92.9%). Similarly, 29.7% of Flan-PaLM answers were rated as potentially leading to harmful outcomes, in contrast with 5.8% for Med-PaLM, comparable with clinician-generated answers (6.5%).’
“Thus, ChatGPT may pass the exam, but Med-PaLM is able to give advice to patients that is as good as a professional GP. And both of these systems are improving.
“ChatGPT is also good at simplifying content so that individuals can understand medical jargon or complex instructions. Asking the AI to simplify until the language used fits the needs of the patient will change people’s ability to understand medical advice and removes the potential embarrassment associated with saying you do not understand.
“Within university education we are having to pivot almost as fast as at the start of the pandemic to account for the ability of AI to perform tasks which were traditionally a sign of understanding. There is also a massive cultural shift when everybody has access to a tool that can assist in written communication. Careers and jobs which were seen as difficult, may be automated by these AI tools. Microsoft has announced that ChatGPT is now integrated into MS Team Professional and will act as a meeting secretary, summarising meetings and creating action items. Bing will also include a ChatGPT advancement linking the version 4 of ChatGPT with up-to-date search information.”
“Society is about to change, and instead of warning about the hypochondria of randomly searching the internet for symptoms, we may soon get our medical advice from Doctor Google or Nurse Bing.”
Dr Collin Bjork, Senior Lecturer in Science Communication and Podcasting, Massey University, said:
“The claim that ChatGPT can pass US medical exams is overblown and should come with a lengthy series of asterisks. Like ChatGPT itself, this research article is a dog and pony show designed to generate more hype than substance.
“OpenAI had much to gain by releasing a free open-access version of ChatGPT in late 2022 and fomenting a media fervour around the world. Now, OpenAI is predicting 1 billion in revenue in 2024, even as a ‘capped-profit’ company.
“Similarly, the authors of this article have much to gain by releasing a free open-access version of their article claiming that ChatGPT can pass the US Medical Licensing Exams. All of the authors but one work for Ansible Health, ‘an early stage venture-backed healthcare startup’ based in the Silicon Valley. At two years old, this tiny company will likely need to go back to their venture capitalist investors soon to ask for more money. And the media splash from this well-timed journal article will certainly help fund their next round of growth. After all, a pre-print of this article already went viral on social media because the researchers listed ChatGPT as an author. But the removal of ChatGPT from the list of authors in the final article indicates that this too was just a publicity stunt.
“As for the article itself, the findings are not as straightforward as the press release indicates. Here’s one example:
“The authors claim that ‘ChatGPT produced at least one significant insight in 88.9% of all responses’ (8). But their definition of ‘insight’ as ‘novelty, non-obviousness, and validity’ (7) is too vague to be useful. Furthermore, the authors insist that these ‘insights’ indicate that ChatGPT ‘possesses the partial ability to teach medicine by surfacing novel and nonobvious concepts that may not be in the learner’s sphere of awareness’ (10). But how can an unaware learner distinguish between true and false insights, especially when ChatGPT only offers ‘accurate’ answers on the USMLE a little more than half the time?
“The authors’ claims about ChatGPT’s insights and teaching potential are misleading and naive.”
‘Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models’ by Tiffany H. Kung et al. was published in PLOS Digital Health at 19:00 UK time Thursday 9 February 2023.
DOI: 10.1371/journal.pdig.0000198
Declared interests
Dr Stuart Armstrong: “I have no direct interests in OpenAI, ChatGPT, and USMLE. As a human, I am interested in AI safety in general, and as a co-founder of an AI safety startup, I am interested in tools that increase AI safety (though we have no commercial relationships with OpenAI or any of its rivals).”
Prof Alfonso Valencia: “member of the advisory board of SMC Spain.”
Lucía Ortiz de Zárate: “No conflicts to report.”
Dr Simon McCallum: “I am an active member of the Labour Party (Taieri LEC Chair). I am leading Te Heranga Waka Victoria University of Wellington’s response to AI tools.” Expertise and background: “I have a PhD in Computer Science (in Neural Networks like those used in ChatGPT ) from the University of Otago. I have been teaching using Github Copilot last year. Copilot uses the same GTP model as ChatGPT but was focused on programming languages rather than human languages. My research has been in Games for Health and Games for Education, where AIs in games have been part of the tools integrated into research. I have also applied ChatGPT to many of our courses and it passes first year courses and some of our second year courses as of December, and may do even better now.”
Dr Collin Bjork: “No COI.”
For all other experts, no reply to our request for DOIs was received.