April 28, 2023

expert reaction to study comparing physician and AI chatbot responses to patient questions

A study published in JAMA Internal Medicine compares physician and artificial intelligence chatbot responses to patient questions.

Prof Martyn Thomas, Professor of IT, Gresham College, London, and Director And Principal Consultant, Martyn Thomas Associates Limited, said:

“As the authors explicitly recognise, they looked at a very small sample of medical questions submitted to a public online forum and and compared replies from doctors with what ChatGPT responded. Neither the doctor nor GPT had access to the patient’s medical history or further context. Their results should not be assumed to apply to other questions, asked differently or evaluated differently. This was not a randomised controlled trial.

“From the examples of answers shown in the paper, the doctors gave succinct advice, whereas ChatGPT’s answers were similar to what you would find from a search engine selection of websites, but without the quality control that you would get by selecting (say) an NHS website.

“ChatGPT has no medical quality control or accountability and LLMs are known to invent convincing answers that are untrue. Doctors are trained to spot rare conditions that might need urgent medical attention. Whilst most medical conditions get better without medical intervention, it would be foolish for a patient to prefer ChatGPT’s advice rather than seeking something authoritative.

“The results in the paper are unsurprising. I hope they do not lead to any change in the existing recommendation that patients should call NHS 111, see their doctor or a pharmacist, or look at a reputable medical website for advice.

“If patients just want sympathy and general information, ChatGPT would seem to offer that.”

Prof Maria Liakata, Professor in Natural Language Processing (NLP) at Queen Mary, University of London, said:

“When comparing physician responses against AI generated responses the question “Which response is better?” is rather vague. For example, when we evaluate the quality of text in NLP generation tasks we ask questions about different aspects of the text such as fluency, meaning substance, informativeness, grammaticality as well as an overall score. If some of the physicians answering were non-English speakers this could have influenced the score assigned to their answers. In the same vein empathy could also be influenced by someone’s language proficiency. While the physicians answering questions were verified physicians that doesn’t necessarily mean we know much about their background, experience or proficiency in English, all of which would affect the fluency of their answers and could result in their replies being perceived as of lesser quality or even less empathetic.

“The authors mention several limitations such as that questions-answer pairs involved questions answered in isolation, out-of-context rather in a dialogue which is more natural form of interaction, the need to evaluate in clinical settings, the lack of evaluation of answers and empathy by actual patients. All this would need to be addressed before the uptake of AI assistants in clinical practice. The suggestion that AI assistants could assist physicians with formulating their responses to increase their productivity and help improve their communication skills is valid and could potentially bring benefits to patients and reduce the workload of clinicians. However, many ethical questions remain open and it is important that the physicians using such tools do not over-rely on them and exercise their own critical thinking and expert knowledge in reviewing and approving AI system outputs.”

Prof Nello Cristianini, Professor of Artificial Intelligence at the University of Bath, said:

“This study was in the position to fairly compare chatGPT with human doctors under the same conditions, by focusing on the specific situation of answering online questions, with no other information about the patient. The questions were posed by online patients (on a Reddit forum), answered by verified human doctors, and then also answered by chatGPT. These were compared in a blind setting by a group of human evaluators, who graded them for accuracy and empathy, finding that the answers of the machine were preferrable to those of the humans. The article suggests that this technology could lead to AI assistants that might “improve responses, lower clinician burnout, and improve patient outcomes”.

“This is an impressive study, with many limitations that are discussed below. The ratings provided by the human assessors could easily become part of the training, following a standard procedure invented by openAI and called reinforcement learning with human feedback. We can expect improvements to be forthcoming.

“However, if the relationship between patient and doctor was limited only to providing information based on a textual prompt, then chatGPT might have shown that it can perform as well as – or better than – human doctors.

“However, this “text only” interaction is the natural mode in which GPT is trained, and not the natural setting for human doctors. Information retrieval is only one of the reasons why patients engage with doctors, and the instructions received are only one of the benefits of this interaction. Framing the comparison in terms of textual prompt and textual answer means missing a series of important points about human doctors. I would prefer to see these tools used by a doctor, when addressing a patient, in a human-to-human relation which is part of the therapy.

“My father was a doctor and would sometimes take phone calls in the middle of the night. At the morning I would ask him what the call was about, and he would say that sometimes people become scared of their own mortality, and they need to hear the doctor. I am not sure if the chatbot could – or should – ever try to play that important role.”

Dr Mhairi Aitken, Ethics Research Fellow at The Alan Turing Institute, said:

“It is important to note some significant limitations of this study. Firstly, the patient queries and clinician responses come from an online forum rather than actual care settings. This is very different from the kinds of advice or responses that may be given by clinicians in actual care settings. It is likely that comparing responses with ChatGPT from physician responses in actual care settings would lead to different outcomes. This comparison would be necessary before making any conclusions about the value of potential applications of ChatGPT in delivery of healthcare.

“Secondly, the evaluators in this study were licensed healthcare professionals who assessed the accuracy and perceived empathy of the responses. If we are to consider using ChatGPT to provide responses to patients it is important to consider the perspectives of patients not just professionals. In particular, perceptions of empathy may vary considerably among different patient groups. There may be different perceptions of empathy or appropriateness of a chatbot from people with different demographic or cultural backgrounds or in relation to different health conditions and circumstances. Any assessments of perceived empathy should engage with diverse interests and perspectives, including from vulnerable or minoritized patient groups and taking account of particular sensitivities around different health conditions.

“It’s important to note that while some people may feel comfortable receiving medical advice from a chatbot, or for a chatbot to assist in a physician’s advice, for many patients the human relationship and care-giving is a vital part of the healthcare process and something which cannot be automated or replaced by chatbots such as ChatGPT. A human doctor is able to adjust their language, manner and approach in response to social cues and interactions, whereas a chatbot will produce more generic language without awareness of social contexts.”

Prof Anthony Cohn, Professor of Automated Reasoning, University of Leeds, said:

“Given ChatGPT’s well known abilities to “write in the style of”, it is not surprising that a chatbot is able to write text that is generally regarded as empathetic. Large Language Model (LLM) chatbots tend to produce quite verbose responses so it is not surprising that they are longer than the physicians’ responses, who will typically under time pressure to respond quickly. The authors are careful to note that a chatbot should only be used as a tool to draft a response to a patient query – given LLM’s propensity to invent “facts” and hallucinate, it would be dangerous to rely on any factual information given by such a chatbot response – it is essential that any responses are carefully checked by a medical professional. However, humans have been shown to overly trust machine responses, particularly when they are often right, and a human may not always be sufficiently vigilant to properly check a chatbot’s response; this would need guarding against (perhaps using random synthetic wrong responses to test vigilance). But this kind of use of LLMs, i.e. as a tool to draft email responses to user queries, is a reasonable use case for early adoption of LLMs and indeed is already in use in some company customer service departments – though the consequences of a bad response there would likely to be less severe than in a medical setting. The authors note a number of limitations of their study and further research to investigate these and possible further adoption would be a good idea.”

Dr Heba Sailem, Head of Biomedical AI and Data Science group at King’s College London:

“These results are highly encouraging and motivates training of more specialised Large Language Models based on medical knowledge to improve communication channels between patients and healthcare professionals. Chatbots could also serve as educational tools to help professionals identify opportunities for responding more empathetically to their patients. ”

Prof James Davenport, Hebron and Medlock Professor of Information Technology, University of Bath, said:

“The questions were posted on Reddit, as were the physicians’ answers. We are given (quite reasonably) censored versions of the questions, but we are not told exactly what ChatGPT was asked. The readers are given 6 questions (out of the experiment’s 195) and their corresponding answers. Both here and in the whole database, the ChatGPT answers were, on average, four times the length of the physicians’. It is stated that the evaluators (all physicians) were given the two responses blind, not knowing which was the physician and which was the ChatGPT. This was probably formally true, but length and style surely made it obvious in practice. At least in the 6 given, the doctors made no attempt to be empathetic, knowing that their answers were public, while ChatGPT is aimed at a 1:1 conversation. Hence in terms of empathy, this is far from being a level comparison. This could have been made more explicit.

“The evaluators (three for each question) were apparently asked two questions, ‘the quality of information provided’ and ‘the empathy or bedside manner provided’. We are not told of any additional explanations. One might think that an empathetic answer was higher quality, and indeed there’s substantial correlation. In the case of the first example (swallowed toothpick), the physician’s answer (chances are they’ve passed into your intestines) is more accurate than ChatGPT’s (if you are not experiencing any symptoms, it is safe to assume that the toothpick has passed through your digestive system), but all evaluators preferred the ChatGPT answer.

“One might think that the longer answer might be more empathetic. This can’t be directly measured, but overall evaluators preferred the ChatGPT 78.6% of the time. This dropped to 71.4% of the time for the longer half of physician comments, and 60.2% for the longest 25%. Hence it is not clear that an equivalent length comparison would favour ChatGPT.

“The paper does not say that ChatGPT can replace doctors (and a very good reason why not is given in https://inflecthealth.medium.com/im-an-er-doctor-here-s-what-i-found-when-i-asked-chatgpt-to-diagnose-my-patients-7829c375a9da ), but does, quite legitimately, call for further research into whether and how ChatGPT can assist physicians in response generation. As it points out, “teams of clinicians often rely on canned responses”, and a stochastic parrot like ChatGPT has a much wider range of responses than even the largest library of canned responses.”

Prof Mirella Lapata, professor of natural language processing, University of Edinburgh, said:

“The study assesses ChatGPT’s ability to provide responses to patient questions and compares these to answers written by physicians. Perhaps unsurprisingly, it finds that health care professionals prefer ChatGPT to physicians. ChatGPT is more empathetic and overall chattier (no pun intended!). Without controlling for the length of the response, we cannot know for sure whether the raters judged for style (e.g., verbose and flowery discourse) rather than content.”

‘Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum’ by John W. Ayers et al. was published in JAMA Internal Medicine at 16:00 UK Time Friday 28 April 2023.

DOI: 10.1001/jamainternmed.2023.1838

Declared interests

Dr Mhairi Aitken: “I confirm I have no conflicts of interest to declare.”

Prof Nello Cristianini: Disclosure: author of “The Shortcut – why intelligent machines do not think like us”

Prof Anthony Cohn: “No COI.”

Dr Heba Sailem: “no conflicts of interest.

Prof James Davenport: “I confirm I have no conflicts of interest with the research or the underlying technology.”

Prof Mirella Lapata is a professor of Natural Language Processing at the university of Edinburgh, she has not in any way participated in the development of ChatGPT or collaborated with the researchers who conducted the study.

For all other experts, no reply to our request for DOIs was received.

April 28, 2023

expert reaction to study comparing physician and AI chatbot responses to patient questions

in this section

filter RoundUps by year

search by tag