According to two recent research investigations published in radiography, a journal of the Radiological Society of North America (RSNA), the most recent version of ChatGPT passed a radiography board-style exam, showcasing the promise of big language models but also exposing flaws that impair reliability.
An artificial intelligence (AI) chatbot called ChatGPT employs a deep learning model to identify patterns and correlations between words in its massive training data in order to produce responses that resemble those of a human in response to a prompt. However, because its training data lacks a reliable source of truth, the tool may produce factually inaccurate replies.
According to senior author Rajesh Bhayana, M.D., FRCPC, an abdominal radiologist and technology head at University Medical Imaging Toronto, Toronto General Hospital in Toronto, Canada, “the use of large language models like ChatGPT is exploding and only going to increase.” “Our research offers insight into ChatGPT’s performance in a radiology context, highlighting the tremendous potential of large language models, along with the current constraints that render it unreliable.”
Bhayana said that similar chatbots are being put into well-known search engines like Google and Bing that physicians and people use to seek medical information. ChatGPT was recently declared the fastest-growing consumer application in history.
Bhayana and associates first evaluated ChatGPT based on GPT-3.5, the most widely used version, to evaluate its performance on radiology board exam questions and investigate benefits and limits. The researchers employed 150 multiple-choice questions that were created to be similar to the Canadian Royal College and American Board of Radiology tests in terms of style, content, and difficulty.
The questions were separated into categories for higher-order (apply, analyse, synthesise) and lower-order (knowledge recall, fundamental comprehension) thinking in order to gain insight into performance. The higher-order thinking questions were further subdivided into categories (description of imaging findings, clinical care, computation and categorization, and illness links).
The effectiveness of ChatGPT was assessed on a general level as well as by question type and topic. The language used in responses was also evaluated for confidence.
The researchers discovered that ChatGPT, which is based on GPT-3.5, correctly answered 69% of questions (104 of 150), which is close to the passing mark of 70% utilised by the Royal College in Canada. The model struggled with questions demanding higher-order thinking (60%, 53 of 89), but did reasonably well on questions requiring lower-order thinking (84%, 51 of 61). The higher-order questions including the explanation of imaging findings (61%, 28 of 46), calculation and classification (25%, 2 of 8), and application of concepts (30%, 3 of 10) were particularly difficult for it to answer. Given its lack of pretraining in radiology, it was not surprising that it performed poorly on higher-order thinking issues.
GPT-4 was made available in limited quantities to premium consumers in March 2023. It especially boasted enhanced advanced reasoning capabilities over GPT-3.5.
In a subsequent investigation, GPT-4 outperformed GPT-3.5 and achieved a passing score of 81% (121 of 150) on the same questions, outperforming GPT-3.5. Higher-order thinking questions, in particular those involving the description of imaging findings (85%) and application of concepts (90%), were answered much more correctly by GPT-4 than by GPT-3.5 (81%).
The results imply that the allegedly superior advanced reasoning abilities of GPT-4 translate to improved performance in a radiology setting. Additionally, they recommend better contextual comprehension of terms relevant to radiology, such as imaging descriptions, which is essential to support future downstream applications.
“Our study demonstrates an impressive improvement in ChatGPT’s performance in radiology over a short period, highlighting the growing potential of large language models in this context,” Bhayana added.
GPT-4’s performance on problems requiring lower-order thinking did not increase (80% vs. 84%), and it gave incorrect answers to 12 questions that GPT-3.5 had answered right, raising concerns about the validity of the test as a source of data.
“We were originally taken aback by ChatGPT’s precise and assured responses to some difficult radiological questions, but we were also taken aback by some incredibly irrational and incorrect comments,” Bhayana added. Naturally, the erroneous results should not be particularly surprising considering how these models operate.
Hallucinations, a potentially harmful propensity of ChatGPT that results in erroneous responses, are less common in GPT-4, although their use is still currently limited in medical practice and education.
Both investigations demonstrated that ChatGPT consistently employed confident language, even when it was untrue. Bhayana observes that this is particularly risky if used as the sole source of knowledge, especially for beginners who might not be able to distinguish between confident and wrong responses.
This, in my opinion, is its biggest drawback. Currently, the best uses of ChatGPT are to generate ideas, aid in the beginning stages of medical writing, and summarise data. It must constantly be fact-checked if utilised for quick information recall, according to Bhayana.