TL;DR:
- ChatGPT-4, an AI language model, performed well on a radiology board-type exam, scoring 81%.
- However, concerns were raised due to the model providing inaccurate and illogical answers alongside accurate ones.
- ChatGPT-3.5 was tested prior to ChatGPT-4 and achieved a 69% accuracy rate.
- ChatGPT-4 showed improvement in answering higher-order thinking questions (81% accuracy) compared to ChatGPT-3.5 (60% accuracy).
- The researchers highlighted the growing potential of large-language models in radiology but emphasized the need for further evaluation.
- ChatGPT’s tendency to produce inaccurate responses, referred to as “hallucinations,” remains a limitation in medical education and practice.
- Confidence in ChatGPT’s answers, even when incorrect, raises concerns about its reliability as an information source.
- The rapid advancement of language models like ChatGPT is considered exciting, and researchers suggest exploring radiology-specific fine-tuning using ChatGPT-4.
Main AI News:
The remarkable achievements of ChatGPT in the field of radiology have sparked both excitement and concerns about its reliability. A recent study published in Radiology on May 16 shed light on the language model’s performance, highlighting its potential while raising questions regarding its accuracy.
Dr. Rajesh Bhayana, together with his colleagues from the University of Toronto, conducted an examination using ChatGPT-4, the latest paid version of the artificial intelligence (AI) large-language model (LLM). The test comprised a multiple-choice text-only format, mirroring the style, content, and difficulty of renowned exams such as the Canadian Royal College and American Board of Radiology. Impressively, ChatGPT-4 achieved an overall score of 81%. However, it was the chatbot’s erroneous responses that gave rise to concerns.
In an official statement released by the Radiological Society of North America (RSNA), Dr. Bhayana expressed initial surprise at ChatGPT’s accurate and confident answers to challenging radiology questions. Yet, equally surprising were the instances where the model made illogical and inaccurate assertions. These discrepancies highlight the need for further evaluation and scrutiny.
While ChatGPT exhibits promise as a valuable tool in medical practice and education, its performance in radiology remains uncertain, as stated by the authors of the study. OpenAI.com introduced ChatGPT-3.5 in November 2022, and the more recent ChatGPT-4 was unveiled in March.
Testing began with ChatGPT-3.5, and the results of this initial examination were also published in Radiology on May 16. The test involved 150 text-only questions, assessing the chatbot’s proficiency in “lower-order thinking,” encompassing knowledge recall and basic understanding, as well as “higher-order thinking,” involving descriptions of imaging findings and application of concepts.
ChatGPT-3.5 demonstrated a correct response rate of 69% (with a passing score set at 70%), outperforming its predecessor on questions involving lower-order thinking. However, its performance on questions requiring higher-order thinking was comparatively weaker. In contrast, ChatGPT-4 exhibited improvements. While its performance on lower-order questions remained unchanged, it showcased a significant boost in answering higher-order thinking questions (81% compared to 60%), as per the findings of the study.
“The impressive improvement in ChatGPT’s radiology performance within such a short period underscores the expanding potential of large-language models in this domain,” stated the researchers in their report.
Dr. Bhayana speculated that ChatGPT-4 might have undergone training with additional data, enhancing its advanced reasoning capabilities. However, the exact details have not been publicly disclosed by its developers. Nevertheless, both ChatGPT-3.5 and ChatGPT-4 consistently employed confident language, even when providing incorrect answers. This characteristic raises concerns about the reliability of ChatGPT as an information source, particularly for novices who may not be able to discern confident yet inaccurate responses.
The researchers expressed their concern regarding ChatGPT’s tendency to produce inaccurate responses, which they referred to as “hallucinations.” While this occurrence was less frequent in ChatGPT-4, it still limits the model’s usability in medical education and practice at present, according to Dr. Bhayana and his colleagues.
Conlcusion:
The performance of ChatGPT-4 in the radiology field has significant implications for the market. While achieving a commendable score on a board-type exam, concerns regarding the model’s inaccurate and illogical responses raise questions about its reliability as a tool in medical practice and education. This presents an opportunity for further development and refinement in order to enhance the model’s accuracy and usability.
As large-language models like ChatGPT continue to evolve and demonstrate potential, there is a growing market demand for sophisticated AI solutions tailored specifically to the field of radiology. Fine-tuning ChatGPT-4 for radiology applications could lead to the creation of advanced tools that aid radiologists in their decision-making processes, ultimately transforming the market landscape and offering new possibilities for improved patient care and diagnostic accuracy.