GPT-3.5 and GPT-4, advanced AI language models, show promise in clinical diagnostic reasoning

TL;DR:

  • GPT-3.5 and GPT-4, advanced language models, show promise in clinical diagnostic reasoning.
  • Research focuses on open-ended clinical questions and effective prompts for LLMs.
  • The study uses MedQA USMLE dataset and NEJM case series for evaluation.
  • GPT-4 outperforms GPT-3.5 in accuracy and demonstrates the potential to mimic clinical reasoning.
  • GPT-4 excels in intuitive-type reasoning prompts.
  • Challenges were observed with analytical reasoning and differential diagnosis prompts.
  • Implications for improving AI trustworthiness in healthcare.

Main AI News:

In the rapidly evolving landscape of healthcare, artificial intelligence (AI) is making significant strides, with models like GPT-3.5 and GPT-4 leading the charge in clinical reasoning. These Large Language Models (LLMs), meticulously trained on vast textual data, have already demonstrated their human-like capabilities in tasks such as composing clinical notes and passing rigorous medical exams. However, their proficiency in clinical diagnostic reasoning remains a pivotal focus for their seamless integration into the healthcare system.

Recent research has honed in on the assessment of these LLMs, particularly in handling open-ended clinical questions, showcasing the potential of innovative models like GPT-4 in deciphering intricate patient cases. To address the variance in LLM performance, researchers have turned to prompt engineering, recognizing that the choice of prompts and questions plays a pivotal role.

The Study: Unveiling Diagnostic Reasoning Skills

In a groundbreaking study, researchers undertook the task of evaluating the diagnostic reasoning capabilities of GPT-3.5 and GPT-4 when faced with open-ended clinical questions. Their hypothesis revolved around the notion that GPT models could outshine conventional chain-of-thought (CoT) prompts by utilizing diagnostic reasoning prompts effectively.

To conduct this investigation, they leveraged the revised MedQA United States Medical Licensing Exam (USMLE) dataset and the case series from the New England Journal of Medicine (NEJM). The researchers meticulously designed various diagnostic logic prompts, drawing inspiration from cognitive procedures such as forming differential diagnoses, analytical reasoning, Bayesian inferences, and intuitive reasoning.

The art of prompt engineering played a pivotal role in generating prompts for diagnostic reasoning, converting questions into free responses by eliminating multiple-choice selections. The questions chosen for this evaluation were confined to step II and step III from the USMLE dataset, all related to patient diagnosis.

To assess the accuracy of GPT-3.5, each iteration of prompt engineering was closely monitored using the MEDQA training set, featuring 95 training questions and 518 for testing purposes. For GPT-4, the evaluation centered around 310 cases recently published in the NEJM journal. To maintain rigor, 10 cases lacking definitive diagnoses or exceeding GPT-4’s context length were excluded from the analysis.

The prompts used for evaluation encompassed two exemplifying questions, each accompanied by rationales employing target reasoning techniques or few-shot learning. The study design ensured a comprehensive and rigorous comparison of various prompting strategies using free-response questions from both the USMLE and NEJM datasets.

A Multifaceted Evaluation

The assessment of language model responses was conducted by esteemed physicians, including physician authors, attending physicians, and an internal medicine resident. Each question underwent evaluation by two independent, blinded physicians, with any discrepancies resolved by a third researcher. Additionally, software verification was employed when necessary to ascertain the accuracy of responses.

Key Findings and Implications

The study findings unveil a promising aspect of GPT-4, as it demonstrates the ability to mimic the clinical reasoning of healthcare professionals without compromising diagnostic accuracy. This breakthrough addresses one of the critical challenges faced by LLMs, shedding light on their black box limitations and bringing them closer to safe and effective utilization in the field of medicine.

While GPT-3.5 exhibited a 46% accuracy rate with standard CoT prompts and 31% with zero-shot-type non-chain-of-thought prompts, it excelled with intuitive-type reasoning, achieving a 48% success rate. However, it performed less impressively with analytical reasoning prompts (40%) and those involving differential diagnoses (38%). Bayesian inferences showed marginal improvement at 42%, with a remarkable inter-rater consensus of 97% for MedQA data GPT-3.5 evaluations.

On the other hand, GPT-4 outshone its predecessor, showcasing higher accuracy rates across the board. With classical chain-of-thought prompts, intuitive-type reasoning, differential diagnostic reasoning, analytical reasoning prompts, and Bayesian inferences, GPT-4 achieved accuracy rates of 76%, 77%, 78%, 78%, and 72%, respectively. The inter-rater consensus soared to an impressive 99% for GPT-4 MedQA evaluations.

In the evaluation of the NEJM dataset, GPT-4 exhibited a 38% accuracy rate with conventional CoT prompts, surpassing the performance of prompts focused on formulating differential diagnoses, which achieved 34% accuracy (a 4.2% difference). The inter-rater consensus for GPT-4’s assessment of NEJM cases reached 97%, showcasing the robustness of its responses and rationales. Prompts emphasizing step-by-step reasoning and a singular diagnostic reasoning strategy demonstrated superior results compared to those combining multiple approaches.

In summary, this study elucidates that GPT-3.5 and GPT-4 have made substantial advancements in their reasoning abilities, although their overall accuracy remains consistent. GPT-4 particularly shines with conventional and intuitive-type reasoning chain-of-thought prompts, while facing challenges with analytical and differential diagnosis prompts. Bayesian inferences and chain-of-thought prompting also experienced slight declines in performance when compared to classical CoT. The authors put forward three plausible explanations for these variations, including differences in GPT-4’s reasoning mechanisms compared to human providers, post-hoc diagnostic evaluations in preferred reasoning formats, and the optimization of precision with the available vignette data.

Conclusion:

These findings underscore the growing significance of AI, particularly GPT-4, in enhancing clinical diagnostic reasoning. The higher accuracy and consistency demonstrated by GPT-4 suggest a brighter future for AI integration in healthcare. However, there is still room for improvement, particularly in handling analytical and differential diagnosis prompts. The market for AI in healthcare is poised for growth, as these advancements bring us closer to leveraging AI effectively in patient care and diagnostics.

Source