Innovative Studies at Stanford Uncover Mid-Sized Language Models’ Potential in Clinical QA Duties

  • Recent advancements show large language models (LLMs) like Med-PaLM 2 and GPT-4 excelling in clinical QA tasks.
  • However, challenges such as high costs, ecological concerns, and restricted access hinder widespread adoption.
  • On-device AI presents a promising alternative, utilizing local devices for running language models in biomedicine.
  • Evaluation of smaller domain-specific models (<3B parameters) versus larger 7B parameter models reveals varied performance in clinical QA tasks.
  • Stanford-led research conducts thorough assessment using popular tasks like MedQA and MultiMedQA Long Form Answering.
  • Mistral 7B emerges as top performer, highlighting potential for clinical question-answering tasks.
  • Findings underscore importance of expert medical review before deploying language models in clinical settings.

Main AI News:

The recent strides in clinical question-answer (QA) tasks achieved by large language models (LLMs) such as Med-PaLM 2 and GPT-4 are notable. Med-PaLM 2, for instance, rivaled human doctors in providing answers to consumer health inquiries, while a GPT-4-based system attained an impressive 90.2% on the MedQA task. However, these LLMs come with their share of challenges. Their training and operational costs are exorbitant, and their ecological impact is concerning due to their massive parameter counts, necessitating dedicated computing clusters. Moreover, access to these large models is restricted to those with paid API access, hindering in-depth analysis and improvement research.

A novel approach known as on-device AI or edge AI leverages local devices like smartphones or tablets to execute language models. This technology holds significant promise in biomedicine, offering solutions such as disseminating medical information in disaster-stricken areas or regions with limited internet connectivity. Despite their size and closed nature presenting obstacles, models like GPT-4 and Med-PaLM 2 can be repurposed for on-device AI, paving the way for new research avenues and practical applications in the medical field.

Within the biomedical context, two categories of models are relevant. Smaller domain-specific models (<3B parameters) like BioGPT-large and BioMedLM were trained exclusively on biomedical text from PubMed. Conversely, larger 7B parameter models like LLaMA 2 and Mistral 7B, while more powerful, were trained on broad English text without a biological focus. The effectiveness of these models and their suitability for clinical QA applications remain uncertain.

To ensure comprehensive and credible insights, a consortium of researchers from Stanford University, University College London, and the University of Cambridge conducted an exhaustive evaluation of all four models in the clinical QA domain. They employed two popular tasks, MedQA (similar to USMLE questions) and MultiMedQA Long Form Answering (providing open responses to consumer health queries), to gauge the models’ capacity to comprehend and reason about medical scenarios and craft informative responses to health-related inquiries.

The MedQA evaluation, featuring a four-option format akin to the USMLE, assesses a language model’s ability to leverage medical information and navigate clinical situations effectively. The test encompasses questions seeking specific medical details, such as symptoms of schizophrenia, as well as scenarios requiring diagnosis or the next course of action. The dataset comprises 1273 test cases, 10178 training instances, and 1272 development examples, each accompanied by a prompt and an expected response.

To standardize the evaluation, all four models were fine-tuned using the same training data (10178 instances) and code, ensuring equitable comparisons. The Hugging Face package facilitated the fine-tuning process. Furthermore, the researchers amalgamated the MedQA training data with the larger MedMCQA training set, enriching the training data by 182822 additional examples. This augmentation enhanced the performance of Mistral 7B, the top-performing model, which was then subjected to a more intricate training process aiming to produce both the correct letter and the complete response text.

For the MultiMedQA Long Form Question Answering task, the researchers provided health-related queries sourced from search engines, encompassing three datasets: LiveQA, MedicationQA, and HealthSearchQA, totaling four thousand questions. These questions span a wide array of health topics, necessitating detailed responses resembling those found on health-related FAQ pages.

The implications of these findings are significant for the biomedical field. Mistral 7B emerged as the standout performer on both tests, showcasing its prowess in clinical question-answering tasks. While BioMedLM exhibited respectable performance despite its smaller size compared to the 7B models, BioGPT-large proved adequate for those with sufficient computational resources. Nevertheless, domain-specific models like BioMedLM performed inferiorly compared to larger-scale models trained on generic English, indicating the potential benefit of incorporating the PubMed corpus into training. The question of whether a larger biomedical specialty model could surpass Mistral 7B underscores the importance of expert medical review before deploying these models in clinical settings.

Conclusion:

The research demonstrates the evolving landscape of language models in clinical question-answering tasks. While larger models exhibit impressive performance, concerns regarding cost, sustainability, and access persist. The emergence of on-device AI offers a viable solution, but careful consideration of model performance and expert medical oversight are imperative for meaningful deployment in the market.

Source