Google’s AI for Medicine Demonstrates Over 90% Accuracy in Clinical Answers

TL;DR:

  • Google’s AI researchers at Alphabet Inc. developed Med-PaLM, an AI model for healthcare.
  • Med-PaLM aims to revolutionize healthcare by providing quick access to medical knowledge for physicians.
  • The model has achieved a 92.6% accuracy rating in line with scientific consensus in medical question responses.
  • Med-PaLM has the potential to extend care to underserved populations and assist with clinical documentation.
  • Concerns remain regarding the deployment of AI systems in real healthcare settings without comprehensive evaluation and customization.

Main AI News:

In February 2022, two esteemed AI researchers at Alphabet Inc.’s Google engaged in a thought-provoking conversation, delving into the vast potential of artificial intelligence within the realm of healthcare. Alan Karthikesalingam and Vivek Natarajan discussed the adaptation of Google’s existing AI models to medical settings over a delightful dinner, extending their discourse for hours. As the evening concluded, Natarajan meticulously outlined a preliminary document highlighting the immense possibilities of large language models in healthcare, including invaluable research directions and the challenges they present.

This conversation would initiate a whirlwind of research activity, propelling the team at Google into an unprecedented realm of intensity. Their collective efforts would culminate in the creation of Med-PaLM, an extraordinary AI model that promises to revolutionize the healthcare industry by empowering physicians with rapid access to medical knowledge, bolstering their clinical decision-making processes. While large language models typically rely on gargantuan amounts of digital text, Karthikesalingam and Natarajan envisioned a system that would be exclusively trained on specialized medical knowledge. Google proudly announced that the peer-reviewed research underpinning this groundbreaking AI model had been accepted by the prestigious scientific journal Nature, making them the first company to publish such research in the journal.

Notably, the published paper includes astonishing results. When the Med-PaLM model was presented with medical inquiries, a pool of clinicians rated its responses to be in line with scientific consensus at an impressive 92.6%, narrowly missing the mark of real-life medical professionals at 92.9%. Nature’s statement clarified that the clinicians’ evaluations of Med-PaLM were not based on its deployment in actual hospital settings, considering real-life patient variables. Furthermore, the study revealed that only 5.8% of the model’s responses had the potential to cause harm, surpassing the clinicians’ achievement of 6.5%.

Sarah West, the managing director of AI Now Institute, an esteemed policy research center, acknowledged the significance of publishing in a scientific journal. However, she emphasized that this alone is an insufficient criterion for determining the readiness of an AI system for real healthcare applications. Meaningful evaluation of such a system necessitates a thorough understanding of various factors, particularly at the individual hospital level, when customizing it for specific clinical settings.

In the early stages of its development, Med-PaLM remains in its infancy. Google has recently begun granting access to a select group of healthcare and life science organizations for testing purposes, underscoring the model’s preliminary stage of readiness for use in patient care. The researchers at Google envision a future where Med-PaLM will serve as an expert source for doctors, offering invaluable insights for unfamiliar cases, alleviating the arduous task of clinical documentation, and extending care to those who would otherwise be deprived of healthcare services.

Karan Singhal, a software engineer who contributed to the project, posed a pivotal question: “Can we catalyze the medical AI community to think seriously about the potential of foundation models for healthcare?” This query embodied their guiding North Star, propelling them forward in their pursuit of groundbreaking innovation.

In March, Google announced the second iteration of Med-PaLM, boasting an improved score of 86.5% when answering US medical licensing-style questions—an impressive advancement from the initial 67% score. The evaluation of the first generation involved nine clinicians from the UK, the US, and India, while the second version enlisted the expertise of 15 physicians.

The race for supremacy in artificial intelligence is fiercely contested by Google and OpenAI, the Microsoft Corp.-backed startup. The medical field serves as an arena for their competitive rivalry, with medical systems exploring the application of OpenAI’s technology, as reported by the Wall Street Journal. Similarly, Google has embarked on collaborations with the Mayo Clinic, further highlighting its commitment to advancing healthcare through Med-PaLM.

Both Karthikesalingam and Natarajan harbored long-standing aspirations of integrating AI into healthcare. Karthikesalingam, having begun his career as a physician, yearned for an AI model that could complement his work. Meanwhile, Natarajan’s upbringing in parts of India, where access to doctors was limited, instilled in him a fervent desire to bridge the healthcare gap.

Tao Tu, one of the pioneering researchers in the team, admitted his initial skepticism regarding the ambitious timeline set forth by the group. However, the team defied expectations, dedicating themselves to a grueling five-week sprint that stretched across Thanksgiving and Christmas, with workdays lasting up to 15 hours. By the end of this intense period, they successfully birthed Med-PaLM, the first-generation model, and unveiled it to the world in December.

The rapid advancements in technology served as a powerful motivator for the team, pushing them to strive for unparalleled progress. Along their journey, they began to comprehend the immense significance of their creation. After some initial adjustments, the model achieved an impressive 63% score on the medical licensing exam, surpassing the threshold for success. Dr. Karthikesalingam, a practicing physician himself, initially found it easy to distinguish between the model’s responses and those of clinicians. However, as the project neared its completion, he admitted that discerning between the two became increasingly challenging.

While AI algorithms already play a role in specific healthcare tasks such as medical imaging and predicting the risk of sepsis in hospitalized patients, generative AI models introduce new risks, as acknowledged by Google. These models have the potential to disseminate medical misinformation convincingly or amplify existing health disparities through the integration of biases.

To mitigate these risks, the researchers behind Med-PaLM implemented “adversarial testing” into their AI model. They carefully curated a set of questions designed to elicit AI-generated answers that could potentially cause harm or exhibit bias. These questions encompassed sensitive medical topics like Covid-19, mental health, and health equity, particularly addressing racial biases in healthcare.

Google stated that Med-PaLM 2 produced answers that were more frequently deemed to have a “low risk of harm” compared to its predecessor. However, the model’s ability to generate accurate and relevant information experienced no significant improvement.

During the testing phase, Shek Azizi, a senior research scientist at Google, discovered that Med-PaLM sometimes exhibited hallucinatory tendencies, referring to studies that did not exist or were not provided when summarizing a patient chart or delivering clinical information.

The inclination of large language models to provide compelling yet incorrect answers raises concerns about their use in domains where truth and accuracy are paramount, particularly in life-or-death situations. Meredith Whittaker, president of the Signal Foundation and a former Google manager, expresses unease about deploying this technology in settings where incentives are already geared toward minimizing care and reducing healthcare expenditure for those in need.

In a demonstration for Bloomberg reporters, Google showcased an experimental chatbot interface for Med-PaLM 2. Users could explore various medical conditions, including “incontinence,” “loss of balance,” and “acute pancreatitis,” generating informative descriptions from the AI model along with evaluation results. Clinicians’ real descriptions of the issues were provided for comparison.

At Google’s annual I/O developers conference in May, it was announced that Med-PaLM 2 was being developed to leverage information from both images and text, aiming to enhance patient outcomes by facilitating the interpretation of X-rays and mammograms. The experimental interface prompt exemplified this capability, requesting a report to summarize a chest X-ray.

While Med-PaLM’s performance in a genuine clinical setting remains uncertain, its AI-generated response appeared remarkably comprehensive and convincing. It commented, “The lung fields are clear without consolidation or edema, the mediastinum is otherwise unremarkable. The cardiac silhouette is within normal limits for size, with no effusion or pneumothorax noted. No displaced fractures are evident.”

Conclusion:

Google’s Med-PaLM represents a significant breakthrough in the healthcare market, showcasing the potential of AI models in assisting physicians and improving patient outcomes. The impressive accuracy rating and the model’s ability to provide expert insights in unfamiliar cases highlight its value in the medical field. However, careful consideration must be given to evaluating and customizing AI systems for individual healthcare settings to ensure their safe and effective integration.

Source