- Georgia Tech researchers find chatbots less accurate in Spanish, Chinese, and Hindi for health queries compared to English.
- Their study suggests non-English speakers should be cautious relying on chatbots for healthcare advice.
- XLingEval framework emphasizes improving accuracy, correctness, and reliability in non-English languages.
- XLingHealth dataset aims to enhance chatbot performance by deepening multilingual data sources.
- Testing reveals significant disparities in chatbot performance across languages, highlighting the need for improvement.
Main AI News:
Researchers from the Georgia Institute of Technology unveil concerning findings regarding the accuracy of chatbots when responding to health inquiries in languages other than English. According to a study led by Ph.D. students Mohit Chandra and Yiqiao (Ahren) Jin from the College of Computing at Georgia Tech, chatbots exhibit reduced accuracy in Spanish, Chinese, and Hindi compared to English when handling health-related questions.
The research, titled “Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries,” introduces a novel framework designed to evaluate the performance of large language models (LLMs) in diverse linguistic contexts. Available as a preprint on arXiv, the paper sheds light on the limitations and potential of LLMs in addressing health-related queries.
Chandra and Jin caution against relying on chatbots like ChatGPT for essential healthcare advice for non-English speakers. Their XLingEval framework emphasizes the need for improved accuracy, correctness, consistency, and reliability in languages other than English. They propose deepening the data pool with multilingual sources, advocating for the adoption of their XLingHealth benchmark to enhance model performance.
The study reveals significant disparities in the performance of chatbots across languages:
- Correctness diminishes by 18% when questions are posed in Spanish, Chinese, or Hindi.
- Responses in non-English languages exhibit a 29% decrease in consistency compared to their English counterparts.
- Non-English responses are 13% less verifiable overall.
To address these challenges, the researchers introduce XLingHealth, a dataset comprising question-answer pairs aimed at improving chatbot performance. This dataset includes health-related content sourced from reputable platforms such as Patient and the U.S. National Institutes of Health (NIH).
In extensive testing, the researchers posed over 2,000 medical queries to ChatGPT-3.5 and MedAlpaca, a healthcare-oriented chatbot trained in medical literature. Alarmingly, more than 67% of MedAlpaca’s responses to non-English questions were deemed irrelevant or contradictory. Chandra notes that while both ChatGPT and MedAlpaca faced challenges, the former outperformed the latter due to its exposure to training data in multiple languages.
The study’s focus on Spanish, Chinese, and Hindi, as the world’s most spoken languages after English, reflects a personal interest and background of the researchers. Jin highlights the observations made by non-native English speakers, underscoring the importance of addressing linguistic disparities in chatbot performance.
This research underscores the critical need for advancements in chatbot technology to ensure accurate and reliable healthcare information is accessible across linguistic boundaries. As the field progresses, initiatives like XLingHealth offer promising avenues for enhancing the effectiveness of chatbots in diverse language contexts.
Conclusion:
The research underscores the pressing need for advancements in chatbot technology to address linguistic disparities in healthcare assistance. As the market for healthcare chatbots continues to expand globally, investments in improving accuracy and reliability across languages will be essential to ensure equitable access to quality healthcare information for diverse populations.