Optimal LLM Selection: Galileo’s Hallucination Index Guides AI Applications

TL;DR:

  • Galileo introduces the Hallucination Index to assist users in selecting the most suitable Language Model Models (LLMs) for their applications.
  • LLMs are gaining prominence, but hallucinations remain a significant challenge in AI.
  • The index evaluates 11 LLMs, highlighting top performers for different AI task types.
  • Open AI models dominate in hallucination resistance across various tasks.
  • Cost-saving opportunities exist by opting for lower-cost LLM versions or open-source models.
  • Galileo’s proprietary evaluation metrics offer an 87% accuracy in detecting hallucinations.

Main AI News:

In a pivotal stride toward mitigating the challenges of hallucinations in Language Model Models (LLMs), Galileo, a prominent machine learning (ML) company specializing in unstructured data, has unveiled the Hallucination Index. This pioneering tool, developed by Galileo Labs, serves as a compass for users seeking to identify the most suitable LLMs for their specific applications. 

The year 2023 has unequivocally marked the ascent of LLMs, captivating the attention of individual developers and Fortune 50 enterprises alike. Yet, amidst this burgeoning enthusiasm, two fundamental truths have surfaced. Firstly, LLMs do not conform to a one-size-fits-all paradigm. Secondly, the challenge of hallucinations continues to cast a long shadow over the widespread adoption of LLM technology.

Atindriyo Sanyal, co-founder and CTO of Galileo, elaborates on this issue, asserting, ‘To facilitate the discernment of suitable LLMs for diverse applications, Galileo Labs has meticulously crafted a ranking system for popular LLMs. This ranking hinges on our proprietary hallucination assessment metrics: Correctness and Context Adherence. We anticipate that this effort will not only illuminate the LLM landscape but also assist teams in making informed decisions when selecting the ideal LLM for their unique use cases.'”

As businesses of all dimensions fervently explore LLM-based applications, the specter of hallucinations has emerged as a formidable obstacle, posing substantial hurdles to the generation of accurate and dependable responses. Hallucinations in AI are characterized by the creation of information that initially appears plausible but subsequently unravels as inaccurate or incongruent with the context.

In response to this pressing challenge, Galileo Labs has introduced the Hallucination Index. This innovative tool evaluates 11 LLMs from various sources, including Open AI (GPT-4-0613, GPT-3.5-turbo-1106, GPT-3.5-turbo-0613, and GPT-3.5-turbo-instruct), Meta (Llama-2-70b, Llama-2-13b, and Llama-2-7b), TII UAE (Falcon-40b-instruct), Mosaic ML (MPT-7b-instruct), Mistral.ai (Mistral-7b-instruct), and Hugging Face (Zephyr-7b-beta). Each LLM is meticulously assessed for its susceptibility to hallucinations across common generative AI task types.

The key revelations from this comprehensive assessment include:

  1. Question & Answer without Retrieval (RAG): In this thorough evaluation, OpenAI’s GPT-4 takes the lead, boasting a remarkable Correctness Score of 0.77. This underscores its superiority in general knowledge applications and its minimal propensity for hallucinations. For this task type, GPT-4-0613 is the recommended choice for reliable and precise AI performance.
  2. Question & Answer with RAG: OpenAI’s GPT-4-0613 continues to shine, emerging as the top performer with a Context Adherence score of 0.76. Surprisingly, Hugging Face’s Zephyr-7b, an open-source model, outperforms Meta’s larger Llama-2-70b, challenging the conventional belief that bigger models equate to superior performance. For this task type, GPT-3.5-turbo-0613 is recommended.
  3. Long-form Text Generation: Once again, OpenAI’s GPT-4-0613 stands out, exhibiting a Correctness Score of 0.83 and a strong resistance to hallucinations. Meta’s open-source Llama-2-70b-chat presents itself as an efficient alternative for this task. For a balance of cost and performance in Long-form Text Generation, Llama-2-70b-chat is the recommended choice.

Furthermore, the dominance of Open AI models across various task types is evident. However, this superiority comes at a price, as Open AI’s API-based pricing model can lead to escalating costs in the development of Generative AI products. Organizations can explore cost-saving opportunities by opting for lower-cost versions of OpenAI’s models, such as GPT-3.5-turbo. The most significant cost savings, though, are attainable by embracing open-source models.

For Long-form Text Generation tasks, Meta’s open-source Llama-2-13b-chat emerges as a commendable alternative to Open AI’s models. In the realm of Question & Answer with RAG tasks, users can confidently explore Hugging Face’s nimble yet potent Zephyr model, which offers a significantly lower inference cost compared to GPT-3.5 Turbo.

The analyses provided in the Hallucination Index are fortified by Galileo’s proprietary evaluation metrics, Correctness and Context Adherence, powered by ChainPoll, a hallucination detection methodology developed by Galileo Labs. These metrics boast an 87% accuracy rate in detecting hallucinations, providing teams with a dependable means of automatically identifying hallucination risks, thereby saving valuable time and resources typically expended on manual evaluations.

Conclusion:

The Hallucination Index by Galileo provides valuable insights into LLM selection, highlighting Open AI’s models as strong performers. This signifies a growing demand for reliable and accurate AI solutions, with potential cost-saving opportunities in choosing the right LLM versions. As the AI market continues to expand, organizations should prioritize LLMs with low hallucination risks for enhanced performance and reliability.

Source