- Scale AI releases its first SEAL Leaderboards, evaluating large language model (LLM) performance.
- OpenAI’s GPT models excel in three out of four domains, with Anthropic PBC’s Claude 3 Opus leading in the fourth.
- Rankings aim to address AI performance opacity, leveraging exclusive datasets and rigorous evaluation methods.
- SEAL Leaderboards highlight GPT-4 Turbo Preview and GPT-4o’s joint lead in coding and GPT-4o’s dominance in instruction following.
- Anthropic’s Claude 3 Opus tops math abilities assessment, while some notable LLMs are absent from evaluations.
- Scale AI commits to updating rankings periodically, incorporating new models and expanding domain coverage.
Main AI News:
Scale AI Inc., a prominent provider of artificial intelligence training data catering to entities like OpenAI and Nvidia Corp., has unveiled the outcomes of its maiden SEAL Leaderboards.
These leaderboards introduce a fresh rating system for cutting-edge large language models, drawing from exclusive, meticulously curated datasets to evaluate their proficiency in prevalent applications such as generative AI coding, adherence to instructions, mathematical prowess, and multilingual capabilities.
In this issue of the leaderboards, OpenAI’s GPT series of LLMs emerges as the frontrunner in three out of the four designated domains, while Anthropic PBC’s renowned Claude 3 Opus secures the top spot in the fourth category. Notably, Google LLC’s Gemini models also exhibit strong performance, clinching joint-leading positions with the GPT models in select domains.
According to Scale AI, the genesis of the SEAL Leaderboards stems from the prevailing opacity surrounding AI performance amidst the proliferation of numerous LLMs for enterprise adoption. These rankings, meticulously crafted by Scale AI’s Safety, Evaluations, and Alignment Lab, pledge to uphold impartiality and credibility by withholding the specifics of the evaluation prompts employed for assessing LLMs.
Acknowledging existing endeavors to rank LLMs such as MLCommons’ benchmarks and Stanford HAI’s transparency index, Scale AI underscores its unique advantage derived from expertise in AI training data, enabling it to surmount challenges encountered by AI researchers. These challenges include the scarcity of high-caliber evaluation datasets immune to contamination, inconsistent reporting practices, unverified evaluator credentials, and inadequate tools for comprehending evaluation outcomes.
To ensure the integrity of its rankings, SEAL has devised proprietary evaluation datasets while leveraging assessments crafted by accredited domain specialists. Both the selection of prompts and resultant rankings undergo rigorous scrutiny to ensure reliability, while transparency is upheld through the publication of a comprehensive methodology elucidating the evaluation process.
In the realm of Scale Coding evaluation, each model undergoes comparative analysis against others across a randomly chosen prompt in no less than 50 iterations to validate result accuracy. This evaluation gauges each model’s capacity for generating computer code, with the leaderboard indicating joint supremacy between OpenAI’s GPT-4 Turbo Preview and GPT-4o alongside Google’s Gemini 1.5 Pro (post I/O).
This shared top position is attributed to Scale AI’s assertion of a 95% confidence level in evaluation scores, with minimal disparities observed among the top contenders. However, a slight edge is discernible for GPT-4 Turbo Preview, amassing a score of 1155, followed by GPT-4o at 1144, and Gemini 1.5 Pro (Post I/O) securing 1112 points.
In the Multilingual category, GPT-4o and Gemini 1.5 Pro (Post I/O) clinch joint leadership, boasting scores of 1139 and 1129, respectively, with GPT-4 Turbo and Gemini Pro 1.5 (Pre I/O) sharing third place.
GPT-4o also attains pole position in the Instruction Following domain, notching an 88.57 score, followed by GPT-4 Turbo Preview at 87.64. Notably, the results underscore Google’s scope for improvement in this domain, with Meta Platforms Inc.’s Llama 3 70b Instruct and Mistral’s Mistral Large Latest LLM trailing closely behind.
Lastly, Scale AI assesses LLMs based on their mathematical acumen, with Anthropic’s Claude 3 Opus emerging as the standout performer with a score of 95.19, securing an unchallenged first place ahead of GPT-4 Turbo Preview at 95.10 and GPT-4o at 94.85.
While these comparisons offer valuable insights, they may not provide a comprehensive overview, given the notable absence of several high-profile LLMs from the evaluations, including AI21 Labs Inc.’s Jurassic and Jamba, Cohere Inc.’s Aya and Command LLMs, and Elon Musk’s Grok models developed by xAI Corp.
However, Scale AI affirms its commitment to addressing the lacunae in the SEAL Leaderboards, pledging periodic updates to ensure currency. The company aims to incorporate new-generation models as they emerge while expanding the scope of domains covered, aspiring to establish itself as the foremost impartial evaluator of LLMs in the industry.
Conclusion:
Scale AI’s introduction of the SEAL Leaderboards marks a significant step towards transparency and benchmarking in the AI market. The dominance of OpenAI’s GPT models underscores their continued relevance and excellence in various domains, while the absence of certain high-profile LLMs suggests room for further evaluation and inclusion. With Scale AI’s commitment to ongoing updates and expansion, these leaderboards are poised to become a crucial reference point for companies navigating the complex landscape of AI model selection and deployment.