- Apple introduces KGLENS, a framework for evaluating alignment between Knowledge Graphs (KGs) and large language models (LLMs).
- KGLENS uses a Thompson sampling-inspired method with a parameterized knowledge graph (PKG) to efficiently identify LLMs’ knowledge gaps.
- A graph-guided question generator, powered by GPT-4, creates fact-checking and fact-QA questions, minimizing answer ambiguity.
- Human evaluations find 97.7% of generated questions to be clear and understandable.
- KGLENS updates the PKG iteratively, refining the probing process until convergence.
- The framework demonstrates effectiveness across various sampling methods and LLMs, highlighting the performance gap between models.
- GPT-4 family models outperform others, while older models like GPT-3.5-turbo lag in specific scenarios.
Main AI News:
Apple researchers have introduced KGLENS, a cutting-edge knowledge probing framework designed to assess the alignment between Knowledge Graphs (KGs) and Large Language Models (LLMs). This innovative tool also pinpoints the knowledge gaps in LLMs. KGLENS leverages a Thompson sampling-inspired method, incorporating a parameterized knowledge graph (PKG) to probe these models efficiently. KGLENS is a graph-guided question generator that is a standout feature, which transforms KGs into natural language queries using GPT-4, producing two distinct types of questions—fact-checking and fact-QA—to minimize response ambiguity. According to human evaluators, 97.7% of these generated questions are clear and understandable.
KGLENS employs a novel strategy to probe LLMs’ knowledge efficiently by utilizing a PKG coupled with a Thompson sampling-inspired approach. The process begins with initializing a PKG, where each edge is enhanced with a beta distribution highlighting potential LLM deficiencies. The framework then samples these edges based on their probabilities, formulates questions from the sampled edges, and evaluates the LLM through a question-answering task. The PKG is continuously updated with each iteration, refining the probing process until it reaches convergence. The graph-guided question generator is integral to this framework, converting KG edges into natural language questions via GPT-4. These questions are categorized into Yes/No and Wh-questions, with the type dictated by the graph’s structure. Additionally, entity aliases are incorporated to ensure clarity and reduce potential ambiguity.
To verify answers, KGLENS directs LLMs to generate responses in specific formats and uses GPT-4 to check the accuracy of Wh-question responses. The framework’s effectiveness is validated through various sampling methods, proving its ability to identify knowledge gaps in LLMs across multiple topics and relationships.
KGLENS’ evaluation across multiple LLMs reveals a consistent performance advantage for the GPT-4 family over other models. GPT-4, GPT-4o, and GPT-4-turbo exhibit similar performance levels, with GPT-4o demonstrating greater caution in handling personal information. A notable performance gap is observed between GPT-3.5-turbo and GPT-4, with GPT-3.5-turbo occasionally underperforming compared to older LLMs due to its conservative nature. Legacy models like Babbage-002 and Davinci-002 show only marginal improvements over random guessing, underscoring the significant advancements made in recent LLMs. The evaluation sheds light on various error types and model behaviors, highlighting the diverse capabilities of LLMs in navigating different knowledge domains and difficulty levels.
Conclusion:
The introduction of KGLENS marks a significant advancement in evaluating LLMs, providing a robust framework for identifying knowledge gaps and improving model alignment with KGs. For the market, this development underscores the rapid evolution of AI and the growing need for tools that can accurately assess and enhance LLM performance. Companies leveraging LLMs will find KGLENS particularly valuable in fine-tuning their models, ensuring they stay competitive in an increasingly data-driven landscape. As LLMs continue to be integrated into various industries, detecting and addressing knowledge deficiencies will be crucial for maintaining the reliability and trustworthiness of AI applications.