LEAN-GitHub Dataset: Transforming Automated Theorem Proving with Large-Scale Data

  • The complexity of theorem proving in mathematics is increasing, challenging traditional methods.
  • Large language models (LLMs) have potential but are limited by insufficient data.
  • Modern proof assistants like Coq, Isabelle, and Lean have advanced automated theorem proving (ATP).
  • Early ATP approaches used traditional methods; recent techniques leverage deep transformer-based models.
  • LEAN-GitHub is a new, extensive dataset developed by researchers from The Chinese University of Hong Kong.
  • The dataset includes Lean repositories from GitHub, enhancing data available for training theorem-proving models.
  • Key steps in dataset construction included repository selection, overcoming compilation challenges, and refining extraction processes.
  • LEAN-GitHub dataset features diverse mathematical topics and advanced problems.
  • InternLM2-StepProver, trained on LEAN-GitHub, shows state-of-the-art performance across various benchmarks.

Main AI News:

The complexity of theorem proving in mathematics continues to escalate, creating significant hurdles for traditional methods. Systems like Lean, Isabelle, and Coq facilitate computer-verifiable proofs, but their development is often labor-intensive. While large language models (LLMs) have shown potential in addressing high-school-level mathematical problems through proof assistants, their capabilities remain limited due to insufficient data. Formal proof languages require specialized expertise, resulting in a scarcity of comprehensive training corpora. Unlike conventional programming languages, formal proofs include nuanced intermediate data, rendering standard language corpora ineffective for training purposes. Despite the availability of valuable human-generated corpora, automated formalization processes cannot entirely replace the richness and diversity of manually crafted data.

Recent advancements in theorem-proving techniques have significantly evolved with the advent of modern proof assistants such as Coq, Isabelle, and Lean. These tools have extended formal systems beyond basic logic, fostering increased interest in automated theorem proving (ATP). The integration of large language models has further propelled this field forward. Early ATP approaches utilized traditional techniques like KNN and GNN, with some employing reinforcement learning. Contemporary methods leverage deep transformer-based models that treat theorems as text, incorporating systems such as GPT-f, PACT, and Llemma, which train models on proof states and tactics. Alternative strategies involve LLMs independently generating proofs or augmenting human-provided proofs. Data extraction tools are crucial for ATP, capturing intermediate states that are not visible in the code but are apparent during runtime. While existing tools support various proof assistants, Lean 4 tools face challenges due to their design limitations and the need for comprehensive extraction across multiple projects. Some methods also explore integrating informal proofs into formal proofs, expanding ATP research possibilities.

A team from The Chinese University of Hong Kong has introduced LEAN-GitHub, an extensive dataset designed to advance automated theorem proving. This innovative dataset complements the Mathlib dataset by providing a broad collection of Lean repositories on GitHub. The LEAN-GitHub dataset significantly enhances the data available for training theorem-proving models, addressing the limitations of previous datasets. The researchers implemented a scalable extraction pipeline to boost efficiency and parallelism, allowing for the effective utilization of previously uncompiled and unextracted Lean corpus. They also addressed the state duplication problem common in tree-proof search methods.

Key steps in the LEAN-GitHub dataset construction included:

  1. Repository Selection: The researchers identified 237 Lean 4 repositories on GitHub, with an estimated 48,091 theorems. After excluding 90 repositories with outdated Lean 4 versions, 147 repositories remained, of which 61 compiled without modifications.
  2. Compilation Challenges: Automated scripts were developed to find the nearest official releases for projects using non-standard Lean 4 versions. The team also addressed issues with isolated files in empty Lean projects.
  3. Source Code Compilation: The Leanc compiler was used directly, enabling the compilation of non-compliant Lean projects and isolated files. The team extended Lake’s import graph and created a custom compiling script to enhance parallelism.
  4. Extraction Process: Building on LeanDojo, the team improved data extraction for isolated files and restructured the process to increase parallelism, overcoming network and computational bottlenecks.
  5. Results: Out of 8,639 Lean source files, 6,352 and 42,000 theorems were extracted, resulting in a final dataset of 2,133 files and 28,000 theorems with valid tactic information.

The LEAN-GitHub dataset encompasses a wide range of mathematical areas, including logic, matroid theory, and arithmetic, and features cutting-edge topics and Olympiad-level problems. Compared to existing datasets, LEAN-GitHub offers a unique mix of human-written content, intermediate states, and varying complexity levels, making it an invaluable resource for advancing automated theorem proving.

InternLM2-StepProver, trained on the LEAN-GitHub dataset, demonstrates outstanding formal reasoning across various benchmarks. It achieves state-of-the-art results on miniF2F and ProofNet and solves advanced problems on PutnamBench. These achievements underscore the effectiveness of LEAN-GitHub in training sophisticated theorem-proving models.

Conclusion:

The introduction of the LEAN-GitHub dataset represents a significant advancement in the field of automated theorem proving. By providing a large-scale, diverse dataset of formal proofs, it addresses the limitations of existing data sources and enhances the training of advanced theorem-proving models. This development is likely to accelerate progress in mathematical reasoning and formal verification, creating opportunities for more effective and efficient problem-solving methods. The open-source nature of LEAN-GitHub also facilitates broader access to high-quality data, potentially driving innovation and further advancements in automated theorem proving across the academic and commercial sectors.

Source