AI over Large Formal Knowledge Bases: The First Decade
Josef Urban, Radboud University Nijmegen
Talk Structure
- AI for Formal Math or Formal Math for AI?
- The semantic AI Plan: Large KBs of formal thought + evolving AI tools
- Premise selection for Mizar/MML, Isabelle/HOL, HOL/Flyspeck, etc.
- ATP guidance: E, Prover9, BliStr
- Larger AI systems: MaLARea, MaLeCoP
- Some lessons learned so far
- Some challenges
The Plan
- Make large KBs of formal thought (Mizar/MML, HOL/Flyspeck, Isabelle) accessible to strong AI tools: DONE (or well under way)
- Test/Use/Evolve AI tools:
- deductive AI: first-order/higher-order/inductive ATPs, SMTs, decision procs.
- inductive AI: statistical learning tools (Bayesian, kernels, neural,...),
- inductive AI: semantic learning tools (ILP - Progol, Alef, ..; latent semantics - PCA; genetic methods, ... )
The Plan (continued)
- Build custom/combined inductive/deductive tools/metasystems:
- usually combining ATP/symbolic techniques with ML/statistical ideas
- (statistically governed, "greedy", ...) indexing, subsumption, abstraction, limited inference/induction, concept/lemma creation, strategy generation, etc.
- Continuously test performance
- Define harder AI tasks as the performance grows (prove theorems from more basic lemmas, develop whole theories, ...)
- GOTO 2
The Plan (continued)
- Corollary: It happens to be useful (Mizar, Isabelle, HOL, etc)
- Large-scale benchmarks crucial for feedback/progress (this is highly experimental science!)
- Define benchmarks: whole MML, MPTP Challenge, CASC LTB, MPTP2078, MZR@Turing
- Otherwise many seemingly "cool" ideas that subsume each other in unexpected ways
- Implementation matters and is the proof of the pudding
- Observation: This plan seems to have taken off in the last decade
- Enter the CASC LTB 2013 competition!! (Mizar, Isabelle, HOL, 1-day pre-training)
Corpora
- Mizar/MML: today about 50-100k formal math theorems/proofs (``Bourbaki'') written by humans (link to MML Query with the known theorems)
- An interesting AI repository, first translated for AI/ATP experiments in 2003
- Isabelle/HOL: about 20k theorems (but add AFP!), first translated for ATP experiments in ca. 2005
- Recent work to allow machine learning (2011-12: consistent naming & translation, proof export)
- HOL (Light): early translations (MESON, Metis) by Harrison and Hurd
- HOL/Flyspeck: about 20k theorems, translated for AI/ATPs recently (2012)
- More: SUMO, LogAnswer (full German Wikipedia in FOL), YAGO, ...
Premise Selection
- Early 2003: Can existing ATPs be used over the freshly translated Mizar library?
- About 80000 nontrivial math facts at that time.
- Is good premise selection possible at all?
- Or is it a mysterious power of mathematicians? (Penrose!)
- Today: Premise selection is not a mysterious property of mathematicians.
- Reasonably good algorithms started to appear (more below).
- Will extensive human (math) knowledge get obsolete?? (cf. Watson)
Example: Mizar Proof Advisor (2003)
- Train premise selection on all previous Mizar/MML proofs (50k)
- Recommend relevant premises when proving new conjectures
- About 70% coverage in the first 100 recommended premises
- Chain the recommendations with strong ATPs to get full proofs
- Used today also for Isabelle/Sledgehammer and HOL/Flyspeck
- Many interesting issues: features, labels, their utility and consistency
- Still easy to improve (waiting for you!)
Evaluation of methods on MPTP2078
ML evaluation (premise recall) 
Evaluation of methods on MPTP2078
ATP evaluation (problems solved) 
Combined (ensemble) methods
Combining with SInE improves the best method from 726 to 797 (10%)

Learning from ATP and ITP proofs
Learning from the shortest ATP proofs helps also by about 10%

Example of a simpler proof
- Mathematicians are good, but sometimes miss things
- Already in 2005: examples in Mizar where better premises found automatically
- Recent HOL Light example: FACE_OF_POLYHEDRON_POLYHEDRON
- The premise selection is good enough to rank the needed premises high
- And then the ATP is good enough to find the proof from them
The HOL(y)Hammer AI/ATP service
Examples of AI guidance in E and Prover9
- Proof-trace guidance:
- Prover9 hints, KB of abstracted lemmas from previous proofs in E
- nearest-neighbor guidance: ConjectureRelativeSymbolWeight in E
- further symbol weighting based on axiom relevance in E
- semantic guidance: Prover9, iProver, Vampire (since 2012)
BliStr: Blind Strategymaker, E-MaLeS
- Problem:
- Is conjecture-based guidance better than proof-trace guidance?
- In other words: how do we put all the sophisticated ATP techniques together?
- Grow a population of diverse strategies by iterative local search (and/or genetics)!
- Use clusters of similar solvable problems to train for unsolved problems
- Interleave low-time training with high-time evaluation
- Thus co-evolve the strategies and their training problems
- In the end, learn which strategy to use on which problem
BliStr on 1000 Mizar@Turing problems
- original E coverage: 597 problems
- after 30 hours of strategy growing: 22 strategies covering 670 problems
- A selection of 14 strategies improves E auto-mode by 25% on unseen problems
- Similar results for the Flyspeck problems
- Be lazy, don't do "hard" theory-driven ATP research (a.k.a: thinking)
- In 30 hours, your ATP can self-improve by data-driven methods while you lie on a beach!
Machine Learner for Automated Reasoning
- MaLARea: explore & exploit, reinforcement learning, etc.
- AI feedback loop interleaving deduction with induction on a large math. theory:
- The more problems you solve (and fail to solve), the more solutions (and failures) you can learn from
- The more you can learn from, the more you solve (now also in HH)
- Systematic concept addition (models, etc.), can be dangerous
- continuous update of the learning wrt. the new concepts
- good performance on AI/ATP benchmarks (MPTPChallenge, CASC LTB 2008,CASC Mizar 2012)
- The CASC performance curve flat for quite a while
Machine Learning Connection Prover
- MaLeCoP: put the AI methods inside a tableau ATP
- the learning/deduction feedback loop runs across problems and inside problems
- The more problems/branches you solve/close, the more solutions you can learn from
- The more solutions you can learn from, the more problems you solve
- Not just model avoidance, also ``dangerous pattern'' avoidance
- still quite a prototype (no CASC)
- already about 20-time proof search shortening on MPTP Challenge compared to leanCoP (see the paper)
Some lessons learned
- Don't push "expensive AI" where "normal AI" is better, do the obvious first:
- statistical ML vs. symbolic (ILP, etc): needs to be carefully merged: statistical substitution/subsumption trees, ATPs testing generalizations, etc.
- high-level combination of standard ATPs and learning vs. low-level new "cool" (often non-performing) systems (lots of work on MaLeCoP to make it efficient)
- make it useful (Sledgehammer, MizAR, HOL(y)Hammer) and have rigorous benchmarks (CASC LTB, etc.)
Some lessons learned
- one's own sense of "AI coolness" may be totally wrong on the large scale
- large-scale evaluations will show what is good and what to do next (analyze data often!)
- simple ML methods applied at the right place can be very powerful (recently e.g. BliStr)
- the formal domain is very fragile and takes a lot of work, but also very rewarding:
- new ATP proofs cannot be trusted/learned from if the translation to ATPs is unsound (funny story with Isabelle)
- once the translation is correct, new AI/ATP proofs really emerge
transfer ideas from SAT to malecop and E
- Alan Frisch claims that using dynamic features the saT/unsat could be guessed with 75% success on random sat problems
- dynamic features: random hill-climbing to get as much clauses true, etc
Thanks for your Attention!
- Don't forget to submit your coolest system to CASC-LTB 2013!
- Questions, Comments, ... ?