AI over Large Formal Knowledge Bases: The First Decade

Josef Urban, Radboud University Nijmegen

Talk Structure

AI for Formal Math or Formal Math for AI?
The semantic AI Plan: Large KBs of formal thought + evolving AI tools
Premise selection for Mizar/MML, Isabelle/HOL, HOL/Flyspeck, etc.
ATP guidance: E, Prover9, BliStr
Larger AI systems: MaLARea, MaLeCoP
Some lessons learned so far
Some challenges

AI for Formal Math or FM for AI?

AI/ATP for ITP getting popular (Sledgehammer), but it is only one way of looking at it
My motivation has been (also) opposite:
Get to strong AI via a large repository of computer-understandable complex human thought
Formal math proofs/theories: good (best?) approximation? Better than scanning books?
Gradually learn how math/science is done:
What are the components (inductive/deductive thinking)? How to combine them together?

The Plan

Make large KBs of formal thought (Mizar/MML, HOL/Flyspeck, Isabelle) accessible to strong AI tools: DONE (or well under way)
Test/Use/Evolve AI tools:
- deductive AI: first-order/higher-order/inductive ATPs, SMTs, decision procs.
- inductive AI: statistical learning tools (Bayesian, kernels, neural,...),
- inductive AI: semantic learning tools (ILP - Progol, Alef, ..; latent semantics - PCA; genetic methods, ... )

The Plan (continued)

Build custom/combined inductive/deductive tools/metasystems:
- usually combining ATP/symbolic techniques with ML/statistical ideas
- (statistically governed, "greedy", ...) indexing, subsumption, abstraction, limited inference/induction, concept/lemma creation, strategy generation, etc.
Continuously test performance
Define harder AI tasks as the performance grows (prove theorems from more basic lemmas, develop whole theories, ...)
GOTO 2

The Plan (continued)

Corollary: It happens to be useful (Mizar, Isabelle, HOL, etc)
Large-scale benchmarks crucial for feedback/progress (this is highly experimental science!)
Define benchmarks: whole MML, MPTP Challenge, CASC LTB, MPTP2078, MZR@Turing
Otherwise many seemingly "cool" ideas that subsume each other in unexpected ways
Implementation matters and is the proof of the pudding
Observation: This plan seems to have taken off in the last decade
Enter the CASC LTB 2013 competition!! (Mizar, Isabelle, HOL, 1-day pre-training)

Corpora

Mizar/MML: today about 50-100k formal math theorems/proofs (``Bourbaki'') written by humans (link to MML Query with the known theorems)
An interesting AI repository, first translated for AI/ATP experiments in 2003
Isabelle/HOL: about 20k theorems (but add AFP!), first translated for ATP experiments in ca. 2005
Recent work to allow machine learning (2011-12: consistent naming & translation, proof export)
HOL (Light): early translations (MESON, Metis) by Harrison and Hurd
HOL/Flyspeck: about 20k theorems, translated for AI/ATPs recently (2012)
More: SUMO, LogAnswer (full German Wikipedia in FOL), YAGO, ...

AI tools: Evolving technology stack

Try to stay on top of the stack and stimulate its progress
ATPs: a lot of progress in handling of large KBs (parsing, indexing, heuristics)
New/improved ATP calculi and strategies
- instantiation-based methods (iProver, SMT)
- (parameterized) preference of conjecture (E, leanCoP, Prover9)
- strategy languages (E), weight/ordering specifications
- extensive tuning for problems (Vampire, BliStr)
- more ITP-friendly extensions (many-sorted, polymorphic)
Great progress in statistical ML
- Netflix Challenge (latent semantics, unsurpervised learning)
- DARPA Challenge, Google, Yahoo, Amazon, ...
- Bayesian inference (biochemical causal chains, ...)

Premise Selection

Early 2003: Can existing ATPs be used over the freshly translated Mizar library?
About 80000 nontrivial math facts at that time.
Is good premise selection possible at all?
Or is it a mysterious power of mathematicians? (Penrose!)
Today: Premise selection is not a mysterious property of mathematicians.
Reasonably good algorithms started to appear (more below).
Will extensive human (math) knowledge get obsolete?? (cf. Watson)

Example: Mizar Proof Advisor (2003)

Train premise selection on all previous Mizar/MML proofs (50k)
Recommend relevant premises when proving new conjectures
About 70% coverage in the first 100 recommended premises
Chain the recommendations with strong ATPs to get full proofs
Used today also for Isabelle/Sledgehammer and HOL/Flyspeck
Many interesting issues: features, labels, their utility and consistency
Still easy to improve (waiting for you!)

Evaluation of methods on MPTP2078

ML evaluation (premise recall) ML evaluation: Recall

Evaluation of methods on MPTP2078

ATP evaluation (problems solved) ATP evaluation: Solved

Combined (ensemble) methods

Combining with SInE improves the best method from 726 to 797 (10%) ATP evaluation: Combined ATP evaluation: Solved

Learning from ATP and ITP proofs

Learning from the shortest ATP proofs helps also by about 10% ATP vs MML ATP and MML, combined with SInE

Evaluation on Flyspeck

628 AI/ATP combinations evaluated so far
they differ in the features, learning data, learning method, premise number, ATP
39% of top-level theorems proved by 14 AI/ATP methods - Nov 2012
45.5% of top-level theorems proved by recent 14 AI/ATP methods - Mar 2013

Example of a simpler proof

Mathematicians are good, but sometimes miss things
Already in 2005: examples in Mizar where better premises found automatically
Recent HOL Light example: FACE_OF_POLYHEDRON_POLYHEDRON
The premise selection is good enough to rank the needed premises high
And then the ATP is good enough to find the proof from them

The HOL(y)Hammer AI/ATP service

HH architecture

Examples of AI guidance in E and Prover9

Proof-trace guidance:
Prover9 hints, KB of abstracted lemmas from previous proofs in E
nearest-neighbor guidance: ConjectureRelativeSymbolWeight in E
further symbol weighting based on axiom relevance in E
semantic guidance: Prover9, iProver, Vampire (since 2012)

BliStr on 1000 Mizar@Turing problems

original E coverage: 597 problems
after 30 hours of strategy growing: 22 strategies covering 670 problems
A selection of 14 strategies improves E auto-mode by 25% on unseen problems
Similar results for the Flyspeck problems
Be lazy, don't do "hard" theory-driven ATP research (a.k.a: thinking)
In 30 hours, your ATP can self-improve by data-driven methods while you lie on a beach!

Examples of self-evolving metasystems

BliStr could evolve various parts of the ATP technology for long time
Machine Learner for Automated Reasoning (MaLARea)
Machine Learning Connection Prover (MaLeCoP)

Machine Learner for Automated Reasoning

MaLARea: explore & exploit, reinforcement learning, etc.
AI feedback loop interleaving deduction with induction on a large math. theory:
The more problems you solve (and fail to solve), the more solutions (and failures) you can learn from
The more you can learn from, the more you solve (now also in HH)
Systematic concept addition (models, etc.), can be dangerous
continuous update of the learning wrt. the new concepts
good performance on AI/ATP benchmarks (MPTPChallenge, CASC LTB 2008,CASC Mizar 2012)
The CASC performance curve flat for quite a while

MaLARea Architecture

MaLARea

Machine Learning Connection Prover

MaLeCoP: put the AI methods inside a tableau ATP
the learning/deduction feedback loop runs across problems and inside problems
The more problems/branches you solve/close, the more solutions you can learn from
The more solutions you can learn from, the more problems you solve
Not just model avoidance, also ``dangerous pattern'' avoidance
still quite a prototype (no CASC)
already about 20-time proof search shortening on MPTP Challenge compared to leanCoP (see the paper)

MaLeCoP Architecture

MaLeCoP

Some lessons learned

Don't push "expensive AI" where "normal AI" is better, do the obvious first:
statistical ML vs. symbolic (ILP, etc): needs to be carefully merged: statistical substitution/subsumption trees, ATPs testing generalizations, etc.
high-level combination of standard ATPs and learning vs. low-level new "cool" (often non-performing) systems (lots of work on MaLeCoP to make it efficient)
make it useful (Sledgehammer, MizAR, HOL(y)Hammer) and have rigorous benchmarks (CASC LTB, etc.)

Some lessons learned

one's own sense of "AI coolness" may be totally wrong on the large scale
large-scale evaluations will show what is good and what to do next (analyze data often!)
simple ML methods applied at the right place can be very powerful (recently e.g. BliStr)
the formal domain is very fragile and takes a lot of work, but also very rewarding:
- new ATP proofs cannot be trusted/learned from if the translation to ATPs is unsound (funny story with Isabelle)
- once the translation is correct, new AI/ATP proofs really emerge

Challenges in large formal KBs

selection of knowledge (for ATPs and humans)
useful translations between logics, useful ATP calculi and decision procedures
training internal guidance of ATPs, SMTs, automation inside ITPs
developing the corpora: lemmas, concepts, conjectures, (counter)examples

transfer ideas from SAT to malecop and E

Alan Frisch claims that using dynamic features the saT/unsat could be guessed with 75% success on random sat problems
dynamic features: random hill-climbing to get as much clauses true, etc

Challenges in large formal KBs

refactoring of long ATP proofs for human consumption (50k-long proof by David Stanovsky & Waldmeister)
using strong AI/ATP to help disambiguation/understanding of natural-language math
emulating the layer on which mathematicians think (related to natural language)
what will it take to prove Brouwer or Jordan fully automatically?

Acknowledgments

Prague Automated Reasoning Group
- Petr Stepanek, Jiri Vyskocil, Petr Pudlak, David Stanovsky, Krystof Hoder, Jan Jakubuv, Ondrej Kuncar, Martin Suda, ...
HOL(y)Hammer: Cezary Kaliszyk (Innsbruck)
ATP and ITP people
- Stephan Schulz, Geoff Sutcliffe, Larry Paulson, Tobias Nipkow, Jasmin Blanchette, John Harrison, Tom Hales, Andrzej Trybulec, Piotr Rudnicki, Ingo Dahn, Bill McCune, Adam Pease, Andrej Voronkov, Kostya Korovin, ...
Learning2Reason
- Evgeni Tsivtsivadze, Daniel Kuehlwein, Tom Heskes, Herman Geuvers ....
... and many more ...

Thanks for your Attention!

Don't forget to submit your coolest system to CASC-LTB 2013!
Questions, Comments, ... ?

Josef Urban, Radboud University Nijmegen

AI over Large Formal Knowledge Bases: The First Decade

AI over Large Formal Knowledge Bases: The First Decade

Josef Urban, Radboud University Nijmegen

Talk Structure

AI for Formal Math or FM for AI?

The Plan

The Plan (continued)

The Plan (continued)

Corpora

AI tools: Evolving technology stack

Premise Selection

Example: Mizar Proof Advisor (2003)

Evaluation of methods on MPTP2078

Evaluation of methods on MPTP2078

Combined (ensemble) methods

Learning from ATP and ITP proofs

Evaluation on Flyspeck

Example of a simpler proof

The HOL(y)Hammer AI/ATP service

Examples of AI guidance in E and Prover9

BliStr: Blind Strategymaker, E-MaLeS

BliStr on 1000 Mizar@Turing problems

Examples of self-evolving metasystems

Machine Learner for Automated Reasoning

MaLARea Architecture

Machine Learning Connection Prover

MaLeCoP Architecture

Some lessons learned

Some lessons learned

Challenges in large formal KBs

transfer ideas from SAT to malecop and E

Challenges in large formal KBs

Acknowledgments

Thanks for your Attention!