Proof Guidance in PVS with Sequential Pattern Mining

. The recent introduction of the big data paradigm and advancements in machine learning and deep mining techniques have made proof guidance and automation in interactive theorem provers (ITPs) an important research topic. In this paper, we provide a learning approach based on sequential pattern mining (SPM) for proof guidance in the PVS proof assistant. Proofs in a PVS theory are ﬁrst abstracted to a computer-processable corpus. SPM techniques are then used on the corpus to discover frequent proof steps and proof patterns, relationships of proof steps / patterns with each other, dependency of new conjectures on already proved facts and to predict the next proof step(s). Obtained results suggest that the integration of SPM in proof assistants can be used to guide the proof process and in the development of proof tactics / strategies.


Introduction
Theorem provers allow the formal development and verification of system properties that can be defined in appropriate logical formalisms.Automated (firstorder) theorem provers (ATPs) deal with the development of computer programs that can automatically perform logical reasoning.However, first-order logic (FOL) lacks the expressibility power that is required to define complex systems with an infinite domain.On the other hand, higher-order logic (HOL) allows quantification over predicates and functions.HOL based theorem provers, also known as interactive theorem provers (ITPs), offer support for rich logical formalisms such as dependent and (co)inductive types as well as recursive functions, which enable ITPs to model complex systems.Today, these mechanical reasoning systems are used in verification projects that range from operating systems, compilers and hardware components to prove the correctness of large mathematical proofs such as the Feit-Thomson Theorem and the Kepler conjecture [22].However, automatic reasoning in ITPs is still a hard problem due to undecidable algorithms and proof methods [20].
Unlike ATPs where the proof process is generally automatic, ITPs follow the user driven proof development process.The user guides the proof process by providing the proof goal and by applying proof commands and tactics to prove the goal.Generally, the user does lots of repetitive work to prove a nontrivial theorem (goal), which is laborious and consumes a large amount of time.Proof guidance and proof automation in ITPs are two extremely desirable features.ITPs now do have a large corpora of computer-understandable formalized knowledge [5,19] in the form of proof libraries.In PVS, proof scripts for a particular theory are stored separately in a file that can also be considered as a proof corpus for the theorems and lemmas in that theory.Proof scripts of different theories can be combined together to develop a more complex corpora.These corpora play an important role in artificial intelligence based methods, such as concept matching, structure formation and theory exploration.The ongoing fast progress in machine learning and data mining made it possible to use these learning techniques on such corpora in guiding the proof search process, in proof automation and in developing proof tactics / strategies, as indicated in recent works [8, 9, 15-17, 21, 26].
In this paper, the focus is on proof guidance and premise selection in ITPs from the perspective of sequential pattern mining (SPM) techniques.SPM techniques are used in data mining to find interesting and useful patterns (information) that are hidden in large corpora of sequential data [14].A particular proof goal in PVS depends on the specifications inside the theory and it can be completed with different combinations of proof commands, inference rules and decision procedures [30].This makes it difficult to infer useful proof tactics and strategies from specific examples that can be applied more generally.Moreover, a proof corpus contains too much information, which makes it hard to carry out the brute force approach for proof search.However, there is the potential to identify useful and interesting hidden proof patterns in these corpora and relationships of such proof patterns with each other.With such information, SPM techniques can be used to investigate the dependency of new conjectures on already proved facts and to predict the next proof step(s) or pattern(s) for guiding the proof of a novel non-trivial theorem / lemma.
We present an SPM-based proof process learning approach for the PVS proof assistant.The basic idea is to convert the PVS proofs for a theory into a proof corpus that is suitable for learning.SPM techniques are applied on the corpus to find frequent proof steps and patterns that are used in the proofs.Moreover, relationships of a proof steps / patterns with each other are discovered through sequential rule mining.The learning approach is also used to find the relevance of the new conjectures with the proofs and the performance of some state-of-theart prediction models are examined by training and testing them on the corpus to predict the next proof step(s).Besides PVS, the proposed approach can also be used to guide the proof process in other proof assistants.The ultimate goal is to develop proof tactics / strategies with useful patterns that can be invoked directly by the user in the proof development process.
The rest of this paper is organized as follows: Section 2 elaborates the SPMbased learning approach that is used to discover useful proof steps / patterns in the proof corpus, their relationship and prediction of next proof step(s).Evaluation of the proposed approach on a case study and obtained results are discussed in Section 3. Related work on using the machine learning and data mining techniques for automated reasoning in ATPs and ITPs is presented in Section 4. Finally, the paper is concluded with some future directions in Section 5. PVS dump files and SPM related data for this work can be found at [31].

Proof Corpus Mining with SPM
The structure of the SPM-based learning approach is shown in Figure 1.It consists of two main parts: 1. Development of proof corpus: PVS proof steps for theorems and lemmas in a theory are converted to a proof corpus, where each complete proof is abstracted to sequences of proof commands.2. Learning through SPM: SPM algorithms are used on the corpus to discover the common proof steps and patterns, relationships of proof steps / patterns with each other, dependency of new conjectures on already proved facts and prediction of next proof step(s).Each part is further elaborated next.In general, data is assembled first, so that data mining algorithms can be used.To make the proof corpus suitable for learning, it should satisfy certain minimum requirements, such as:

PVS Proof Steps
-It is stored in a computational and electronic form.
-It contains many examples of proofs that offer diversity in kinds of proof steps.The corpus should have different proof steps so that useful proof patterns as well as the dependency of proof steps and prediction of next proof steps can be discovered.
-It is transformed in a suitable abstraction, so that no meaningful information from the proofs is left out.For this, we use the "proof sequences to integers" abstraction, where each proof command is converted to a distinct positive integer.Such abstraction allows wide diversity and makes the approach more general in nature.
Besides the dump file that contains the specifications for a particular theory with imported libraries and proof scripts (collection of proof steps) for theorems / lemmas, PVS also saves the proof scripts for a theory in a separate proof file.These files contain proof commands with some other information related to PVS.After removing the redundant information from the proof files, the complete proof is a sequence of proof steps.In the following we present some concepts related to sequences in the context of this work.
Let P S = {ps 1 , ps 2 , ..., ps m } represent the set of proof commands.A proof steps set P SS is a set of proof commands, that is P SS ⊆ P S. |P SS| denotes the set cardinality.P SS has a length k (called k-PSS) if it contains k proof commands, i.e., |P SS| = k.For example, consider the set of PVS proof commands P S = {skolem, flatten, inst?, split, beta, iff, assert}.The set {skolem, flatten, assert} is a proof steps set that contains three proof commands.For the purpose of processing commands in some order, a total order relation on proof commands is assumed to exist (e.g. the lexicographical order), denoted as ≺.
A proof sequence is a list of proof steps sets S = P SS 1 , P SS 2 , ..., P SS n , such that P SS i ⊆ P SS (1 ≤ i ≤ n).For example, {skolem, flatten}, {inst?}, {beta, iff}, {assert} is a proof sequence which has four proof steps sets being used to prove a theorem.A proof corpus P C is a list of proof sequences P C = S 1 , S 2 , ..., S p , where each sequence has an identifier (ID).For example, Table 1 shows a P C that has five proof sequences with IDs 1, 2, 3, 4 and 5. {skeep, expand "Tle", typepred "<", expand "strict_order?", flatten, expand "transitive?",inst -2 "T(s1)""T(s2)""T(s3)", assert } 5 {induct n},{expand "sum", propax},{skosimp, expand "sum" +, assert} The final step is to convert the proof sequences into sequences of integers to bring the corpus in a more suitable format for SPM techniques.In the final corpus, each line represents a proof sequence that was used for the proof of a theorem / lemma.Each proof command in the sequence is replaced by a positive integer.For example, the proof command skosimp is replaced by 1.Moreover, proof commands are separated with a single space followed by a negative integer -1.The negative integer -2 appears at the end of each line to indicate the end of a proof sequence.It is to note that the same integers are used for similar proof commands such as (inst?) and (inst fnum constants), and (skosimp) and (skosimp*).This makes the learning process more general in nature and can be used for other PVS theories, in particular for the PVS library.
A proof sequence S α = α 1 , α 2 , ..., α n is present or contained in another proof sequence In SPM, various measures are used to evaluate the importance and interestingness of a subsequence.The support measure is used by most SPM techniques.The support of S α in P C is the total number of sequences (S) that contain S α , and is represented by sup(S α ): SPM is an enumeration problem that aims to find all the frequent subsequences in a sequential dataset.A sequence S is a frequent sequence (also called sequential pattern) iff sup(S) ≥ minsup, where minsup (minimum support) is the threshold being determined by the user.A sequence containing n items (proof commands in this work) in a corpus can have up to 2 n − 1 distinct subsequences.This makes the naive approach to calculate the support of all possible subsequences infeasible for most corpora.Several efficient algorithms have been developed in recent years that do not explore all the search space for all possible subsequences.
All SPM algorithms investigate the patterns search space with two operations: s-extensions and i-extensions.A sequence , where α n is equal to the first |α n | items of β n according to the ≺ order.Note that SPM algorithms follow a specific order ≺ so that the same potential patterns are not considered twice and the choice of the order ≺ does not affect the final result produced by SPM algorithms.A sequence S β is an s-extension of a sequence S α for an item x if S β = α 1 , α 2 , ..., α n , {x} .Similarly, for an item x, a sequence S γ is an i-extension of S α if S γ = α 1 , α 2 , ..., α n ∪ {x} .SPM algorithms either employ a breadth-first search or a depth-first search.In the following, a brief description of state-of-the-art SPM algorithms is presented.
The TKS (Top-k Sequential) algorithm finds the top-k sequential patterns in a corpus, where k is set by the user and it represents the number of sequential patterns to be discovered by the algorithm.TKS employs the basic candidate generation procedure of SPAM and vertical database representation.With vertical representation, support for patterns can be calculated without performing costly database scans.This makes vertical algorithms to perform better on dense or long sequences.TKS also uses several strategies for search space pruning and depends on the PMAP (Precedence Map) data structure to avoid costly operations of bit vector intersection.Another SPM algorithm is the CM-SPAM algorithm that performs a depth-first search to discover frequent sequential patterns in a corpus.The CMAP (Co-occurrence MAP) data structure is used in CM-SPAM to store co-occurrence of item information.A generic pruning mechanism that is based on CMAP is used for pruning the search space with vertical database representation, to efficiently discover sequential patterns.More detail on TKS and CM-SPAM can be found in [11], [10] respectively.
Sequential patterns that appear frequently in a corpus with low confidence are worthless for decision making or prediction.Sequential rules discover patterns by considering not only their support but also their confidence.A sequential rule X → Y is a relationship between two P SSs X, Y ⊆ P S, such that X ∩ Y = ∅ and X, Y = ∅.The rule r : X → Y means that if items of X occur in a sequence, items of Y will occur afterward in the same sequence.X is contained in S α (written as X The confidence of r in P C is defined as: The support of r in P C is defined as: A rule r is a frequent sequential rule iff sup P C (r) ≥ minsup and r is a valid sequential rule iff it is frequent and conf P C (r) ≥ minconf , where the thresholds minsup, minconf ∈ [0, 1] are set by the user.Mining sequential rules in a corpus deals with finding all the valid sequential rules.For this, the ERMiner (Equivalence class based sequential Rule Miner) algorithm [12] is used.It relies on a vertical database representation and represents the search space of rules using equivalence classes of rules having the same antecedent or consequent.It employs two operations (left and right merges) to explore the search space of frequent sequential rules, where the search space is pruned with the Sparse Count Matrix (SCM) technique, which makes ERMiner more efficient than other sequential rule finding algorithms.
The statistical Naive Bayes (NB) classifier [32] is based on Bayes' theorem and is used to compute the probability of using the proof p in the corpus to prove a new conjecture c.A conjecture is a proposition or statement that has not been proved yet, but is thought to be true.The probability is based on the fact that some p are already used before in the proof of conjectures similar to c.As each p contains a set of proof steps, the conditional probability P (P SS|c) estimates the relevance of P SS for c.The conditional probability is computed and multiplied to get the overall probability for c.
The Compact Prediction Tree+ (CPT+) model is used to predict the next proof step(s) [18].CPT+ implements two strategies for compression to reduce the CPT size and one strategy for the reduction of prediction time.In the training phase, CPT+ takes a set of training sequences as input and generates three data structures: a prediction tree, a lookup table and an inverted index.These three structures are built incrementally by considering the sequence one by one during training.For a proof sequence S α of n elements, the suffix of S α of size y where 1 ≤ y ≤ n is defined as P y (S α ) = α n−y+1 , α n−y+2 , ..., α n .Predicting the next proof steps of S α is done by finding those sequences that are similar to P y (S α ) in any order.For prediction, CPT+ uses the consequent of each sequence that is similar to S α .Let S β be another proof sequence similar to S α .The consequent of S β with respect to S α is the longest subsequence β v , β v+1 , ..., β m of S β such that v−1 k=1 {β k } ⊆ P y (S α ) and 1 ≤ v ≤ m.Each proof command discovered in the consequent of a similar proof sequence of S α is stored in the count table (CT) data structure.CPT+ in last returns the most supported proof step(s) in the CT as prediction(s).

Experiments
All the following experiments are performed on an HP laptop with a fifth generation Core i5 processor and 8 GB RAM.For the case study, we select our previous work [29], where PVS is used for the analysis and verification of Reo connectors composed of untimed and timed channels.The main reason to select the proofs in [29] is that we are extending the formalization framework to cover the probabilistic [3] and stochastic [4] behavior of Reo connectors.The approach not only enabled us to comprehend the proof process for probabilistic connectors but also can be considered far effective in providing the necessary guidance to attain the proofs of such connectors.
SPMF data mining library, developed in JAVA, is used to analyze the proof corpus.It is an open-source and cross-platform framework that is specialized in pattern mining tasks.SPMF offers implementations for more than 150 data mining algorithms.More detail on SPMF can be found in [13].

Case Study
Reo [2] is a channel-based exogenous coordination language that allows the construction of complex connectors from primitive channels through compositional operators.Connectors in Reo provide the protocol for controlling and organizing the communication, synchronization and cooperation between components.Each channel in Reo has two channel ends type source or sink.The connector behavior in PVS is formalized by means of data-flows on its sink and source nodes, which are essentially infinite sequences.In PVS, record structure named TD is used to represent the timed-data sequences on sink and source nodes, where time is defined as a positive real number (R + ) and data is defined as a positive type.Three main composition operators (flow through, replicate and merge) are used in Reo for connector construction.Flow through and replicate operators can be achieved explicitly in PVS, whereas merge operator is defined inductively.
We omit the details of Reo connector modeling in PVS due to the length limitation.Interested readers can find more details in [29,31].Here, one example is provided to show how connectors are modeled and how properties for connectors are proved in PVS.
Example 1. Figure 2 shows a connector which consists of one Synchronous (Sync) channel (AB) and one FIFO1 channel (BC), that accepts data items at source node A and stores the data items in the buffer, before dispensing them through the sink node C. The mixed node B allows the data items to move from the Sync channel to F IF O1 channel without any change.Proof.PVS prover is based on sequent calculus and it can build a graphical proof tree for a proof goal.The nodes in the proof tree are sequents.PVS proof commands may divide the main goal into sub-goals (tree leaves).The proof is completed when all the sub-goals are proved.The proof steps for theorem 1 are shown in Figure 3.

Results and Discussion
Results obtained by applying SPM algorithms on the proof corpus are discussed in this section.
The TKS algorithm is first applied on the corpus to find hidden proof steps and patterns.TKS takes a corpus and a user specified parameter k as input and returns the top-k most frequent patterns as output.The parameter k is used in place of minusp threshold due to the following two reasons: 1. Selection of a proper minimum support to discover the desired amount of useful patterns has an effect on the performance of SPM algorithms.2. The minimum support fine-tuning process is hard and time consuming.
To overcome these drawbacks, the parameter k puts a bound on the total number of patterns to be discovered by the algorithm.Some proof patterns discovered by the TKS algorithm with varying length are shown in Table 2.The column Sup indicates the occurrence count of each pattern in the corpus.Table 2 provides some useful information related to frequent occurrence of proof steps and patterns that are used in the verification of Reo channels and connectors.Unlike TKS, the CM-SPAM algorithm offers the minsup threshold.Table 3 lists some of the most useful frequent proof patterns in the corpus which are extracted with the CM-SPAM algorithm.The first six proof patterns appear in at least 50% of the sequences in the corpus.The next six patterns appear in at least 40% of the sequences and last two patterns appear in at least 10% of the sequences.Discovered patterns with the CM-SPAM algorithm are almost similar to the results obtained with the TKS algorithm.As the outputs of TKS and CM-SPAM are very similar, the performance of TKS with CM-SPAM is compared in terms of execution time and memory used.The CM-SPAM is fine tuned with the minsup threshold to generate the k proof patterns.For optimal support, TKS execution time is very similar to CM-SPAM.Similarly, TKS showed excellent scalability.These results, which are consistent with [11], are important because finding the top-k sequential proof patterns is a harder problem than mining all proof patterns, as the minsup requires dynamic raising that starts from 0. Figure 4 shows the relationships between proof steps / patterns that are discovered through sequential rule mining with the ERMiner algorithm.The confidence (misconf ) threshold is set to 70%, which means that rules have a confidence of at least 70% (a rule X → Y has a confidence of 70% if the set of proof commands in X is followed by the set of proof commands in Y at least 70% of the times when X appears in a proof sequence).The value above the arrow is for the support and the value below the arrow indicates the confidence (probability).For example, the first rule in Figure 4 indicates that 94.7% of the time, the assert command is followed after the expand command.With the ERMiner algorithm, we found some interesting relationship and dependency of proof steps / patterns with each other.Results obtained so far indicate that the total number of proof steps in each proof (abstraction simplicity) has a direct correlation on the efficiency of SPM algorithms.
In [7], common proof patterns are found in the Isabelle proofs with a variable length Markov Chain.Proofs are represented in a tree structure format, which are linearized, such as the proofs are split into separate sequences and given weights accordingly.However, linearization means losing any important connections (information) between different branches in the proofs due to which interesting patterns may well be lost.In this work, the proof corpus contains all the necessary important information for pattern discovery and SPM algorithms, which are more user-friendly and work efficiently on the corpus.
The NB classifier implemented in SPMF is used to check the dependency of new conjectures on already proved proofs.For that, the classifier is trained on the proofs presented in the corpus.We then provide new conjectures from our ongoing work on probabilistic Reo connectors.In the output, the classifier successfully classified the new conjectures, which shows that the proofs can be used in guiding the proof process of new conjectures.Moreover, for conjectures taken from PVS libraries, the classifier was unable to classify, which means that their proofs are not dependent on the facts present in the corpus.NB classifiers are also used in [23] for computing the proof dependencies for new conjectures from the theorems taken from the Coq repository.Obtained results are presented with measures such as precision, recall and rank.In SPMF, the NB implementation only provides the binary type output for classification and does not provide information for the measures.In future, we would like to enhance the implementation of NB to provide statistics about the measures.
Predicting the next proof steps for the new conjecture or unproved theorem / lemma has gained increased importance in last few years.The CPT+ model is used for predicting the next proof steps.The model is first trained on the proof sequences in the corpus.The prediction model is then used to predict the next proof step for a new proof sequence.Prediction of the next proof step is based on the scores calculated by the model for each proof command.For example, CPT+ predicted assert for the proof sequence <flatten, split>.The statistics and scores assigned by the model to each proof step for the previous example are listed in Table 4.It is to note here that a higher score is considered better for CPT+.To check the efficiency of CPT+, we compared its performance with various other state-of-the-art prediction models such as Dependency Graph (DG), Transition Directed Acyclic Graph (TDAG), CPT (the predecessor of CPT+), AKOM (All-K-Order-Markov) and LZ78.Each model is trained and tested with 10-fold cross-validation.The cross-validation technique characterizes the performance of each model by evaluating the generalization of independent set over statistical results provided by the model.In k-fold cross-validation, the dataset is randomly partitioned into k sub-datasets.One sub-dataset is then selected as validation set for model testing and the remaining k −1 sub-datasets are used for model training.This process is continued for k times and each sub-dataset is used exactly once as the validation set.Single estimation of the result is obtained by taking the average of k results.The main reason to use 10-fold cross-validation is to achieve low variance in each run.Obtained results for various prediction models are shown in Table 5.For evaluation of prediction models, three measures are used.The result of a prediction can be: a success if the model predicts accurately, a failure if the model predicts inaccurately and no match if the model cannot perform the prediction.
The overall performance of each model is measured through its accuracy, which is the total number of successful predictions performed by the model against the total number of test sequences.Two other measures training time and testing time are not included in the results here as all the models take almost the same time for training and testing.CPT+ achieved higher accuracy (79.412) as compared to other prediction models.CPT has a higher success rate than CPT+, but the higher no match rate makes its accuracy lower than CPT+.Markov based prediction models DG achieved the lowest success rate and highest failure rate, while TDAG and AKOM have the same results for all four parameters.

Related Work
Using machine learning and data mining in theorem provers is not a new idea and they are used mainly for three tasks: premise selection, strategy selection and internal guidance.Support vector machines (SVMs) and Gaussian processes (GPs) were used in [6] for selection of a good heuristics in the E theorem prover.In [27], kernel methods were applied for strategy scheduling and strategy finding problems in three ATPs: E, Satallax and LEO-II.Deep networks have been used in [28] for internal guidance in E, where deep learning based proof guidance increases the total number of theorems proved while reducing the average number of proof search steps.Moreover, internal proof guidance methods based on the watchlist technique were developed in [17] for E prover.A proof search guidance technique based on leanCoP was presented in [24] to guide the tableaux proof search.In [33], GRU networks were used in MetaMath for guiding the proof search of a tableau style proof process.Monte-Carlo tree search methods added with a connection tableau were studied and implemented in leanCoP in [9] for guiding the proof search.A new theorem proving algorithm (implemented in rlCoP) was recently presented in [26] for proof guidance that uses Monte-Carlo simulations with reinforcement iterations.rlCoP showed better performance than leanCoP in solving unseen problems when trained on a large corpus.
For HOL based theorem provers, variable length Markov models (VLMM) technique has been applied in [7] on a proof corpus of the Isabelle prover to identify sequences of proof steps and these sequences were used to form tactics. Particle swarm optimization and NB based techniques were proposed in [8] to internally guide the given-clause algorithm in the Satallax prover.Premise selection techniques were developed in [23] for the Coq system, where machine learning methods are compared on Coq proofs taken from the CoRN repository.Recurrent and convolutional neural networks were used in [21] for premise selection in the Mizar prover.A corpus of proofs was constructed in [1] for training a kernalized classifier with bag-of-word features that show the term occurrences in a vocabulary.Premise selection based on machine learning and automated reasoning for the HOL4 is provided in [15] by adapting HOL(y)Hammer [25].A learning based automation technique was recently developed in [16] called TacticToe on top of the HOL4 for automation of theorems proofs.The HolStep dataset, introduced in [22], consists of 10K conjectures and 2M statements to develop new machine learning based proof strategies.

Conclusion
The proof development process in ITPs requires heavy interactions between users and the proof assistants, where users are forced to do lots of repetitive work which makes the proving process a more time consuming activity.To make the proof process simpler and for proof guidance, the SPM-based learning approach is adopted in this work to find the frequent proof steps / patterns and their relationship in a PVS theory.NB classifier is used to check the dependency of new conjectures on the already proved proofs.Moreover, the performance of some models for the prediction of next proof step(s) is compared, where CPT+ performs better than other models.Some interesting proof patterns are found with SPM and obtained results show that the number of proof steps in each proof has a direct correlation on the efficiency of SPM algorithms.
There are several directions of future work.First, we would like to use the SPM algorithms on the corpora of proof steps for theories included in PVS library, which contains thousands of theorems.This will enable us to develop a more general learning approach for the proofs of new conjunctures.Another direction is to use evolutionary and heuristics techniques such as genetic programming and particle swarm optimization for the development of PVS strategies from frequently occurring proof patterns.Some other future work includes the implementation of some famous classifiers such as k-nearest neighbor in SPMF and enhancing the implementation of NB to provide statistics about common measures such as precision, recall and f-measure.Last but not the least, using SPM algorithms on the dataset provided by [22] is in our future plan as well.

Fig. 1 .
Fig.1.SPM-based approach to learn the proof process

Fig. 2 .
Fig. 2. A connector composed of a Sync and a FIFO1 channel

Table 1 .
A

Table 3 .
Frequent proof patterns extracted with CM-SPAM

Table 4 .
Results for TPC+ prediction model

Table 5 .
Accuracy of prediction models