Learning Automata with Side-Eﬀects (cid:63)

. Automata learning has been successfully applied in the veriﬁ-cation of hardware and software. The size of the automaton model learned is a bottleneck for scalability, and hence optimizations that enable learning of compact representations are important. This paper exploits monads, both as a mathematical structure and a programming construct, to design and prove correct a wide class of such optimizations. Monads enable the development of a new learning algorithm and correctness proofs, building upon a general framework for automata learning based on category theory. The new algorithm is parametric on a monad, which provides a rich algebraic structure to capture non-determinism and other side-eﬀects. We show that this allows us to uniformly capture existing algorithms, develop new ones, and add optimizations.


Introduction
The increasing complexity of software and hardware systems calls for new scalable methods to design, verify, and continuously improve systems.Black-box inference methods aim at building models of running systems by observing their response to certain queries.This reverse engineering process is very amenable for automation and allows for fine-tuning the precision of the model depending on the properties of interest, which is important for scalability.
One of the most successful instances of black-box inference is automata learning, which has been used in various verification tasks, ranging from finding bugs in implementations of network protocols [15] to rejuvenating legacy software [29].Vaandrager [30] has written a comprehensive overview of the widespread use of automata learning in verification.
A limitation in automata learning is that the models of real systems can become too large to be handled by tools.This demands compositional methods and techniques that enable compact representation of behaviors.
In this paper, we show how monads can be used to add optimizations to learning algorithms in order to obtain compact representations.We will use as playground for our approach the well known L algorithm [2], which learns a minimal deterministic finite automaton (DFA) accepting a regular language by interacting with a teacher, i.e., an oracle that can reply to specific queries about the target language.Monads allow us to take an abstract approach, in which category theory is used to devise an optimized learning algorithm and a generic correctness proof for a broad class of compact models.
The inspiration for this work is quite concrete: it is a well-known fact that non-deterministic finite automata (NFAs) can be much smaller than deterministic ones for a regular language.The subtle point is that given a regular language, there is a canonical deterministic automaton accepting it-the minimal one-but there might be many "minimal" non-deterministic automata accepting the same language.This raises a challenge for learning algorithms: which non-deterministic automaton should the algorithm learn?To overcome this, Bollig et al. [11] developed a version of Angluin's L algorithm, which they called NL , in which they use a particular class of NFAs, Residual Finite State Automata (RFSAs), which do admit minimal canonical representatives.Though NL indeed is a first step in incorporating a more compact representation of regular languages, there are several questions that remain to be addressed.We tackle them in this paper.
DFAs and NFAs are formally connected by the subset construction.Underpinning this construction is the rich algebraic structure of languages and of the state space of the DFA obtained by determinizing an NFA.The state space of a determinized DFA-consisting of subsets of the state space of the original NFA-has a join-semilattice structure.Moreover, this structure is preserved in language acceptance: if there are subsets U and V , then the language of U ∪ V is the union of the languages of the first two.Formally, the function that assigns to each state its language is a join-semilattice map, since languages themselves are just sets of words and have a lattice structure.And languages are even richer: they have the structure of complete atomic Boolean algebras.This leads to several questions: Can we exploit this structure and have even more compact representations?What if we slightly change the setting and look at weighted languages over a semiring, which have the structure of a semimodule (or vector space, if the semiring is a field)?
The latter question is strongly motivated by the widespread use of weighted languages and corresponding weighted finite automata (WFAs) in verification, from the formal verification of quantitative properties [13,17,25], to probabilistic model-checking [5], to the verification of on-line algorithms [1].
Our key insight is that the algebraic structures mentioned above are in fact algebras for a monad T .In the case of join-semilattices this is the powerset monad, and in the case of vector spaces it is the free vector space monad.These monads can be used to define a notion of T -automaton, with states having the structure of an algebra for the monad T , which generalizes non-determinism as a side-effect.From T -automata we can derive a compact, equivalent version by taking as states a set of generators and transferring the algebraic structure of the original state space to the transition structure of the automaton.
This general perspective enables us to generalize L to a new algorithm L T , which learns compact automata featuring non-determinism and other side-effects captured by a monad.Moreover, L T incorporates further optimizations arising from the monadic representation, which lead to more scalable algorithms.
We start by giving an overview of our approach, which also states our main contributions in greater detail and ends with a road map of the rest of the paper.

Overview and Contributions
In this section, we explain the original L algorithm and discuss the challenges in adapting the algorithm to learn automata with side-effects, illustrating them through a concrete example-NFAs.We then highlight our main contributions.
L algorithm.This algorithm learns the minimal DFA accepting a language L ⊆ A * over a finite alphabet A. It assumes the existence of a minimally adequate teacher, which is an oracle that can answer two types of queries: 1. Membership queries: given a word w ∈ A * , does w belong to L? and 2. Equivalence queries: given a hypothesis DFA H, does H accept L? If not, the teacher returns a counterexample, i.e., a word incorrectly classified by H.The algorithm incrementally builds an observation table made of two parts: a top part, with rows ranging over a finite set S ⊆ A * ; and a bottom part, with rows ranging over S • A (• is pointwise concatenation).Columns range over a finite E ⊆ A * .For each u ∈ S ∪ S • A and v ∈ E, the corresponding cell in the table contains 1 if and only if uv ∈ L. Intuitively, each row u contains enough information to fully identify the Myhill-Nerode equivalence class of u with respect to an approximation of the target language-rows with the same content are considered members of the same equivalence class.Cells are filled in using membership queries.
As an example, and to set notation, consider the table below over A = {a, b}.It shows that L contains the word aa and does not contain the words ε (the empty word), a, b, ba, aaa, and baa.E ε a aa S ε 0 0 1 We use functions row t and row b to describe the top and bottom parts of the table, respectively.Notice that S and S • A may intersect.For conciseness, when tables are depicted, elements in the intersection are only shown in the top part.
A key idea of the algorithm is to construct a hypothesis DFA from the different rows in the table.The construction is the same as that of the minimal DFA from the Myhill-Nerode equivalence, and exploits the correspondence between table rows and Myhill-Nerode equivalence classes.The state space of the hypothesis  DFA is given by the set H = {row t (s) | s ∈ S}.Note that there may be multiple rows with the same content, but they result in a single state, as they all belong to the same Myhill-Nerode equivalence class.The initial state is row t (ε), and we use the ε column to determine whether a state is accepting: row t (s) is accepting whenever row t (s)(ε) = 1.The transition function is defined as row t (s) a − → row b (s)(a).(Notice that the continuation is drawn from the bottom part of the table).For the hypothesis automaton to be well-defined, ε must be in S and E, and the table must satisfy two properties: -Closedness states that each transition actually leads to a state of the hypothesis.That is, the table is closed if for all t ∈ S and a ∈ A there is s ∈ S such that row t (s) = row b (t)(a).-Consistency states that there is no ambiguity in determining the transitions.
That is, the table is consistent if for all s 1 , s 2 ∈ S such that row t (s 1 ) = row t (s 2 ) we have row b (s 1 ) = row b (s 2 ).
The algorithm updates the sets S and E to satisfy these properties, constructs a hypothesis, submits it in an equivalence query, and, when given a counterexample, refines the hypothesis.This process continues until the hypothesis is correct.The algorithm is shown in Fig. 1.
Example Run.We now run the algorithm with the target language Initially, S = E = {ε}.We build the observation table given in Fig. 2a.This table is not closed, because the row with label a, having 0 in the only column, does not appear in the top part of the table: the only row ε has 1.To fix this, we add the word a to the set S. Now the table (Fig. 2b) is closed and consistent, so we construct the hypothesis that is shown in Fig. 2c and pose an equivalence query.The teacher replies no and informs us that the word aaa should have been accepted.L handles a counterexample by adding all its prefixes to the set S. We only have to add aa and aaa in this case.The next table (Fig. 2d) is closed, but not consistent: the rows ε and aa both have value 1, but their extensions a and aaa differ.To fix this, we prepend the continuation a to the column ε on which they differ and add a • ε = a to E. This distinguishes row t (ε) from row t (aa), as seen in the next table in Fig. 2e.The table is now closed and consistent, and the new hypothesis automaton is precisely the correct one M.
As mentioned, the hypothesis construction approximates the theoretical construction of the minimal DFA, which is unique up to isomorphism.That is, for S = E = A * the relation that identifies words of S having the same value in row t is precisely the Myhill-Nerode's right congruence.
Learning non-deterministic automata.As is well known, NFAs can be smaller than the minimal DFA for a given language.For example, the language L above is accepted by the NFA which is smaller than the minimal DFA M. Though in this example, which we chose for simplicity, the state reduction is not massive, it is known that in general NFAs can be exponentially smaller than the minimal DFA [24].This reduction of the state space is enabled by a side-effect-non-determinism, in this case.
Learning NFAs can lead to a substantial gain in space complexity, but it is challenging.The main difficulty is that NFAs do not have a canonical minimal representative: there may be several non-isomorphic state-minimal NFAs accepting the same language, which poses problems for the development of the learning algorithm.To overcome this, Bollig et al. [11] proposed to use a particular class of NFAs, namely RFSAs, which do admit minimal canonical representatives.However, their ad-hoc solution for NFAs does not extend to other automata, such as weighted or alternating.In this paper we present a solution that works for any side-effect, specified as a monad.
The crucial observation underlying our approach is that the language semantics of an NFA is defined in terms of its determinization, i.e., the DFA obtained by taking sets of states of the NFA as state space.In other words, this DFA is defined over an algebraic structure induced by the powerset, namely a (complete) join semilattice (JSL) whose join operation is set union.This automaton model does admit minimal representatives, which leads to the key idea for our algorithm: learning NFAs as automata over JSLs.In order to do so, we use an extended table where rows have a JSL structure, defined as follows.The join of two rows is given by an element-wise or, and the bottom element is the row containing only zeroes.More precisely, the new table consists of the two functions given by row Formally, these functions are JSL homomorphisms, and they induce the following general definitions: We remark that our algorithm does not actually store the whole extended Optimizations.In this paper we also present two optimizations to our algorithm.For the first one, note that the state space of the hypothesis constructed by the algorithm can be very large since it encodes the entire algebraic structure.We show that we can extract a minimal set of generators from the table and compute a succinct hypothesis in the form of an automaton with side-effects, without any algebraic structure.For JSLs, this consists in only taking rows that are not the join of other rows, i.e., the join-irreducibles.By applying this optimization to this specific case, we essentially recover the learning algorithm of Bollig et al. [11].
The second optimization is a generalization of the optimized counterexample handling method of Rivest and Schapire [28], originally intended for L and DFAs.It consists in processing counterexamples by adding a single suffix of the counterexample to E, instead of adding all prefixes of the counterexample to S. This can avoid the algorithm posing a large number of membership queries.Example Revisited.We now run the new algorithm on the language L = {w ∈ {a} * | |w| = 1} considered earlier.Starting from S = E = {ε}, the observation table (Fig. 3a) is immediately closed and consistent.(It is closed because we have row t ({a}) = row t (∅).)This gives the JSL hypothesis shown in Fig. 3b, which leads to an NFA hypothesis having a single state that is initial, accepting, and has no transitions (Fig. 3c).The hypothesis is incorrect, and the teacher may supply us with counterexample aa.Adding prefixes a and aa to S leads to the table in Fig. 3d.The table is again closed, but not consistent: . Thus, we add a to E. The resulting table (Fig. 3e) is closed and consistent.We note that row aa is the union of other rows: row t ({aa}) = row t ({ε, a}) (i.e., it is not a join-irreducible), and therefore can be ignored when building the succinct hypothesis.This hypothesis has two states, ε and a, and indeed it is the correct one N.
Contributions and road map of the paper.After some preliminary notions in Section 3, we present the main contributions: -In Section 4, we develop a general algorithm L T , which generalizes the NFA one presented in Section 2 to an arbitrary monad T capturing side-effects, and we provide a general correctness proof for our algorithm.-In Section 5, we describe the first optimization and prove its correctness.
-In Section 6 we describe the second optimization.We also show how it can be combined with the one of Section 5, and how it can lead to a further small optimization, where the consistency check on the table is dropped.-Finally, in Section 7 we show how L T can be applied to several automata models, highlighting further case-specific optimizations when available.

Preliminaries
In this section we define a notion of T -automaton, a generalization of nondeterministic finite automata parametric in a monad T .We assume familiarity with basic notions of category theory: functors (in the category Set of sets and functions) and natural transformations.Side-effects can be conveniently captured as a monad.A monad T = (T, η, µ) is a triple consisting of an endofunctor T on Set and two natural transformations: a unit η : Id ⇒ T and a multiplication µ : T 2 ⇒ T , which satisfy the compatibility laws µ Example 1 (Monads).An example of a monad is the triple (P, {−}, ), where P denotes the powerset functor associating a collection of subsets to a set, {−} is the singleton operation, and is just union of sets.Another example is the triple (V (−), e, m), where V (X) is the free semimodule (over a semiring S) over X, namely {ϕ | ϕ : X → S having finite support}.The support of a function ϕ : X → S is the set of x ∈ X such that ϕ(x) = 0. Then e : X → V (X) is the characteristic function for each x ∈ X, and m : Given a monad T , a T -algebra is a pair (X, h) consisting of a carrier set X and a function h : . The abstract notion of T -algebra instantiates to expected notions, as illustrated in the following example.
Example 2 (Algebras for a monad).The P-algebras are the (complete) joinsemilattices, and their homomorphisms are join-preserving functions.If S is a field, V -algebras are vector spaces, and their homomorphisms are linear maps.
We will often refer to a T -algebra (X, h) as X if h is understood or if its specific definition is irrelevant.Given a set X, (T X, µ X ) is a T -algebra called the free T -algebra on X.One can build algebras pointwise for some operations.For instance, if Y is a set and (X, x) a T -algebra, then we have a T -algebra (X Y , f ), where f : T (X Y ) → X Y is given by f (W )(y) = (x • T (ev y ))(W ) and ev y : X Y → X by ev y (g) = g(y).If U and V are T -algebras and f : U → V is a T -algebra homomorphism, then the image img(f ) of f is a T -algebra, with the T -algebra structure inherited from V .The following proposition connects algebra homomorphisms from the free T -algebra on a set U to an algebra V with functions U → V .We will make use of this later in the section.Proposition 3. Given a set U and a T -algebra (V, v), there is a bijective correspondence between T -algebra homomorphisms T U → V and functions U → V : for a T -algebra homomorphism f : Then g is a T -algebra homomorphism called the free T -extension of g, and we have f † = f and g † = g.
We now have all the ingredients to define our notion of automaton with sideeffects and their language semantics.We fix a monad (T, η, µ) with T preserving finite sets, as well as a T -algebra O that models outputs of automata.

Definition 4 (T -automaton
where Q is a T -algebra, the transition map δ and output map out are T -algebra homomorphisms, and init is the initial state. Example 5. DFAs are Id-automata when O = 2 = {0, 1} is used to distinguish accepting from rejecting states.For the more general case of O being any set, DFAs generalize into Moore automata.Example 6. Recall that P-algebras are JSLs, and their homomorphisms are joinpreserving functions.In a P-automaton, Q is equipped with a join operation, and Q A is a join-semilattice with pointwise join: (f ∨ g)(a) = f (a) ∨ g(a) for a ∈ A. Since the automaton maps preserve joins, we have, in particular, δ(q 1 ∨ q 2 )(a) = δ(q 1 )(a) ∨ δ(q 2 )(a).One can represent an NFA over a set of states S as a Pautomaton by taking Q = (P(S), ) and O = 2, the Boolean join-semilattice with the or operation as its join.Let init ⊆ S be the set of initial states and out : P(Q) → 2 and δ : P(S) → P(S) A the respective extensions (Proposition 3) of the NFA's output and transition functions.The resulting P-automaton is precisely the determinized version of the NFA.
More generally, an automaton with side-effects given by a monad T always represents a T -automaton with a free state space.Proposition 7. A T -automaton of the form ((T X, µ X ), δ, out, init), for any set X, is completely defined by the set X with the element init ∈ T X and functions We call such a T -automaton a succinct automaton, which we sometimes identify with the representation (X, δ † , out † , init).These automata are closely related to the ones studied in [18].
A (generalized) language is a function L : A * → O.For every T -automaton we have an observability and a reachability map, telling respectively which state is reached by reading a given word and which language each state recognizes.

Definition 8 (Reachability/Observability Maps). The reachability map of a T -automaton A is a function r
The language accepted by A is the map Example 9.For an NFA A represented as a P-automaton, as seen in Example 6, o A (q) is the language of q in the traditional sense.Note that q, in general, is a set of states: o A (q) takes the union of languages of singleton states.The set L A is the language accepted by the initial states, i.e., the language of the NFA.The reachability map r A (u) returns the set of states reached via all paths reading u.
Given a language L : A * → O, there exists a (unique) minimal T -automaton M L accepting L, which is minimal in the number of states.Its existence follows from general facts.See for example [19].
In the following, we will also make use of the minimal Moore automaton accepting L.Although this always exists-by instantiating Definition 10 with T = Id-it need not be finite.The following property says that finiteness of Moore automata and of T -automata accepting the same language are related.
Proposition 11.The minimal Moore automaton accepting L is finite if and only if the minimal T -automaton accepting L is finite.

A General Algorithm
In this section we introduce our extension of L to learn automata with side-effects.The algorithm is parametric in the notion of side-effect, represented as the monad T , and is therefore called L T .We fix a language L : A * → O that is to be learned, and we assume that there is a finite T -automaton accepting L. This assumption generalizes the requirement of L that L is regular (i.e., accepted by a specific class of T -automata, see Example 5).
An observation table consists of a pair of functions given by row t (s)(e) = L(se) and row b (s)(a)(e) = L(sae), where S, E ⊆ A * are finite sets with ε ∈ S ∩ E. For O = 2, we recover exactly the L observation table .The key idea for L T is defining closedness and consistency over the free T -extensions of those functions.
Definition 12 (Closedness and Consistency).The table is closed if for all U ∈ T (S) and a ∈ A there exists a U ∈ T (S) such that row For closedness, we do not need to check all elements of T (S) × A against elements of T (S), but only those of S × A, thanks to the following result.Lemma 13.If for all s ∈ S and a ∈ A there is U ∈ T (S) such that row t (U ) = row b (s)(a), then the table is closed.Example 14.For NFAs represented as P-automata, the properties are as presented in Section 2. Recall that for T = P and O = 2, the Boolean join-semilattice, row t and row b describe a table where rows are labeled by subsets of S. Then we have, for instance, row t ({s 1 , s 2 })(e) = row t (s 1 )(e) ∨ row t (s 2 )(e), i.e., row t ({s 1 , s 2 })(e) = 1 if and only if L(s 1 e) = 1 or L(s 2 e) = 1.Closedness amounts to check whether each row in the bottom part of the table is the join of a set of rows in the top part.Consistency amounts to check whether, for all sets of rows U 1 , U 2 ⊆ S in the top part of the table whose joins are equal, the joins of rows U 1 • {a} and U 2 • {a} in the bottom part are also equal, for all a ∈ A.  If closedness and consistency hold, we can define a hypothesis T -automaton H, with state space H = img(row t ), init = row t (ε), and output and transitions The correctness of this definition follows from the abstract treatment of [21], instantiated to the category of T -algebras and their homomorphisms.
We can now give algorithm L T .Similarly to the example in Section 2, we only have to adjust lines 5 and 8 in Fig. 1.The resulting algorithm is shown in Fig. 4.

Correctness.
Correctness for L T amounts to proving that, for any target language L, the algorithm terminates returning the minimal T -automaton M L accepting L. As in the original L algorithm, we only need to prove that the algorithm terminates, that is, that only finitely many hypotheses are produced.Correctness follows from termination, since line 13 causes the algorithm to terminate only if the hypothesis automaton coincides with M L .
In order to show termination, we argue that the state space H of the hypothesis increases while the algorithm loops, and that H cannot be larger than M , the state space of M L .In fact, when a closedness defect is resolved (line 6), a row that was not previously found in the image of row t : T (S) → O E is added, so the set H grows larger.When a consistency defect is resolved (line 9), two previously equal rows become distinguished, which also increases the size of H.
As for counterexamples, adding their prefixes to S (line 11) creates a consistency defect, which will be fixed during the next iteration, causing H to increase.This is due to the following result, which says that the counterexample z has a prefix that violates consistency.Note that the hypothesis H in the statement below is the hypothesis obtained before adding the prefixes of z to S. Proposition 15.If z ∈ A * is such that L H (z) = L(z) and prefixes(z) ⊆ S, then there are a prefix ua of z, with u ∈ A * and a ∈ A, and U ∈ T (S) such that row t (u) = row t (U ) and row b (u)(a) = row b (U )(a).Now, note that by increasing S or E, the hypothesis state space H never decreases in size.Moreover, for S = A * and E = A * , row t = t L .Therefore, since H and M are defined as the images of row t and t L , respectively, the size of H is bounded by that of M .Since H increases while the algorithm loops, the algorithm must terminate and is thus correct.
Note that the learning algorithm of Bollig et al. does not terminate using this counterexample processing method [10,Appendix F].This is due to their notion of consistency being weaker than ours: we have shown that progress is guaranteed because a consistency defect, in our sense, is created using this method.
Query complexity.The complexity of automata learning algorithms is usually measured in terms of the number of both membership and equivalence queries asked, as it is common to assume that computations within the algorithm are insignificant compared to evaluating the system under analysis in applications.The cost of answering queries themselves is not considered, as it depends on the implementation of the teacher, which the algorithm abstracts from.
The table is a T -algebra homomorphism, so membership queries for rows labeled in S are enough to determine all other rows.We measure the query complexities in terms of the number of states n of the minimal Moore automaton, the number of states t of the minimal T -automaton, the size k of the alphabet, and the length m of the longest counterexample.Note that t cannot be smaller than n, but it can be much bigger.For example, when T = P, t may be in O(2 n ). 3  The maximum number of closedness defects fixed by the algorithm is n, as a closedness defect for the setting with algebraic structure is also a closedness defect for the setting without that structure.The maximum number of consistency defects fixed by the algorithm is t, as fixing a consistency defect distinguishes two rows that were previously identified.Since counterexamples lead to consistency defects, this also means that the algorithm will not pose more than t equivalence queries.A word is added to S when fixing a closedness defect, and O(m) words are added to S when processing a counterexample.The number of rows that we need to fill using queries is therefore in O(tmk).The number of columns added to the table is given by the number of times a consistency defect is fixed and thus in O(t).Altogether, the number of membership queries is in O(t 2 mk).
3 Take the language {a p }, for some p ∈ N and a singleton alphabet {a}.Its residual languages are ∅ and {a i } for all 0 ≤ i ≤ p, thus the minimal DFA accepting the language has p + 2 states.However, the residual languages w.r.t.sets of words are all the subsets of {ε, a, aa, . . ., a p }-hence, the minimal T -automaton has 2 p+1 states.

Succinct Hypotheses
We now describe the first of two optimizations, which is enabled by the use of monads.Our algorithm produces hypotheses that can be quite large, as their state space is the image of row t , which has the whole set T (S) as its domain.For instance, when T = P, T (S) is exponentially larger than S. We show how we can compute succinct hypotheses, whose state space is given by a subset of S. We start by defining sets of generators for the table.
Definition 16.A set S ⊆ S is a set of generators for the table whenever for all s ∈ S there is U ∈ T (S ) such that row t (s) = row t (U ). 4ntuitively, U is the decomposition of s into a "combination" of generators.When T = P, S generates the table whenever each row can be obtained as the join of a set of rows labeled by S .Explicitly: for all s ∈ S there is {s 1 , . . ., s n } ⊆ S such that row t (s) = row t ({s 1 , . . ., Recall that H, with state space H, is the hypothesis automaton for the table.The existence of generators S allows us to compute a T -automaton with state space T (S ) equivalent to H. We call this the succinct hypothesis, although T (S ) may be larger than H. Proposition 7 tells us that the succinct hypothesis can be represented as an automaton with side-effects in T that has S as its state space.This results in a lower space complexity when storing the hypothesis.
We now show how the succinct hypothesis is computed.Observe that, if generators S exist, row t factors through the restriction of itself to T (S ).Denote this latter function row t .Since we have T (S ) ⊆ T (S), the image of row t coincides with img(row t ) = H, and therefore the surjection restricting row t to its image has the form e : T (S ) → H. Any right inverse i : H → T (S ) of the function e (that is, e • i = id H , but whereas e is a T -algebra homomorphism, i need not be one) yields a succinct hypothesis as follows.
Definition 17 (Succinct Hypothesis).The succinct hypothesis is the Tautomaton S = (T (S ), δ, out, init) given by init = i(row t (ε)) and This definition is inspired by that of a scoop, due to Arbib and Manes [4].
Proposition 18.Any succinct hypothesis of H accepts the language of H.
We now give a simple procedure to compute a minimal set of generators, that is, a set S such that no proper subset is a set of generators.This generalizes a procedure defined by Angluin et al. [3] for non-deterministic, universal, and alternating automata.

Proposition 19. The following algorithm returns a minimal set of generators for the table:
S ← S while there are s ∈ S and U ∈ T (S \ {s}) s.t.row t (U ) = row t (s) S ← S \ {s} return S To determine whether U as in the above algorithm exists, one can always naively enumerate all possibilities, using that T preserves finite sets.This is what we call the basic algorithm.For specific algebraic structures, one may find more efficient methods, as we show in the following example.
Example 20.Consider the powerset monad T = P.We now exemplify two ways of computing succinct hypotheses, which are inspired by canonical RFSAs [16].The basic idea is to start from a deterministic automaton and to remove states that are equivalent to a set of other states.The algorithm given in Proposition 19 computes a minimal S that only contains labels of rows that are not the join of other rows.(In case two rows are equal, only one of their labels is kept.)In other words, as mentioned in Section 2, S contains labels of join-irreducible rows.To concretize the algorithm efficiently, we use a method introduced by Bollig et al. [11], which essentially exploits the natural order on the JSL of table rows.In contrast to the basic exponential algorithm, this results in a polynomial one. 5ollig et al. determine whether a row is a join of other rows by comparing the row just to the join of rows below it.Like them, we make use of this also to compute right inverses of e, for which we will formalize the order.
The function e : P(S ) → H tells us which sets of rows are equivalent to a single state in H.We show two right inverses H → P(S ) for it.The first one, stems from the construction of the canonical RFSA of a language [16].Here we use the order a ≤ b ⇐⇒ a ∨ b = b induced by the JSL structure.The resulting construction of a succinct hypothesis was first used by Bollig et al. [11].This succinct hypothesis has a "maximal" transition function, meaning that no more transitions can be added without changing the language of the automaton.
The second inverse is resulting in a more economical transition function, where some redundancies are removed.This corresponds to the simplified canonical RFSA [16].
Example 21.Consider T = P, and recall the table in Fig. 3e.When S = S, the right inverse given by i 1 yields the succinct hypothesis shown below.
a a a a a a a Note that i 1 (row t (aa)) = {ε, a, aa}.Taking i 2 instead, the succinct hypothesis is just the DFA (1) because i 2 (row t (aa)) = {aa}.Rather than constructing a succinct hypothesis directly, our algorithm first reduces the set S .In this case, we have row t (aa) = row t ({ε, a}), so we remove aa from S .Now i 1 and i 2 coincide and produce the NFA (2).Minimizing the set S in this setting essentially comes down to determining what Bollig et al. [11] call the prime rows of the table.
Remark 22.The algorithm in Proposition 19 implicitly assumes an order in which elements of S are checked.Although the algorithm is correct for any such order, different orders may give results that differ in size.

Optimized Counterexample Handling
The second optimization we give generalizes the counterexample processing method due to Rivest and Schapire [28], which improves the worst case complexity of the number of membership queries needed in L .Maler and Pnueli [26] proposed to add all suffixes of the counterexample to the set E instead of adding all prefixes to the set S. This eliminates the need for consistency checks in the deterministic setting.The method by Rivest and Schapire finds a single suffix of the counterexample and adds it to E. This suffix is chosen in such a way that it either distinguishes two existing rows or creates a closedness defect, both of which imply that the hypothesis automaton will grow.
The main idea is finding the distinguishing suffix via the hypothesis automaton H.Given u ∈ A * , let q u be the state in H reached by reading u, i.e., q u = r H (u). For each q ∈ H, we pick any U q ∈ T (S) that yields q according to the table, i.e., such that row t (U q ) = q.Then for a counterexample z we have that the residual language w.r.t.U qz does not "agree" with the residual language w.r.t.z.
The above intuition can be formalized as follows.Let R : A * → O A * be given by R(u) = t L (U qu ) for all u ∈ A * , the residual language computation.We have the following technical lemma, saying that a counterexample z distinguishes the residual languages t L (z) and R(z).
We assume that U qε = η(ε).For a counterexample z, we then have While reading z, the hypothesis automaton passes a sequence of states q u0 , q u1 ,q u2 ,. . .,q un , where u 0 = , u n = z, and u i+1 = u i a for some a ∈ A is a prefix of z.If z were correctly classified by H, all residuals R(u i ) would classify the remaining suffix v of z, i.e., such that z = u i v, in the same way.However, the previous lemma tells us that, for a counterexample z, this is not case, meaning that for some suffix v we have R(ua)(v) = R(u)(av).In short, this inequality is discovered along a transition in the path to z.
Corollary 24.If z ∈ A * is such that L H (z) = L(z), then there are u, v ∈ A * and a ∈ A such that uav = z and R(ua)(v) = R(u)(av).
To find such a decomposition efficiently, Rivest and Schapire use a binary search algorithm.We conclude with the following result that turns the above property into the elimination of a closedness witness.That is, given a counterexample z and the resulting decomposition uav from the above corollary, we show that, while currently row t (U qua ) = row b (U qu )(a), after adding v to E we have row t (U qua )(v) = row b (U qu )(a)(v).(To see that the latter follows from the proposition below, note that for all U ∈ T (S) and e ∈ E, row t (U )(e) = t L (U )(e) and for each a ∈ A, row b (U )(a )(e) = t L (U )(a e).)The inequality means that either we have a closedness defect, or there still exists some U ∈ T (S) such that row t (U ) = row b (U qu )(a).In this case, the rows row t (U ) and row t (U qua ) have become distinguished by adding v, which means that the size of H has increased.A closedness defect also increases the size of H, so in any case we make progress.
We now show how to combine this optimized counterexample processing method with the succinct hypothesis optimization from Section 5. Recall that the succinct hypothesis S is based on a right inverse i : H → T (S ) of e : T (S ) → H. Choosing such an i is equivalent to choosing U q for each q ∈ H.We then redefine R using the reachability map of the succinct hypothesis.Specifically, Unfortunately, there is one complication.We assumed earlier that U qε = η(ε), or more specifically R(ε)(z) = L(z).This now may be impossible because we do not even necessarily have ε ∈ S .We show next that if this equality does not hold, then there are two rows that we can distinguish by adding z to E. Thus, after testing whether R(ε)(z) = L(z), we either add z to E (if the test fails) or proceed with the original method.
To see that the original method still works, we prove the analogue of Proposition 25 for the new definition of R.
Recall the succinct hypothesis S from Fig. 3c for the table in Fig. 2a.Note that S = S cannot be further reduced.The hypothesis is based on the right inverse i : H → P(S) of e : P(S) → H given by i(row t (ε)) = {ε} and i(row t (∅)) = ∅.This is the only possible right inverse because e is bijective.For the prefixes of the counterexample aa we have r S (ε) = {ε} and r S (a) = r S (aa Adding a to E would indeed create a closedness defect.
Query complexity.Again, we measure the membership and equivalence query complexities in terms of the number of states n of the minimal Moore automaton, the number of states t of the minimal T -automaton, the size k of the alphabet, and the length m of the longest counterexample.
A counterexample now gives an additional column instead of a set of rows, and we have seen that this leads to either a closedness defect or to two rows being distinguished.Thus, the number of equivalence queries is still at most t, and the number of columns is still in O(t).However, the number of rows that we need to fill using membership queries is now in O(nk).This means that a total of O(tnk) membership queries is needed to fill the table.
Apart from filling the table, we also need queries to analyze counterexamples.The binary search algorithm mentioned after Corollary 24 requires for each counterexample O(log m) computations of R(x)(y) for varying words x and y.Let r be the maximum number of queries required for a single such computation.Note that for u, v ∈ A * , and letting α : T O → O be the algebra structure on O, in the succinct hypothesis case.Since the restricted map For some examples (see for instance the writer automata in Section 7), we even have r = 1.The overall membership query complexity is O(tnk + tr log m).
Dropping Consistency.We described the counterexample processing method based around Proposition 25 in terms of the succinct hypothesis S rather than the actual hypothesis H by showing that R can be defined using S. Since the definition of the succinct hypothesis does not rely on the property of consistency to be well-defined, this means we could drop the consistency check from the algorithm altogether.We can still measure progress in terms of the size of the set H, but it will not be the state space of an actual hypothesis during intermediate stages.This observation also explains why Bollig et al. [11] are able to use a weaker notion of consistency in their algorithm.Interestingly, they exploit the canonicity of their choice of succinct hypotheses to arrive at a polynomial membership query complexity that does not involve the factor t.

Examples
In this section we list several examples that can be seen as T -automata and hence learned via an instance of L T .We remark that, since our algorithm operates on finite structures (recall that T preserves finite sets), for each automaton type one can obtain a basic, correct-by-construction instance of L T for free, by plugging the concrete definition of the monad into the abstract algorithm.However, we note that this is not how L T is intended to be used in a real-world context; it should be seen as an abstract specification of the operations each concrete implementation needs to perform, or, in other words, as a template for real implementations.
For each instance below, we discuss whether certain operations admit a more efficient implementation than the basic one, based on the specific algebraic structure induced by the monad.Due to our general treatment, the optimizations of Sections 5 and 6 apply to all of these instances.
Non-deterministic automata.As discussed before, non-deterministic automata are P-automata with a free state space, provided that O = 2 is equipped with the "or" operation as its P-algebra structure.We also mentioned that, as Bollig et al. [11] showed, there is a polynomial time algorithm to check whether a given row is the join of other rows.This gives an efficient method for handling closedness straight away.Moreover, as shown in Example 20, it allows for an efficient construction of the succinct hypothesis.Unfortunately, checking for consistency defects seems to require a number of computations exponential in the number of rows.However, as explained at the end of Section 6, we can in fact drop consistency altogether.
Universal automata.Just like non-deterministic automata, universal automata can be seen as P-automata with a free state space.The difference is that the P-algebra structure on O = 2 is dual: it is given by the "and" rather than the "or" operation.Universal automata accept a word when all paths reading that word are accepting.One can dualize the optimized specific algorithms for the case of non-deterministic automata.This is precisely what Angluin et al. [3] have done.
Partial automata.Consider the maybe monad Maybe(X) = 1 + X, with natural transformations having components η X : X → 1 + X and µ X : 1 + 1 + X → 1 + X defined in the standard way.Partial automata with states X can be represented as Maybe-automata with state space Maybe(X) = 1 + X, where there is an additional sink state, and output algebra O = Maybe(1) = 1 + 1.Here the left value is for rejecting states, including the sink one.The transition map δ : 1 + X → (1 + X) A represents an undefined transition as one going to the sink state.The algorithm L Maybe is mostly like L , except that implicitly the table has an additional row with zeroes in every column.Since the monad only adds a single element to each set, there is no need to optimize the basic algorithm for this specific case.
Weighted automata.Recall from Section 3 the free semimodule monad V , sending a set X to the free semimodule over a finite semiring S. Weighted automata over a set of states X can be represented as V -automata whose state space is the semimodule V (X), the output function out : V (X) → S assigns a weight to each state, and the transition map δ : V (X) → V (X) A sends each state and each input symbol to a linear combination of states.The obvious semimodule structure on S extends to a pointwise structure on the potential rows of the table.The basic algorithm loops over all linear combinations of rows to check closedness and over all pairs of combinations of rows to check consistency, making them extremely expensive operations.If S is a field, a row can be decomposed into a linear combination of other rows in polynomial time using standard techniques from linear algebra.As a result, there are efficient procedures for checking closedness and constructing succinct hypotheses.It was shown by Van Heerdt et al. [21] that consistency in this setting is equivalent to closedness of the transpose of the table.This trick is due to Bergadano and Varricchio [7], who first studied learning of weighted automata.
Alternating automata.We use the characterization of alternating automata due to Bertrand and Rot [9].Recall that, given a partially ordered set (P, ≤), an upset is a subset U of P such that, if x ∈ U and x ≤ y, then y ∈ U .Given Q ⊆ P , we write ↑Q for the upward closure of Q, that is the smallest upset of P containing Q.We consider the monad A that maps a set X to the set of all upsets of P(X).Its unit is given by η X (x) =↑{{x}} and its multiplication by Algebras for the monad A are completely distributive lattices [27].The sets of sets in A(X) can be seen as DNF formulae over elements of X, where the outer powerset is disjunctive and the inner one is conjunctive.Accordingly, we define an algebra structure β : A(2) → 2 on the output set 2 by letting β(U ) = 1 if {1} ∈ U , 0 otherwise.Alternating automata with states X can be represented as A-automata with state space A(X), output map out : A(X) → 2, and transition map δ : A(X) → A(X) A , sending each state to a DNF formula over X.The only difference with the usual definition of alternating automata is that A(X) is not the full set PP(X), which is not a monad [23].However, for each formula in PP(X) there is an equivalent one in A(X).
An adaptation of L for alternating automata was introduced by Angluin et al. [3] and further investigated by Berndt et al. [8].The former found that given a row r ∈ 2 E and a set of rows X ⊆ 2 E , r is equal to a DNF combination of rows from X (where logical operators are applied component-wise) if and only if it is equal to the combination defined by Y = {{x ∈ X | x(e) = 1} | e ∈ E ∧ r(e) = 1}.We can reuse this idea to efficiently find closedness defects and to construct the hypothesis.Even though the monad A formally requires the use of DNF formulae representing upsets, in the actual implementation we can use smaller formulae, e.g., Y above instead of its upward closure.In fact, it is easy to check that DNF combinations of rows are invariant under upward closure.Similar as before, we do not know of an efficient way to ensure consistency, but we could drop it.
Writer automata.The examples considered so far involve existing classes of automata.To further demonstrate the generality of our approach, we introduce a new (as far as we know) type of automaton, which we call writer automaton.
The writer monad Writer(X) = M × X for a finite monoid M has a unit η X : X → M × X given by adding the unit e of the monoid, η X (x) = (e, x), and a multiplication µ X : M × M × X → M × X given by performing the monoid multiplication, µ X (m 1 , m 2 , x) = (m 1 m 2 , x).In Haskell, the writer monad is used for such tasks as collecting successive log messages, where the monoid is given by the set of sets or lists of possible messages and the multiplication adds a message.
The algebras for this monad are sets Q equipped with an M-action.One may take the output object to be the set M with the monoid multiplication as its action.Writer-automata with a free state space can be represented as deterministic automata that have an element of M associated with each transition.The semantics is as expected: M-elements multiply along paths and finally multiply with the output of the last state to produce the actual output.
The basic learning algorithm has polynomial time complexity.To determine whether a given row is a combination of rows in the table, i.e., whether it is given by a monoid value applied to one of the rows in the table, one simply tries all of these values.This allows us to check for closedness, to minimize the generators, and to construct the succinct hypothesis, in polynominal time.Consistency involves comparing all ways of applying monoid values to rows and, for each comparison, at most |A| further comparisons between one-letter extensions.The total number of comparisons is clearly polynomial in |M|, |S|, and |A|.

Conclusion
We have presented L T , a general adaptation of L that uses monads to learn an automaton with algebraic structure, as well as a method for finding a succinct equivalent based on its generators.Furthermore, we adapted the optimized counterexample handling method of Rivest and Schapire [28] to this setting and discussed instantiations to non-deterministic, universal, partial, weighted, alternating, and writer automata.
Related Work.This paper builds on and extends the theoretical toolkit of Van Heerdt et al. [21,19], who are developing a categorical automata learning framework (CALF) in which learning algorithms can be understood and developed in a structured way.
An adaptation of L that produces NFAs was first developed by Bollig et al. [11].Their algorithm learns a special subclass of NFAs consisting of RFSAs, which were introduced by Denis et al. [16].Angluin et al. [3] unified algorithms for NFAs, universal automata, and alternating automata, the latter of which was further improved by Berndt et al. [8].We are able to provide a more general framework, which encompasses and goes beyond those classes of automata.Moreover, we study optimized counterexample handling, which [3,11,8] do not consider.
The algorithm for weighted automata over an arbitrary field was studied in a category theoretical context by Jacobs and Silva [22] and elaborated on by Van Heerdt et al. [21].The algorithm itself was introduced by Bergadano and Varricchio [7].The theory of succinct automata used for our hypotheses is based on the work of Arbib and Manes [4], revamped to more recent category theory.Future Work.Whereas our general algorithm effortlessly instantiates to monads that preserve finite sets, a major challenge lies in investigating monads that do not enjoy this property.The algorithm for weighted automata generalizes to an infinite field [7,22,21] and even a principal ideal domain [20].However, for an infinite semiring in general we cannot guarantee termination, which is because a finitely generated semimodule may have an infinite chain of strict submodules [20].Intuitively, this means that while fixing closedness defects increases the size of the hypothesis state space semimodule, an infinite number of steps may be needed to resolve all closedness defects.In future work we would like to characterize more precisely for which semirings we can learn, and ideally formulate this characterization on the monad level.
As a result of the correspondence between learning and conformance testing [6,21], it should be possible to include in our framework the W-method [14], which is often used in case studies deploying L (e.g.[12,15]).We defer a thorough investigation of conformance testing to future work. in Definition 12 amounts to requiring the existence of a T -algebra homomorphism close making the following diagram commute: It is easy to see that the hypothesis of this lemma corresponds to requiring the existence of a function close making the diagram below on the left in Set commute.We prove that the returned set is a set of generators.For clarity, we denote by d S : S → T (S ) the function associated with a set of generators S .The main idea is incrementally building d S while building S .In the first line, S is a set of generators, with d S = η S : S → T (S).For the loop, suppose S is a set of generators.If the loop guard is false, the algorithm returns the set of generators S .Otherwise, suppose there are there are s ∈ S and U ∈ T (S \ {s}) such that row t (U ) = row t (s).Then there is a function

S img(row
that satisfies row t (s ) = row t (f (s )) for all s ∈ S , from which it follows that row t (U ) = row t (f (U )) for all U ∈ T (S ).Then we can set d S \{s} to f • d S : S → T (S \{s}) because row t (s ) = row t (d S \{s} (s )) for all s ∈ S. Therefore, S \ {s} is a set of generators.

Definition 10 (
Minimal T -Automaton for L).Let t L : A * → O A * be the function giving the residual languages of L, namely t L (u) = λv.L(uv).The minimal T -automaton M L accepting L has state space M = img(t L ), initial state init = t L (ε), and T -algebra homomorphisms out : M → O and δ :

Proposition 15 .
t ) A (O E ) A close row b m A T (S) T (img(row t ) A ) T ((O E ) A ) img(row t ) A (O E ) A T (close ) T (row b ) T (m A ) m AThis diagram can be made into a diagram of T -algebra homomorphisms as on the right, where the compositions of the left and right legs give respectively close and row b .This diagram commutes because the top triangle commutes by functoriality of T , and the bottom square commutes by m A being a T -algebra homomorphism.Therefore we have that (3) commutes for close = close .If z ∈ A * is such that L H (z) = L(z) and prefixes(z) ⊆ S, then there are a prefix ua of z, with u ∈ A * and a ∈ A, and U ∈ T (S) such that row t (u) = row t (U ) and row b (u)(a) = row b (U )(a).

Proof.Proposition 18 .
Note thatrow t (z)(ε) = L(z) (definition of row t ) = L H (z) (assumption) = out H (r H (z)) (Definition of L H ) = r H (z)(ε) (definition of out H ), so row t (z) = r H (z). Let p ∈ A * be the smallest prefix of z satisfying row t (p) = r H (p). We have row t (ε) = init H = r H (ε), so p = ε and therefore p = ua for certain u ∈ A * and a ∈ A. Let S ⊂ S be the set from which H was constructed-recall that we added prefixes(z) to S after constructing H. Choose any U ∈ T (S ) such that row t (U ) = r H (u), which is possible because H is the image of row t restricted to the domain T (S ).By the minimality property of p we have row t (u) = r H (u) = row t (U ).Furthermore,row b (u)(a) = row t (ua) (definitions of row t and row b ) = r H (ua) (ua = p and row t (p) = r H (p)) = δ H (r H (u))(a) (definition of r H ) = δ H (row t (U ))(a) (r H (u) = row t (U )) = row b (U )(a) (definition of δ H ).Any succinct hypothesis of H accepts the language of H.Proof. Assume a right inverse i : H → T (S ) of e : T (S ) → H.We first prove o H • e = o S , by induction on the length of words.For all U ∈ T (S ), we haveo H (e(U ))(ε) = out H (e(U )) (definition of o H ) = out H (row t (U )) (definition of e) = row t (U )(ε) (definition of out H ) = out S (U ) (definition of out S ) = o S (U )(ε) (definition of o S ).Now assume that for a given v ∈ A * and all U ∈ T (S ) we have o H (e(U ))(v) = o S (U )(v).Then, for all U ∈ T (S ) and a ∈ A, o H (e(U ))(av) = o H (δ H (e(U ))(a))(v) (definition of o H ) = o H (δ H (row t (U ))(a))(v) (definition of e) = o H (row b (U )(a))(v) (definition of δ H ) = (o H • e • i)(row b (U )(a))(v) (e • i = id H ) = (o S • i)(row b (U )(a))(v) (induction hypothesis) = o S (δ S (U )(a))(v) (definition of δ S ) = o S (U )(av) (definition of o S ).
From this we see thato S (init S ) = (o S • i • row t )(ε) (definition of init S ) = (o H • e • i • row t )(ε) (o H • e = o S ) = (o H • row t )(ε) (e • i = id H ) = o H (init H ) (definition of init H ).Proposition 19.The following algorithm returns a minimal set of generators for the table: S ← S while there are s ∈ S and U ∈ T (S \ {s}) s.t.row t (U ) = row t (s) S ← S \ {s} return S Proof.Minimality is obvious, as S not being minimal would make the loop guard true.