On the Average Complexity of Strong Star Normal Form

,


Introduction
A regular expression α is in strong star normal form (ssnf) if for any subexpression of the form β or β + ε the language represented by β does not include the empty word, ε.The star normal form was introduced by Brüggemann-Klein [5] as a step to improve the construction of the position automaton from a regular expression from cubic to quadratic time.Transforming a regular expression into this normal form can be achieved in linear time, and moreover the position automaton resulting from that normal form coincides with the one of the original expression.In the same paper, the star normal form was also used to characterize certain types of unambiguous expressions.The position automaton construction [9] is a basic conversion between regular expressions and ε-free nondeterministic finite automata (NFA), and several other constructions are known to be its quotients.This is the case for the partial derivative automaton [1,7] and the follow automaton [14].Champarnaud et al. [6] showed that if a regular expression is in star normal form and is normalised modulo some regular expression equivalences, the partial derivative automaton is a quotient of the follow automaton.Many conversions from regular expressions to equivalent NFAs consider automata with transitions labelled by the empty word (ε-NFA).Although the most used of these conversions is the Thompson construction (implemented in many UNIX-like string search commands) [18], an older and more thrifty construction in the use of ε-transitions was presented by Ott and Feinstein in 1961 [16].An improved version of this construction was redefined by Ilie and Yu, and called the ε-follow automaton.Gulan, Fernau and Gruber [12,10,11] studied the optimal (worst-case) size for all known constructions from regular expressions to ε-NFAs.It turns out that the optimal construction corresponds to the conversion of a regular expression in strong star normal form into an ε-follow automaton.
All this motivated us to study the average-case complexity of regular expressions in strong star normal form, as well as their conversions to NFAs.In previous work, we studied the asymptotic average complexity for some of the above mentioned conversions from regular expressions using the framework of analytic combinatorics [2][3][4], which relates the enumeration of combinatorial objects to the algebraic and complex analytic properties of generating functions.In particular, generating functions can be seen as complex analytic functions, and the study of their behaviour around their dominant singularities gives access to the asymptotic form of their coefficients.Starting with an unambiguous grammar for the set of regular expressions over a given alphabet, and a non-negative measure, the symbolic method allows to obtain a generating function associated with the sequence of the (finite) number of expressions of measure n.Multivariate generating functions can be used to analyse different measures apart from the size of combinatorial objects, e.g. the number of states of the automaton resulting from a given conversion method applied to a regular expression of given size, and thus allow to obtain estimates for the average values of those measures.
While in previous work we were able to get explicit expressions for the generating functions involved, here that would be unmanageable.Using the existence of a Puiseux expansion at a singularity, we show how to get the required information for the asymptotic estimates from an algebraic equation satisfied by the generating function, without actually computing that expansion.We note that the technique here presented allows to find, for the combinatorial classes considered, the form of the function without knowing beforehand the explicit value of the singularity.This provides a very useful method, at least for some combinatorial classes, that circumvents some of the more cumbersome steps of the Algebraic Coefficient Asymptotics algorithm presented by Flajolet and Sedgewick [8, pp. 504-505], as well as the need to know a priori the type of the singularity.
We use this method to derive the asymptotic estimates for the number of regular expressions in ssnf of a given size, as well as a parametric function of several related measures, which can give us, in particular, the alphabetic size or the size of the ε-follow automaton, on average.In the next section, we review some basics on regular expressions and ε-NFAs.In Section 3, we consider the transformation into strong star normal form and give some characterisations of expressions in this form.Section 4 describes a shortcut to obtain asymptotic estimates of the coefficients of generating functions.This is used in Section 5 to obtain the estimates mentioned before.Some experiments corroborating those estimates are presented in Section 6. Conclusions are drawn in Section 7.

Regular Expressions and ε-NFAs
We consider the grammar for regular expressions proposed by Gruber and Gulan in [10,11], which has the major advantage of avoiding many redundant expressions built with the symbols ε and ∅.Given an alphabet Σ = {σ 1 , . . ., σ k } of size k, the set R k of regular expressions, α, over Σ is defined by the following grammar, where the operator • (concatenation) is often omitted.The language associated with α is denoted by L(α) and is defined as usual, with L(β ? ) = L(β) ∪ {ε}.It is clear that α ? is equivalent to the standard regular expression α + ε.
For the size of a regular expression α, denoted by |α|, we will consider reverse polish notation length, i.e., the number of symbols in α, not counting parentheses.The number of letters in α is denoted by |α| Σ , and usually called alphabetic size.The number of occurrences of each operator c ∈ {+, •, , ?} is denoted by |α| c .
A nondeterministic finite automaton is a tuple where δ is naturally extended to sets of states and words.
Conversion of a regular expression into an equivalent NFA can be defined by induction on the structure of the regular expression.Let N α denote the automaton corresponding to a regular expression α.In Figure 1 we present the construction of the ε-follow automaton, A εf (α) [14].The size of the A εf (α) for the atomic expressions ∅, ε, and σ ∈ Σ is 2, 3 and 3, respectively.For the remaining constructions, the size of the resulting automaton equals the sum of the sizes of its constituents plus some constant.For instance, for the operator + one has c , c ? ) that define functions that can be used to compute several interesting measures.For example, using (2, 2, 2, −2, −1, 1, 0) one gets the number of states; the number of transitions are computed using (0, 1, 1, 0, 0, 2, 1), and the combined size corresponds to (2, 3, 3, −2, −1, 3, 1).
We note that the worst-case complexity for this conversion can be reached for expressions with only one letter and n − 1 stars.For such an expression of size n, the corresponding A εf automaton has size 3n.

Strong Star Normal Form
A regular expression α is in star normal form if for any subexpression of the form β , ε / ∈ L(β) [5].The original notion of star normal form makes use of two operators on regular expressions.Gulan and Gruber simplified that definition and adapted it to forbid that subexpressions of the form β ?could have ε ∈ L(β).The resulting form was called strong star normal form.Definition 1.The operators • and • are inductively defined as follows.Let The expression α • is the strong star normal form (ssnf) of α.
For a regular expression α, L(α Using this theorem it is possible to write a context-free grammar for regular expressions in ssnf, i.e., in which every subexpression of the form α or α ?, satisfies ε / ∈ L(α).The set S k of regular expressions in ssnf over Σ is defined by: where α ε are regular expressions whose language includes ε, while for α ε , ε / ∈ L(α ε ).The following theorem summarizes the results by Gruber and Gulan [10, Theorems 4 and 6] (see also Gulan [11]).Let A(z) = n a n z n be the generating function associated with some combinatorial class A (cf. [8]).Given some measure of the objects of the class, the coefficient a n represents the sum of the values of this measure for all objects of size n.We will use the notation [z n ]A(z) for a n .The generating function A(z) can be seen as a complex analytic function, and the study of its behaviour around its dominant singularity ρ (when unique) gives us access to the asymptotic form of its coefficients.In particular, if A(z) is analytic in some indented disc neighbourhood of ρ, then one has the following [8,3]: Applying this result for the generating function R k (z), corresponding to the number of expressions in R k of size n, the following asymptotic values were obtained in Broda et al. [3]: In the same paper, the average size of the ε-follow automata construction was studied, and it was shown that, as the alphabet grows, the size of A εf approaches 0.75n, asymptotically and on average.Let us now give a generic description of the method used for the combinatorial classes that show up within the present paper.From a grammar one obtains, by the symbolic method expounded in [8], a set of polynomial equations involving the generating function of whose coefficients we want to have an asymptotic estimate.Computing a Gröbner basis for the ideal generated by those polynomials, one gets an algebraic equation for that generating function w = w(z), i.e., an equation of the form G(z, w) = 0, where G(z, w) is a polynomial in Z[z][w], of which w(z) is a root.Since w(z) is the generating function of a combinatorial class, thus a series with non-negative integer coefficients, which is not a polynomial, it must have, by Pringsheim's Theorem [8, Thm IV.6], a real positive singularity, ρ, smaller than 1.At this singularity two cases may occur: either lim z→ρ w(z) = a, a positive real number, or lim z→ρ w(z) = +∞.
In the first case, after making the change of variable s = 1 − z/ρ, one knows that w = w(s) has a Puiseux series expansion at the singularity s = 0, i.e., there exists a slit neighbourhood of that point in which w(s) has a representation as a power series with fractional powers [13,Chap. 12].In particular, w must have the form for some a ∈ R, α ∈ Q + , the first positive exponent of that expansion, and g(s) such that g(s) = b + h(s)s β , h(0) = 0, β ∈ Q + , and b ∈ R * .We will show that, under some generic conditions that happen to be satisfied in all the cases treated below, one has α = 1 2 or α = − 1 2 .One then needs to find the values of ρ and of b or c, depending on the case, to use either ( 2) or (3) to obtain the sought-after asymptotic estimates of the coefficients of w(z).
In the case under study, the curve defined by G has a shape similar to the one depicted in Fig. 2, and therefore This, together with the fact that G(ρ, a) = 0, shows that ρ is a root of the discriminant polynomial of G with respect to variable w, which is a polynomial in z (cf.[15, p. 204]).In all the cases studied here, this polynomial has only one root in ]0, 1[, a fact that allows to numerically get an approximation for the value of ρ.The minimum polynomial in Q[z] of ρ can be obtained by analysing the greatest common divisor of the polynomials G(z, w) and ∂ ∂w G(z, w) with respect to w: gcd w (G(z, w), ∂ ∂w G(z, w)).We will denote this polynomial by m ρ (z).Using now the gcd z (G(z, w), ∂ ∂w G(z, w)) one can get a polynomial that has a as a root.One can then numerically compute all the real roots of that polynomial, and then check which one is an approximation for the value of a by means of a numerical study of the curve G(z, w).Using ( 7) in ( 6), and dividing it through by s α , one gets Now, in all cases studied in this paper, one has ∂G ∂z (ρ, a) = 0, and This was checked by computing gcd(p 1 (z), m ρ (z)) and gcd(p 2 (z), m ρ (z)), obtaining a constant depending only on k, that is non-zero for all k = 54 in all cases dealt with in this paper.The case k = 54 was dealt separately.Using the explicit value for ρ, the validity of ( 9) for this value of k was verified.It now follows from (8), by noticing that the first and third summands have the smallest degrees in s, that they must have the same degree and cancel each other.Dividing, then, by s α and letting s → 0, one obtains , and b = g(0) = 2ρ ∂G ∂z (ρ, a) .
In conclusion, for the case where lim z→ρ w(z) = a, using (2), one has For the case where lim z→ρ w(z) = +∞, making v = 1/w one concludes as above that v = cs α − g(s)s α+β , for some 0 < α < 1, β > 0, and for some Puiseux series g(s), with non-negative exponents.The polynomial satisfied by v is then which is the reciprocal polynomial of G(z, w) with respect to the variable w.
Using the same procedure as above, one computes ρ, and checking that the corresponding derivatives are non-zero, i.e.

Average Sizes: Concrete Results
Let A k (z) and B k (z) be the generating functions for α ε and α ε , respectively.They satisfy the following equations From ( 13) one gets and then substituting B k (z) in ( 14) one obtains, after clearing up denominators, Using now (14) to get A k (z) as a function of B k (z), and then substituting that into (13), one easily sees that B k (z) is a root of Using the technique described in the previous section, one sees that A k (z) and B k (z) have the same singularity, namely the only root in the interval ]0, 1[ of the polynomial Also one gets that α = 1 2 , and that the values of a A = A k (ρ) and of a B = B k (ρ) are roots of the polynomials 8z 3 −kz 2 +2kz −k, and 8z 3 +2kz 2 −k 2 , respectively.With all this, and writing S k (z) = A k (z) + B k (z) one then gets that Using these results and the one mentioned in (4), the ratio of regular expressions in ssnf, , can now be computed for any k and n.In particular, one finds that, for example, r (2,1000) = 4.427117336 × 10 −59 , r (10,1000) = 2.562752010 × 10 −19 , r (50,1000) = 1.517513555 × 10 −4 .

Counting Letters
To obtain the asymptotic average value of several measures for regular expressions of a given size, we consider bivariate generating functions parametrized by weights of the form c o , with o ∈ {∅, ε, σ, +, •, , ?}, associated to each regular expression element.Considering the grammar (1), let A k (u, z) and B k (u, z) be the bivariate generating functions associated to α ε and α ε , respectively.Then Note that A and B depend on the parameters (c ∅ , c ε , c σ , c + , c • , c , c ? ), but for sake of simplicity we choose to omit them.For computing the average number of letters those parameters are (0, 0, 1, 0, 0, 0, 0), and analogously for each operator.
The generating function L k (z) for the number of letters is given by Using Gröbner basis, as mentioned above, one gets the following polynomial for w = L k : It turns out that, from this, one can deduce that the singularity for this algebraic function w has the same minimal polynomial as in (15), and so it is the same as for the number of regular expressions there considered.One then finds that, in this case, α = − 1 2 , and that where, for example, From this one gets, for any given k, the density of letters in expressions of size , which is independent of n since the singularities of L k and S k are the same.In particular, one finds that, for example, 2 = 0.4172563448, 10 = 0.4432524170, 50 = 0.4657465002.

Size of ε-Follow Automata
Considering the parameters (2, 3, 3, −2, −1, 3, 1), as defined in Section 2, the generating function F k (z) for the size of the A εf automaton is given by Using the same abbreviations as above, one has: Proceeding as above, one can verify that the singularity for F k (z) still has the same minimal polynomial as in (15), that α = − 1 2 , and that where, for example, For the average ratio, , between the size of the A εf and the size of the respective regular expression (also independent of n) one has, for example, f 2 = 0.9803566472, f 10 = 0.9034371711, f 50 = 0.8400260553.

Experimental Results
We ran some experiments, using the FAdo package [17], to obtain average sizes of the measures studied above for small values of k and n.For the results to be statistically significant, regular expressions were uniformly random generated using a version of the grammar for S k in reverse polish notation.For each size n ∈ {200, 500, 1000}, and alphabet size k ∈ {2, 10, 50}, samples of 10000 expressions were generated.This is sufficient to ensure a 95% confidence level within a 1% error margin.The results are presented in Table 1, together with the values of k and f k calculated in the previous section.The last column, labeled wc, presents the worst case size of A εf as given in Theorem 2, for expressions of size n.

Conclusions
The average complexity results obtained for expressions in ssnf are only slightly smaller than the ones obtained for general regular expressions.Indeed, for the size of A εf , and the same values of k, the asymptotic values obtained in [3], were f 2 = 1.2, f 10 = 1, and f 50 = 0.9.In that study, we got an explicit expression, and F ⊆ Q is the set of final states.The size of an NFA N is |N | = |Q| + |δ|, the number of states |N | Q = |Q|, and the number of transitions |N | δ = |δ|.An NFA that has transitions labelled with ε is an ε-NFA.The language accepted by an automaton N is L

Table 1 .
Results for regular expressions in ssnf