Unfolding-based Dynamic Partial Order Reduction of Asynchronous Distributed Programs

. Unfolding-based Dynamic Partial Order Reduction (UDPOR) is a recent technique mixing Dynamic Partial Order Reduction (DPOR) with concepts of concurrency such as unfoldings to eﬃciently mitigate state space explosion in model-checking of concurrent programs. It is optimal in the sense that each Mazurkiewicz trace, i.e. a class of interleavings equivalent by commuting independent actions, is explored exactly once. This paper shows that UDPOR can be extended to verify asynchronous distributed applications, where processes both communicate by messages and synchronize on shared resources. To do so, a general model of asynchronous distributed programs is formalized in TLA+. This allows to deﬁne an independence relation, a main ingredient of the unfolding semantics. Then, the adaptation of UDPOR, involving the construction of an unfolding, is made eﬃcient by a precise analysis of dependencies. A prototype implementation gives promising experimental results.


Introduction
Developing distributed applications that run on parallel computers and communicate by message passing is hard due to their size, heterogeneity, asynchronicity and dynamicity. Besides performance, their correction is crucial but very challenging due to the complex interactions of parallel components.
Model-checking (see e.g. [4]) is a set of techniques allowing to verify automatically and effectively some properties on models of such systems. The principle is usually to explore all possible behaviors (states and transitions) of the system model. However, state spaces increase exponentially with the number of concurrent processes. Unfoldings and Partial order reduction (POR) are two candidate alternative techniques born in the 90's to mitigate this state space explosion and scale to large applications.
We revise and extend this model with synchronization primitives and formally specify it in TLA+ [11]. A clear advantage of this model is its abstraction: it remains concise, but its generality allows e.g. the encoding of MPI primitives. Already defining a correct independence relation from this formal model is difficult, due to the variety and complex semantics of actions. In addition, making UDPOR and in particular the computation of unfoldings and extensions efficient cannot directly rely on solutions of [13], which are tuned for concurrent programs with only mutexes, thus clever algorithms need to be designed. For now we prototyped our solutions in a simplified context, but we target the Sim-Grid tool which allows to run HPC code (in particular MPI) in a simulation environment [5]. The paper is organized as follows. Section 2 recalls notions of interleaving and concurrency semantics, and how a transition system is unfolded into an event structure with respect to an independence relation. In Section 3 the programming model is presented together with a sketch of the independence relation. Section 4 presents the UDPOR algorithm, its adaptation to our programming model, and how to make it efficient. Finally we present a prototype implementation and its experimental evaluation.

Interleaving and Unfolding Semantics
The behaviors of a distributed program can be described in an interleaving semantics by a labelled transition system, or in a true concurrency semantics by an event structure. An LTS equipped with an independence relation can be unfolded into an event structure [16]. This is a main step for UDPOR.
Definition 1 (Labelled transition system). A labelled transition system (LTS) is a tuple T = S , s 0 , Σ, → where S is the set of states, s 0 ∈ S the initial state, Σ is the alphabet of actions, and →⊆ S × Σ × S is the transition relation.
We note s a − → s when (s, a, s ) ∈ → and extend the notation to execution sequences: s Independence is a key notion in both POR techniques and unfoldings, linked to the possibility to commute actions: Definition 2 (Commutation and independence). Two actions a 1 , a 2 of an LTS T = S , s 0 , Σ, → commute in a state s if they satisfy the two conditions: executing one action does not enable nor disable the other one: their execution order does not change the overall result: A relation I ⊆ Σ × Σ is a valid independence relation if it under-approximates commutation, i.e. ∀a 1 , a 2 , I (a 1 , a 2 ) implies that a 1 and a 2 commute in all states. Conversely a 1 and a 2 are dependent and we note D(a 1 , a 2 ) when ¬(I (a 1 , a 2 )).
A Mazurkiewicz trace is an equivalence class of executions (or interleavings) of an LTS T obtained by commuting adjacent independent actions. By the second item of Definition 2, all these interleavings reach a unique state. The principle of all DPOR approaches is precisely to reduce the state space exploration while covering at least one execution per Mazurkiewicz trace. If a deadlock exists, a Mazurkiewicz trace leads to it and will be discovered. More generally, safety properties are preserved.
The UDPOR technique that we consider also uses concurrency notions. A classical model of true concurrency is prime event structures: Definition 3 (Prime event structure). Given an alphabet of actions Σ, a Σ-prime event structure (Σ-PES) is a tuple E = E , <, #, λ where E is a set of events, < is a partial order relation on E , called the causality relation, λ : E → Σ is a function labelling each event e with an action λ(e), # is an irreflexive and symmetric relation called the conflict relation such that, the set of causal predecessors or history of any event e, e = {e ∈ E : e < e} is finite, and conflicts are inherited by causality: ∀e, e , e ∈ E , e#e ∧ e < e =⇒ e#e .
Intuitively, e < e means that e must happen before e , and e#e that those two events cannot belong to the same execution. Two distinct events that are neither causally ordered nor in conflict are said concurrent. The set [e] := e ∪ {e} is called the local configuration of e. An event e can be characterized by a pair < λ(e), H > where λ(e) is its action, and H = e its history.
We note conf (E ) the set of configurations of E, where a configuration is a set of events C ⊆ E that is both causally closed (e ∈ C =⇒ e ⊆ C ) and conflict free (e, e ∈ C =⇒ ¬(e#e )). A configuration C is characterized by its causally maximal events maxEvents(C ) = {e ∈ C : e ∈ C , e < e }, since it is exactly the union of local configurations of these events: C = e ∈ maxEvents(C ) [e]; conversely a conflict free set K of incomparable events for < defines a configuration config(K ) and C = config(maxEvents(C )).
A configuration C , together with the causal and independence relations defines a Mazurkiewicz trace: all interleavings are obtained by causally ordering all events in the configuration C but commuting concurrent ones. The state of a configuration C denoted by state(C ) is the state in T reached by any of these executions, and it is unique as discussed above. We write enab(C ) = enabled (state(C )) ⊆ Σ for the set of actions enabled at state(C ), while actions(C ) denotes the set of actions labelling events in C , i.e. actions(C ) = {λ(e) : e ∈ C }.
The set of extensions of C is ex (C ) = {e ∈ E \ C : e ⊆ C }, i.e. the set of events not in C but whose causal predecessors are all in C . When appending an extension to C , only resulting conflict-free sets of events are indeed configurations. These extensions constitute the set of enabled events en(C ) = {e ∈ ex (C ) : e ∈ C , e#e } while the other ones are conflicting extensions cex (C ) := ex (C ) \ en(C ).
Parametric Unfolding Semantics. Given an LTS T and an independence relation I , one can build a prime event structure E such that each linearization of a maximal (for inclusion) configuration represents an execution in T , and conversely, to each Mazurkiewicz trace in T corresponds a configuration in E [13].
Definition 4 (Unfolding). The unfolding of an LTS T under an independence relation I is the Σ-PES E = E , <, #, λ, incrementally constructed from the initial Σ-PES ∅, ∅, ∅, ∅ by the following rules until no new event can be created: for any configuration C ∈ conf (E ), any action a ∈ enabled (state(C )), if for any e ∈ maxEvents(C ), ¬I (a, λ(e )), add a new event e = a, C to E ; for any such new event e = a, C , update <, # and λ as follows: λ(e) := a and for every e ∈ E \ {e}, consider three cases: (i) if e ∈ C then e < e, (ii) if e / ∈ C and ¬I (a, λ(e )), then e#e , (iii) otherwise, i.e. if e / ∈ C and I (a, λ(e )), then e and e are concurrent.

Programming Model and Independence Relation
In this section we introduce the abstract model of asynchronous distributed systems that we consider. While abstract, this model is sufficient to represent concrete MPI programs, as it encompasses all building blocks of the SMPI implementation of the standard [5]. We formalized this model in the specification language TLA+ [11], to later infer an independence relation. In our model an asynchronous distributed system P consists in a set of n actors Actors = {A 1 , A 2 , ...A n } that perform local actions, communicate asynchronously with each others, and share some resources. We assume that the program is terminating, which implies that all actions are terminating. All local actions are abstracted into a unique one LocalComp. Communication actions are of four types: AsyncSend, AsyncReceive, TestAny , and WaitAny . Actions on shared resources called synchronizations are of four types: AsyncMutexLock, MutexUnlock, MutexTest and MutexWait.

Abstract Model
At the semantics level, P is a tuple P = Actors, Network, Synchronization where Network and Synchronization respectively describe the abstract objects, and the effects on these of the communication and synchronizations actions. The Network subsystem provides facilities for the Actors to asynchronously communicate with each others, while the subsystem Synchronization allows the synchronization of actors on access to shared resources.

Synchronization subsystem. The Synchronization subsystem consists in a pair
Mutexes, Requests where Mutexes is a set of asynchronous mutexes used to synchronize the actors, and Requests is a vector indexed by actors ids of sets of requested mutexes. Each mutex m j is represented by a FIFO queue of actors ids i who declared their interest on a mutex m j by executing the action AsyncMutexLock(m j ). A mutex m j is free if its queue is empty, busy otherwise. The owner is the actor whose id is the first in the queue. In actor A i , the effect of the synchronization actions on Mutexes and Requests is as follows: -AsyncMutexLock(m j ) requests a mutex m j with the effect of appending the actor id i to m j 's queue and adding j to Requests[i ]. A i is waiting until owning m j but, unlike classical mutexes, waiting is not necessarily blocking. -MutexUnlock(m j ) removes its interest to a mutex m j by deleting the actor id i from the m j 's queue and removing j from Test (resp. MutexWait) are similar to TestAny (resp. WaitAny ) and could be merged. We keep them separate here for simplicity of explanations.
Beside those actions, a program can have local computations named Local-Comp actions. Such actions do not intervene with shared objects (Mailboxes, Mutexes and Communications), and they can be responsible for I/O tasks. We specified our model of asynchronous distributed systems in the formal language TLA+ [11]. Our TLA+ model 1 focuses on how actions transform the global state of the system. An instance P of a program is described by a set of actors and their actions (representing their source code). Following the semantics of TLA+, and since programs are terminating, the interleaving semantics of a program P can be described by an acyclic LTS representing all its behaviors. Formally, the LTS of P is a tuple T P = S , s 0 , Σ, → where Σ represent the actions of P ; a state s =< l , g > in S consists of the local state l of all actors (i.e. local variables, Requests) and g the state of all shared objects including Mutexes, Mailboxes and Communications; in the initial state s 0 all actors are in their initial local state, sets and FIFO queues are empty; a transition s a − → s is defined if, according to the TLA+ model, the action encoded by a is enabled at s and executing a transforms the state from s to s .
Notice that when verifying a real program, we only observe its actions and assume that they respect the proposed TLA+ model and the independence relation discussed below. These assumptions are necessary to suppose that the LTS correctly models the actual program behaviors.

Additional Property of the Model
The model presented in the previous section may appear unusual, because the lock action on mutexes is split into an AsyncMutexLock and a MutexWait while most works in the literature consider atomic locks. Our model does not induce any loss of generality, since synchronous locks can trivially be simulated with asynchronous locks. One reason to introduce this specificity is that this entails the following lemma, that is the key to the efficiency of UDPOR in our model. Lemma 1 (Persistence). Let u be a prefix of an execution v of a program in our model. If an action a is enabled after u, it is either executed in v or still enabled after v .
Intuitively, persistence says that once enabled, actions are never disabled by any subsequent action, thus remain enabled until executed. Persistence does not hold for classical synchronous locks, as some enabled lock(m) action of an actor may become disabled by the lock(m) of another actor. This persistence property has been early introduced by Karp and Miller [9], and later studied for Petri Nets [12]. It should not be confused with the notion of persistent set used in DPOR 2 . Persistent sets are linked to independence, while persistence is not.
Proof. When a is a LocalComp, AsyncSend, AsyncReceive, TestAny , AsyncMutex-Lock, MutexUnlock, or MutexTest action, a cannot be disabled by any new action. Indeed, these actions are never blocking (e.g. AsyncMutexLock comes down to the addition of an element in a FIFO, which is always enabled) and only depend on the execution of the action right before them by the same actor.
WaitAny and MutexWait may seem more complex. If a is a WaitAny , being enabled after u means that one communication it refers to was paired. Similarly, if a is a MutexWait, being enabled after u means that the corresponding actor is first in the FIFO of a mutex it refers to. In both cases these facts cannot be modified by any subsequent action, so a remains enabled until executed.

Independence Theorems
In order to use DPOR algorithms for our model of distributed programs, and in particular UDPOR that is based on the unfolding semantics, we need to define a valid independence relation for this model. Intuitively, two actions in distinct actors are independent when they do not compete on shared objects, namely Mailboxes, Communications, or Mutexes. This relation is formally expressed in TLA+ as so-called "independence theorems". We use the term "theorem" since the validity of the independence relation with respect to commutation should be proved. We proved them manually and implemented them as rules in the model-checker. These independence theorems are as follows 3 :

Adapting UDPOR
This section first recalls the UDPOR algorithm of [16] and then explains how it may be adapted to our context, in particular how the computation of extensions, a key operation, can be made efficient in our programming model.

The UDPOR Algorithm
Algorithm 1 presents the UDPOR exploration algorithm of [16]. Like other DPOR algorithms, it explores only a part of the LTS of a given terminating distributed program P according to an independence relation I , while ensuring that the explored part is sufficient to detect all deadlocks. The particularity of UDPOR is to use the concurrency semantics explicitly, namely unfoldings, which makes it both complete and optimal: it explores exactly one interleaving per Mazurkiewicz trace, never reaching any sleep-set blocked execution. The algorithm works as follows. Executions are represented by configurations, thus equivalent to their Mazurkiewicz traces. The set U , initially empty, contains all events met so far in the exploration. The procedure Explore has three parameters: a configuration C encoding the current execution; a set D (for disabled) of events to avoid (playing a role similar to a sleep set in [8]), thus preventing revisits of configurations; a set A (for add) of events conflicting with D and used to guide the search to events in conflicting configurations in cex (C ) to explore alternative executions.
First, all extensions of C are computed and added to U (line 4). The search backtracks (line 6) in two cases: when C is maximal (en(C ) = ∅), i.e. a deadlock (or the program end) is reached, or when all events enabled in C should be avoided (en(C ) ⊆ D), which corresponds to a redundant call, thus a sleepset blocked execution. Otherwise, an enabled event e is chosen (line 7-10), in A if this guiding information is non empty (line 10), and a "left" recursive exploration Explore(C ∪ {e}, D, A \ {e}) is called (line 11) from this extended configuration C ∪ {e}, it continues trying to avoid D, but e is removed from A in the guiding information. When this call is completed, all configurations containing C and e have been explored, thus it remains to explore those that contain C but not e. In this aim alternatives are computed (line 12) with the function call Alt(C , D ∪ {e}). Alternatives play a role similar to "backtracking sets" in the original DPOR algorithm, i.e. sets of actions that must be explored from the current state. Formally, an alternative to D = D ∪ {e} after C in U is a subset J of U that, does not intersect D , forms a configuration C ∪ J after C , and such that all events in D conflict with some event in J . If an Alternatives J exists, a right "recursive" exploration is called Explore(C , D ∪ {e}, J \ C ): C is still the configuration to extend, but e is now also to be avoided, thus added to D, while events in J \ C are used as guides. Upon completion (line 14), U is intersected with Q C ,D which includes all events in C and D as well as every event in U conflicting with some events in C ∪ D.
In order to avoid sleep-set blocked executions (SSB) and obtain the optimality of DPOR, the function Alt(C , D ∪{e}) has to solve an NP-complete problem [13]: find a subset J of U that can be used for backtracking, conflicts with all D ∪ {e} thus necessarily leading to a configuration C ∪J that is not already visited. In this case en(C ) ⊆ D can then be replaced by en(C ) = ∅ in line 5. Note that with a different encoding, Optimal DPOR must solve the same problem [1] as explained in [13]. In [13], a variant of the algorithm is proposed for the function Alt that computes k -partial alternatives rather than alternatives, i.e. sets of events J conflicting with only k events in D, not necessarily all of them. Depending on k , (e.g. k = ∞ (or k = |D| + 1) for alternatives, k = 1 for source sets of [1]) this variant allows to tune between an optimal or a quasi-optimal algorithm that may be more efficient.

Computing Extensions Efficiently
Computing the extensions ex (C ) of a configuration C may be costly in general. It is for example an NP-complete problem for Petri Nets since all subconfigurations must be enumerated. Fortunately this algorithm can be specially tuned for sub-classes. In particular for the programming model of [16,13] it is tuned in an algorithm working in time O(n 2 log(n)), using the fact that events have a maximum of two causal predecessors, thus limiting the subsets to consider.
This section tunes the algorithm to our more complex model, using the fact that the amount of causal predecessors of events is also bounded. Next section shows how to incrementally compute ex (C ) to avoid recomputations. Figure 2 illustrates some aspects of an extension. This section mandates some additional notations. Given a configuration C and an extension with action a, let pre(a) denote the action right before a in the same actor, while preEvt(a, C ) denotes the event in C associated with pre(a) (formally e = preEvt(a, C ) ⇐⇒ e ∈ C , λ(e) = pre(a)). Given a set F of events F ⊆ E , Depend (a, F ) means that a depends on all actions labeling events in F .
The definition of ex (C ) (set of extensions of a configuration C ) {e ∈ E \ C : e ⊆ C } can be rewritten using the definitions of section 2 as follows: Fortunately, it is not necessary to enumerate all subsets H of C , that are in exponential numbers, to compute this set. According to the unfolding construction in Definition 4, an event e = a, H only exists in ex (C ) if the action a is dependant with the actions of all maximal events of H . This gives: Depend (a, maxEvents(H ))}. Now ex (C ) can be simplified and decomposed by enumerating Σ, yielding to: The above formulation of ex (C ) iterates on all actions in Σ. However, interpreting the persistence property (Lemma 1) for configurations entails that for two configurations H and C with H ⊆ C , an action a in enab(H ) is either in actions(C ) or enab(C ). Therefore, ex (C ) can be rewritten by restricting a to actions(C ) ∪ enab(C ) : Now, instead of enumerating possible configurations H ∈ S a,C , we can enumerate their maximal sets K = maxEvents(H ). Hence, with S max a,C = {K ∈ 2 C : K is maximal ∧a ∈ enab(config(K ))∧Depend (a, K ))} and K is maximal if ( e, e ∈ K : e < e ∨ e#e ).
One can then specialize the computation of ex (C ) according to the type of action a. Due to space limitations, we only detail the computation for AsyncSend actions, the other ones being similar.
Computing extensions for AsyncSend actions. Let C be a configuration, and a an action of type c = AsyncSend(m, ) of an actor A i . We want to compute the set S max a,C of sets K of maximal events from which a depends. According to independence theorems (see 3.3), a only depends on the following actions: pre(a), all AsyncSend(m, ) actions of distinct actors A j which concern the same mailbox m, and all WaitAny (resp. TestAny ) actions that wait (resp. test) a AsyncReceive which concerns the same communication c. Considering this, we now examine the composition of maximal events sets K in S max a,C . First, two events labelled by AsyncSend(m, ) actions cannot co-exist in K , formally e, e ∈ K : λ(e), λ(e ) in AsyncSend(m, ): indeed, if two such events exist in a configuration, they are dependent but cannot conflict, thus are causality related and cannot be both maximal.
Second, if a WaitAny (Com) action concerns communication c, there are two cases: (i) either c is not the first done communication in Com, then WaitAny (Com) and the action a are independent. (ii) or c is the first done communication in Com and WaitAny is enabled only after a. Thus the only possibility for a maximal event to be labelled by a WaitAny is when pre(a) is a WaitAny of the same actor. We can then write: e ∈ K : λ(e) in WaitAny ∧λ(e) = pre(a).
Third, all AsyncReceive events for the mailbox m are causally related in configuration C , and c can only be paired with one of them, say c . Thus a can only depend on actions TestAny (Com ) such that c ∈ Com and c and c form the first done communication in Com , and all those TestAny events are ordered. Thus, there is at most one event e labelled by TestAny in K such that λ(e) = pre(a).
To conclude, K contains at most three events: preEvt(a, C ), some event labelled with an action AsyncSend on the same mailbox, and some TestAny for some matching AsyncReceive communication. There is thus only a cubic number of such sets, which is the worse case among considered action types. Algorithm 2 generates all events in ex (C ) labelled by an AsyncSend action a. We only have one AsyncSend event e 3 . Since ¬(e 2 < e 3 ) and ¬(e 3 < e 2 ), we form a first set K = {e 2 , e 3 }, and add e 7 =< AsyncSend, {e 2 , e 3 } > to ex (C ). Next all TestAny events that concern the mailbox m should be considered. Events e 2 and e 5 can be combined to form a new maximal event set K = {e 2 , e 5 }, but since a and λ(e 5 ) are not related to the same communication, D(a, λ(e 5 )) is not satisfied and no event is created. Finally combinations of e 2 with an AsyncSend event and a TestAny event are examined (lines 9-17). We then get K = {e 2 , e 5 , e 3 }, and e 8 is added to ex (C ) since D(a, λ(e 5 )) holds in the configuration config({e 2 , e 5 , e 3 }).

Computing Extensions Incrementally
In the UDPOR exploration algorithm, after extending a configuration C by adding a new event e, one must compute the extensions of C = C ∪ {e}, thus resulting in redundant computations of events. The next theorem improves this by providing an incremental computation of extensions. where S a,C = {H ∈ 2 C ∩ conf (E ) : a ∈ enab(H ) ∧ Depend (a, maxEvents(H ))}.
Proof. With the definition of S a,C as above, recall that Applying the same equation (6) to C we get: Now, exploring e from C leads to C , which entails that λ(e) belongs to enab(C ) and actions(C ) ∪ λ(e) = actions(C ), thus the range of a in ex (C ) which is actions(C ) ∪ enab(C ) can be rewritten actions(C ) ∪ (enab(C ) \ λ(e)). First, separating action(C ) from the rest in both ex (C ) and ex (C ) we prove: {< a, H >: H ∈ S a,C } = a ∈ actions(C ) {< a, H >: H ∈ S a,C } (7) (⊇) This inclusion is obvious since C ⊇ C , and thus S a,C ⊇ S a,C . (⊆) Suppose there exists some event e n =< a, H > belonging to the left but not the right set. If a = λ(e n ) = λ(e), then H ∈ S a,C ∩S a,C , so e n is in both sets, resulting in a contradiction. If a = λ(e n ) = λ(e), there are two cases: (i) either e / ∈ H then H ∈ S a,C and e n belongs to the right set, a contradiction. (ii) or e ∈ H , then λ(e n ) ∈ actions(C ) \ {λ(e)} = actions(C ), thus there is another event e ∈ C such that λ(e ) = λ(e n ), then e cannot belong to H (one action a cannot appear twice in e n ). Besides, e is the last event explored in C , thus a depends on λ(e) by Definition 4. Then, e conflicts with e, contradicting their membership to the same configuration C . This proves (7).

Experiments
We implemented the quasi-optimal version of UDPOR with k -partial alternatives [13] in a prototype adapted to the distributed programming model of Section 3, i.e. with its independence relation. The computation of k -partial alternatives is essentially inspired by [13]. Recall the algorithm reaches optimality when k = |D| + 1, while k = 1 corresponds to Source DPOR [1]. The prototype is still limited, not connected to the SimGrid environment, thus can only be experimented on simple examples. We first compare optimal UDPOR with an exhaustive stateless search on several benchmarks (see Table 1). The first five benchmarks come from Umpire Tests 5 , while DTG and RMQ-receiving belong to [10] and [17], respectively. The last benchmark is an implementation of a simple Master-Worker pattern. We expressed them in our programming model and explored their state space with our prototype. The experiments were performed on an HP computer, Intel Core i7-6600U 2.60GHz processors, 16GB of RAM, and Ubuntu version 18.04.1. Table 1 presents the number of explored traces and running time for both an exhaustive search and optimal UDPOR. In all benchmarks UDPOR outperforms the exhaustive search. For example, for RMQ-receiving with 4 processes, the exhaustive search explores more than 20000 traces in around 8 seconds, while UDPOR explores only 6 traces in 0.2 second. Besides, UDPOR is optimal, exploring only one trace per Mazurkiewicz trace. For example in RMQ-receiving with 5 processes, with only 4 AsyncSend actions that concern the same mailbox, UDPOR explores exactly 24 (=4!) non-equivalent traces. Similarly, the DTG benchmark has only two dependent AsyncSend actions, thus two non-equivalent traces. Furthermore, deadlocks are also detected in the prototype.
We also tried to vary the value of k . When k is decreased, one gains in efficiency in computing alternatives, but looses optimality by producing more traces. It is then interesting to analyse, whether this can be globally more efficient than optimal UDPOR. Similar to [13], we observed that in some cases, fixing smaller values of k may improve the efficiency. For example with RMQ-receiving, k = 7 is optimal, but reducing to k = 4 still produces 24 traces (thus is optimal) a bit more quickly (2.3 s), while for k = 3 the number of traces and time diverge. We have to analyse this more precisely on more examples in the future.
Note that with our simple prototype, we do not yet make experiments with concrete programs (e.g. MPI programs), for which running time may somehow diverge. We expect to make it in the next months and then experiment the algorithms in more depth. However, we believe that the results are already significant and that UDPOR is effective for asynchronous distributed programs.

Conclusion and Future Work
The paper adapts the unfolding-based dynamic partial order reduction (UD-POR) approach [16] to the verification of asynchronous distributed programs. The programming model we consider is generic enough to properly model a large class of asynchronous distributed systems, including e.g. MPI applications, while exhibiting some interesting properties. From a formal specification of this model in TLA+, an independence relation is built, that is used by UDPOR to partly build the unfolding semantics of programs. We show that, thanks to the properties of our model, some usually expensive operations of UDPOR can be made efficient. A prototype of UDPOR has been implemented and experimented on some benchmarks, gaining promising first results.
In the future we aim at extending our model of asynchronous distributed systems, while both preserving good properties, getting a more precise independence relation, and implementing UDPOR in the SimGrid model-checker and verify real MPI applications. Once done, we should experiment UDPOR more deeply, and compare it with state of the art tools on more significant benchmarks, get a more precise analysis about the efficiency of UDPOR compared to simpler DPOR approaches, analyse the impact of quasi-optimality on efficiency.