Highly-available and consistent group collaboration at the edge with colony

Edge applications, such as gaming, cooperative engineering, or in-the-field information sharing, enjoy immediate response, autonomy and availability by distributing and replicating data at the edge. However, application developers and users demand the highest possible consistency guarantees, and specific support for group collaboration. To address this challenge, Colony guarantees Transactional Causal Plus Consistency (TCC+) globally, strengthened to Snapshot Isolation within edge groups. To help with scalability, fault tolerance and security, its logical communication topology is forest-like, with replicated roots in the core cloud, but with the flexibility to migrate a node or a group. Despite this hybrid approach, applications enjoy the same semantics everywhere in the topology. Our experiments show that local caching and peer groups improve throughput and response time significantly, performance is not affected in offline mode, and that migration is seamless.


INTRODUCTION
Internet-scale collaboration is a growing application area, as evidenced by games such as Overwatch or Ingress, shared editors such as Google Docs, Microsoft Office 365 or Apple iWork, or filesharing systems such as Dropbox or Nextcloud. Mobile devices with Augmented Reality capabilities support location-aware games such as Pokémon Go and Harry Potter Unite, or collaborative 3D modelling and manufacturing applications [47,54,64].
Existing systems are cloud-based, sometimes providing ad-hoc application-level caching. Consistency violations are common, baffling users and vexing application developers [11,19,20,55,68]. Support for offline operation is spotty. Users interact through the cloud only, even when direct communication would be possible. These systems lack collaboration features such as group management or versioning. This paper presents the Colony database and middleware designed to address these issues. A top requirement is an edge-first approach [25] that locates data on device, to provide availability, fast and seamless response independently of network connectivity and of location, and so that users have ownership of their data. However, this makes it challenging to satisfy consistency and freshness expectations.
To answer this challenge, Colony takes a hybrid approach, and provides the highest consistency guarantees compatible with availability (TCC+), strengthened further (to SI) in well-connected zones. CRDT data types ensure convergence without rollbacks. A related challenge is the overhead of concurrency metadata (fat vector clocks), which we limit thanks to a flexible forest topology and to SI zones.
To address the requirements of group collaboration, Colony versions data, enables an edge group to share without relying on the cloud, and supports collaborative security features. Our design provides uniform access to data across a spectrum spanning core cloud to the far edge.
Finally, this paper addresses a number of design and implementation challenges, including disconnected operation, consistency under migration, total-order consensus at the edge, and avoiding single points of failure despite the forest topology.
We claim the following contributions: • A decentralised database architecture, designed for collaborative applications, that provides a continuum spanning from the core cloud to the far edge.
• A hybrid consistency model, based on causal and transactional guarantees globally, strengthened to total-order consistency in edge collaboration groups and in geographical proximity.
• A scalable metadata and topology design that bounds the overhead of the required consistency metadata, and that supports seamless disconnection or migration of a node or of a whole group.
• A novel approach to access control that leverages the consistency model.
• Efficient design and implementation of Colony and its experimental evaluation demonstrating the benefits of our approach.
Our experimental evaluation shows that: local and group caching improve throughput by 1.4× and 1.6× respectively, and response time by 8× and 20×, compared to a classical cloud configuration; the performance in offline mode remains the same as in the online mode; both the offline and online transition and migration are seamless.

Global Consistency Guarantee: TCC+
In the edge context, two different consistency models have been explored. Although they are incomparable, both have been proved to be the strongest possible model compatible with availability under partition. One is Monotonic Prefix Consistency (MPC), which combines the per-process orders into a global total order; however a process is exposed to arbitrary rollbacks [18]. We argue that a client losing an unpredictable amount of work is an unacceptable user experience.
Our preferred alternative is Causal Consistency (CC) [5]. Intuitively, if a client observes some update, it also observes all preceding updates. Only concurrent updates may be observed in different orders.
CC can be enforced locally and does not require consensus; the drawback is that tracking the partial order in CC can have a heavy metadata cost, as we discuss later. On top of CC, atomic transactions and convergence guarantees can be supported without impacting availability [2,69], a model we call Transactional Causal Plus Consistency (TCC+). Section 3.1 formalises the TCC+ guarantees.

Local Strengthening: Data Centre
Nodes that are strongly connected to each other can provide even stronger guarantees, totally ordering updates upfront, for instance Snapshot Isolation (SI). SI is stronger than MPC, as it does not suffer rollbacks, and than CC, as it totally orders updates; the metadata cost is much reduced. We call a set of nodes that enjoy SI among themselves an SI zone.
One kind of SI zone is a data centre (DC). A DC has a large number of parallel servers, connected through a high-quality network. Colony executes transactions across multiple servers in the same DC under SI [2,13].
From the perspective of global TCC+, a DC behaves like a single sequential process, thus limiting metadata size.

Local strengthening: Peer groups
This section examines another kind of SI zone, the peer group. Inconsistency is especially problematic to users who communicate directly, outside of the database. For instance, in the enhancedreality game Pokémon Go, there is an anomaly where two users in close proximity can both become the owner of the same game character, confusing them [31]. This and similar examples argue for groups with stronger consistency.
In Colony, edge nodes in network proximity can constitute an SI zone, called a peer group. According to SI, transactions become visible in some a priori total order with no rollback. Their mutual strong consistency improves user experience and metadata management. To avoid that these stronger guarantees be at the expense of availability, the system should support disconnected work and mobility within the topology, without losing the TCC+ guarantees.

Security Requirements
To support collaboration, Colony supports versioning and trust management. Because it is the edge device that executes and merges updates, data can remain encrypted end-to-end; the untrusted cloud serves merely for transport and persistence [28]. 2 On the other hand, the edge use case poses new security challenges. Information is exposed on compromised edge nodes [62]; concurrent changes to the security policy changes and to data weaken security [65,67]; and decentralised key management is problematic [28]. We alleviate these difficulties by leveraging the cloud, e.g., for authentication and key management.
Our focus in the security area is support for group collaboration. Every data object comes with an Access Control List (ACL) that describes what updates users are allowed to perform. The system preventatively enforces ACL in edge devices. Because an edge device may be compromised, every node double-checks the updates it receives, and masks an update that is not allowed by the corresponding ACL, and transitively any update that depends on it. Thus a correct node never depends upon a state that violates the security policy.

PROTOCOL DESIGN
We turn now to a system design for satisfying the above requirements efficiently. Our design is an extension of the SwiftCloud approach [69]. 2 End-to-end encryption is not yet implemented in the current prototype.
Colony uses caching and replication to ensure that a client can execute locally. The system must remain safe at all times; specifically, the data observed by a client always satisfies the TCC+ and security invariants defined below. It should also remain available.
The trade-off is that, during some failures, liveness cannot be ensured. A client cannot make progress in two cases: if it requires data that cannot be retrieved; or if it runs out of storage. Furthermore, there are corner cases (described later) where a client commits updates, but they cannot become visible. The above situations are temporary, and last only until the problem is repaired.
Our system ensures convergence by using operation-based CRDTs, which merge concurrent conflicting operations deterministically [48]. As underlined in the Introduction, supporting causal consistency (CC) can have high metadata overhead; our design bounds metadata to a small size. Similarly to recent CC designs [2,9], Colony separates (internal) state management from (external) visibility: the backend layer transmits and stores states efficiently, without regard for correctness, whereas the visibility layer manages metadata and ensures that an application observes only those states that satisfy the TCC+ guarantees.

The TCC+ Guarantees
We now tersely specify the TCC+ guarantees. We use the following notations and definitions. Nodes (at any level of the topology) are noted , ′ . A node behaves sequentially, executing one transaction at time. A node might fail, in which case it ceases executing (failstop); a node that does not fail is said to be correct. , designate data objects. Transactions are noted , ′ . A transaction consists of a sequence of reads and updates. A transaction is interactive, i.e., the objects it accesses are not known in advance. A read has no side effect; an update does not return a response value. We write ∈ when operation (a read or an update) belongs to transaction . A transaction executes at a single replica; if it commits, its updates are broadcasted to be replayed by the other replicas. An operation is noted , , . . . ; in more detail, updating is written ( ), and a read ( ). The response value of ( ) is res( ( )).
Following Viotti and Vukolić [63], an abstract execution = ( , vis →, ar →) consists of an interleaving of operations executed by the nodes, or history ( ), a visibility relation ( vis →), a partial order that accounts for the propagation of updates in the system, and the arbitration relation ( ar →), a per-object total order over that helps to solve concurrency conflicts. The order in which nodes execute operations is called the program order. The happenedbefore relation (≺) is the transitive closure of the union of visibility and program order [63].
Hereafter, we consider only transactions that commit; we can safely ignore the operations of a transaction that aborts, since it has no effect.
The phrase "visible in node " refers to an operation that is visible by some operation executed at node . Each object starts in some known initial state. The return value of a read is computed according to the semantics of prior updates to the object (including updates in the same transaction). That is, for each read operation ( ), res( ( )) results from some linearization ( ) of the updates visible to ( ) consistent with ≺. TCC+ is defined by the following invariants.
Causal Consistency (CC). Causal consistency requires that every update that happened-before an operation is visible to that operation, and that arbitration is consistent with happened-before. The above invariants constrain the behaviour of individual operations. Below, we formalise the fact that a transaction is atomic (i.e., all-or-nothing). We define the following equivalence relation, written ≡: if operations and are in the same transaction , then ≡ . For some relation over the set of operations, we say that is left-compatible with ≡ when for any three operations , and , if ≡ and ( , ) ∈ then ( , ) ∈ . Right-compatibility is defined symmetrically, that is ≡ ∧ ( , ) ∈ =⇒ ( , ) ∈ . Relation is compatible with ≡ when it is both left-and right-compatible with it.
Atomicity. If two updates occur in the same transaction, then they are visible atomically, and arbitrated in the same way. Formally, visibility and arbitration are compatible with transactional ≡.

Snapshot.
A transaction takes all its reads (independently of their order) from a same snapshot, which is sound both causally and for the atomicity relation. Formally, if ( ) ≡ ′ ( ) then ( ) = ′ ( ) . Additionally, the following liveness property should hold: Eventual Visibility. If two correct nodes and ′ are not permanently disconnected from one another, and ( ) is visible in , then eventually ( ) is visible in ′ .
TCC+ extends Transactional Causal Consistency, as defined by Zawirski et al. [69], with the Strong Convergence and Rollback-Freedom properties. This ensures that progress is monotonic at each node.
To illustrate the concepts in this section, consider the history in Figure 2, which depicts the evolution of a CRDT counter ( ) when nodes execute increment operations ( ), and propagate such updates (depicted by arrows). 3 The history in the figure is causally consistent. Indeed, every new increment updates the counter to a state also containing the preceding operations (e.g., after event 6 ○, the counter value is 2). Similarly, there is no roll-back at any node. Nodes that received the same increments (e.g., events 7 ○ and 8 ○) are in the same state; therefore this history satisfies strong convergence. Moreover, since every transaction contains a single operation, the history trivially ensures the atomicity and snapshot requirements. 3 For now, ignore the version, commit and snapshot information, which will be detailed later.

Strengthening to SI
Colony strengthens the above TCC+ guarantees to strong consistency in an SI zone.
In a SI zone, Colony ensures Snapshot Isolation (SI). This means that ar → is gapless [49], i.e., for any operation visible to , every operation ar → is also visible to .

Bounding metadata
This and the following sections detail the logic to achieve the above consistency guarantees.
Supporting CC requires metadata, which can represent a substantial overhead; this section explains how Colony bounds metadata to a small size.
The CC invariant dictates that an update may become visible only if its dependencies (i.e., the updates that happened-before it) are themselves visible. To check this, when transmitting an update, Colony piggy-backs some associated visibility metadata, a vector timestamp (or version vector) that summarises its dependencies [15,37]. Vector timestamps support efficiently computing the set of missing dependencies [41].
A precise representation of the happened-before order among concurrent writers requires a vector of size ≥ [12]. As grows, the overhead on every message quickly becomes unacceptable. 4 The following sections describe some techniques that we use to keep the size small, at the cost of spuriously ordering some concurrent events.

Topology and metadata design
We first turn to the topology design (illustrated in Figure 1) and the metadata design.
Each DC forms an SI zone; therefore, the updates of a given DC are totally ordered; externally, it behaves as a single sequential node. On the other hand, DCs are connected in a full peer-to-peer mesh; their updates are partially ordered, which requires a vector.
Since each DC appears sequential, a timestamp vector of size suffices to a point in the CC partial order between DCs. Component [ ] numbers the (sequentially ordered) transactions committed at DC .
The least upper bound (LUB) of two vectors is defined as their component-wise maximum. Each node maintains its state vector, which is the LUB of the commit timestamps (defined next) that it has observed.
A transaction has a unique identifier called its dot [3].
Communication between DCs is a full mesh. Edge nodes (border or far-edge) are partitioned into distinct trees, forming a forest, as illustrated in Figure 1. Each tree is rooted at a specific DC, which we call its connected DC. 5 A subtree may detach itself from its parent and migrate to a different tree, e.g., to accommodate mobility or a failure.

Transaction metadata
We now describe the metadata that Colony associates with a transaction : its snapshot and commit timestamp vectors, and its dot.
Transaction 's snapshot vector . describes the (previous) transactions it depends upon. . forms a snapshot closed under causal consistency and atomicity. The meaning of . [ ] = is the following: reads from all the transactions ′ committed at DC up to time , and no later, i.e., such that ′ . [ ] ≤ .
A read-only or aborted transaction terminates without side effects. The commit vector . of an update transaction represents the point where it commits. 6 It is greater than its snapshot vector; if the transaction commits at DC , they differ only at index , i.e., .
Transaction is before ′ if . ≤ ′ . . If neither of or ′ is before the other, they are said to be concurrent.
Finally, a transaction has a unique timestamp called a dot . , which both serve as a unique identifier and provides the (total) arbitration order between concurrent transactions (as defined in Section 3.1).

In-DC transaction protocol
Let us describe how the system computes metadata in the simple case of a transaction that executes within some DC . By default, . is assigned the current state vector of DC . The system checks that . represents a consistent cut [2,45] such that . [ ] ≤ current_time. Its unique dot is . := (current_time, ). The commit protocol is a standard two-phase commit among the servers of DC (we use ClockSI [13]). The commit vector is equal to the snapshot vector, except that . [ ] := current_time. Object versions created by the transaction are marked with version timestamp . . As Colony objects are operation-based CRDTs, materialising a version may require to apply multiple updates [10,48]. Conversely, concurrent transactions that update the same CRDT object can be merged and by default do not abort, although it can abort for semantic reasons, e.g., if it would violate some invariant; we assume a higher level of concurrency control to detect such violations [4,21,22], which is out of the scope of this paper.
We illustrate the in-DC transaction lifecycle in Figure 2, events 0 ○ through 4 ○. Focus on the three DCs, numbered 0, 1 and 2, and on the CRDT counter . 7

Basic Edge Transaction Protocol
A transaction may execute and commit in an edge node. In this case, commit is asynchronous, i.e., for availability, the edge node continues to execute further transactions without waiting for the DC to assign its commit vector.
Starting a transaction is similar to the in-DC case: the edge node assigns its snapshot, and a dot using the edge node's unique identifier. The transaction commits locally at the edge node, which can immediately start another dependent transaction. Until it receives the DC's acknowledgement, the commit timestamp remains symbolic, i.e., indeterminate, subject only to the invariant . < . .

Node Migration and K-Stability
A fixed forest is inflexible, and a single fault may have a disproportionate impact. Therefore, Colony supports migrating a node and the subtree attached to it. Ideally, node migration should be seamless and transparent to applications, but unfortunately this is not completely possible.
Migration creates some extra complications to the edge transaction protocol, which we consider next. For simplicity, we focus on the case of a single migrating edge node. Hereafter, we focus on the migration mechanism, and ignore the policy decision of why or when to migrate, e.g., in response to a network failure.
Avoiding Duplicates. Migration can change the connected DC of the node. Consider the edge transaction protocol described above. Suppose that some edge node sends its transaction to its connected DC , loses the connection to DC , then migrates to DC ≠ . As the edge node does not know whether DC received , it sends again to DC . Although might now be received twice, This system has three data centres DC0, 1, and 2, and edge nodes A and B. Vector components refer to DC0, 1 and 2 respectively. Dots are omitted from the figure. The k-stability objective is 2.  via both DCs, a replica should replay it only once; the transaction's dot . serves to filter out such duplicates. To this effect, every node keeps track of the highest dot assigned by another node, and ignores a transaction whose dot is less or equal this value.
K-Stability to Avoid Causal Incompatibility. Consider an edge node that migrates from DC to a new connected DC . If the state of DC includes that of DC , the edge node's dependencies remain satisfied, and migration is seamless. We say that the states are causally compatible. However, it might happen (for instance, because of a communication failure) that an edge transaction ′ depends on a transaction that was visible at DC but not yet at DC . The snapshot of ′ does not satisfy the CC invariant at DC , which cannot apply it and cannot assign its commit vector. The edge node remains effectively disconnected, and its transactions are non-visible to the rest of the system. We say the edge node state is incompatible with DC .
If was not visible to ′ , the above dependency could not exist, and the nodes would remain compatible. Thus, one possible approach would be to let transaction become visible at the edge only once it is known at all DCs. However, a single slow DC would delay edge visibility of all transactions.
Our solution, taken from SwiftCloud [69], is twofold. First, to ensure the Read-My-Writes session guarantee [57], an edge node's transactions are always visible to itself. Second, to decrease the probability of incompatibility, transaction becomes visible to edge nodes only after it is visible by ≥ DCs, where 1 ≤ ≤ [69]. The higher , the higher the probability that the new DC is compatible with the old one, i.e., that its state includes the dependencies of ′ . The exact value of is a trade-off between two extremes. If = 1, the probability of incompatibility is high. If = , a single slow DC could prevent all edge transactions from becoming visible.
To illustrate K-stability, refer again to Figure 2. . counts the number of DCs where is stable. The visibility limit is set to Thus, a same transaction may carry up to equivalent commit timestamps. We optimise their memory size as follows. Recall that a commit vector differs from the snapshot vector in a single component, that of the DC that accepted it; the others are not significant. Therefore, Colony stores multiple commit vectors into a single vector of size , containing a significant value only for a DC that accepted the transaction. For simplicity, Figure 2 does not depict this optimisation.

Transaction Migration
Resource-hungry transactions should run in the core cloud rather than the edge. Examples include analytics or large queries. Colony supports migrating them to a trusted node in the core cloud for execution.
The migrated transaction must have the same effect as if it ran on the edge node; only performance should differ. Thanks to TCC+, it suffices to assign the same snapshot vector.
Thus, the client primes the snapshot with its own state vector and sends the transaction code. Before the transaction starts, the DC must have received the client's local transactions, which that the new one depends upon (Section 5.1.3 explains how we accelerate this). This ensures that every read can be satisfied. The migrated transaction executes in the DC just like a standard local client, and its results are sent back to the requesting edge node.

DATA MANAGEMENT
Colony ensures convergence by using operation-based CRDTs, which merge concurrent conflicting operations deterministically [48]. Similarly to recent CC designs [2,9], Colony separates (internal) state management from (external) visibility: the backend layer transmits and stores the state efficiently, without regard for correctness, whereas the visibility layer manages metadata and ensures that an application observes only those states that satisfy the TCC+ guarantees.

Versioning
Colony stores an object persistently as a base version and a journal of updates since the base version. To materialise an arbitrary object version, the cache first reads the base version from the store, and applies the missing updates from the journal. Occasionally, the system advances the base version.
A transaction reads from its snapshot, logs its updates to the journal, and materialises new versions in a private buffer. When the transaction commits, it updates the cache from the buffer. Both the updates recorded in the journal, and object versions that result from committed transaction , are labelled with vector . and dot . .

Edge Caching
An edge node cannot replicate the whole database, but can only cache some small fraction of it. An edge client may declare interest in some object to add it to its node's cache. The connected DC regularly informs the client of updates to its interest set.
At any point in time, the state vector of an edge node is the LUB of the state received from its connected DC (itself ≤ the DC's current state vector) and the commit vectors of local transactions. Choosing a snapshot vector ≤ the node's state vector ensures that every read could be satisfied either from the local cache or from the connected DC. It may happen that the client requires an object version that cannot be retrieved (in the cache, from the DC or from another node), in which case the transaction cannot proceed. This limitation of availability is inherent to the edge environment.

GROUPS
Colony supports two distinct group mechanisms: the peer group, an SI zone at the edge, and the collaboration group, nodes that update the same data. Peer groups are disjoint, whereas collaboration groups may overlap. All nodes in a peer group are in the same collaboration group.

Peer Groups
A peer group is a set of nodes with high-availability, low-latency connection to one another. It makes sense to provide SI within the group. This enhances the user experience, and simplifies metadata management. A peer group creates opportunities to improve performance, by pooling resources into a collaborative cache, and to decrease the network load to the cloud by collecting the updates from many clients. Conceptually, a peer group consists of four related components, with distinct roles: managing group membership, sharing content within the group, communicating with the outside, and enforcing the SI order. They are described in further detail below.

Membership.
Membership of a peer group is seeded and managed by a single node, called the group's parent. The parent maintains a connection to each of the group members, stores their list, and informs them of any membership change. The parent is fixed but arbitrary, possibly located in the DC or on a point-ofpresence (PoP) server. A node may serve as a member and a parent at the same time. To join or leave a group, a node contacts the group's parent. The parent responds with the membership list, as well as the session security key (described shortly). When a node migrates between groups, it uses the migration protocol previously described (Section 3.8); the new group must be causally compatible with the node's state.

Content
Sharing. Using the membership list, the group members and their parent maintain point-to-point connections. Above these connections, they construct a collaborative cache using a simple peer-to-peer protocol. Each member publishes its current interest set to all its neighbours (other members and parent). This subscribes the member to receive all updates to its interest set. When a member updates an object in a neighbour's interest set, it pushes that update in a best-effort manner. Conversely, if a member observes that it is missing an update to its interest set (by examining the visibility log described below), it pulls the transaction from some neighbour. Objects evicted from a cache are unsubscribed to save resources. The parent maintains an interest set that is the union of those of the group members. It subscribes for updates outside the peer group on behalf of its members, as detailed next. Figure 1, a subtree communicates with another one via some common ancestor. For simplicity, the description below assumes this ancestor is its connected DC.

Communicating Outside the Group. As depicted in
Let us call synchronisation point (sync point) a node within a group that communicates with the DC. In the common case, this is the parent, but any member may also unilaterally become a sync point (e.g., before migrating a transaction to the DC), thus avoiding any single point of failure. A sync point sends all missing updates to the DC, and symmetrically subscribes to updates in its interest set. Importantly, the sync point makes updates visible to the DC in the visibility order described in the next section. This ensures that different sync points send identical information.

Transaction Protocol for Peer Groups.
A group as a whole should behave like a single, sequential edge node, from the perspective of the rest of the system. To ensure sequential ordering, causality and progress within a peer group, Colony relies on EPaxos [38]. Compared to other consensus protocols, EPaxos improves availability and performance, by allowing any group member to become the leader for any transaction, and by minimising synchronisation between non-conflicting transactions.
In addition to improving the user experience, consensus is also essential to correct metadata management. Recall from Section 5.1.3 that possibly multiple sync points send transactions to the DC. Without consensus, conflicting transactions would be sent in different orders, breaking causality and causing unsafe commit vectors.
When a peer node commits a transaction, it submits it to EPaxos. EPaxos ensures consensus on the order in which versions become visible in the group, which we call the visibility order. To this end, every peer maintains the list of visible transactions in a visibility log. A transaction executes in isolation against the local cache. Its dependencies are the union of the state vector, the node's previous transactions, and the transactions in the node's visibility log.
Within a peer group, two different variants of commit exist. In the first, the node submits the transaction to EPaxos in the critical path of commit. This has the effect of ordering the commitment of conflicting transactions within the peer group, possibly leading to aborts; non-conflicting transactions commit in parallel. This variant maintains Parallel Snapshot Isolation (PSI) within a group [52], ensuring that the group behaves as an SI zone.
The second variant follows a similar approach to Section 3.7. It assumes that transactions never conflict. The transaction commits locally as soon as it reaches the commit statement, and a new transaction can follow immediately. The transaction is then submitted to EPaxos in the background.
As pointed above, committed transactions become visible in the order assigned by EPaxos. A sync point sends visible transactions to the connected DC according to the visibility order, where they get assigned a commit timestamp.

Migration Between Peer Groups
Just as a node can migrate between DCs (Section 3.8), it may migrate between peer groups. Similar consistency issues occur here. In this case, the base version of cached objects on the migrating node must be compatible with that of the new group. If the client is not missing any dependencies, or can retrieve them, then migration is seamless.
However, if the client is missing dependencies and the new peer group is offline, migration cannot succeed. If the client waits, its pending commits remain logged until the communication problem is fixed and they can be merged into the DC. In the meantime, the client might start a session with the new group, but its pending updates in the old session become invisible. Alternatively, the client might attempt to migrate again.

Collaboration groups
The mechanisms related to collaboration groups are trust management and versioning.
Messages are protected using symmetric cryptography. The authentication service provides a client with a session key per shared object, which she uses to decrypt data and sign her updates. This ensures that only legitimate clients can read an object. The key remains valid through disconnection and reconnection.
To keep out untrusted or unwanted updates, we leverage the separation between state and visibility, previously discussed in Section 3. Recall that an update is visible only if it satisfies the TCC+ invariants. In addition, it is visible only if it satisfies collaboration constraints.
To manage trust, the security administrator sets ACL. Furthermore, a collaboration group can, for instance, restrict visibility to include only versions produced within the group. An update that does not satisfy the corresponding ACL or group constraints remains invisible, and transitively the updates that depend upon it. Thus, security policies and groups can evolve dynamically. Technically, this violates the monotonicity invariant, but in a very restricted manner.  The store remains TCC+, but security and group constraints expose only a variable-size window thereof.

SYSTEM API AND IMPLEMENTATION
The Colony middleware is designed to provide a simple API for developing and deploying collaborative applications. This section presents its implementation and programming interface. The code is open-source and available on Gitlab [60].

API and Programming Model
An application node connects to a session manager (currently implemented in the core cloud), which authenticates the node. With the session opened, the node may join a collaboration or peer group, and run transactions accessing database objects. The node is notified of group change events (e.g., a new peer joins).
The database stores CRDT objects, such as counters, registers, sets, maps, or sequence datatypes. An object is stored in a namespace called a bucket. Opening a bucket caches it in the node; optional parameters can specify cache policies (e.g., LRU, writeback, etc.). The application can subscribe to an object's update events, in order to implement reactive programming patterns. A transaction is atomic (all-or-nothing) against multiple updates, and reads a TCC+-consistent snapshot of its opened buckets. Colony supports both interactive and batch transactions.
The Typescript example in Figure 3 illustrates the API. This application opens a session (Line 1). Then, it creates and increments a CRDT counter object (Lines 2-3). Further, it connects to a peer group (line 4), and updates the grow-only map (gmap) "myMap" in a transaction (lines 5-10). This map contains references to a register object (key "a") and a set object (key "e"). The counter update and the commit are both asynchronous (Lines 7 and 11), returning a promise. At line 13, the client waits for the promise, and displays the content of the set.

Communication protocol
Edge nodes communicate over WebRTC. Opening a client session occurs in the signalling phase of WebRTC and currently relies on a server in the core cloud, to simplify authentication and trust management.
The session provides the networking information required to communicate with the system, i.e., the IP addresses and ports of nearby peers, and the keys required to establish secure point-topoint connections with them. 8 To migrate to a different peer group, the node relies again on the session server.

Storage
Cloud nodes (DCs and PoPs) have secondary storage and persist their data to it. They also cache data in memory for performance. Data in a DC is sharded by consistent hashing across multiple server machines, leveraging riak_core [26].
We do not assume that a far-edge node has disk, and store data in browser memory. When a disconnected client reconnects again, it repopulates its cache, either from its peer group's content sharing network, or from its connected DC.

Security
The authentication keys received from the session server serve to encrypt communication between nodes, using symmetric encryption. This ensures that only authenticated clients are able to observe and update objects. End-to-end encryption and decentralised authentification [28] is left for future work.
A system administrator can set a security policy with the help of access-control lists (ACLs). An ACL is a tuple from the set objects × users × permissions. It defines that a given user is granted access to some object and the operation she is allowed to execute on that object. Right inheritance (RI) is modelled using two forests, atop objects and users. If user inherits from user , then holds the same ACL as . Similarly, if an object inherits from some object , then any ACL granted on also holds for . Checking an ACL evaluates a first-order logic predicate over the RI and ACL relations following the above logic. For instance, (C1) (book, Alice, own) ∈ , or (C2) (book, shelf) ∈ ∧ (shelf, Bob, read) ∈ specify respectively that Alice owns a book and that this book is on a shelf readable by Bob.
An ACL check must respect the order in which clients modify both data and the security policy, to avoid unexpected behaviour. More precisely, the system must ensure [39] that: (i) ACLs are applied in the order they were issued, and (ii) ACL checks are evaluated on a fresh copy of data and metadata. If data and security metadata are mutually consistent according to TCC+, the first constraint is trivially satisfied. Let us use an example to illustrate the problem with the second constraint. Consider predicate C2, and assume that Alice, Bob and Carl share the bookshelf. Suppose Alice removes a book from the shelf on her node, while Bob makes the shelf readable by everyone. The two are concurrent from the causality perspective, and thus Carl may observe them in any order.
However, by the second constraint, if Bob's update occurs later in real time, then Carl must never see Alice's book on the shelf. If Bob's node is disconnected or slow to transmit, this requirement is violated.
Colony alleviates the above problem as follows. First, object versions are visible according to the local copy of the and relations. Second, it defers ACL checks to after commit. A committed transaction that fails an ACL check is not visible. In the above example, Alice's book may appear briefly on Carl's node; but as soon as Bob's update is delivered, it will disappear.

EXPERIMENTAL EVALUATION
This section presents an empirical evaluation of Colony. We first demonstrate the implementation of a realistic collaborative application atop the middleware. With this as our main benchmark, we then evaluate the platform experimentally, comparing it to a classical client-server approach in the cloud, and to a simple caching approach. We consider both the online and the offline case. In the former, we evaluate transaction throughput vs. response time, and behaviour under load. In the offline case, we measure reconnection time, i.e., the time it takes for disconnected clients to be synchronised again. We also evaluate the performance benefit of peer groups. Finally, we study migration in mobile setups, measuring the time to return to normal performance.

ColonyChat benchmark application
Overview. ColonyChat emulates a team collaboration application modelled after the Slack and Mattermost communication platforms [36,51]. It consists of approximately 1500 lines of Typescript code.
ColonyChat represents its three main entities, users, workspaces and bots with the help of CRDT objects.
In detail, a user has a profile, a list of events, a set of friends, and a set of workspaces she is a member of. A workspace contains the users that collaborate through the application and a set of channels. It also maintains the status of the users within the workspace (e.g., owner, ordinary, invited, or deleted). A channel holds a brief description, and the list of messages posted to it. A bot is a special kind of user. It automatically triggers an action when it observes some event, or a specific message on a channel. For instance, a bot might monitor activities within a file-system tree, or display weather information. Bots play an important role in the benchmark, as they generate a large number of update transactions.
The TCC+ guarantees of Colony ensure that there are no ordering anomalies in the application. For instance, an answer is guaranteed to be visible in a chat after the corresponding question. Moreover, atomic transactions allow maintaining invariants such as "a user is in a workspace if and only if the workspace is in the user's profile." Finally, within an SI zone such as a peer group, users observe updates in the same order, greatly simplifying collaboration.  channels on average. A user can be in more than one workspace, and one of the workspaces contains 1,000 users. Around 10% of the users are bots that act randomly upon receiving a message on the channel, they have subscribed to. An action of a user follows a 90/10 read/write ratio. A user refreshes its local copy of a channel every 5 transactions. The trace follows a Pareto distribution for the actions, where 20% of the users execute 80% of operations. It contains 40 days of activity in total on the Mattermost server and exhibits a diurnal cycle. In the experiments, the trace is accelerated to execute in a few minutes only.
For each experiment, we indicate when users are scattered in peer groups, or directly connected to a remote DC. The experiments use the second variant of the peer group commit protocol, i.e., EPaxos is off the critical path of commitment (Section 5.1.4). The current version of our benchmark does not exercise transaction migration (Section 3.9); this will be added in future work. Each experiment is executed 10 times, and we report the average.

Experimental Setup
We deploy each Colony component (edge client, cloud server, peer group, etc.) as a Docker container, on a set of dedicated servers in a cluster. Each server has two Intel Xeon Gold CPUs, each with 16 cores per CPU, 128 GB of memory, and 2 TB of (spinning) hard disk. Nodes are all connected through 10 Gb/s network switches. A monitoring server, deployed on a separate container, captures the performance metrics.
We measure an average 0.15 ms network response time within the cluster. We use the Linux traffic shaping tool (tc) to simulate larger network response time, with a mean of 50 ms for mobile cellular data and 10 ms for carrier Ethernet. DCs are connected in a mesh using RabbitMQ sockets above TCP; peer groups are connected using WebRTC.

Response time and throughput
In this first experiment, we evaluate system performance when scaling up, increasing the number of clients until performance saturates. We compare three approaches. One emulates AntidoteDB [59], a classical geo-replicated approach, where a client does not have a local cache, and must contact the DC for each operation. Another emulates SwiftCloud [69], where clients have a local cache but do not form peer groups. Finally, the Colony label indicates a system with peer groups enabled. In each case, we evaluate a deployment with a single DC and one with three DCs. Figure 4 reports throughput vs. response time. It uses a log-log scale; down and to the right is better. Load doubles from one point to the next, from 4 to 1024 clients. As expected, at the beginning of the curve, throughput improves and response time remains stable. At some point, throughput levels out and response time degrades, indicating saturation.
Observe that the Colony's response time is approximately 5 times better than Swiftcloud's, which itself performs one order of magnitude better than AntidoteDB (both for throughput and response time). This difference is explained by the caching policy. AntidoteDB does not have a client-side cache. In the SwiftCloud configuration, 90% of transactions hit the local cache. The hit rate reaches 95% in the shared cache of the Colony peer group configuration.
Adding more DCs spreads the load in the AntidoteDB configuration, improving the maximum throughput of the system, by 40% from a single to three DCs. However, this improves marginally response time, since clients need to contact the cloud for each operation. It remains 8x slower than the SwiftCloud configuration. In contrast, the number of DCs has a minor impact in the SwiftCloud and Colony configurations.

Response time of offline collaboration
We now evaluate how the response time varies under offline collaboration, or when the sync point of a group fails to connect to a DC. To this end, we use a single ColonyChat workspace that contains 36 users. We pack 12 of these users in a peer group, whereas the others remain independent. All users start with a warmed-up cache. The  In Figure 5, we observe that the response time for local cache hits is near zero (in blue). Users that belong to the same peer group benefit from an average 2.3 ms response time when data is fetched from the collaborative cache (in green). This raises to around 82 ms when the user needs to perform a remote read from the DC (in red).
Approximately 25 s after the start of the experiment, the peer group goes offline, and only collaborates on its shared interest set of objects. After this event, we observe that both the local and peer response time is unchanged: users in the group will not observe remote transactions due to the disconnection, but they continue their collaboration seamlessly. Around 45 s after the beginning of the experiment, the group is then reconnected to the DC. In Figure 5, we can observe a slight increase of the response time at the reconnection, yet it has minimal impact on performance.
In Figure 6, we consider the same workload but this time disconnect a user from its peer group. The disconnection occurs after 25 s and the user reconnects 20 s later. In this figure, we may observe that the response time of the ColonyChat application is slightly impacted by the reconnection to the peer group. Upon reconnecting to the peer group, the user notices a slight increase (below the millisecond) in its transactions. This variation comes from the fact that the channels were updated with the new content published by the users in the peer group.

Migration Effect on Response Time
Mobile clients, especially in location-based collaborative applications, like games, frequently switch from one peer group to another. Our last experiment studies the synchronisation time for a client to connect to a group when its cache is invalid. This experiment exercises both the cache refreshing mechanism of Colony and the collaborative cache in a peer group. The results are presented in Figure 7.
In this figure, 45s after the start of the experiment, a mobile client migrates and joins the peer group. The client has a completely invalid chat history. She thus needs to synchronise her cache before interacting with the peer group. response time observed by the connecting user (in blue), and the rest of the group (in green).
As previously, each dot in the plot represents the response time of a transaction. In this figure, we can observe that the first transactions of the connecting user have a higher response time (below 12 ms). This performance degradation is way lower than the cost of reconnecting to a DC, and fetching data from it (as in Figure 5). Moreover, after only a few seconds, it returns to the normal and matches the perceived response time of the group users (in green).

RELATED WORK
Much previous work on data in edge computing [23,44,58] focuses on streaming and content delivery. Examples include sensor systems or propagating database views. We leverage this previous work by propagating shared state as a stream of update events. Sharing mutable state raises extra challenges, which are the focus of this paper. Achieving low response time is an ongoing challenge for many web applications [1,27,30]. In order to deliver fast response and offline support, applications may cache data at the client side, e.g., in the browser, as in News Feed [35], or offline Google Docs and Google Maps [53]. Mobile operation requires on-device replicas of data under weak consistency [46,56], as in Bayou [58], Rover [23] or Coda [24]. Similarly, Cimbiosys supports decentralised Internet Services [44].
The COPS system introduces Causal Plus Consistency (CC+), strengthening causal consistency with strong convergence [33]. ChainReaction augments CC+ with transactional reads, thanks to a sequencer per DC, and executes a write only after the versions read by the client are stable in the DC. TCC+ [2] extends the guarantees of CC+ to transactions. This model is closely related to Parallel Snapshot Isolation (PSI) [52]. TCC+ is the strongest model compatible with availability under partition. It does order concurrent transactions, but only requires convergence.
Colony guarantees TCC+ globally; in zones with good connectivity, it enforces an SI zone to improve metadata overhead and user experience. Whereas TCC+ supports concurrent updates and arbitrary CRDT types, PSI restricts concurrency to a single data type (the cset). Its SI zones, the DCs, are fixed. PSI supports global transactions that are strongly consistent, but this impacts availability and performance.
A hybrid consistency models combine different consistency guarantees; see for instance References 6, 14 or 32. In Lazy Replication, operations are causally consistent by default, and optionally linearisable [29]. Unistore [8] supports causal and linearisable transactions over a geo-replicated store; for fault tolerance, the causal dependencies of a linearisable transaction must be stable before it commits. Fisheye Consistency [16] is a proximity-based hybrid model, such that close-by nodes are mutually strongly consistent, and consistency is weak between far-away nodes.
Depot [34] and PRACTI [7] pioneer highly available caching at the edge, under CC with one vector clock entry per replica; this severely limits scalability. Depot targets Byzantine fault tolerance, but not transactions. Simba [40] enables the edge application to select among eventual, causal or serialisable consistency.
PouchDB [42] is a client-side cache that replicates data from a CouchDB server; it supports offline operation and detects conflicts, but does not merge them.
SwiftCloud [69] introduces bounded-size vectors and migration. Legion [61] extends web applications with peer-to-peer interaction using CRDTs under CC. Colony extends the above designs with collaboration and peer groups, and seamless migration.

CONCLUSION
We presented the design, implementation and performance of Colony, a system that brings the strongest consistency guarantees (while bounding the cost of causality metadata) to applications at the edge. According to an edge-first design, edge applications enjoy data locality, fast response, and disconnected operation. Colony supports seamless migration of a device or a whole peer group. Furthermore, Colony supports collaboration, ensuring total-order consistency within an edge group, and relevant security guarantees.
Several aspects remain open for improvement. As an edge device has limited resources, applications with a large footprint would benefit from better caching heuristics and automatic transaction migration. Placing clients at different levels of the hierarchy, in particular in Content Delivery Network points of presence, might improve perceived response time even more. Extending peer-topeer communication beyond edge groups would make the system less dependent on the cloud.