MR-RBAT: Anonymizing Large Transaction Datasets Using MapReduce

. Privacy is a concern when publishing transaction data for applications such as marketing research and biomedical studies. While methods for anonymizing transaction data exist, they are designed to run on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In the paper, we consider how MapReduce may be used to provide scalability in transaction anonymization. More speciﬁcally, we consider how RBAT may be parallelized using MapReduce. RBAT is a sequential method that has some desirable features for transaction anonymization, but its highly iterative nature makes its parallelization challenging. A direct implementation of RBAT on MapReduce using data partitioning alone can result in signiﬁcant overhead, which can oﬀset the gains from parallel processing. We propose MR-RBAT that employs two parameters to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and can retain good data utility.


Introduction
Publishing transaction data is important to applications such as personalized web search or understanding purchasing trends. A transaction is a set of items associated with an individual, e.g. search query logs, diagnosis codes or shopping cart items. Such data often contain person-specific sensitive information which may be disclosed in the process. De-identification (i.e. removing personal identifiers) may not provide sufficient protection for individuals' privacy [3,14], and attack on de-identified data can still lead to two forms of disclosure: identity disclosure where an individual can uniquely be linked to their transactions in the dataset and sensitive item disclosure where sensitive information about an individual is learnt with or without identifying their transactions.
One approach to protecting privacy is anonymization [10]. Various methods have been proposed [5,11,13,15,17,19], and RBAT [13] is one of them that has some desirable features. RBAT is able to deal with both types of disclosure, and can retain high data utility by allowing fine-grained privacy requirements to be specified. However, RBAT is designed to work in a centralized setting and requires the whole dataset to be memory-resident, on a single machine throughout the anonymization process. This makes RBAT un-scalable to large datasets. For instance, Walmart [18] handles more than one million customers every hour, collecting an estimated 2.5 petabytes of transaction data in the process. RBAT is unable to handle datasets of this scale.
Recently, MapReduce [7] has emerged as a scalable and cost-effective dataprocessing platform for data-intensive applications. In this paper, we study how MapReduce may be used to improve RBAT's scalability. This is not straightforward due to the highly iterative tasks that RBAT performs. MapReduce does not support iterative processing well [4,12], and computation in each iteration must be configured separately. This can generate a significant overhead which, if not managed, can offset the gains from parallel processing. To achieve high scalability while ensuring reasonable response time, we propose MR-RBAT, a MapReduce version of RBAT, which implements the key operations of RBAT in parallel and employs two parameters to control the overhead generated by MapReduce. Our experimental results show that MR-RBAT can scale linearly to large datasets and can retain good data utility.
The rest of the paper is organized as follows. Section 2 discusses the related work. In Sect. 3, we introduce some notations and give an overview of RBAT. Section 4 describes MR-RBAT that we propose in this paper. The experimental results are given in Sect. 5. Finally, Sect. 6 concludes the paper.

Related Work
Different privacy models have been proposed to guard transaction datasets against disclosure attacks, for example, k m -anonymity [15,17], l m -diversity [16], complete k-anonymity [11], ρ-uncertainty [5], (h, k, p)-Coherence [19] and PSrules [13]. These models differ in terms of how they assume that the data may be attacked, e.g. k m -anonymity assumes that an attacker knows up to m items in a transaction and ρ-uncertainty does not distinguish between public and sensitive items, but their sanitization approach is largely similar, relying on some form of data distortion. In this paper, we do not propose yet another privacy model, but instead we focus on the parallelization of the data sanitization method adopted by RBAT.
Recently, there has been a considerable interest in MapReduce as a scalable platform for data intensive applications. One issue that has received much attention is how to handle iterations efficiently in MapReduce. One approach is to extend the standard MapReduce framework itself. Ekanayake et al. [8] and Bu et al. [4], for example, proposed to avoid reading unnecessary data repeatedly from distributed storage, by identifying invariant data and keeping them locally over iterations. However, such methods can only be levered when most of the data remain static between iterations. Iterations of RBAT do not satisfy this requirement. Furthermore, they need to limit some features of the standard MapReduce framework, for example, forcing the data to remain locally means that tasks involving such data cannot be scheduled to be processed on multiple computing nodes. Such limitations can result in poor performance, especially over heterogeneous clusters.
Other works have proposed to deal with iterative computations in MapReduce algorithmically. Chierichetti et al. [6] implemented an existing greedy Max-k-cover algorithm using MapReduce efficiently and achieved provably approximation to sequential results. Bahmani et al. [2] obtained a parallel implementation of K-means++ [1] and empirically showed to have achieved similar results in a constant number of rounds. MapReduce solutions have also been proposed for anonymization [20][21][22], but were limited to achieving k-anonymity for relational data only. Our work is similar to these works in that we also address the iteration issue algorithmically, but the existing methods, including those designed to achieve k-anonymity, cannot trivially be adopted to parallelize the operations performed by RBAT.

Background
In this section, we first present some basic notations and concepts necessary to understand our proposed solution, then give a brief overview of RBAT.

Preliminaries
Let D be a collection of transactions. Each transaction t ∈ D is a non-empty subset of I = {i 1 , · · · , i n } where each i j ∈ I, 1 ≤ j ≤ n is called an item. Any λ ⊆ I is called an itemset.
We partition I into two disjoint subsets P and S such that P ∪ S = I and P ∩ S = ∅. S contains items that are sensitive about the associated individuals and P contains all other items called public items. We assume that S needs to be published intact and that an attacker may have knowledge about individuals in the form of P .
When a set of transactions is released in its original form, certain combinations of public items may not appear frequently enough. This allows an adversary to link an individual to a small set of transactions, thereby breaching privacy. To protect this, PS-rules may be specified [13].

Definition 2 (PS-rule). Given two itemsets p ⊆ P and s ⊆ S, a PS-rule is an implication of the form p → s.
Each PS-rule captures an association between a public and a sensitive itemset. The antecedent and consequent of each rule can consist of any public and sensitive items respectively and many PS-rules can be specified by data publishers to capture detailed privacy requirements. A published transaction dataset is deemed to be protected if the specified PS-rules are protected.

Definition 3 (Protection of PS-rule). Given a dataset D, the parameters
σD(p) . Condition 1 protects data against identity disclosure by ensuring that the probability of associating an individual to his or her transaction in D using the antecedent of any PS-rule is no more than 1/k. Condition 2 prevents sensitive item disclosure by ensuring that the probability of linking an individual to a set of sensitive items specified by the consequent of a PS-rule is at most c, given that the probability of associating an individual to his or her transaction using the rule's antecedent is no more than 1/k. Given a set of transactions D and a set of PS-rules Θ, if any rule in Θ is not protected, then D must be sanitized. One sanitization approach is set-based generalization which attempts to hide an original item by replacing it with a set of items. It has been shown that set-based generalization retains data utility better than other generalization methods [13]. Consider D given in Table 1, for example. Suppose that we require k = 3 and c = 0.6. PS-rule ac → h is not protected in D as ac has a support of 1 only. But ac → h is protected inD given in Table 2 where items a, b and f are replaced by (a, b, f ) and c, d and e by (c, d, e) following the generalization, since ac is now supported by 4 transactions and Conf(ac → h) = 0.5.  It is easy to see that there can be many possible generalizations of a dataset to protect a set of PS-rules. The one that incurs least distortion (or has a minimum loss of information) is preferred. RBAT uses the following measure to capture the loss of information as a result of generalization.

Definition 4 (Utility Loss). Given a generalized datasetD, the utility loss of a single generalized itemĩ is given by
The utility loss of the whole datasetD is calculated as ∀ĩ∈P UL(ĩ), whereP is a set of all generalized items inD.
The UL measure given in Definition 4 captures the loss of information in terms of the size of the generalized itemset, its significance (weight) and its support inD. The more items are generalized together, the more uncertain we are about its original representation, hence more utility loss. w(ĩ) assigns some penalty based on the importance of the items inĩ. The support of the generalized item also affects the utility of anonymized data. The more frequently the generalized item occurs inD, the more distortion to the whole dataset is.

The RBAT Algorithm
RBAT [13] is a heuristic method for anonymizing transaction data. It is based on the PS-rule privacy model and uses set-based generalization in data sanitization.
The key steps of RBAT are given in Algorithm 1.
Input: Original dataset D, a set of PS-rules Θ, the most generalized itemĩ, minimum support k and maximum confidence c.
end if 9: end while 10: returnD RBAT is iterative and works in a top-down fashion. Starting with all public items mapped to a single most generalized itemĩ and D generalized toD according toĩ (step 1), each iteration involves replacing a generalized itemĩ with two less generalized itemsĩ l andĩ r . RBAT does this greedily by using a two-step split phase (step 4). The first step finds a pair of items fromĩ incurring maximum UL when generalized together. The second step uses the pair as seeds to splitĩ into two disjoint subsetsĩ l andĩ r .
To ensure that the anonymized data after replacingĩ withĩ l andĩ r still offers the required privacy protection, each split is followed by an update (step 5) and check phase (step 6-8). The update step creates a temporary dataset D by copyingD and replacingĩ withĩ l andĩ r . D is then checked to see if Θ is still protected. If it is, D becomes newD andĩ l ,ĩ r are queued for further split. The split-update-check is repeated until |Q| = 0, in which caseD is returned as the result. This top-down specialization process effectively constructs a binary Split Tree with the root representing the most generalized item and set of all leaf nodes forming a Split Cut representing the final generalization.

MR-RBAT
In this section, we describe MR-RBAT. We assume that there are sufficient processing nodes to store all of the data across them and to run all map and reduce tasks in parallel. Algorithm 2 shows the overall structure of MR-RBAT, which mirrors RBAT (to retain its useful properties), but performs its key computations in parallel (to address scalability). Algorithm 2 is performed on a single processing node as a control, but the functions indicated by an MR subscript are performed in parallel using MapReduce. In the following sections, we explain the key steps of MR-RBAT.
Input: Original dataset D, a set of PS-rules Θ, the most generalized itemĩ, minimum support k and maximum confidence c.

Data Partitioning and Preparation
We partition D among M mappers equally using a horizontal file-based data partitioning strategy. That is, first n transactions of D are assigned to the first mapper, the next n to the second mapper, and so on, where n = |D|/M . This strategy has been shown to be more efficient than other methods [9]. Note that our partitioning method is based on the number of transactions only and does not take into account mappers' memory usage. However, this can be trivially accounted for.
For efficiency, we prepare two datasets,P andD, before anonymizing D. We first computeP to contain pairwise ULs of all public items in P (step 1) and then generalize D intoD according to the most generalized itemĩ. Both are performed using a single MapReduce round and are straightforward, so we will not discuss them further. The benefit of having these computed beforehand will become evident when we discuss the split and update functions later.

Split MR
This corresponds to the split phase of RBAT (step 4 of Algorithm 1) and is carried out in two steps. The first step uses a single MapReduce round with M mappers and a single reducer to find a pair which when generalized together incurs maximum UL (Algorithm 3). Each mapper reads a subset of P from the distributed file system (DFS), finds the pair with maximum UL locally, and sends it to a single reducer (steps 2-3) which finds the pair i x , i y with maximum UL globally (step 5).
1: Map(m, Pm) 2: Pm ← Load the m-thP from DFS 3: The second step uses i x , i y to splitĩ into two less generalized itemsĩ l and i r . RBAT does this by assigning i x and i y to I l and I r first, then considering each item i q ∈ĩ − {i x , i y } 1 in turn and assigning it to either I l or I r based on UL(I l ∪ i q ) and UL(I r ∪ i q ). A direct parallelization of this heuristic will require |ĩ| − 2 MapReduce rounds, as the assignment of each item is recursively dependent on the assignment of the items preceding it. In the worse case when the most generalized item is split to single items, one per iteration, it will require a total of O(|P | 2 ) MapReduce rounds. This will result in a significant setup and data loading overhead.
Alternatively, one may splitĩ based on seeds only. That is, we decide whether an item i q ∈ĩ should be assigned to I l or I r based on UL(i q , i x ) and UL(i q , i y ). This would then require only a single MapReduce round to splitĩ. While this can cut the number of MapReduce rounds significantly, it may cause substantial data utility loss. Consider an extreme case whereĩ = {i 1 , . . . , i |P | } is the most generalized item, i 1 and i |P | are the seeds, σ D (i j ) < k/4, j < |P | and k/2 < σ D (i |P | ) < k. Assuming that a uniform weight of 1 is used in UL calculation, then it is easy to see that using this strategy all the items will be generalized with i 1 , resulting inĩ l = (i 1 , · · · , i (|P |−1) ) andĩ r = (i |P | ). As σ D (i |P | ) < k,ĩ cannot be split, and the data has to be generalized using the most generalized itemĩ, incurring a substantial utility loss.
Splittingĩ by seeds or by preceding items in fact represent two extreme strategies for parallelizing split: one has the potential to retain utility better and the other incurs least parallelization overhead. We propose a control that allows a specified number of MapReduce rounds to be run, thereby balancing efficiency and utility retention.

Items in each bucket are split based on seeds only, and the splits obtained from the previous iterations are used as the seeds in the current iteration.
Algorithm 4 shows how α-Split works.ĩ is partitioned into α disjoint buckets (step 2), and α MapReduce rounds are used to splitĩ (step 3). Within each round, each mapper reads a copy of I l , I r , bucketĩ h and a subset of D, computes the partial support of I l ∪ i q and I r ∪ i q for each item i q ∈ĩ h locally, and then shuffles the results to the reducers (steps 4-9). Each reducer aggregates the partial supports for i q , assigns i q to I l or I r based on their UL values, and emits updated I l and I r as seeds for the next iteration (steps [11][12][13][14][15][16]. Note that currentlyĩ is partitioned randomly, i.e. the first |ĩ|−2 α items form the first bucket, the next |ĩ|−2 α items form the second bucket, and so on. Exploring how to best assign items to buckets is beyond the scope of this paper. It is easy to see that α-Split is a generalization of RBAT Split: when α = |ĩ|, α-Split becomes RBAT Split. Any other settings of α represent a tradeoff between efficiency and potential utility loss. This gives us a control to balance between performance and quality in anonymizing large transaction datasets, as we will show in Sect. 5. We now analyse the overhead cost associated with Split MR . Letĩ be the item to be split, s m (M ) and s r (R) be the cost of setting up M mappers and R reducers, ω be the average time that it takes to read a transaction (of average size in D) from the distributed file system. The overall map cost t M of a single MapReduce round is given by Assume that each mapper has enough parallel connections to send data across the network to R reducers in parallel, the shuffle cost t S of a single MapReduce round is as follows, where ξ is a network efficiency constant.
Note that map output with the same key must be sent to the same reducer, so the number of reducers needed is determined by min(R, |ĩ|−2 α ) in (2). The reduce cost t R of a single MapReduce round is dominated by the cost of setting up the reducers and reading the shuffled data sent by the mappers: The overall cost of Split MR using α iterations to splitĩ is therefore Clearly, a large α, which a direct parallelization of RBAT would imply, can result in a significant overhead cost due to the setup and data loading requirements. We will show in Sect. 5 that it is possible to use a small α in split to control overhead while retaining good data utility.

Check MR
Onceĩ is split andD is updated (using a single MapReduce round), Θ must be checked to see if it is still protected. Parallelizing rule-checking while keeping overhead low is another issue to address. RBAT checks all PS-rules in sequence and stops if any rule is found unprotected. Implementing this directly in MapReduce could incur the cost of setting up O(|Θ|) rounds. We observe that when every rule in Θ is protected, it is more efficient to use a single MapReduce round: mappers check every rule locally and reducers check the overall protection. But when not every rule is protected, this is not efficient. For example, if the first rule in Θ is not protected, then no other rules need to be checked. However, the MapReduce architecture does not allow the nodes to communicate with each other until the whole round is finished, effectively requiring all rules to be checked. This increases the network cost of shuffling partial supports and will incur extra, but unnecessary, computation of checking all the rules.
Again, we observe that checking all rules or one rule only in a single MapReduce round are two extremes of optimisation: checking one rule per round will avoid checking unnecessary rules but can incur severe parallelization overhead, whereas checking all rules in a single round will minimise parallelization overhead but can perform a large amount of unnecessary computation. To balance this, we propose another parameter γ to control the number of MapReduce rounds used to check rules. Definition 6 (γ-Check). Given a set of PS-rules Θ, γ-Check, 1 ≤ γ ≤ |Θ|, checks Θ in γ iterations. Each iteration checks |Θ| γ PS-rules in Θ. Algorithm 5 shows how γ-Check works. It checks |Θ| in γ MapReduce rounds, each round checking |Θ| γ rules. Each mapper checks every rule p → s ∈ Θ j in a single round, by computing the partial support ofp andp ∪ s (steps 5-8, where φ(p) =p generalizes p). The reducers aggregate the partial supports pertaining to each rule and check the protection conditions (steps 9-13). The algorithm will not undergo any subsequent rounds, if any rule is found unprotected in the current round. Hence, γ-Check at maximum checks ( |Θ| γ − 1) more rules than RBAT does, but would require a maximum of γ MapReduce rounds only. for each rule p → s ∈ Θj do 6:p ← φ(p) 7: Emit (p → s, σ D m (p), σ D m (p ∪ s) ) 8: end for 9: σ D (p) > c then 12: Emit(r, False) 13: end if 14: end for The cost analysis of Check MR mirrors that of Split MR . That is, Eqs. (1)(2)(3)(4) apply to Check MR if we replace α by γ and |ĩ|−2 by Θ, So no further analysis is given here. It is useful to note that γ-Check is a generalization of RBAT Check: when γ = 1, it becomes RBAT Check. However, unlike the α control, using any γ value in rule checking will not affect the quality of anonymization, but only the performance.

Experimental Evaluation
This section describes our experiments. The default settings of parameters are given in Sect. 5.1 and experimental results are analysed in Sect. 5.2.

Setup
All experiments were performed using the Apache Hadoop 2 , an open-source implementation of MapReduce. We conducted the experiments over a cloud of thirteen computing nodes, physically located within the same building and interconnected by 100 Mbps ethernet connection. One of the machines was allocated to run the MR-RBAT master program (Algorithm 2). All client machines were homogenous in configuration containing 2 GB memory and 2 physical cores. Each machine was set to use one core only and was assigned to run a single mapper or reducer instance.
We used a real-world dataset BMS-POS [23] as D, containing several years of point-of-sale transaction records. We have |D| = 515597 and |I| = 1657 with the maximum and average transaction size being 164 and 6.5, respectively. The larger datasets were constructed by random selection of transactions from D, and we refer to these datasets as nX, where n ∈ [0.5, 16] is a blow up factor. |S| was set to 0.1 × |I|. We used ARE (average relative error) [13] to quantify the utility of anonymized data. We used a workload of 1000 randomly generated queries, each consisting of a pair of public items and a single sensitive item. Default settings of parameters used in the experiments are given in Table 3.  This suggests the severity of overhead caused by iteration using MapReduce: the gains from parallel processing has almost been exhausted entirely by the overhead. It is also interesting to observe ARE results in these experiments. It is expected that setting a smaller α would help performance, but could potentially affect utility. However, Fig. 1(b) shows that almost identical utility can be achieved when α = 64. This confirms the feasibility and effectiveness of using α in the split phase of MR-RBAT to balance performance and utility retention. We also evaluated the scalability of MR-RBAT by varying cluster size. Figure 2(a) and (b) shows the runtime and speedup, respectively. The speedup is measured as the ratio of the runtime of a 1-node cluster to that of the cluster tested. MR-RBAT achieved a near linear speedup when a smaller cluster (relative to data size) was used. This is because MR-RBAT uses one reducer only in the first step of Split MR , and others are used as mappers. Increasing the number of mappers causes more transactions to be fetched and processed by the single reducer, thereby degrading the speedup. Furthermore, we used a dataset of 4X in these experiments. With this data size and when cluster size increased, the computation performed by each mapper became lighter, and the effect of overhead on runtime became more significant. So adopting a suitable ratio of data size to cluster size is important to achieving a good speedup.

Scalability and Performance
We then tested the effect of α on runtime and on data utility. As can be seen from Fig. 3(a), runtime scales largely linearly w.r.t. α. This suggests that dividingĩ into different sizes of buckets had little impact on the outcome of split in each iteration. This is further confirmed by the ARE results in Fig. 3(b). When α is very small, many items ofĩ are put into one bucket and assigned to I l or I r based on seeds only. This caused items with high ULs to be generalized together. But once α is large enough, i.e. buckets are smaller enough, the α value only affected runtime, not ARE. This shows that the performance of splittingĩ can be significantly improved by using a relatively small α without compromising data utility. We also observed a relationship between α and split skewness during these experiments. Let S α be the split tree constructed by MR-RBAT based on some α, andĩ be a non-leaf node of S α withĩ l andĩ r as its left and right children respectively. We measure split skewness ξ(S α ) as It was observed that split skewness decreased as α was increased. This is because when α is small, a large bucket of items will be split in a single round based on the seeds only. If data distribution is such that generalizing most of items in the bucket with one of the seeds produces a larger UL than the other seed does, then most of the items will be generalized with one seed, resulting in a skewed split. Skewed splits are more likely to make Θ unprotected, resulting in an early stop in split and a higher ARE. This is confirmed by Fig. 3(b). Very small α values are therefore to be avoided. Next, we varied domain size to see its effect on runtime. As shown in Fig. 4(a) and (b), MR-RBAT's runtime was more stable and grew much more slowly than RBAT did as the domain size increased. On the other hand, the difference in ARE between RBAT and MR-RBAT increased slightly as the domain size was increased. This is because MR-RBAT uses a fixed number of MapReduce rounds in split. Increasing domain size causes more items to be put in one bucket and generalized in one round. This contributed to an increased ARE. Finally, we tested the effect of γ on runtime. Note that a γ setting will only affect runtime, not ARE, so only the runtime results are reported in Fig. 5. Observe that initially the runtime was decreased when we increased γ (see Fig. 5(a)). This is because when γ is small, the inefficiency introduced by Check MR is mainly from the need to check extra, but unnecessary rules. As γ increases, the number of unnecessary rules to check decreases, resulting in a better runtime. However, as we further increase γ, reduction through checking fewer unnecessary rules decreases, but the cost of setting up more iterations increases, making an increase in the overall runtime.  Increasing the number of rules also caused the runtime of MR-RBAT to increase, but much more slowly than RBAT did, except for γ = 1, as shown in Fig. 5(b). When γ = 1, MR-RBAT checks every rule in Θ and is not efficient for the reason we gave above. All other settings of γ have resulted better runtime and scalability w.r.t the number of rules to be enforced in anonymization.

Conclusions
In this paper, we have studied how RBAT may be made scalable using MapReduce. This is important to a range of transaction data anonymization solutions as most of them, like RBAT, are designed for a centralized setting and involve some iterative data distortion operations in data sanitization. We have shown that parallelizing RBAT in MapReduce by some straightforward data partitioning can incur an overwhelmingly high parallelization cost due to the iterative operations to be performed. We proposed MR-RBAT which employ two controls to limit the maximum number of MapReduce rounds to be used during data generalization, thereby reducing the overhead and computational cost. We have empirically studied the effect of different settings of these controls and have found that MR-RBAT can scale nearly linear w.r.t. the size of data and can efficiently anonymize datasets consisting of millions of transactions, while retaining good data utility.