Link-Based Cluster Ensemble Method for Improved Meta-clustering Algorithm

: Ensemble clustering has become a hot research field in intelligent information processing and machine learning. Although significant progress has been made in recent years, there are still two challenging issues in the current ensemble clustering research. First of all, most ensemble clustering algorithms tend to explore similarity at the level of object but lack the ability to explore information at the level of cluster. Secondly, many ensemble clustering algorithms only focus on the direct relationship, while ignoring the indirect relationship between clusters. In order to solve these two problems, a link-based meta-clustering algorithm (L-MCLA) have been proposed in this paper. A series of experiment results demonstrate that the proposed algorithm not only produces better clustering effect but is also less influenced by different ensemble sizes.


Introduction
In the field of intelligent information processing and machine learning, clustering analysis is an important learning tool for unlabeled data.Generally speaking, clustering is to classify a given dataset into clusters, so that the data objects within the cluster have larger similarity, while the data objects between clusters are quite different from each other 1 .Clustering has been used in various fields, such as image processing 2 , cognitive computing 3 , time series analysis 20 and medical diagnosis 17 .In the past few decades, a large number of clustering algorithms have been developed, among which the most representative ones are partitional clustering 18 , hierarchical clustering 19 , spectral clustering 45 , density clustering 67 , adaptive clustering 89 and semi-supervised clustering 121 .Nevertheless, there are still some problems in the current clustering algorithm.For instance, the clustering result largely depends on parameters and initialization without which the clustering result is not robust enough.In order to solve these problems, ensemble clustering was proposed by researchers.
Unlike the traditional method of using an algorithm to generate a single clustering result, ensemble clustering is a process of ensemble multiple different clustering results to generate better clustering result.Due to the effectiveness of ensemble clustering algorithm, more researchers have been attracted and proposed many related algorithms.Despite significant advances in ensemble clustering research, most algorithms only focus on direct connection, while ignoring indirect connection between clusters.As shown in Fig 1, two objects appear in the same cluster and thus we say that they are directly connected.However, like (b) and (c), two objects are in two different clusters but we cannot conclude that there is no connection between them because they are likely to be related to each other indirectly.Such indirect connection information may affect the consensus result.In order to explore indirect connection information, we propose a link-based meta-clustering Algorithm (L-MCLA) in this paper.The remainder of paper is organized as follows.Section 2 reviews the background of this study.Section 3 details the proposed method in this paper.Section 4 shows the experiment results.Section 5 concludes this paper.

Ensemble Clustering
Ensemble clustering is an algorithm to improve the clustering effect by ensemble multiple base clusterings, which can be generally expressed as follows:

Meta-Clustering Algorithm
Meta-clustering algorithm (MCLA) is proposed by Strehl and Ghosh which is an ensemble clustering algorithm working on the level of cluster.Jaccard coefficient is used to calculate similarity between clusters.The jaccard coefficient between cluster C i and C j can be calculated as follows: Where denotes the intersection of two sets, denotes the union of two sets, and |*| denotes the number of objects in a set.Specifically, the meta-clustering algorithm consists of the following four steps: 1) Construct a similarity matrix by calculating jaccard coefficient between clusters contained in base clusterings.
2) Regard the similarity matrix of the previous step as an undirected graph, which is called metagraph.
3) Use graph partitioning package METIS 16 to divide the meta-graph of the previous step to obtain the meta-cluster and each meta-cluster contains several clusters.
4) Assign each object to the corresponding meta-cluster to get the final clustering result.

Construct similarity matrix
The meta-clustering algorithm is superior, but it still has a shortcoming.The similarity matrix constructed by Jaccard coefficient can only reflect the direct relationship between clusters while lacking capability to find the indirect relationship.In 2011, the concept of weighted connected-triple (WCT) was proposed by lam-on et al 12 , which makes it possible to explore the hidden indirect relationship between clusters.
In this section, connected triple is used to construct a refined cluster similarity matrix.The connected triple is shown in Fig 3.
Let C k have similarity with C i and C j , then the weighted connected triple between C i and C j is defined as follow: min( ( , ), ( , )) Then the indirectly connection between cluster C i and C j is calculated as follows: For any two clusters C i and C j , their indirectly similarity is defined as: Where DC is a constant decay factor.That is the confidence level of accepting two nonidentical clusters as being similar.
The refine similarity matrix S is constructed as:

graph division and object allocation
We regard the refined similarity matrix S as the adjacency matrix of graph G. Graph segmentation algorithm is our consensus function.In the selection of graph segmentation algorithm, since the normalized cut (Ncut) is effective and robust, we select it in this study 13  .Normalized cut is a kind of spectral clustering.The basic idea is to define a cut criterion, which considers the total dissimilarity between different clusters and the total similarity within the cluster.By normalized cut, K meta-cluster groups can be obtained, that is: Here we use the voting method to assign objects.For given object x i , x i belong to zero or more clusters in MC j .Specifically, the voting score of x i for the meta-cluster MC j can be defined as follow: Where |MC j | denotes the number of clusters in MC j .
We assign the point x i to the meta-cluster with the highest score.The final clustering result can be obtained by this way.
For clarity, the algorithm of L-MCLA is described in Algorithm 1. 2) The inter-cluster similarity matrix Z is constructed by jaccard coefficients, which can be calculated by the equation (2). 3) For similarity matrix Z, equation ( 3)-( 6) is used to obtain the refined similarity matrix S.

4)
The similarity matrix S is regarded as a graph G. K meta-clusters are obtained by using Ncut algorithm to segment this graph as equation ( 7).

5)
The clustering result Label is obtained by allocating object to corresponding meta-cluster by equation (8).Output: Final cluster result Label

Experiments
In this section, we conduct experiments on multiple real-world datasets and compare results with several existing ensemble clustering algorithms to evaluate the performance of the algorithm proposed in this paper.Moreover, the robustness of the algorithm is evidenced by experiment on different ensemble sizes.

Datasets and evaluation measures
In our experiments, nine datasets in the UCI (University of California Irvine) machine learning database are used as experimental datasets 22 .Table 1  In our experiments, adjusted Rand index (ARI) and normalized mutual information (NMI) are selected to evaluate the performance of the clustering results.The two evaluation are described as follows: ARI is a clustering evaluation index that measures the similarity between two clustering results by calculating the number of sample point pairs in the same cluster and different clusters.The equation is as follow: Where a denotes the number of point pairs that belong to the same cluster in both real and experimental, b denotes the number of point pairs that belong to the same cluster in real label and different clusters in experimental result, and c denotes the number of point pairs that belong to the same cluster in the experimental result and different clusters in the real label, and d represents the number of point pairs that belong to different clusters in both real and experimental.Its value range [-1,1].The larger the value is, the more consistent it is with the real result, namely the better clustering effect.
NMI is a common external evaluation index of clustering.It evaluates the similarity of two clustering results from the perspective of communication theory.Let the experimental result be X and the real label be Y, then the equation is as follows: Where I(X, Y) represents the mutual information between X and Y, and H(X) and H(Y) represent the entropy of X and Y.Its value range [0,1].The larger value indicates more shared information with the real label, that is, the better clustering result.
In our experiments, k-means is used to generate base clusterings with the parameter k randomly selected in the range 2N   ， .For parameter DC, high DC values (i.e., 0.7 to 0.9) bring about a data partition of exceptionally good quality 12 , so we set DC=0.9 in our experiment.We call the number of base clusterings m as ensemble size and set ensemble size m=50 to compare the L-MCLA algorithm with other ensemble clustering algorithms.Furthermore, we change the ensemble size to test the robustness of L-MCLA algorithm.

Comparison with other ensemble clustering methods
This section make a comparison experiment of our algorithm.Each ensemble clustering algorithm runs 20 times on each dataset and each run randomly generates base clustering according to section 4.2.The average scores and standard deviation of ARI and NMI are recorded.The experimental results are shown in Table 2 and Table 3, with the highest score shown in bold.As shown in Table 2, the ARI score of L-MCLA algorithm on 9 datasets are all the highest.As can be seen from Table 3, L-MCLA algorithm on six datasets has the highest NMI value, which is only slightly inferior on the CTG, Ionosphere and Segmentation, but the difference is not significant.To summarize, the L-MCLA method exhibits overall better performance (with respect to ARI, NMI) than the other methods.

Robustness to ensemble size
In this section, we evaluate the performance of L-MCLA algorithm under different ensemble size on nine datasets.Ensemble size is in the range of [10,100], increasing by 10.The generation settings for base clustering are same as section 4.   According to the above experimental analysis, the ensemble size has little influence on L-MCLA algorithm.On most datasets, L-MCLA algorithm relys on fewer base clusterings to obtain more robust results.

Conclusion
Ensemble clustering is the use of multiple clustering results to generate better clustering results.However, the existing ensemble clustering algorithms often only pay attention to the direct inter-cluster connection and ignore the indirect connection.In this paper, we propose a link-based meta-clustering algorithm which uses connected triple to explore indirect connection.Link-based method is used to enrich similarity matrix for generating better results.Our algorithm has the following advantages: 1.This algorithm considers the information from the cluster level and the object level.2. It use the linkbased method to explore the indirect connection between clusters.A series of experiments proved the advantages of our algorithm.Our future work is to further explore the hidden information in the base clustering and improve the clustering results in this way.

Fig1.
Fig1.Relationship between two points.(a) in the same cluster.(b) belong to two clusters with common parts.(c) belong to two unrelated clusters, but both of them are related to the third cluster.

Let 12 X={x 2 m 12 {
,x ,...,x } n denotes a dataset with n objects.We use clustering algorithms to obtain m clustering results 1 P={p ,p ,...p } and call them as base clusterings.Each base clustering contains several clusters, which is written as is the number of clusters in the base clustering p i .Ensemble clustering is to merge the set P through the consensus function T to obtain the final clustering result P*.The specific process of ensemble clustering is shown in Fig 2.

Fig 3 .
Fig 3. Connected triple diagram P 1 , P 2 , P 3 are three base clusterings.C 1 1 and C 2 1 are unrelated clusters (ie, have no public part) and in common sense should have no similarity.C 1 1 and C 1 2 have a common point x 1 , and C 2 1 and C 1 2 have a common point x 2 .Therefore, C 1 1 and C 1 2 are similar and C 2 1 and C 1 2 are similar.Because both C 1 1 and C 2 1 have a similar third-cluster C 1 2 , C 1 1 and C 2 1 are indirectly connected to each other.In the same way, C 1 1 and C 1 3 have a common point x 1 while C 2 1 and C 1 3 have a common point x 2 .Accordingly, C 1 1 and C 2 1 have indirectly connection to each other.It can be seen that connected triple can help find more connection between clusters, which is beneficial for reaching a consensus result later.Firstly, similarity matrix Z is constructed by jaccard coefficient. ij

Algorithm 1 : 1 ) 12 {
Link-Based Meta-Clustering Algorithm.Input: Dataset X, number of clusters K Using clustering algorithm to generate m base clusterings

2 .
Then we record the average score of ARI and NMI.The change of score is shown in Fig 4 and Fig 5.

Fig 4 .Fig 5 .
Fig 4. Average ARI scores of L-MCLA under different ensemble size

Fig 5
Fig 5 shows the NMI index values of the L-MCLA algorithm on 9 datasets under different ensemble sizes.It can be seen from observation that the NMI values of most datasets tend to be stable except for the Thyroid datasets.However, the NMI value of the Thyroid dataset shows a significant upward trend lists the detail of each dataset:

Table 2 .
Average ARI scores by different ensemble clustering methods.The highest score in each comparison is in bold

Table 3 .
Average NMI scores by different ensemble clustering methods.The highest score in each comparison is in bold