Improved Hierarchical K-means Clustering Algorithm without Iteration Based on Distance Measurement

: Hierarchical K-means has got rapid development and wide application because of combining the advantage of high accuracy of hierarchical algorithm and fast convergence of K-means in recent years. Traditional HK clustering algo-rithm first determines to the initial cluster centers and the number of clusters by agglomerative algorithm, but agglomerative algorithm merges two data objects of minimum distance in dataset every time. Hence, its time complexity can not be acceptable for analyzing huge dataset. In view of the above problem of the traditional HK, this paper proposes a new clustering algorithm iHK. Its basic idea is that the each layer of the N data objects constructs   2 N clusters by running K-means algorithm, and the mean vector of each cluster is used as the input of the next layer. iHK algorithm is tested on many different types of dataset and excellent experimental results are got.


Introduction
Traditional HK algorithm has the advantage of simple and easy to convergence, but also it has some obvious deficiencies. For instance, HK would have a high computational complexity when the k value is uncertain. Agglomerative algorithm only mergers two clusters having minimum distance every time, which leads to have a higher time complexity at high dimension and big data. To overcome the shortcoming of traditional HK, some researchers have done different degrees of improvement for HK algorithm [1][2][3][4] . But most of researchers have modified HK algorithm for their specific research fields [5][6][7][8][9][10][11][12][13] .
Now society is rapidly being from information era to age of data, it is significant to accurately grasp the valuable information in dataset. Therefore clustering analysis has become a hot research field chased by researchers. To precisely and quickly analyze the data information carried by the dataset, this paper presents a new clustering algorithm iHK, which is a easily convergent and quite accurate clustering algorithm by integrating with the feature of the K-means and hierarchical algorithm. Moreover, iHK algorithm is not limit to a particular research field. In essence, iHK clustering algorithm is an improved HK.
Section 1 describes the summary of iHK algorithm. Section 2 briefly introduces some improved HK algorithm in recent years. Section 3 details iHK algorithm proposed by this paper. And Section 4 presents the experimental results . Section 5 makes a conclusion.

Related work
The training set is divided into two parts, a part of the dataset uses hierarchical algorithm to obtain distribution information of data, then runs K-means at another part. This hybrid algorithm was put forward for the first time by Bernard Chen et al. [14] at 2005. Due to its accuracy, simplicity and convergence, the method attracted wide attention after it has been proposed. He Ying et al. [15] presented HK based on PCA, the general idea of algorithm is that on the whole dataset (rather than two parts), it first makes use of PCA technology to reduce dimension of the dataset and then determines the initial cluster center by executing agglomerative algorithm. Finally it gets clustering results by using K-means. Improved HK based on PCA has more accuracy than traditional HK. To overcome the limitation of a binary tree constructed by hierarchical algorithm, a kind of divisive clustering algorithm was come up with by Lamrous S, Taileb M et al. [16] . The algorithm generates an non-binary tree where each node can split more than two branches by employing K-means, where the k value of K-means is determined by Silhouette index. Kohei Arai et al. [2] studied on an integrated HK algorithm to cluster high-dimensional data set. An improved HK algorithm was put forward by Yongxin Liu et al. [5] to solve the problem of document clustering possessing big data and high-dimension. Li Zhang et al. [17] combined divisive algorithm with agglomerative algorithm to address irreversibility of HK. Divisive algorithm gets several clusters by executing K-means at each layer, then utilizes agglomerative algorithm to merger clusters. Bernard Chen et al. [18] added fuzzy theory into traditional HK for boosting the precision of finding protein sequences theme and reducing the time complexity of HK algorithm.
According to the above related work, some ideas about significantly improving efficiency and accuracy are drawn which they can be applied to iHK. Such as, when using agglomerative algorithm no longer relies on binary tree rules and similarity measure between clusters would adopt mean distance rather than minimum distance to avoid serious impact of noise data on precision of algorithm. In iHK algorithm, the number of clusters produced by K-means always varies with the change of layer (data size). So cluster center can present the distribution of data as far as possible.

Normalization of data
Paper gives a new algorithm iHK on the basis of learning a large number of improved HK clustering algorithms. iHK firstly standardizes attributes by formula (1) Then data objects obtained by above-mentioned steps are used as clustering data.
iHK algorithm greatly promotes execution efficiency by combining the idea of 2-way merge, because running algorithm can reduce half the number of clusters at every time. To improve traditional HK algorithm would use K-means algorithm to clustering data at each layer rather than simply merger two clusters by adopting minimum distance between clusters. Next, traditional Hierarchical clustering algorithm and iHK algorithm are simply introduced.

Traditional Hierarchical Algorithm
Assuming that dataset is D and the total number of data object is N . At first, treating every data as a single cluster, hence there are N clusters. Similarity measure between cluster The number of clusters just reduces one at a time through merging two clusters by (2). This process is being performed continually until meeting the given threshold or all data are in one cluster. The time complexity of hierarchical algorithm is , thus it is not suitable for processing huge dataset.

iHK Clustering Algorithm
iHK overcomes limit of merely aggregating two clusters one time at hierarchical algorithm so that traditional hierarchical algorithm can be better applied to clustering big data. In addition, iHK no longer merges clusters in a minimum distance manner at each layer, but it makes use of K-means algorithm based on mean distance measurement. Supposing that th h layer has L data objects, the process of iHK algorithm based above assumption is that it selects cluster center by employing (3) before Ldata is a sequence of data objects.
By comparing the distance between the remaining data and cluster centers, data is divided into the closest cluster center. However, attributes may be mixed, therefore it can't adopt simple distance metric value as a criterion. New standard is defined by where x is a data object with n attributes, namely ) , , , , and y is a cluster center with n attributes, namely ) , , , is used to represent cluster and the calculation formula of mean vector is given by (5).
is applied to count the number of data object in k C cluster and ) ( ij x count is used to compute how many values equal ij x is a discrete type, then i x equals to the value of attribute appearing most times in dataset .
      2 L clusters are produced after the aforementioned process has been executed, then mean vectors obtained at th h layer are used as the new input of th h 1  layer.
iHK algorithm is being executed until satisfying given condition or threshold. iHK algorithm performs N 2 log times K-means in total, in which K-means just compares half of the input data in this layer. Through the above details, the overall description of the iHK algorithm is shown in Figure 1.
Input: data set D , the total number of data object n .
Output: Data object are divided into different clusters.
Step2:employing step length 2 to select cluster center by according to the order of the data storage.
Step3:computing distance by (4.2) if attribute is numeric type.
Step4:comparing discrete attribute values whether or not they are the same. Distance is 0,iff their value is the same. Otherwise, the value of distance is 1.
Step5:Performing K-means on the basis of step3-4.
Step6: Treating the results of K-means clustering as new dataset.
Step8:Repeat Step2-7 until meeting the given conditions. In Figure 1, Step1 is to standardize the numeric type data so that between attributes of different measure units can make the correct comparison. Distance calculation between data and cluster center is completed by Step3-4. Mean vectors generated by (5) are used to represent clusters and use them as input of next circulation.
The pseudocode of the algorithm is described by Figure 2.

Experimental results
This paper tests the performance of iHK at some typical dataset of diverse type. In iHK algorithm, K-means based on new distance metric is executed at each layer. Among, the value of k is different, which is half size of each layer input data.
Accuracy rate changes with different k values, which is shown in Figure 3.
The accuracy of iHK algorithm on five large dataset is shown at top of the Figure 3 and Accuracy on relatively small dataset is displayed at the bottom.
Input: dataset D , the total number of data object n .
forming new dataset by the mean vector of each cluster; until meeting the given conditions. The overall trend of all line chart is that they begin to rise gradually and reach maximum at some value, then decrease. Besides, accuracy obviously varies with the change of k values on some dataset, as Figure 3 expresses. To make experimental results can show the superiority of iHK clustering algorithm, this paper compares iHK algorithm with basic K-means and traditional HK algorithm at some evaluation indicators of the performance of algorithm. Accuracy rate is frequently used as an important indicator. The accuracy rate of three algorithms at some dataset are shown in Table 1.
Experimental results in Table 1 show that iHK clustering algorithm has higher accuracy than the basic K-means and HK algorithm at most dataset. Now most of the data come from Web and Web data is mainly big data. When these data are clustered, the time complexity of the algorithm is also considered as a significant indicator of performance of prediction algorithm. In Table 2, time complexity of HK algorithm, basic K-means, iHK algorithm are compared.
The time complexity of HK algorithm should be the sum of the time complexity of Hierarchical algorithm and K-means', it is greater than The time complexity of basic K-means is linear, where m is the number of iteration. iHK is similar to K-means. It is also linear, where N is the total number of data objects. By comparison, this conclusion can be drawn, which the time complexity of iHK algorithm is minimum, basic K-means follows and HK algorithm's is maximum .
The different time complexity are generated by three algorithms at experimental data set, which use Figure 4 to show.
The time complexity of HK algorithm is significantly higher than iHK at most dataset. iHK and Basic K-means are about equal at all dataset, which can be clearly observed from Figure 4.

Conclusion
HK algorithm has been widely used, due to its superiority in the clustering analysis. But the HK algorithm also has shortcoming, such as high time complexity. So, it can not be applied to clustering big data. Therefore, some improved HK algorithms have been studied by some researchers for resolving the problem of specific application areas. Some improved HK algorithm have good clustering outcome, but these clustering results are not good in certain application domain. iHK algorithm has good generalization in that it is not limited to a particular field. And it can be used to clustering big data since it has low time complexity.
From accuracy, efficiency and the time complexity acquired by comparing these three algorithms, a conclusion can be drawn that iHK has the advantage of high accuracy and easy to convergence. In addition, its performance is distinctly superior to other algorithms, but time complexity is similar to basic K-means's. The important point is that iHK is not based on any special application areas and easy to integrate to other clustering algorithm. However, iHK algorithm still can not solve irreversibility of HK algorithm. The next mainly task is that iHK algorithm is further improved.