Automated Determination of the Input Parameter of DBSCAN Based on Outlier Detection

. During the last two decades, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has been one of the most common clustering algorithms, that is also highly cited in the scientific literature. However, despite its strengths, DBSCAN has a shortcoming in parameter detection, which is done in interaction with the user, presenting some graphical representation of the data. This paper introduces a simple and effective method for automatically determining the input parameter of DBSCAN. The idea is based on a statistical technique for outlier detection, namely the empirical rule. This work also suggests a more accurate method for detecting the clusters that lie close to each other. Experimental results in comparison with the old method, together with the time complexity of the algorithm, which is the same as for the old algo-rithm, indicate that the proposed method is able to automatically determine the input parameter of DBSCAN quite reliably and efficiently.


Introduction
Machine Learning (ML) is one of the core fields of Artificial Intelligence (AI) and is concerned with the question of how to construct computer programs that automatically improve with experience [ 1].Depending on the nature of the learning data available to the learning system, machine learning methods are typically classified into three main categories [ 2,3]: supervised, unsupervised and reinforcement learning.In supervised learning example inputs and their desired outputs are given and the goal is to learn a general rule that maps these inputs to their desired outputs.In unsupervised learning, on the other hand, no labels are given to the learning algorithm, leaving it on its own to find the hidden structure of the data, e.g. to look for the similarities between the data instances (i.e.clustering [ 4]), or to discover the dependencies between the variables in large databases (i.e.association rule mining [ 5]).In reinforcement learning the desired input/output pairs are again not presented, however, the algorithm is able to estimate the optimal actions by interacting with a dynamic environment and based on the outcomes of the more recent actions, while ignoring experiences from the past, that were not reinforced recently.This research focuses on the most common unsupervised learning method (i.e.cluster analysis [ 4,6]), and more specifically on one of its successful algorithms the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [ 7].As mentioned above, in unsupervised learning, learner processes the input data with the goal of coming up with some summary or compressed version of the data [ 4].Clustering a dataset is a typical example of this type of learning.Clustering is the task of grouping a set of objects such that similar objects end up in the same group and dissimilar objects are diverted into different groups.Clearly, this description is quite imprecise and possibly ambiguous.However, quite surprisingly, it is not at all clear how to come up with a more rigorous definition [ 4], and since no definition of cluster is widely accepted many algorithms have been developed to suit specific domains [ 8], each of which using a different induction principle [ 9].
Due to their diversity, clustering methods are classified into different categories in the scientific literature [ 9,10,11,12].However, despite the slight differences between these classifications, they all mention the DBSCAN algorithm as one of the eminent methods available.DBSCAN owes its popularity to the group of capabilities it offers [ 7]: (1) it does not require the specification of the number of clusters in the dataset beforehand, (2) it requires little domain knowledge to determine its input parameter, (3) it can find arbitrarily shaped clusters, ( 4) it has good efficiency on large datasets, ( 5) it has a notion of noise, and is robust to outliers, (6) it is designed in a way that it can be supported efficiently by spatial access methods such as R*-trees [ 13], and so on.
DBSCAN algorithm requires two input parameters, namely  and  , which are considered to be the density parameters of the thinnest cluster acceptable, specifying the lowest density which is not considered to be noise.These parameters are hence respectively the radius and the minimum number of data objects of the least dense cluster possible.The algorithm supports the user in determining the appropriate values for these parameters offering a heuristic method, which imposes the user interaction based on some graphical representation of the data (presented in section 2.2).However, since DBSCAN is sensitive to its input parameters and the parameters have significant influences on the clustering result, an automated and more precise method for the determination of the input parameters is needed.Some notable algorithms targeting this problem are: (1) GRPDBSCAN, which combines the grid partition technique and DBSCAN algorithm [ 14], (2) DBSCAN-GM, that combines Gaussian-Means and DBSCAN algorithms [ 15], and (3) BDE-DBSCAN, which combines Differential Evolution and DBSCAN algorithms [ 16].Opposed to these methods, which all intend to solve the problem using some other techniques, this paper remains with the original idea of the DBSCAN algorithm and just tries to omit the user interaction needed, allowing the algorithm to detect the appropriate value itself.This is done using some basic statistical techniques for outlier detection.Two different approaches are mentioned in this paper, which apply the concept of standard deviation to the problem of outlier detection, namely the empirical rule for normal distributions and the Chebyshev's inequality for non-normal dis-tributions [ 17,18].This work, however, focuses mainly on the application of the empirical rule to outlier detection in normal distributed data, and addresses the Chebyshev's inequality only as a possible solution for non-normal distributions.
The rest of the paper is organized as follows.Section 2 describes the DBSCAN algorithm and its supporting technique for the determination of its input parameters.In Section 3, the above mentioned statistical techniques for outlier detection are presented (i.e. the empirical rule and the Chebyshev's inequality).Section 4 describes the automated technique for the determination of the parameter .Experimental results and the time complexity of the automated technique are then discussed in Section 5. Section 6 concludes with a summary and some directions for the feature researches.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise
According to [ 7], the key idea of DBSCAN algorithm is that for each point of the cluster the neighborhood of a given radius has to contain at least a minimum number of points, i.e. the density in the neighborhood has to exceed some threshold.The following definitions support the realization of this idea.

. |𝑁 𝐸𝑝𝑠 (𝑞) ≥ 𝑀𝑖𝑛𝑃𝑡𝑠|
The second condition is called core point condition (There are two kinds of points in a cluster, points inside of the cluster, called core points, and points on the border of the cluster, called border points).Definition 3: (density-reachable) A point  is density-reachable from a point , w.r.t. and , if there is a chain of points  1 , … ,   ,  1 = ,   =  such that  +1 is directly density-reachable from   .Definition 4: (density-connected) A point  is density-connected to a point  , w.r.t. and  , if there is a point  such that both,  and  are densityreachable from , w.r.t. and .Definition 5: (cluster) Let  be a database of points.A cluster  , w.r.t. and , is a non-empty subset of  satisfying the following conditions: 1.  , :     and  is density-reachable from , w.r.t. and  , then   .(Maximality) 2.  ,   :  is density-connected to , w.r.t. and .(Connectivity) Definition 6: (noise) Let  1 , . . .,   be the clusters of the database , w.r.t.parameters   and   ,  = 1, … , .Then the noise is defined as the set of points in the database  not belonging to any cluster   , i.e.  = {   | :    ).
The following lemmata are important for validating the correctness of the algorithm.Intuitively, they state that having the parameters  and , a cluster can be discovered in a two-step approach.First, choose an arbitrary point from the database satisfying the core point condition as a seed.Second, retrieve all points that are density-reachable from the seed, obtaining the cluster containing the seed.
3. If  does not satisfy the core point condition, mark it as a noise.4. Else retrieve all the density-reachable points from   () forming a cluster containing   () and mark all the member of this cluster as classified. 5. End While

2.2
Determining the Parameters  and  DBSCAN offers a simple but effective heuristic method to determine the parameters  and  of the thinnest cluster in the dataset.For a given  function  −  is defined from the Database  to the real numbers, mapping each point to the distance from its  − ℎ nearest neighbor.When sorting the points of the dataset in descending order of their  −  values, the graph of this function gives some hints concerning the density distribution in the dataset.This graph is called the sorted  −  graph.It is clear that the first point in the first valley of the  −  graph can be the threshold point with the maximal  −  value in the thinnest cluster.All points with a larger  −  value are considered to be noise, and all the other points are assigned to some clusters.DBSCAN states that according to experiments, the  −  graphs for  > 4 do not significantly differ from the 4 −  graph and, furthermore, they need considerably more computation.Therefore, it eliminates the parameter  by setting it to 4 for all datasets (for 2-dimensional data).The parameter determination method also explains, that since in general, it is very difficult to detect the first valley of the  −  graph automatically, but it is relatively simple for the user to see this valley in a graphical representation, it is suggested to follow an interactive approach for determining the threshold point.

Statistical Techniques for Outlier Detection
The term noise in DBSCAN algorithm is equivalent to an outlier in statistics, which is an observation that is far removed from the rest of the observations [ 19].One of the basic statistical techniques for outlier detection is called the empirical rule.The empirical rule is an important rule of thumb, that is used to state the approximate percentage of values that lie within a given number of standard deviation from the  of a set of data if the data are normally distributed.The empirical rule, also called the 68-95-99.7 rule or the three-sigma rule of thumb states that 68.27%, 95.45% and 99.73% of the values in a normal distribution lie within one, two and three standard deviations of the mean [ 17].One of the practical usages of the empirical rule is as a definition of outliers as the data that fall more than three standard deviations from the norm in normal distributions [ 20].
Fig. 1.The Empirical Rule [ 21] If there are many points that fall more than three standard deviations from the norm, then the distribution is most likely non-normal.In this case, Chebyshev's inequality, which applies to non-normal distributions, is applicable.Chebyshev's inequality states that in any probability distribution, at least 1 − in  standard deviations of the  [ 17] (e.g. in non-normal distributions at least 99% of the values lie within 10 standard deviations of the ).Hence, using the Chebyshev's inequality, the outlier can also be defined as the data that fall outside an appropriate number of standard deviations from the mean [ 22] 2 .

Automated Determination of the Parameter Eps
Setting the  to 4, determining the parameter , the algorithm is aiming a radius that covers the majority of the 4 −  values and stands well as a threshold for the specification of the noise values.As mentioned above, the term noise in DBSCAN algorithm is equivalent to an outlier in statistics, which is an observation that is far removed from the rest of the observations [ 19].Thus, the idea here is to use statistical rules in order to find the threshold value between the accepted 4 −  values and the values considered for the noise points.
As mentioned above, one of the practical usages of the empirical rule is as a definition of outliers as the data that fall more than three standard deviations from the norm in normal distributions [ 20].Thus, considering the 4 −  values, the value of parameter  can be set to their  plus three standard deviations.This would cover even more than 99.73% of the calculated 4 −  values, since the 4 −  values smaller than  − 3 ×  are also covered here.
Border points and even in general, points closer to the border of the clusters usually have greater  −  values, which lead to larger  values and thus might cause two close clusters to be detected as one cluster (Since the parameter  or  is set to 4, this problem may be caused mostly by the border points).These relatively greater  −  values, however, do not have any positive effect on the process of cluster detection, as the  −  values of the core points are actually the ones forming the right clusters and at the same time covering the border points.Figure 2 shows a case in which the 4 −  value of border point  is much larger than the 4 −  value of the core point , which can actually cover  in its 4 −  − ℎℎ.

Fig. 2. 4 − 𝑑𝑖𝑠𝑡 values for example core (𝑞) and border point (𝑝)
2 This work focuses solely on the empirical rule and the normal distributions.However, the possibilityofusingtheChebyshev'sinequalityisgivenhere,in order to show that the general idea of using outlier detection techniques for the reason of parameter determination in DBSCAN is not limited to the distribution of the data.

p q
In order to eliminate the negative effect of the  −  values of the border points, the algorithm presented here considers any point with minimum  −  value which covers the border point in its  −  − ℎℎ and replaces the  −  value of this border point with the  −  value of this core point.Thus for a given , function  − ˊ is defined from the Dataset  to the real numbers, mapping each point to the  −  value of any core point, covering this point in its  −  − ℎℎ, with minimum  −  value.Actually, following this technique, points are considered in ascending order of their 4 −  values, then taking each point , if the 4 − ˊ value for any point in its four nearest neighbors is not set so far, this value will be set to the 4 −  value of point .Using this technique for each point, the  −  value of the smallest cluster, the point can join, would be considered.At the end the  and the standard deviation of these  − ˊ values which are saved for all points are calculated and the ˊ value is set to  + 3 × .The following pseudo-code indicates this method.

Experimental Results and Time Complexity
In this section the experimental results and the time complexity of the automated technique proposed in Section 4 () are discussed.

Experimental Results and Discussions
In this section, the algorithm presented in Section 4 is applied to some datasets.This makes the comparison between the old method and the new automated method possible.All the experiments were performed on Intel(R) Celeron(R) CPU 1.90GHz with 2 GB RAM on the Microsoft Windows 8 platform.The algorithm and the datasets were implemented in Java on Eclipse IDE, MARS.In order to illustrate the problem that may occur with the  −  value of the border points (discussed in Section 4), dataset 5 is presented here (Figure 6).This dataset is defined in a way that nested and very close clusters are available in it. 3Note that the larger difference between  and ˊ for Dataset 3 is caused by the larger difference between the 4 − ˊ values of those data instances considered as noise and the rest of the data instances.This difference has no effect on the clustering result, since  and ˊ are actually threshold values and since there are no data instances with 4 − ˊ values between  and ˊ, the clustering result would remain the same.

4-distˊ points
It should be pointed out that even though the experiments presented here were all for 2-dimensional datasets, the idea can be applied to high-dimensional datasets as well.This is clearly possible, since the calculation of the distance between the points and the application of standard deviation remains the same for high-dimensional datasets.
The only point that must be considered is that, the DBSCAN has suggested 4 as the  value just for 2-dimensional datasets.However, as mentioned before,  and  are the density parameters of the thinnest cluster; therefore it is always possible to determine the  by keeping the  parameter small enough (or even just by setting it to one).The diversity of the density may always be described with different radii containing a predefined number of points ().

Time Complexity
Since the algorithm needs to find the four nearest neighbors of each point in the dataset, the time complexity of the algorithm cannot be less than ( 2 ).Of course, since these points should have been also retrieved in the user interaction technique, and the only difference here is the calculation of the  and the standard deviation, which can be done in (), it is clear that the time complexity of the automated technique presented here, is the same as for the old method.Thus concerning the automated abilities of this technique, it is obvious that the application of this approach in the determination of the  parameter is quite reasonable.

Conclusion
This paper proposes a simple and effective method to automatically determine the input parameter  of DBSCAN.The work remains with the original idea of the DBSCAN algorithm and just tries to omit the user interaction needed, and allow the algorithm to detect the appropriate value itself.This is done using some basic statistical techniques for outlier detection.Two different approaches are mentioned here, which apply the concept of standard deviation to the problem of outlier detection, namely the empirical rule for normal distributions and Chebyshev's inequality for non-normal distributions.One of the practical usages of the empirical rule is as a definition of outliers as the data that fall more than three standard deviations from the norm in normal distributions.Thus, the value of parameter  can be set to  plus three standard deviations.This value would cover the majority of the  − ˊ values and stands well as a threshold for the specification of the noise values.This work also mentioned the problem which occurs with the  −  values of the border points, and suggests a more accurate method for the determination of the values, based on which  is calculated (i.e. − ˊ values).Experimental results and the time complexity of the proposed algorithm suggest that the application of this technique in the determination of the  parameter is quite reasonable.The concentration of this research was mainly on the application of the empirical rule to outlier detection in normal distributed data.The future works will have to consider the Chebyshev's inequality for possible non-normal distributions of  − ˊ values.

1 𝑘 2
of the values are with-

Table 2 .
Algorithm 2: Pseudo-code of the   (Input: ) 1.For each point  find the four nearest neighbors.2. Sort the points in ascending order of theirs 4 −  values.3. Following the ascending order, take each point  and if the 4 − ˊ value for any of its four nearest neighbors is not set so far, set this value to the 4 −  value of the point . 4. Calculate the  of the 4 − ˊ values:  5. Calculate the standard deviation of the 4 − ˊ values:  6. Set the ˊ value to  + 3 × .