Optimalization of Parallel GNG by Neurons Assigned to Processes

. The size, complexity and dimensionality of data collections are ever increasing from the beginning of the computer era. Clustering is used to reveal structures and to reduce large amounts of raw data. There are two main issues when clustering based on unsupervised learning, such as Growing Neural Gas (GNG) [9], is performed on vast high dimensional data collection – the fast growth of computational complexity with respect to growing data dimensionality, and the speciﬁc similarity measurement in a high-dimensional space. These two factors reduce the eﬀectiveness of clustering algorithms in many real applications. The growth of computational complexity can be partially solved using the parallel computation facilities, such as High Performance Computing (HPC) cluster with MPI. An eﬀective parallel implementation of GNG is discussed in this paper, while the main focus is on minimizing of interprocess communication. The achieved speed-up was better than previous approach and the results from the standard and parallel version of GNG are same.


Introduction
The size and complexity of data collections are ever increasing from the beginning of the computer era, while the dimensionality of the data sets is rapidly increasing in recent years. Contemporary and especially future technologies allow us to acquire, store and process large high dimensional data collections. High dimensional data collections are commonly available in areas like medicine, biology, information retrieval, web analysis, social network analysis, image processing, financial transaction analysis and many others.
Clustering, considered the most important unsupervised learning problem, is used to reveal structures, to identify "natural" groupings of the data collections and to reduce large amounts of raw data by categorizing in smaller sets of similar items.
There are two main issues when clustering based on unsupervised learning, such as Growing Neural Gas (GNG) [9], is performed on vast high dimensional data collection: 1. The fast growth of computational complexity with respect to growing data dimensionality, and 2. The specific similarity measurement in a high-dimensional space, where the expected distance, computed by Euclidean metrics to the closest and to the farthest point of any given point, shrinks with growing dimensionality [2].
These two factors reduce the effectiveness of clustering algorithms on the abovementioned high-dimensional data collections in many real applications. The growth of computational complexity can be partially solved using the parallel computation facilities, such as High Performance Computing (HPC) cluster with MPI. Obviously, it is necessary to resolve technical and implementation issues specific to this computing platform. An effective parallel implementation of GNG is discussed in this paper, while the main focus is on minimizing of interprocess communication.

Related Works
The methods based on Artificial Neural Networks (ANN) are highly computationally expensive. There are different approaches on how to improve effectivity of these methods. The one possibility is to improve computation. The authors of this paper [3] propose two optimization techniques that are aimed at an efficient implementation of the GNG algorithm internal structure. Their optimizations preserve all properties of the GNG algorithm. The first technique enhances the nearest neighbor search using a space partitioning by a grid of rectangular cells and the second technique speeds up the handling of node errors using the lazy evaluation approach.
The next possibility for how to improve effectivity methods based on ANN is parallelization. In the paper [4] the authors combine the batch variant of the GNG algorithm with the MapReduce paradigm resulting in a GNG variant suitable for processing large data sets in scalable, general cluster environments. The paper [16] is focused on the actualizations of neurons weights in the learning phase of parallel implementation of SOM. There are two extremal update strategies. Using the first strategy, all necessary updates are done immediately after processing one input vector. The other extremal choice is used in Batch SOM -updates which are processed at the end of whole epoch and authors study update strategies between these two extremal strategies.
For parallelization is often use Graphics Processing Units (GPU). In the paper [1] authors present the results of different parallelization approaches to the GNG clustering algorithm. They especially investigated the GPU and multi-core CPU architectures. Authors in the paper [12] explore an alternative approach for the parallelization of growing self-organizing networks, based on an algorithm variant designed to match the features of the large-scale, ne-grained parallelism of GPUs, in which multiple input signals are processed simultaneously. The paper [15] describes the implementation and analysis of a network-agnostic and convergence-invariant coarse-grain parallelization of the deep neural network (DNN) training algorithm. The coarse-grain parallelization is achieved through the exploitation of the batch-level parallelism. This strategy is independent from the support of specialized and optimized libraries. Therefore, the optimization is immediately available for accelerating the DNN training. The proposal is compatible with multi-GPU execution without altering the algorithm convergence rate.

Growing Neural Gas
The principle of this neural network is an undirected graph which need not be connected. Generally, there are no restrictions on the topology. The graph is generated and continuously updated by competitive Hebbian Learning [8,13]. According to the pre-set conditions, new neurons are automatically added and connections between neurons are subject to time and can be removed. GNG can be used for vector quantization by finding the code-vectors in clusters [7], clustering data streams [6], biologically influenced [14] and 3D model reconstruction [10]. GNG works by modifying the graph, where the operations are the addition and removal of neurons and edges between neurons.
To understand the functioning of GNG, it is necessary to define the algorithm. The algorithm described in our previous article [17] is based on the original algorithm [5] [7], but it is modified for better continuity in the SOM algorithm. The description of the algorithm has been divided for convenience into two parts.
Here is the Algorithm 1 which describes one iteration.
Remark The notation used in the paper is briefly listed in Table 1.

Parallelization
In the paper [17] we have dealt with the parallelization of GNG. The following is a brief description of our parallelization algorithm.
After analysing the GNG learning algorithm we identified the one most timeconsuming processor area. This part was selected as a candidate for the possible parallelization. The selected area are: Finding BMU -this part of GNG learning can be significantly accelerated by dividing the GNG output layer into smaller pieces -distribution of neurons for effective parallelization. Each piece is then assigned to an individual computation process. The calculation of the Euclidean distance among the The first Best Matching Unit (BM U1), winner of learning competition Nc 2 The second Best Matching Unit (BM U2), the second best matching neuron in learning competition wc 1 (t) Weight vector of BM U1 wc 1 (t) Weight vector of BM U2 lc 1 Learning factor of BM U1 lnc 1 Learning factor of BM U1 neighbours ei Local error of output neuron Ni, i = 1, 2, . . . , N α Error ei reduction factor β Neuron error reduction factor γ Interval of input patterns to add a new neuron amax Maximum edges age aij Age of edge eij p Number of processes.

Vi
Set of neurons assigned to P rocessi. V Set of in the GNG.
Algorithm 1 One iteration of the Growing Neural Gas algorithm 3. Update the weight vector wc 1 of neuron Nc 1 4. For all neurons N k where exists edge e c 1 k (Nc 1 neighbourhood) (a) Update the weights w k using lnc 1 learning factor (b) Increase age a kc 1 of edge e c 1 k a kc 1 = a kc 1 + 1 5. If there is no edge between neurons Nc 1 and Nc 2 , then create such edge. If the edge exists, the age is set to 0. 6. If any edge has reached the age of amax, it is removed. 7. If there is a neuron without connection to any edge, the neuron is then removed. individual input vector and all the weight vectors to find BMU in a given part of the GNG output layer is the crucial point of this part of GNG learning. Each process finds its own partial BMU in its part of the GNG output layer. Each partial BMU is then compared with other BMUs obtained by other processes. Information about the BMU of the whole network is then transmitted to all the processes to perform the updates of the BMU neighbourhood.
Updates of weights -update weights of edges incident with N c1 it is quickly in the event that the neighboring nodes to N c1 are on a same process. This part can theoretically accelerate if we move adjacent nodes on a single process. Unfortunately, the transfer node for multidimensional data is very time consuming (test data have a dimension of 8000+).
A detailed description of our parallelization process is described in Fig. 1.  The parallelization of GNG learning was performed on an HPC cluster, using Message Passing Interface (MPI) technology. MPI technology is based on effective communication between processes. That means that one application can run on many cores. The application uses MPI processes which run on individual cores. The processes are able to send and receive messages and data, communicate etc. Detailed information about HPC and MPI technology is provided, for example, in [11]. 3

Distribution Neurons for Effective Parallelization
In the paper [17] we used Method 1 (the equal distribution of neurons), where new neurons are allocated to the process with the lowest number of neurons (see Figure 2). The advantage of this distribution is constant workload processes. The disadvantage is increased communication between processes. Our goal is to focus on reducing interprocessor communication by using the following methods for adding new neuron: The neurons are gradually added to the process, which currently does not contain a predetermined number of neurons ( Nmax p ). If one process is filled up so a new neuron is added to the next process (see Figure 3).

Method 3
The start is similar to Method 1. If |V i | ≥ 2 ∀i ≤ p then add N new to P rocess i where (N c1 ∈ V i and |V i | ≤ Nmax p ) or add N new to P rocess k where (N c2 ∈ V k and |V k | ≤ Nmax p ) or add N new to the next free P rocess j (see Figure 4).

The Experiment
The experiment was oriented towards a comparison of the parallel GNG algorithm and parallel by modification by assignment to processes. The Medlars dataset was used for the experiment. A parallel version of the learning algorithm was run using 2, 8, 16, 24, 32 and 64 MPI processes. The records with an asterisk (*) represents the results for only one process i.e. this is the original serial learning algorithm and there is no network communication. GNG parameters are the same for all experiments and are as follows γ = 200, e w = 0.05, e n = 0.006, α = 0.5, β = 0.0005, a max = 88, M = 2021, δ = 1500. The achieved computing time is presented in Table 2.
As we can see from Table 2 and Fig. 5, the computing time depends on the number of used cores as well. With a growing number of processors, the computation effectiveness increases, and the computational time is sufficiently reduced (in the paper [17] we used Method 1).

Conclusion
In this paper the parallel implementation of the GNG neural network algorithm is presented. The achieved speed-up was better than our previous approach. However, the effectiveness of a parallel solution is dependent on the division of the output layer. the authors introduced three different methods of neuron assignment to processes, where better acceleration for the new approaches was achieved. these approaches reached the best time but speed up is not too significant for the selected data set. Our methods are focusing on the different ways of assigning a new neuron in the GNG to processes for parallel computation.
In future work we intend to focus on the sparse date, use combinations of neural networks for improved result and improved acceleration.