Self Organizing Maps with Delay Actualization

. The paper deals with the Self Organizing Maps (SOM). The SOM is a standard tool for clustering and visualization of high-dimensional data. The learning phase of SOM is time-consuming especially for large datasets. There are two main bottleneck in the learning phase of SOM: ﬁnding of a winner of competitive learning process and updating of neurons’ weights. The paper is focused on the second problem. There are two extremal update strategies. Using the ﬁrst strategy, all necessary updates are done immediately after processing one input vector. The other extremal choice is used in Batch SOM – updates are processed at the end of whole epoch. In this paper we study update strategies between these two extremal strategies. Learning of the SOM with delay updates are proposed in the paper. Proposed strategies are also experimentally evaluated.


Introduction
Recently, the issue of high-dimensional data clustering has arisen together with the development of information and communication technologies which support growing opportunities to process large data collections. High-dimensional data collections are commonly available in areas like medicine, biology, information retrieval, web analysis, social network analysis, image processing, financial transaction analysis and many others.
Two main challenges should be solved to process high-dimensional data collections. One of the problems is the fast growth of computational complexity with respect to growing data dimensionality. The second one is specific similarity measurement in a high-dimensional space. Beyer et al. presented in [1] that for the expected distance any point in a high-dimensional space, computed by the Euclidean distance to the closest and to the farthest point, shrinks with growing dimensionality. These two reasons reduce the effectiveness of clustering algorithms on the above-mentioned high-dimensional data collections in many actual applications.
The paper is organized as follows. In Sect. 2 we will describe one Self Organizing Maps. Section 3 describes parallel design of SOM learning algorithm. Modification of weights' update process is given in Sect. 4. Some experimental results are presented in Sect. 5. The paper is summarized and conslusions are made in Sect. 6.

Self Organizing Maps
Self Organizing Maps (SOMs), also known as Kohonen maps, were proposed by Teuvo Kohonen in 1982 [3]. SOM consists of two layers of neurons: an input layer that receives and transmits the input information, and an output layer, that represents the output characteristics. The output layer is commonly organized as a two-dimensional rectangular grid of nodes, where each node corresponds to one neuron. Both layers are feed-forward connected. Each neuron in the input layer is connected to each neuron in the output layer. A real number, weight, is assigned to each of these connections. i.e. weights of all connections for given neuron form weight vector. SOM is a kind of artificial neural network that is trained by unsupervised learning. Learning of the SOM is competitive process, in which neurons compete for the right to respond to a training sample. The winner of the competition is called Best Matching Unit (BMU).
Using SOM, the input space of training samples can be represented in a lower-dimensional (often two-dimensional) space [4], called a map. Such a model is efficient in structure visualization due to its feature of topological preservation using a neighbourhood function.

Parallel SOM Learning Algorithm
A network partitioning is the most suitable implementation of the parallelization of an SOM learning algorithm. Network partitioning is an implementation of the learning algorithm, where the neural network is partitioned among the processes. Network partitioning has been implemented by several authors [2,9]. The parallel implementation proposed in this work is derived from the standard sequential SOM learning algorithm. After analysing the serial SOM learning algorithm we have identified the two most processor time-consuming areas. These parts were selected as candidates for the possible parallelization. The selected areas were: Finding BMU -this part of SOM learning can be significantly accelerated by dividing the SOM output layer into smaller pieces. Each piece is then assigned to an individual computation process. The calculation of Euclidean distance among the individual input vector and all the weight vectors to find BMU in a given part of the SOM output layer is the crucial point of this part of SOM learning. Each process finds its own, partial, BMU in its part of the SOM output layer. Each partial BMU is then compared with other BMUs obtained by other processes. Information about the BMU of the whole network is then transmitted to all the processes to perform the updates of the BMU neighbourhood. Weight Actualization -Weight vectors of neurons in the BMU neighbourhood are updated in this phase. The updating process can also be performed using parallel processing. Each process can effectively detect whether or not some of its neurons belong to BMU neighbourhood. If so, the selected neurons are updated.
A detailed description of our approach to the parallelization process is described in Fig. 2.

Fig. 2. Improved Parallel SOM Algorithm
Before any implementation of an experimental application began, we had to decide how the parallelization would be done. Initially, we supposed that the most effective approach is to divide the SOM into several parts or blocks, where each block is assigned to the individual computational process. For example, let's suppose that an SOM with neurons N = 20 in the output layer is given. The output layer is formed as a rectangular grid with number of rows N r = 4 and number of columns N c = 5. Then the output layer of the SOM is divided into 3 continuous blocks which are associated with three processes 1 .
To remove the unbalanced load, the approach to the parallelization process has been modified. The division of the SOM output layer was changed from a block load to a cyclic one. The individual neurons were assigned to the processes in a cyclic manner. A nearly uniform distribution of the output layer's neurons among processes is the main advantage of this kind of parallelization. The uniform distribution of the neurons plays an important role in weight actualization because there is a strong assumption that neurons in the BMU neighbourhood will belong to different processes. An example of a cyclic division of the SOM output layer with a dimension of 4 × 5 neurons can be seen in Fig. 3, where each neuron is labeled with a color of assigned process. A more detailed description of parallelization can be found in our previous papers (including a full notation) [6,8].

Delay Actualization
In this modification of the SOM algorithm we focused on the area called finding BMU. Only in the parallel version is it necessary to find the global BMU from the local BMUs in each iteration and here are two areas, by which we will discuss: 1. To find the global BMU we must transfer a lot of data between processes.

This waiting mode (blocking communication), where other processes and
threads awaiting the outcome, will decrease the efficiency of parallel computation.
The method described below is based on information that with the same amount of data it is effective to send data all at once instead of sending in portions. Both problems mentioned above are solved this way. The proof about transfer data is in Table 1, where 1 to 64 processes are used and transferred 50 thousand and 500 thousand numbers of all processes on a single process, but in one case we send all of these numbers together and in the second case separately -at one time two numbers together. For clarity, the number of data that are transfer from individual processes is always the same. Only the total number of data that are finally placed on the target process is changing. For example, if we have 6 processes and 50k numbers, so we have on the target process saved 6 × 50k numbers. From these results it is possible to see that the final times for both amounts (50k and 500k) and for sending data together are the very similar.
For data transmission, the MPI functions Gather [2] are used and the processes are running on separate computing nodes, which are connected by the Infiniband network. The second point, which we mentioned above, concerns the utilization of individual processes or threads (both parallelization operate on the same principle, see previous article [6]). As we mentioned earlier we divide the SOM algorithm into two parts: The first part concerns the search for the BMU (the fast part) and the second part concerns updating weight (time-consuming part). The delay occurs in the situation where some processes (threads) must update more neurons than the other processes (threads), an example can be seen in Fig. 4. Where the process number two must update three neurons, but other processes must update only two neurons. If the update does not occur after each iteration, but only after a certain time, it is possible to reduce the impact of blocking communication. Individual processes will not have to wait for other processes and utilization processes should be uniform. It is because of two reasons: 1. A BMU is usually different with each iteration and therefore neurons which must be updated are different. 2. The number of neurons to be updated decreases, but at the beginning of the algorithm 1/4 of all neurons are updated. An important factor that affects the distribution is training data and unfortunately this cannot be anticipated. According to the above examples our goal is aggregate data which are transmitted and we propose the following approach to update the weights: The base is the parallel solution which we described in Sect. 3, where at the beginning limit of delay -L is set for how many local BMU can be kept in the local memory in each process. Each processes find the local BMU and save the result in the local memory. If the limit is not reached, it is necessary to read the new input vector and find a new local BMU. If the limit is reached all local BMUs are moved to a process with rank 0, which finds a global BMU for each iteration and then sends results to all processes. After this step each process gradually updates the weights.
We worked with three variants of the above described algorithm: 1. Constant delay (Cons) -Size of L is the same throughout the calculation. For a complete description of the algorithm is to be noted that ζ is applied at the end of each epoch (only for variants Dec and Inc). Setting the value of ζ, for variants Dec and Inc largely depends on the number of input vectors M. Therefore we are working with a percentage of M. It is used for settings L and ζ.
In the chapter experiments we attempted to show how much influence the value of ζ is. For example: L = 10% of M, ζ = 0.1% of M.
Here it is necessary to briefly recall the behaviour of the neural network SOM. Over time, the number of updated neurons is changing -decreases. At the beginning, most of the neurons are updated but at the end only the few neurons or only one neuron are. If variant Dec is used, delay gradually decreases by ζ and also the number of neurons that must be updated. In variant Inc it is the opposite, by the number of updated neurons still decreases but the delay increases.

Experiments
We will describe different datasets and we will provide experiments with bigger and smaller limit of delay -L. The mean quantization error (MQE) is used to compare the quality of the neural network method which it is described in paper [5].

Experimental Datasets and Hardware
Weblogs Dataset. A Weblogs dataset was used to test learning algorithm effectiveness on high dimensional datasets. The Weblogs dataset contained web logs from an Apache server. The dataset contained records of two months worth of requested activities (HTTP requests) from the NASA Kennedy Space Center WWW server in Florida 2 . Standard data preprocessing methods were applied to the obtained dataset. The records from search engines and spiders were removed, and only the web site browsing option was left (without download of pictures and icons, stylesheets, scripts etc.). The final dataset (input vector space) had a dimension of 90,060 and consisted of 54,961 input vectors. For a detailed description, see our previous work [7], where web site community behaviour has been analyzed.
On the base of this dataset 15,560 user profiles were extracted and the number of profile attributes is 28,894 (this number corresponds to the dimension of input space) for the final dataset.
Experimental Hardware. The experiments were performed on a Linux HPC cluster, named Anselm, with 209 computing nodes, where each node had 16 processors with 64 GB of memory.The processors in the nodes were Intel Sandy Bridge E5-2665. Compute network is InfiniBand QDR, fully non-blocking, fattree. Detailed information about hardware can be found on the website of Anselm HPC cluster 3 .
In this section we describe experiments which are based on delay actualizations. For the experiment, we examine the quality of the resulting neural networks and the time that is required for calculation. The type of parallelization of SOM is a combination of MPI and OpenMP.

First Part of the Experiment
The first part of the experiments was oriented towards an examinination of the quality of neural networks which depends on the size of the delay. The dataset used is Weblogs. All the experiments in this section were carried out for 1000 epochs; the random initial values of neuron weights in the first epoch were always set to the same values. The tests were performed for SOM with rectangular shape -400 × 400 neurons. All three variants shown in section 4, are tested. If variants Inc or Dec are used, then the steps ζ are as follows 0.1%, 0.01%, 0.005%. MQE errors are presented for limit of delays L equal 5%, 10% and 20% in Table 2.
Step size does not affect the variant Cons. Therefore, this method has only one value instead of three in the above table.

Second Part of the Experiment
The second part of the experiments were oriented towards scalability. As in the previous test, experiments are carried out on three types of delays (increasing, decreasing and constant). The parallel version of the learning algorithm was run using 16, 32, 64, 128, 256, 512, and 1024 cores respectively. The achieved computing time is presented in the Table 3 for step ζ = 0.1%, in Table 4 for step ζ = 0.01%, in Table 5 for step ζ = 0.005%. In the above tables the variant Cons is presented (it is not affected by the step -all three tables contains same values), the reason is comparison resulting times. For comparison, the standard SOM algorithm (without any delay) takes 32:10:30 computing time and MQE is 0.4825.    Conclusion of Delay Experiments. In this section an evaluation of the above described experiments can be found. The reason this evaluation is discussed in a separate part is that the overall evaluation of effectiveness can not only be based on individual results. It is necessary to focus on a combination of outcomes for finding the optimal solution. From the first experiment, which was focused on the quality of the final neural network, we can deduce the following conclusions: 1. As we expected, with the increasing size of the local memory, the overall quality of the neural networks is deteriorating. This behavior is evident in all three types of delays.

The variant
Cons was in all three cases the worst.
3. According to these results, the variants Inc and Dec fundamentally differ from each other. When we use the variant Dec, the subsequent decrease of the value of the delay deteriorates the quality of the neural networks, but when we use the variant Inc, the quality of the neural network improves; it is not a significant change in the same way as in the Dec.
The second experiment was focused on the scalability and the time consumption of the above variants. We describe the results of the experiments as follows: 1. Even though the variant Cons is independent of the value steps ζ it still achieves the fastest computing time.
2. When the variant Cons and the variant Inc are used, the time difference between the delay (5%, 10% and 20%) is only a small percent -almost negligible. However, the variant Dec reaches time differences of up to 60%.
3. When 16 cores are used and step ζ = 0.1% so the variant Inc is much faster (more than twice) than the variant Dec. Again, using 16 cores and at step ζ = 0.01% times are in both the above variants almost comparable. However, when step ζ = 0.005% is used, the variant Dec is slightly faster than the variant Inc.
If we look only at the individual results according to the first experiment, the overall best results are obtained with the variant Dec (delay L = 5% and step ζ = 0.1%) and the worst results are obtained with the variant Cons. The second experiment shows that the variant Cons is the fastest and the variation Inc is minimally affected by the delay amount. After comparing all the achieved results and the required time to calculate them, we have identified as the best variant Dec (delay L = 5% and step ζ = 0.01%) with computing time 0:35:07 and MQE 0.55741.
An example of limit values achieved for delay L = 5% is possible to see in Fig. 5 where we can see the methods Inc and Dec with step ζ = 0.1%, 0.01% and 0.005%. The step ζ value has a major impact on the overall result, because it determines the time when the above method reaches the maximum permitted delay or vice versa, when they reach the minimum delay.

Conclusion
The experiments have shown the possibility of speeding up the computation of actualizing the weights while maintaining the sufficient quality of the final neural network. Speeding up the calculation of the SOM algorithm is based on updating the weights after several (delay L) input vectors. It is similar to Batch SOM, which updates the weights after one epoch. The actualization process for variant Dec calculates the values of the weight roughly at the beginning and the next calculation in this variant leads to a more accurate calculation of the weights using the decreasing value of the delay. Overall, the best results are achieved for the variant Dec with the smallest test delay (L = 5%) and a mean step (ζ = 0.01%). This variant is quickly approaching the standard SOM with weight actualization after each input vector. With the initial actualization for the smallest test number of delays, this variant Dec is faster than the standard SOM (Dec for cores = 16, L = 5%, ζ = 0.01% takes 14:19:45 and the standard SOM for 16 cores takes 32:10:30). Further acceleration is due to the massive parallelization, when the best time is achieved for 512 cores (0:35:07). Even faster is the variant with ζ = 0.005%, but the MQE of this variant is twice as big and therefore less accurate.