Improved Louvain Method for Directed Networks

. Existing studies about community detection mainly focus on undirected networks. However, research results on detecting community structure in directed networks are less extensive and less systematic. The Louvain Method is one of the best algorithms for community detection in undirected networks. In this study, an algorithm was proposed to detect community structure in mass directed networks. First, the definition for modularity of directed networks based on the community connection matrix was proposed. Second, equations to calculate modularity gain in directed networks were derived. Finally, based on the idea of Louvain Method, an algorithm to detect community in directed networks was proposed. Relevant experiments show that not only does the algo-rithm have obvious advantages both in run-time and accuracy of community discovery results, but it can also obtain multi-granularity community structure that could reflect the self-similarity characteristics and hierarchical characteristics of complex networks. Experimental results indicate the algorithm is excellent in detecting community structure in mass directed networks.


Introduction
Many real systems in the world can be modelled as complex networks. And community structure is common in complex networks [1]. Community structure provides a mesoscale perspective for the study of complex networks, and this characteristic can be used to study the topology and dynamical behavior of networks. Therefore, community detection not only has important theoretical significance, but also has important practical value.
There are two different types of networks: undirected and directed. The research of community detection was initially focused on undirected networks. Currently, there are many different methods to detect community structure in undirected networks [2]. In 2004, Newman et al. [3] proposed modularity, which was used originally to measure the accuracy of the results obtained by community detection algorithms and achieved great success in practice. Subsequently, the modularity optimization algorithms which take modularity as the objective function have also become one of the mainstream methods for community detection in undirected networks. Modularity optimization algorithms mainly include algorithms based on greedy strategy [4][5], extreme value optimization strategy [6], spectral clustering strategy [7] and algorithms which combine multiple strategies [8][9]. The Louvain Method (LM) [8] which integrates greedy strategy and hierarchical clustering strategy has been recognized by many scholars for its low time complexity and high-quality community detection results [10].
In contrast to research on community detection in undirected networks, there are few systematic studies on community detection in directed networks. Currently, there are three main methods of community detection in directed networks: The first method is to treat the directed network in the same way as an undirected network by ignoring the edge direction of the directed network. However, the direction of the edges in a directed network implies important information and ignoring the direction of the edges to detect community structure will cause inaccurate results [11].
The second method also converts directed networks into undirected networks, but the converted undirected network still contains the information of the direction of the edges. In [12], the authors transformed directed networks into bipartite undirected networks. Then they defined a new modularity for bipartite networks by modifying modularity of undirected networks. Finally, the partition with the largest modularity of bipartite networks is found as the community structure of the networks. This method is inefficient and unsuitable to deal with large-scale networks.
The third method is the modularity optimization method. In 2007, Leicht et al. [11] extended modularity of undirected networks and proposed the modularity of directed networks. Which makes one can design modularity optimization algorithm of directed networks based on the ideas of modularity optimization algorithm of undirected networks. At present, the modularity optimization algorithms of directed networks mainly contain LN algorithm [11] and LLQ algorithm [13]. LN and LLQ can handle directed networks directly, but they both have high time complexity, and the results obtained by these two algorithms also lack accuracy.
This study proposed an algorithm called ILMDN (Improved Louvain Method for Directed Networks) based on the idea of the LM to detect community structure in mass directed networks. The ILMDN not only has low time complexity and highprecision community detection results, but also can obtain multi-granularity community structure that can reflect the self-similarity characteristics and hierarchical characteristics of complex networks.
The community structure of a network is a division of its node set. Based on the division, the network can be divided into several subgraphs so that the nodes in the same subgraph are closely connected, but the nodes in different subgraphs are sparsely connected. And each subgraph is called a community of the network.
Modularity is a mathematical definition of community structure. In a directed network, let {} vw Aa = is its adjacency matrix. out v vw w ka =  is the sum of weights for edges starting from node v, and simply referred to as the out-degree of node v. in w vw v ka =  is the sum of the weights for edges which end with node w, and simply referred as the in-degree of node w. (1) [11] High values of the modularity correspond to good divisions of a network into communities. Therefore, the modularity optimization algorithm finds the division with the highest modularity as network's community structure making it a feasible community detection technology. The LM is one of the superior modularity optimization algorithms for undirected networks. Its main flow is shown in Fig.1. The algorithm is an iterative algorithm and contains two sub-processes in each iteration: modularity optimization and community aggregation. The LM has been recognized by many scholars and has been continuously improved. Many improvement strategies are proposed to further reduce the running time of LM and improve the accuracy of results obtained by LM [14][15][16][17].
It should be noted that modularity has some shortcomings in describing the community structure of the network [18][19]. Despite these deficiencies, modularity and modularity optimization algorithms are feasible in practical application.

Calculating Modularity Gain in Directed Networks
For the directed network, the gain of the modularity refers to the variation brought to the modularity after the nodes are merged or after the nodes are removed from the community they belong to. In this section, we first derive the definition of directednetwork's modularity based on the community connection matrix of the network, and then derive the equations to calculate the modularity gain in directed networks.
Supposing that a directed network is divided into k communities, the k-th matrix is defined as its community connection matrix. In Fig.   2, there is a directed network with two communities, and matrix E is its community connection matrix. It can be known that (1) can be equivalently transformed as (2), which is the new definition of the directed network's modularity.
( ) Suppose a directed network contains k communities and A is its community connection matrix. Merging community i and community j in the directed network, that is, adding the elements of the i-th row to the j-th row and add the elements of the i-th column to the j-th column in A. After the merging, matrix A will become as matrix B. Then calculate A d Q which is the value of modularity of the network before merging and B d Q which is the value of modularity of the network after merging by using (2). Finally, by subtracting B d Q from A d Q and the equation to calculate modularity gain after the communities are merged in directed network is derived, as shown in (3). 11 1, 1 Suppose the community which node v belongs to is v C and ' v C is the community formed after removing node v and the edge connected to v from v C . The transformation of matrix A to matrix B corresponds to the merging of nodes, contrarily, the transformation of matrix B to matrix A corresponds to nodes leaving their community. Therefore, according to the derivation method of (3), the equation to calculate modularity gain after v removed from v C can also be obtained. It is as shown in (4), , v vC e is the sum of weights of edges that all start from node v and end with nodes with- Cv e is the sum of weights of edges that all start from nodes within ' v C and end with node v .
Suppose that community k is the community formed by merging community i and community j, then Equations in (5) can be obtained based on the transformation from matrix A to matrix B.
e e e e e = + + + Suppose that community i is a sub-graph of community j, and community h is the community formed after removing community i from community j, then Equations in (6) can be obtained based on the transformation from matrix B to matrix A.

The algorithm
The main procedure of ILMDN is shown in Algorithm 1, and it is composed of two phases: iteration phase and refinement phase. The iteration phase corresponds to the LM algorithm and contains two sub-phases: modularity optimization (steps 3 to 15) and the community combination (steps 18 to 21). The refinement phase corresponds to the improvement strategy proposed in [14] for undirected networks. This improved strategy was extended to make it suitable for the directed network. Increase identifies whether there is node movement in a traverse, and its initial value is true.
Equation (4) The iteration phase is over, and the refinement phase is executed; Foreach j of In the l-th iteration of iteration phase, first assign a different community to each node of 1 l G − in steps 3 ~ 4. So, the number of communities equals the number of nodes in initial. Then traverse 1 l V − several times until there is no change on communities of every nodes in steps 5~15. In each traversal, there are three cases of commu-nity ownership for node i. The first case corresponds to steps 10~11, in which removing node i from its community can increases the value of d Q , however if node i is ulteriorly merged into any neighboring community, the value of d Q will be reduced. The second case corresponds to steps 12~13, in which removing node i from its community and then merging it with MaxCid community can increase the value of d Q the most. The third case corresponds to steps 14~15, in which removing node i from its community would decreases the value of d Q , and then merging node i with any neighboring community would reduce the value of d Q further. In steps 18 ~ 21, the edge set l E and the mapping l w are generated by traversing 1 l G − one time. It should be noticed that there is no node movement at the last traversal on nodes set of input network in each iteration. Therefore, if a certain iteration has only traversed the node set of the input network for one time, the iteration phase is terminated, and then goes to perform the refinement phase (step 16 ~ 17).The main flow of refinement phase is shown as Function 1. In the refinement phase, an iterative process is actually executed.

The time complexity and space complexity of ILMDN
Let l denote the iteration times of iteration phase and t denote the traverse times on the node set of input network in each iteration. Numerous experiments show that l and t are constants independent of the scale of input network [8,[14][15][16][17], and their values are small. In addition, the variables needed when calculating the modularity gain can be updated in real time according to Equations in (5) and (6)

Experimental Comparison and Analysis
To verify the ILMDN's performance, based on the six directed networks in Table  1, contrast experiments between ILMDN, LN and LLQ were performed.
First, ILMDN has a deficiency that it sensitives to the input sequence of initialnetwork's nodes, that is, different node input sequences will result in different community structure. However, the difference between the different community structures is insignificant, and the most of the results when input sequences of nodes are random are close to or the best community structure of the network. For example, when ILMDN was ran 1000 times on the Directed LM Network, three different results shown in Fig. 3 were obtained. To simplify the display, two directed edges between two nodes in the network were replaced by an undirected edge in the figure. The times that three results appeared from left to right are respectively 2, 84 and 914. There is little difference between the three results and the rightmost result, which appeased the most, is the best community structure of the network. It is generated by converting every undirected edge in LM Network shown in Fig.1 as two directed edges that are in the opposite direction.

Directed Karate Clubs Network
It is generated by converting every undirected edge in Karate Club Network [20] as two directed edges that are in the opposite direction.

Wikipedia Vote Network
The network is constructed by SNAP using Wikipedia user voting data [21]. 7115 103689 Email Communication Network The network is constructed by SNAP using Enron's Email Dataset [21]. 265214 420045

Bank Customer Transaction Network
The network is generated using customer transaction records of a commercial bank in the first quarter of 2015. Nodes of the network are customer's accounts, and the weight of edge AB → is the cumulative times of transactions from the account A to the account B.

Wikipedia Talk Network
The network is constructed by SNAP using Wikipedia page edit data [21]. 2394385 5021410 Table 2 shows the performances of LN, LLQ and ILMDN for community detection in the six directed networks. For each algorithm/network, the table displays the modularity that is achieved and the computation time T (in seconds). Because ILMDN is sensitive to the input sequence of nodes, ILMDN was ran repeatedly on each network in experiment: run ILMDN 1000 times on the first three networks and 300 times on the latter three networks. Then for each network, the maximum, minimum and average computation time ( max T , min T , ave T , respectively) consumed by ILMDN were counted. The maximum, minimum, and average values of modularity ( max d Q , min d Q , ave d Q , respectively) achieved by ILMDN were also counted. In the table, the time "0.0" indicates that the algorithm consumes less than one millisecond when running on a smaller network, and "-" indicates that the algorithm failed to detect the community structure of the network within one hour when the network scale was large. As can be seen from the table, the computation time of ILMDN was far less than that of LN and LLQ. Even for networks with more than two million nodes, ILMDN detected its community structure in about a minute. In terms of modularity, first, for each network, the maximum modularity achieved by ILMDN was greater than or equal to the modularity which LN and LLQ achieved. Second, because the result that ILMDN obtained on the network when input sequence of nodes is random is close to or the best community structure of the network, the average values of modularity ILMDN achieved at each network was close to the maximum value. Finally, on some networks, the smallest modularity ILMDN achieved was still greater than the modularity LLQ obtained. In summary, compared with LN and LLQ, ILMDN has obvious advantages in terms of computation time and accuracy of community discovery results. ILMDN was also compared to the LM. It is found that the community structure ILMDN detected in Directed LM Network is the same as the community structure the LM mined in LM Network, and the results are the same for Directed Karate Clubs Network and Karate Clubs Network. The result further verifies the accuracy of ILMDN.
In addition, the ILMDN can provide a multi-granularity community structure based on the intermediate partitions found at each iterative. The multi-granularity community structure of the largest isolated subgraph of Bank Customer Transaction Network is shown in Fig.4. There are 37,979 isolated subgraphs in Bank Customer Transaction Network, while the largest isolated subgraph contains 179,759 nodes and the other isolated subgraphs are smaller (the second largest isolated subgraph contains only 121 nodes). In order to display the results more clearly, only the multi-granularity community structure of the largest isolated subgraph was shown in Fig.4. There was a total of 6 iterations when ILMDN was ran on the largest isolated subgraph. The community structures detected from the first iteration to the fifth iteration are shown from Fig.4(a) to Fig.4(e). Note that the result of the sixth iteration is the same as the fifth iteration and is not shown. Fig.4(f) shows the community structure detected after the refinement phase. Each chart in Fig.4 shows the modularity of the result, the total number of communities and the top nine communities with the most nodes in the corresponding results. In every chart, the circles indicate the communities, and the numbers in the circles indicate the numbers of nodes in the community. And Fig.5 shows the internal structure of a community generated by the second iteration. There are five communities in the figure, which are all generated by the first iteration. Multi-granularity community structure embodies the self-similarity and hierar- As can also be seen from Fig.4, the modularity gain obtained by the first iteration is significantly higher than that of other iterative obtained. The community structure detected after the refinement phase had the same community number as the community structure detected after the iteration phase, but the community ownership of some nodes changed, so that the modularity that ILMDN achieved finally was improved.

Conclusion and Discussion
In this paper, we proposed an algorithm named ILMDN for detecting community structure in directed networks. ILMDN not only has linear time complexity, but also has higher accuracy of community detection results. And ILMDN can obtain multigranularity community structure which contains important information for subsequent