LFGCN: Levitating over Graphs with Levy Flights

We propose a new Lévy Flights Graph Convolutional Networks (LFGCN) method for semi-supervised learning, which casts the Lévy Flights into random walks on graphs and, as a result, allows both to accurately account for the intrinsic graph topology and to substantially improve classification performance, especially for heterogeneous graphs. Furthermore, we propose a new preferential P-DropEdge method based on the Girvan-Newman argument. That is, in contrast to uniform removing of edges as in DropEdge, following the Girvan-Newman algorithm, we detect network periphery structures using information on edge betweenness and then remove edges according to their betweenness centrality. Our experimental results on semi-supervised node classification tasks demonstrate that the LFGCN coupled with P-DropEdge accelerates the training task, increases stability and further improves predictive accuracy of learned graph topology structure. Finally, in our case studies we bring the machinery of LFGCN and other deep networks tools to analysis of power grid networks – the area where the utility of GDL remains untapped.


I. INTRODUCTION
Adaptation of deep learning (DL) to graphs and other non-Euclidean objects has recently witnessed an ever increasing interest, leading to the new subfield of geometric deep learning (GDL).In particular, geometric deep learning is an emerging direction in machine learning which generalizes concepts of deep learning for data in non-Euclidean spaces, e.g., graphs and manifolds, by bridging the gap between graph theory and deep neural networks [1,2,3].
Many such DL approaches for non-Euclidian objects are based on the idea of a convolution operation in the spectral domain with a suitably chosen nonlinear trainable filter [see an overview by 4].As a result, node features are mapped into some Euclidian space.Next, graph filters are approximated with various finite order polynomials, e.g., Chebyshev polynomials [i.e., the ChebNet model family of 5,6], Cayley transform [i.e., CayleyNet of 7] or the generalization of polynomial filters in a form of Auto-Regressive Moving Average (ARMA) models [8].However, deep learning approaches based on approximation with finite order polynomials tend to be non-robust to even minor changes in the graph structure and to largely disregard the local graph topology which often plays the critical role for learning on heterogeneous graphs.In contrast, as noted by [8], one of the primary benefits of the ARMA filters over polynomial ones is that ARMA filters are not computed in the Fourier space induced by a graph Laplacian, and as a result, ARMA filters are local in the node space and enable to more flexibly and accurately capture the underlying graph topology.
Further advancing the localized approaches in GDL, we propose a new fractional generalized graph-based convolutional filter for semi-supervised learning which casts the Lévy Flights into random walks on graphs.As a result, our new Lévy Flights Graph Convolutional Network (LFGCN) method allows to more accurately account for the intrinsic local graph topology and to substantially improve classification performance, especially for heterogeneous graphs.
To fly or not to fly, and if to fly, why take a Lévy Flight?Lévy Flight is a random process with a scale-free, Lévy stable jump length distribution.Due to the scale-free character, throughout the graph exploration we move randomly according to a power-law distribution for hops, rather than with integer hops as in a standard random walk.As a result, Lévy Flight delivers more accurate and efficient search strategies, especially, in sparse environments, comparing to other types of random walks [9].While superiority of Lévy Flights as a primary search strategy has been proven in a broad range of settings, utility of Lévy Flights in graph-based learning remains unexplored.Hence, Lévy Flights offer a new learning perspective with multiple advantages comparing to the currently available architectures.First, due to a fractal character, Lévy Flights combines both local graph exploration and long-range excursions, which reduces oversampling comparing to a normal random walk (i.e., lower probability to revisit the nodes we have already seen).Second, Lévy Flights allow to directly reach long-distance nodes without intervention of intermediate nodes.Third, based on average global time, Lévy Flights average return probability is of power-law distr., i.e. p γ 0 (t) ∼ t −1/2γ and is lower than average return probability of a normal random walk p 1 0 (t) ∼ t −1/2 , thereby leading to more efficient graph exploration.Forth, Lévy Flights are known to exhibit a particularly high utility for unbalanced and directed data which can explain higher LFGCN accuracy we have obtained in directed networks.
In addition, to abate over-fitting and over-smoothing in GDL, we develop a new preferential P-DropEdge method based on censoring edge order statistics at each training epoch.Our P-DropEdge idea is inspired by the recent DropEdge algorithm of [10] and is rooted in nonparametric methods, specifically, various censoring schemes, for statistical inference To Appear in the 2020 IEEE International Conference on Data Mining (ICDM).

arXiv:2009.02365v1 [cs.
LG] 4 Sep 2020 on order statistics [11].In contrast to uniformly removing edges as in the recent DropEdge algorithm of [10], we follow the Girvan-Newman argument and target edges that tend to contribute more to the intrinsic graph topology.That is, we randomly remove edges with higher betweenness centrality, or the corresponding higher edge order statistics.The intuition is the following.In both P-DropEdge and DropEdge the goal is to introduce randomness in the network structure.If we are to learn international political networks with GDL, DropEdge largely tends to remove connections among individual citizens while P-DropEdge randomly censors collaboration links among Presidents and Prime Ministers.Removal of such targeted connection is likely to lead to higher perturbation effects.We investigate utility of the new P-DropEdge approach vs. DropEdge in conjunction with LFGCN and GMMN (the best performing baseline) of [12].
Significance of our contributions can be summarized as: SOTAs on all 4 considered directed networks.
• the proposed architecture of LFGCN uses three state-ofthe-art operations -gated max-average pooling, residual block, and P-DropEdge.We provide an ablation study and investigate contribution of each component to the resulting classification accuracy as well as explore sensitivity of the overall system architecture to (hyper)parameter settings.• we provide theoretical foundations behind the proposed LFGCN architecture and show that the proposed LFGCN architecture leads to significant gains in the training convergence and model output stability.
• the developed preferential P-DropEdge based on censoring of higher edge betweenness order statistics is shown to exhibit utility in other GCN methods and, hence, might be applicable in broader GDL settings.• Last but not the least, while validating our LFGCN methodology, we bring the GDL concepts to the analysis of power grid networks, i.e., the area of critical societal importance where to the best of our knowledge, the GDL machinery has never been yet applied.
II. RELATED WORK Many earlier semi-supervised learning approaches on graphs, e.g., Gaussian mixture models, co-training, harmonic function, and label propagation, tend to employ only the label information (i.e., labeled instances) for training models based on the smoothness assumption over the labels [13] and to largely disregard the underlying graph structure.To enhance performance, several learning methods on graphs propose to incorporate intrinsic "graph-based" information by designing a classifying function via generalizing the normalized cut and adding a smooth function with respect to the intrinsic structure [14,15].An optimization framework of [16] generalizes these approaches by considering the above two methods as particular cases.However, the major criticism to these graph-based semi-supervised learning methods is that important information contained in graph edges is largely disregarded.
To address these limitations, [5] propose a formulation of convolutional neural networks (CNN) based on spectral graph theory -ChebNet.ChebNet employs approximation via finite order polynomials and is based on the Chebyshev expansion for fast filtering instead of the expensive eigen-decomposition.Graph Convolutional Networks (GCN) of [6] simplifies ChebNet while further addressing the gradient vanishing problem and reducing the number of optimization.Other related approaches to graph learning with deep neural networks include, for instance, mixture model networks (MoNet) [2], graph attention networks (GAT) [17], graph convolutional recurrent networks [18], dual graph convolutional networks [19], FastGCN [20], and simplified version of GCN [21].By directly powering the graph Laplacian, GCN based on random walks such as approximate personalized propagation of neural predictions (APPNP) [22], variable power network (VPN) [23], and MixHop [24] can learn the relationships between multiple-hops neighborhood.
To extend the success of GCN on undirected graphs to directed graphs, MotifNet of [25] replaces the normalized Laplacian with the motif Laplacian in a multivariate polynomial filter, where the motifs information can help capture the network structure.Finally, the most recent approach of [8] provides more flexible responses than GCN by using parallel and periodic concatenations of the convolutional kernel via the ARMA filter.As a result, the ARMA approach which is applicable to both directed and undirected networks allows to more accurately incorporate the underlying local graph structure into the graph learning process.For a recent comprehensive overview of GCNs see [4].

III. METHODOLOGY
Consider a graph structure G = {V, E, W }, where V is a node set with cardinality |V| of N , and E ⊆ V × V is an edge set.An N × N -matrix W with entries {ω ij } 1≤i,j≤N represents the adjacency matrix of G, that is, ω ij = 0 for any e ij ∈ E and ω ij = 0, otherwise.For an undirected graph G, W = W .In reality, however, undirected graphs are often simplified representations of complex directed networks.If G is directed, we substitute W with W = (W +W )/2.
Let Q,Q ∈ Z >0 be the number of different node features associated each node v ∈ V.Then, a N ×Q feature matrix X serves as an input to an semi-supervised learning algorithm.To classify N data points into K classes (communities), we define a N ×K label matrix Y such that Y ik = 1 if vertex i is labeled as class k, and 0 otherwise.Here we refer to each column Y •k of matrix Y as a labeling function.Finally, we define an N ×K matrix F whose columns F •k are referred to as classification functions.

A. Graph signal processing
Given the adjacency matrix W of G, let D be the degree matrix where d ii = N j=1 w ij and L = U ΛU be the Standard Laplacian matrix.Here Λ = diag(λ 0 ,...,λ N −1 ) and U = [u 0 ,...,u N −1 ] is the matrix of eigenvectors.
In the following, we will revisit three popular semisupervised learning methods -graph-based semi-supervised learning, fractional graph-based semi-supervised learning, and graph convolutional networks and gain new insights for improving their modeling capabilities.
Graph-based semi-supervised learning Graph-based semi-supervised learning (G-SSL) has received much attention as an alternative approach to the population paradigm of supervised learning in recent years.G-SSL develops a generalized optimization framework, which has three particular cases (i) the Standard Laplacian (SL); (ii) Normalized Laplacian (NL); (iii) PageRank (PR).The general idea of graph-based semi-supervised learning (G-SSL) is based on two widely used optimization frameworks.The first formulation, the SL based formulation [15] as follows: , where d ii is (i, i)-element in degree matrix D and w ij represents the edge weight for edge e ij in adjacency matrix W .
For the second formulation, the NL based formulation [14], is as follows: The following lemma [16] asserts that the generalized optimization framework, i.e., G-SSL, which has as particular cases the two above mentioned formulations: Lemma 1.Let σ denote an alternative parameter on the power of degree matrix D whose entries are the degrees d ii ; and let 0 ≤ σ ≤ 1.Then The classification functions for the generalized semi-supervised learning are given by F Proof of Lemma 1 is in Appendix VII.The optimization formulation S(F ) with the following expression: where µ is a regularization parameter.Minimization of the 1st term in (1) corresponds to the idea that if two nodes are close in graph with respect to some metric, they should belong to the same class; and by minimizing the 2nd term we aim to bring the classification function F •k as close as possible to the labeling function Y •k .Eq. ( 1) allows us to obtain the Standard Laplacian based formulation (σ = 1), the Normalized Laplacian formulation (σ = 0.5), and PageRank formulation (σ = 0).
Objective of the generalized optimization framework for G-SSL is a convex function and the corresponding classification function: By tuning the parameter σ on the power of degree matrix D, we can obtain three mentioned above particular semi-supervised learning methods: From above formulations, classification function F is a closed form solution based on the theory of random walks on graphs, which in turn provides connection to the probabilistic interpretation of G-SSL.Parameter α controls the strength of the ground truth label matrix Y in the generalized optimization framework.
Fractional graph-based semi-supervised learning To improve classification performance (in particular, fuzzy graphs and unbalanced labeled data) of G-SSL, fractional graph-based semi-supervised learning [26] embeds Lévy Flights into random walks on graphs by constructing from powers of the Laplacian matrix, i.e., the L γ operator.This operation can be used to generate different transition probabilities (i.e., corresponding to stochastic adjacency matrix) based on different γ values.Intuitively, embedding Lévy Flights into random walks allows for better capturing mixing properties (i.e., dependence) in the data.Based on a fractional Laplacian matrix, 0 < γ ≤ 1, the anomalous (fractional) diffusion processes on networks can be constructed from the spectra data and eigenvectors of the Laplacian matrix.The fractional powers of L allows Lévy random walks with long-range navigation on a network.For example, the long-range transitions on a network can directly move node u and node v with the transition probability m (γ) u→v through a random walker, where m (γ) u→v is an element in the fractional transition matrix M (γ) .Transition probability m (γ) u→v between any two nodes whose geodesic distance is not infinite can be summarized as follow: where δ uv is the Kronecker delta, k u denotes the fractional degree of the node u and k (γ) u ≡ (L γ ) uu .Eq. ( 3) provides transition probabilities for the Lévy Flights.Unlike the standard random walk, the Lévy Flights can jump immediately over several hops in a graph.This feature enables Lévy Flights to be a very effective graph exploratory process.Lemma 2 makes this statement formal.Proof of Lemma 2 is in Appendix VIII.There is a price to pay for this: the typically sparse transition probability matrix becomes non-sparse.We can mitigate non-sparsity by taking a reasonable number of principal singular eigenvectors or limiting the number of terms in the Taylor expansion.Through replacing the L operator with L γ = U Λ γ U , the new optimization formulation S * (F ) leaves us with the following expression: where Let 0 < γ ≤ 1, then the closed form solution for (4) can be obtain as follows: for k = 1,...,K.Therefore, we can conclude three particular fractional semi-supervised learning methods like G-SSL:

B. Proposed Lévy Flights Graph Convolutional Network for semi-supervised node classification
Although both G-SSL and fractional G-SSL achieve comparable and consistent (low variance) performance on some datasets, e.g., Les Miserables, Wikipedia-math, and MNIST, these approaches consider only the given adjacency matrix W and the label matrix Y , without using the feature matrix X.This limitation is crucial, especially when dealing with datasets that not only exhibit a sophisticated topological graph structure but also provide node feature information, such as citation, biological, financial, and power grid networks.To address this limitation, there have been recently proposed many graph-based neural networks methods, e.g., graph convolutional networks (GCN), which use the feature matrix X instead of the label matrix Y and encode the graph structure by using neural network framework.Such graph-based neural networks have been shown to achieve impressive gains in semi-supervised learning performance on graphs.Next, we turn to discussing on how the idea of Lévy Flights can be incorporated to GCN, leading to the new Lévy Flights Graph Convolutional Network (LFGCN) for semi-supervised node classification.
Lévy Flights Graph Convolutional Network (LFGCN) The key idea behind our proposed method is Fractional Generalized Sigma-based (FGS) filter To avoid the inverse computations, we insert the Taylor series expansion into the FGS filter, resulting in: Empirically, it shows that i = 4α is enough to get a good approximation.We then obtain the general classification function by multiplying (5) by the feature matrix X: Convolutional layer During LFGCN training, the convolutional model needs to train parameters (W,b) of the graph filter, where the trainable graph filter scan the given input feature matrix into a series of feature maps with neurons.Thereby, we provide an implementation of (6) as a FGS convolutional layer: where H (t+1) is the hidden layer output matrix of activations in the t-th layer and H (0) = X, σ(•) is the adopted activation function, and W t is the trainable weight in the t-th layer.Furthermore, we bring the concept of the parallel system (PS) from the reliability theory to improve the consistency of our proposed method.A parallel system is a configuration such that the entire system functions as long as not all involved components in the system fail.Hence, the parallel system structure is more robust against noisy inputs, compared to a single system structure.
Lemma 3. Let XF GS be the output matrix P 1 from a pooling layer.Let U = {1,2,...,N } be a finite population such that each unit i,i ∈ U is associated with an output matrix X (i) Suppose there are n components in a parallel system, with the probability of non-failure P (i) R (where i = 1,••• ,n) in a parallel system, then the reliability of this parallel system P P S R can be obtained with the following expression: Proof of Lemma 3 is in Appendix IX.According to Lemma 3, the introduced concept of a parallel system allows for enhancing stability and reducing estimation variance up to order of n (i.e., Var( XF GS ) = O(S 2 /n)).In this way, we establish both theoretical and practical guarantees for our proposed model to reach stable over a large set of hyperparameters, small datasets, and noisy labels based on this parallel implementation.
Pooling layer When implementing the form of pooling operation to aggregate information from the outputs of parallel FGS convolutional layer, instead of using some popular pooling functions such as max and average pooling, we apply the state-of-the-art pooling operation -gated max-average pooling [27] to capture the local and global information from all the nodes and graph structure.The rationale behind the gated max-average pooling, is that it considers "responsive" strategy (i.e., improving translation invariance and scale where W is the trainable weight matrix, X F GS is the output matrix from the parallel FGS convolutional layer after concatenation operation. Residual building block Inspired by the seminal works of [28,29] that implemented residual learning in a graph convolutional network, we apply a residual block (RB) by adding the skip connection after the pooling layer.One of the advantages of the residual learning is the identity mapping which provides a direct path for propagating information.When using the residual building block, we adopt a similar scheme as [30] to deal with the output of the pooling layer.Let H(x) be an underlying mapping and we cast it as H(x) = F(x)+x, where F(x) is the residual mapping, defined by H(x)−x.That is, optimizing the residual mapping F(x) is easier than optimizing the direct mapping H(x) and helps to avoid the gradient vanishing problem during training.We use an exponential linear unit (ELU) in direct mapping and place a rectified linear unit (ReLU) after addition in our model.P-DropEdge Motivated by the recent idea of message passing inference [i.e., DropEdge of 10], we develop a new preferential DropEdge approach called the P-DropEdge which is based on censoring higher edge betweenness order statistics.In particular, most recently [10] propose a flexible approach, the DropEdge algorithm which by uniformly randomly removing a certain proportion of edges from the input graph at each training epoch, allows to better prevent against over-fitting and to reduce the effect of over-smoothing.The rationale behind DropEdge on introducing more randomness and deformation into the data is intrinsically linked and complementary to the Dropout ideas of [31].Our approach further advances DropEdge by targeting and randomly removing edges proportionally to their betweenness centrality, i.e., preferential edge dropout of higher edge betweenness order statistics.That is, first, our idea is based on the Girvan-Newman argument of focusing on edges which tend to play a higher role in the underlying network topology [32].Second, dropout of higher edge betweenness order statistics may be viewed as a variant of recently proposed non-uniform censoring schemes for generalized order statistics in reliability theory which are shown to deliver more robust parameter estimates in heterogeneous probability distributions [33].Definition 1.. (Edge Order Statistics) Given a input graph G = {V,E,W }, the betweenness centrality for the edge e ∈ E is defined as C Be (e) = u =v∈V σ uv (e)/σ uv , where σ uv the number of shortest paths connecting u to v, and σ uv (e) the number of shortest paths connecting u to v passing through the edge e.We then arrange edges in ascending order of their betweenness {C Be (e) i }, i = 1,2,...,|E| as C Be (e) (1) ≤ C Be (e) (2) ≤ ••• ≤ C Be (e) (|E|) .Here C Be (e) (i) is said to be the ith-order edge betweenness score, or the ith-edge betweenness order statistic.
Note that the Girvan-Newman algorithm on edge betweenness infers the edges connecting communities, that is, the edges exhibiting a more profound role in the network organization.As a result, P-DropEdge offers multi-fold benefits: (i) it constrains direction of a random walk and acts as a "self-avoiding" random walk, e.g., reduces the chance of moving back to the already visited graph structure; (ii) increases variability among randomly deformed copies of the original graph.That is, let us consider, e.g., an international political network.Randomly removing connections among Mr. and Mrs. Smith or even US Senators from Texas and California will tend to deliver a more similar resulting graph structure than randomly removing collaboration links between Trump, Macron, Putin and Johnson. .Finally, we replace the fractional Laplacian L in (7) with LpP-D.E. for propagation and training.In validation and testing steps, P-DropEdge is not utilized.

Algorithm 1 P-DropEdge Algorithm
Advantages of LFGCN vs. Higher-order methods Recently there has been a spike of interests to higher-order methods, that is, algorithms based on the graph convolutional layer with higher-order information in graphs, such as APPNP [22], VPN [23], and MixHop [24].In contrast to such higher-order graph architectures, LFGCN offers multi-fold benefits: (i) Due to a fractal character, LF integrates local graph exploration with long-range excursions, which reduces oversampling comparing to standard random walks and allows for more efficient graph exploration; (ii) since high-order schemes [23,24] are based on integer powers of Laplacian, when exploring 2-hops, 3-hops,..., k-hops, standard random walks employed in these higher-order methods can only describe larger scale graph structures, often resulting in very dense adjacency matrices and higher computation costs; (iii) Lévy Flights average return probability is lower than average return probability of a normal random walk, implying more efficient graph exploration; (iv) Lévy Flights reinforces separability of clusters, enhances performance for unbalanced data, and is known to yield better search results in directional data.Clearly, these advantages are essential for learning graphs with higher heterogeneity, and for more homogeneous and balanced graphs, methods based on standard random walks may be a competitive alternative.

IV. EXPERIMENTAL SETTINGS
Directed and Undirected Datasets Joining the previous works practice, we use three undirected citation networks benchmark datasets for semi-supervised learning evaluation, including Cora-ML (this Cora dataset consists of Machine Learning papers), CiteSeer and PubMed.We also evaluate our method on four directed networks -Cora, IEEE 118-bus system (IEEE bus), Texas 2000-bus system (TX bus), and South Carolina 500-bus system (SC bus).The dataset statistics are summarized in Table V (in Appendix X).We provide the more details about datasets description on Github in the Appendix X.
Training Settings Training task is done by using Adam optimizer with learning rate lr 1 = 0.01 for undirected networks and lr 2 = {0.1;0.001} for directed networks.To prevent our approach from over-fitting, we consider both adding dropout layer before two graph convolutional layers and kernel regularizers ( 2 ) in each layer.For undirected and directed networks: we follow the same experimental setup used in the baselines experiments to set the parameters of baselines.Parameters p P-D.E. and τ largely depend on the distributional properties of a network and can be estimated, e.g., via cross-validation.As a rule of thumb, we recommend p P-D.E. and τ of 5% and 6%, respectively, in larger networks of more than 2,000 nodes, and p P-D.E. and τ of 1% and 2%, respectively, in smaller network of less than 1,000 nodes.The best hyperparameter configurations of LFGCN for each dataset by using standard grid search mechanism are available at Github link in Appendix X.
V. RESULTS 1. Performance analysis Tables I and II report the average accuracy delivered by LFGCN and competing methods for undirected and directed networks, respectively.The best performance for each dataset is marked in bold.We find that LFGCN outperforms all competing approaches in all datasets, except for PubMed (LFGCN delivers the second best accuracy result).The improvement gain of LFGCN over the next most accurate method ranges from 0.29% (for CiteSeer over GMNN) to 4.27% (for directed IEEE 118-Bus over GMNN).Remarkably, methods that are applicable both to undirected and directed networks (i.e., [5,6,8,12,22,23,24,36,37]) tend to deliver noticeably lower accuracy results for a directed networks (especially on weighted-directed networks), while the new LFGCN method yields a more stable performance across both directed and undirected networks.In turn, PubMed (unweighted-undirected), GMNN outperforms LFGCN up to 2.63%.Based on the obtain results, the new LFGCN approach tends to be the most competitive and, hence, preferred node classification method for sparser networks with higher label rates.Furthermore, the IEEE 118-Bus dataset is the smallest among the considered data, and we might expect to observe lower accuracy results for this dataset due to a limited training set.However, the accuracy yielded by LFGCN is among the highest ones across all datasets.For PubMed, it has the lightest tails for the degree distribution and a weak structural info (i.e., with very few links per node on average), thus LFGCN is not the best exploration choice.
We provide the training time per epoch on all datasets in the Appendix X (see Tables VI VII and VIII).
2. Ablation study by removing individual components in LFGCN To discover the vital components in the success of our LFGCN, we investigate the contributions of individual components proposed in Section III-B to the performance of LFGCN.We conduct experiments by removing individual component separately (in the spirit of leave-one-out operation) from our LFGCN architecture, leading to a network without P-DropEdge, parallel structure, residual block, or gated max-average pooling.
Table III provides the comparison results between LFGCN without P-DropEdge, parallel structure, residual block, or gated maxaverage pooling.The results show that LFGCN consistently outperforms the reduced LFGCN baselines by a significant margin, reaching around 0.24% to 2.91% relative improvement on Cora-ML and IEEE 118-bus system.These results demonstrate contributions of all components to performance improvement.
Table I: Comparison of average accuracy (%) and standard deviation (%) in () of semi-supervised classification approaches for undirected networks.

Method
Cora-ML CiteSeer PubMed LP [38] 68.70 46.32 65.92 DW [35] 67.20 43.27 65.33 ChebNet [5] 81.45 70.23 78.40 GCN [6] 81.50 71.11 79.00 ARMA [8] 82.80 (0.63) 72.30 (0.44) 78.80 (0.30) GAT [17] 83.11 (0.70) 70.85 (0.70) 78.56 (0.31) GMNN [12] 83.72 (0.90) 73.10 (0.79) 81.80 (0.53) LGCNs [36] 83.35    Carolina 500-bus system.We find that while a sufficiently sampling-based edge-removing is helpful for performance enhancement, regular randomly removing edges do not always improve performance.Note that this finding is in contrast to the regular DropEdge where both LFGCN and baseline equipped with P-DropEdge achieve consistently better performance than others.These findings prove the effectiveness of employing preferential approach of P-DropEdge before the learning task.4. Evaluation of LFGCN-specific parameters During grid search over three parameters (i.e., α, σ, and γ), we find that: (i) the regularization parameter α which used to specify the relative importance of a graph in clustering strongly relates to the probability of initial conditions for random walks when the self-refreshing process works, and it strongly influences the network's generalization ability and node classification performance for all datasets; (ii) the free unifying parameter σ provides enough flexibility to construct a canonical formulation of different graph-based semi-supervised methods -Table I and Table II indicate that the optimal value of σ depends on both the types of networks (undirected and directed) and label rate not on the size of network; (iii) the fractional power parameter γ substantially impacts the accuracy of node classification for the small datasets (see e.g., Figure 2), however, no similarly  strong influence is found in the larger datasets.

Hyperparameter sensitivity
In the sensitivity analysis setting, we have the ability to analyze the sensitivity of the node classification accuracy to variation from three LFGCNspecific parameters -α ∈ {0.1,••• ,1}, σ ∈ {0,0.1,•••,1}, and γ ∈ {0.001,0.01,0.1,1}.In this case, we only show the results from sensitively analysis for LFGCN model on IEEE 118-bus dataset.First, we perform the parameter learning experiments on four scenarios with a fixed parameter γ. Figure 2 shows that the accuracy substantially decreases when α is larger than 0.8, especially in γ equals to 0.001 and 0.01 (see Figure 2(a), 2(b)).Setting γ = {0.1,1},we observe that the classification accuracy nearly monotonic decreases while increasing α.Additionally, LFGCN generally gives consistent and higher accuracy for γ = {0.001,0.01}when the α parameter is within the range of {0.1, 0.2, 0.3, 0.4}.We then explore the variation of accuracy based on tuning parameter γ within the range of [0.001,0.002,•••,0.01](setting σ ∈ {0,0.1,•••,1} at the same time), however, it is hard to obtain the optimal (σ, γ) combination through gathering finite experimental results (100 runs) since some of the results are very close.Therefore, we run the following experiments to demonstrate the impact evaluation of γ: Figure 3 shows that there exists a more profound difference between the shapes of approximate Gaussian distributions by fixing the parameter σ than fixing the parameter γ.These findings imply that σ tends to be a more important factor in the LFGCN approach for small datasets.Fig. 3: Generalized Gaussian density of accuracy of LFGCN (two blue dashed lines represent lower and upper bounds, respectively): (a) the red filled curve is the PR-based method (σ = 0), blue filled curve is the NL-based method (σ = 0.5), and green filled curve is the SL-based method (σ = 1).(b) the red, blue, green filled curves represent scenarios with fractional parameters γ of 0.001, 0.005, and 0.010, respectively.

VI. CONCLUSION
We have proposed a new Lévy Flights Graph Convolutional Network (LFGCN) method for semi-supervised learning on graphs that enables to better capture the intrinsic local graph topology.In addition, to further mitigate over-fitting and over-smoothing, we have proposed a new preferential P-DropEdge algorithm, based on censoring higher edge betweenness order statistics.We have investigated theoretical properties of LFGCN and have validated utility of individual components of the LFGCN architecture.
Our numerical studies have indicated that the new LFGCN method tends to outperform all competing deep learning approaches on both unweighted-directed and unweightedundirected graphs in all considered datasets, except of PubMed.The gain in learning accuracy of LFGCN over the next best competitor ranges from 0.29% to 4.27%, and the highest gain has been achieved for the IEEE 118-Bus dataset which is the smallest among the considered datasets.Furthermore, in contrast to the competing approaches, LFGCN tends to deliver a more stable performance across directed and undirected networks regardless of the label rate.
In the future we plan to advance the proposed LFGCN technique to learning on multilayer networks, explore utility of P-DropEdge combined with other order statistics on graphs, and enhance graph learning process with topological information on the underlying deep neural network.

Lemma 2 .
The Lévy flight defined by the normalized Laplacian has a shorter relaxation time (measure of the transience) in comparison with the original random walk.

Fig. 1 :
Fig. 1: Illustration Lévy Flights Graph Convolutional Network model.The input is the feature matrix X and the graph within dotted circle represents embedding Lévy Flights into random walks on graph (where L γ is the Laplacian matrix L to a power γ).LFGCN architecture consists of three main components: (i) FGS convolutional layer with parallel structure; (ii) gated max-average pooling layer; (iii) activation block for residual learning.invariance via considering input in each gating mask) based on the mixed max-average pooling equation.That is,

Table II :
Comparison of average accuracy (%) and standard deviation (%) in () of semi-supervised classification approaches for directed networks.
Table IV presents comparison between LFGCN and GMNN with regular DropEdge and P-DropEdge on CiteSeer and South

Table IV :
Comparison of GMNN and LFGCN with regular DropEdge (p) and P-DropEdge (p P-D.E. ) in terms of node classification accuracy (%) on undirected Citeseer and directed South Carolina bus system.The numbers in () denote the optimal edge removal rate for the models with DropEdge and P-DropEdge.