Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

. According to the Greek mythology, Typhon was a gigantic monster with one hundred dragon heads, bigger than all mountains. His open hands were extending from East to West, his head could reach the sky and flames were coming out of his mouth. His body below the waste consisted of curled snakes. This research effort introduces the “ Cyber-Typhon ” (CYTY) an Online Multi-Task Anomaly Detection Framework. It aims to fully upgrade old passive infrastructure through an intelligent mechanism, using advanced Computational Intelligence (COIN) algorithms. More specifically, it proposes an intelligent Multi-Task Learning framework, which combines On-Line Sequential Extreme Learning Machines (OS-ELM) and Restricted Boltzmann Machines (RBMs) in order to control data flows. The final target of this model is the intelligent classification of Critical Infrastructures’ network flow, resulting in Anomaly Detection due to Advanced Persistent Threat (APT) attacks.


Introduction
Information generated by complex environments such as the Internet of Things (IOT) ecosystem, has increased exponentially.The result is the inefficient management and storage of the total volume of generated information.This requires the adoption of complex data mining and analysis architectures [1].These architectures should incorporate specialized processing algorithms, that dynamically adapt to new standards or data, or to scaled data production as a equation of time [2].Although mining Data Streams (DAST) is an emerging area, it poses enormous challenges to the data mining community.High transmission speed, change of data distribution, and high volume, raise the following issues that should be addressed [3]: ➢ High Velocity: Online DAST, arrive at a very high speed.Thus, it has become almost practically infeasible to scan all of them.This is also the case for the offline ones.➢ Concept Drift: Frequent Patterns (FREP) keep changing, as data streams are time varying in nature.During the mining process, as new incoming data is added to the existing ones, some FREP may change their status to become infrequent and vice versa.This issue is known as the Concept Drift (CDR) problem.➢ Unbounded Size (UNS): DAST are unbounded in size.Their size is unknown to the user in advance unlike the static data.
➢ Enormous Space Requirement (ESR): Huge amount of data are generated both in online and offline applications.There might not be enough space to store the data stream before processing.➢ Unsteady Analysis Results (UAR): High speed as well as varying data distribution, may affect the analyzed results, due to which the mining outcome may be declined.In order to cope with this, DAST mining must be an incremental process.Anomaly Detection [4] over multiple data streams, is initially determined by the observation of a single (multi-variable) time series frequency, which constitutes the systems' quantitative performance parameters.A data stream  may be coming from a system of sensors, and it consists of numerical values, where () stands for the data stream flow value at the time , and   [0, +∞].If  sensors' flows are synchronized periodically to report their values, the whole of the multi-variable information at each time  is represented with the following frame vector shown in equation 1, [5].
ΔΠ  = ( 1 (),  2 (), … ,   () ∈   ) (1) In practice, each flow forms a one-dimensional time series, while the frame vector flow (FVF) represents a multi-variable time series.Event detection on DAST is intended to determine the values  (), which represent abrupt changes within an FVF.Particularly, each FVF of length  is converted to a binary vector of the same length, where each value represents a possible change in the corresponding sensor flux.Such deviations from the normal behavior are called events, and binary vectors are called event vectors.An event may be an observation that does not conform to an expected standard in the data set (anomaly).Incidents may have been caused by a variety of reasons, like sensor failure or malequation, or deviations and substantial changes that may affect the system's behavior, such as Cyber Attacks (CYA) [7].Therefore, a vector of events in time  is represented by equation 2 below [3], [5].
where    =   () is the binary value which represents the occurrence of an abnormal flow behavior, and its value is ()=1 in time .The values of () must be in an interval {0,1}.The error is calculated at each iteration, as data characteristics can change drastically and in an unpredictable way changing the typical, normal behavior.An object that may be considered abnormal, can then be included in the set of normal observations due to rapid developments in the data stream.Due to the fact that the data volume is unlimited, data mining is performed on a subset of the flow, called a sliding window, which obviously contains a small but recent percentage of the observations.The goal of the data flow processing algorithms is to minimize the cumulative error for all iterations, that can be calculated by the following equation 3 [3], [5].
where   ∈   ,  ∈   and   ∈ .We consider that Χ ×  is a data matrix and Υ × 1 is a matrix with target values, after the arrival of the first i data points.Assuming that the covariance matrix   =    is reversible, the optimal solution  * () = 〈 * , 〉 is given by the following equation 4 [3], [5].
First the covariance table is calculated by using the following equation 5 (5).
The initial time complexity (CM) is calculated to be of ( 2 ) order, but after we inverse the    ( × ) table, it increases to ( 3 ), while the rest of the required multiplications have an ( 2 ) .This produces an overall CM of ( 2 +  3 ) [3] [5] (d is the size of the window).
It is conceivable that robust systems ensuring reliability and high accuracy rates without requiring high availability of resources are required to safely approach problems stemming from knowledge mining processes.The above argument is further supported as follows: Let's suppose that the size of data points is equal to n and after the arrival of each new data point i = 1,2, ..., n, the recalculation of the solution is required.In this case the total time complexity would be equal to ( 2  2 +  3 ) [5].The Multi-Τask Learning (MTL) is a robust method in order to face some of the most challenge of the Big Data Streams processing with Online Learning algorithms.MTL is a subfield of Machine Learning (ML) in which multiple learning tasks are solved at the same time, utilizing common elements or differences arising from the multiple tasks included in the case study [6][7].More general MΤL is an inductive transfer method which generates many generalization features.The common features or differences that arise between distributed tasks during the training process are transferred or shared as guaranteed and unambiguous knowledge in subsequent relevant or unrelated tasks, maximizing the accuracy of the model.MTL is efficient because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly.The following approaches are characteristic cases of MTL [6][7]: ➢ Τask grouping and overlapping, where tasks are grouped or provided in an overlapping way so that there is relevance, capable to lead in the use of cognitive or learning relationships.➢ Exploiting unrelated tasks, where the common learning of non-relevant tasks using the same input data, can be beneficial for learning main tasks of an application field.➢ Transfer of knowledge, where transfer of relevant knowledge is carried out to achieve learning from correspondingly trained models.➢ Group online adaptive learning, where transfer of previous experience or knowledge into continually changing environments takes place.

Literature Review
Qian and Sherif [8] have applied autonomous computing technology, to monitor SCADA system performance.Their approach proactively estimates upcoming attacks for a given system model of a physical infrastructure.Soupionis et al. [9] have proposed a combinatorial method for automatic detection and classification of faults and cyber-attacks occurring on the power grid system when there is limited data from the power grid nodes due to cyber implications.In addition, Tao et al. [10] have described the network attack knowledge, based on the theory of the factor expression of knowledge.They have studied the formal knowledge theory of SCADA network from the factor state space and equivalence partitioning.This approach utilizes the Factor Neural Network (FNN) theory which contains high-level knowledge and quantitative reasoning, used to establish a predictive model including analytic FNN and analogous FNN.This model abstracts and builds an equivalent and corresponding network attack and a defense knowledge factors system.Finally [11] has introduced a new European Framework-7 project Cockpit CI (Critical Infrastructure) and roles of intelligent machine learning methods to prevent SCADA systems from cyber-attacks.Also, existing multi-task learning methods can be categorized into two main categories: learning with feature covariance and learning with task relations [12].Different from prior solutions to distributed multi-task learning, which are focused on the former category [13], our proposed multi-task learning falls into the latter category.On the other hand, distributed machine learning has attracted more and more interests recently [14].
There have been tremendous efforts done on different machine learning problems.Also, online multi-task learning assumes instances from different tasks arrive in a sequence and adversarial chooses task to learn.Cavallanti et al. [15] exploited online multi-task learning with a given task relationship encoded in a matrix, which is known beforehand.Parallel multi-task learning aims to develop parallel computing algorithms for multi-task learning in a shared-memory computing environment.
Recently, Zhang [7] proposed a parallel multi-task learning algorithm named PMTL.In this work, dual forms of three losses are presented and accelerated proximal gradient method is applied to make the problem decomposable, and thus possible to be solved in parallel.Finally, distributed multi-task learning is an area that has not been much exploited.
Wang et al. [13] proposed a distributed algorithm for multi-task learning by assuming that different tasks are related through shared sparsity.In another work [6], asynchronous distributed multi-task learning method is proposed for multi-task learning with shared subspace learning or shared feature subset learning.Different from the above-mentioned approaches, our method aims at solving multi-task learning by learning task relationships from data, which can be positive, negative, or unrelated, via a task-covariance matrix.

The proposed Cyber-Typhon Framework
This research proposes the Cyber-Typhon model, which combines the algorithmic power of OS-ELM and RBM in a hybrid mode.It is an innovative Multi-Task Learning approach [16], that performs control of network traffic in Critical Infrastructures [17][18] [19].The final target is the detection of vulnerabilities that are usually due to APT attacks [20][21] [22].More specifically, the Cyber-Typhon initially exports features related to network traffic, which are used as input to an OS-ELM neural network.The OS-ELM has been trained with proper data, in order to be able either to classify traffic as normal or (in the opposite case) to identify the threat or the attack type.Obviously, it performs Multiclass classification in order to identify one of the following eight (8)  If the network traffic is normal further communication is allowed.In the opposite case, the type of anomaly is determined and the data flow is redirected to a proper absolutely specialized and dedicated Restricted Boltzmann Machine.If the first RBM does not recognize the specific anomaly for which it is specialized, the data is redirected to the next RBM responsible for the detection of another anomaly and so on till the successful identification is achieved.If detection cannot be done by any of the trained RBM (which are as many as the types of the known anomalies) the network flow data return to the initial OS-ELM, which can perform online sequential learning.Thus, the classification effort can be re-adjusted.The following figure 1 presents the architecture of the Cyber-Typhon.The proposed ELM is a Single-Hidden Layer Feed Forward Neural Network (SHLF2N2) [23] with N hidden neurons, randomly selected input weights and random values of bias in the hidden layer.The output weights are calculated with a single vector matrix multiplication [13].Hidden nodes or hidden level parameters can be randomly created before seeing the training data, while it is remarkable that non-differential activation equations can be handled, and known Neural Network problems such as stopping criterion, learning rate and learning epochs are not addressed.Specifically, the input data is mapped to a random L-dimensional space with a discrete training set N, where (  ,   ),  ∈ ⟦1, ⟧ with   ∈   and   ∈   .The specification output of the network is given by the following equation 6 [23]: Vector matrix  = [ 1 , … ,   ]  is the output of the weight vector matrix connecting hidden and output nodes.On the other hand, ℎ() = [ 1 (), … ,   ()] is the output of the hidden nodes for input x, and  1 () is the output of the ith neuron.Based on a training set {(  ,   )} =1  , the ELM can solve the Learning Problem  = , where  = [ 1 , … ,   ]  are the target labels and the output vector matrix of the Hidden Layer  is given by the following equation 7 [23]: ] × (7) The input weight vector matrix of the hidden layer ω (before training) and the bias vectors b are created randomly in the interval [−1,1], by employing equations 8a and 8b.
The output weight vector matrix of the hidden layer H is calculated by the use of the activation equation in the training dataset, based on equation 9a and the output weights β are calculated based on equation 9b.
where  = [ℎ 1 , … , ℎ  ] is the output vector matrix of the hidden layer and  = [ 1 , … ,   ] is the input vector matrix of the hidden layer.Indeed, β can be calculated by the following equation 10: where  + is the generalized inverse Moore-Penrose vector for matrix H [23].The Cyber-Tyfon employs the ELM algorithm, which employs the kernel of the following Gaussian Radial Basis Equation K(u,v)=exp(-γ||u-v|| 2 ) (5).
The number of the hidden neurons k is equal to 20 and the assigned random input weights are wi where bi, i=1,…,N are the bias.The calculation of the hidden layer output matrix H has been done by employing equation 11 below [23].
where h(x) = [h1(x), . . ., hL(x)] is the output (row) vector of the hidden layer with respect to the input x.Also, h(x) maps the data from the D-dimensional input space to the L-dimensional hidden-layer feature space (ELM feature space) H. Thus, h(x) is indeed a feature mapping.ELM aims to minimize the training error ||Hβ -T|| 2 as well as the norm ||β|| of the output weights: where H is the hidden-layer output matrix of the equation 11.
Minimization of the norm of the output weights ||β|| is actually achieved by maximizing the distance of the separating margins of the two different classes in the ELM feature space 2/||β||.
To calculate the output weights β we used equation 12 [23].
where the value of C (a positive constant) and the value of T are obtained from the Equation Approximation of the SHLF2N2 with additive neurons (see equation 13 below).
The OS-ELM [16] is an alternative technique for large-scale computing and machine learning approaches that used when data becomes available in a sequential order to determine a mapping from data set corresponding labels.
The main difference between online learning and batch learning techniques is that in online learning the mapping is updated after the arrival of every new data point in a scale fashion, whereas batch techniques are used when one has access to the entire training data set at once.It is a versatile sequential learning algorithm because of the training observations are sequentially (one-by-one or chunk-by-chunk with varying or fixed chunk length) presented to the learning algorithm.At any time, only the newly arrived single or chunk of observations (instead of the entire past data) are seen and learned.A single or a chunk of training observations is discarded as soon as the learning procedure for that particular (single or chunk of) observation(s) is completed.The learning algorithm has no prior knowledge as to how many training observations will be presented.Unlike other sequential learning algorithms which have many control parameters to be tuned, OS-ELM only requires the number of hidden nodes to be specified [23].
The proposed OS-ELM algorithm consists of two main phases namely: The Boosting (BPh) and the Sequential Learning (SLPh).Obviously, if the network flow is classified as normal, it is allowed to continue.In the opposite case, there are 7 RΒΜs, as many as the abnormal classes, where each one of them has been trained to perform One-Class Classification (OCC) in order to exclusively recognize one specific network attack [24].The OCC also known as Unary Classification Method (UCΜ), implements intelligent categorization of cases belonging to a specific class, among an existing set of records.OCC is learning from a training set that contains only records of one specific class.Usually in this method the negative class is absent because it is not sampled.Thus, the division boundaries are determined effectively only with the knowledge of the positive class.This process is much more difficult than a traditional binary or multiclass classifier [25], as it is trained to accept target objects and to reject the ones that have significant deviation.Minimizing the errors is also a difficult process, because in this type of categorization, cross validation is unavailable since there is no data from the other classes.Finally, it should be stressed that one class problem-solving technique is inverse to the generalization approaches that are pursued in other machine learning problems, as it tends to provide a fully defined configuration of parameters.This can exponentially increase the complexity of the classifier, trying to correctly classify target data.The more complex the model, the smaller the rank range in the target data range, and the less likely it is for the outliers to be categorized correctly.In practice, one can create a complex model by setting all its possible parameters without being at risk from overfitting.
In this sense, the ΟCC is the most appropriate approach for detecting abnormalities and identifying patterns or trends, in a set of data that displays divergent behaviors than expected.ΟCC achieves high levels of successful detection, while maintaining low false error rates (false alarm) [24] [25].
The RBM [26] belong to the family of energy-based models, where each configuration of the variables of interest corresponds to a finite scalar energy value used for training.The learning process is performed by modifying the energy equation (ENF), so that its shape has desirable.Specifically, RBM is a symmetric graphical model.The units of one layer are connected (and thus dependent) only with units of the next one.The proposed RBM ENF with V visible units and H hidden ones is defined as follows equation 14 [26]: (14) where v and h are binary vectors related to the state of visible units, and to the state of hidden units respectively.Moreover,   and ℎ  correspond to the individual state of each visible unit (VU) i, and each hidden unit (HU)  respectively, and   is the weight assigned to their connection.Finally,    and   ℎ is the bias of the VU i and the HU j.The reserved probability (|ℎ) is given by the following equation 15: In the specific case of examining a unit i of the visible level, the assigned reserved probability distribution (if we know the status of the hidden layer h), is calculated by equation 16: The reserved probabilities (ℎ|) and (ℎ  = 1|) are defined by equations 17 and 18: These relations express the independence of the units of the two levels from the units of the same level.The training process of the RBM is the process of finding values for its parameters, that maximize the mean logarithmic probability of the occurrence of set C (equation 19).The first term calculates the average values of    ,   when the visible level of RBM is leaded by the data (  ), whereas the second term corresponds to the values of   ,   when the data are "produced" by the model.An equivalent way of formulating, would suggest that every weight   should change to become equal to   (equation 21): ) Τhe RBM model in this case starts approaching the actual values of the data.The first term   [  ℎ  ] can be calculated easily, since knowing the values of the units at the visible level, we can, by means of the above equations, calculate the reserved probability for each unit at the hidden level.Calculation of the second term, which presupposes the existence of samples from the model itself, is more difficult.In this paper we use optimization through Contrastive Divergence (CD) [27].The steps that we used to procedure a single sample can be summarized as follows: 1.Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.2. Compute the outer product of v and h and call this the positive gradient.3. Use h, to sample a reconstruction v' of the visible units, and then resample the hidden activations h'.(Gibbs sampling step) 4. Compute the outer product of v' and h' and call this the negative gradient.
5. Let the update to the weight matrix W be the positive gradient minus the negative gradient, times some learning rate:  = (ℎ  −  ′ ℎ ′ ). 6. Update the biases a and b analogously:  = ( −  ′ ),  = (ℎ − ℎ ′ ).The CD method gives lower energy to the actual data and much higher energy to the "reconstructions" resulting from them.This helps the model approach the actual data distribution.Also, the Typhon employs the Multi-Task Grouping and Overlap Learning approach, combined with the optimal use of the OS-ELM and RBMs methodologies.The sliding windows (SLIW) [28] are used to partition the data stream.Using the SLIW technique, the system estimates a table of indices, after accepting the data flow vectors as input.The data are divided in partitions of 1,000 samples with 400 of them overlapping between adjoining windows.This enables continuous and unintentional scanning of the data, which results in faster and more accurate control results.This happens because a small SLIW is much more likely to be more uniform than a larger one, and therefore it is more predictable.Each new sample is checked by the OS-ELM algorithm and force it in the appropriate RBM.Once the optimal model has been created, it is then applied to the control window of the SLIW.This process is followed until the input data of each window is completed.The total knowledge of the entire window is stored in MM1 (the first window).The MM1 is transferred in the window 2 as MLT, which means that the total knowledge of the first window is transferred to the second window.The process continues for all the data in window 2, and so on for all other windows, until the full analysis of the data flow is done.The full representation of the proposed process is presented in figure 2 below.When the OS-ELM classification detects an attack type, the network flow data are directed to the corresponding RBM.If it is a false positive case, the data are directed to the next RBM and so on.If the output is a false alarm, then the traffic is redirected to the initial OS-ELM, which has the online learning potential and it re-examines the dataflow from the beginning as if it was a new one.This architecture has been chosen due to the fact that in multi parametric and high complexity problems related to big data (like the one examined herein) the classification results are unstable especially regarding the analysis and the incorporation of the data flows.The introduced architecture is a serious method of a hybrid combined resolution of the APT attacks.The system is flexible and effective and it not only offers a robust approach but it also offers a faster convergence of the overall model.Finally, the proposed architecture can exploit the multi-tasking learning in order to enhance generalization.Each one of the developed RBMs is totally dedicated to the specific problem, which can result in bias and variance reduction.This can result in overfitting elimination and it offers a robust framework capable of coping with high complexity problems.

Datasets
Appropriate datasets were chosen that closely simulate Industrial Control System (ICS) communication and transaction data.They were used in the development and evaluation of the proposed model.Contained preprocessed network transaction data and preprocessed to strip lower layer transmission data, were used (e.g.TCP, MAC) [29].Specifically, the Gas_Dataset that was chosen for the purpose of this research includes 26   The dataset is determined and normalized in the interval [-1,1] in order to phase the problem of prevalence of features with wider range over the ones with a narrower range, without being more important.Also, the outliers and the extreme values spotted were removed based on the Inter Quartile Range technique [30].The reader can find details regarding the dataset and the data collection and assessment methodology in [29].

Training the RBMs
The RBMs were trained based on an innovative application of the One Class Classification methodology.More specifically, according to this approach the system is exclusively trained with data related to a vulnerability of the network, in order to be able to identify the specific behavior -anomaly that has a potential relation with APT attacks.This is achieved by estimating the RBMs Reconstruction Error (RER), which is a basic criterion for the safe determination of the classes.It is calculated by using a threshold which is unique for each class and which emerged after several trial and error attempts, aiming in the optimal output of the system.If the error is higher than this predefined limit (which is characteristic for each class) then it is rejected as an unknown class and the data are redirected to the next RBM.

Results
In data cases using a machine learning classifier, for estimating the real error during training, the full probability density of both categories should be known.The classification performance is estimated by the development of a Confusion Matrix (CM), where the main diagonal values (top left corner to bottom right) correspond to correct classifications and the rest of the numbers correspond to very few cases that were misclassified.The following table1 presents the CM results of the OS-ELM: The following table 2 present the analytical results of the proposed method.

Discussion and Conclusions
This research introduces a highly innovative, reliable and effective anomaly detection system, which is based on advanced computational intelligence approaches [32][33] [34].The Cyber-Typhon performs a series of sophisticated anomaly detection equations, by using Multi-Task Learning and by effectively combining On-Line Sequential Extreme Learning Machines with Restricted Boltzmann Machines.It ensures (in the most effective and intelligent way) the safe network communication among the interacting devices in a critical infrastructure environment.The proposed system, significantly enhances the security mechanisms of the Critical Infrastructures, which are a constant target due to their high importance.Also, this architecture offers the potential of developing a safe platform that can control and integrate network transactions.This can be done without the need for human intelligence or for a central authority.Also, it has been proven that collective intelligence technologies offer a smart solution for the determination of digital security and for monitoring of assets related to critical infrastructure.
The development of the model was based on the extremely effective OS-ELM algorithm, and on the multi-task learning method, which can perform transfer of knowledge between relative processes that are executed in parallel.
The training process was based on the employment of the RBMs that were trained based on the Unary Classification method.This was done by using specific datasets, each one of them corresponding to the behavior of a certain attack, in order to ensure the absolute reliability of the classifier.
The performance of the proposed system was tested on a multidimensional dataset of high complexity.This has resulted after an extensive research into the operation of critical infrastructure control systems and after comparisons, audits and tests.The target was the identification of the most appropriate limits, which express and realistically represent the classes they represent.The high accuracy of the results has greatly enhanced the validity of the general methodology.
It is important to mention that this particular model is presented for the first time in the literature.
It constitutes an important guideline in further exploitation of intelligent technologies in the automations that compose industrial networks.Proposals for the development and future improvements of this system, should focus on further optimizing the parameters of the RBMs used in order to achieve an even more efficient, accurate and quicker classification, capable of dividing even more precisely the boundaries between the situations of systems.
It would be important to study the equation-extension of the proposed algorithm with metalearning methods.This could further improve the anomaly detection process.Finally, the introduced model can employ adaptive learning in order to gain self-improvement potentials.This would automate 100% the whole process.
20 estimates the cost equation to wij that represents the renewal of the weights:

Fig. 3
Fig.3 Abnormal class distribution The Precision (PRE) the Recall (REC), the F-Score and the Total Accuracy (TA) indices are defined as in equations 25, 26, 27 and 28 respectively[31]:

Table 1 .
[31]usion Matrix of the OS-ELM All of the performance metrics in this testing, were estimated based on the One Versus ALL approach, because it is a Multi-Class Classification case.The numbers of misclassifications are related to the False Positive (FP) and False Negative (FN) indices appearing in the confusion Matrix.The True Positive rate (TPR) also known as Sensitivity the True Negative rate also known as Specificity (TNR) and the False Positive Rate (FPR) are defined by using equations 22, 23, and 24 respectively[31].

Table 2 .
Classification Accuracy and Performance MetricsThe 10-fold cross validation (10_FCV) is employed to obtain performance indices.Crossvalidation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it.