Iliou Machine Learning Data Preprocessing Method for Stress Level Prediction

. Data pre-processing is an important step in the data mining process. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection. In this paper, Iliou and PCA data preprocessing methods evaluated in a data set of 103 students, aged 18-25, who were experiencing anxiety problems. The performance of Iliou and PCA data preprocessing methods was evaluated using the 10-fold cross validation method assessing seven classification algorithms, IB1, J48, Random Forest, MLP, SMO, JRip and FURIA, respectively. The classification results indicate that Iliou data preprocessing algorithm consistently and substantially outperforms PCA data preprocessing method, achieving 98.6% against 92.2% classification performance, respectively.


Introduction
Anxiety disorders are among the most prevalent mental disorders that may occur in the general population and they can lead to chronic characteristics, associated with significant percentages of mortality [3]. This paper is structured as follows: Section 2 describes the Diagnostic criteria, Section 3 presents the Cognitive models, Section 4 presents the Beck Anxiety Inventory (BAI), Section 5 describes our dataset, Section 6 presents the experimental results of this study, while Section 7 concludes this paper and describes future work.

Diagnostic Criteria
According to DSM-5, anxiety of separation, selective mutism, specific phobia, social anxiety disorder (or social phobia), panic disorder, agoraphobia, generalized anxiety disorder, substance-induced or drug-induced anxiety disorder, anxiety disorder due to another physical state, another predetermined anxiety disorder and unspecified anxiety disorder fall in the category of anxiety disorders [10].
Panic attack is a sudden onset of intense fear or intense discomfort that culminates in minutes, during which can occur at least four of the following symptoms (abrupt onset may occur from calm situation or stressful situation): palpitations, heart "pounding", or accelerated heart rate, transpiration, trembling or intense fear, breathlessness or feeling of suffocation, choking feeling, pain or discomfort in the chest, nausea or abdominal discomfort, dizziness, instability or fainting, chills or feeling of warmth, hallucinations (numbness or tingling), derealization (feeling unreal) or depersonalisation (feeling posting by itself), fear of losing control or oncoming madness, fear of death.
Generalized anxiety disorder is characterized from feelings of excessive anxiety and worry (fearful expectation), which are present quite often and during a period of at least six days for a number of events or activities (such as work and school performance). The person feels that it is difficult to control his worry, while stress and anxiety are associated with at least three of the following symptoms (with some of the symptoms to be present for more days during the last six months) [1] [2] [10]: nervousness or anxiety or stress feeling, feeling of an unusual upset, difficulty in concentrating or feeling that the mind is emptied, irritability, muscle tension.
Hoehn-Saric & McLeod in 1988 [11], and also Freeman, in 1990 described stress as a globally known experience that function like a warning safety mechanism. This mechanism emits warning signals during hazardous conditions, associated with uncertainty, while it seems to malfunction in a number of cases, like for example when: (a) stress is too intense, (b) stress lasts also after the exposure to the risk (c) occurs in situations in which risk or threat are absent, or (d) occurs without any particular reason.

Beck Anxiety Inventory
Beck Anxiety Inventory (BAI), created by Aaron T. Beck and his colleagues, is a 21-item, multiple-choice, self-report inventory that is used to measure severity of anxiety in children and in adults [6]. BAI questions are associated with symptoms of anxiety that the subject experienced during the past week (including the day of the BAI), such as numbness and tingling, sweating, and fear that the worst is going to happen. It can be administered to individuals over 17 years old and it takes around 5 to 10 minutes in order to be completed. Several studies have found BAI to be an accurate measurement of anxiety symptoms in children and adults [17].
Somatic subscale is more emphasized on the BAI, as there are 15 out of 21 items measuring physiological symptoms, while the rest of the items concern cognitive, affective, and behavioral components of anxiety. Therefore, BAI functions more adequately in anxiety disorders with a high somatic component, like for example panic disorder. On the other hand, BAI is not appropriate for disorders, such as social phobia or obsessive-compulsive disorder, which have a stronger cognitive or behavioral component. Many questions of the Beck Anxiety Inventory include physiological symptoms, such as palpitations, indigestion, and trouble breathing. Because of this, it has been shown to elevate anxiety measures in those with physical illnesses like postural orthostatic tachycardia syndrome, when other measures did not [18].
The Beck Anxiety Inventory (BAI) and the Anxiety Disorders Interview Schedule (ADIS-IV) were administered to 193 adults with a primary diagnosis of generalized anxiety disorder (GAD), specific or social phobia, panic disorder with or without agoraphobia, obsessive-compulsive disorder (OCD), and no psychiatric diagnosis, at a major Midwestern university recruited from an anxiety research and treatment center. The results of this study support previous findings that the strongest quality of the BAI is its ability to assess panic symptomatology and can be used as an efficient screening tool for distinguishing between individuals with and without panic disorder [16]. The study of Steer RA, Kumar G, Ranieri WF (1995), administered the Beck Anxiety Inventory to 105 outpatients between 13 and 17 years old who were diagnosed with various types of psychiatric disorders and the results are supporting the use of the inventory for evaluating self-reported anxiety in outpatient adolescents [29].
The Beck Anxiety Inventory (BAI), created by Aaron T. Beck and other colleagues, is a 21-question multiple-choice self-report inventory that is used for measuring the severity of anxiety in children and adults. It is a reliable tool. The BAI contains 21 questions, each answer being scored on a scale value of 0 (not at all) to 3 (severely).

Data Collection
Questionnaires were completed by 103 students, aged 18-25, who were experiencing anxiety problems and had visited the Office of the Special Consulting Health Services, of Patras University, from the September 2014 until the June 2016.
A result between 0-21 indicates low levels of stress. Although this is usually positive, someone should consider whether the result is realistic or if the person is in refusal concerning its problems. A result between 22-35 indicates moderate stress levels, meaning that the body is trying to tell something. The person should look for patterns in relation to the appearance of symptoms. For example, whether symptoms appear before a specific activity (e.g., a business meeting) and must learn to manage the stress before this activity. No panic is needed, but certainly some technical manager stress will help. A result over 36 points shows high levels of stress and is certainly a reason to ask for supply management. Remember that a high level of anxiety is not a sign of weakness or personal failure. It is something that should be addressed in order to prevent the impact to the mental and physical level of someone [9]. Questionnaires were completed by 103 students, aged 18-25, who were experiencing anxiety problems and had visited the Office of the Special Consulting Health Services, of Patras University, from the September 2014 until the June 2016. It was performed by one person, the psychologist of the office. Though the BAI was developed to minimize its overlap with the depression scale as measured by the Beck Depression Inventory, a correlation of r=.66 (p<.01) between the BAI and BDI-II was seen among psychiatric outpatients, suggesting that the BAI and the BDI-II equally discriminate between anxiety and depression. Another study indicates that, in primary care patients with different anxiety disorders including social phobia, panic disorder, panic disorder with or without agoraphobia, agoraphobia, or generalized anxiety disorder, the BAI seemed to measure the severity of depression. This suggests that perhaps the BAI cannot adequately differentiate between depression and anxiety in a primary care population. We have uploaded the original data as well.

Data preprocessing
The set of techniques used prior to the application of a data mining method are named as data preprocessing for data mining [19] and it is known to be one of the most meaningful issues within the famous Knowledge Discovery from Data process [20]. Since real world data are generally imperfect, incomplete (lacking attribute values, lacking certain attributes of interest, or containing only aggregate data), noisy (containing errors or outliers), inconsistent (containing discrepancies in codes or names) and contain redundancies, is not directly applicable for a starting a data mining process. We must also mention the fast growing of data generation rates and their size in business, industrial, academic and science applications. The bigger amounts of data collected require more sophisticated mechanisms to analyze it. Data preprocessing is able to adapt the data to the requirements posed by each data mining algorithm, enabling to process data that would be unfeasible otherwise. Data preprocessing mainly includes: (ii) Data integration: using multiple databases or files.
(iv) Data reduction: reducing the features of the initial dataset but producing the same or similar analytical results.
(v) Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
Indicatively, we describe the below data preprocessing methods:  Principal component analysis (PCA) tries to find a rotation such that the set of possibly correlated features transforms into a set of linearly uncorrelated features. The columns used in this orthogonal transformation are called principal components [22]. This method is also designed for matrices with a low number of features.
 Iliou Preprocessing method transforms the initial dataset into a completely new dataset with 4*m attributes (in big data, Iliou method achieves feature reduction), where m is the number of classes of the dataset output. In this paper we modified our method in order to achieve better results and less complexity.
Iliou method uses Statistics and Linear Algebra methods like the dot (inner) product of dataset vectors in conjunction with some descriptive statistics [24].
The new dataset achieves better classification performance comparing to other preprocessing methods. We used Iliou method for classification or prediction problems [24,25]. Below follows the detailed description of the method: Step 1 Let's assume that a dataset of a machine learning problem named dataset1 is chosen, with n instances (rows), k variables (columns) and m classes. Then, a new set (called from now-on Basic-Set or BS) is created randomly selecting 10% of data from dataset1, consisting of d instances and m classes. The remaining 90% of dataset1 is called Rest-Set.
Step 2 Afterwards, inner (dot) product of every Basic Set instance (row) with the remaining rows of the Basic Set is computed (An inner product is a generalization of the dot product. In a vector space, it is a way to multiply vectors together, with the result of this multiplication being a scalar).   Step 3 Assuming that Rest-Set from step 2 has r instances (rows) and m classes, a similar to step 2 approach follows. Specifically, inner (dot) product of every single Rest-Set row with every single row of the Basic Set is performed. Then, the mean and median values of the inner (dot) product result of every row for each class are calculated (RS_Mean_classm _rowj and RS_Median_classm_rowj respectively), producing new m+m=2m variables for every row of the Rest Set. As a result, we have r values for RS_Mean_classm_rowj, and r values for RS_Median_classm_rowj.
Similarly to step 2, we compute mean and medial values (RS_Mean_classm_rowx and RS_Median_classm_ rowx respectively) for every class.
Apart from the above, the Final_Meanm_rowj and Final_Medianm_rowj values are also calculated as shown in equation (5) and equation (6) respectively (m is the name of the class and j (from 1 to r) is the row of the Rest set.
Eq. 5 Eq. 6 Finally, m Final_Meanm_rowj and Final_Medianm_rowj values result, one for every class m and every row j of the Rest set.

Step 4
The rows (variables) RS_Mean_classm_rowj, RS_Median_classm_rowj, Final_Meanm_rowj and Final_Medianm_rowj for every class are selected from previous step and then are placed in a new table.
The method ends with the transposition of the

Experimental results
The experiments conducted using seven classification schemes (Table 1): IB1 (Nearestneighbour classifier), J48 (C4.5 algorithm implementation), Random Forest, MLP (Multilayer Perceptron), SMO (Support Vector Machines), JRip (Repeated Incremental Pruning to Produce Error Reduction) and FURIA (Fuzzy Unordered Rule Induction Algorithm), respectively. In order to estimate how accurately the above predictive models will perform in practice and to assess how the results will generalize to an independent data set, we used the repeated 10-fold cross validation technique [26]. The experiments conducted using WEKA 3.8 data mining software [27] [32]. Purnendu Shekhar Panday (2017) based on heart beat tried to predict whether a person is in stress or not achieving 68% accuracy in classification process [33]. In [34] a real-life, unconstrained study carried out with 30 employees within two organisations and the classification results using ensemble approach increased the accuracy by ≈10% to 71.58% compared to not using any transfer learning technique.
In equations 7 and 8 the formulas for Precision and Recall metrics are presented. The True Positive (TP) is the number of items correctly labeled as belonging to the positive class. The items that correctly labeled as not belonging to the positive class, they are called True Negative (TN). In the case that items which were not labeled as belonging to the positive class by the classifier but should have been, they are called False Negatives (FN). Finally, the items incorrectly labeled as belonging to the class, they are called False Positives (FP). Thus, the number of true positives, true negatives, false negatives and false positives add up to 100% of the set. Precision = (true positives) / (true positives + false positives) Eq. 7 Recall = (true positives) / (true positives + false negatives) Eq. 8 As we can observe in Table 1, initial (raw) data have achieved the best classification results 92% with Support Vector Machines (SMO) algorithm, PCA transformed data achieved the best classification results 90% and 89% with FURIA and JRip algorithm respectively. Finally, Iliou method achieved 98% classification performance with SMO and IB1 algorithm.

Conclusions
In this paper we focused on stress level prediction using initial dataset without preprocessing and datasets being preprocessed by PCA and Iliou data preprocessing methods, respectively. Our experimental evaluation has shown that the Iliou data preprocessing algorithm consistently and substantially outperforms PCA data preprocessing method in classification performance. Moreover, table 1 reveals that Iliou method achieved better classification results than initial dataset. In our point of view, Iliou method can be used for significantly improve classification algorithms performance in every stress level dataset.
In future work, it would be preferable to make the same experiments in larger datasets using more classifiers. In addition, Iliou data preprocessing method could be extended to classification algorithm.