A Byproduct of a Differentiable Neural Network-Data Weighting from an Implicit Form to an Explicit Form

. Data weighting is important for data preservation and data mining. This paper presents a data weighting — neural network data weighting which obtains data weighting through transforming the implicit weighting of neural network to explicit weighting. This method includes two phases: in the first phase, choose a differentiable neural network whose transfer function is differentiable, and train the neural network on the ground of training samples; in the second phase, input the training samples as test samples into the network, calculate partial derivatives of the outputs with respect to inputs based on the differential characteristics of neural network, and statistical partial derivatives with respect to each input data item are used to calculate the weight of the data item. In this way, implicit weights stored in the neural network are converted to explicit weights. Experiments show that the method is more accurate than art-of-state methods. Furthermore, the method can be used in more fields, where the differentiable neural network can be used. The types of data can be discrete, continuous, or labeled, and the number of output data items is unlimited.


Introduction
Data weighting plays a very significant role in data preservation and data mining. According to whether to need labeled information (output data) or not, data weighting is divided into two categories: unsupervised weighting and supervised weighting. Unsupervised weighting includes Maximum deviation [1], Standard deviation [2], Information entropy [3], Grey relational analysis [4], Laplacian score [5], Mutual information maximization [6], Clustering analysis (Weighting K-mean) [7], etc. These methods gain data weighting by means of the statistical analysis of input data. Recently, weighting clustering becomes a focus of research [7][8]. But the clustering analysis is often used as data sample classification model and the performance of the clustering weighting has not been confirmed. The main problem of unsupervised data weighting is that the weighting is related to the input data self, rather than the output data. In the condition that there are the pseudo data or unrelated data, the accuracies of the methods are very low. The supervised weighting includes ReliefF [9], Fisher score [10], Trace ratio [11], Rough set [12], Simba [13], etc. These methods work based on the correlations between input data and output data. Among them, ReliefF, Fisher score and Trace ratio are similar, aiming to make the intra-class difference smaller while the inter-class difference larger. The main differences lie in the definition of data difference. Their output data are generally the labeled data, which limits its application scope. Rough set is a data dimension reduction method, which can also be used to evaluate data weighting. But the method is sensitive to the pseudo data and requires that its input data and output data all be discrete. Compared with unsupervised data weighting, the supervised weighting usually achieves better performance, but there are still great improvements in accuracy, application scope, etc.
Some machine learning algorithms, such as neural network, can be seen as weight allocation algorithms. Equivalent to the black box, data weighting is implicitly stored in the network through learning from training samples. The input-output mapping performances of the algorithms determine the accuracy of implicit weighting. But implicit weighting cannot achieve knowledge transfer. It is necessary to probe into the data weighting transfer from an implicit form to an explicit form.
This paper presents a supervised weighting method, which can be taken as a byproduct of a differentiable neural network, based on the differentiability of the neural network. Because BP neural network is a classic differentiable neural network, the paper chooses BP neural network to investigate neural network data weighting and the corresponding data weighting is named as BP NN weighting, abbreviated as BPNN. The rest of this paper is organized as follows. Section 2 introduces the basic structure of a differentiable neural network. Section 3 investigates the relationship between data weighting and data partial derivative. Further, it introduces the implementation of neural network data weighting. In Section 4, experiments are carried out to verify the performance of BPNN, and BPNN's advantages are discussed. Finally, conclusions including research prospects are presented in Section 5.

Structure of a differentiable neural network
In 1986, Rumelhart et al. proposed an error back propagation neural network [14], abbreviated as BP (Propagation Back) network, which is a widely used differentiable neural network. It is required that the network's neurons all be differentiable. Here BP network is chosen to introduce neural network data weighting. Based on gradient descent method, it reversely distributes the error to each unit, revises network weights, and saves learned knowledge into the data connection weights. That is to say that the trained neural network stores implicit data weighting information. As shown in Figure 1, it is supposed that the neural network is a L-layer network: one input layer, L-2 hidden layers and one output layer. Each neuron in the network accepts the outputs of the front layer as inputs, and propagates them to the next layer.   . Under the assumptions, the data relationship between the adjacent layers can be represented as follows Data weighting

Data weighting analysis
The chosen neural network is differentiable with respect to input data. By training, the neural network implicitly contains data weights. For an output data item j y , its relationship with input data in the trained neural network is equivalent to a differentiable function, defined as follows ) ,..., , ( Its differential equation is as follows implies the weight of i x ∆ . From a statistical view, the greater the absolute value of the partial derivative coefficient is, the greater the weight of the data item x is. The absolute value of partial derivative shows the correlation between x and j y . In the L-th layer (output layer), the partial derivation of an output feature j y with respect to i x is calculated as follows The equation can be iteratively calculated until l=2.
Assume that a trained neural network can accurately reflect the relations between input data and output data. All the training sample data as testing sample data are input into the trained network again and the partial derivatives are calculated at

Data weighting implementation
BPNN data weighting is based on a trained neural network. Consequently, its implementation consists of two phases.
(1) In the first phase, train a BP neural network The phase is to establish an accurate relationship between input data and output data. Neural network is trained based on training sample data. It learns from the training samples and implicitly stores data weighting into neural network connection weights and neuron thresholds.
(2) In the second phase, analyze data weighting based on the trained neural network Calculate the partial derivatives with respect to input data for all the training sample data. Statistical partial derivatives of each input data item reflects its weight. In this way, implicit weights stored in the neural network are converted to explicit weights.

Experiments
The chosen weighting methods include Standard deviation (SD), Weighting Kmean (WKmeans) [7], ReliefF [9], Simba [13] and BPNN. These data weighting methods are tested on object recognition of low dimensional data and high dimensional data in different datasets. Furthermore, the methods are investigated in the condition that outputs are continuous. Finally, the advantages of BPNN are discussed.

Low dimensional data weighting
Low dimensional data weighting experiments are carried out in UCI repository database [15]. Here we choose Liver Disorders (Liver), Glass Identification (Glass), Iris and Wine datasets from the repository. As shown in Table 1, the dimensions for all the data in above datasets are less than 20. In the calculation of BPNN data weighting, the dropout factor of the neural network BPNN is 0.5, and the batch-size is 20. 75% samples are randomly chosen from the data sets as training samples, and the remaining are testing samples. , where n refers to the number of input data dimensions, and m refers to the number of output labels. The performances of the chosen data weighting methods are directly presented through a k-NN [16] classifier with k equal to 3. k-NN classifies data according to data difference. The accurate data weighting means the low classification error rate.   Table 2 shows the results of different data weighting methods, i.e. the average test error ratio followed by the standard deviation in parentheses. The results are calculated based on experimental data for 20 times. Among the weighting methods, BPNN has the lowest classification error rates in Liver, Iris and Wine datasets, and has the third lowest classification error rate in Glass dataset. Although ReliefF and Simba are two supervised weight methods, they are not be significantly improved compared with unsupervised weight methods. The overall performance of ReliefF is not ideal and is even inferior to unsupervised method: SD and WKmeans. On statistical grounds, there are no significant differences in performance. The lowest classification error ratio means the highest performance. So, BPNN has the best performances among the five metrics.

High dimensional data weighting
Experiments are carried out in Mnist database [17] and face database [18][19][20], which are all image databases. Through k-NN classification algorithm with the parameter k equal to 3, a more accurate data weighting produce a higher recognition accuracy. The Mnist database is a large handwritten digit image database, containing 60,000 training images and 10,000 test images. The handwritten digit images are of size The results are calculated based on experimental data for 20 times. Table 3 shows the results of different data weighting methods for high dimensional data, i.e. the average test error ratio followed by the standard deviation in parentheses. It can be seen that BPNN is superior to other methods. BPNN has the lowest recognition error ratios in Mnist, ORL and Indian Male databases, the second lowest in Indian Female database and the third lowest in AR database. And its recognition error ratios in Indian Female and AR database are only a little greater than the lowest recognition error ratios. As a whole, Simba is in the second place, a little better than SD, WKmeans and ReliefF. ReliefF is not ideal and is even inferior to unsupervised methods: SD and WKmeans. In high-dimensional databases, WKmeans sometimes cannot work because its weighting optimization may produce very small values which go beyond the range of computer representation and lead to computation failures. The experimental results shows that BPNN can achieve better performance for high dimensional data.

Data weighting for continuous outputs
Traditional data weighting is used in data classification/recognition in which the output data are labeled data. If the output data are continuous, the output must be discretized into labeled data. In this way, useful information will be lost. BPNN data weighting can be used in more complex conditions. For example, BPNN is used in the condition that output items are continuous. In this condition, BPNN does not need to transform continuous output data to labeled data. Here we just verify BPNN in a condition that there is a continuous output data item. An indirect method is adopted to verify the validities. The verification is based on a hypothesis: data weighting helps to enhance useful contents and suppress useless contents. Because the purpose of PCA learning is to reduce data dimension while trying to preserve data information, PCA data extracted from the data weighted with more accurate weight factors will contain more useful contents and can be mapped to more accurate outputs.
Experiments are carried out in two datasets. One dataset is Concrete dataset from UCI repository. The data consist of nine items, including 1 output item and 8 input items. Every item is continuous. Another dataset is a self-built dataset. The data in the dataset consist of 21 items, including 1 output item (tagged as y) and 20 input items x x There are 400 samples in the self-built dataset. The output item is continuous, and the items x7 ~ x20 are irrelevant to the output item. In each experiment, 80% samples are randomly chosen for training, and the remaining samples are used as test samples.
Experiments are carried out as follows: determine data weighting (output item is uniformly discretized into three labels when needing labels), add data weighting to the original data, extract PCA data of the weighted data, complete data regression through SVM based on chosen training PCA data and output data, and calculate average regression errors of test samples. A smaller error implies that the PCA data keep more useful information. The average absolute regression errors indirectly show the performances of date weightings. In PCA extraction, set the fixed eigenvalue ratio 96%, and the experimental results are shown in Table 4. According to Table 4, BPNN is the best method with the smallest regression errors from the view of overall performance. ReliefF and Simba. ReliefF and Simba are not suitable for continuous data output because these methods require that the output data must be labeled. Continuous data discretized as labeled data will also lose a lot of information. Simba may not work in some conditions. For example, if minimum intraclass difference is larger than the minimux inter-class difference, Simba algorithm will fail to work.

BP algorithm advantages
According to the experimental results in Section 4.1-4.3, BPNN maintains the best performances, better than state-of-the-art algorithms, such as SD, WKmeans, ReliefF and Simba. Especially when output data are continuous, BPNN shows more excellent performances. It can be concluded that BPNN weighting has greater practical significance and can achieve better results. In actual applications, BP weighting can be used in more complicated conditions. Traditional data weighting analysis mainly are mainly used in the object classification/recognition where the output data are labeled data or looked as labeled data. Some methods even require that the input data should also be discrete or labeled. BPNN weights can effectively overcome these limitations: the number of the output data items is unlimited, and the output type can be discrete, continuous or labeled. That is to say BPNN can be widely used in different conditions where BP neural network can work.

Conclusion
This paper presents a new data weighting method, abbreviated as BPNN, which is a byproduct of differentiable neural network -BP neural network. Based on trained neural network, this method transforms implicit weights into explicit weights through partial differentials. Experiments show that this method is more stable and accurate, and have a wide application scope. BPNN is closely related to neural network's performance. New developments of neural network will improve neural network and provide more accurate data weighting. In scientific research field, the neural network has been a hot research topic for a long time. Data preprocessing based on neural network, such as data weighting, will be an exciting research direction.