Collective Interpretation and Potential Joint Information Maximization

. The present paper aims to propose a new type of information-theoretic method called “potential joint information maximization”. The joint information maximization has an eﬀect to reduce the number of jointly ﬁred neurons and then to stabilize the production of ﬁnal representations. Then, the ﬁnal connection weights are collectively interpreted by averaging weights produced by diﬀerent data sets. The method was applied to the data set of rebel participation among youths. The result show that ﬁnal weights could be collectively interpreted and only one feature could be extracted. In addition, generalization performance could be improved.


Introduction
Information-theoretic methods have had much influences on neural computing in many aspects of neural learning [1], [2], [3], [4], [5], [6], [7].Though the information-theoretic methods have aimed to describe relations or dependencies between neurons or between layers, due attention has not been paid to those relations.They have even tried to reduce the strength of relations between neurons [8], [9].For example, they have tried to make individual neurons as independent as possible.In addition, they have tried to make the distribution of neurons' firing as uniform as possible.This is simply because difficulty has existed in taking into account neurons' relations or dependencies.
The present paper aims to describe one of the main relations between neurons, namely, relations between input and hidden neurons, because they play critical roles in improving the performance of neural networks, for example, generalization performance.However, it has been few efforts to describe relations between input and hidden neurons from the information-theoretic points of view.To examine relations between input and hidden neurons, we introduce the joint probability between input and hidden neurons.Then, the joint information contained between input and hidden neurons is also introduced.When this joint information increases, only a small number of joint input and hidden neurons fire strongly, while all the others cease to do so.However, one of the major problems to realize the joint information lies in difficulty in computation.As has been well known, the majority of the informationtheoretic methods have this problem of difficulty in computation [7].To overcome the problem, we have introduced the potential learning [10], [11], [12], [13].In the method, information maximization can be translated into potentiality maximization where a specific neuron is forced to have the largest potentiality to deal with many different situations.Applying the potentiality to joint neurons, potentiality maximization corresponds to a situation where a small number of joint neurons are forced to have larger potentiality.
In addition, the present method aims to propose a new method to interpret final representations.As has been well known, the black-box property of neural networks have prevented them from being applied to practical problems, because in practical applications, the interpretation of final results can be more important than the generalization performance.Usually, neural networks produce completely different types of connection weights, depending on different data sets and initial conditions.The joint information maximization can be used to explain the final representations clearly.When the joint information increases, the number of activated neurons diminishes, which constraints severally the production of many different types of weights.Thus, a few typical connection weights are produced by the joint information maximization.Then, we can interpret those connection weights by averaging them.This type of interpretation is called "collective interpretation" in the present paper.As generalization performance is evaluated in terms of the average values, the interpretation performance can be evaluated collectivity by taking into account all the connection weights produced by diffident data sets and initial conditions.

Concept of Joint Information Maximization
Figure 1 shows a concept of joint information maximization.For a data set, when the joint information is maximized, only one joint hidden and input neuron fire strongly with a strong connection weight in Figure 1(b).For another data set, another joint hidden and input neuron strongly fire in Figure 1(c).For interpretation, connection weights produced by all data sets are taken into account by averaging connection weights with due consideration for hidden-output connection weights in Figure 1(e).

Potential Joint Information Maximization
Potential joint information is based on the potentiality so far defined for hidden neurons [10], [11], [12], [13].As shown in Figure 1(b), let w t jk denote connection weights from the kth input neuron to the jth hidden neuron for the tth data set, then the potentiality v t jk is defined by where w t denotes the average weight defined by where M and L denotes the number of hidden and input neurons.Then, the potentiality is normalized as Then, we have the potential joint information where T is the number of data sets, p(t) is the probability with which the tth data set is given and p(t)p(j, k|t). (5)

Computing Pseudo-Potential Joint Information Maximization
It is possible to differentiate the joint information to have update rules, but much simpler methods have been developed in the name of potential learning.In the method, potentiality maximization is replaced by pseudo-potentiality maximization, which is easily maximized just by changing the parameter.Now, the pseudo-potentiality is defined by where r ≥ 0 deontes the potential parameter v max is the maximum potentiality.By normalizing this potentiality, we have the pseudo-firing probability Then, we have pseudo-information p(j, k; r) log p(j, k; r) p(j, k|t; r) log p(j, k|t; r).
The pseudo-information can be increased just by increasing the parameter r, and the joint information can be increased by assimilating pseudo-potentiality φ t,r jk repeatedly, while the potential parameter increased gradually.The new weights new w t jk are obtained by weighting the old weights old w t jk by the pseudopotentiality new w t jk = old w t jk φ t,r jk .
Then, new learning starts with those connection weights as initial ones.This process repeats itself for a fixed number of learning steps.

Experimental Outline
The data set was made to infer the probability of rebel participation among youths in the Niger Delta [14].The number of input patterns was 1,340, and 19 input variables were used.The number of patterns for modeling neural networks was 1000 and the remaining 340 was exclusively for testing.With less than 1000 patterns, improved generalization performance was not obtained by the present and conventional methods.Of 1000 modeling data, 700 training data were randomly and repeatedly taken and ten training sets were prepared.The remaining 300 were used for the early stopping and checking the data sets.The potential parameter r was gradually increased from zero in the first learning step to one in the tenth learning step (final step).

Mutual Information
Figure 2 shows the joint information as a function of the number of steps.The joint information was simplified by supposing the uniform distribution The information increased gradually and close to 0.6.Though the joint information could be further increased, generalization errors increased in direct proportion to this information increase beyond this point.The results show that the present method can increase the joint information sufficiently.

Connection Weights
Figure 3 shows connection weights for the rebel data set when the number of steps increased from one to ten.When the number of steps was one, almost random weights could be seen in Figure 3(a).When the number of steps was increased from two in Figure 3(b) to six in Figure 3(f), gradually the number of strong connection weights decreased.Then, when the number of steps was increased Fig. 3. Connection weights from input to hidden neurons with 10 hidden neurons for the rebel data set.Green and red weights represent positive and negative ones.
from seven in Figure 3(g) to ten in Figure 3(j), only one connection weight from the eighth input neuron to sixth hidden neuron became the strongest, while all the other weights became close to zero. Figure 4 shows adjusted connection weights for the maximum potential hidden neurons j * by ten different data sets randomly taken from the modeling data set.Adusted weights for interpretation c t j * k was computed by where sign(W 1j * ) denote the sign of the weight from the maximum potential hidden neuron to the first output neuron, representing that the youths do not want to participate in the rebel force.As shown in the figure, five out of ten results showed that the input neuron No.8 had stronger weights than any other ones.Thus, the input neuron No.8 was collectively considered to be important by the present method.
Figure 5 shows the average connection weights.The average weights were computed by As can be seen in the figure, the input neuron No.8 had the largest connection weight.The variable No.8 represents the government's presence in the community in terms of the number of government establishments.Thus, when the government's presence becomes more visible, the youths do not want to participate in the rebel force.
Figure 6 shows the regression coefficients by the logistic regression analysis.In the original data set, a tricky variable was introduced, namely, the variable No.16 (oil size) and No.17 (squared oil size), which were naturally correlated, because principally two variables were the same.Thus, they produced the multi-collinearity where two variable responded completely differently to input patterns.On other hand, the present method responded to the two variables almost evenly.The results show that the present method is good at dealing with this kind of data set with strong correlation between variables.Finally, the interesting thing to note is that except the variables No.8, No.16 and No.17, quite similar weights and coefficients were produced by both methods.

Generalization Performance
The present method produced the best performance of generalization, comparing with that by the other two conventional methods.Table 1 shows generalization performance by three methods.As can be seen in the table, the best generalization error of 0.1662 on average was obtained by the present method.In addition, the best minimum and maximum error of 0.1382 and 0.2 were obtained by the present method.The second best one was obtained by the BP with the early stopping.Finally, the worst one was obtained by the logistic regression analysis.

Conclusion
The present paper proposed a new information-theoretic method called "joint information maximization".The joint information represents relations between input and hidden neurons.When the joint information increases, the number of strongly connected hidden and input neurons decreases gradually.The method   was applied to the rebel participation data set.The results show that the joint information could be increased by the present method.Final results could be interpreted collectively by averaging the connection weights.Finally, generalization performance was improved by the present method.The present method was much simpler than any other conventional information-theoretic methods because of the potential learning.Thus, it can be applied to large-scale and practical problems.

Fig. 2 .
Fig. 2. Potential joint information with 10 hidden neurons for the rebel data set.

Fig. 4 .
Fig.4.Adjusted connection weights for ten different data sets from input to hidden neurons with 10 hidden neurons for the rebel data set.Green and red weights denote positive and negative ones.

Fig. 5 .
Fig. 5. Collective and average weights for the rebel data set.

Table 1 .
Summary of experimental results on generalization performance for the rebel data set.The BP(ES) represents the BP with early stopping.The bold face numbers show the best values.