An Entropy Based Algorithm for Credit Scoring

. The request of effective credit scoring models is rising in these last decades, due to the increase of consumer lending. Their objective is to divide the loan applicants into two classes, reliable or unreliable, on the basis of the available information. The linear discriminant analysis is one of the most common techniques used to deﬁne these models, although this simple parametric statistical method does not overcome some problems, the most important of which is the imbalanced distribution of data by classes. It happens since the number of default cases is much smaller than that of non-default ones, a scenario that reduces the effectiveness of the machine learning approaches, e.g., neural networks and random forests. The Difference in Maximum Entropy (DME) approach proposed in this paper leads toward two interesting results: on the one hand, it evaluates the new loan applications in terms of maximum entropy difference between their features and those of the non-default past cases, using for the model training only these last cases, overcoming the imbalanced learning issue; on the other hand, it operates proactively, overcoming the cold-start problem. Our model has been evaluated by using two real-world datasets with an imbalanced distribution of data, comparing its performance to that of the most performing state-of-the-art approach: random forests.


Introduction
The processes taken into account in this paper typically start with a loan application (from now on named as instance) and end with a repayment (or not repayment) of the loan.Although the retail lending represents one of the most profitable source of income for the financial operators, the increase of loans is directly related to the increase of the number of defaulted cases, i.e., fully or partially not repaid loans.In short, the credit scoring is used to classify, on the basis of the available information, the loan applicants into two classes, reliable or unreliable (or better, referring to their instances, accepted or rejected).Considering its capability to reduce the losses of money, it is clear that it represents an important tool, as stated in [1].More formally, the credit scoring techniques can be defined as a group of statistical methods used to infer the probability that an instance leads toward a default [2,3].
Whereas that their processes involve all the factors that contribute to determine the credit risk [4] (i.e., probability of loss from a debtor's default), they allow the financial operators to evaluate this aspect.Other advantages related to these techniques are the reduction of the credit analysis cost, a quick response time in the credit decisions, and the possibility to accurately monitor the credit activities [5].The design of effective approaches for credit scoring is not a simple task, due to a series of problems, the most important of which is the imbalanced distribution of data [6] used to train the models (small number of default cases, compared to the non-default ones), which reduces the effectiveness of the machine learning strategies [7].
The idea behind this paper is to evaluate an instance in terms of its features' entropy, and to define a metric able to determinate its level of reliability, on the basis of this criterion.In more detail, we measure the difference, in terms of maximum Shannon entropy (from now on referred simply as entropy), between the same instance features, before and after we added to the set of non-default past instances an instance to evaluate.In the information theory, the entropy gives a measure of the uncertainty of a random variable.The larger it is, the less a-priori information one has on the value of it, then the entropy increases as the data becomes equally probable and decreases when their chances are unbalanced.It should be observed that, when all data have the same probability, we achieve the maximum uncertainty in the process of prediction of future data.
On the basis of the previous considerations, we evaluate a new instance in terms of uncertainty of its feature values: comparing the entropy of the set of non-default past instances, before and after we added to it the instance to evaluate.A larger entropy will indicate that the new instance contains similar data (increasing the level of equiprobability), otherwise, it contains different data, thus it represents a potential default case (in terms of non-similarity with the values of the non-default cases).
In our approach we choose to operate without take into account the default cases.Such strategy presents a twofold advantage: first, we can operate proactively, i.e., without the need to use default cases to train our model; second, we overcome the cold-start problem related with the scarcity (or total absence) of default cases, considering that in a real-world context they are much less than the non-default ones.
Given that in most of the cases reported in the literature [8,9,10] the Random Forests approach outperforms the other ones in this context, we will compare the proposed approach only to this one.
The main contributions of our work to the state of the art are listed below: (i) calculation of the local maximum entropy by features (Λ), which gives us information about the entropy achieved by each feature in the set of non-default cases (it allows us to measure the differences between instances in terms of single features); (ii) calculation of the global maximum entropy (γ), which represents a meta-feature based on the integral of the area under curve of the Λ values (it allows us to measure the difference between instances in terms of all features); (iii) formalization of the Difference in Maximum Entropy (DME) approach, used to classify the unevaluated instances as accepted or rejected, by exploiting the Λ and γ information; (iv) evaluation of the DME approach on two real-world datasets, by comparing its performance with those of a state-of-the-art approach such as Random Forest (in our case, without using the past default cases to train the model).
The remainder of the paper is organized as follows: Section 2 discusses the background and related work; Section 3 provides a formal notation and defines the problem faced in this paper; Section 4 describes the implementation of the proposed approach; Section 5 provides details on the experimental environment, the adopted datasets and metrics, as well as on the used strategy and the experimental results; some concluding remarks and future work are given in the last Section 6.

Background and Related Work
A large number of credit scoring classification techniques have been proposed in literature [11], as well as many studies aimed to compare their performance on the basis of several datasets, such as in [8], where a large scale benchmark of 41 classification methods has been performed across eight credit scoring datasets.
The problem of how to choose the best classification approach and how to tune optimally its parameters was instead faced in [12]; in the same work are reported some useful observations about the canonical metrics of performance used in this field [13].

Credit Scoring Models
Most of the statistical and data mining techniques at the state of the art can be used in order to build credit scoring models [14,15], e.g., linear discriminant models [16], logistic regression models [3], neural network models [17,18], genetic programming models [19,20], k-nearest neighbor models [21], and decision tree models [22,23].
These techniques can also be combined in order to create hybrid approaches of credit scoring, as that proposed in [24,25], based on a two-stage hybrid modeling procedure with artificial neural networks and multivariate adaptive regression splines, or that presented in [26], based on neural networks and clustering methods.

Imbalanced Class Distribution
A complicating factor in the credit scoring process is the imbalanced class distribution of data [27,7], caused by the fact that the default cases are much smaller than the nondefault ones.Such distribution of data reduces the performance of the classification techniques, as reported in the study made in [9].
The misclassification costs during the processes of scorecard construction and those of the classification were studied in [28], where it is also proposed to preprocess the dataset of training through an over-sampling or a under-sampling of the classes.Its effect on the performance has been deeply studied in [29,30].

Cold Start
The cold start problem [31,32] arises when there are not enough information to train a reliable model about a domain [33,34,35].
In the credit scoring context, this happens when there are not many instances related to the credit-worthy and non-credit-worthy customers [36,37].Considering that, during the model definition, the proposed approach does not exploit the data about the defaulting loans, it is able to reduce/overcome the aforementioned issue.

Random Forests
Random Forests represent an ensemble learning method for classification and regression that is based on the construction of a number of randomized decision trees during the training phase and it infers conclusions by averaging the results.
Since its formalization [38], it represents one of the most common techniques for data analysis, thanks to its better performance w.r.t. the other state-of-the-art techniques.This technique allows us to face a wide range of prediction problems, without performing any complex configuration, since it only requires the tuning of two parameters: the number of trees and the number of attributes used to grow each tree.

Shannon Entropy
The Shannon entropy, formalized by Claude E. Shannon in [39], is one of the most important metrics used in information theory.It reports the uncertainty associated with a random variable, allowing us to evaluate the average minimum number of bits needed to encode a string of symbols, based on their frequency.
More formally, given a set of values v ∈ V , the entropy H(V ) is defined as shown in the Equation 1, where P(v) is the probability that the element v is present in the set V .
For instance, if we have a symbol set the entropy H(V ) (i.e., the average minimum number of bits needed to represent a symbol) is given by the Equation 2. Rounding up the result, we need 2 bits/per symbol.So, to represent a sequence of five characters optimally, we need 10 bits.
In the context of the classification methods, the use of entropy-based metrics is typically restricted to the feature selection [40,41,42], the process where a subset of relevant features (variables, predictors) is selected and used for the definition of the classification model.
In this work, we instead use this metric to detect anomalous values in the features of a new instance, where anomalous stands for values different from those in the history of the non-default cases.

Notation and Problem Definition
This section introduces some notational conventions used in this paper and defines the faced problem.

Notation
Given a set of classified instances T = {t 1 ,t 2 , . . .,t N }, and a set of features F = { f 1 , f 2 , . . ., f M } that compose each t, we denote as T + ⊆ T the subset of non-default instances, and as T − ⊆ T the subset of default ones.
We also denote as T = {t 1 , t2 , . . ., tU } a set of unclassified instances and as E = {e 1 , e 2 , . . ., e U } these instances after the classification process, thus Each instance can belong only to one class c ∈ C, where C = {accepted, re jected}.

Problem Definition
On the basis of the Λ and γ information (explained in Section 4.1), calculated before and after we added to the set T + the unclassified instances in the set T (one by one), we classify each instance t ∈ T as accepted or rejected.
Given a function eval(t, λ, γ) created to evaluate the correctness of the t classification made by exploiting the λ and γ information, which returns a boolean value σ (0=misclassification, 1=correct classification), we formalize our objective as the maximization of the results sum, as shown in Equation 3.

Our Approach
The implementation of our approach is carried out through the following three steps: 1. Local Maximum Entropy by Features : calculation of the local maximum entropy by features Λ, aimed to obtain information about the maximum level of entropy assumed by each feature in the set T + ; 2. Global Maximum Entropy : calculation of the global maximum entropy γ, a metafeature defined on the basis of the integral of the area under curve of the maximum entropy by features Lambda; 3. Difference in Maximum Entropy : formalization of the Difference in Maximum Entropy (DME) algorithm, able to classify a new instance as accepted or rejected, on the basis of the Λ and γ information.
In the following, we provide a detailed description of each of these steps.

Local Maximum Entropy by Features
Denoting as H( f ) the entropy measured in the values assumed by a feature f ∈ F in the set T + , we define the set Λ as shown in Equation 4. It contains the maximum entropy achieved by each f ∈ F, so we have that |Λ| = |F|.We use this information during the evaluation process explained in Section 4.3.

Global Maximum Entropy
We denote as global maximum entropy γ the integral of the area under curve of the maximum entropy by features Λ (previously defined in Section 4.1), as shown in Fig. 1.
Fig. 1: Global Maximum Entropy γ More formally, the value of γ is calculated by using the trapezium rule, as shown in Equation 5.
It is a meta-feature that gives us information about the maximum entropy achieved by all features in T + , before and after we added to it a unevaluated instance.We use this information during the evaluation process (Section 4.3), jointly with that given by Λ.

Difference in Maximum Entropy
The Difference in Maximum Entropy (DME) Algorithm 1 is aimed to evaluate and classify a set of unevaluated instances.
It takes as input a set T + of non-default instances occurred in the past and a set T of unevaluated instances, returning as output a set E containing all instances in T , classified as accepted or rejected on the basis of the Λ and γ information.
In step 2 we calculate the Λ a value by using the non-default instances in T + , as described in Section 4.1, while in step 3 we obtain the global maximum entropy γ (Section 4.2).The steps from 4 to 26, process all the instances t ∈ T .
After the calculation of the Λ b and γ b values (steps 5 and 6), performed by adding the current instance t to the non-default instances set T + , in the steps from 7 to 13, we compare each λ a ∈ Λ a with the corresponding feature λ b ∈ Λ b (steps from 8 to 12), counting how many times the value of λ b is greater than that of λ a , increasing the value of b (step 9) when this happens, or that of a otherwise (step 11); in the steps from 14 to 18 we perform the same operation, but by taking into account the global maximum entropy γ.
At the end of the previous sub-processes, in the steps from 19 to 23 we classify the current instance as accepted or rejected, on the basis of the a and b values, then we set them to zero (steps 24 and 25).The resulting set E is returned at the end of the entire process (step 27).This section describes the experimental environment, the used datasets and metrics, the adopted strategy, and the results of the performed experiments.

Experimental Setup
The proposed approach was developed in Java, while the implementation of the stateof-the-art approach used to evaluate its performance was made in R1 , using the random-Forest and ROCR packages.
The experiments have been performed by using two real-world datasets characterized by a strong unbalanced distribution of data.For reasons of reproducibility of the RF experiments, the R function set.seed() has been used in order to fix the seed of the random number generator.The RF parameters have been tuned by searching those that maximize the performance.
It should be further added that we verified the existence of a statistical difference between the results, by using the independent-samples two-tailed Student's t-tests (p < 0.05).

DataSets
The two real-world datasets used in the experiments (i.e., Default of Credit Card Clients dataset and German Credit datasets, both available at the UCI Repository of Machine Learning Databases2 ) represent two benchmarks in this research field.In the following we provide a short description of their characteristics: Default of Credit Card Clients (DC).It contains 30,000 instances: 23,364 of them are credit-worthy applicants (77.88%) and 6,636 are not credit-worthy (22.12%).Each instance contains 23 attributes and a binary class variable (accepted or rejected).
German Credit (GC).It contains 1,000 instances: 700 of them are credit-worthy applicants (70.00%) and 300 are not credit-worthy (30.00%).Each instance contains 21 attributes and a binary class variable (accepted or rejected).

Metrics
This section presents the metrics used in the experiments.
Accuracy.The Accuracy metric reports the number of instances correctly classified, compared to the total number of them.More formally, given a set of instances X to be classified, it is calculated as shown in Equation 6, where |X| stands for the total number of instances, and |X (+) | for the number of those correctly classified.
F-measure.The F-measure is the weighted average of the precision and recall metrics.
It is a largely used metric in the statistical analysis of binary classification and gives us a value in a range [0, 1], where 0 represents the worst value and 1 the best one.More formally, given two sets X and Y , where X denotes the set of performed classifications of instances, and Y the set that contains the actual classifications of them, this metric is defined as shown in Equation 7.
AUC.The Area Under the Receiver Operating Characteristic curve (AUC) is a performance measure used to evaluate the effectiveness of a classification model [43,44].Its result is in a range [0, 1], where 1 indicates the best performance.More formally, according with the notation of Section 3, given the subset of non-default instances T + and the subset of default ones T − , the formalization of the AUC metric is reported in the Equation 8, where Θ indicates all possible comparisons between the instances of the two subsets T + and T − .It should be noted that the result is obtained by averaging over these comparisons.

Strategy
The experiments have been performed by using the k-fold cross-validation criterion, with k=10.Each dataset is randomly shuffled and then divided in k subsets; each k subset is used as test set, while the other k-1 subsets are used as training set, considering as result the average of all results.This approach allows us to reduce the impact of data dependency and improves the reliability of the results.

Experimental Results
As shown in Fig. 2 and Fig. 3, the performance of our DME approach is very similar to the RF one, both in terms of Accuracy and in terms of F-measure, where we achieve better performance than RF with the DC dataset.By examining the obtained results, the first observation that rises is related to the fact that our approach gets the same performance of RF, despite it operates in a proactive manner (i.e., without using default cases during the training process).Another observation arises about the F-measure results, which show how the effectiveness of our approach increases with the number of non-default instances used in the training process (DC dataset).This does not happen with RF, although it uses both default and non-default instances, during the model training.
We can observe interesting results also in terms of AUC: this metric evaluates the predictive capability of a classification model, and the results in Fig. 4 show that our performance is similar to those of RF, although we did not train our model with both classes of instances.
It should be noted that, as introduced in Section 1, the capability of the DME approach to operate proactively allows us to reduce/overcome the cold-start problem.The credit scoring techniques cover a crucial role in many financial contexts (i.e., personal loans, insurance policies, etc.), since they are used by financial operators in order to evaluate the potential risks of lending, allowing them to reduce the losses due to default.This paper proposes a novel approach of credit scoring that exploits an entropybased criterion to classify a new instance as accepted or rejected.
Considering that it does not need to be trained with the past default instances, it is able to operate in a proactive manner, also reducing/overcoming the cold-start and the data imbalance problems that reduce the effectiveness of the canonical machine learning approaches.
The experimental results presented in Section 5.5 show two important aspects related to our approach: on the one hand, it performs similarly to one of the best performing approaches in the state of the art (i.e., RF), by operating proactively; on the other hand, it is able to outperform RF, when in the training process a large number of non-default instances are involved, differently from RF, where the performance does not improve further.
A possible follow up of this paper could be a new series of experiments aimed at improving the non-proactive state-of-the-art approaches, by adding the information related to the default cases, as well as the evaluation of the proposed approach in heterogeneous scenarios, which involve different types of financial data, such as those generated by an electronic commerce environment.

Algorithm 1
Di f f erence in Maximum Entropy (DME) Input: T + =Set of non-default instances, T =Set of instances to evaluate Output: E=Set of classified instances 1: procedure INSTANCESEVALUATION(T + , T ) 2: if b > a then