MODELING CREDIT APPROVAL DATA WITH NEURAL NETWORKS: AN EXPERIMENTAL INVESTIGATION AND OPTIMIZATION*

. This study proposes an investigation and optimization of Multi-Layer Per-ceptron (MLP) based artificial neural networks (ANN) credit prediction model, combine with the effect of different ratios of training to testing instances over five real-world credit databases. As an outcome from the alteration procedure, three different types of hidden units [ K = 9 (ANN–1), K = 10 (ANN–2), K = 23 (ANN–3)] are chosen through the pilot experiments and execute, therefore, 45 (5×3×3) unique neural models. Experimental re - sults indicate that “the neural architecture with ten hidden units” is proposed as an optimal approach to classifying the credit information. With these contributions, therefore, we complement previous evidence and modernize the methods of credit prediction modeling. This study, however


Introduction
Credit prediction is a key application in statistical modeling and plays an important function in contemporary financial risk management practice. It gives to the key element in credit approval process, which is to precisely and effectively quantify the degree of uncertainty associated with a creditor. The degree of the credit risk of a creditor is connected with the probability of default, i.e. an event of not paying back the approved loan Recently, Zhao et al. (2015) demonstrated that the classification performance of MLP based NN methods could significantly improve by changing the ratio of sample composite mixture (SCM) of training and testing instances, the number of hidden neurons, and the training iterations. It also depends on chosen real world databases for training and validating the trained neural model, Khashman (2011) added. If careful attention is paid to the earlier studies, however, one can notice the lack of original databases; i.e., their database for experiential investigation was to some extent imperfect; fixed SCM ratios, defective selection of hidden neurons in NN models that can hinder its performance. For example, Lee (2007), Min and Lee (2005), Kim and Ahn (2012) and Shin et al. (2005) applied MLP and SVM to Korean credit prediction and bankruptcy prediction and drew the conclusion that the best classification accuracy was found for MLP when the number of hidden nodes was 10; MLP (88.16%) was better than that of SVM (88.01%); MLP once more, was better approach as opposed to ordinary statistical approaches; while SVM was better to learn a small size of data patterns in the 1 st , 2 nd , 3 rd and 4 th study, respectively but they use an imbalance training (80%) and testing (20%) sets, far smaller datasets with fewer features. However, Li et al. (2006) and Li et al. (2016) found that MLP-trained model obtained satisfactory accuracy rate for consumer credit and SME credit databases, respectively but their results were also based on a small sample size with imbalance training-testing ratio. In contrast with the above work, Khashman (2010) trained three MLP neural networks on German credit dataset based on nine learning schemes with different training to testing ratios and different number of hidden neurons. That study concluded that the learning scheme with 40% training and 60% testing dataset and 23 hidden neurons performed best. More recently, however, adjacent to the above study, Zhao et al. (2015) concluded that models with nine hidden units performed best out of 34 MLP neural network models on the similar database. In the study of Khashman (2009), seven learning schemes with different training to testing dataset was investigated on Australian credit and concluded that neural model with and 43.5% training to 56.5% testing set with 9 hidden neurons performed best, on the other hand, an emotional neural network outperformed the conventional neural network in the study of Khashman (2011) on the same training to testing ratio with ten hidden neurons on an identical dataset. Similarly, in Jeong et al. (2012), the tuned NN model with four hidden units outperformed the non-tuned NN model for a Korean bankruptcy database.
One major constraint of existing studies is that the most relevant studies (Khashman 2009(Khashman , 2010(Khashman , 2011Zhao et al. 2015) simply use one dataset, small number of sample sizes and with fewer features for system validation than would be used by a financial institution. Another problem in most cases when using neural networks is the use of either balance or imbalance training to testing ratio. Furthermore, no experimental investigation is followed, except a few studies, to select optimum number hidden units. Therefore, for the competitive performance of NN model, versatile databases with different ratios of SCM in training-testing examples, an optimum selection of hidden neurons during the model construction phase must be cautiously refined.
In these contexts, this study proposes an investigation and optimization of MLP based NN credit prediction model, combine with the effect of different ratios of training to testing datasets. Therefore, we use three different types of balance/imbalance mixtures of training and testing instances, 40%:60%, 50%:50% and 90%:10%, respectively to determine the most optimal one. The training data is utilized to train the network while the test data is used to validate the network's performance upon completion of training. In addition to the above aspects, the number of hidden neurons (K) can have a large impact on the performance of the network architecture. The optimal number of hidden neurons, however, was selected after several trials involving the alterations of the number of hidden neurons from one to fifty neurons, with maintaining the following criteria: it should have the lowest root mean square error (RMSE), largest percentage of overall accuracy rate and the lowest type II error. As an outcome from the alteration procedure, three different types of hidden units [K = 9 (ANN-1), K = 10 (ANN-2), K = 23 (ANN-3)] are chosen through the pilot experiments (see, e.g., Supplementary information Tables A1-A5). In fact, we compare 45 (5×3×3) unique neural models over the five databases with different number of hidden neurons; on different SCM ratios, to get the model with the best accuracy and effectiveness. Experimental results indicate that ANN-2, the neural architecture with ten hidden units, is proposed as an optimal approach to classify the credit information. With these contributions, therefore, we complement previous evidence and modernize the methods of credit prediction modeling.

Real-world credit database
We focus on five real-world credit datasets, e.g., the Australian, German and Japanese are from UCI machine learning database repository (Lichman 2013), and have been extensively used as a benchmark in many prediction models. The Chinese credit, a project dataset, provided by a Chinese commercial bank, while SPSS credit modeling dataset is from Vukovic et al. (2012). The datasets comprise example of non-default and default creditors with a binary target variable, illustrated by a set of risk drivers which capture information from the creditor application form. A summary of the five datasets is presented in Table 1.

Neural network architecture
We used popular neural network architecture, namely Multi-Layer Perceptron (MLP) where all neurons and data flow assembled through hidden units in a feedforward manner. The opening layer is called the input layer which is composed of pieces of credit information from the external environment. In this study, different input neurons used for the different databases, e.g., for Australian credit, the NN input layer has 14 neurons, according to the number of credit applicant's features in the database. The final layer is called the output layer where the network produces the target output, Y, by defining a credit customer either non-default or default. Any layers between these two are called hidden layers those have no contact with external credit information and can only receive responses from the connected layers. Therefore, the final decision can be gained by evaluating target output, Y, with a threshold, generally set at 0.5, thereby reaching a decision of non-default if Y > 0.5; otherwise it will be classified as a default. For reducing the computational complexity, single hidden layer containing K neurons used in this study for all networks. Figure 1 depicts the architecture of ANN classifier.
The number of neurons, K, in hidden layer was first chosen as 9 (ANN-1), then was changed to 10 (ANN-2) and 23 (ANN-3) during subsequent experiments for comparison of system performance which have a large impact on the performance of the network architecture. The huge number of iterations needed in the training phase for the huge number of hidden nodes. It is vital not to over-fit the network with a huge number of hidden units than needed until it can memorize the training set. As a result, different rules of thumb are proposed in the literature for selecting the optimal number of hidden nodes, but their optimality is not ensured. For instance, Salchenberger et al. (1992) suggested that the number of hidden units should be 75% of the number of features. Tang and Chi (2005) advocated that number of hidden units in a three layer MLP network should be m/2, 2m/3, m, m+1, and 2m+1where m is the number of features in the respective database. Moreover, a random selection of hidden layer neurons was followed in Tsai and Wu (2008). Later, the more popular method was cited by Jeong et al. (2012) who quoted that hidden units of MLP architecture optimized by means of a Cross-Validation (CV) procedure. But some evidence (Tommi 2014;Arlot, Celisse 2010;Barrow, Crone 2016) advised that in practice CV undergoes from two major shortcomings. The first shortcoming is that when it is employed to choose between two/more networks the approximate network accuracy of CV tends to be higher than the accurate network accuracy, and this propensity becomes more pronounced as the number of networks tested raises. The second and associated difficulty is that, usually, the more networks that are tested, the higher the probability that CV will unsuccessful to choose the best available neural architecture.
However, an alteration experiment was used in the most relevant studies (Khashman 2009(Khashman , 2010(Khashman , 2011Zhao et al. 2015) of this work to determine the number of neurons in the hidden layer. In addition, this procedure was also used in other application area, e.g., project selection in portfolio management (Costantino 2015), for nonlinear time series forecasting (Zhong, Enke 2017), for predicting soil distribution (Falamaki 2013). Considering their successful experimentation, the optimal number of hidden neurons therefore, was selected after several trials involving the alterations of the number of hidden neurons from one to 50, maintaining the following criteria: it should have the lowest RMSE, largest percentage of overall accuracy and the lowest type II error (see, e.g., Supplementary information Tables A1-A5). The learning rate is set to 0.4, and the momentum term is to 0.9, while the learning of NN stopped when predetermined number of epochs was reached. It uses the gradient decent method to control the speed of training. The activation function in the hidden layer and the output layer is the hyperbolic tangent.

Training and testing sub-sets
No rules of thumb are suggested in the literature for designing training to testing subset ratios (Khashman 2010). Therefore, in the current experiments, we use three different types of balance/imbalance mixtures of training and testing instances, 40%:60%, 50%:50% and 90%:10%, respectively to determine the most optimal one. First two ratios were chosen as their closer/equal to 50%:50%, and for the third case, although inclusion of 80%:20% could be sensible, but it is assumed that the training accuracy of the model increased by increasing the training examples, supporting that the adoption of 90%:10% is more sensible thereby enhancing the network's predictability (Zhao et al. 2015). Though it is reasonable to adopt the mixture of all possibilities, i.e. the mixture of 10%:90%, 20%:80%, 30%:70%, and so on, until the last mixture of 90%:10%. But due to space constraint, it is less possible to adopt all possible mixtures over five different databases. However, the training data is utilized to train the network while the test data is used to validate the network's performance upon completion of training.

Performance evaluation
In order to evaluate the NN-based credit evaluation system, three standard measures are used which are originated from a 2 × 2 confusion matrix as that given in Table 2, where tp are true positive, fp false positive, fn false negative, and tn true negative counts. The evaluation measures are described respectively as following: Type -I error = fn / (tp + fn); (2) Type -II error = fp / (tn + fp). (3)

Cost of credit prediction errors
We summarize the costs of credit prediction errors, type I and type II errors, and their impact on classifier selection. Several evidence (Abellán, Castellano 2017;Lee, Chen 2005;West 2000) advise that adding these costs into the prediction models can guide to better and more precise results. It is marked that the costs related to type I errors (a creditor being non-default is misclassified as default) and type II errors (a creditor being default is misclassified as non-default) are notably different. Usually, the misclassification costs related to type II errors, P 12 are much higher and more detrimental than those related to type I errors, P 21 . It is vital, in this aspect, to assess the credit investigation neural network algorithms with their associated cost, described as following, rather than relying on the overall accuracy.
So as to determine the cost function of the credit prediction models, the ratio of misclassification (MC) costs proposed by Dr. Hofmann, associated with type II and type I, is 5:1 (West 2000). The stress is not only on this relative cost ratio at 5:1, but also it offers a sensitivity analysis using higher cost ratios at e.g. 7:1, 10:1, 12:1, 15:1, respectively and in current study, therefore, we consider five different levels of MC cost for each database. For the turmoil financial situation, particularly, it is expected that the higher cost ratio might be more suitable and Kao et al. (2012), however, suggested that the relative cost ratio can range from 5:1 to 20:1. Determination of the cost function also requires an estimation of the prior probabilities of non-default credit, π 1 and default credit, π 2 in the applicant pool of the credit prediction model. These prior probabilities are estimated from the actual ratios of non-default and default credit in the empirical databases. The ratios s 2 /S 2 and s 1 /S 1 in Equation (4) compute the probability of making type II errors and type I errors, respectively.

Model prediction
We use three different types of hidden units 9 (ANN-1), 10 (ANN-2), 23 (ANN-3), those are picked through pilot studies and execute, therefore, 45 (5×3×3) unique neural models. The effects of the number of hidden units on accuracy rate and error rate, the different levels of MC cost ratios over the five credit databases with three SCM ratios are summarized in Tables 3-7.
From the experimental results shown in   Zhao et al. (2015), in the setting of 70%:30% approved/rejected instances with nine hidden units; Zhao's model achieved 87% accuracy. There are two possible reasons: the database designed by a novel data distribution method, namely, "Average Random Choosing", the first reason; and their models trained with training, validation, and test datasets, the second reason. However, as indicated in Table 5, the prediction results for the Japanese credit display some of the similar patterns discussed for the German credit. ANN-3 classifier in 90%:10% SCMR outperforms the other neural models with regard to test dataset accuracy. ANN-3 classifier, under 40%:60% SCMR conversely, show the best classification capability in terms of overall accuracy of 95.28% with 97.09% training while 87.50% testing accuracies including the minimum RMSE error of 14.05%.
For the Chinese credit, the experiential results presented in Table 6 reveal that ANN-2 credit classifier in 90%:10% SCMR produces the best results, with the highest overall accuracy of 78.83% through 80.77% training while 58.01% testing accuracies together with a minimum RMSE error of 15.39%. ANN-3 classifier in 90%:10% SCMR, on the contrary, acquire the maximum ability to predict the default creditors pertaining to test dataset accuracy of 76.19%. However, there is a remarkable accuracy rate gap in between of training, and testing examples as the database is highly unbalanced. Note: * Tr-dataset and Te-dataset refer to training and testing datasets, respectively. Note: * Tr-dataset and Te-dataset refer to training and testing datasets, respectively. Conversely, SPSS credit modeling database results presented in Table 7 reveal that ANN-1 credit prediction classifier in 40%:60% SCMR illustrates the best extrapolative performance through 91.22% overall accuracy with 92.55% training while 85.92% testing accuracies together with a minimum RMSE error of 11.47%. ANN-2 classifier in 90%:10% SCMR in contrast, shows the best predictive performance in connection with testing dataset accuracy of 90.91%.

Type I and type II errors with their corresponding EMC cost
Tables 3-7 also summarize the type I and type II errors of the neural models across five credit databases with their corresponding expected MC (EMC) costs. According to the results from Tables 3-7, for type I and type II errors, ANN-2 in 50%:50% SCMR reduces two indicators into 7.22% and the most competitive, 3.49% for the Australian credit dataset; into the best, 0.50% and the worst, 93.94% for the Chinese credit dataset. It is mentioned earlier that the later dataset is the most imbalanced (the ratio is about 43:1) and produce the worst type II error rate. For the German (Japanese) credit dataset, ANN-2 in 90%:10% (ANN-3 in 40%:60%) SCMR has the most competitive rate of type II error 30.53% (5.05%) and a significantly low (the lowest) rate of type I error 19.87% (4.25%) in associations with the other neural models. In addition, ANN-1 in 40%:60% SCMR decreases two indicators into the most competitive, 10.81% and 8.24% for SPSS credit modeling dataset. These results, however, are consistent with the overall accuracy of the models for all except Chinese, databases.
In case of prediction cost, for the Chinese and Australian credit databases ANN-2 in 50%:50% SCMR neural model is the best amongst all classifiers in all databases at MC ratio of 5:1. Being an extensive investigation to incorporate MC ratio ratios of 7:1 to 15:1, ANN-2 in 50%:50% SCMR is still the best producing the minimum MC costs. For the Japanese credit, amongst all neural classifiers, the lowest EMC with all MC ratios, 5:1 to 15:1 is for ANN-3 in 40%:60% model; for the SPSS credit, is for ANN-1 in 40%:60% SCMR model; for the German credit, is for ANN-2 in 90%:10% SCMR model. These results are extremely significant for the decision makers, being noteworthy for them to achieve an appropriate balance between both error types so as not to lose potentially non-default creditors.

Selecting the optimal SCM ratio
We average the performance indicators in Figures 2-4 across all training to testing ratios over the five credit datasets to assess their influences for modeling the credit approval data. Other than above aspects, we then display the performance indicators in Figures 5-7 across five credit datasets to further review the neural based credit prediction models.
As illustrated by the Figures 2-4, it is possible to see that a SCM ratio of 90%:10% performs best in ANN-1 and ANN-2 models in regards to accuracy rate but it is an imbalance ratio which is biased toward majority class. There is no exact "winner" then for the type I error indicator. For example, ANN-1 performs the best in 50%:50% SCM ratio; following ANN-2 in 90%:10%, ANN-3 in 40%:60% SCM ratios. ANN-1 and ANN-2 once more, perform best considering type II error in 40%:60% SCM ratio. Different SCM ratios, therefore, show the different results on different indicators. Taking a closer look at Figures 5-7, in addition, the predictive performance of ANN-1 credit approval model is the best considering all indicators which means that the neural model with nine hidden units seems more stable.
Note: * Tr/Te sets refer to training and testing datasets, respectively.

Comparison with the most perfect models
In order to see the reliability of the findings, we used a non-parametric Wilcoxon signedranks (WSR) test, establishing the significance level at p = 0.01/0.05, to fix on statistically significant performance differences between the neural based credit approval classifiers. All credit classifiers (Model Y) are verified for significant dissimilarities with the most perfect classifier (Model X) in the database. The null hypothesis is "Model X's overall accuracy/type I /II errors = Model Y's overall accuracy/type I/II errors" while the reverse is the alternative hypothesis. The column "improvement" presents the comparative improvement in overall accuracy (type I/II errors) that model X earns over model Y. The results are summarized in Supplementary information Tables B1-B5.
Evidences from Supplementary information Tables B1-B5 that ANN-2 in 50%:50% for the Chinese credit, following ANN-1 in 50%:50% for SPSS credit acquires a massive improvement with comparing to other classifiers considering overall accuracy criterion. For type I error, ANN-1 in 50%:50% for SPSS credit yields more than 81% improvement while for type II error, ANN-2 in 50%:50% for Australian credit obtains more than 78% improvement. It is obvious from the Supplementary information Tables B1-B5 that all improvements in all databases, except Chinese with a few in others, are statistically significant with respect to the best performing neural classifiers. It is worth noting the following explanations from the investigation of the results presented in Tables 3-7 with Supplementary information Tables B1-B5.
1) The number of hidden units affects the accuracy of the classifier. Across most ranges of situations, ANN-2 with ten hidden units, a credit modeling classifier gets the higher prediction accuracy, lower type I and type II errors, and the most competitive EMCC in all except SPSS, databases and which is also justified by non-parametric WSR test. As illustrated in the Figures 5-7, in contrast, it is possible to see that the predictive performance of ANN-1 with nine hidden units over the five databases is the best considering average accuracy (82.37%), type I error (11.82%), and type II error (32.35%). The neural architecture (ANN-2) with ten hidden units is therefore proposed as a feasible approach to classifying the credit data which is consistent with the findings of Khashman (2011). 2) Although different ratios of training to testing set usually result in different performances, we found that a SCM ratio of 90%:10% gives better prediction accuracy, suggesting that increasing the training sample size has a positive influence on the classification accuracy. (Table 1), and the result of the best accuracy of database from Tables 3-7, an interesting result seems to be found that the lower the number of creditors with a reasonable feature set, the higher the classification accuracy with lower the type-II error. For the Australian and Japanese credit, 94.25% and 95.28% overall accuracy achieved, respectively, with 690 cases each (Tables 3 and 5); for the SPSS, it is 91.22% with 700 creditors (Table 7). A 79.00% overall accuracy earned with 1000 creditors in German (

Conclusions
During the history of credit prediction investigation, the neural networks have occupied an important position. For many years, existing literature supported the supremacy of NN classifier over a versatile optimization and statistical methods. However, three challenges have encountered the prestige of NN classifiers. First, the use of insufficient number of databases and small number of instances with fewer features has been a common problem in most relevant studies. Another problem in most cases when using neural networks is the use of either balance or imbalance training to testing ratio. More example set used in training means less used in testing, which may result in low accuracy. The number of hidden units is the third challenge since too many hidden nodes can easily cause an over-training of the network and insufficient units cannot attain optimal accuracy. Therefore, we set out an experimental investigation to optimize MLP feed forward neural network for modeling credit approval data over the Australian, Chinese, Germen, Japanese and SPSS, five different databases, which can help to overcome the aforementioned challenges. We focused on the effect of hidden units in the hidden layers and versatile databases with different ratios of SCM in training-testing examples to boost up the performance of suggested investigation. In our experiment, we trained 45 unique neural models on three SCM ratios over five different databases. These models vary in the proportion of the number of credit approval examples used for training, against those used for testing. Having compared the experimental results of the 45 neural models; based on the performance appraisal criterions, it can be concluded that ANN-2; the neural architecture with ten hidden units, outperforms the remaining neural models. Besides the above aspects, a SCM ratio of 90%:10% gets better prediction accuracy, thus this combination is most fitting for modeling an acceptable neural architecture. In addition to, we find a motivating relationship between the sample sizes and the overall accuracy: when a network is trained with the limited sample sizes with a reasonable feature set, it may achieve the higher classification accuracy. With these contributions thus, we complement previous evidence and modernize the methods of credit prediction modeling in several domains. First, our classifier outperforms most of the relevant studies in terms of predicting ability. Second, the findings of our study are compared against an extensive set of substitute methods than relevant articles do. Third, our investigation is easier and, offer an apparent visualization of the complex neural architecture.
This study, however, has realistic implications for bank managers and other stakeholders to delineate the risk profile of the credit customers. From the managerial point of view, the defensive measures can be different in the short, medium, or long term based on the prediction of creditors' status, that is, in the group to which the creditor belongs. Similarly, the profitability value of more accurate credit predictions is an important concern. Therefore, it is vital to judge whether the findings that we observe from this study simplifies to real-world applications, and to what extend their implementation would add to profit. These queries are much more debated in the literature and from this study, we can add some points to the debate. This investigation could also be extended to include other financial products by collecting more important features that will improve the prediction ability. We hope that these attempts would be taken in other regions by many modelers.