A DYNAMIC CREDIT SCORING MODEL BASED ON SURVIVAL GRADIENT BOOSTING DECISION TREE APPROACH

. Credit scoring, which is typically transformed into a classification problem, is a powerful tool to manage credit risk since it forecasts the probability of default (PD) of a loan application. However, there is a growing trend of integrating survival analysis into credit scoring to provide a dynamic prediction on PD over time and a clear explanation on censoring. A novel dynamic credit scoring model (i.e., SurvXGBoost) is proposed based on survival gradient boosting decision tree (GBDT) approach. Our proposal, which combines survival analysis and GBDT approach, is expected to enhance predictability relative to statistical survival models. The proposed method is compared with several common benchmark models on a real-world consumer loan dataset. The results of out-of-sample and out-of-time validation indicate that SurvXGBoost outperform the benchmarks in terms of predictability and misclassification cost. The incorporation of macroeconomic variables can further enhance performance of survival models. The proposed SurvXGBoost meanwhile maintains some interpretability since it provides information on feature importance.


Introduction
Since the global financial crisis, risk management has achieved much prominence and become a primary focus of both academia and industry. Among the various types of risks in financial institutions, credit risk, which is defined as the potential loss when a counterparty fails to meet his/her obligation, is regarded as the largest risk that financial institutions faces (Apostolik et al., 2009). The Basel Accord proposed several credit risk parameters to quantify credit risk. Financial institutions are also allowed to employ internal ratings-based (IRB) methods, which means that financial institutions can build their own quantitative models to estimate these risk parameters as Basel Accord II instructed. Although advanced IRB approach is prohibited for some types of assets in Basel Accord III, it can still be used to estimate key risk parameters for retailing portfolio. Among the credit risk parameters, probability of default (PD) has received much attention from banks and researchers since it supports decision making in consumer loans and the calculation of regulatory capital requirement (Crook et al., 2007). Credit scoring, defined as an empirical model-based prediction on the undesired behaviour of a potential borrower (Lessmann et al., 2015), is a popular measure to estimate PD in practice.
Credit scoring mainly includes three sequential stages: pre-modelling, modelling, and post-modelling. During the first phase, feature selection (Maldonado et al., 2017), wavelet analysis (Hung, 2019) and data transformation (Han & Ge, 2017) are utilized to provide a representative dataset. In the post-modelling stage, model validation (e.g., misclassification cost and profit-based evaluation measures) (Lohmann & Ohliger, 2019;Xia et al., 2017b), probability calibration , credit rating migration (Huang et al., 2020;Liang et al., 2016) and interpretability (Munkhdalai et al., 2019) have been considered in prior studies. The pursuit of accurate model is the central task of modelling stage because a minor improvement of credit scoring model may incur enormous economic benefits (Finlay, 2011). Despite credit scoring models are established by clustering algorithms in some cases (Lim & Sohn, 2007), they are routinely built using classification methods. Statistical models (e.g., logistic regression and generalized additive models) are initially used for credit scoring modelling. Despite their transparency and easy-to-implementation, statistical methods hold some strong assumptions that are far from reality (e.g., linear separability or normal distribution of input data). Consequently, statistical methods are not comparable with machine learning methods in terms of predictability as suggested in several comprehensive comparative analysis (Baesens et al., 2003;Lessmann et al., 2015). Popular machine learning approaches used in credit scoring include decision tree (DT), support vector machine (SVM), artificial neural network (ANN), evolutionary algorithms, among others (Huang et al., 2007;Ong et al., 2005;West, 2000). Inspired by the famous "no free lunch theorem" (Wolpert & Macready, 1997), there is a growing trend that ensemble learning, which combines the predictions of multiple models, is extensively introduced to credit risk assessment mainly due to its superior predictive accuracy as shown in several comparative studies (Finlay, 2011;Wang et al., 2012). Random forest (RF) is even advocated as industry benchmark in credit scoring (Lessmann et al., 2015). Among the ensemble models, gradient boosting decision tree (GBDT) and its variant algorithms have been applied as a homogeneous ensemble model itself (Ma et al., 2018;Xia et al., 2020aXia et al., , 2018b or as a critical component of heterogeneous ensemble structure (Xia et al., 2018a).
However, these classification approaches are typically cross-sectional models, which share a number of drawbacks that can be further overcome by survival models. Survival models have a long history and was initially applied in medical research (Klein & Moeschberger, 2006). Over the past few decades, the application of survival models into credit risk assessment is becoming research hotspot (Malik & Thomas, 2010;Stepanova & Thomas, 2002;Tong et al., 2012). The advantages of survival models applied in credit scoring are: 1. Survival models can provide a dynamic PD prediction overtime (e.g., 12-months or lifetime PDs of loan portfolio as required in CECL and IFRS 9). A dynamic PD means that the financial institutions can precisely adjust the capital charge and collection strategy during the payment of loans; 2. Survival models can offer a reasonable explanation on censoring, which contributes to a realistic and practical credit scoring model (Leow & Crook, 2016); 3. Survival models can be easily incorporated with time-dependent covariates (e.g., macroeconomic or behavioural covariates) (Bellotti & Crook, 2009). Recent advancements with regard to applying survival models to credit risk assessment mainly concern on building accurate survival models. One path to achieve this goal is improving statistical models. Time-dependent covariates and coefficients are examined in-depth for survival function and partial likelihood function to consider the time-varying effects in credit scoring (Dirick et al., 2019;Djeundje & Crook, 2018, 2019Leow & Crook, 2016). Another solution concerns on the fusion of survival models with machine learning. By far, survival models have been integrated with ANN (Baesens et al., 2005) and RF . In a benchmark study, Dirick et al. (2017) compared several classical survival models used in credit scoring and revealed that Cox PH-based models provided a good performance especially in combination with spline methods. However, few studies have combined survival model with GBDT-based techniques despite GBDT showed superiority relative to classical classifiers when using cross-sectional data (Xia et al., 2017a). Table 1 includes a selection of research applying survival models to credit scoring, which shows that limited studies have incorporated survival analysis into machine learning. We aim to overcome this gap in this paper. Validated on a large dataset of consumer loans, we develop SurvXGBoost, a survival gradient boosting decision tree approach that combines XGBoost (Chen & Guestrin, 2016) and survival analysis, to provide an accurate and dynamic prediction on PD. The experimental results also illustrate the efficiency of our proposal.
We make three contribution in this paper to prior literature. First, we establish a novel method (i.e., SurvXGBoost) to predict PD overtime. SurvXGBoost is a modified Cox proportional hazard (PH) model (Cox, 1972) whereas departs from the prototype by relaxing PH assumption and allowing for non-linearity for covariates. To the best of knowledge, survival gradient boosting decision tree approaches have not been applied to credit risk assessment. Second, as shown in Table 1, four out of seven studies have considered only a small number of macroeconomic variables in modelling. Macroeconomic variables can reflect the sudden change of economy (Kartal, 2020;Sukharev, 2020) and the economic uncertainty (Liu et al., 2019), which are expected to affect the borrowers' ability and wiliness to pay (Zhang & Thomas, 2012) and therefore determine the PD. Thus, we further enhance the predictability of SurvXGBoost model by extracting information from the principle components of 1,042 monthly macroeconomic variables. Finally, the out-of-sample (OOS) validation that frequently used in existing studies may theoretically contain future information in the training set and thus over-estimates the model performance. Instead of fixed training and test set used in prior studies, we compare model performances under both OOS and out-of-time (OOT) validation to examine the external validity of empirical comparisons. The remaining part is structed as follows. Section 1 introduces the preliminaries. In Section 2, the proposed SurvXGBoost model. Section 3 explains experimental setup, including the data, model validation, and evaluation measures are discussed in details. Section 4 compares and analyses the experimental result. Finally, conclusions and future research are discussed in the last section.

Standard survival analysis
Survival analysis is used to model the time of a certain event (e.g., default or survival). Since the event distribution is usually modelled as a continuous function of time, we define survival function as the probability of not having encountered the event until a specific time t, namely where ( ) f s is the probability density function. A close concept to survival function is hazard function which measures the event rate at time t conditional on survival until t. Once the hazard function is acquired, one can retrieve the survival function through the cumulative hazard ( ) In real-world data, a proportion of censored samples exists, which means non-default loan applications exist by the time of data collection. Under this circumstance, early payment and fully paid loans are interpreted as censored, the true event time of which is unknown. Instead of observing the true event time * T , we can only observe right-censored event time where 1( ) ⋅ 1 is the indicator function. Specifically, δ = 1 denotes the occurrence of default event, and δ = 0 indicates an early or fully paid loan. For the i-th sample in the dataset, let x i denote the covariates and T i imply the observed during time. The likelihood for right-censored data is represented as follows: where q is the set of parameters. Eq. (4) can be optimized by maximum-likelihood approach over functional space of S and parameter space of q, whereas this is usually intractable when no prior form is specified on hazard function. The recent extensions on standard survival analysis mainly focus on the specification on the survival function, hazard function, or cumulative hazard. We will subsequently introduce two types of modified survival analysis model, namely Cox PH model and random survival forests model.

Cox Proportional hazard model
The Cox PH model is widely used in survival analysis. It develops a semi-parametric specification on the hazard function described in Eq. (2): where ( ) 0 h t is a non-parametric baseline hazard, and ( ) q exp T x denotes a parametric relative risk function. Due to the semi-parametric form, the function cannot be directly optimized using the maximum-likelihood approach. In this paper, the cumulative baseline hazard (i.e., ( ) H t described in Eq. (3)) is estimated by Breslow estimator: h t can therefore be estimated by smoothing the increments (i.e., ( ) ∆ 0 i H T ). The parametric part, namely the relative risk function, is fitted by maximizing the Cox partial likelihood, which is expressed as follows: where R i denotes the set of samples that not censored before time T i . The Cox PH model has been extensively applied to the survival analysis in credit scoring. Some recent studies modified the conventional Cox PH model to handle time-varying covariate or time-varying coefficient (Bellotti & Crook, 2009;Djeundje & Crook, 2018, 2019Leow & Crook, 2016). Specifically, the hazard function described in Eq. (5) are modified as follows: Although the two models are closer to reality than the prototype, it remains one major drawback, namely the parametric forms of relative risk function. The universal approximation property of ANN provides an alternative to non-parametric relative risk function (Baesens et al., 2005), but abundant evidences have shown that a single ANN is not comparable to ensemble models in credit risk assessment (Lessmann et al., 2015). To further enhance the predictability of survival models, we aim to introduce tree-based ensemble methods into survival analysis.

Random survival forests
RF is a popular non-parametric tree-based ensemble algorithm proposed by Breiman (2001). In survival setting, random survival forests (RSF) (Ishwaran et al., 2008) make predictions on time-to-event by combining the results of multiple survival trees. The general procedure of RSF is as follows: Step 1. Take samples from training set via bootstrap method.
Step 2. Train a survival tree based on the samples in the training set. For each node of the survival tree, RSF selects the splitting variable randomly. The splitting threshold is determined by a certain criterion such as log-rank test.
Step 3. Continue to split the nodes until a stopping criterion is reached.
Step 4. Aggerate the information of all the survival trees to obtain the risk prediction of RSF.
Let h denote the terminal node of the survival tree. Different from the conventional forms illustrated in Subsection 1.1, the cumulative hazard and survival function for terminal node h are estimated using Nelson-Aalen and Kaplan-Meier estimators, respectively: t . Given a new sample with feature X, X would be assigned into a unique terminal node h. The cumulative hazard and survival function for X can be calculated as The ensemble cumulative hazard and survival function are determined by averaging the tree estimators in all the survival trees. Let

SurvXGBoost model
GBDT is a member of boosting algorithms, which combines multiple weak learners into a storing one by an additive manner (Friedman, 2000). XGBoost (Chen & Guestrin, 2016) is an advanced GBDT-based approach that provide superior performance in credit risk assessment (Xia et al., 2017a). For a given dataset GBDT-based techniques use K additive functions to make predictions on the target variable: is the space of classification and regression trees. To determine the set of ( ) f x in Eq. (16), SurvXGBoost aims to minimize the objective function L obj below: is a convex loss function that measures the difference between the true value y i and the prediction ˆi y . Based on the likelihood defined in Eq. (4), the loss function of SurvXG- ( ) herein is a regularization term. g is a regularization hyper-parameter. T denotes the number of splits, and l is a L2 regularization term. Let t i y denote the prediction of the i-th sample at the t-th model. SurvXGBoost attempts to optimize Eq. (17) by adding a new base learner f t : In SurvXGBoost, the optimal new base leaner is approximated by Newton-Raphson method rather than the gradient descent method used in conventional GBDT since the use second-order gradient information usually provides a quick approximation. After removing the constant term, the Eq. (18) can be simplified to the following form at the t-th iteration: where g i and h i represents the first-and second-order gradient of the i-th sample, respectively. Once the optimal base learner is acquired, it is added to the prior functions following Eq.
(16) to finish an iteration. Moreover, XGBoost also makes some engineering optimization to build a scalable GBDT approach. For example, a histogram-based approximation algorithm is developed to quickly optimize Eq. (19). Moreover, XGBoost supports distributed learning and GPU computing which accommodates the application of big data.

Data
To clarify the superiority of SurvXGBoost, an experiment is conducted based on a real-world consumer loan dataset. The dataset is derived from a consumer loan transactions of a major P2P lending platform in the U.S. This dataset consists of 226,148 loan applications which were issued between January 2009 and December 2013. The dataset comprises a variety of application variables, which can be roughly categorized into three types, namely loan characteristics, borrower's creditworthiness, and borrower's solvency. In this dataset, we define the loan status "reject" as 120 days or more past due. The status "fully paid" means that the loan is paid before or at the duration time. The summary statistic of the time-to-event, loan status, and the features are displayed in Table 2, which shows that the class distribution is imbalanced in this dataset.
Moreover, macroeconomic variables are added to reflect the dynamics of business circle on payment of retailing loans. Instead of the limited number of macroeconomic variables used in prior studies (Bellotti & Crook, 2009;Dirick et al., 2019;Djeundje & Crook, 2019), we applied 1,042 macroeconomic variables to provide a comprehensive description on economic conditions. All these macroeconomic variables are monthly series. Principal component analysis is applied to convert the original set of macroeconomic variables into high-performance variables since multicollinearity occurs between macroeconomic variables. The principle components (PCs) of macroeconomic variables are subsequently employed as features of survival analysis. Table 3 displays the top five influential macroeconomic variables for each PC. According to the influential macroeconomic variables, the PCs are named as labour force indicator, employment indicator, price indicator, recession indicator, technological diffusion and advancement indicator, and inventory indictor, respectively.

Model validation
The initial task of model validation lies on the splitting of training and test set. In prior studies, a fixed training and test set were determined based on a certain time threshold and the survival models were built only once using the training set and made predictions on test set (Bellotti & Crook, 2009;Dirick et al., 2019;Djeundje & Crook, 2018, 2019. However, such a validation approach arouses concerns in that it may provide unreliable experiment results especially for small dataset (Lessmann et al., 2015). As a result, we use two types of validation approaches in this paper: regarding the OOS validation, a five-fold cross-validation (CV) is performed for 50 times and the average performance is used to evaluate models. Concerning the OOT validation, we apply a sliding-window method, the basic idea of which is described in Figure 1. The "window" is specified as one year and the test set comprises samples in the corresponding year. The training set includes samples before the window. Once the models are built and evaluated, the window slides into the next year. Sliding-window method does not stop until the samples in the final year of issuing date have been employed as test set. The proposed SurvXGBoost model is compared with three non-parametric survival models, namely the Cox PH model, RSF, and survival GBDT model (Chen et al., 2013). The Cox PH model is a state-of-the-art benchmark that has been extensively considered in existing literature (Bellotti & Crook, 2009;Wang et al., 2018). We have also employed a time-varying Cox PH model following Bellotti and Crook (2013) to capture the time-varying effects of some features. The number of trees is determined as 200 for RSF. To optimize the hyper-parameters in GBDT and XGBoost, a Bayesian hyper-parameter tuning approach is performed following Xia et al. (2017a). Bayesian hyper-parameter optimization is a type of sequential model-based optimization, which means it pre-sets the number of iterations for optimization. During each iteration, Bayesian hyper-parameter optimization builds a surrogate probability model of a fitness function and determines the optimal hyper-parameters for the surrogate. The hyper-parameters are afterwards used to the real fitness function and returns the corresponding results. The hyper-parameters and the corresponding results are then used to update the surrogate model. The iteration continues unless the number of iterations is reached. Specifically, the five-fold CV Concordance Index (C-index) of training set is employed as the fitness function of survival GBDT and SurvXGBoost. The surrogate probability model is established based on the Tree Parzen Estimator following Bergstra et al. (2011). The hyper-parameters considered in this paper is summarized in Table 4.   TEST   TRAIN  TRAIN  TEST   TRAIN  TRAIN  TRAIN  TEST   TRAIN  TRAIN  TRAIN  TRAIN  TEST

Evaluation measures
We employ four popular evaluation measures to examine model performance in terms of discriminative ability and label prediction. The discriminative ability implies the model's capability of distinguishing between default and non-default borrowers. Following the instruction of CECEL and IFRS 9, the 12-month PD of survival models is evaluated. The evaluation metrics consist of: (1) C-index, which is a popular performance metric that quantifies the quality of rankings for survival models. It is computed as the fraction of concordant pairs divided by the number of possible evaluation pairs. The range of C-index is     0.5, 1 , where 0.5 indicates a random guess and 1 indicates a perfect model.
(2) Area under the ROC curve (AUC), which evaluates the quality of model's prediction irrespective what decision threshold is determined. AUC has been regarded as a frequently used metric in evaluating credit scoring models in prior literature Xia et al., 2017a). AUC measures the entire two-dimensional area under the ROC curve. Following Huang and Ling (2005), AUC is calculated as follows for a binary classification: where n 0 and n 1 denote the number of non-default and default loans in test set, respectively.
= ∑ 0 j S rank is the rank of probability predications of j-th default loans.
(3) H measure, which is proposed by Hand (2009) to overcome the inconsistent misclassification costs that potentially assumed in AUC. Specifically, AUC potentially assumes that the misclassification cost is dependent on the classifiers rather than the datasets, which is far from reality. Thus, Hand (2009) advocated to employ a beta distribution to fit a cost weight function. In a further analysis of Hand and Anagnostopoulos (2014), the optimal parameter of beta distribution is discussed, and in this paper, we follow the suggestion of Hand and Anagnostopoulos (2014) to use a beta (2, 2) distribution. Recent credit scoring studies have introduced H measure as an efficient alternative to AUC when evaluating models (Ala'raj & Abbod, 2016;He et al., 2018). (4) Misclassification cost. Since cost-sensitivity usually occurs in credit scoring, accuracy can seldom provide an overall evaluation on the label prediction (Shen et al., 2020). As a result, we follow Lohmann and Ohliger (2019)  where ⋅     is the Iverson bracket. a herein is the cost parameter and we determine it as 5 in this paper. The misclassification cost further raises an issue on the decision threshold, which can dramatically affect the label prediction. The samples with higher PDs than decision threshold will be rejected and the remaining loan applications will be granted. In a highly imbalanced dataset, credit scoring model tends to predict all the applications as the majority class (usually non-default) and thus lacks the capability to discriminate risky ones (Sahin et al., 2013). Cost-sensitive learning is a solution to imbalanced dataset, which can be roughly divided into the direct method and indirect ones (Shen et al., 2020;Xia et al., 2017b). The direct cost-sensitive learning methods design models that are cost-sensitive in themselves, whereas the indirect methods transform the cost-insensitive models into cost-sensitive one by sampling or thresholding. The sampling technique means balancing the class distribution in training set, and the thresholding indicates adjusting the decision threshold. In this paper, we employ the thresholding technique due to its easy-to-implementation and popularity. Specifically, we determine the decision threshold as the fraction of good and risky applications in training set as advocated by  and Xia et al. (2020b).

Out-of-sample validation
The results in Table 5 show the average performance of SurvXGBoost and benchmarks around the evaluation measures. The standard deviations of performance are described in brackets and the best-performing model for each evaluation metric is highlighted in bold. Performances that are significantly inferior to the best model at a 95% confidence level with respect to a paired t-test are denoted in underlines. Table 5 exhibits several important findings.
First, the superiority of machine learning algorithms is explicitly demonstrated. The proposed SurvXGBoost performs significantly better than the benchmark models for both discriminative capability and misclassification cost. Moreover, survival GBDT and RSF are also marginally better than Cox PH models. These results are similar with those in Chen et al. (2013) and further imply that the predictive ability of machine learning algorithms is superior to statistical ones in most cases. This finding advocates the extension of machine learning approaches to real-world credit risk evaluation. Note: the best-performing model for each evaluation measure is highlighted in bold. The value in brackets is the standard deviation. The underlines imply a significant difference between the corresponding model and the best-performing one.
Second, when comparing among the parametric survival models, time-varying Cox PH model provides better results in terms of C-index and misclassification cost than the original model. This implies that the time-varying Cox PH model quantifies the non-linear relationship between covariates and default time to some extent. However, the inferior performance of time-varying Cox PH model relative to the non-parametric survival models indicates that the parametric form proposed by Bellotti and Crook (2013) can only capture parts of the non-linear effects.
Finally, when comparing among non-parametric models, Table 5 reveals that a combination of GBDT-based model improves model performance relative to RSF. This result is similar with those reported in Xia et al. (2017a) and Xia et al. (2020) in case of classification credit scoring models. Future research can explore applying other advanced GBDT-based methods into survival credit scoring models.

Out-of-time validation
The OOT validation starts from the year of 2010, implying that the samples issued before 2010 are employed as training set. The loan transactions issued during the year of 2010 are employed as test set. The window slides into the next quarter until the last quarter of the dataset is reached. Figures 2 to 5 display the heatmaps of C-index, AUC, H measure, and misclassification cost for the models, respectively. In these figures, the columns represent the models and the rows display the year.
When making a horizontal comparison, the machine learning variants of survival analysis provide promising results. The SurvXGBoost provides the best performance on all evaluation metrics except H measure. This is in line with those revealed in the previous subsection and again demonstrates the superiority of the proposed SurvXGBoost. RSF and survival GBDT achieve the best performance when the data is limited. Concretely, when samples in 2011 is utilized as test set, the two models achieve the best AUC, H measure and misclassification cost.
When making a vertical comparison, Figures 2 to 5 reveal that model performance varies in different years. Although survival models provide unsatisfying performance when training data is limited, their performance does not necessarily improve when number of training sample grows. A possible explanation on this phenomenon lies on that the characteristics of loan applications show very different patterns over the year of 2010 to 2013. Further investigation is required for this argument.

The effects of macroeconomic variables
The main goal of this subsection is to examine whether model performance is improved after adding macroeconomic variables. Moreover, we aim to see whether the determinants of default and time-to-event differ. Thus, a Cox PH model (abbreviated as Model 1) and logistic regression model (abbreviated as Model 2) are established with all the covariates listed in Table 2.
The fitted coefficients for the two models are reported in Table 6. A few important findings can be revealed from Table 6. Concretely, we can observe that loan characteristics, borrowers' creditworthiness and solvency are powerful determinants of time-to-default.  This finding is in parallel with those revealed in Wang et al. (2018) and Dirick et al. (2019). Moreover, the fitted coefficients for Cox PH model and logistic regression exhibit very different patterns. Concretely, the variables of No. of delinquency, revolving utilization rate, and price indicator can hardly affect loan status significantly whereas they are significant determinants of time-to-default. This encourages a deep investigation on the determinants of time-to-default in future research. Finally, macroeconomic variables, especially indicators concerning employment, price, recession, and technological diffusion and advancement, are powerful determinants in survival analysis. This further encourages us to examine the predictability of survival models when incorporating macroeconomic variables. The comparisons of the evaluation measures are presented in Table 7. The models are benchmarked against models that employ the same loan characteristics, borrowers' creditworthiness and solvency variables without macroeconomic variables. Both the results of OOS and OOT validation are reported to give a comprehensive description of the models. Concerning the results of OOS validation, a comparison between Tables 5 and 7 shows that the performance of benchmark models is improved after adding macroeconomic variables to survival models in most cases, therefore partially demonstrating the effectiveness of macroeconomic variables in credit risk assessment. According to the famous five Cs of credit, character, capacity, capital, collateral, and conditions are key factors to predict borrower's PD. The former four characteristics have been attached much attention whereas the macroeconomic condition is not frequently considered in credit risk modelling. Future research should be performed on this topic.
However, the OOS validation may over-estimate the effects of macroeconomic variables since it includes future information of macroeconomic variables in training set. Thus, we also report the OOT results in Table 7. This table reveals that macroeconomic variables can enhance model performance for OOS validation in most cases. The comparison between OOS and OOT validation indicates that in-time modelling can capture a larger share of the variation in the training set and lead to higher C-index, H measure, and AUC than those in OOT validation. Considering the fact that OOT validation is closer to real-world modelling process whereas it gathers limited attention in concerning studies, this finding highlights the necessity of OOT validation in model comparisons.
Heterogeneity is also witnessed for model performance under OOT validation. Concretely, for benchmark Cox PH model, the incorporation of macroeconomic variables does not better off the model performance. These results are in line with expectations since the macroeconomic variables may have non-linear effects on time-to-event. For example, Figure  6 shows the estimates of time-dependent coefficients for the six PCs of macroeconomic variables of Cox PH model, where the solid vertical lines indicate the coefficients of zero and the dashed lines represent the fixed coefficients of Cox PH model. This figure illustrates a rapid change in the sign and the magnitude of the coefficients for macroeconomic variables. The Cox PH model can hardly capture the non-linear effects of macroeconomic variables and thus lead to inferior performance than non-parametric survival models.
From Table 7 we can also observe that the ranks of models hold the same when macroeconomic variables are not included. SurvXGBoost becomes the best-performing model under OOS and OOT validations, which confirms the robustness of the proposed method.

Interpretability
Lack of interpretability may hinder the mangers' wiliness to employ complex credit models. Moreover, transparent models are required by regulators in many regions or countries. Since CART is employed as the base learner of SurvXGBoost, one can plot all the base models in a graph so that the process of decision making is clear to the users. However, it is a tough work to take hundreds of base models into consideration. Thus, the proposed SurvXGBoost maintain some interpretability whereas it is not so interpretable relative to parametric ones. Nevertheless, we can explain the proposed SurvXGBoost model by figuring out the important features. XGBoost provides several feature importance measures to describe how important a feature is during modelling. In this paper, we select relative gain measure since it directly evaluates the relative contribution of a certain feature selected as the splitting variable to the model. A higher relative gain indicates a more important feature when generating base models. Figure 7 shows the feature importance measured by relative gain for the 50 × 5-fold CV. The error bars in Figure 7 indicate the confidence intervals. As shown in Figure 7, interest rate, annual income, and credit grades account for the top three important features in the modelling of SurvXGBoost. On the contrary, some features concerning loan characteristic, borrowers' creditworthiness and solvency can hardly be used for splitting nodes. The macroeconomic variables, especially the labour force and employment indicators also play important roles in building SurvXGBoost models.

Conclusions and future research
Credit risk is a major type of risk that financial institution encounters. To quantify the credit risk of a portfolio, financial institutions have developed several risk parameters, among which PD is a major concern. Credit scoring is a common method to predict PD. In the modelling stage of credit scoring, ensemble models which combines the predictions of multiple models have shown their superiority in predictability. Moreover, survival analysis which can provide dynamic predictions on PD over time has been considered as an alternative to common classification algorithms. Thus, we develop a novel SurvXGBoost model which integrates a state-of-the-art ensemble model (i.e., XGBoost) and survival analysis. Our proposal is compared with several benchmark survival models on a large real-world consumer loan dataset. To further enhance model performance of survival models, information extracted from the principle components of 1,042 macroeconomic variables are integrated with the original features which include loan characteristics, borrower's creditworthiness and solvency. The model performance on predictability and misclassification cost are compared under OOS and OOT validation. For OOS validation, the proposed SurvXGBoost model outperforms the benchmarks significantly in all the evaluation measures. Concerning OOT validation, SurvXGBoost is marginally better than the benchmarks in most cases. The information extracted from macroeconomic variables can improve the predictability of survival models, which confirms the relationship between business circle and time-to-event. The model performance under OOT validation is worse than those in OOS validation, thereby confirming the fact that OOS validation may include future information of macroeconomic variables in modelling and therefore lead to over-estimated performance. It is thus recommended that OOT validation should be emphasized in the comparison of credit models. We also found that SurvXGBoost can maintain some interpretability by disclosing the feature importance.
Regarding the directions of future research, the proposed SurvXGBoost can only handle right-censored data at present. One may further extend SurvXGBoost to support other type of censored data. Moreover, the interpretability of non-parametric survival model requires further exploration. Maybe the SurvLIME algorithm can be applied into the explanation of complex survival models in credit risk assessment. The integration of other efficient GB-DT-based algorithms into survival analysis is also an interesting research direction. and analysis. Yating Fu and Yinguo Li were responsible for data interpretation. Yufei Xia and Yinguo Li wrote the first draft of this article. Yufei Xia revised the draft.