MODELLING THE STOCHASTIC DEPENDENCE UNDERLYING CONSTRUCTION COST AND DURATION

. Construction cost and duration are two critical project indicators. It is acknowledged that these two indicators are closely dependent and highly uncertain due to various common factors and limited data for explanatory model calibra-tion. However, the stochastic dependence underlying construction cost and duration is usually ignored and the subsequent probabilistic analysis can be misleading. In response, this study develops a Nataf distribution model of building cost and duration, in which the uncertainties of total cost, unit cost, and duration are respectively quantified by univariate distribution fitting, while their stochastic dependence is inferred by maximum likelihood estimation. This method is applied to the costs and durations of 77 China residential building projects completed between 2011 and 2016. The goodness of fit test illustrates that the data conform well to the developed model. The conditional distributions of cost and duration are then derived and the corresponding conditional expectations and variances are given. The results provide the distribution of building costs for a desired duration and the expected duration given a budget. This, together with the ability to update probabilities when new project information is available, confirms the potential of the proposed model to benefit precon-tract decision making from a risk perspective.


Introduction
Construction cost and duration are two critical project indicators. It is widely considered that construction cost and duration are interrelated (Bromilow 1969;Žujo et al. 2017). Several studies indicate that construction cost can be used to estimate the construction time (Bromilow 1969;Ogunsemi, Jagboro 2006) while some works explore the influence of duration delays on cost overrun (Huo et al. 2018;Lo et al. 2006). Despite the close relationship between construction cost and duration, they are usually deemed as two contradictory objectives that need to be traded off in project management. A planed budget largely determines a reasonable construction time through its impact on resource supply. In other words, completion within a certain period usually requires sufficient budget. From the owner's perspective, determining the planed budget demands a good understanding of the dependence between construction cost and duration. Some planners may prioritize financial and budgetary factors in early project planning while others would consider completion on time of paramount importance. Understanding this dependency between construction cost and duration help planners to select reasonable duration according to the budget limit and make reasonable decision according to their risk preference.
The closeness of construction cost and time relation has been proved by numerous researches. Several studies have examined the causes of cost and duration overrun in different regions, including Vietnam (Kim et al. 2018), Nigeria (Dada, Jagboro 2007), Hong Kong (Huo et al. 2018;Lo et al. 2006), and Malaysia (Ismail et al. 2014;Shehu et al. 2014aShehu et al. , 2014b. Construction delay has been identified as one of the major reasons for construction cost overrun (Flyvbjerg et al. 2004). On the other hand, schedule delay often occurs due to an insufficient budget for covering the extra costs originating from unpredicted events such as bad weather or design changes. Some owners prioritize financial and budgetary factors in early project planning while others consider completion on time of paramount importance.
Several studies have attempted to address the prediction of construction cost and duration by using both qualitative and quantitative methods. This has resulted in several factors being identified as the major causes of construction duration and cost overruns, including ineffective early project planning, changes in design or scope, and late payments (Famiyeh et al. 2017;Larsen et al. 2015;Mulla, Waghmare 2015;Shehu et al. 2014a). Studies of the relationship between construction duration and cost began with a simple linear model (Fulkerson 1961), followed later by the fitting of several non-linear models, such as the quadratic (Deckro et al. 1995) and exponential (Žujo et al. 2017). A representative model developed by Bromilow (1969) in Australia has had its predictive ability tested and improved in different countries (Chan 1999;Kaka, Price 1991;Ogunsemi, Jagboro 2006;Ng et al. 2001;Yeong 1994). Optimization approaches based on these models have been proposed to address the cost-duration tradeoff problem; these methods include the genetic algorithm (Agdas et al. 2017;Koo et al. 2015;Zheng et al. 2004), artificial bee colony algorithms (Tran et al. 2015), and particle swarm optimization (Aminbakhsh, Sonmez 2016) and critical path (Alavipour, Arditi 2018).
Various factors contribute to construction performance, including project complexity, financial considerations, political conditions, and market environments. Some are difficult to quantify because of the limited amount of data available (particularly at the early stage of construction), while other factors are difficult to incorporate due to limited knowledge of their explanatory relations, such as project communication and team hierarchical diversity (Sanchez et al. 2017). Consequently, predicting construction cost and duration is surrounded by considerable uncertainty; even the performance of projects with an identical design still differ because of these uncertainties (Pinto, Morris 2004).
In minimizing the risk of unrealistic project planning, therefore, it is critical to capture the uncertainty in construction cost and duration. This is difficult to do by deterministic methods as they only provide point-value estimates (Kim, Reinschmidt 2009). As shown in Figure 1, due to the uncertainty of the underlying factors, the relationship between cost and duration cannot be represented by a deterministic function; rather, for a given duration (e.g. d in Figure 1), the corresponding cost (c in Figure 1) is uncertain and better characterized by a distribution of probable points ( ( ) f x in Figure 1). In contrast, the probabilistic approach provides a distributive estimate rather than point estimate, which is potentially more useful (Skitmore 2001). The fitness of several general distribution types such as Weibull (Nassar et al. 2005) and uniform (Fine, Hackemer 1970;Grinyer, Whittaker 1973;Van Cauwelaert, Heynig 1979), Gamma (Friedman 1956), and lognormal (Skitmore 1991) have been explored for construction cost or duration. Other studies investigate the distribution of duration under different weather conditions (Lee et al. 2009), project and contract types (Irfan et al. 2011), and logic types (Wang 2005). Based on predefined distributions, several probabilistic methods such as Monte Carlo simulation (MCS), the program evaluation and review technique (PERT) and agent-based simulation have been widely used to address the uncertainty in construction duration (Ahuja, Nandakumar 1985;Ballesteros-Pérez 2017;Erol et al. 2017;Farshchian et al. 2017;Karabulut 2017;Moret, Einstein 2016). Based on assumptions of precedence relationships between activities, these methods can model the distribution of scheduled duration (Wang 2005). However, a common assumption underlying probabilistic simulation is the independence between variables (Irfan et al. 2011), whereas construction duration and cost are closely dependent (Žujo et al. 2017), raising concerns over the accuracy of the results obtained under the independence assumption (Isidore, Back 2002). Therefore, there is a need for a probabilistic method considering the stochastic dependence between construction cost and duration. The significance of cost and time dependency modelling is prominent within the context of early project decision making. Many construction companies consider the cost and time are main objectives for project management. Quantifying the changes of the cost caused by the deviation of the duration from a probabilistic perspective helps the planner to better select the duration and prepare for the possible crashing.
This study employs the Nataf distribution to capture the uncertainty of construction duration and cost and their stochastic dependence (Žujo et al. 2017). The Nataf distribution is widely used to describe dependent uncertainties. Unlike other multivariate distribution models, the Nataf distribution is more flexible in the choice of univariate normal or non-normal distributions. Moreover, the stochastic dependence is modeled independent of the uncertainty characterization of each random variable, which simplifies the procedures of model development. Several studies have employed the Nataf distribution for construction reliability analysis (Chen et al. 2015).
This paper introduces a probability model based on the Nataf distribution to capture the dependent uncertainty between cost and duration. This model is developed and tested on a dataset comprising the total cost (TC), unit cost (UC), and duration (D) of 77 China residential building projects. UC (total final contract cost/total floor area) is included here because it was also deemed a key indicator of construction performance (Chan, A. P. C., Chan, A. P. L. ). Univariate and multivariate analysis of TC, UC and D are carried out and the conditional distribution based on the multivariate distribution model is derived, and the method of updating of distributions given new observations is then illustrated. Finally, the potential application of the proposed model is presented and its implication for early stage project management is discussed.

Methodology
This research aims to model the stochastic relationship between building cost and duration. A multistep process is proposed (Figure 2), which includes data processing, model development and validation, and model updating. Each step is presented in detail in the subsequent paragraphs.

Data source and processing
The data collected in this study was from a professional database in the construction industry in China, which covers various types of construction buildings. Th e detailed information of each project is stored in a structured way that each project sample consists of several fundamental construction attributes, such as project type, storey, total floor area, bidding, decoration standard, delivery method. Th is study focuses on one major type of project, i.e., apartment building. Moreover, cost and schedule overrun is caused by some errors, omissions, or changes (Love et al. 2012). Th erefore, the data samples corresponding to projects with cost or schedule overruns are removed to minimize the influence of unpredicted factors. In other words, this study focuses on the cost-duration relationship that a 'normal' project is likely to possess.
To eliminate the bias caused by the inconsistence in commence date and location, the project cost data was adjusted to the identical point of time using the price indices of construction from the National Bureau of Statistics of China. This index is an artificial statistical data used to es-timate price fluctuation by assuming 100 as the price at a certain point in time (Koo et al. 2010).

Model development and validation
Investigating the stochastic relationship between cost and duration requires the modeling of their joint probability distribution. This study employs the Nataf distribution to capture the dependency uncertainty of building cost and duration. The major advantage of this model comes in terms of its flexibility to incorporate various univariate distributions while preserving the correlations between variables. The model is useful because the uncertainties of building cost and duration are usually non-Gaussian and correlated. The model parameter (including the univariate distribution) is estimated by the maximum likelihood method. Finally, univariate and multivariate goodness of fit tests are conducted to examine whether the uncertainties of cost and duration and their stochastic dependence are well represented by the developed model. The following explains the method used in this study.
The Nataf distribution is a popular model used to approximate the joint probability distribution of the input random variables. The model is flexible and efficient since it can incorporate various marginal distributions and can be easily generalized to higher dimensions. The stochastic dependence is modeled via a joint Gaussian distribution after mapping the random variables from the original space onto standard normal space (Chen et al. 2015). Suppose is an n-dimensional random vector. Assume that the univariate probability density function (PDF) and cumulative distribution function (CDF) of each element , 1, , normal CDF denoted as ( ) Φ ⋅ ; i y is a standard normal random variable. By applying the differentiating rule, the joint PDF and CDF of X are: where ( ) ρ are the PDF and CDF of an n-variate standard normal distribution respectively, and 0 ρ is its correlation matrix. The joint probability distribution of X with the PDF given by Eqn (2) is referred to as the Nataf distribution and 0 ρ is termed the fictive correlation matrix acting as the Nataf distribution model parameter (Higham 2002).
Maximum likelihood estimation (MLE) is extensively used to infer parameters of statistical models from a set of observations. The basic idea is to maximize the likelihood of the statistical model given the observations with respect to the model parameters. For the Nataf distribution model, there are two types of parameters: (1) the parameter of the univariate distribution of each random variable; and (2) the fictive correlation matrix describing the dependence strength between different random variables. Estimating the Type 1 parameters for a univariate distribution is straightforward. For the Type 2 parameters, the inference by MLE where the log-likelihood of the model is: ( ) log f X x with respect to 0 ρ is equivalent to the maximization of ( ) log ; n φ 0 y ρ with respect to 0 ρ . In other words, the estimator of 0 ρ is the estimator of the correlation matrix of an n-variate standard normal distribution, which is: are transformed from the m observations of the i-th and j-th random variable by Eqn (4) respectively; i Y and j Y denote the sample mean of ˆi Y and ˆj Y , respectively. The goodness of fit of a statistical model measures the discrepancy between the proposed statistical model and the observations. This measure is commonly used in statistical hypothesis testing to test whether the data frequencies can be captured by a given distribution. The goodness of fit of a univariate distribution model is measured in terms of the deviation between the theoretical CDF ( ) F x from the proposed model and the empirical CDF ( ) F x from observations, which are respectively: where ( ) are m observations of X sorted in an ascending order, and E 1 is the indicator function of event E . The Anderson-Darling (A-D) statistic ( 2 A ) has excellent power properties against a variety of alternatives (D' Agostino 2017). Therefore, in this work, the A-D test is taken to examine whether the data samples are drawn from the proposed univariate distribution model, where the A-D statistic ( 2 A ) is defined as: For a multivariate distribution model, the PIT-based A-D test is employed to test its goodness of fit. Particularly, to examine whether the observations of a random vector X are appropriately modeled by the Nataf distribution, a null hypothesis is defined as: It has been proved that the elements of the random vector ( ) 1 2 , , , n Z Z Z = … Z are uniformly and independently distributed on 0,1     (Rosenblatt 1952 H can again be tested using the A-D statistic. To visualize the fitness of the Nataf distribution, the theoretical CDF is plotted based on Eqn (2) whereas the multivariate empirical CDF is defined as: where  indicates the intersection of event sets.

Model updating
The established Nataf distribution model could be updated to infer the conditional distribution of some random variables given new observations of the others. In addition, the calculation of conditional statistics such as expectation and variance are introduced. A potential application of the established Nataf distribution model is to infer the conditional distribution of some random variables given new observations of the others. Let . The corresponding transformed standard normal random variable is then denoted as . Suppose that the Nataf distribution has been established where the model parameter which is the CDF of a multivariate normal distribution. In other words, the CDF of which follows a k-variate normal distribution * * ( , ) k N µ ρ with mean and covariance matrix respectively given by: For the computation of the marginal conditional CDFs in the PIT, we can simply apply Eqns (12) Based on the conditional distribution, the conditional statistical moments can be calculated. For example, the conditional expectation and variance are calculated as follows: where  is the expectation operator,  is the variance operator, e denotes the observations of conditioning variables.

Data collected
Considering the consistency in the project attributes, this study finally collected 77 project samples of apartment buildings from the database. These projects were completed during 2011 to 2016 in China without occurrence of cost and schedule overrun. The project cost was adjusted using the price indices of construction from the National Bureau of Statistics of China to eliminate the effects of inflation.
The statistics of the data are shown in Table 1. The Pearson's correlation between TC and D is -0.4949, UC and D is -0.3621, TC and UC is 0.3112, indicating that TC and D, UC, and Dare negatively correlated while there is a positive correlation between TC and UC.

Results
This study includes three variables (UC, TC, D) indicating project performance, which are stochastic and interdependent. The target is their statistical modeling and unravelling their stochastic relationship. Univariate analysis and multivariate analysis are both included in this section.

Univariate analysis
To establish the model, the first step is to quantify the uncertainty of each random variable by determining the univariate distributions. A number of continuous distributions are examined, including the Beta, Burr, Weibull, Log-Logistic, Cauchy, Gumbel Max/Min, Gamma, Johnson SB, and Normal. The parameters of the distributions are estimated using the MLE method. The software package Easyfit 5.5 is used for parameter estimation and the A is less than the critical value at the given significance level ; otherwise, the null hypothesis is rejected. Table 2 presents the hypothesis test results corresponding to the best-fit distributions for UP, TP, and D, respectively. The probability density functions (PDFs) and cumulative distribution functions (CDFs) of Weibull (3-parameter), Gamma (3-parameter), and Log-Logistic (3-parameter) distributions and the meaning of associated parameters are given in Table 3. Figure 3 compares the empirical distribution (stair step lines) with the best fit distribution for UP, TP, and D respectively. The empirical CDF Figure 3. Best fit distributions of TC, UC and D is generally in good agreement with the theoretical CDF, which indicates that the uncertainty of UP, TP, and D can be characterized by the theoretical CDFs.

Multivariate analysis
The next step in developing the Nataf distribution model is to estimate the fictive correlation matrix. The results are shown in Table 4. The fictive correlation matrix is close to, but deviates from, the correlation matrix associated with the original random vector. In general, the fictive correlation and original correlation are co-monotonic (i.e. fictive correlation strictly increases as the original correlation strictly increases, and vice versa) and the fictive correlation is positive or negative if the original correlation is positive or negative. The deviation is largely determined by the normality of the univariate distribution of each random variable.
The goodness of fit of the developed multivariate distribution model is then examined. A critical step in the goodness of fit test of the Nataf distribution model is the PIT, which involves the sequential computation of the marginal conditional CDFs of the random vector. The result of the A-D test for goodness of fit is shown in Table 5. This indicates that the auxiliary hypothesis * 0 H (statistical S conforms to the 2 n χ distribution) is accepted, which means that the Nataf distribution can be used to model the trivariate distribution of the random vector. The histogram of S compared with the PDF of 2 3 χ distribution is shown in Figure 4 and the empirical cumulative distribution of S compared with the CDF of 2 3 χ distribution is Note: α is the shape parameter, β is the scale parameter and γ is the location parameter.  Figure 4(b). Figures 4(a) and 4(b) illustrate that the proposed model well fits the empirical data. Figure 5 compares the contours of the trivariate empirical distribution with that of the Nataf distribution in two dimensions. The empirical and fitted CDFs are generally consistent despite some small deviations.

Model application
An application of the established Nataf distribution model is to infer the conditional distribution of construction duration or cost when new information of the others is avail-able. This section discusses the conditional distribution of cost or duration given new observations. In addition, the calculation of conditional expectation and variance of building cost and duration is illustrated. The potential of the proposed model for project management is discussed later.

Conditional distribution of TC or UC given D
This section discusses the conditional distribution of TC or UC given new values of D. Figures 6(a) and 6(b) present the PDF and CDF of the conditional distribution Here, we take the 5 th (355 days) and 95 th (837 days) percentile of D for illustrative purpose. The unconditional distribution of UC is included as a reference. The mode of UC that corresponds to the maximum of unconditional PDF is ¥1322/m 2 . If the duration equals to its 95 th percentile, the mode of UC decreases to ¥1134/m 2 and the conditional CDF is above the unconditional CDF, which implies that a smaller value of UC is likely to occur. In contrast, as the duration is shortened to its 5 th percentile, the mode of UC increases to ¥1619/m 2 and the conditional CDF is below the unconditional CDF. In other words, the probability of UC being less than a value is lower, which indicates that a larger value of UC is likely to appear. A similar pattern of influence is observed for conditional distribution of TC. When D equals its 5 th and 95 th percentile, the mode of TC deviates from its unconditional value of ¥4.22e6 to ¥1.50e7 and ¥3.15e6, respectively. The conditional CDFs of TC given in the 5 th and 95 th percentile of D are also below and above the unconditional CDF of TC. However, the deviation between the conditional and unconditional CDF lines for UC and TC are different. The following index is used to investigate the sensitivity of UC and TC to D: here, we set 2 x to the median of its unconditional distribution (i.e., ( ) 2 2 0.5 P X x < = ) and 1 x to its 5 th and 95 th percentile, respectively. Table 6 summarizes the sensitivity of TC and UC to D. The conditional CDFs of TC apparently deviates more significantly from its unconditional CDF than the result from UC. This difference indicates that TC is more sensitive to D.

Conditional distribution of D given UC and TC
Conversely, the possible duration under specific budgetary cost can be analyzed. Similarly, the conditional distribution of D is illustrated with UC and TC being their 5 th (UC = ¥997/m 2 , TC = ¥3.56e6) and 95 th (UC = ¥2274/m 2 , TC = ¥2.61e7) percentile. As illustrated in Figure 8, a limited budget can increase the likelihood of longer construction duration while sufficient financial support can decrease this likelihood. In particular, the mode of unconditional distribution of D is 484 days. If both UC and TC are set to their 5 th percentile (or 95 th percentile), the mode of D becomes to 675 days (or 385 days). This result quantifies the budgetary influence on construction duration.

Conditional statistics
Based on the conditional distribution, the conditional statistics of construction duration and cost are calculated using Eqns (15) and (16). Table 7 tabulates some expectations and variances corresponding to the conditional distribution investigated above. The results illustrate that, with the increase of the cost, the expectation of duration decreases, which is consistent with the construction practice that shortening duration will cause an increase of the budget. However, this model provides a practical method to quantify this influence. For example, if the duration needs to be substantially shortened from 837 days to 355 days, the increase in total cost is expected to be ¥1.394e7. The model can therefore serve as an effective tool to support the owner and clients to estimate the changes of cost caused by shortening the construction time. The capability of the model not only results in reduced likelihood of an unreasonable schedule plan but also lowers the risk of an infeasible balance between budget and duration. As a result, owners can make rational decisions for project development and management with the aid of this model.

Conclusions
This paper proposes a probabilistic model based on the Nataf distribution to address the dependent uncertainties associated with building cost and duration. This was demonstrated and tested with a sample of 77 of residential buildings completed between 2011 and 2016 in China. The uncertainties of total cost, unit cost, and duration were quantified by univariate distribution fitting, while their stochastic dependence is inferred from the dataset by maximum likelihood estimation. The goodness of fit test indicates that the collected data conform well to the developed Nataf distribution. The conditional distribution of cost or duration given new observations was derived and the calculation of conditional expectation and variance is illustrated. Based on the established distribution, the stochastic relationship between cost and duration is explored by investigating their respective conditional distribution given the 5 th and 95 th percentiles of the other variables. The results show how the distribution of TC and/or UC changes as the duration varies and vice versa. Moreover, it is illustrated  how owners or clients can use the developed model to estimate the conditional expectation and variance of cost such that the risk of cost increasing due to shortening construction time can be assessed.
The multivariate distribution is established by modeling the uncertainty of each random variable and its stochastic dependence respectively. In general, the proposed model can update the uncertainty of project performance when new project information becomes available and it benefits on decision making in project management from a risk perspective. Quantifying the cost changes as a function of duration helps the planner to select the reasonable duration and prepare for the possible schedule crashing. Further research needs to be conducted by adding more explanatory factors to establish a more sophisticated model.

Funding
This work was supported by the National Natural Science Foundation of China (NSFC) under Grant [number 51608399].

Author Contributions
Xue Xiao contributes to the conception and design of the study, acquisition of data, drafting and revision of the article. Fan Wang contributes to the conception and design of the study, analysis and interpretation of data, drafting and revision of the article. Heng Li and Martin Skitmore contribute to the revision of the article for important intellectual content.