SARIMA MODELLING APPROACH FOR RAILWAY PASSENGER FLOW FORECASTING

. In this paper, railway passenger flows are analyzed and a suitable modeling method proposed. Based on historical data composed from monthly passenger counts realized on Serbian railway network it is concluded that the time series has a strong autocorrelation of seasonal characteristics. In order to deal with seasonal periodicity, Seasonal AutoRegressive Integrated Moving Average (SARIMA) method is applied for fitting and forecasting the time series that spans over the January 2004 – June 2014 periods. Experimental results show good prediction performances. Therefore, developed SARIMA model can be considered for forecasting of monthly passenger flows on Serbian railways.


Introduction
Forecasting of future demand for transport service represent important element of success for a transport company. Forecasting also provides basic input for planning and control of functional areas including transport operations planning, marketing and finance.
There are two broad categories of forecasting approaches in transportation field: parametric and nonparametric (Smith et al. 2002;Vlahogianni et al. 2004;Wei, Chen 2012). The main difference between these two categories lies in functional dependence between independent variables and dependent variable. Within the non-parametric methods there are applications of neural networks (Dougherty 1995;Vlahogianni et al. 2004;Xie, Zhang 2006;Zhang, He 2007;Zhang, Ye 2008;Tsai et al. 2009;Alekseev, Seixas 2009), non-parametric regression (Smith et al. 2002;Williams et al. 1998;Clark 2003), Kalman filters (Whittaker et al. 1997;Stathopoulos, Karlaftis 2003;Saab, Zouein 2001;Wang et al. 2007) and Gaussian maximum likelihood (Tang et al. 2003). For example, Wei and Chen (2012) developed a hybrid forecasting approach based on a combination of Empirical Mode Decomposition (EMD) and Back-Propagation Neural (BPN) networks to predict short term passenger flow in metro systems. Xu et al. (2004) applied spatiotemporal data mining to forecast the railway passenger flow based on spatio. Approach is decomposed into two phases. In first phase, it forecasts the time sequence of the target object using statistical principles. In second phase, it figures out the spatial influence of neighbor objects using a neural network, and finally combines the two forecasting results using linear regression. Li et al. (2012) used Grey model for forecast annual outgoing passenger flows of a railway station. Authors analyzed monthly flucations in passenger flows and applied proportional coefficient method for prediction of passenger flows for each month.
In the traditional parametric techniques, historical average (Smith, Demetsky 1997;Stephanedes et al. 1981), smoothing techniques (Williams et al. 1998), and AutoRegressive Integrated Moving Average (ARIMA) (Hansen et al. 1999;Williams et al. 1998;Lee, Fambro 1999;Williams 2001;Kamarianakis, Prastacos 2005) have been applied to forecast transportation demand. For example, Grubb and Mason (2001) estimated a very long time series about air passenger traffic using Holt-Winters forecasting methods. They estimated the histor-ical growth using Holt-Winters decomposition and produced long lead-time forecasts. On the base of long data series, they evaluated the out-of-sample forecasting performance of this and other forecasting procedures and concluded that presented modification of Holt-Winter method greatly improves forecasting performance for long lead-times. Presented approach also considers an assessment of uncertainty in the predictions. By modified Holt-Winters procedure they varied the trend used for predictions and estimated the sensitivity of forecasts to assumptions about the future trend. Grosche et al. (2007) presented two gravity models for the estimation of air passenger flows between pairs of cities. Models are based on variables describing economic activity and geographical characteristics of city pairs. This means that developed models can be applied in case when new service is considered or historical data are not available.
Within the class of parametric methods, AutoRegressive Integrated Moving Average (ARIMA) has become one of the common parametric forecasting approaches since the 1970s (Wei, Chen 2012). In addition, with the characteristics of seasonality and trends in traffic data, some researches use seasonal ARIMA to predict traffic flow (Williams, Hoel 2003;Tan et al. 2009) and international air passenger flow (Faraway, Chatfield 1993;Lim, McAleer 2002;Chen et al. 2009).
As already mentioned, many railway planning processes including financial, asset, capital investment and service plan are predicted on the estimate of passenger demand to be handled within a given time period. Forecasts of traffic volume represent one of the most important outputs of a marketing plan and also the important input to the corporate plan. Current planning process in JSC 'Serbian Railways' lacks of any planning tool, so having that in mind, as well as the actual market share and increasing competition there is a need for better planning instruments in the company. Therefore, in this paper we present the first detailed application of Seasonal ARIMA (SARIMA) models in order to generate passenger traffic forecasts for the case of JSC 'Serbian Railways' . We use the data representing monthly passenger flows on all lines of Serbian railway network provided by Statistical Office of the Republic of Serbia (http://webrzs. stat.gov.rs). At the time of our analysis a time series of monthly passenger flows from January 2004 to June 2014 (126 months) was available.
Paper is organized as follows. Section 1 introduces ARIMA and SARIMA models and describes Box-Jenkins methodology for their selection. In Section 2, SARIMA modeling process has been applied for deriving an efficient model to assess rail passenger demand for a case of Serbian railways. Last section concludes this paper. Box et al. (2013) introduced the ARIMA method. This method now represents one of the most frequently used univariate time series modeling tools. ARIMA models are based on AutoRegressive Model (AR), the Moving Average Model (MA) and the combination of the AR and MA, the ARMA models (Suhartono 2011).

ARIMA and SARIMA models
AR model includes lagged terms on the time series itself, and MA model includes lagged terms on the noise or residuals. The first requirement for ARIMA modeling is that the time series data to be modeled are either stationary or can be transformed into stationary. Therefore, the letter 'I' (Integrated) means that the first order difference is applied in order to stationarize given time series. First order differencing requires that there is a need to find between observations in two successive months. In case that there is a time series with trends, seasonal pattern and short time correlations a Seasonal ARIMA (SARIMA) model can be used. Besides three main components present in ARIMA model, there is a need for seasonal differencing in order to make a seasonal time series stationary. Therefore, a SARIMA model has four components: -the non-seasonal and seasonal AR polynomial term of order p and P, respectively: -the non-seasonal and seasonal Moving Average (MA) part of order q and Q, respectively: (3) is the of order d used to eliminate polynomial trends; -seasonal differencing operator, ( ) − 1 D s B , of order D used to eliminate seasonal patterns. Parameters f and q are the ordinary ARMA coefficients, F and Q are the seasonal ARMA coefficients, B is the backshift operator, whose effect on a time series Y t can be summarized as B Y Y . Therefore, the generalized form of SARIMA(p, d, q)× (P, D, Q) s model for a series Y t can be written as (Box et al. 2013;Cryer, Chan 2008): where: s is the length of the periodicity (seasonality) and e t is a white noise sequence. Box-Jenkins method in essence involve three steps for fitting the ARIMA and SARIMA models. These are identification, estimation and validation of the model (Box et al. 2013;Prista et al. 2011). First step is generally based on an analysis of AutoCorrelation Function (ACF) and Partial ACF (PACF) and their comparison with theoretical profiles of these functions in AR, MA and ARMA processes. The output of the identification step , , S p d q P D Q model structure. The model structure selected in previous step has to be fitted to the time series and its parameters to be estimated. This is the essence of the second step and it is done by using the conditional sum of squares or maximum likelihood method. Validation of selected model is performed within the diagnostic checking by analysis of stationarity, invertibility as well as the presence of redundancy in model parameters. If a selected model fails in diagnostic check it is necessary to repeat the whole procedure again. After an appropriate model is found it can be used for forecasting purpose (Box et al. 2013;Prista et al. 2011).
Subjectivity is mostly present in model identification step. In the fact that this step is based essentially on graphical interpretations of ACF/PACF estimates. To cope with this subjectivity and to improve the determination the final orders of the ARMA processes there are a lot of model selection criteria proposed in literature (De Gooijer et al. 1985).
The most frequently used are information criteria such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and normalized version of BIC. These information criteria are designed to deal with the fit of the nonlinear models and to account for the number of the parameters in the model as well. They consist of the natural log of the Mean Squared Error (MSE) and a penalty for the number of parameters being estimated (Yaffee, McGee 2000): where: T is number of observations, k is the number of parameters in the model = + + + +1 k p q P Q . Common procedure often involves estimating the largest model for which it is assumed to correctly capture the dynamics of a time series and then decreasing its size (dropping the lags) until the minimum value of AIC, BIC or Normalized BIC is reached. Presented methodology will be used in next chapter for analysis of railway passenger flows on Serbian railways.

SARIMA model for rail passenger flow forecasting: a case study of Serbian railways
In this chapter, we outline the practical steps, which need to be undertaken to use SARIMA time series models for passenger flows forecasting on Serbian railways. To test alternative SARIMA models we obtained a time series of a total monthly number of passengers travelled by the railway from the Statistical Office of the Republic of Serbia (http://webrzs.stat.gov.rs). Data set covers the period from January 2004 to June 2014 (126 monthly values). The time series is presented on Figure 1. First 114 monthly observations are used for fitting purpose, whereas the remaining 12 observations served for verification the forecasting capability of selected SARIMA model.
According to the Box-Jenkins methodology, in SA-RIMA models there is an explicit consideration of trend and seasonal non-stationarities by the model structure. However, it is needed to analyze non-stationarities of variance before the model fitting. As it can be seen from for rail passenger traffic in Serbia have strong seasonality pattern with a trend, which is not constant throughout the whole study period, rather slightly decreasing in period from 2004 to 2010 and increasing in period from 2011 to 2014, an increase in variance with the serious mean is identified from variance-mean plots ( Figure 2). Therefore, we log-transformed the data in order to stabilize the variance and then used these log transformed data set as input to the SARIMA analysis ( Figure 3). Different SARIMA models were applied to find the best fitting model. Among all candidate models, the most suitable predictive model for our case was found on the base of Box-Jenkins model and normalized BIC. This approach involved selection of the candidate model set, estimation of the model and determination of normalized BIC and diagnostic check. The set of alternative models is determined by analyzing the sample estimates of ACF and PACF, and defining the three major orders of the SARIMA models: d, D, and S. Lags 1,2,3,4,9,10,11,12,13,14,22,23,24,25,26,29 and 30 had large autocorrelations with values 0.80,0.64,0.45,0.28,0.25,0.36,0.46,0.51,0.38,0.27,0.22,0.31,0.39,0.29,0.19,respectively (Figure 4). According to these values, namely, the very sharp decrease of autocorrelation values in first three lags; it is obvious that there exists a long-term trend. In order to eliminate it, the first lag difference term has to be included in the structure of the SARIMA model (d = 1). Annual lags and its multiplies also reported large autocorrelation values. Consequently, seasonal difference terms are included in models (S = 12, D = 1). Further justification of these conclusions is obvious from ACF and PACF plots of differenced time series (Figure 4). Based on this conclusions, we selected SARIMA( ) ( ) × 12 ,1, ,1, SARIMA p q P Q as the basic structure of the alternative SARIMA models.
Among the statistical models, SARIMA (0,1,0)×(0,1,1) 12 was the most appropriate with the lowest normalized BIC of 7.056 and a Mean Absolute Percent Error (MAPE) of 4.176 (Table 1). 70.8% of the variance of the time series was covered by this model (stationary R-squared). According to the Ljung-Box statistics it is proven that the selected model is correctly specified. Namely, the value of 0.542 shown here is not significant, so we can be confident in correct specification of the model. This model has the following equation:   According to diagnostic checks SARIMA model appears as stationary and invertible without redundant parameters. According to the Ljung-Box Q-test and p-value (Ljung-Box Q = 15.747, p-value >0.05) residuals were white noise and therefore there is no significant auto-correlation between residuals at different lag times (Figure 5). Significance level of a = 0.05 was used for all tests.
Twelve    Figure 6 displays observed values and SARIMA model fit and forecast values. Model gives slightly lower forecasts then observed values (September 2013-March 2014, but pattern of model forecasts almost matched the one in observed passenger counts except for period from May 2014 to June 2014. The main reason for this is a slightly increasing trend for the last two years of the study period. RMSE during the prediction period (RMSE = 56.45) was 1.9 times the RMSE of the fitting period (RMSE = 29.69). Negative errors and low ME (ME = 4.32) registered in eight of twelve forecasts indicate on minority of underestimation in global terms. However, it has to be noted that this conclusion is valid for the case of random error, but this will not be the case if the error is systematic (system changes -transportation, activity or flow patterns). MAPE was 7.13% reflecting the slightly lower forecasts during the July 2013 and September 2013 -March 2014. The SARIMA model forecasts also registered better performances with respect to SES, resulting in 9.8% reduction in RMSE, 44.2% reduction in ME, and 9.7% reduction in MAPE.
Outliers are known to cause trouble in time series model identification, estimation and forecasts (Prista et al. 2011). Rail passenger flow data set presented four apparent outliers, which are incorporated into the model (Figure 7). They are of different kinds. These outliers are mainly additive that affects a single observation (August 2007, September 2009and February 2012, but there is also one transient outlier whose impact exponentially decays (June 2010). The main reason for existence of outliers at the beginning of year (like those in February 2012) lies in winter vacations and non-existence of daily and weekly school migrations. During the August 2007 significant flows appeared especially within the international passenger services. Higher volumes of passenger flows in September 2009 resulted by ending of summer vacations and beginning the activities at schools and faculties. June of 2010 in Serbia was very hot and this reflected on lower volumes of passenger flows on Serbian railways.

Conclusions
In this paper, we proposed a sophisticated mathematical model for generating rail passenger traffic forecasts for the case of JSC 'Serbian Railways' . Different SARIMA models were tested to select an appropriate.
The results of diagnostic check indicate that the SA-RIMA(0,1,0)×(0,1,1) 12 is the most appropriate for modeling the rail passenger demand on Serbian railways.
This method can be a valuable decision support tool for company's marketing sector considering from one side, the current situation of rail passenger sector in Serbia (market share, revenues generated, number of passengers) and lack of any sophisticated planning tools, and from the other side, the accuracy of forecasting approach.