PATTERN RECOGNITION BASED SPEED FORECASTING METHODOLOGY FOR URBAN TRAFFIC NETWORK

. A full methodology of short-term traffic prediction is proposed for urban road traffic network via Artificial Neural Network (ANN). The goal of the forecasting is to provide speed estimation forward by 5, 15 and 30 min. Unlike similar research results in this field, the investigated method aims to predict traffic speed for signalized urban road links and not for highway or arterial roads. The methodology contains an efficient feature selection algorithm in order to determine the appropriate input parameters required for neural network training. As another contribution of the paper, a built-in incomplete data handling is provided as input data (originating from traffic sensors or Floating Car Data (FCD)) might be absent or biased in practice. Therefore, input data handling can assure a robust operation of speed forecasting also in case of missing data. The proposed algorithm is trained, tested and analysed in a test network built-up in a microscopic traffic simulator by using daily course of real-world traffic.


Introduction
The forecasting of traffic states has always been a popular topic in transportation research. Several investigations have been conducted both for freeway and urban traffic parameter prediction, such as traffic flow, travel time, occupancy, probability of congestion, or emission (Vlahogianni et al. 2014;Zefreh, Török 2016;Buzási, Csete 2015). In our days, especially short-term road traffic prediction has become an important problem due to the new mobility trends, i.e. emerging ITS (Intelligent Transportation System) tools for traffic management as well as sharing economy in transport. The importance of relevant average speed forecast is straightforward for ITS applications, e.g. (Csikós et al. 2015a;Ficzere et al. 2014). At the same time, it must be emphasised that resources' sharing based services also require information sharing. In fact, resource pooling in urban transport cannot be successfully achieved without appropriate knowledge of real-time and future traffic states.
Traffic prediction methods can be classified as classical prediction or data-driven methods. Classical methods apply micro-and macroscopic traffic models and/ or use statistical tools for model-based estimation, e.g. Ben-Akiva (1998), Van Grol et al. (1999) or Lin et al. (2008). Concerning the classical methods with statistical approach, several methods can be referenced: Bayesian network models (Fei et al. 2011), History Average (HA) models, Autoregressive Integrated Moving Average (ARIMA) models (Williams et al. 1988;Billings, Yang 2006;Guin 2006), non-parametric regressions are most commonly used for prediction, as well as procedures based on Kalman filter (Okutani, Stephanedes 1984;Guo et al. 2014). These prediction methods achieve their forecast through the analysis of historical data time series. Therefore, they are mostly used for freeway traffic. For this reason, with the evolution of computational intelligence, data-driven methods have gained attention, which consider the traffic system as a black box. This approach is based on analysing the data in order to find relations between the input and output state variables. Data-driven methods may also be complemented by statistical tools to improve future predictions. One of the most popular data-driven techniques is the self-learning pattern recognition based on Artificial Neural Network (ANN) models (Vlahogianni et al. 2005;Dougherty, Cobbett 1997;Chen et al. 2012), fuzzy-rule based logics (Li et al. 2008), Support Vector Machines (SVM) (Yu et al. 2013), k-means clustering (Montazeri-Gh;Fotouhi 2011;Lin et al. 2013), and expectation maximization based algorithm (Lo 2013). Summarizing the contributions of the papers cited above, the main advantage of the data-driven methods is the ability to capture the linkage of traffic variables in complex urban network, even under rapidly changing conditions.
A recent overview of the latest traffic forecasting research and future challenges are provided by Vlahogianni et al. (2014). Therefore, a fully detailed literature review is omitted now. Only the most relevant papers are discussed by focusing on the new contributions proposed in this paper.
In our days, traffic flow estimation via computational intelligence has been deeply investigated. Gastaldi et al. (2014) presented a combined ANN-Fuzzy method for estimating the average daily traffic flow based on one-week traffic counts. Zhu et al. (2014) also aimed to forecast traffic volume by using radial basis function neural network. Lozano et al. (2009) introduced a camera-based image processing method for congestion level recognition without prediction. Srinivasan et al. (2009) and Dimitriou et al. (2008) proposed an ANN based urban traffic flow forecast. Kumar et al. (2013Kumar et al. ( , 2015 investigated short-term traffic flow prediction via ANN and successfully validated the method by using highway traffic flow data. Albeit the previous results clearly justify the applicability of computational intelligence, they only investigate traffic flow prediction. Concerning the travel time or speed forecasting via ANN, several studies are also available. These papers, however, only concern prediction on freeway or urban arterial road, see the papers listed by the review article of Vlahogianni et al. (2014), or the papers of Liu et al. (2006), Van Lint (2004 or Basu, Maitra (2006). At the same time, the traffic speed forecast for complex urban network with traffic lights is still a rarely investigated problem. After a thorough literature review, only the articles of Fusco et al. (2015) and Csikós et al. (2015b) have been found as research results to this specific problem. The former one studied short-term traffic prediction problem on large urban network by using Floating Car Data (FCD) via ANN. The latter one also proposed ANN based prediction for speed categories in an arbitrary urban network.
As compared to the surveyed relevant papers, the contributions of our research are listed below providing solutions to some open problems: -data-driven methods might be very sensitive to training data quality. This fact is especially important as input parameters in this problem can be generated form FDC or traffic sensors. In practice, however, this kind of measurement data is often biased or intermittent by nature. Therefore, the ANN might suffer a loss on performance in case of training based on incomplete data. The previously cited research works do not deal with the missing data problem, or apply a preprocessing step known as imputation to generate missing values (e.g. by using mean). The application of these techniques, however, may reduce the final prediction performance in case of absent data.
Therefore, a built-in incomplete data handling is proposed as in Viharos et al. (2002) in order guarantee a robust operation of speed forecasting also in case of missing data; -the choice of the right input parameters for neural network training is not a straightforward problem. The reviewed papers generally apply a few input parameters without any preprocessing. As a conscious approach, this paper suggests an additional algorithm called feature selection in order to find the proper set of inputs (statistical parameters). The best operation for machine learning is typically achieved by using small set of input variables, the feature selection is crucial because it allows defining a wide range of features which can be later filtered based on their importance thus reducing the number of input variables for model building; -due to the lack of appropriate (frequent) realworld speed data, data originating from validated microscopic traffic simulator was used during the research. This, however, resulted in an additional contribution. The simulation based data generation is capable producing a huge set of traffic states, moreover extreme scenarios, which only occur rarely. Consequently, compared to a realword traffic measurement it has the great advantage, i.e. the forecasting model is trained to cover a much wider range of traffic situations; -a detailed comparative analysis has been carried out on the performance of two methods, the suggested ANN approach and the widely used SVM. These machine learning methods are very popular due to their generalization capabilities and modelling ability of complex nonlinear problems with high accuracy. They are especially useful for modelling of highly uncertain problems where the connections between the variables are unknown; -finally, it is emphasised that present paper provides a full methodology by summarizing practical considerations on the use of ANN with built-in missing data handling, and thus shares experiences with the research community on the special problem of short-term traffic speed estimation for signalized urban road links. Basically, the contributed methodology of the paper is motivated by the complex engineering problem of short-term traffic state estimation in case of incomplete traffic data. This consists of the following practical tasks: data collection (adequate traffic simulation or measurement), preparation of data, model building (feature selection, ANN training).
Section 1 introduces the basic preparatory tasks for pattern recognition. In Section 2, relevant soft computing techniques and algorithms are presented. Section 3 provides a full methodology for ANN based short-term speed prediction, demonstrated by a traffic simulation based case study. Section 4 contains study to demonstrate the viability of the proposed method. Conclusions are given at the end of the paper.

Preparatory Works for Pattern Recognition
The efficiency of pattern recognition can be considerably enhanced by preparing and using appropriate data. During the construction of the dataset, three main aspects are addressed: -realistic patterns are needed, but recurring patterns must be excluded to avoid overfitting of the ANN; -the irrelevant data need to be filtered; -dynamic characteristics of the process need to be built into database. The first point is realized by creating traffic excitations as a sum of sinusoids with different frequencies. By using this scheme, occurrence of different traffic demand waves can be mimicked (e.g. the short rush before school opening during the morning rush-hour). The amplitudes of the different sinusoids are given by random variables to exclude deterministic patterns.
The second point is addressed considering topologic characteristics: spatially irrelevant information (i.e. the data of non-connected links) is excluded. When creating the database, it is reasonable to exploit the dynamic characteristics of the system. The most basic consideration is that the analysis and prediction horizon need to be longer than the time constant of the system. Further dynamic characteristics can be involved by using statistical features, such as high-order moments, the tendencies and the highest relative variations. The additional input features of the neural network are tabulated into Table 1. Note that the prediction is carried out for each link separately, thus for each link a dedicated dataset must be calculated.

ANN Model for Pattern Recognition
This section describes those soft computing techniques and algorithms that are applied in the presented methodology.

First Stage: Feature Selection
The Euclidian-distance based feature selection algorithm was originally proposed by Devijver and Kittler (1982) and it assumes a pure classification task with the goal of reducing the number of inputs needed for one single output. As a generalization, continuous output parameters can be mapped onto the discrete classification scheme with an appropriate heuristics. Such heuristics are used in the applied method, where the values of the output encountered in the training data set are grouped into the highest possible number of clusters (i.e. intervals of equal length), so that at least one element is contained in each interval.
Once the continuous output vector is transformed into a discrete range, the feature selection algorithm can be applied to rank the input features based on relevance. This is done by using sequential forward selection and applying a statistical measure, which tries to maximize the separability of the output classes. The following equations define the statistical measure: where: c is the number of classes of the output; n i is the number of samples in the i-th class; n is the number of samples; i m is the centre of gravity of the i-th class; m is the centre of gravity of the samples; ij p is the j-th sample of the i-th class.
Vector parameters ij p , i m and m are defined in a subset of the whole feature set, i.e. the dimension of these vectors equal to the number of features the subset contains. The dimension of ij p , i m and m is increasing over the iterations of the sequential forward selection as more and more features are selected. In a given iteration the newly selected feature is the one where the M value of the containing subset is the highest. Basically S b represents the average distance between the classes and S w represents the average distance within the classes and M has to be maximized in each iteration for the classes to be the most separated in a given subset. Fig. 1 describes the pseudocode for the feature selection algorithm, where the calculate() function calculates the M value described in Eq. (1) on its input set. The set R grows with one new feature in each iteration based on the M value. Note that the number of iteration in which the feature was chosen is also stored in R by pairing it with the feature. This also means that the number of maximization tasks in this algorithm is equals to the number of total features.

Second Stage: ANN Model Building
Over the decades, ANNs proved to be powerful computational models for solving complex estimation and classification problems. An ANN implements the functionality of the biological neural networks (McCulloch, Pitts 1943). One of the most popular and widespread ANN models is the Multi-Layer Perceptron (MLP) (Werbos 1974). Fig. 2 shows an MLP model where the neurons are organized into layers and each layer is fully connected with the next one. Supervised training of an MLP means repeated adjustment of the weight of each link to receive more and more favourable output on specific neurons (output neurons) while stimulating other neurons (input neurons). The backpropagation algorithm achieves this by calculating the derivatives of the network's error with respect to all of its weights and adjusting the weights to a position where, based on the derivatives, the error is smaller, e.g. moving the weights in the direction of the descent of the derivatives where the error is a measure of the difference between the network's output and the target values for the same input. This is a form of supervised learning because the data samples used for training are known before the model building.

Built-In Incomplete Data Handling
Incomplete data is a common problem in pattern recognition. The typical solution is to impute the missing or incorrect values with a default or interpolated value. This solution has the downside of generating distortion in the dataset. The applied MLP model has an extension to the original backpropagation algorithm, which allows dynamic handling of missing values (e.g. FCD may be intermittent) (Viharos et al. 2002). The concept of the extension is reconfiguring the network for each sample and turning off the input and output neurons corresponding to the missing values. The neurons that are turned off and every link connected to them behave as objects outside the network.   the maximum number of iteration is reached or the error is below the predefined threshold. Basically, it works the same way as the original backpropagation algorithm, but before the forward and backward calculation the network is reconfigured according to the missing values and after the procession the network is reverted to its original state. In each training iteration for every training sample the following steps are executed: 1) turn off the network neurons according to the missing values of the given data vector; 2) apply the model on the complete part of the input data vector; 3) calculate the derivatives of the network weights; 4) calculate the corresponding changes of the weights and sum them up; 5) turn on all the neurons that were turned off in step 1. Earlier results (Viharos et al. 2002) showed this solution performs better than the typical imputation methods because of the fact that no distortion is added to the data during the procedure. This paper also compares some imputation methods with the built-in data handling concluding the same results.

Methodology through a Traffic Simulation Based Case Study
In the case study, the objective is to predict the state of traffic around a high capacity intersection. The measured data covers only the mean speed of traffic for the network links with a sampling period of 5 min. Based on a measurement record of 30 min long periods, state prediction is carried out for different horizon lengths (5, 15 and 30 min). Using continuous input values, continuous speeds of traffic are forecast.

Traffic Simulation in the Test Network
The case study network models the vicinity of Oktogon square in District 6, Budapest (Fig. 4).
The models were trained and tested using VISSIM simulation data exclusively as we only have limited access to real-word traffic data. The characteristics of the real-world traffic flow (peak period dynamics) were imitated in the simulations. Therefore, daily pattern of traffic demands could be reproduced. On the other hand, the network traffic parameters (e.g. signal plans) are also tuned according to the real-world attributes.
All links are examined in all directions, respectively. The selected road links are separated by intersections with traffic lights. Thus, the length of links are different: the shortest is approximately 100 m (No 10 and 15 in Fig. 4), while the longest is approx. 330 m (No 11 and 14).
For the simulations, the microscopic traffic simulator VISSIM is utilized (Wiedemann 1974) together with to MATLAB (Tettamanti, Varga 2012).

Preparatory Works
During a simulation, the mean speed of traffic in each link is measured with a sampling time of 5 min. A sample is given with a row vector, containing the mean speeds of the links. The measurements are organized in 60 min blocks. Thus, one record contains data of 12 measurements. Each record is divided to two parts: the measurement data of the first 30 min are used as inputs (which is further modified), while the last 30 min serve as the basis of outputs of the neural network. Applying a time-shifted framework for the measurement dataset, from a t-hour long simulation a total of (t -1) × 12 re- Fig. 4. Scheme of the modelled real-world road network (Oktogon square, Budapest, Hungary, GPS: 47.505207, 19.0633920) and Bing Map of Budapest cords can be produced. In the case study, 60 simulation runs were conducted, each of them lasting 6-hour long. Each simulation run resulted in 60 records. Therefore, a total of 3600 records were obtained, of which 2500 records were used for training and 1100 for testing the neural network. In the methodology, the prediction is carried out separately for each link, using dedicated dataset (note that Section 4 presents a case study for link 13 only). First, the non-relevant data are excluded: link measurements of opposite directions (e.g. for link No 1, data of links No 5-8 are excluded). Then, the statistical features of Table 1 for each relevant link are calculated and attached to the input vector. As a result of the preparations, one record of the data set is a vector of length 120.
The output of each pattern recognition problem is thus a three-element continuous valued vector (with state prediction of 5, 15 and 30 min ahead of the last input).

Application of the ANN
The applied model building method consists of two process stages. The first stage is a feature selection method, which greatly reduces the number of parameters making it applicable for neural network training. In the presented case, there are 120 possible input features describing the whole traffic network and 3 output features as one of the 16 links are estimated over 5, 15 and 30 min in the future. For each output a separate feature selection is required as the order of features depends on the estimation task. Once the order of features is established one can select the best n features for being the inputs of the MLP model. This decision is usually based on expert knowledge. In the presented application the first 10 features are selected in each of the estimation tasks for the sake of comparability. Moreover, the rest of the features can be considered insignificant based on the feature selection measure. Table 2 shows the first 10 selected features. In the naming, the first part describes the feature type discussed earlier with one exception: Speed means the link mean speed measurements. The second part refers to one of the 16 links where the feature was calculated from. The third part denotes which 5 min (e.g. fifth) of the 30 min input interval is used by the feature (if the third part is missing then the features is calculated based on the whole 30 min interval).
For evaluation purposes, two other methods were tested for selecting features: the mRMR (Maximum-Relevance Minimum-Redundancy) (Peng et al. 2005) and the expert knowledge. The expert knowledge means the expert opinion of the engineers or scientists who have significant experience in the field of traffic network dynamics. The selected features are listed in Tables 3-4 and the performance evaluation is presented in Section 4.   Table 3 shows the feature selection order using the mRMR algorihm while Table 4 shows 8 features which were selected by using expert knowledge (these were applied for all three tasks). After selecting the most significant features from the feature set during the first stage, the second stage can apply the MLP training on the reduced dataset. For each estimation task a separate model is built and tested, where the inputs of the given model are the 10 selected features.
Furthermore, different versions of each dataset are created for simulating varying amount of incompleteness of the data. Two types of incomplete data have been generated. In the first case, link measurements of random samples are missing. In the second case, the measurements of the whole network are missing from random measurement periods. Incomplete databases are created offline, following a random choice on sample loss. The additional statistical features are then recalculated considering the incomplete measurement data. The following levels of incompleteness are considered: 10, 20 and 50%. This incompleteness is handled by the MLP model automatically and for the SVM comparison imputation is applied on the datasets.

Evaluation
The following case study describes the evaluation results for link 13 solely (see the network in Fig. 4). Presentation of the estimation results are divided into two parts. The first part shows a comparison of the ANN and SVM models on complete datasets. Then, the second part discusses the performance of these two model types on incomplete datasets. As a state-of-the-art technology, the SVM is chosen for comparative evaluation. SVM is a popular method of our days, widely used in many classification and regression problems (Byun, Lee 2002;Moguerza, Muñoz 2006). One of the most popular SVM library is LIBSVM, which is coded in C++ programming language and has an interface for MATLAB.
The main tuning parameters for the SVM were applied as follows: -SVM type: nu-SVR; -Kernel type: radial basis function; -epsilon: 0.001; -cost: 1; -gamma: 0.01. They were chosen by experimenting with different settings and choosing the one that yields the best performance results.

Feature Selection Comparison
This section presents the results of the comparison of three different feature selection approaches: the Euclidian-distance based feature selection described in Section 2.1; the mRMR algorithm (Peng et al. 2005) and manual selection using expert knowledge (as described in Section 3.3). Table 5 shows how the three different feature selection approaches performed after training the MLP model with the selected features. In the 5 and 15 min estimation tasks the models trained with the inputs selected by the Euclidian-distance based feature selection performed better than in the case of the other two approaches. In the 30 min estimation task expert knowledge was the best and the mRMR is the worst but the performance differences are smaller than in the case of the other two tasks. The following analysis (in Section 4.2 and 4.3) was carried using the inputs provided by the Euclidian-distance based feature selection (Table 2).

Speed Estimation: Full Data Case
This subsection presents the results of the 3 estimation tasks (5, 15 and 30 min) using the ANN and the SVM model. Both models were trained with the same dataset, but evaluated on two different test dataset (denoted as Test data #1 and Test data #2) in order to provide a valid comparison. Table 6 shows an overall comparison of the ANN and SVM models. The results show that the ANN provides an estimation of 10-15% lower relative errors compared to SVM. Fig. 5 provides the estimation results of the ANN and SVM models on the 5 min estimation task. The figures display the estimated value for every sample of the test dataset ordered by the real value. The results highlight that the ANN approach performed better than the SVM method. ANN provides the lowest modelling error at low speeds. Also, at high speeds, low relative errors are present. However, during transient periods (in the interval of [15,40] km/h) high relative errors can be observed. SVM results in a highly uncertain estimation with strong errors generally.

Speed Estimation: Incomplete Data Case
This subsection discusses the estimation results of the test cases with different amount of incompleteness. As it was mentioned earlier the ANN model has built-in incomplete data handling Viharos et al. (2002). The SVM model, however, is unable to perform on an incomplete dataset. For this reason the SVM model was trained and tested on preprocessed data where the missing values were imputed. Three different imputation values are used: -0, which is a standard imputation value; -0.5, which is the centre of the normalization interval; -the average value, which is the average value of the not missing values of a given feature.
There are two type of incomplete data generation (both for the training and testing datasets) which were described at the end of Section 3.3. These are referenced as Incompleteness #1 and Incompleteness #2 in the Table 7 and Figs 6-8. The selected incompleteness percentages are chosen to represent real-world-like data loss situations, i.e. gradual incompleteness cases of 10, 20 and 30%.
The estimation results based on incomplete data are summarized in Table 7. Note that the imputation value denoted by '-' (in the 3rd column of Table 7) represents that no imputation was applied but instead the ANN with the built-in incomplete data handling.
The ANN approach together with the built-in incomplete data handling results in the best performance in comparison with the other methods. By analysing SVM results, it is observable that with the imputation of 0.5 value and average value the SVM models perform similarly, while the imputation of 0 value shows slightly worse accuracy. The reason for this is that 0 is completely independent from the generated dataset, and while 0.5 is the centre of the normalization interval, the average value trivially depends on the dataset. In conclusion, it can be seen that numerical test results clearly justify the efficient applicability of the proposed method. Fig. 6 depicts a comparison of the ANN and SVM models on the first type of incomplete datasets. In the case of the SVM model only the results with the imputation of average value are shown in the figure as these proved to be the best. It can be seen how the accuracy decreases as the ratio of incomplete data rises in the datasets. Fig. 7 shows a further analysis of the built-in incomplete data handling. In this case, the ANN model is trained on the complete dataset, but evaluated on the incomplete test datasets. The results indicate that the performance of the trained model decreases as the incompleteness increases in the test datasets. The explanation is that the model was not prepared to deal with incomplete data, as it is trained on a complete dataset. Compared to the results depicted in Fig. 6, one can observe that if the model is trained with the same amount of incompleteness as in the test dataset, the test performance is much better.   Fig. 8 shows the average estimation capability of the different imputation methods on different level of incompleteness. It can be seen that the built-in missing data handling performs better than the plain imputation methods improving the estimation accuracy with about 10%.
As a summary of the performance analysis and experiences, an engineering suggestion can be given, i.e. the proposed method is fully appropriate for short-term prediction of 5-15 min as simulation results show that the estimation error in this case remains in a reasonable low range. Considering this performance, the elaborated method can be accepted for further use in ITS applications, e.g. traffic incident detection, route guidance, or traffic control.

Conclusions
A traffic speed prediction algorithm and related methodology has been investigated specifically for urban road traffic networks. During the research, important experiences have been gained concerning the methodology for input-output parameter selection and appropriate feature selection.
Numerical results attest that the generation and narrowing of input dataset plays a key role in the urban traffic speed estimation performance.
Basically, two main targets have been achieved. On the one hand, the applicability of the proposed feature selection method was approved. Based on the order of features established by the algorithm a reduced set of 10 parameters could be selected for each of the estimation tasks as the most relevant inputs for ANN training. On the other hand, the advantage of the built-in incomplete data handling solution for traffic speed estimation was shown in the paper. Nevertheless, there is always limitation in such systems.
In the case of the proposed method if data loss overshoots the level of 30%, the performance start decreasing. In conclusion, the simulations justified the viability of the proposed method considering the obtained recognition rates also in comparison of the concurrent SVM method. Hence, acceptable urban traffic speed prediction can be performed for short periods, serving as a practically applicable method for several traffic applications.
Additionally, the application of paper results may also contribute to efficient collaborative transport services capable considering traffic incidents.