RECENT ADVANCES ON SUPPORT VECTOR MACHINES RESEARCH

. Support vector machines (SVMs), with their roots in Statistical Learning Theory (SLT) and optimization methods, have become powerful tools for problem solution in machine learning. SVMs reduce most machine learning problems to optimization problems and optimization lies at the heart of SVMs. Lots of SVM algorithms involve solving not only convex problems, such as linear programming, quadratic programming, second order cone programming, semi-definite programming, but also non-convex and more general optimization problems, such as integer programming, semi-infinite programming, bi-level programming and so on. The purpose of this paper is to understand SVM from the optimization point of view, review several representative optimization models in SVMs, their applications in economics, in order to promote the research interests in both optimization-based SVMs theory and economics applications. This paper starts with summariz-ing and explaining the nature of SVMs. It then proceeds to discuss optimization models for SVM following three major themes. First, least squares SVM, twin SVM, AUC Maximizing SVM, and fuzzy SVM are discussed for standard problems. Second, support vector ordinal machine, semi-supervised SVM, Universum SVM, robust SVM, knowledge based SVM and multi-instance SVM are then presented for nonstandard problems. Third, we explore other important issues such as l p norm SVM for feature selection, LOOSVM based on minimizing LOO error bound, probabilistic outputs for SVM, and rule extraction from SVM. At last, several applications of SVMs to forecasting, bankruptcy Reference to this paper should be made as follows: Tian, Y.; Shi, Y.; Liu, X. 2012. Recent advances on support vector machines research, Technological and Economic Development of Economy 18(1): 5–33.

In recent years, the fields of machine learning and mathematical programming are increasingly intertwined (Bennett, Parrado-Hernández 2006), in which SVMs are the typical representatives. SVMs reduce most machine learning problems to optimization problems, optimization lies at the heart of SVMs, especially the convex optimization problem plays an important role in SVMs. Since convex problems are much more tractable algorithmically and theoretically, lots of SVM algorithms involves solving convex problems, such as linear programming (Nash, Sofer 1996;Vanderbei 2001), convex quadratic programming (Nash, Sofer 1996), second order cone programming (Alizadeh, Goldfarb 2003;Boyd, Vandenberghe 2004;Goldfarb, Iyengar 2003), semi-definite programming (Klerk 2002) and etc. However, there are also non-convex and more general optimization problems appeared in SVMs: integer or discrete optimization considers non-convex problems with integer constraints, semi-infinite programming (Goberna, López 1998), bi-level optimization ) and so on. Especially in the process of model construction, these optimization problems may be solved many times. The research area of mathematical programming intersects with SVMs closely through these core optimization problems.
Generally speaking, there are three majors themes in the interplay of SVMs and mathematical programming. The first theme contains the development of under-lying models for standard classification or regression problems. Novel methods are developed by making some changes to the standard SVM models that enable the development of powerful new algorithms, including ν-SVM (Schölkopf, Smola 2002;Vapnik 1998), linear programming SVM Deng et al. 2012, least squares SVM (LSSVM) (Johan et al. 2002), proximal SVM (PSVM) , twin SVM (TWSVM) (Khemchandani, Chandra 2007;Shao et al. 2011), multi-kernel SVM (Sonnenburg et al. 2006;Wu et al. 2007), AUC maximizing SVM (Ataman, Street 2005;Brefeld, Scheffer 2005), localized SVM (Segata, Blanzieri 2009), cost sensitive SVM (Akbani et al. 2004), fuzzy SVM (Lin, Wang 2002), Crammer-Singer SVM (Crammer, Singer 2001), K-support vector classification regression (K-SVCR) (Angulo, Català 2000) and etc., are developed. The second theme concerns the well-known optimization methods extended to new SVM models and paradigms. A wide range of programming methods is used to create novel optimization models in order to deal with different practical problems such as ordinal regression (Herbrich et al. 1999), robust classification (Goldfarb, Iyengar 2003;Yang 2007;Zhong, Fukushima 2007), semisupervised and unsupervised classification (Xu, Schuurmans 2005;Zhao et al. 2006Zhao et al. , 2007, transductive classification (Joachims 1999b), knowledge based classification (Fung et al. , 2003Mangasarian, Wild 2006), Universum classification (Vapnik 2006), priviledged classification (Vapnik, Vashist 2009), multi-instance classification (Mangasarian, Wild 2008), multi-label classification (Tsoumakas, Katakis 2007;Tsoumakas et al. 2010), multi-view classification (Farquhar et al. 2005), structured output classification (Tsochantaridis et al. 2005) and etc. The third theme considers the important issues in constructing and solving SVM optimization problems. On the one hand, several methods are developed for constructing optimization problems in order to enforce feature selection Tan et al. 2010), model selection Kunapuli et al. 2008), probabilistic outputs (Platt 2000), rule extraction from SVMs (Martens et al. 2008) and so on. On the other hand existing SVM optimization models are aimed at being solved more efficiently for the large scale data set, in which the key point is creating algorithms that exploit the structure of the optimization problem and pay careful attention to algorithmic and numeric issue, such as SMO (Platt 1999), efficient methods for solving large-scale linear SVM Hsieh et al. 2008;Joachims 2006;Keerthi et al. 2008), parallel methods for solving large-scale SVM (Zanghirati, Zanni 2003) and etc.
Considering the many variants of SVM core optimization problems, a systematic survey is needed and helpful to understand and use this family of data mining techniques more easily. The goal of this paper is to closely review SVMs from the optimization point of view. Section 2 of the paper takes standard C -SVM as an example to summarize and explain the nature of SVMs. Section 3 will describe SVM optimization models with different variations according to the above three major themes. Several applications of SVMs to financial forecasting, bankruptcy prediction, credit risk analysis are introduced in Section 4. Finally, Section 5 will provide remarks and future research directions.

The nature of C-Support vector machines
In this section, standard C-SVM (Deng, Tian 2004, 2009Deng et al. 2012;Vapnik 1998) for binary classification is briefly summarized and understood from several points of view. x R y i l ∈ ∈ = − =  y the goal is to find a real function ( ) g x in n R and derive the value of y for any x by the decision function C -SVM formulates the problem as a convex quadratic programming 2 , , 1 where 1 ( ,..., ) , T l ξ = ξ ξ and C > 0 is a penalty parameter. For this primal problem, C -SVM solves its Lagrangian dual problem where ( , ') K x x is the kernel function, which is also a convex quadratic problem and then construct the decision function.
As we all know, the principal of Structural Risk Minimization (SRM) is embodied in SVM, the confidential interval and the empirical risk should be considered at the same time. The two terms in the objective function (3) indicate that we not only minimize 2 || || w (maximize the margin), but also minimize 1 , l i i= ξ ∑ which is a measurement of violation of the constraints (( ) ) 1, 1, , . y w x b i l i i ⋅ + ≥ =  Here the parameter C determines the weighting between the two terms, the larger the value of C, the larger the punishment on empirical risk.
In fact, the parameter C has another meaningful interpretation Deng et al. 2012). Consider the binary classification problem, select a decision function candidate set F(t) depending on a real parameter t: and suppose that the loss function to be the soft margin loss function defined by Thus structural risk minimization is implemented by solving the following convex programming for an appropriate parameter t: , , 1 min , An interesting point is proved that when the parameters C and t are chosen satisfying ( ), t C = ψ where ( ) ψ ⋅ is nondecreasing in the interval (0, ) +∞ , problem(3)~(5) and problem (11)~(14) will get the same decision function . Hence the very interesting and important meaning of the parameter C is proposed: C corresponds to the size of the decision function candidate set in the principle of SRM: the larger the value of C, the larger the decision function candidate set. Now we can summarize and understand C -SVM from following points of view: (i) Construct a decision function by selecting a proper size of the decision function candidate set via adjusting the parameter C; (ii) Construct a decision function by selecting the weighting between the margin of the decision function and the deviation of the decision function measured by the soft-margin loss function via adjusting the parameter C; (iii) Another understanding about C -SVM can also be seen in the literatures (Deng et al. 2012): Construct a decision function by selecting the weighting between flatness of the decision function and the deviation of the decision function measured by the soft-margin loss function via adjusting the parameter C.

Optimization models of support vector machines
In this section, several representative and important SVM optimization models with different variations are described and analyzed. These models can be divided into three categories: models for standard problems, models for nonstandard learning problems, and models combining SVMs with other issues in machine learning.

Models for standard problems
For the standard classification or regression problems, lot of methods are developed based on standard SVM models to be the powerful new algorithms. Here we briefly introduce several basic and efficient models, lots of developments of these models are omitted here.

Least squares support vector machine
Just like the standard C -SVM the starting point of least squares SVM (LSSVM) (Johan et al. 2002) is also to find a separating hyperplane, but with different primal problem. In fact, introducing the transformation ( ) x x = Φ and the corresponding kernel ( , ') ( ( ) ( ')), K x x x x = Φ ⋅Φ the primal problem becomes the convex quadratic programming ) 1 , 1, , .
The geometric interpretation of the above problem with 2 x R ∈ is shown in Figure 1, where minimizing 2 1 || || 2 w realizes the maximal margin between the straight lines In C -SVM, the error is measured by the soft margin loss function, this leads to the fact that the decision function is decided only by the support vectors. While in LSSVM, almost all training points contribute to the decision function, which makes it lose the sparseness. However, LSSVM needs to solve a quadratic programming with only equality constraints, or equivalently a linear system of equations. Therefore, it is simpler and faster than C -SVM.

Twin support vector machine
Twin support vector machine (TWSVM) is a binary classifier that perform classification using two nonparallel hyperplanes instead of a single hyperplane as in the case of conventional SVMs (Shao et al. 2011). Suppose the two non-parallel hyperplanes are the positive hyperplane and the negative hyperplane The primal problems for finding these two hyperplanes are two convex quadratic programming problems (Shao et al. 2011) and where , 1,..., are positive inputs, and , 1,..., TWSVM is established based on solving two dual problems of the above primal problems separately. The generalization of TWSVM has been shown to be significantly better than standard SVM for both linear and nonlinear kernels. It has become one of the popular methods in machine learning because of its low computational complexity, since it solves above two smaller sized convex quadratic programming problems. On average, it is about four times faster than the standard SVMs.

AUC maximizing support vector machine
Nowadays the area under the receiver operating characteristics (ROC) curve, which corresponds to the Wilcoxon-Mann-Whitney test statistic, is increasingly used as a performance measure for classification systems, especially when one often has to deal with imbalanced class priors or misclassification costs. The area of that curve is the probability that a randomly drawn positive example has a higher decision function value than a random negative example; it is called the AUC (area under ROC curve). When the goal of a learning problem is to find a decision function with high AUC value, then it is natural to use a learning algorithm that directly maximizes this criterion. Over the last years, AUC maximizing SVMs (AUCSVM) have been developed (Ataman, Street 2005;Brefeld, Scheffer 2005), in which one kind of primary problem to be solved is a convex problem where , 1, , , and , 1, , are positive and negative inputs separately. It's dual problem is also a convex quadratic programming problem.
However, the existing algorithms all have the serious drawback that the number of constraints is quadratic in the number of training points, so they become very large even for small training set. To cope with this, different strategies can be constructed, in one of which a Fast and Exact k -Means (FEKM) (Goswami et al. 2004) algorithm is applied to approximate the problem by representing the l l + − many pairs ( ) − cluster centers and thereby reduce the number of constraints and parameters. The approximate k -Means AUCSVM is more effective at maximizing the AUC than the SVM for linear kernels. Its execution time is quadratic in the sample size.

Fuzzy support vector machine
In standard SVMs, each sample is treated equally; i.e., each input point is fully assigned to one of the two classes. However, in many applications, some input points, such as the outliers, may not be exactly assigned to one of these two classes, and each point does not have the same meaning to the decision surface. To solve this problem, each data point in the training data set is assigned with a membership, if one data point is detected as an outlier, it is assigned with a low membership, so its contribution to total error term decreases. Unlike the equal treatment in standard SVMs, this kind of SVM fuzzifies the penalty term in order to reduce the sensitivity of less important data points. Fuzzy SVM (FSVM) construct its primal problem as (Lin, Wang 2002) where i s is the membership generalized by some outlier-detecting methods. Its dual problem is similarly deduced as C -SVM to be a convex quadratic programming Model (32)~(34) is also the general formulation of the cost sensitive SVM (Akbani et al. 2004) solving the imbalanced problem, in which different error costs are used for the positive ( ) C + and negative ( )

Models for nonstandard problems
For the nonstandard problems appeared in different practical applications, a wide range of programming methods are used to build novel optimization models. Here we present several important and interesting models to show the interplay of SVMs and optimization.

Support vector ordinal regression
Support vector ordinal regression (SVOR) (Herbrich et al. 1999) is a method to solve a specialization of the multi-class classification problem: ordinal regression problem. The problem of ordinal regression arises in many fields, e.g., information retrieval, econometric models, and classical statistics. It is complementary to the classification problem and metric regression problem due to its discrete and ordered outcome space.

Definition 3.1. (Ordinal regression problem). Given a training set
where j i x is an input of a training point, the supscript denotes the corresponding class number, 1, , j i l =  is the index within each class, and j l is the number of the training points in class j.
such that the class number for any x can be predicted by SVOR constructs the primal problem as * 0, 0, 1, , , 1, , , Its dual problem is the following convex quadratic programming Though SVOR is a method to solve a specialization of the multi-class classification problem and has many applications itself (Herbrich et al. 1999), it is also used in the context of solving general multi-class classification problem Deng et al. 2012;Yang 2007;Yang et al. 2005), in which the SVOR is used as a basic classifier and used several times instead of only once, just as the binary classifiers for multi-class classification. There are many choices since any p -class SVOR with different order can be candidate, where p = 2, 3, ... , M. When p = 2, this approach reduces to the approach based on binary classifiers.

Semi-supervised support vector machine
In practice, labeled instances are often difficult, expensive, or time consuming to obtain, meanwhile unlabeled instance may be relatively easy to collect. Different with standard SVMs using only labeled training points, lots of semi-supervised SVMs (S 3 VM) use large amount of unlabeled data, together with the labeled data, to build better classifiers. Transductive support vector machine (TSVM) (Joachims 1999b) is such an efficient method finding a labeling of the unlabeled data, so that a linear boundary has the maximum margin on both the original labeled data and the (now labeled) unlabeled data. The decision function has the smallest generalization error bound on unlabeled data. For a training set given by are parameters. However, finding the exact solution to this problem is NP-hard. Major effort has focused on efficient approximation algorithms. The SVM-light is the first widely used software (Joachims 1999b).
In the approximation algorithms, several relax the above TSVM training problem to semi-definite programming (SDP) (Xu, Schuurmans 2005;Zhao et al. 2006Zhao et al. , 2007. The basic idea is to work with the binary label matrix of rank 1, and relax it by a positive semi-definite matrix without the rank constraint. However the computational cost of SDP is still expensive for large scale problems.

Universum support vector machine
Different with semi-supervised SVM leveraging unlabeled data from the same distribution, Universum support vector machine (USVM) use the the additional data not belonging to either class of interest. Universum contains data belonging to the same domain as the prob-lem of interest and is expected to represent meaningful information related to the pattern recognition task at hand. Universum classification problem can be formulated as follows: Definition 3.

(Universum classification problem). Given a training set
is a collection of unlabeled inputs known not to belong to either class, find a real function g(x) in n R such that the value of y for any xcan be predicted by the decision function Universum SVM constructs the following primal problem .We can also get its dual problem and introduce kernel function for dealing with nonlinear classification.
It is natural to consider the relationship between USVM and some 3-class classification. In fact, it can be shown that, under some assumptions, USVM is equivalent to K-SVCR (Angulo, Català 2000), and is also equivalent to the SVOR with M = 3 with slight modification (Gao 2008). USVM's performance depends on the quality of the Universum, methodology of choosing the appropriate Universum is the subject of future research.

Robust support vector machine
In standard SVMs, the parameters in the optimization problems are implicitly assumed to be known exactly. However, in practice, some uncertainty is often resent in many real-world problems, these parameters have perturbations since they are estimated from the training data which are usually corrupted by measurement noise. The solutions to the optimization problems are sensitive to parameter perturbations. So it is useful to explore formulations that can yield discriminants robust to such measurement errors. For example, when the inputs are subjected to measurement errors, it would be better to describe the inputs by uncertainty sets n R ∈ i X , 1, , , i l =  since all we know is that the input belongs to the set i X . Therefore the standard problem turns to be the following robust classification problem.
where i X is a set in n R , . Find a real function g(x) in n R , such that the value of y for any x can be predicted by the decision function The geometric interpretation of the robust problem with circle perturbations is shown in Figure 3, where the circles with "+" and "o" are positive and negative input sets respectively, the optimal separating hyperplane * * ( ) 0 w x b ⋅ + = by the principle of maximal margin is constructed by robust SVM (RSVM). Now, the primal problem of RSVM for such case is a semi-infinite programming problem where the set i X is a supersphere obtained from perturbation of a point i x

Fig. 3. Geometric interpretation of robust classification problem
This semi-infinite programming problem can be proved to be equivalent to the following second order cone programming (Goldfarb, Iyengar 2003;Yang 2007) , , , , (( ) ) 1 , 1, , , 1, (77) its dual problem is also a second order cone programming , , , , 1 max , which can be efficiently solved by Self-Dual-Minimization (SeDuMi). SeDuMi is a tool for solving optimization problems. It can be used to solve linear programming, second-order cone programming and semi-definite programming, and is available at the web site http:// sedumi.mcmaster.ca.

Knowledge based support vector machine
In many real-world problems, we are given not only the traditional training set, but also prior knowledge such as some advised classification rules. If appropriately used, prior knowledge can significantly improve the predictive accuracy of learning algorithms or reduce the amount of training data needed. Now the problem can be extended in the following way: the single input points in the training points are extended to input sets, called knowledge sets. If we consider the input sets restricted as polyhedrons, the problem is formulated mathematically as follows: and , Of course we can construct the primal problem to be the following semi-infinite programming problem 2 , , 1 1 min || || , 2 However, it was shown that the constraints (88)~(90) can be converted into a set of limited constraints and then the problem becomes a quadratic programming  2 , , , , , 1 1 This model considered the linear knowledge incorporated to linear SVM, while linear knowledge based nonlinear SVM and nonlinear knowledge based SVM were also proposed by Mangasarian and his co-workers (Fung et al. 2003;Mangasarian, Wild 2006). Handling prior knowledge is worthy of further study, especially when the training data may not be easily available whereas expert knowledge may be readily available in the form of knowledge sets. Another prior information such as some additional descriptions of the training points was also considered and a method called privileged SVM was proposed (Vapnik, Vashist 2009), which allows one to introduce human elements of teaching: teacher's remarks, explanations, analogy, and so on in the machine learning process.

Multi-instance support vector machine
Multi-instance problem was proposed in the application domain of drug activity prediction, and similar to both the robust and knowledge-based classification problems, it can be formulated as follows.
. Find a real function g(x) in n R , such that the label y for any instance x can be predicted by the decision function The set i X is called a bag containing a number of instances. Note that the interesting point of this problem is that: the label of a bag is related with the labels of the instances in the bag and decided by the following way: a bag is positive if and only if there is at least one instance in the bag is positive; a bag is negative if and only if all instances in the bag are negative. A geometric interpretation of multi-instance classification problem is shown in Figure 4, where every enclosure stands for a bag; a bag with "+" is positive and a bag with "o" is negative, and both "+" and "o" stand for instances.
where r and s are respectively the number of the instances in all positive bags and all negative bags, and p is the number of positive bags. Though the above problem is nonlinear, it is easy to see that among its constraints, only the first one is nonlinear, and in fact is bilinear. Then a local solution to this problem is obtained by solving a succession of fast linear programs in a few iterations: Alternatively, hold one set of variables which constitute the bilinear terms constant while varying the other set. For a nonlinear classifier, a similar statement applies to the higher dimensional space induced by the kernel.

Other SVM issues
This section concerns some important issues of SVMs: feature selection, parameter (model) selection, probabilistic outputs, rule extraction, implements of algorithms and so on, in which the optimization models are also applied.

Feature selection via SVMs
Standard SVMs cannot get the importance features, while identifying a subset of features which contribute most to classification is also an important task in machine learning. The benefit of feature selection is twofold. It leads to parsimonious models that are often preferred in many scientific problems, and it is also crucial for achieving good classification accuracy in the presence of redundant features. We can combine SVM with various feature selection strategies, Some of them are "filters": general feature selection methods independent of SVMs. That is, these methods select important features first and then SVMs are applied. On the other hand, some are wrapper-type methods: modifications of SVMs which choose important features as well as conduct training/testing. In the machine learning literature, there are several proposals for feature selection to accomplish the goal of automatic feature selection in the SVM Guyon et al. 2001;Li et al. 2007;Weston et al. 2001;Zhu et al. 2004;Zou, Yuan 2008) via some optimization problems, in some of which they applied the 0 l -norm, 1 l -norm or ∞ l -norm SVM and got competitive performance. Naturally, we expect that using the p l -norm (0 1) p < < in SVM can find more sparse solution than using 1 l -norm and more algorithmic advantages. Through combining C-SVM and feature selection strategy by introducing the p l -norm (0 1) p < < , the primal problem in p l -support vector machines ( p l -SVM) is Deng et al. 2012;Tian et al. 2010) , , 1 min || || where p is a nonnegative parameter, and For the case of p = 0, 0 || || w represents the number of nonzero components of w, for the case of p = 0, the problem turns to be a linear programming, for the case of p = 2, a convex quadratic programming, and for the case of , the problem is proved to be equivalent to a linear programming problem (Zou, Yuan 2008).
However, solving this nonconvex, non-Lipschitz continuous minimization problem is very difficult. After equivalently transforming the problem to be , , , 1 1 min , and introducing the first-order Taylor's expansion as the approximation of this nonlinear objective function, this problem can be solved by a successive linear approximation algorithm Deng et al. 2012). Furthermore, a lower bound for the absolute value of nonzero entries in every local optimal solution of l p -SVM is developed , which reflects the relationship between sparsity of the solution and the choice of the parameters C and p.

LOO error bounds for SVMs
The success of SVMs depends on the tuning of their several parameters which affect the generalization error. An effective approach choosing these parameters which will generalize well is to estimate the generalization error and then search for parameters so that this estimator is minimized. This requires that the estimators are both effective and computationally efficient. Leave-one-out (LOO) method (Vapnik, Chapelle 2000) is the extreme case of crossvalidation, and LOO error provides an almost unbiased estimate of the generalization error. However, one shortcoming of the LOO method is that it is highly time consuming when the number of training points l is very large thus methods are sought to speed up the process. An effective approach is to approximate the LOO error by its upper bound, that is computed by running a concrete classification algorithm only once on the original training set T of size l . This approach has successfully been developed for both support vector classification machine (Gretton et al. 2001;Jaakkola, Haussler 1998Joachims 2000;Vapnik, Chapelle 2000), support vector regression machine (Chang, Lin 2005;Tian 2005;, and support vector ordinal regression (Yang et al. 2009). Then we can search for parameter so that this upper bound is minimized. Furthermore, inspired by the LOO error bound, approaches were proposed by directly minimizing the expression given by the bound in an attempt to minimize leave-one-out error (Tian 2005;Weston 1999), and these approaches are called LOO support vector machines (LOOSVM). LOOSVMs also involve solving convex optimization problems, and one of which in such the algorithms is a linear programming problem and is the kernel function. LOOSVMs possess many of the same properties as SVMs. The main novelty of these algorithms is that apart from the choice of kernel, they are parameter less: the selection of the number of training errors is inherent in the algorithms and not chosen by an extra free parameter as in SVMs.

Probabilistic outputs for support vector machines
For a binary classification problem with the training set (1), standard C-SVM computes a decision function (2) such that it can be used to predict the label of any test input x. However, we cannot guarantee that the deduction is absolutely correct. So sometimes we hope to know how much confidence we have, i.e. the probability of the input x belonging to the positive class. To answer this question, investigate the information contained in g(x). It is not difficult to imagine that the larger g(x) is, the larger the probability is. So the value of g(x) can be used to estimate the probability ( 1| ( )) P y g x = of the input x belonging to the positive class. In fact, we only need to establish an appropriate monotonic function from ( , ) −∞ +∞ where g(x) takes value to the probability values interval [0,1], such as the sigmoid function is used (Platt 2000) 1 2 1 ( ) , 1 exp( ) p g c g c = + + where 1 0 c < and 2 c are two parameters to be found. In order to choose the optimal values This problem is a two-parameter maximization, hence it can be performed using any number of optimization algorithms, while Figure 5 shows a numerical results of the probabilistic outputs for a linear SVM on some data (Platt 2000). For better implementation of solving problem (118), an improved algorithm that theoretically converges and avoids numerical difficulties was also proposed (Lin et al. 2007).

Rule extraction from support vector machines
Though SVMs are the state-of-the-art tools in data mining, their strength are also their main weakness, as the generated nonlinear models are typically regarded as incomprehensible black-box models. Therefore, opening the black-box or making SVMs explainable, i.e. extracting rules from SVMs models to mimic their behavior and give comprehensibility to them became more important and necessary in areas such as medical diagnosis and credit evaluation (Martens et al. 2008).
There are several techniques to extract rules from SVMs so far, and one potential method of classifying these rule extraction techniques is in terms of the "translucency", which is of the view taken within the rule extraction method of the underlying classifier. Two main categories of rule extraction methods are known as pedagogical (Setiono et al. 2006) and decompositional (Fung et al. 2005;Núñez et al. 2002). Pedagogical algorithms consider the trained model as a black box and directly extract rules which relate the inputs and outputs of the SVMs. On the other hand, decompositional approach is closely related to the internal workings of the SVMs and their constructed hyperplane. Fung et al. (2005) present an algorithm to extract propositional classification rules from linear SVMs. The method is considered to be decompositional because it is only applicable when the underlying model provides a linear decision boundary. The resulting rules are parallel with the axes and nonoverlapping, but only (asymptotically) exhaustive. The algorithm is iterative and extracts the rules by solving a constrained optimization problem that is computationally inexpensive to solve. Figure 6 shows execution of the algorithm for binary classification and only rules for the black squares are being extracted (Fung et al. 2005). Different optimal rules will be extracted according to different criteria, and maximizes the log of the volume of the region that the rules encloses is one kind of which, leads to solving the following optimization problem However, existing rule extracted algorithms have limitations in real applications especially when the problems are large scale with high dimensions. So the incorporation of the feature selection into the rule extraction problem is also a possibility to be explored, and there are already some papers considering this topic (Yang, Tian 2011).

Applications in economics
SVMs have been successfully applied in many fields including economics, finance and management. Some applications of SVMs to financial forecasting problems have been reported , 2003Kim 2003;. Tay and Cao (2002) proposed C-ascending SVMs by increasing the value of parameter C, this idea was based on the assumption that it was better to give more weights on recent data than distant data. Their results showed that C-ascending SVMs gave better performance than standard SVM in financial time series forecasting. Cao and Tay (2003) also compared SVMs with multilayer backpropagation (BP) neural network and the regularized radial basis function (RBF) neural network. Simulation results showed that SVMs with adaptive parameters outperform two other methods.
Bankruptcy prediction is an important and widely studied topic since it can have significant impact on bank lending decisions and profitability, SVMs were successfully adopted to this problem in recent years (Fan, Palaniswami 2000;Huang et al. 2004;Min, Lee 2005;Min et al. 2006;Shin et al. 2005). The results for different real world data sets demonstrated that SVMs outperform BP at the accuracy and generalization performance. The effect of the variability in performance with respect to various values of parameters in SVMs were also investigated.
Due to recent financial crises and regulatory concerns, credit risk assessment is an area that has seen a resurgence of interest from both the academic world and the business community. Since credit risk analysis or credit scoring is in fact a classification problem, so lots of classification techniques were applied to this field, and naturally competitive SVMs can be used (Stoenescu Cimpoeru 2011;Shi et al. 2005;Thomas et al. 2005;Van Gestel et al. 2003;Yu et al. 2009;Zhou et al. 2009). Additionally, combining genetic algorithms with SVMs, named hybrid GA-SVMs can simultaneously perform feature selection task and model parameters optimization (Huang et al. 2007). Because in credit scoring areas we usually cannot label one customer as absolutely good or bad, a fuzzy support vector machine different with model (32)~(34) was proposed to treat every inputs as both positive and negative classes, but with different memberships (Wang et al. 2005), where m i is the membership for the ith inputs to the class y i .
Other applications in economics, including motor insurance fraud management (Furlan et al. 2011), environmental risk assessment (Kochanek, Tynan 2010), e-banking website quality assessment (Kaya, Kahraman 2011) and etc., can also be explored by SVMs.

Remarks and future directions
This paper has offered an extensive review of optimization models of SVMs, including least squares SVM, twin SVM, AUC Maximizing SVM, and fuzzy SVM for standard problems; support vector ordinal machine, semi-supervised SVM, Universum SVM, robust SVM, knowledge based SVM, and multi-instance SVM for nonstandard problems, as well as p l -norm SVM for feature selection, LOOSVM based on minimizing LOO error bound, probabilistic outputs for SVM, and rule extraction from SVM. These models have already been used in many real-life applications, such as text categorization, bio-informatics, bankruptcy prediction, remote sensing image analysis, network intrusion and detection, information security, and credit assessment management. Some applications to financial forecasting, bankruptcy prediction, credit risk analysis are also reviewed in this paper. Researchers and engineers in data mining, especially in SVMs can benefit from this survey in better understanding the essence of the relation between SVMs and optimization. In addition, it can also serve as a reference repertory of such approaches.
Research in SVMs and research in optimization have become increasingly coupled. In this paper, we can see optimization models including linear, nonlinear, second order cone, and semi-definite, integer or discrete, semi-infinite programming models are used. Of course, there are still many optimization models of SVMs not discussed here, and new practical problems remaining to be explored present new challenges to SVM to construct new optimization models. These models should also have the same desirable properties as the models in this paper including : good generalization, scalability, simple and easy implementation of algorithm, robustness, as well as theoretically known convergence and complexity.