Calibration of L\'evy Processes using Optimal Control of Kolmogorov Equations with Periodic Boundary Conditions

We present an optimal control approach to the problem of model calibration for L\'evy processes based on a non parametric estimation procedure. The calibration problem is of considerable interest in mathematical finance and beyond. Calibration of L\'evy processes is particularly challenging as the jump distribution is given by an arbitrary L\'evy measure, which form a infinite dimensional space. In this work, we follow an approach which is related to the maximum likelihood theory of sieves. The sampling of the L\'evy process is modelled as independent observations of the stochastic process at some terminal time $T$. We use a generic spline discretization of the L\'evy jump measure and select an adequate size of the spline basis using the Akaike Information Criterion (AIC). The numerical solution of the L\'evy calibration problem requires efficient optimization of the log likelihood functional in high dimensional parameter spaces. We provide this by the optimal control of Kolmogorov's forward equation for the probability density function (Fokker-Planck equation). The first order optimality conditions are derived based on the Lagrange multiplier technique in a functional space. The resulting Partial Integral-Differential Equations (PIDE) are discretized, numerically solved and controlled using scheme a composed of Chang-Cooper, BDF2 and direct quadrature methods. For the numerical solver of the Kolmogorov's forward equation we prove conditions for non-negativity and stability in the $L^1$ norm of the discrete solution. To set boundary conditions, we argue that any L\'evy process on the real line can be projected to a torus, where it again is a L\'evy process. If the torus is sufficiently large, the loss of information is negligible.


Introduction
Lévy processes play a large rôle in contemporary mathematical finance [15], but also in many areas of physics, see e.g. [2,26]. A real valued Lévy process is a stochastic process Y (t) that has increments Y (t) − Y (s), t ≥ s, that are independent of the past. The increments are also stationary in the sense that the probability distribution of the increment only depends on the time difference t − s. Furthermore, Y (0) = 0 and a stochastic continuity condition for t = 0 holds, see e.g. [2]. Under the given conditions, the characteristic function of Y (t) is given by the Lévy Khinchine representation E e iY (t)k = e tψ (k) (1)
In Eq. (2) 1 {|s|≤1} (s) is the characteristic function of the set {|s| ≤ 1} which takes the value 1 on this set and 0 otherwise. The calibration problem for Lévy processes consists of the estimation of the canonical triplet (b, σ 2 , ν) given the observation Y (t j ) of the process' trajectory Y (t) at some prescribed times t j , j = 1, . . . , L. For instance, Y (t) could be the process of log-Returns of some asset and t j could be the closing time of the j-th trading day ('historic low frequency data'). As the j-th increment of the process Y j = Y (t j ) − Y (t j−1 ) has the same distribution as Y (T ), if T = t j − t j−1 , from a statistical point of view this is equivalent to the L-fold independent observation of the terminal values Y (T ) at time T .
Y (t) can also be understood as the solution to the Stochastic Differential Equation Here B(t) is a standard Brownian motion and N ((t, t + ∆t], A) ∼ Po(ν(A)∆t) is the random counting measure of jumps of height in the set A ⊆ R in the time interval (t, t+∆t]. Po(λ) stands for the Poisson distribution with intensity λ andÑ ((t, t+∆t], A) = N ((t, t+∆t], A)−ν(A)∆t is the compensated or Martingale jump measure for small jumps, where we require A ⊂ R \ {|x| ≤ ε} for some ε > 0, see [2] for further details. The calibration problem for Lévy processes, respectively the solution of (4), unfortunately is ill posed: The collection of all Lévy measures ν is infinite dimensional, while only L observations are available. Direct application of the maximum likelihood principle in this situation leads to severe over-fitting issues [21]. In many applications, one chooses families of Lévy measures ν(ᾱ) that depend only on a finite dimensional parameter vectorᾱ, see e.g. [24]. Furthermore, one often restricts to such parametrizations, where the density f (x, T,ᾱ) of the probability distribution of Y (t) can be calculated explicitly or at low numerical cost. One then assumes that the true distribution of Y (t) is inside the prescribed set and uses the maximum likelihood approach for calibration [19]. This assumption might however not be justified and give rise to modelling errors.
As a non-parametric alternative, one can use generic parametrizations for the density of the Lévy measure ν that can be refined depending on the amount of data available. This gives rise to a hierarchy -or sieve [21] -of maximum likelihood problems with a finite number of parameters. If a suitable finite parametrization has been chosen, it remains to solve the maximum likelihood estimate at a given level of parametrization. One also has to determine this level based on the quality and also stability of the fits obtained. The resulting densities can no longer be calculated analytically. Also, solution of the maximum likelihood problem gives rise to high dimensional optimization problems.
The maximum likelihood method requires a parametric representation of the probability density functions (PDF). The PDF can however be obtained as a solution to the Kolmogorov forward equation (Fokker Planck equation). The parametersᾱ then enter in this equation via coefficients in the generator of the semigroup [2]. If the Lévy measure ν is not zero, the generator of both these equations does not only contain a 2nd order partial differential operator, but also an integral operator of convolution type. This places the model calibration problem in the framework of optimal control problems with partial integral differential equations (PIDE) constraints.
Indeed, we know that the Kolmogorov forward equation is representative of a stochastic process described in terms of SDEs such as that one of Eq. (4), where the set of parametrization for the approximation of the PDF, would correspond to a set of controls of the stochastic dynamic equation, so that, jointly to the maximum likelihood problem, it corresponds to a stochastic optimal control problem. The classical way to deal with the optimal control of stochastic process is by the Dynamic Programming principle and the related Hamilton-Jacobi-Bellman equation for stochastic processes [9]. However, this problem has been recently framed as a constrained PDE optimization problem, where the PDE is the Fokker-Planck, i.e. Kolmogorov forward, equation [3,4,5]. Following this framework, the solution of the maximum likelihood problem, i.e. the stochastic optimiza-tion, is found by solving the first order optimality conditions in a functional space, that is the optimality system consisting of two PIDEs, named forward and backward (or adjoint) equations, plus an optimality condition.
This optimality system can be numerically solved by a gradient-based iterative algorithm as follows. The Kolmogorov forward equation has a set of control parameters in order to maximize the log-Likelihood functional for its terminal value. These controls involve the Kolmogorov backward equation (adjoint equation) with suitable terminal condition, corresponding to the log-likelihood functional. Hence, given an initial approximation of the unknown parametrization, first solve the forward equation, then set up the terminal condition and solve the adjoint one. With both the forward and adjoint solutions, by using the optimality condition equation the gradient is computed. Then with a descending gradient technique, such as a non linear conjugate gradient method, found a better approximation of the control parameters and repeat until the satisfying accuracy for the parametrization is found.
Since this maximum likelihood problem could have an high dimensional space ad a huge number L of observations, a fast, stable and enough accurate numerical solver for the PIDE is required. In our case the Kolmogorov forward equation is a PIDE of parabolic differential operator type. Such kind of PIDEs are, e.g., investigated in the option pricing models as a generalization of the Black-Scholes equation. The first difficulty to numerically solve this equation is the integral. In fact, in the case of using a fully implicit method, it would lead to solve a dense system of equation, for this reason implicitexplicit (IMEX) or operator splitting methods can be applied to bypass this problem (see [18,10,16]). The solution of the Kolmogorov forward equation is a probability density function that is non negative with constant integral over the domain. Such properties must be owned from the discrete solution too. The Chang-Cooper (CC) is a non-negative and conservative numerical method that has been used to solve the classical Fokker-Planck equation [14,4,11]. Here, we use a numerical method that can be classified as IMEX, since we use the CC method with an implicit time difference scheme for the differential operators of our PIDE, and evaluate the integral operator at the previous time step solution, i.e. in an explicit way. We prove for the resulting numerical solver: conservativeness, non-negativity and stability in the L 1 -norm. The numerical solver for the adjoint equation is obtained directly from the solver for the forward equation by using the "discretize then optimize" approach to the optimization problem.
Finally, we quote, that for related work with vanishing Lévy measure see e.g. [4,12], and for estimation procedures based on non parametric approximations of the empirical characteristic function see e.g. [7]. An approach based on the method of moments and asymptotic expansions of Lévy densities can be found in [22].
The article is organised as follows: In the following Section 2 we describe the hierarchy, of estimation problems. We shall show that the estimation problems that can actually be solved numerically can come arbitrarily close, at increasing computational cost, to the fully general Lévy estimation problem. We also show that the use of periodic boundary conditions in the Kolmogorov equations can be understood in terms of mapping the original Lévy process on the real line to a derived Lévy process on the torus.
In Section 3, we set up the maximum likelihood estimation problem for a given parametrization and derive Kolmogorov's forward (Fokker-Planck) equation and its adjoint (Kolmogorov backward) equation with terminal conditions set by the log-Likelihood objective functional. This maximum likelihood estimation problem is solved in the framework of the Fokker-Planck optimal control of stochastic processes, as a constrained PDE optimal control problem.
In Section 4 the discretization for Kolmogorov's equations and the optimal control scheme is derived following a Chang-Cooper and IMEX approach. In particular we prove the structural properties of the numerical solution, i.e. conservativeness, non-negativity and stability.
Section 5 gives numerical tests for the consistency of the proposed procedure based on simulated data. We propose to use Akaike's information criterion (AIC) [13] to choose an adequate parametrization from the hierarchy of spline parametrizations for density of the Lévy measure. Three different tests are performed: We first fit data that are simulated from a given distribution within our hierarchy of Lévy distributions. The fits obtained are shown to be of very good quality and AIC-selection criterion reproduces almost the original parametrization. As a second test we fit simulated data from a bi-directional Gamma process, i.e. the difference of two independent Gamma processes [2], which is not inside one of the parametrizations of degree N Θ , but can be approximated by those. The bi-directional Gamma process is augmented by a small diffusive component and projected to the torus. The AIC selection criterion and the fitting results again reproduce the final distribution of this process rather adequately. As a final test, we select financial data from the German stock exchange DAX in a period between April 1998 and March 2002 and consider daily log-returns over 1000 trading days. This period represents a rather stable period for the DAX with an almost constant level of the volatility. After projection, the AIC based method selects a six-parameter spline approximation of the Lévy measure density. The resulting fits again give a decent reproduction of the empirical distribution.
Our conclusions and an outlook are given in the final Section 6.

A Hierarchy of Parametrizations for Maximum Likelihood Estimation of Lévy Data
The estimation problem for the Lévy measure ν is plagued by several issues. Here we take a step by step approach towards the derivation of a hierarchy of estimation problems that approximate the original one. Let f (x|ᾱ) = f (x|ᾱ, N Θ ),ᾱ ∈ R N Θ , be a family of probability density functions with variable dimension N Θ of the parameter setᾱ, and let f (x) be the (unknown) probability density of Y t .
For N Θ fixed, we apply the Maximum Likelihood method to select an estimated valuê α based on increasing sample size. It is known from the general theory of Maximum Likelihood [19] that the estimatedα converges almost surely to the true valueᾱ, provided This leaves the question open, which parametrization -or which value for N Θ -one should choose. We solve this problem by maximizing the Akaike Information Criterion (AIC). Maximizing the AIC corresponds to minimizing (asymptotically) the expected Kulback-Leibler distance, or relative entropy, between the true distribution f (x) and its parametric estimate f (x|ᾱ, N Θ ). See [13][Chapter 6] for a detailed derivation.
Let us now define the parametrizations with N Θ parameters. We intend to show that, for sufficiently large N θ , we can approximate the original Lévy distribution to an arbitrary precision. This requires several steps of approximation: Truncating small and large jumps. The total mass of the Lévy measure, λ = ν(R \ {0}), can be infinite. This quantity defines the average number of jumps per unit time of the associated Lévy process [2]. The easiest way to deal with this is to truncate small jumps by setting ν ε = 1 {|s|>ε} ν which is a finite measure by equation (3). Using (3) and dominated convergence, one can moreover prove that ψ ε (k) → ψ(k) for k ∈ R as ε 0. Here ψ ε (k) is given by (2) with ν replaced by ν ε . By the continuity theorem of Paul Lévy, see e.g. [6,Theorem 23.8], this then implies (weak) convergence in law of the respective probability distributions.
At the same time, a finite Lévy measure ν permits one to re-parametrize ψ(k) of Eq.
In the following we assume ν to be finite and use parametrization (5). The last term in (5) now has the structure of a compound Poisson distribution, i.e. Y (t) can be represented as is Poisson distributed with intensity λt and U j are i.i.d. random variables with distribution given by the normalized Lévy measure, U j ∼ λ −1 ν. Also Z(t), N (t) and U j are all stochastically independent.
With a similar argument, we can cut off large jumps by replacing ν with 1 {|s|≤ε −1 } ν. Also in this case, in the limit ε 0, the truncated Lévy distributions converge in law to the non truncated one. In the following we thus assume that the support of ν is contained in some finite interval [Ω a , Ω b ]. The appropriate size of this region can be estimated e.g. by the Chebyshev's inequality using empirical mean and variance from the data.
Spline approximation of the densities. Let thus dν(s) = (s)ds with ρ(s) non negative, continuously differentiable and with compact support. Let θ j , j = 1, . . . , n + 1, be a collection of uniform grid points in R such that the support of (x) is covered and θ j − θ j−1 = ∆. Define ∆ (s) by the linear interpolation between points (θ j , (θ j )). Again, one easily sees that ∆ (s) converges to (s) as ∆ 0. Also, for small ∆, the functions ∆ (s) have support in a fixed compact interval and are uniformly bounded by the maximum value of (s). If we insert measures dν ∆ (s) = ∆ (s)dν(s) into (5), this expression converges to the related one with dν(s) = (s)ds. This suffices to prove that an approximation of the probability distributions (in law) is feasible with 1st order spline densities.
Fixing drift and diffusion terms. One might or might not like to include the drift and diffusion term determined by b and σ 2 into the estimation problem. Although, in general, these quantities have to be estimated, here we keep a small fixed value for σ 2 for reasons of numerical stability of the Kolmogorov equations. As long as this value underestimates the true diffusion, this corresponds to a splitting of the Lévy process Y (t) = Z(t) + L(t) into stochastically independent components where Z(t) is determined by the purely Gaussian Lévy triplet (b, σ 2 , 0). L(t) will be a Lévy process that contains drift, the excess diffusion and jumps. However, the distribution of L(t) at time T can be approximated in the sense of convergence in law by compound Poisson distributions without drift and Lévy terms. The explicit construction can e.g. be found in [8]. We can thus approximate our estimation problem to arbitrary accuracy with a problem where drift and diffusion take fixed values. We also note that a non vanishing diffusion implies the existence of a probability density function f (x, t) for the distribution of Y (t).
Periodic boundary conditions from the projection to a torus. Another problem with the Kolmogorov forward equation is the issue of boundary conditions. We have already shown that we can approximate the estimation problem by one where the Lévy measure ν has support inside a large interval [Ω a , Ω b ]. Let Ω be the torus [Ω a , Ω b ] with the end points of the interval identified. Let K = Ω b − Ω a , then define the operator + K as x + K y := (x + y) mod (K) a group operation on Ω, with (·) mod (·) the modulus operation. Let furthermore φ : (R, +) → (Ω, + K ) be the group homomorphism defined by φ(x) = (x) mod (K). Let X(t) = φ(Y (t)) be a stochastic process on Ω. If Y (t) is a Lévy process on the group (R, +), the same applies to X(t) with respect to the group (Ω, + K ). Note that in the definition of Lévy processes only the group structure of (R, +) enters. Lévy processes are naturally defined on locally compact Abelian groups like (R, +) or also (Ω, + K ), see [8]. Let us now consider the characteristic function of the Ω-valued process X(t) at time t. By the periodicity of Ω, only k values from 2π K Z are needed. To derive a Lévy -Kinchine formula (2) for X(t) on Ω from that of Y (t) on R, we consider for such values of k Inserting (5), we obtain with φ * ν the image measure on ν under φ. Note that under the hypothesis that ν has support in [Ω a , Ω b ], ν can be reconstructed from φ * ν as ν = φ −1 * φ * ν, where φ −1 : Ω → R is the natural embedding. Using this, we identify ν and φ * ν in the following.
Summing up, we consider the maximum likelihood parameter estimation problem with fixed b, σ 2 , a piecewise linear density function with N Θ grid points for the finite Lévy measure and periodic boundary conditions on Ω. By refining the grid for the linear interpolation, enlarging the size of the torus and letting the fixed diffusion go to zero, we can approximate the distribution of any Lévy process with one of our candidate processes in the topology set by weak convergence in law. This constitutes the hierarchy of maximum likelihood estimation procedures.

Kolmogorov Equations and Optimality for the Log-Likelihood
In this section we formulate the optimization problem for the maximum likelihood parameters estimation. The maximum likelihood estimator as an optimizing objective functional is given together to the Kolmogorov forwards PIDE as constraints. The optimality system is written by using the Lagrange multipliers method in a functional space, by also including the Karush-Kuhn-Tucker conditions for the non negativity of the optimizing parameters.
Objective functional and forward equation. Let the L independent sample values X 1 , . . . , X L be given, and X l ∈ Ω, l = 1, . . . , L, where Ω = [Ω a , Ω b ]. These values can e.g. be obtained as is the group homomorphism defined in Eq. (6) and Y l is the Lévy process on R. We deal with the problem to find the PDF of X(T ) such that it best fits with the sample values. For this purpose we consider the maximum likelihood problem in the framework of PIDE-constrained optimization: We have to find the maximum likelihood estimator with respect to the parametrization of the measure given byᾱ, where is the (normalized) log-Likelihood with the constraint given by the following Kolmogorov forward (Fokker-Planck) equation for the Lévy process X(t) on the torus Ω with Lévy data (b, σ 2 , νᾱ) using the parametrization (5) and where f (x, t) represent the PDF of the process at time t. This PIDE is defined in the interval of time t ∈ [0, T ], and with periodic boundary conditions on Ω a = Ω b . Here Θ j (s) is a set of triangular shaped basis for the set of continuous functions that are linear on (θ j−1 , θ j ), see the preceding section, The existence and uniqueness of the solution of the Fokker-Planck equation (11) is well established, also for initial conditions belonging to the class of measures [8].
The optimality system. If we write the mappingᾱ → f (ᾱ) between the maximization parameters and the PDF, then we introduce the so-called reduced cost functionalĴ(ᾱ) = J(f (ᾱ),ᾱ), so that the maximization problem becomes A local maximaᾱ * forĴ can be found by solving the optimality system obtained by vanishing the variations of the following Lagrangian functional whereπ = (π 1 , . . . , π N Θ ) andᾱ fulfil the usual Karush-Kuhn-Tucker (KKT) conditions π j α j = 0 and π j ≥ 0. These are important to include the non-negativity constraints for the control variables. Note that if the condition α j ≥ 0 is violated for some j ∈ {1, . . . , N Θ }, the density of the measure dνᾱ(s) is negative in a neighbourhood of θ j and thus is not a Lévy measure any more. The sum N Θ j=1 π j α j should be extended only on the active constraints, i.e. when α * j = 0. For those values of j on the maximum where α * j > 0 we have π * j = 0. First we calculate the variation L(f + δf ) − L(f ) for the adjoint equation. In the following the variations are calculated separately for each addend of the r.h.s. We get For the term − j α j T 0 Ω Ω δf (x+s, t)Θ j (s)ds p(x, t) dx dt we apply the substitution y = x + s, then exchange x ↔ y, so that it recasts to − j α j T 0 Ω Ω p(y, t)Θ j (x − y)dy δf (x, t)dx dt. Then, again, we substitute s = x − y and, by inserting also the former term, we get For the variation of the time derivative, integrating by parts, one obtains The variation δf (x, 0) = 0 holds because of the Cauchy initial condition, while the variation in T can be defined in some points X l . Next, we integrate by parts the term with the first order derivative in x and obtain b Due to periodicity in the first term, δf (Ω a , t)(p(Ω b , t) − p(Ω a , t)) has to be zero, hence The first boundary term is zero because of the periodic condition of the variation of the derivative of f at the boundaries, and because of the previous periodic condition on p.
The second is analogous and has to vanish, we therefore get the continuity condition By collecting all the terms under double integral, we get the adjoint equation. The remaining boundary term Ω δf (x, T )p(x, T )dx will be considered below.
To calculate the variation on f in the functional J we perform an additional integration in space, so that so that the first order terms plus the remaining boundary, give This expression have to be zero for each δf (x, T ). It represents the terminal condition for the adjoint equation: that is p(X l , T ) = −1/(Lf (x l , T )), and p(x, T ) = 0, if x = {X 1 , . . . , X L }. In case of multiplicity of X l the condition becomes p(X l , T ) = −1/(L l f (X l , T )), with l running on the multiplicity value. Summarizing, the adjoint equation (Kolmogorov's backward equation) is as follows: We note that by reverting the sign of the time we get the same PIDE as the forward equation (up to a reflection of the drift and jump direction), hence this equation has a unique solution, also for the non regular final value problem [8].
Second, we variate in Eq.(13) the fitting parameters L(α j + δα j ) − L(α j ), from which we found the optimality equations: Ω Ω where j runs on the set of values where α * j = 0. Note that the active π j do not change the gradient, but simply balance non-zero gradient components that point to the directions where the inequality constraint α j ≥ 0 is violated. As in our case we deal with simple boxconstraints on the α j themselves, we can set those components of the negative gradient equal to zero that correspond to an active index j and are negative, when determining the update. This then accounts for the effect of the π j , see e.g. [20,25].
The 1 st order necessary optimality system consists of the Eqs. (11), (21) and (22). Its solution gives values α * 1 , . . . , α * N Θ that are candidates for maximizing the functional (10). Note that maximum likelihood fits in most cases do not correspond to convex optimization problems and one always has to account for the perils of local minima that are sub-optimal globally.
Forward equation in flux form. The forward equation (11), can be written in flux form: , it is easy to verify that Eq. (23) is equivalent to Eq. (11). Further, from the conservation of the total probability, it follows that the flux has the periodic boundary condition F(Ω a , t) = F(Ω b , t). From this we immediately get the periodic condition on the first

Numerical Scheme
The numerical solution of the optimality system is found by a non linear gradient conjugate iterative procedure [23,29,4]. At each iteration the solution of two PIDEs, the forward and the adjoint one, must be found. In particular the structural properties of the PDF solution must be satisfied, as well as a stability condition of the PIDEs numerical scheme solver.
For the numerical discretization of the Kolmogorov forward equation we use the Chang-Cooper scheme (CC) [14], joint to a 2nd order backward differentiation formula (BDF2) for the discrete time operator. The CC method was proposed for a Fokker-Planck resp. Kolmogorov equation [4] without the integral term. It is stable, second-order accurate, non-negative, and conservative numerical scheme [4,11].
The CC method is used for the differential operators, the integral term is treated separately according to an IMEX methodology. We denote the following B = −b and C = σ 2 /2, then the Kolmogorov forward equation reads as follows where where is the BDF2 operator. Q(f m i ;ᾱ) is the sum of the integrals of Eq. (23) calculated with the mid-point scheme ) Ω h represents the translated f m i by the value s k ∈ Ω h and continued by periodicity, θ jk = Θ j (s k ), Note also that the summation starts from k = 1, because the point k = 0 is the same of that k = N . Therefore, the solution at a new time step is calculated by solving the following equation for the unknown f m+1 with the initial condition This scheme is based on the fluxes at N cell boundaries. The partial flux at the position x i+h/2 is computed as follows This formula results from the following linear convex combination of f at the points i and i + 1: The idea of implementing this combination was proposed by Chang and Cooper in [14] and it was motivated with the need to guarantee positive solutions that preserve the equilibrium configuration. Indeed, the CC method is related to exponential fitting methods, such as that one proposed by Allen and Southwell [1], and by the Scharfetter-Gummel discretization scheme [28]. The value of the parameter δ is δ = 1/w − 1/(exp(w) − 1), where w = h B/C, which can be shown to be monotonically decreasing from 1 to 0 as w goes from −∞ to ∞. Notice that with the choice of δ given above, the numerical scheme shares the same properties of the continuous FP equation that guarantee positiveness and conservativeness. This is a special case of the CC scheme because in the general one, the functions B and C may depends on (x, t), hence also δ may depend on (x, t), too. Both the CC scheme [11] and the mid-point are second order accurate, then a second order numerical scheme results. Let f m = (f m 1 , . . . , f m N ) † be the discrete solution at the time t m , with f m 0 omitted due to periodicity, and β = C/h − δB. The action of the finite difference operator for F m in Eq. (26) reads as matrix A whose elements are defined by is the matrix coefficients related to Eqs. (30) and (32). We note that this method needs of a second starting point, that can be calculated by using a first order Euler scheme with a smaller time step size than δt. The implicit Euler scheme for the Eqs. (24) and (32) is These two numerical schemes own some properties that can be easily proved, but we list here as remarks.
In fact, N i=1 A i,j = 0, ∀j, and N i=1 Q(f m i ,ᾱ) = 0 because the set of values of f m i are the same asf m ik , being the last only translated by k.
, then the BDF2-CC scheme (35) to Eqs. (24) and (32), defined on the periodic domain Ω h , is conservative. In fact for the same constraints on A and Q as above, we get the identity The positivity of the numerical scheme is proved by using the theorem for the class of M -matrix [27]. Given a positive matrix E, E ij ≥ 0, we say that M = sI − E is a non singular M -matrix if s > ρ(E), where ρ(E) is the spectral radius of E. Proof. The argument is as follows: let R the matrix operator such that Rf m = h N j=1 α j N k=1 θ jkf m ik . Such a matrix is non negative because α j and θ jk are. The numerical scheme (37) can be recast as whereÃ = A − diag(A) is a positive matrix. Provided that f m−1 ≥ 0 and δt ≤ 1/a the r.h.s. is a non negative vector. We observe that the matrix on the l.h.s is always diagonal dominant, hence it has a convergent regular splitting and consequently is an M -matrix [27]. Therefore, (I − δtA) −1 is non negative and f m will be too.
In order to prove the positivity of the BDF2 numerical scheme (35), we need of the following Lemma that gives a lower bound to the velocity of decreasing of the solution.

Lemma 1.
Let the vector f m ∈ R N be given non negative. Take a number ξ > 1, then the solution f m+q calculated with the Euler scheme of Eq. (37) after q time steps satisfies the following inequality provided that δt < ξ − 1 aξ + β(1 + ω)/h , with parameters defined in Eq. (34).
is a positive matrix. Now provided the bound for δt, then the r.h.s. is positive and from Th. 1 we get that v ≥ 0. By iterating that inequality q times, we get the thesis. Proof. We apply the operator ( and use Eq. (35) to the first term on the r.h.s. to get is a positive matrix. We know that (3I − 2δtA) is an M -matrix and its inverse is always non-negative. Also 2Ã + ξR is non negative. Hence, we can prove non negativity of v, provided that for a value m = m * , because of the hypothesis ξf m * +1 − f m * ≥ 0, that also states that f m * +1 ≥ 0. The last inequality is just the bound on δt in the assertion that gives a positive value for δt only when 1 < ξ < 3.
Indeed, this Lemma proves positivity of the numerical solution of Eq. (35), provided that f 0 ≥ 0, and ξf 1 − f 0 ≥ 0. f 1 is the second starting value of the numerical scheme, that can be calculated with the Euler scheme (37).
Theorem 2. Let f 0 ≥ 0 the discrete initial condition (31), and let f 1 the second starting value calculated with the Euler scheme (37) with an appropriate time step, such that ξf 1 − f 0 ≥ 0, for 1 < ξ < 3. Then, the BDF2 scheme (35) to Eq. (24), defined in the periodic domain Ω h , is positive preserving for the solution f m , with m > 1.
Proof. The proof is an application of the Lemmas 1 and 2.
In order to establish the stability of the discrete numerical schemes of Eqs. (35) and (37), we need inequalities of the form f m+1 ≤ K f m evaluated in a suitable norm with K possibly less or equal than 1. We prove that it realizes for the 1-norm with K = 1. Proof. Let r = δt β/h and invert the matrix operator at l.h.s., then Eq. (39) reads as

Now we observe that
Proof. The numerical scheme (35) can be written as where R is defined as in Thm. 1. We apply M −1 and evaluate the 1-norm to both sides. Following the same calculations as in Thm. 3, we get that M −1 1 = 1/3. From the bound on δt, we note that This means that for all ζ in the interval 5 − 2 √ 3 < ζ < 3, it is ζ − a δt = ξ with ξ ∈ (1, 3). Now we have that f m ≥ 0 by virtue of the positivity condition, (ζ − a δt)f m − f m−1 ≥ 0 by our assumptions, and 4 − ζ > 0, hence is guaranteed that the sum in the r.h.s. is a non negative vector and the modulus in the calculation of the 1-norm can be removed. Using the property given in Rem. 1, we conclude that f m+1 1 ≤ f m 1 . are non negative, so that the previous conservativeness identity corresponds to the 1-norm equivalence. Further, we can state that for these numerical schemes the conservativeness and the non negativity imply the stability of the discrete operator. Adjoint equation. The discrete adjoint equation can be found by discretizing the Lagrangian function of Eq. (13) and then performing the variations on the discrete variables. This is know as the discretize-then-optimize approach (see Ref. [4] for details). This technique yields the following discrete adjoint equation The numerical stability is given by the same condition for the forward equation, since the transpose of the operator M has the same eigenvalues, but in this case the non negativity and conservativeness property are not required.
Care has to be taken for the discrete terminal condition, since it can not be defined through the Eq. (21) for the presence of the δ-Dirac measure. For this purpose we discretize the term (18) as follows wherex i are the points of the integral average theorem. Then we use the approximation f (x i , T ) ≈ f N T i , so that, by performing the variation δf N T i on this discrete functional, we get the discrete terminal condition According to Eq. (21), it completes the formulation of the discrete adjoint problem.
Discrete gradient. The discrete of the reduced gradient related to the optimality condition of Eq. (22) is calculated with the mid-point quadrature formula. Each component j is given by where (DᾱĴ) j ≈ (∇ᾱĴ) j .
Non linear conjugate gradient method. The availability of the discrete gradient allows us to implement a non linear conjugate gradient scheme (NLCG) in order to solve the optimization problem (12). NLCG represents an extension of the linear conjugate gradient method to non-quadratic problems [23,29,4].
The optimality system is solved by implementing the gradient given by the following algorithm: Algorithm 1 (Evaluation of the Gradient atᾱ). in a NLCG scheme. The search directions are recursively as where k = 0, 1, 2, . . . in this paragraph stands for the iteration index, g k = DᾱĴ(ᾱ k ) is the numerical gradient, with d 0 = −g 0 . Letᾱ k an estimation of the best rates at the iteration k, the next one for a minimum point are given bȳ where ξ k > 0 is a steplength obtained with a line-search that satisfies the Armijo condition of sufficient decrease ofĴ's value as followŝ where 0 < δ < 1/2; see [25]. Notice that we use the inner product of the U = R n space.
For the formula of β k we use the formulation due to Dai and Yuan [17] β DY where y k = g k+1 − g k . Summarizing, the NLCG scheme is implemented as follows Algorithm 2 (NLCG Scheme).

End while
Correction factor for the logarithm in the objective. The numerical evaluation of the functional of Eq. (10) has the problem of the logarithm in the points X l where the PDF at the final time has vanishing values. Hence, the functional is replaced as follows with = 10 −12 .
Nearest grid point for sample values The discrete PDF is defined on the mesh grid Ω h , the sample values X l used for the evaluation of the PDF are approximated to the nearest values of the space mesh grid Ω h . This approximation affects both the value of the functional and the terminal condition for the adjoint equation.
Von Mises distribution. The initial distribution f 0 (x) of Eq. (11) is set as the following von Mises distribution where I 0 (.) is the modified Bessel function of order 0, and κ is the concentration parameter that should be taken large in order to approximate the Dirac delta function in zero as initial data for the forward PIDE.

Numerical Tests
In this section we perform the non parametric estimation of Lévy density distribution function, that is to find the valueᾱ = (α 1 , . . . , α N Θ ) such that best fits with the given data. We present two validation test cases and one application case to finance.
Testing for Consistency. We perform a numerical test on the consistency of our estimation procedure. According to the maximum likelihood technique, consistency here means that, if we fix a parametrization N Θ and the parameter values, we can (approximately) reconstruct these values form our estimation procedure and maximization of the AIC, provides that a sufficiently large sample from the true distribution is given. We simulate such a sample using pseudo random realizations for the Lévy process X(t). Details on the simulation methods can be found e.g. in [24]. However note that in our case, cyclic boundary conditions have to be taken into account. The data setting for our test case is as follows: the space domain Ω = [−π, π), the final time T = 1, the initial von Mises distribution has center µ = 0 and wideness κ = 400, the drift of the stochastic process is b = 0 and the Gaussian volatility is σ = √ 0.02. The setting for the numerical solution is: space grid size N = 420, time grid size N T = 250. The setting for the optimization is critical, we found the following parameters by the experience: initial approximation of the parameter ratesᾱ 0 = (0.1, 0.1, . . .), constant of the Armijo condition δ = 0.1, initial step-length of point 2. of Algorithm 2 is set to ξ k = 0.5 and shrink by a factor 0.3, ξ k+1 = 0.3ξ k .
As a first test, we perform a fit for a set of L = We see the good match for N Θ = 5 with the original ratesα. In Figs. 1,2 and 3 we can also appreciate the good data fitting between the calculated PDF and the histograms of the simulated Monte Carlo data, for the proposed optimization problem with N Θ = 3, 5, 6.
Another interesting problem is the selection of the number of parameters N Θ and the corresponding basis functions Θ j for the best data fit. In Fig. 4 we depict the result of the Akaike's Information Criterion (AIC) [13], given by A common choice in statistics is to pick that parametrization that maximises the AIC.
We can see that criterion gives the value N Θ,opt = 6, while the correct value is 5. The difference in the AIC is however rather small for N Θ between 5 and 7.
Fitting Data from a bi-directional gamma process. In the second test we fit the final position at T = 1 of 10 5 samples of a stochastic process with the jumps distributed   dν(s) = A e −β|s| |s| ds.
Here A > 0 is the so-called shape parameter and β is the rate parameter. Note that this is not a finite measure, so we are out of the compound Poisson class, and the trajectory of the bidirectional gamma process as infinitely many (small) jumps. In [2] only the unidirectional Gamma process is described. Let Y + (t) be such a unidirectional gamma process, then the Lévy measure is Let thus Y + (t) and Y − (t) be two independent copies of the Gamma process, then is our bi-directional gamma process, which is the jump part of our Lévy process that also includes diffusion as in the first experiment. If we project Y (t) to the torus [−π, π], the effect on the projected Levy measure φ * ν, see (8), of the projected Lévy process Using ∞ n=0 e −qc p + n = e −q Φ(e −q , 1, p), p = 0, q > 0,  with Φ(z, s, a) the Lerch transcendent, for p = |s|/π and q = βπ, we get as the Lévy measure on the torus for the projected bi-directional Gamma process.
In the simulations data are generated by scaling the rate parameter such that 1/β = 1, and setting the shape to A = 0.5. Finally, the data Y (t) are projected to the torus [−π, π). In Fig. 5 we report the result of the AIC test the fit with N Θ = 9 basis functions centered to θ j = (−1 + 2(j − 1)/9)π, j = 1, . . . 9, whose the calculated rates arē α = (0, 0.0417, 0.0235, 0.1529, 0.8827, 0.8616, 0.1517, 0.0316, 0.0396). We conclude that our procedure results in high quality fits, even for Lévy distributions that are not part of our hierarchy of parametrizations, but can only be approximated by these.
Financial Data. As an example from a real world problem we report the result of fitting of the German Stock Exchange (DAX) index, which is publicly available, e.g. from the Yahoo Finance, see Figure 6 left panel. Within the data of all closing quotes between April 1998 and March 2015, there are several periods of elevated volatility, so called volatility bursts. Obviously, this contradicts the description of the market with an exponential levy model S(t) = e Y (t) [2,15,24], as the statistical law of Y (t) − Y (s) does not only depend on t − s. In order to avoid the pitfalls of time dependent (or stochastic [15,24]) volatility, we identify a period of comparatively stable volatility of 1000 trading days between April 1998 and February 2002, see the right panel of Figure  6. This data set has a small drift value b = 6.787 × 10 −4 which corresponds to the increased value of the stocks of 67% nominal interest rate in 1000 trading days (followed by severe losses in the subsequent period). The empirical daily volatility (i.e. standard deviation of daily log-returns) in this period of time is σ = 9.304 × 10 −3 . The obvious absence of axial symmetry prohibits a Gaussian (Black-Scholes) market model from the outset. Our goal is to find a suitable description of this sample with an exponential Lévy market model from our hierarchy of parametrizations. The data has been mapped to the [−π, π] torus, by rescaling and 'wrapping' the daily log-Returns below/above 3%, i.e.
With this setting we calculated the fitting of the distribution, with equally spaced basis functions in the interval [−π, π). In Fig. 7 (left panel) we report the result of the AIC test and the fit with N Θ = 6 (right panel). The selected basis functions are centred at θ j = (−1 + (j − 1)/3)π, j = 1, . . . 6, whose the calculated rates areᾱ = (0, 0, 0.484, 0.223, 0.304, 0). Although only three parameters are different from zero, the AIC is maximized at N Θ = 6. That the AIC at N Θ = 3 is lower is explained by the fact, that the more localized basis functions in the N Θ = 6-basis are more adequate to fit the data. It is also a misinterpretation that the chosen parametrization misses an effective description with three parameters, since the position of the grid points are additional parameters. Note that the zero entries of the 1st, 2nd and 6th slotᾱ actually correspond to small positive values and only represented as zero when rounded to the 3rd digit.

Conclusion and Outlook
In the present article, we demonstrate the use of optimal control for PIDE for the non parametric estimation of Lévy processes. Here the PIDE is given by Kolmogorov's forward equation which allows one to calculate the terminal distribution of the Lévy process at time T . The objective functional is the log-likelihood evaluated on a sample of terminal values of the Lévy process. Based on the study of Lévy distributions, we set up approximate estimation problems that can be tackled by maximum likelihood estimation along with model selection based on Akaike's information criterion (AIC). As the density of Lévy probability distributions in most cases can not be determined analytically, numerical solutions of the Kolmogorov forward equation (Fokker-Planck equation) and its backward (adjoint) analogue are needed for the efficient maximization of the log-likelihood functional.
For the numerical solution of the optimality system we used the Chang-Cooper method with a mid-point quadrature rule and second order backward time differentiation formula. This numerical scheme is second order accurate and conservative, and we found conditions for stability and positivity of the numerical solution. We use a non linear conjugate gradient method to find the optimality condition.
We have shown that this method works for spline discretizations of the density of the Lévy measure with symmetric boundary conditions for up to 11 parameters. The results consistently fit simulated data from the family of discretizations itself. The same turns out to be true from Lévy processes that only can be approximated by such discretizations, if the number of parameters goes to infinity, like the gamma process. Here the AIC provides an effective mechanism to choose an adequate discretization at a given sample size. Finally, we have demonstrated that also real-world, financial data can be effectively fitted using our strategy.
The future potential of this solution lies in the fact that, unlike FFT / spectral based calibration procedures that are widely used in financial engineering [7,15,24], the present approach naturally generalizes to processes that originate as the solution of Stochastic Differential Equations (SDE) with state dependent coefficients. Such local volatility models are frequently used in contemporary financial engineering.
In this work, we used historic and low frequency data for non parametric model calibration. High frequency historical data and implicit volatility data [15,24] are natural candidates to set up new objective functionals for related control problems that go beyond the control of the terminal distribution.
Another relevant problem is the notorious occurrence of local minima in the maximum likelihood estimation. We expect this to be more severe, when the number of parameters significantly increases. An interesting hybrid approach would combine the robustness of non-parametric spectral calibration methods as a sort of pre-conditioner with the highly efficient maximum likelihood estimation.