Accuracy of Nonparametric Density Estimation for Univariate Gaussian Mixture Models: a Comparative Study

. Flexible and reliable probability density estimation is fundamental in un-supervised learning and classiﬁcation. Finite Gaussian mixture models are commonly used for this purpose. However, the parametric form of the distribution is not always known. In this case, non-parametric density estimation methods are used. Usually, these methods become computationally demanding as the number of components increases. In this paper, a comparative study of accuracy of some nonparametric density estimators is made by means of simulation. The following approaches have been considered: an adaptive bandwidth kernel estimator, a projection pursuit estimator, a logspline estimator, and a k -nearest neighbor estimator. It was concluded that data clustering as a pre-processing step improves the estimation of mixture densities. However, in case data does not have clearly deﬁned clusters, the pre-preprocessing step does not give that much of advantage. The application of density estimators is illustrated using municipal solid waste data collected in Kaunas (Lithuania). The data distribution is similar (i.e., with kurtotic unimodal density) to the benchmark distribution introduced by Marron and Wand. Based on the homogeneity tests it can be concluded that distributions of the municipal solid waste fractions in Kutaisi (Georgia), Saint-Petersburg (Russia), and Boryspil (Ukraine) are statistically indiﬀerent compared to the distribution of waste fractions in Kaunas. The distribution of waste data collected in Kaunas (Lithuania) follows the general observations introduced by Marron and Wand (i.e., has one mode and certain kurtosis).


Introduction
The problem under consideration is closely related to distribution analysis. Which is an important branch of data analysis and is being used to solve various other problems (discriminant analysis, image recognition, etc.). The methodology for estimating distributions is receiving increasing attention due to emerging applications: processing of genetic information, analysis of astronomical objects, research of computer equipment and its peripheral data, etc. Even though there are already many methods for estimating data distributions, new advancements are being proposed by various authors.
In case data distribution is multimodal and the sample size is small, in practice, it is not easy to choose a robust density estimation method. Kernel estimators are the most popular method for density estimation (see, for example [19,26,33]). Other estimators are frequently used by researchers (see in [6,11,18,22]). As big data become commonplace in almost every field of data science lately, various approaches to estimate density are being investigated, including machine learning algorithms. Academia and industry focus on the development of innovative density estimation procedures (see in [25,30]). In some cases the accuracy of the assessment can be significantly increased (according to [32]), if the observations are clustered at first (i.e., treating the multimodal density as a mixture of unimodal densities), and the popular nonparametric estimators are applied to each cluster separately.
The present work extended the pilot research presented in [32]. The following new research areas are investigated: 1) new density estimation methods based on inverse formula are formulated, 2) in order to compare the robustness of estimators the wide set of distributions proposed by Marron and Wand [26] is used to study densities, 3) in order to obtain results with a reasonably high level of confidence a relatively high (100000) number of independent samples have been generated.
The primary purpose of this paper is to assess the performance of several density estimators, by using benchmark densities [26] that represent different types of problems that can arise for unimodal and multimodal densities. The relationship between the estimation accuracy and complex structures in case of univariate Gaussian mixture models (GMM) are analysed by Monte Carlo method. Applications of GMM distributions are popular and are used in various scientific fields for examination of important problems. For example, Younghong [14] in medical studies for limb mobility classification, (using mioelectrical signals) compared GMM with three other frequently used classification methods of linear discriminant analysis, linear multilayer perceptrons network and multilayer neural network. GMM demonstrated the exceptional classification accuracy in [14]. Henthorn [16] examined the ability to hydroxylate (the degree of urinal metabolism) debrisoquin and O-dimethilath dekstrometorfan with the help of GMM. Zimmerman [37] used the GMM for EKG analysis using ST segments and T waves for determining the ischemia. Kovacs [23] applied it on simulations of molecular dynamics of membrane pore formation. Stachniss [35] applied GMM in geological investigations for gas concentration prediction based on carried mining. Gruszczyński [13] used ap-proximation by GMM in ecological research on the soil contamination. Chernova and Veloso [4] used the learning results of interactive policy formation that learners demonstrate later, the results are described by GMM in the field of applied sociology.
In this work, after the clustering pre-processing step, the mixture components were identified using: 1) a kernel estimator with the adaptively selected smoothing width, 2) Friedmans proposed algorithm based on the projection pursuit, 3) Kooperberg and Stone log-spline estimator based on the approximation of logarithm of distribution density by the sum of cubic B-splines, 4) Fix and Hodges k -nearest neighbor density estimator, and 5) the proposed inversion formula modification for density estimator, where the structure is given in (3.6). A pilot comparative study of several non-parametric estimators accuracy [32] showed that the Friedman procedure is more robust in the majority of examined Gaussian multivariate mixtures cases where the components can be separated. The kernel estimator is more accurate even with a small sample size. Initial clustering of the sample allows to get less biased estimates of density.
The scientific novelty of this research is the comparative study accuracy of density estimation methods. The performance of various non-parametric estimators has been compared using different data sets. The representative density estimators of various techniques, analysed by other researchers, has been selected. In addition, this study reveals the practical value of the homogeneity tests.
This paper is organized as follows: Section 2 describes the EM algorithm used for sample clustering before density estimation. Section 3 summarizes the five selected density estimators. Section 4 contains the simulation results and states the findings in the context of the municipal solid waste data. The detailed results of the Monte Carlo study can be found in appendixes. Concluding remarks are presented in Section 5.

Sample clustering with the EM algorithm
Although previous numerical simulations [32] have shown that initial data clustering is beneficial for evaluating multimodal distribution density, the question on the optimal clustering method is still open. A comparative study in [32] showed that probabilistic clustering methods are more robust compared to geometric clustering methods (e.g., k -means, hierarchical, etc.) applied in conjunction with the transformation of mixture components to spherical [34]. This work is limited by the study of the sample clustering based on considered density approximation by Gaussian mixture distribution.
Let X(1), ..., X(n) be the observed independent and identically distributed random variables with unknown distribution density f (x). If density is multimodal, it can be analysed as a mixture of several unimodal densities: (2.1) Suppose the observed random variable X depends on the random variable v, taking the values 1, ..., q (interpreted as a class number of the observed object). Weights p k =Pv = k in classification theory are called a priori probabilities of observed object belonging to class k and weights π k (x) = Pv = kjX = x are called a posteriori probabilities. The function f k is treated as X conditional distribution density at the condition v = k. The assessment of the values π k X(t) for all k = 1, ..., q, t = 1, ..., n, is called sample soft clustering. In this paper the sample is strictly clustered: it is split into subsets using estimates v(1), ..., v(n), where v(t) denotes the class number of observation X(t).
GMM is a well studied model in model-based clustering theory and practice; therefore it was selected for this research. The application of GMM is based on the assumption that the functions f k (x) are Gaussian densities with mean parameters µ k and variances σ 2 k . In this case the right side of equality (2.1) is marked as f (θ, x), where θ = ((p k , µ k , σ 2 k ), k = 1, . . . , q). Since the equalities are valid the a posteriori probability estimates can be simply obtained by a plug-in method which replaces unknown parameter vector θ on the right-hand side of (2.2) with its maximum likelihood estimate θ * = arg max θ L(θ), L(θ) = n t=1 f (θ, X(t)). The EM algorithm is applied in this work to find the estimate of the parameter vectors.
Given the estimatesπ k =π k (X(t)) · X(t), for all k = 1, . . . , q. Insertingθ (r+1) into (2.2) the right sideπ (r+1) k (X(t)), k = 1, ..., q, t = 1, ..., n is found. Non-decreasing sequence of likelihood L( θ (r) ) is obtained as a result of this iterative procedure. However, the convergence to the global maximum depends on the choice of the initial estimateθ (0) (or π (0) ). The simplest strategy is to apply the EM algorithm on a number of randomly generated initial estimatesπ (0) . The resulting estimate with the highest likelihood function L(θ) is taken. The sequential mixture components extraction method described in [7,27,31] could be used here as well.
A variety of model adequacy tests can be used for selection of the clusters' number q [28]. Let f * (x) = f (θ * , x), where θ * is the estimate obtained by a maximum likelihood estimator. Lets us denote the distribution function as F * = x −∞ f * (t)dt, and the empirical distribution function as F (x) = n −1 n t=1 1 X(t)<x . Letf be a non-parametric density estimate of f , that we will present in Section 3. Defining (for more details we refer the reader to [31]) we do not reject the adequacy hypothesis of model (2.1), if the following condition ψ < ε is fulfilled. Given significance level α, it is desirable to choose the level ε as close as possible to the quantile u α : P {ψ > u α } = α. The bootstrap method (see [3,15,24,36]) can be used for this purpose. Let G(u) denote the distribution function of the statistic ψ, then, with a fixed n, it is completely defined by the distribution function F of the random variable X, i.e., P {ψ > u} = G(u, F ), for all u. The variable ε is defined by equality G(ε, F * ) = 1 − α for the given α. Function G(·, F * ) can be obtained by bootstrap method.

The density estimation algorithms analysed
The comparative study of accuracy of density estimation algorithms has been performed based on four different types of statistics, which are used by other researchers, as well. In this paper the following density estimators for probability distribution density function using Monte Carlo method were examined: 1. Density estimator (PPDE), proposed by Friedman (see [11,12]), which is based on the target design and projection consistent gausianization; 4. The k -nearest neighbor distribution density estimator (NNDE), which is based on the distance to the nearest k observation position evaluation to the tested random dimension and has been examined by Devroye and Krzyżak [9];

Modified distribution density estimator (MIDE) with the original version
proposed in [20] and analysed in [32] based on application of the inversion formula.
The methods are used on a standardized sample. The examined algorithms will be described in detail.
PPDE algorithm. J.H. Friedman proposed an iterative algorithm of multidimensional density estimation. It is based on consistent search of one dimensional projections [17], where distributions are mostly different from normal. These projections are transformed into Gaussian values. Let X be the standardized random variable (i.e., with zero mean and the unit variance) with the unknown distribution density f (x). The size of X is transformed Z (k) = Q k (X), k = 1, 2, . . . after each iteration. The transformation is performed so that Z (k) is projected to the direction τ , where the density g k differs from the standard normal density ϕ at most, a distribution would be Φ and the projection τ to orthogonal d − 1 measurement R d subspace remains unchanged. Friedman proved [12] that Z (k) over k → ∞ converges to the standard multi-dimensional Gaussian random variable according to the distribution. Therefore, for a suffi- . The Friedman statistic is obtained by substituting statistical estimates for the unknown distribution densities g k on the right-hand side of the expression (3.1). The direction is τ = 1 and M = 1, using one dimentional random variables. Friedman's proposed projection estimator on the base of Legendre orthogonal polynomials was applied to statistical estimation of density g k . Let ξ 1 , . . . , ξ n be random variables with distribution density g(u).
and replacing the coefficients b j = (j + 1 2 )Eψ j (η t ) to their empirical analogues, the following estimator is obtained According to the recommendation [18], the expansion (3.2) order should be s ≤ 6.
AKDE algorithm. Kernel density distribution statistical estimator with the local bandwidth has the following form The same procedure (as in the article [18]) is used in this paper, while the standard Gaussian distribution density ϕ is used as a kernel function, and the bandwidth [19] is defined by where h = (4/(3n)) 1/5 ,f (·) is kernel estimator (3.3) [19], obtained after the change h j to h, q = exp{ 1 n log n j=1f (X(j))}, v is the sensitivity parameter.
The authors in [18] proposed choosing latter value from a set of 0.2, 0.4, 0.6, 0.8, the specific v value is determined by the means of cross-validation. LSDE algorithm. Logspline distribution density estimators approximate density logarithm by the sum of splineŝ for a given set of basis functions B 1 , . . ., B s with a certain coefficient vector β = (β 1 , . . . , β s ) and the normalization constant C(β).
The procedure applies cubic B-splines proposed by Kooperberg and Stone. Spline knots are chosen by [1], the coefficients are obtained by maximum likelihood method. Computer program which computes the sample density estimate is given in [21] an has been used in our study.
NNDE algorithm. The main idea behind this algorithm is revealed in multivariate case by formula [10]: is the k-th nearest x neighbor in the whole sample of observations {X(t), t = 1, ..., n}. k is the nearest neighbor density estimator depending on the coordinate system, as the distances vary after their changes x − X(t) , 1 ≤ t ≤ n. In order to avoid this dependence, the following changes are introduced [9]: , and X(1), . . . , X(n) is a permutation X 1 , . . . , X d according to increasing values of product d j=1 |x j − X j (t)|. This permutation is invariant under linear transformations of the coordinate axes (excluding rotations). The standard k -nearest neighbor estimator is obtained when d = 1. It is recommended to choose k = log(n) [9].
MIDE algorithm. The density function f (x) can be obtained from the characteristic function using the inversion formula where ψ(t) = Ee it X denotes the characteristic function of a random vector X. Let us denote the characteristic functions projected data by ψ τ . Since there is a one-to-one correspondence between densities and characteristic functions, and where τ = t/ |t|, we obtain an equation for using the inversion formula (3.4) and performing the transformation to a spherical coordinate system. The first integral in expression (3.5) denotes the surface integral over unit sphere. By using the estimates of ψ τ and replacing the surface integral by a sum, we obtain the formula for calculation of an estimatê where c(d) = d2 −d π −d/2 /Γ ( d 2 + 1), T 0 consists of M random points uniformly distributed on a sphere T, multiplier e −λu 2 performs the additional smoothing function, and the estimatorψ τ (•) is one dimensional τ X distribution density estimatorf τ Fourier transformation. Estimatef τ is obtained in [32] by the help of AKDE procedure, a normal distribution density ϕ using as kernel function enabling to calculate analytically the integral on the right-hand side (3.6).
The disadvantage of the inversion formula method (3.6) is that the Gaussian distribution mixture model described by this estimator only gives good estimation of the density of observations similar to it. Inversion formula density estimator often becomes complicated due to large components with a small number of a priori probabilities approximating the analysed density by Gaussian distribution mixture. Their value can be reduced by introducing the noise cluster. Let the characteristic function estimator be constructed as Gaussian mixture distributions and uniform distribution of characteristic functions with a priori probabilities in modified density estimator equation (3.6): where the second term describes the cluster of an uniform noise distribution, p 0 is the noise cluster weight, a = a(τ ), b = b(τ ). We can write based on the given equivalent distribution parameter estimates and the designed data. The direction is τ = 1, and M = 1, and the calculations become simpler using one dimensional (d = 1) random variables. The proposed noise cluster weight is 0.1. We will consider the application of these methods in estimating the conditional density in case the sample has been clustered. As noted at the beginning of Section 2, we assume that the observed random value X depends on a latent random variable v that takes on the values 1, . . ., q, and that the conditional density f i (x) under the condition {v = i} is unimodal for i = 1, q. Since the formula (2.1) holds, we usef (x) = q i=1p ifi (x) for the estimation of density f (x) when the sample is clustered. With the posterior probability π i (x) = P{v = i|X = x} estimatesπ i (x) and on the basis of equality p i = Eπ i (x) are obtained: i (X(t)), i = 1, . . ., q.

The analysis of estimation accuracy
This study was performed by comparing previously described density estimation methods. As benchmark distributions the following Gaussian mixture models (introduced by Marron and Wand [26]) were used (also see Figure 1): These distributions have been carefully chosen because they thoroughly represent many different types of challenges to non-parametric estimators. The first five ones represent different types of problems that can arise for unimodal densities. The densities of the rest distributions are multimodal. Densities of distributions from f) to i) are mildly multimodal and one might hope to be able to estimate them fairly well with a data set of a moderate size. Small and medium-size samples (50, 100, 200, 400, 800, 1 600, 3 200) were used for modeling. 100 000 replications were generated in each case.
The following error measures have been computed to compare accuracy of estimators: The results of analysis. The obtained error measures results of the estimators with the best parameters are given in appendices A and B. This is the example of typical models, in other cases the tendency of errors are similar to the provided cases. Arithmetic mean obtained from 100 000 generated samples are given for each estimator. Appendix A graphically shows the results of density estimation without original data clustering and after clustering process. The symbols A, P, L, N, and M mark AKDE, PPDE, LSDE, NNDE, and MIDE algorithms respectively (the last two ones, because of significantly larger error values are not given in all graphs); the distribution density estimation errors without original data clustering are indicated by a solid line, while the errors after the initial data clustering with a dashed line. The tables in Appendix B provide the accuracy results of density estimators.
The conducted computer experiment results qualitatively confirmed the conclusions of [32] on the appropriateness of clustering when the mixture components are separated easily, and as 'separated bimodal' in the case of densities. Clustering is useful for larger samples (from 800 observations), in case of asymmetric skewed unimodal densities. When clustering fails to isolate the components of a mixture, when density has strongly expressed single apex or distinguished observations, such as kurtotic and outlier densities, then primary clustering does not benefit.
In this section, we discuss the results of applying density estimators in more detail: 1. PPDE algorithm gave the best estimators when components of mixtures were separated well enough during clustering, i.e., are the most close to Gaussian densities, without the Gaussian density they are separated bimodal densities. Contrary to the paper [32] where multidimensional densities were used. The design-based estimates used in this study had the disadvantage that only one projection could be used for a one-dimensional case. We analysed a wide range of different types of univariate distribution densities. Based on results, we can state that there is no universal estimator suitable for most cases. The selection of a particular density estimator depends on shape of the density function, which can be characterized by descriptive statistics: skewness, kurtosis, percent of outliers' in the sample, number of modes, etc.
Empirical examples. We compared density estimates of 519 observations of weekly municipal solid waste data (kg/capita) change in a medium-scaled Eastern European city (Kaunas, Lithuania) from 2000 to 2009, which are analysed in [29]. We chose the three density estimators of the highest accuracy as suggested by our comparative study results.
The top row of panels of Figure 2 illustrate the density estimates without data clustering (solid line) and after initial data clustering (dashed line). The peak height is significantly lower for the PPDE estimates, which is inferred as underestimate, considering the tendency of the PPDE algorithm to fit transformations which reveal that distribution structures are too low in univariate case. The same density estimates are illustrated in the bottom row of panels in the Figure 2, with a log-scaled vertical axis to emphasize tail fitting performance. The bumps generated by the outliers are not visible in the tails. Similar pattern is seen in the shape of kurtotic unimodal density that was exhibited PPDE AKDE LSDE Figure 2. Density of the municipal solid waste data (kg capita −1 week −1 ) changes.
in Figure 1. LSDE is the best estimator for this density type when the sample size is more than 400 observations. Statistical data analysis [29] showed that the amount of mixed municipal waste was changing significantly and that statistically important periodic variations (dependability on seasons) exist. By studying the time series data and the distribution density form of the waste amount changes generated in households, we can see that the highest changes are observed before the main holidays of the year and after them. It can be explained by the increased and decreased consumption by the capita. The increase of waste is also observed in the middle of spring and the beginning of autumn, and a sharp drop in the early summer and winter. The relation between the amount of accumulated municipal waste and GDP can be influenced not only by specificity of production of municipal waste but also by the fact that fluctuation of national income does not necessarily change the main household expenses [5]. For example, decrease of GDP can have no influence on consumption producing waste yet it can reduce saving. Thus, accumulation of municipal waste can be explained by the part of GDP spent on consumption by private economies. The amount of waste generated by the population stabilizes quickly after a change in their household habits; it causes the form of density peak.
We expand short-term forecasting of the generation of municipal solid waste (kg capita −1 month −1 ) fractions study [8]. In the above study bootstrapping technique was applied with assumption that distribution densities of Kutaisi For the homogeneity we are using some tests based on density estimators described in previous section. For given two samples of independent observations with unknown probability density functions f (x) and g(x) it is required to test homogeneity hypothesis mentioned in paper [2] by Bakshaev (original version proposed by Rudzkis): The test statistic for the hypothesis of homogeneity of two samples X and Y have the form: where X and Y are the randomly splited parts of sample X and Y , respectively, K(u, v) is kernel function of the difference between u and v. K is a non-negative function that integrates to 1. Usually, but not always, K will be a symmetric probability density function. The critical region of the test statistic T n,m established by the next statement.
Under the null hypothesis statistics T n,m will asymptotically (n, m → ∞ and n/m → ρ = 0) have the Gaussian distribution with mean zero and variance σ 2 (see the proof of the theorem in [2]).
In this study we used two tests of homogeneity: Kolmogorov-Smirnov test (denote as K-S) and Rudzkis-Bakshaev's test (denote as R-B), for three different density estimators. One of the steps, leading to the main result, was to check the homogeneity between the densities of Kaunas municipal solid waste fractions and densities of other cities municipal solid waste fractions. All results (p-values) of the homogeneity between the density of municipal solid waste fractions are shown in Tables 1-3.
This study has shown that the densities of Kutaisi, Saint-Petersburg, and Boryspil municipal solid waste fractions are similar (are not statistically significant) to the densities of Kaunas municipal solid waste fractions.

Conclusions and future work
This paper examines several statistical estimation procedures of a mixture of Gaussian densities. In practice, it is not easy to select an robust estimation procedure if the data density is multimodal and the sample volume is small. The authors have introduced an inversion formula method with an uniform noise component for density estimation. The proposed method, together with other popular estimators, was validated by means of computer simulation.
The quality of multimodal distribution density statistical estimation increases significantly if the sample is pre-clustered (i.e., the multimodal density function is approximated by a mixture of unimodal densities), and the density estimation methods are applied to each cluster afterwards.
The application of statistical methods revealed the relationships and dependencies of municipal solid waste data, and the results could be used in supporting decision making.
As a future work it might be worthwhile to overview the current recommendation systems and propose one for the selection of the most appropriate density estimators for the given sample.

Appendix A
The preliminary data clustering performance analysis is provided there. Each figure corresponds a different density error dependence on the sample size.

Appendix B
The density error estimators are given in the tables. The error without the initial clustering is given in the first row of each method, and the second row shows the error after the initial data clustering.