Accuracy of nonparametric density estimation for univariate Gaussian mixture models: A comparative study

Jurgita Arnastauskaitė; Tomas  Ruzgas

doi:10.3846/mma.2020.10505

DOI: https://doi.org/10.3846/mma.2020.10505

Abstract

Flexible and reliable probability density estimation is fundamental in unsupervised learning and classification. Finite Gaussian mixture models are commonly used for this purpose. However, the parametric form of the distribution is not always known. In this case, non-parametric density estimation methods are used. Usually, these methods become computationally demanding as the number of components increases. In this paper, a comparative study of accuracy of some nonparametric density estimators is made by means of simulation. The following approaches have been considered: an adaptive bandwidth kernel estimator, a projection pursuit estimator, a logspline estimator, and a k-nearest neighbor estimator. It was concluded that data clustering as a pre-processing step improves the estimation of mixture densities. However, in case data does not have clearly defined clusters, the pre-preprocessing step does not give that much of advantage. The application of density estimators is illustrated using municipal solid waste data collected in Kaunas (Lithuania). The data distribution is similar (i.e., with kurtotic unimodal density) to the benchmark distribution introduced by Marron and Wand. Based on the homogeneity tests it can be concluded that distributions of the municipal solid waste fractions in Kutaisi (Georgia), Saint-Petersburg (Russia), and Boryspil (Ukraine) are statistically indifferent compared to the distribution of waste fractions in Kaunas. The distribution of waste data collected in Kaunas (Lithuania) follows the general observations introduced by Marron and Wand (i.e., has one mode and certain kurtosis).

Keyword : univariate probability density, nonparametric density estimation, homogeneity test, sample clustering, Monte Carlo method, municipal solid waste

How to Cite

Arnastauskaitė, J., & Ruzgas, T. (2020). Accuracy of nonparametric density estimation for univariate Gaussian mixture models: A comparative study. Mathematical Modelling and Analysis, 25(4), 622-641. https://doi.org/10.3846/mma.2020.10505

Published in Issue

Oct 13, 2020

Abstract Views

679

PDF Downloads

435

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974. https://doi.org/10.1109/TAC.1974.1100705

A. Bakshaev. Goodness of fit and homogeneity tests on the basis of N-distances. Journal of Statistical Planning and Inference, 139(11):3750–3758, 2009. ISSN 0378-3758. Special Issue: The 8th Tartu Conference on Multivariate Statistics & The 6th Conference on Multivariate Distributions with Fixed Marginals. https://doi.org/10.1016/j.jspi.2009.05.014

M.D. Burke and E. Gombay. The bootstrapped maximum likelihood estimator with an application. Statistics & Probability Letters, 12(5):421–427, 1991. https://doi.org/10.1016/0167-7152(91)90031-L

S. Chernova and M. Veloso. Confidence-based policy learning from demonstration using Gaussian mixture models. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS’07, pp. 1–8, New York, NY, USA, 2007. Association for Computing Machinery. https://doi.org/10.1145/1329125.1329407

K.M. Christiansen and C. Fischer. Baseline projections of selected waste streams. Development of Methodology, European Environment Agency, Copenhagen, 1999. Technical Report No. 28.

J. Ćwik and J. Koronacki. Multivariate density estimation: A comparative study. Neural Computing & Applications, 6(3):173–185, 1997. https://doi.org/10.1007/BF01413829

A.P. Dempster, N.M. Laird and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

G. Denafas, T. Ruzgas, D. Martuzevičius, S. Shmarin, M. Hoffmann, V. Mykhaylenko, S. Ogorodnik, M. Romanov, E. Neguliaeva, A. Chusov et al. Seasonal variation of municipal solid waste generation and composition in four East European cities. Resources, Conservation and Recycling, 89:22–30, 2014. https://doi.org/10.1016/j.resconrec.2014.06.001

L. Devroye and A. Krzyżak. New multivariate product density estimators. Journal of Multivariate Analysis, 82(1):88–110, 2002. https://doi.org/10.1006/jmva.2001.2021

E. Fix and J.L. Hodges. Discriminatory analysis: nonparametric discrimination, consistency properties. Interscience Publishers, New York, 1951.

J.H Friedman. Exploratory projection pursuit. Journal of the American statistical association, 82(397):249–266, 1987. https://doi.org/10.1080/01621459.1987.10478427

J.H. Friedman, W. Stuetzle and A. Schroeder. Projection pursuit density estimation. Journal of the American Statistical Association, 79(387):599–608, 1984. https://doi.org/10.1080/01621459.1984.10478086

S. Gruszczyński. The assessment of variability of the concentration of chromium in soils with the application of neural networks. Polish Journal of Environmental Studies, 14(6):743–751, 2005.

Yonghong H., K.B. Englehart, B. Hudgins and A.D.C. Chan. A Gaussian mixture model based classification scheme for myoelectric control of powered upper limb prostheses. IEEE Transactions on Biomedical Engineering, 52(11):1801–1811, 2005. https://doi.org/10.1109/TBME.2005.856295

P. Hall. The bootstrap and Edgeworth expansion. Springer, New York, NY, 1992. https://doi.org/10.1007/978-1-4612-4384-7

T.K. Henthorn, J. Benitez, M.J. Avram, C. Martinez, A. Llerena, J. Cobaleda, T.C. Krejcie and R.D. Gibbons. Assessment of the debrisoquin and dextromethorphan phenotyping tests by Gaussian mixture distributions analysis. Clinical Pharmacology & Therapeutics, 45(3):328–333, 1989. https://doi.org/10.1038/clpt.1989.36

P.J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435–475, 1985. https://doi.org/10.1214/aos/1176349519

J.-N. Hwang, Sh.-R. Lay and A. Lippman. Nonparametric multivariate density estimation: a comparative study. IEEE Transactions on Signal Processing, 42(10):2795–2810, 1994. https://doi.org/10.1109/78.324744

M.C. Jones, J.S. Marron and S.J. Sheather. A brief survey of bandwidth selection for density estimation. Journal of the American statistical association, 91(433):401–407, 1996. https://doi.org/10.1080/01621459.1996.10476701

M. Kavaliauskas, R. Rudzkis and T. Ruzgas. The projection-based multi-variate distribution density estimation. Acta Comment. Univ. Tartu. Math., Tartu University Press, 8:135–141, 2004.

C Kooperberg. Software available at, 2013. Available from Internet: http://bear.fhcrc.org

Ch. Kooperberg and C.J. Stone. A study of logspline density estimation. Computational Statistics & Data Analysis, 12(3):327–347, 1991. https://doi.org/10.1016/0167-9473(91)90115-I

J.A. Kovacs, C. Helmick and W. Wriggers. A balanced approach to adaptive probability density estimation. Frontiers in Molecular Biosciences, 4:25, 2017. https://doi.org/10.3389/fmolb.2017.00025

Ch. Léger, D. N. Politis, and O.P. Romano. Bootstrap technology and applications. Technometrics, 34(4):378–398, 1992. https://doi.org/10.1080/00401706.1992.10484950

D. Li, K. Yang and W.H. Wong. Density estimation via discrepancy based adaptive sequential partition. In D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon and R. Garnett(Eds.), Advances in neural information processing systems 29, pp. 1091–1099. Curran Associates, Inc., 2016. Available from Internet: http://papers.nips.cc/paper/6217-density-estimation-viadiscrepancy-based-adaptive-sequential-partition.pdf

J.S. Marron and M.P. Wand. Exact mean integrated squared error. Ann. Statist., 20(2):712–736, 06 1992. https://doi.org/10.1214/aos/1176348653

G.J. McLachlan and T. Krishnan. The EM algorithm and extensions. John Wiley & Sons, Inc., 2008. https://doi.org/10.1002/9780470191613

G.J. McLachlan and D. Peel. Finite mixture models. John Wiley & Sons, Inc., 2000. https://doi.org/10.1002/0471721182

I. Rimaitytė, T. Ruzgas, G. Denafas, V. Račys and D. Martuzevicius. Application and evaluation of forecasting methods for municipal solid waste generation in an eastern-European city. Waste Management & Research, 30(1):89–98, 2012. https://doi.org/10.1177/0734242X10396754

J. Rothfuss, F. Ferreira, S. Walther and M. Ulrich. Conditional density estimation with neural networks: Best practices and benchmarks. arXiv preprint arXiv:1903.00954, 2019.

R. Rudzkis and M. Radavičius. Statistical estimation of a mixture of Gaussian distributions. Acta Applicandae Mathematica, 38(1):37–54, 1995. https://doi.org/10.1007/BF00992613

T. Ruzgas, R. Rudzkis and M. Kavaliauskas. Application of clustering in the nonparametric estimation of distribution density. Nonlinear Analysis: Modeling and Control, 11(4):393–411, 2006. https://doi.org/10.15388/NA.2006.11.4.14741

B.W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986.

R. Smidtaite. Application of nonlinear statistics for distribution density estimation of random vectors. Kaunas University of Technology, 2008. (MS thesis).

C. Stachniss, C. Plagemann and A.J. Lilienthal. Learning gas distribution models using sparse Gaussian process mixtures. Autonomous Robots, 26(2):187–202, 2009. https://doi.org/10.1007/s10514-009-9111-5

M.A. Wong. A bootstrap testing procedure for investigating the number of subpopulations. Journal of Statistical Computation and Simulation, 22(2):99– 112, 1985. https://doi.org/10.1080/00949658508810837

M.W. Zimmerman, R.J. Povinelli, M.T. Johnson and K.M. Ropella. A reconstructed phase space approach for distinguishing ischemic from non-ischemic ST changes using Holter ECG data. In Computers in Cardiology, 2003, pp. 243–246, 2003. https://doi.org/10.1109/CIC.2003.1291136