Login | Register

Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA


Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA

Jung, Sungkyu, Sen, Arusharka and Marron, J.S. (2012) Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA. Journal of Multivariate Analysis, 109 . pp. 190-203. ISSN 0047259X

Text (application/pdf)
sen2012.pdf - Accepted Version

Official URL: http://dx.doi.org/10.1016/j.jmva.2012.03.005


In High Dimension, Low Sample Size (HDLSS) data situations, where the dimension d is much larger than the sample size n, principal component analysis (PCA) plays an important role in statistical analysis. Under which conditions does the sample PCA well reflect the population covariance structure? We answer this question in a relevant asymptotic context where d grows and n is fixed, under a generalized spiked covariance model. Specifically, we assume the largest population eigenvalues to be of the order dα, where α<, =, or >1. Earlier results show the conditions for consistency and strong inconsistency of eigenvectors of the sample covariance matrix. In the boundary case, α=1, where the sample PC directions are neither consistent nor strongly inconsistent, we show that eigenvalues and eigenvectors do not degenerate but have limiting distributions. The result smoothly bridges the phase transition represented by the other two cases, and thus gives a spectrum of limits for the sample PCA in the HDLSS asymptotics. While the results hold under a general situation, the limiting distributions under Gaussian assumption are illustrated in greater detail. In addition, the geometric representation of HDLSS data is extended to give three different representations, that depend on the magnitude of variances in the first few principal components.

Divisions:Concordia University > Faculty of Arts and Science > Mathematics and Statistics
Item Type:Article
Authors:Jung, Sungkyu and Sen, Arusharka and Marron, J.S.
Journal or Publication:Journal of Multivariate Analysis
Digital Object Identifier (DOI):10.1016/j.jmva.2012.03.005
Keywords:Principal component analysis; High Dimension Low Sample Size; Geometric representation; ρ-mixing; Consistency and strong inconsistency; Spiked covariance model
ID Code:976820
Deposited On:29 Jan 2013 13:28
Last Modified:18 Jan 2018 17:43


[1] A. Bhattacharjee, W. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. J. Sugarbaker, M. Meyerson, Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses, Proc. Natl. Acad. Sci. USA 98(24):13790-5.

[2] J. Ahn, J.S. Marron, K.M. Muller, Y.-Y. Chi The high-dimension, low-sample-size geometric representation holds under mild conditions Biometrika, 94 (3) (2007), pp. 760–766

[3] Z. Bai, J.W. Silverstein Spectral Analysis of Large Dimensional Random Matrices, Springer Series in Statistics (second ed.), Springer, New York (2010) http://dx.doi.org/10.1007/978-1-4419-0661-8

[4] J. Baik, J.W. Silverstein Eigenvalues of large sample covariance matrices of spiked population models J. Multivariate Anal., 97 (6) (2006), pp. 1382–1408

[5] A. Bhattacharjee, W. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E.J. Mark, E.S. Lander, W. Wong, B.E. Johnson, T.R. Golub, D.J. Sugarbaker, M. Meyerson Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses Proc. Natl. Acad. Sci. USA, 98 (24) (2001), pp. 13790–13795

[6] R.C. Bradley Basic properties of strong mixing conditions. A survey and some open questions Probab. Surv., 2 (2005), pp. 107–144 (electronic), Update of, and a supplement to, the 1986 original

[7] G. Casella, J.T. Hwang Limit expressions for the risk of James–Stein estimators Canad. J. Statist., 10 (4) (1982), pp. 305–309 http://dx.doi.org/10.2307/3556196

[8] N. El Karoui Spectrum estimation for large dimensional covariance matrices using random matrix theory Ann. Statist., 36 (6) (2008), pp. 2757–2790 http://dx.doi.org/10.1214/07-AOS581

[9] T.L. Gaydos, Data representation and basis selection to understand variation of function valued traits, Ph.D. Thesis, University of North Carolina at Chapel Hill, 2008.

[10] G.H. Golub, C.F. Van Loan Matrix Computations, Johns Hopkins Studies in the Mathematical Sciences (third ed.), Johns Hopkins University Press, Baltimore, MD (1996)

[11] P. Hall, J.S. Marron, A. Neeman Geometric representation of high dimension, low sample size data J. R. Stat. Soc. Ser. B Stat. Methodol., 67 (3) (2005), pp. 427–444

[12] H. Huang, Y. Liu, J.S. Marron, Bi-directional discrimination with application to data visualization, manuscript, 2012.

[13] I.M. Johnstone On the distribution of the largest eigenvalue in principal components analysis Ann. Statist., 29 (2) (2001), pp. 295–327

[14] S. Jung, J.S. Marron PCA consistency in high dimension, low sample size context Ann. Statist., 37 (6B) (2009), pp. 4104–4130

[15] A.N. Kolmogorov, Y.A. Rozanov On strong mixing conditions for stationary Gaussian processes Theory Probab. Appl., 5 (2) (1960), pp. 204–208

[16] S. Lee, F. Zou, F.A. Wright Convergence and prediction of principal component scores in high-dimensional settings Ann. Statist., 38 (6) (2010), pp. 3605–3629 http://dx.doi.org/10.1214/10-AOS821

[17] R.J. Muirhead Aspects of Multivariate Statistical Theory, Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons Inc., New York (1982)

[18] B. Nadler Finite sample approximation results for principal component analysis: a matrix perturbation approach Ann. Statist., 36 (6) (2008), pp. 2791–2817 http://dx.doi.org/10.1214/08-AOS618

[19] D. Paul Asymptotics of sample eigenstructure for a large dimensional spiked covariance model Statist. Sinica, 17 (2007), pp. 1617–1642

[20] F. Pesarin, L. Salmaso Finite-sample consistency of combination-based permutation tests with application to repeated measures designs J. Nonparametr. Stat., 22 (5–6) (2010), pp. 669–684 http://dx.doi.org/10.1080/10485250902807407

[21] F. Pesarin, L. Salmaso Permutation Tests for Complex Data: Theory, Applications and Software Wiley, Chichester, UK (2010)

[22] X. Qiao, H.H. Zhang, Y. Liu, M. Todd, J.S. Marron Weighted distance weighted discrimination and its asymptotic properties J. Amer. Statist. Assoc., 105 (489) (2010), pp. 401–414

[23] G.W. Stewart, J.G. Sun Matrix Perturbation Theory, Computer Science and Scientific Computing, Academic Press Inc., Boston, MA (1990)

[24] K. Yata, M. Aoshima Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix J. Multivariate Anal., 101 (9) (2010), pp. 2060–2077 http://dx.doi.org/10.1016/j.jmva.2010.04.006

[25] K. Yata, M. Aoshima PCA consistency for non-Gaussian data in high dimension, low sample size context Comm. Statist. Theory Methods, 38 (16–17) (2009), pp. 2634–2652
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Back to top Back to top