Department of Statistics Unitmark
Dietrich College of Humanities and Social Sciences

High-Dimensional Statistics

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than dimensions considered in classical multivariate analysis. High-dimensional statistics relies on the theory of random vectors. In many applications, the dimension of the data vectors may be larger than the sample size.

A Spectral Series Approach to High-Dimensional Nonparametric Regression

Authors:
Lee, A. B., & Izbicki, R.
Publication Date:
February, 2016

A key question in modern statistics is how to make fast and reliable inferences for complex, high-dimensional data. While there has been much interest in sparse techniques, current methods do not generalize well to data with nonlinear structure. In this work, we present an orthogonal series estimator for predictors that are complex aggregate objects, such as natural images, galaxy spectra, trajectories, and movies. Our series approach ties together ideas from manifold learning, kernel machine learning, and Fourier methods. We expand the unknown regression on the data in terms of the eigenfunctions of a kernel-based operator, and we take advantage of orthogonality of the basis with respect to the underlying data distribution, P, to speed up computations and tuning of parameters. If the kernel is appropriately chosen, then the eigenfunctions adapt to the intrinsic geometry and dimension of the data. We provide theoretical guarantees for a radial kernel with varying bandwidth, and we relate smoothness of the regression function with respect to P to sparsity in the eigenbasis. Finally, using simulated and real-world data, we systematically compare the performance of the spectral series approach with classical kernel smoothing, k-nearest neighbors regression, kernel ridge regression, and state-of-the-art manifold and local regression methods.

Lab Groups:

A unified framework for constructing, tuning and assessing photometric redshift density estimates in a selection bias setting

Authors:
P. E. Freeman, R. Izbicki, & A. B. Lee
Publication Date:
July, 2017

Photometric redshift estimation is an indispensable tool of precision cosmology. One problem that plagues the use of this tool in the era of large-scale sky surveys is that the bright galaxies that are selected for spectroscopic observation do not have properties that match those of (far more numerous) dimmer galaxies; thus, ill-designed empirical methods that produce accurate and precise redshift estimates for the former generally will not produce good estimates for the latter. In this paper, we provide a principled framework for generating conditional density estimates (i.e. photometric redshift PDFs) that takes into account selection bias and the covariate shift that this bias induces. We base our approach on the assumption that the probability that astronomers label a galaxy (i.e. determine its spectroscopic redshift) depends only on its measured (photometric and perhaps other) properties x and not on its true redshift. With this assumption, we can explicitly write down risk functions that allow us to both tune and compare methods for estimating importance weights (i.e. the ratio of densities of unlabeled and labeled galaxies for different values of x) and conditional densities. We also provide a method for combining multiple conditional density estimates for the same galaxy into a single estimate with better properties. We apply our risk functions to an analysis of approximately one million galaxies, mostly observed by SDSS, and demonstrate through multiple diagnostic tests that our method achieves good conditional density estimates for the unlabeled galaxies.

Lab Groups:

Converting High-Dimensional Regression to High-Dimensional Conditional Density Estimation

Authors:
Izbicki, R., & Lee, A. B.
Publication Date:
July, 2017

There is a growing demand for nonparametric conditional density estimators (CDEs) in fields such as astronomy and economics. In astronomy, for example, one can dramatically improve estimates of the parameters that dictate the evolution of the Universe by working with full conditional densities instead of regression (i.e., conditional mean) estimates. More generally, standard regression falls short in any prediction problem where the distribution of the response is more complex with multi-modality, asymmetry or heteroscedastic noise. Nevertheless, much of the work on high-dimensional inference concerns regression and classification only, whereas research on density estimation has lagged behind. Here we propose FlexCode, a fully nonparametric approach to conditional density estimation that reformulates CDE as a non-parametric orthogonal series problem where the expansion coefficients are estimated by regression. By taking such an approach, one can efficiently estimate conditional densities and not just expectations in high dimensions by drawing upon the success in high-dimensional regression. Depending on the choice of regression procedure, our method can adapt to a variety of challenging high-dimensional settings with different structures in the data (e.g., a large number of irrelevant components and nonlinear manifold structure) as well as different data types (e.g., functional data, mixed data types and sample sets). We study the theoretical and empirical performance of our proposed method, and we compare our approach with traditional conditional density estimators on simulated as well as real-world data, such as photometric galaxy data, Twitter data, and line-of-sight velocities in a galaxy cluster.

Lab Groups:

High-Dimensional Density Ratio Estimation with Extensions to Approximate Likelihood Computation

Authors:
Izbicki, R., Lee, A. B., & Schafer, C. M.
Publication Date:
January, 2014

The ratio between two probability density functions is an important component of various tasks, including selection bias correction, novelty detection and classification. Recently, several estimators of this ratio have been proposed. Most of these methods fail if the sample space is high-dimensional, and hence require a dimension reduction step, the result of which can be a significant loss of information. Here we propose a simple-to-implement, fully nonparametric density ratio estimator that expands the ratio in terms of the eigenfunctions of a kernel-based operator; these functions reflect the underlying geometry of the data (e.g., submanifold structure), often leading to better estimates without an explicit dimension reduction step. We show how our general framework can be extended to address another important problem, the estimation of a likelihood function in situations where that function cannot be well-approximated by an analytical form. One is often faced with this situation when performing statistical inference with data from the sciences, due the complexity of the data and of the processes that generated those data. We emphasize applications where using existing likelihood-free methods of inference would be challenging due to the high dimensionality of the sample space, but where our spectral series method yields a reasonable estimate of the likelihood function. We provide theoretical guarantees and illustrate the effectiveness of our proposed method with numerical experiments.

Lab Groups:

New image statistics for detecting disturbed galaxy morphologies at high redshift

Authors:
P. E. Freeman, R. Izbicki, A. B. Lee, J. A. Newman, C. J. Conselice, A. M. Koekemoer, J. M. Lotz, & M. Mozena
Publication Date:
September, 2013

Testing theories of hierarchical structure formation requires estimating the distribution of galaxy morphologies and its change with redshift. One aspect of this investigation involves identifying galaxies with disturbed morphologies (e.g. merging galaxies). This is often done by summarizing galaxy images using, e.g. the concentration, asymmetry and clumpiness and Gini-M20 statistics of Conselice and Lotz et al., respectively, and associating particular statistic values with disturbance. We introduce three statistics that enhance detection of disturbed morphologies at high redshift (z ˜ 2): the multimode (M), intensity (I) and deviation (D) statistics. We show their effectiveness by training a machine-learning classifier, random forest, using 1639 galaxies observed in the H band by the Hubble Space Telescope WFC3, galaxies that had been previously classified by eye by the Cosmic Assembly Near-IR Deep Extragalactic Legacy Survey collaboration. We find that the MID statistics (and the A statistic of Conselice) are the most useful for identifying disturbed morphologies.

We also explore whether human annotators are useful for identifying disturbed morphologies. We demonstrate that they show limited ability to detect disturbance at high redshift, and that increasing their number beyond ≈10 does not provably yield better classification performance. We propose a simulation-based model-fitting algorithm that mitigates these issues by bypassing annotation.

Lab Groups:

Nonparametric Conditional Density Estimation in a High-Dimensional Regression Setting

Authors:
Izbicki, R., & Lee, A. B.
Publication Date:
October, 2016

In some applications (e.g., in cosmology and economics), the regression E[Z|x] is not adequate to represent the association between a predictor x and a response Z because of multi-modality and asymmetry of f(z|x); using the full density instead of a single-point estimate can then lead to less bias in subsequent analysis. As of now, there are no effective ways of estimating f(z|x) when x represents high-dimensional, complex data. In this article, we propose a new nonparametric estimator of f(z|x) that adapts to sparse (low-dimensional) structure in x. By directly expanding f(z|x) in the eigenfunctions of a kernel-based operator, we avoid tensor products in high dimensions as well as ratios of estimated densities. Our basis functions are orthogonal with respect to the underlying data distribution, allowing fast implementation and tuning of parameters. We derive rates of convergence and show that the method adapts to the intrinsic dimension of the data. We also demonstrate the effectiveness of the series method on images, spectra, and an application to photometric redshift estimation of galaxies. Supplementary materials for this article are available online.

Lab Groups:

RFCDE: Random Forests for Conditional Density Estimation

Authors:
Pospisil, T., & Lee, A. B.
Publication Date:
May, 2018

Random forests is a common non-parametric regression technique which performs well for mixed-type data and irrelevant covariates, while being robust to monotonic variable transformations. Existing random forest implementations target regression or classification. We introduce the RFCDE package for fitting random forest models optimized for nonparametric conditional density estimation, including joint densities for multiple responses. This enables analysis of conditional probability distributions which is useful for propagating uncertainty and of joint distributions that describe relationships between multiple responses and covariates. RFCDE is released under the MIT open-source license and can be accessed at this https URL . Both R and Python versions, which call a common C++ library, are available.

Lab Groups: