Department of Statistics Unitmark
Dietrich College of Humanities and Social Sciences

High-Dimensional Statistics

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than dimensions considered in classical multivariate analysis. High-dimensional statistics relies on the theory of random vectors. In many applications, the dimension of the data vectors may be larger than the sample size.

There are currently no projects for this area of research.

A unified framework for constructing, tuning and assessing photometric redshift density estimates in a selection bias setting

Photometric redshift estimation is an indispensable tool of precision cosmology. One problem that plagues the use of this tool in the era of large-scale sky surveys is that the bright galaxies that are selected for spectroscopic observation do not have properties that match those of (far more numerous) dimmer galaxies; thus, ill-designed empirical methods that produce accurate and precise redshift estimates for the former generally will not produce good estimates for the latter. In this paper, we provide a principled framework for generating conditional density estimates (i.e. photometric redshift PDFs) that takes into account selection bias and the covariate shift that this bias induces. We base our approach on the assumption that the probability that astronomers label a galaxy (i.e. determine its spectroscopic redshift) depends only on its measured (photometric and perhaps other) properties x and not on its true redshift. With this assumption, we can explicitly write down risk functions that allow us to both tune and compare methods for estimating importance weights (i.e. the ratio of densities of unlabeled and labeled galaxies for different values of x) and conditional densities. We also provide a method for combining multiple conditional density estimates for the same galaxy into a single estimate with better properties. We apply our risk functions to an analysis of approximately one million galaxies, mostly observed by SDSS, and demonstrate through multiple diagnostic tests that our method achieves good conditional density estimates for the unlabeled galaxies.

New image statistics for detecting disturbed galaxy morphologies at high redshift

Testing theories of hierarchical structure formation requires estimating the distribution of galaxy morphologies and its change with redshift. One aspect of this investigation involves identifying galaxies with disturbed morphologies (e.g. merging galaxies). This is often done by summarizing galaxy images using, e.g. the concentration, asymmetry and clumpiness and Gini-M20 statistics of Conselice and Lotz et al., respectively, and associating particular statistic values with disturbance. We introduce three statistics that enhance detection of disturbed morphologies at high redshift (z ˜ 2): the multimode (M), intensity (I) and deviation (D) statistics. We show their effectiveness by training a machine-learning classifier, random forest, using 1639 galaxies observed in the H band by the Hubble Space Telescope WFC3, galaxies that had been previously classified by eye by the Cosmic Assembly Near-IR Deep Extragalactic Legacy Survey collaboration. We find that the MID statistics (and the A statistic of Conselice) are the most useful for identifying disturbed morphologies.

We also explore whether human annotators are useful for identifying disturbed morphologies. We demonstrate that they show limited ability to detect disturbance at high redshift, and that increasing their number beyond ≈10 does not provably yield better classification performance. We propose a simulation-based model-fitting algorithm that mitigates these issues by bypassing annotation.