Astrostatistics is the study of stars, galaxies and the large scale structure of the Universe. Using data from telescopes and satellites, astrostatisticians study questions about the origin, evolution and fate of the universe. In the last decade, there has been a deluge of valuable data and statisticians play an important role in analyzing these data. Genovese and Wasserman are founding members of the International Computational Astrostatistics (INCA) group, a cross-disciplinary research team consisting of astrophysicists, statisticians and computer scientists. Within the department, several faculty, post-docs, and graduate students are members of the group; and other active members are drawn from the other departments at Carnegie Mellon, the University of Pittsburgh, and several international institutions. The statistics department works closely with the McWilliams Center for Cosmology at Carnegie Mellon as well as with the Department of Physics and Astronomy at the University of Pittsburgh. Recent projects include: analysis of the cosmic microwave background, estimating the dark energy equation of state, analysis of galaxy spectra, detecting galaxy clusters, identifying filaments, and estimating density functions with truncated data. A common theme in this work is the goal of detecting subtle, nonlinear signals in noisy, high-dimensional data. Our primary focus is on using state-of-the-art data, and analytical methods, to advance cosmology.

Cosmic Microwave Background

We analyze first-year data of WMAP to determine the significance of asymmetry in summed power between arbitrarily defined opposite hemispheres, using maps that we create ourselves with software developed independently of the WMAP team. We find that over the multipole range l=[2,64], the significance of asymmetry is ~ 10^-4, a value insensitive to both frequency and power spectrum. We determine the smallest multipole ranges exhibiting significant asymmetry, and find twelve, including l=[2,3] and [6,7], for which the significance -> 0. In these ranges there is an improbable association between the direction of maximum significance and the ecliptic plane (p ~ 0.01). Also, contours of least significance follow great circles inclined relative to the ecliptic at the largest scales. The great circle for l=[2,3] passes over previously reported preferred axes and is insensitive to frequency, while the great circle for l=[6,7] is aligned with the ecliptic poles. We examine how changing map-making parameters affects asymmetry, and find that at large scales, it is rendered insignificant if the magnitude of the WMAP dipole vector is increased by approximately 1-3 sigma (or 2-6 km/s). While confirmation of this result would require data recalibration, such a systematic change would be consistent with observations of frequency-independent asymmetry. We conclude that the use of an incorrect dipole vector, in combination with a systematic or foreground process associated with the ecliptic, may help to explain the observed asymmetry.

P. E. Freeman (1), C. R. Genovese (1), C. J. Miller (2), R. C. Nichol (3), L. Wasserman (1)

Nonlinear Data Transformation

Dimension-reduction techniques can greatly improve statistical inference in astronomy. A standard approach is to use Principal Components Analysis (PCA). In this work we apply a recently-developed technique, diffusion maps, to astronomical spectra for data parameterization and dimensionality reduction, and develop a robust, eigenmode-based framework for regression. We show how our framework provides a computationally efficient means by which to predict redshifts of galaxies, and thus could inform more expensive redshift estimators such as template cross-correlation. It also provides a natural means by which to identify outliers (e.g., misclassified spectra, spectra with anomalous features). We analyze 3835 SDSS spectra and show how our framework yields a more than 95% reduction in dimensionality. Finally, we show that the prediction error of the diffusion map-based regression approach is markedly smaller than that of a similar approach based on PCA, clearly demonstrating the superiority of diffusion maps over PCA for this regression task.

A Statistical Method for Estimating Luminosity Functions Using Truncated Data

The observational limitations of astronomical surveys lead to significant statistical inference challenges. One such challenge is the estimation of luminosity functions given redshift (z) and absolute magnitude (M) measurements from an irregularly truncated sample of objects. This is a bivariate density estimation problem; we develop here a statistically rigorous method which (1) does not assume a strict parametric form for the bivariate density; (2) does not assume independence between redshift and absolute magnitude (and hence allows evolution of the luminosity function with redshift); (3) does not require dividing the data into arbitrary bins; and (4) naturally incorporates a varying selection function. We accomplish this by decomposing the bivariate density φ(z,M) vialogφ(z,M)=f(z)+g(M)+h(z,M,θ), where f and g are estimated nonparametrically and h takes an assumed parametric form. There is a simple way of estimating the integrated mean squared error of the estimator; smoothing parameters are selected to minimize this quantity. Results are presented from the analysis of a sample of quasars.

A unified framework for constructing, tuning and assessing photometric redshift density estimates in a selection bias setting

Photometric redshift estimation is an indispensable tool of precision cosmology. One problem that plagues the use of this tool in the era of large-scale sky surveys is that the bright galaxies that are selected for spectroscopic observation do not have properties that match those of (far more numerous) dimmer galaxies; thus, ill-designed empirical methods that produce accurate and precise redshift estimates for the former generally will not produce good estimates for the latter. In this paper, we provide a principled framework for generating conditional density estimates (i.e. photometric redshift PDFs) that takes into account selection bias and the covariate shift that this bias induces. We base our approach on the assumption that the probability that astronomers label a galaxy (i.e. determine its spectroscopic redshift) depends only on its measured (photometric and perhaps other) properties x and not on its true redshift. With this assumption, we can explicitly write down risk functions that allow us to both tune and compare methods for estimating importance weights (i.e. the ratio of densities of unlabeled and labeled galaxies for different values of x) and conditional densities. We also provide a method for combining multiple conditional density estimates for the same galaxy into a single estimate with better properties. We apply our risk functions to an analysis of approximately one million galaxies, mostly observed by SDSS, and demonstrate through multiple diagnostic tests that our method achieves good conditional density estimates for the unlabeled galaxies.

Comparing Distributions of Galaxy Morphologies

A principal goal of astronomy is to describe and understand how galaxies evolve as the
Universe ages. To understand the processes that drive evolution, one needs to investigate the
connections between various properties of galaxies, such as mass, star-formation rate (SFR),
and morphology, in a quantitative manner. The last of the these properties, morphology,
refers to the two-dimensional appearance of a galaxy projected onto the plane of the sky

Cosmic web reconstruction through density ridges: catalogue

We construct a catalogue for filaments using a novel approach called SCMS (subspace constrained mean shift). SCMS is a gradient-based method that detects filaments through density ridges (smooth curves tracing high-density regions). A great advantage of SCMS is its uncertainty measure, which allows an evaluation of the errors for the detected filaments. To detect filaments, we use data from the Sloan Digital Sky Survey, which consist of three galaxy samples: the NYU main galaxy sample (MGS), the LOWZ sample and the CMASS sample. Each of the three data set covers different redshift regions so that the combined sample allows detection of filaments up to z = 0.7. Our filament catalogue consists of a sequence of two-dimensional filament maps at different redshifts that provide several useful statistics on the evolution cosmic web. To construct the maps, we select spectroscopically confirmed galaxies within 0.050 < z < 0.700 and partition them into 130 bins. For each bin, we ignore the redshift, treating the galaxy observations as a 2-D data and detect filaments using SCMS. The filament catalogue consists of 130 individual 2-D filament maps, and each map comprises points on the detected filaments that describe the filamentary structures at a particular redshift. We also apply our filament catalogue to investigate galaxy luminosity and its relation with distance to filament. Using a volume-limited sample, we find strong evidence (6.1σ-12.3σ) that galaxies close to filaments are generally brighter than those at significant distance from filaments.

Cosmic web reconstruction through density ridges: method and algorithm

The detection and characterization of filamentary structures in the cosmic web allows cosmologists to constrain parameters that dictate the evolution of the Universe. While many filament estimators have been proposed, they generally lack estimates of uncertainty, reducing their inferential power. In this paper, we demonstrate how one may apply the subspace constrained mean shift (SCMS) algorithm (Ozertem & Erdogmus 2011; Genovese et al. 2014) to uncover filamentary structure in galaxy data. The SCMS algorithm is a gradient ascent method that models filaments as density ridges, one-dimensional smooth curves that trace high-density regions within the point cloud. We also demonstrate how augmenting the SCMS algorithm with bootstrap-based methods of uncertainty estimation allows one to place uncertainty bands around putative filaments. We apply the SCMS first to the data set generated from the Voronoi model. The density ridges show strong agreement with the filaments from Voronoi method. We then apply the SCMS method data sets sampled from a P3M N-body simulation, with galaxy number densities consistent with SDSS and WFIRST-AFTA, and to LOWZ and CMASS data from the Baryon Oscillation Spectroscopic Survey (BOSS). To further assess the efficacy of SCMS, we compare the relative locations of BOSS filaments with galaxy clusters in the redMaPPer catalogue, and find that redMaPPer clusters are significantly closer (with p-values <10-9) to SCMS-detected filaments than to randomly selected galaxies.

Examining the Effect of the Map-making Algorithm on Observed Power Asymmetry in WMAP Data

We analyze first-year data of WMAP to determine the significance of asymmetry in summed power between arbitrarily defined opposite hemispheres. We perform this analysis on maps that we create ourselves from the time-ordered data, using software developed independently of the WMAP team. We find that over the multipole range l=[2, 64], the significance of asymmetry is ~10-4, a value insensitive to both frequency and power spectrum. We determine the smallest multipole ranges exhibiting significant asymmetry and find 12, including l=[2, 3] and [6, 7], for which the significance -->0. Examination of the 12 ranges indicates both an improbable association between the direction of maximum significance and the ecliptic plane (significance ~0.01) and that contours of least significance follow great circles inclined relative to the ecliptic at the largest scales. The great circle for l=[2, 3] passes over previously reported preferred axes and is insensitive to frequency, while the great circle for l=[6, 7] is aligned with the ecliptic poles. We examine how changing map-making parameters, e.g., foreground masking, affects asymmetry. Only one change appreciably reduces asymmetry: asymmetry at large scales (l<=7) is rendered insignificant if the magnitude of the WMAP dipole vector (368.11 km s-1) is increased by ~1-3 σ (~2-6 km s-1). While confirmation of this result requires the recalibration of the time-ordered data, such a systematic change would be consistent with observations of frequency-independent asymmetry. We conclude that the use of an incorrect dipole vector, in combination with a systematic or foreground process associated with the ecliptic, may help to explain the observed power asymmetry.

Inference for the dark energy equation of state using Type IA supernova data

The surprising discovery of an accelerating universe led cosmologists to posit the existence of "dark energy" - a mysterious energy field that permeates the universe. Understanding dark energy has become the central problem of modern cosmology. After describing the scientific background in depth, we formulate the task as a nonlinear inverse problem that expresses the comoving distance function in terms of the dark-energy equation of state. We present two classes of methods for making sharp statistical inferences about the equation of state from observations of Type Ia Supernovae (SNe). First, we derive a technique for testing hypotheses about the equation of state that requires no assumptions about its form and can distinguish among competing theories. Second, we present a framework for computing parametric and nonparametric estimators of the equation of state, with an associated assessment of uncertainty. Using our approach, we evaluate the strength of statistical evidence for various competing models of dark energy. Consistent with current studies, we find that with the available Type Ia SNe data, it is not possible to distinguish statistically among popular dark-energy models, and that, in particular, there is no support in the data for rejecting a cosmological constant. With much more supernova data likely to be available in coming years (e.g., from the DOE/NASA Joint Dark Energy Mission), we address the more interesting question of whether future data sets will have sufficient resolution to distinguish among competing theories.

Inferring the Evolution of Galaxy Morphology

In astronomy, one of the major goals is to put tighter constraints on parameters in the Lambda-CDM
model, which is currently the standard model describing the evolution of the Universe after the Big
Bang. One way to work towards this goal is to estimate how galaxy structure and morphology evolve;
we can then compare what we observe with rates predicted by the standard model via simulation.

Investigating galaxy-filament alignments in hydrodynamic simulations using density ridges

In this paper, we study the filamentary structures and the galaxy alignment along filaments at redshift z = 0.06 in the MassiveBlack-II simulation, a state-of-the-art, high-resolution hydrodynamical cosmological simulation which includes stellar and AGN feedback in a volume of (100 Mpc h-1)3. The filaments are constructed using the subspace constrained mean shift (SCMS; Ozertem & Erdogmus; Chen et al.). First, we show that reconstructed filaments using galaxies and reconstructed filaments using dark matter particles are similar to each other; over 50 per cent of the points on the galaxy filaments have a corresponding point on the dark matter filaments within distance 0.13 Mpc h-1 (and vice versa) and this distance is even smaller at high-density regions. Second, we observe the alignment of the major principal axis of a galaxy with respect to the orientation of its nearest filament and detect a 2.5 Mpc h-1 critical radius for filament's influence on the alignment when the subhalo mass of this galaxy is between 109 M h-1 and 1012 M h-1. Moreover, we find the alignment signal to increase significantly with the subhalo mass. Third, when a galaxy is close to filaments (less than 0.25 Mpc h-1), the galaxy alignment towards the nearest galaxy group is positively correlated with the galaxy subhalo mass. Finally, we find that galaxies close to filaments or groups tend to be rounder than those away from filaments or groups.

New image statistics for detecting disturbed galaxy morphologies at high redshift

Testing theories of hierarchical structure formation requires estimating the distribution of galaxy morphologies and its change with redshift. One aspect of this investigation involves identifying galaxies with disturbed morphologies (e.g. merging galaxies). This is often done by summarizing galaxy images using, e.g. the concentration, asymmetry and clumpiness and Gini-M20 statistics of Conselice and Lotz et al., respectively, and associating particular statistic values with disturbance. We introduce three statistics that enhance detection of disturbed morphologies at high redshift (z ˜ 2): the multimode (M), intensity (I) and deviation (D) statistics. We show their effectiveness by training a machine-learning classifier, random forest, using 1639 galaxies observed in the H band by the Hubble Space Telescope WFC3, galaxies that had been previously classified by eye by the Cosmic Assembly Near-IR Deep Extragalactic Legacy Survey collaboration. We find that the MID statistics (and the A statistic of Conselice) are the most useful for identifying disturbed morphologies.

We also explore whether human annotators are useful for identifying disturbed morphologies. We demonstrate that they show limited ability to detect disturbance at high redshift, and that increasing their number beyond ≈10 does not provably yield better classification performance. We propose a simulation-based model-fitting algorithm that mitigates these issues by bypassing annotation.

Non-parametric 3D map of the intergalactic medium using the Lyman-alpha forest

Visualizing the high-redshift Universe is difficult due to the dearth of available data; however, the Lyman-alpha forest provides a means to map the intergalactic medium at redshifts not accessible to large galaxy surveys. Large-scale structure surveys, such as the Baryon Oscillation Spectroscopic Survey (BOSS), have collected quasar (QSO) spectra that enable the reconstruction of H I density fluctuations. The data fall on a collection of lines defined by the lines of sight (LOS) of the QSO, and a major issue with producing a 3D reconstruction is determining how to model the regions between the LOS. We present a method that produces a 3D map of this relatively uncharted portion of the Universe by employing local polynomial smoothing, a non-parametric methodology. The performance of the method is analysed on simulated data that mimics the varying number of LOS expected in real data, and then is applied to a sample region selected from BOSS. Evaluation of the reconstruction is assessed by considering various features of the predicted 3D maps including visual comparison of slices, probability density functions (PDFs), counts of local minima and maxima, and standardized correlation functions. This 3D reconstruction allows for an initial investigation of the topology of this portion of the Universe using persistent homology.

Nonparametric Inference for the Cosmic Microwave Background

The cosmic microwave background (CMB), which permeates the entire Universe, is the radiation left over from just 380,000 years after the Big Bang. On very large scales, the CMB radiation field is smooth and isotropic, but the existence of structure in the Universe - stars, galaxies, clusters of galaxies, -- suggests that the field should fluctuate on smaller scales. Recent observations, from the Cosmic Microwave Background Explorer to the Wilkinson Microwave Anisotropy Probe, have strikingly confirmed this prediction.

CMB fluctuations provide clues to the Universe's structure and composition shortly after the Big Bang that are critical for testing cosmological models. For example, CMB data can be used to determine what portion of the Universe is composed of ordinary matter versus the mysterious dark matter and dark energy. To this end, cosmologists usually summarize the fluctuations by the power spectrum, which gives the variance as a function of angular frequency. The spectrum's shape, and in particular the location and height of its peaks, relates directly to the parameters in the cosmological models. Thus, a critical statistical question is how accurately can these peaks be estimated.

We use recently developed techniques to construct a nonparametric confidence set for the unknown CMB spectrum. Our estimated spectrum, based on minimal assumptions, closely matches the model-based estimates used by cosmologists, but we can make a wide range of additional inferences. We apply these techniques to test various models and to extract confidence intervals on cosmological parameters of interest. Our analysis shows that, even without parametric assumptions, the first peak is resolved accurately with current data but that the second and third peaks are not.

Photometric redshift estimation using spectral connectivity analysis

The development of fast and accurate methods of photometric redshift estimation is a vital step towards being able to fully utilize the data of next-generation surveys within precision cosmology. In this paper, we apply a specific approach to spectral connectivity analysis (SCA) called diffusion map. SCA is a class of non-linear techniques for transforming observed data (e.g. photometric colours for each galaxy, where the data lie on a complex subset of p-dimensional space) to a simpler, more natural coordinate system wherein we apply regression to make redshift predictions. In previous applications of SCA to other astronomical problems, we demonstrate its superiority vis-a-vis the principal components analysis, a standard linear technique for transforming data. As SCA relies upon eigen-decomposition, our training set size is limited to <~104 galaxies; we use the Nyström extension to quickly estimate diffusion coordinates for objects not in the training set. We apply our method to 350738 Sloan Digital Sky Survey (SDSS) main sample galaxies, 29816 SDSS luminous red galaxies and 5223 galaxies from DEEP2 with Canada-France-Hawaii Telescope Legacy Survey ugriz photometry. For all three data sets, we achieve prediction accuracies at par with previous analyses, and find that the use of the Nyström extension leads to a negligible loss of prediction accuracy relative to that achieved with the training sets. As in some previous analyses, we observe that our predictions are generally too high (low) in the low (high) redshift regimes. We demonstrate that this is a manifestation of attenuation bias, wherein measurement error (i.e. uncertainty in diffusion coordinates due to uncertainty in the measured fluxes/magnitudes) reduces the slope of the best-fitting regression line. Mitigation of this bias is necessary if we are to use photometric redshift estimates produced by computationally efficient empirical methods in precision cosmology.

Semi-supervised learning for photometric supernova classification

We present a semi-supervised method for photometric supernova typing. Our approach is to first use the non-linear dimension reduction technique diffusion map to detect structure in a data base of supernova light curves and subsequently employ random forest classification on a spectroscopically confirmed training set to learn a model that can predict the type of each newly observed supernova. We demonstrate that this is an effective method for supernova typing. As supernova numbers increase, our semi-supervised method efficiently utilizes this information to improve classification, a property not enjoyed by template-based methods. Applied to supernova data simulated by Kessler et al. to mimic those of the Dark Energy Survey, our methods achieve (cross-validated) 95 per cent Type Ia purity and 87 per cent Type Ia efficiency on the spectroscopic sample, but only 50 per cent Type Ia purity and 50 per cent efficiency on the photometric sample due to their spectroscopic follow-up strategy. To improve the performance on the photometric sample, we search for better spectroscopic follow-up procedures by studying the sensitivity of our machine-learned supernova classification on the specific strategy used to obtain training sets. With a fixed amount of spectroscopic follow-up time, we find that, despite collecting data on a smaller number of supernovae, deeper magnitude-limited spectroscopic surveys are better for producing training sets. For supernova Ia (II-P) typing, we obtain a 44 per cent (1 per cent) increase in purity to 72 per cent (87 per cent) and 30 per cent (162 per cent) increase in efficiency to 65 per cent (84 per cent) of the sample using a 25th (24.5th) magnitude-limited survey instead of the shallower spectroscopic sample used in the original simulations. When redshift information is available, we incorporate it into our analysis using a novel method of altering the diffusion map representation of the supernovae. Incorporating host redshifts leads to a 5 per cent improvement in Type Ia purity and 13 per cent improvement in Type Ia efficiency.