### Bayesian Workshop 7 Invited Papers

The two invited papers to be presented at the workshop are:

• Modeling Species Diversity Through Species Level Hierarchical Modeling
• Highly Structured Models for Spectral and Image Analysis in High Energy Astrophysics

## Modeling Species Diversity Through Species Level Hierarchical Modeling''

Alan E. Gelfand, John A. Silander, Jr., Shanshan Wu, Andrew Latimer and Paul Lewis

Understanding spatial patterns of species diversity and the distribution of individual species is a consuming problem in biogeography and conservation. In particular, to respond to the ancient challenge which asks why there are so many species in some areas and so few in others, one finds many different "universal explanations". For instance, a survey paper by Palmer (1996) lists 120 named hypotheses to explain patterns in biodiversity. Arguably more disconcerting is to witness the same data used to support different claims.

The advent of inexpensive high speed computation, including widely available GIS software, has revised the way that many ecologists think about spatial prediction of species distribution. The survey paper of Guisan and Zimmermann (2000) provides an extensive review of the variety of statistical and algorithmic methods which have been proposed.

What we envisage is a region which has been surveyed at a number of sites. At each site, presence (hence, implicitly, absence) of a collection of species has been recorded resulting in a site (row) by species (column) matrix. A classification-then-modeling strategy gathers either sites into groups containing similar species ("communities") or species into groups occurring at similar sites ("assemblages"). Marginalization across rows yields richness at sites, marginalization down columns produces species prevalence. Regression modeling using generalized linear models (GLM's) or generalized additive models (GAM's) are the current widely advocated approaches to explain the aggregated measurements.

Rather than aggregating, one might model directly at the species/site level. Regression work exists in this case as well, including the recently proposed generalized regression analysis and spatial prediction (GRASP) methodolgy (Lehmann, Overton and Leathwick (2002). We also work at the site level but we seek to accommodate a variety of concerns which are often ignored in the literature. These include very irregular sampling intensity, ecological factors measured at much different resolution than our essentially point-referenced sampling sites, human interventions which have transformed land use, the incorporation of phylogenetic information, and the accommodation of large physical regions with large numbers of sampling sites. To accommodate these featurees we work with spatially explicit hierarchical models and fit the data within a Bayesian framework. As a result, we are able to offer full inference and better assessment of uncertainty associated with our inference.

The work undertaken through this project enables a challenging collaboration involving a team of statisticians, ecologists, botanists, geneticists, a mathematical modeler and a GIS specialist. Thus far, one paper has been completed with a tentative acceptance from Applied Statistics (Gelfand, et al., 2003).

We have at our disposal arguably the richest dataset in existence to model and analyze biodiversity patterns.. In particular, we work within the Cape Floristic Region in South Africa, a global hotspot of diversity and endemism, e.g., this region includes 9000 plant species, 69% of which are found nowhere else. Our dataset has sampled some 60,000 localities within the roughly 90,000 sq.kms which this region encompasses. Our focus is the Proteaceae, an icon flowering plant family of South Africa, which shows a remarkable level of speciation, about 400 species across Africa of which roughly 330 are 99% restricted to the Cape Floristic Region. We have some 250,000 species counts among these protea species. We also have extensive environmental data for the region including hydrology and climatology, elevation and slope, and also geology. For the species themselves, we have attribute data but most intriguingly we also have phylogenetic data in the form of a constructed genetic tree. We are attempting a synthesis of all of these sources to explain species range and richness.

The modeling employs a multistage, spatially explicit, hierarchical logistic regression structure to link up "potential" presence or absence, "adjusted" presence or absence, reflecting the effects of human transformation of land use, and "observed" presence or absence which is collected with varying sampling intensity across the region. More precisely, for each species we conceptualize a latent binary presence/absence process over the study region. Then, species k is potentially present in areal unit i with probability which is the block average of its associated latent binary process over unit i. This approach enables direct extension to transformed presence under the assumption that the transformation status (yes or no) at a particular location is independent of the value of the latent presence/absence state for that location. Finally, the probability of observed presence or absence at a particular location for species k within areal unit i emerges as the size of a realized block average relative to the potential block average. As a result, all of these probabilities have an "areal" definition and are modeled on the logit scale using ecological variables and species attributes which are anticipated to affect block average size.

A variety of models can be developed within this hierarchical framework and this necessitates an approach for model comparison. Since our primary focus is on spatial prediction of species locations, we have developed a predictive model choice criterion which extends earlier work of Gelfand and Ghosh (1998). Such extension becomes necessary to avoid working on the binary scale and to accommodate the "censoring" introduced by transformation.

An attraction of the modeling is that it enables us to examine a variety of novel biodiversity measures (which emerge as posterior features) to capture range, richness, relative diversity, and turnover for species. Our initial analyses have been applied to a subset of 25 species over a subregion of roughly 4400 sq.kms. We have been surprised by how well our modeling approach describes observed biogeographical patterns over the region. We are currently extending our analyses by enriching our models through the inclusion of geological information and phylogenetic information.

These analyses include roughly 10% of our sampled data. Even with this smaller version of the problem we have interesting challenges involving rarely occurring species and "holes", grid cells where no sampling has been done . Escalation to the entire region and the entire collection of species exacerbates these issues but in addition raises formidable computational problems. We are developing and examining modeling approaches to enable us to handle so many spatial units. A related question which we are also addressing is that of scale. How do our proposed biodiversity measures change with scale? Classical biodiversity measures are ad hoc in definition but do attempt to reflect the extent of aggregation. So, we are considering this as well.

A further extension is to enable inclusion of phylogenetic information into the foregoing model specifications. Here, we move into the realm of cladistic biogeography where we are proposing the first integration of spatial environmental factors with genetic tree data. Trees can be built using either purely molecular or a combination of molecular and morphological data. Regardless, they characterize speciation with information regarding historical time of speciation. But, from a presence/absence perspective, there is often evidence of a phenomenon called vicariance which suggests that speciation may have occurred as a result of some historical event under which, for at least a portion of the study region, species were discouraged from jointly appearing in the same place. This introduces more complex dependence structure for species across areal units. We have built "vicariance promoter" models and are currently investigating their performance.

The project is funded under two grants, a three-year interdisciplinary award from NSF and a three-year travel/meeting award from the National Center for Ecological Analysis and Synthesis. The research team, consisting of scientists from three continents, is:

Alan Gelfand, Statistician, Duke Univ
Alexandra Schmidt, Statistician, Univ Federal, Rio De Janeiro
Shanshan Wu, Grad Student, Statistics, Univ of Connecticut
Jarrett Barber, Post-doctoral Researcher, Statistics, Duke University
John Silander, Ecologist, Univ of Connecticut
Andrew Latimer, Grad Student, Ecology, Univ of Connecticut
Paul Lewis, Geneticist, Univ of Connecticut

Mark Hunter, Post-doctoral researcher, Ecology, Univ of Connecticut
Richard Cowling, Botanist, Univ of Port Elizabeth, SA
Anthony Rebelo, Botanist, Kirstenbosch National Botanical Gardens, SA
Walter Smit, GIS specialist, Kirstenbosch National Botanical Gardens, SA
Gail Reeves, Geneticist, Kirstenbosch National Botanical Gardens, SA
Henri Laurie, Applied Mathematician, Univ of Capetown, SA

References

Palmer, M.W. (1996) Variation in Species Richness: Towards a Unification of Hypotheses. Folia Geobotanica et Phytotaxonomica 29,511-530.

Guisan, A. and Zimmermann, N.E. (2000) Predictive Habitat Distribution Models in Ecology. Ecological Modelling, 135, 147-186.

Lehmann, A., Overton, J.M., and Leathwick, J.R. (2002) GRASP: Generalized Regression Analysis and Spatial Prediction. Ecological Modelling 157, 189-207.

Gelfand, A.E., Schmidt, A.M., Wu, S., Silander, J.A., Latimer, A. and Rebelo, A.G. (2003) Modelling Species Diversity through Species Level Hierarchical Modeling. Applied Statistics (tentative acceptance)

Gelfand, A.E. and Ghosh, S. (1998) Model Choice: A Minimum Posterior Predictive Loss Approach. Biometrika 85,1-11.

Discussants: Jennifer Hoeting, Jay VerHoef

## Highly Structured Models for Spectral and Image Analysis in High Energy Astrophysics''

David A. van Dyk, David N. Esch, Yaming Yu, Margarita Karovska, Vinay L. Kashyap, Peter Freeman, Aneta Siemiginowska and Alanna Connors

X-rays are high energy electromagnetic waves, i.e., photons, which can be produced by very hot matter. The production of X-rays requires temperatures of millions of degrees that can occur in the presence of high magnetic fields, extreme gravity, or explosive forces. Thus, X-ray telescopes can map the remnants of exploding stars, areas of star formation, regions near the event horizon of a stellar black hole, or very distant but very turbulent galaxies. The distribution of the energy of the electromagnetic emissions gives insight into the composition, temperatures, and relative velocity of an astronomical source. The spatial distribution of the emission reflects physical structures in an extended source, for example, emission jets or the shape of the debris of a stellar explosion. Some sources exhibit temporal variability or periodicity that might result from surface or internal pulsations, eclipses, starspots, or magnetic activity cycles. Thus, instrumentation that can precisely measure the energy, sky coordinates, and arrival time of X-ray band photons enables astrophysicists to extract subtle clues as to the underlying physics of X-ray sources. Such instrumentation is necessarily space-based since high energy photons are absorbed by the Earth's atmosphere. The Chandra X-ray Observatory is an example of the new class of instruments providing high precision data. Launched in July of 1999, Chandra has already given astronomers a wealth of important new data. The instrumentation aboard Chandra includes four pairs of ultra-smooth high-resolution mirrors, and efficient X-ray detectors to provide images at least thirty times sharper than any previous X-ray telescope. Data is collected on each X-ray photon that arrives at the detector; the time of arrival, the two dimensional sky coordinates, and the energy are all recorded. Due to instrumental constraints, each of these four variables is discrete. Thus, a data set can be represented by a four-way table of counts with margins corresponding to time, two sky coordinates, and energy. Here, we focus on spectral analysis, which models the one-way energy margin and image analysis, which models the two-way marginal table of sky coordinates.

There are numerous statistical challenges that arise when analyzing data from Chandra and other high-resolution count-based detectors. It is to address these challenges that we develop new statistical models and methods. To motivate the need for new methodology, we briefly outline some of these challenges and some of our model-based solutions.

Chandra's capacity for high resolution imaging means that it has a much finer discretization of energy and sky coordinates than previous instruments. This results in an overall increase in the number of bins and pixels that leads to lower observed counts in each. Thus, Gaussian assumptions that might have been appropriate for data from older instruments are often inappropriate for Chandra data. For example, so-called minimum $\chi^2$ fitting of model parameters, popular in astrophysics, involves an implicit Gaussian assumption. We avoid such assumptions by directly modeling spectra and images as Poisson processes.

Models for the spectrum and for the image of an X-ray source often include several components. For example, a spectral model typically contains several additive Poisson components which can be formulated as a finite mixture model. Roughly speaking, the components can be split into two groups: continuum terms, which describe the distribution over the entire energy range of interest and emission lines, which are local positive aberrations from the continuum. The finite mixture is then multiplied by an absorption factor, which represents stochastic censoring of photons. A proportion of photons are absorbed by matter at the surface of the source or between the source and detector; the probability of absorption varies with energy. Models for spatial structure in the images are even more complex. Although some X-ray emitters are simple star-like point sources many of the interesting images are composed of hot gas in an extended diffuse nebula with irregular and often unpredictable structure. The nebula may contain long filaments, loops of hot gas, or extended jets emanating from one or more point sources. Again, we use Poisson mixture models, perhaps including components for point sources and jets on top of multi-scale model (e.g., wavelet like---but in a fully Bayesian model) for the extended nebula.

Unfortunately, due to instrumental constraints the photon counts are degraded in a number of ways. For example, the effective area of the detector varies with the energy of the photon. Heuristically the instrument works like a prism which bends an X-ray by an angle which depends on its energy. Some of the photons are bent so far that they miss the detector all together. Thus, the probability that an X-ray is not refracted off the detector depends on its energy. A second form of data degradation is due to instrument response, which is a characteristic of the detector that results in a blurring of the photon energies and sky coordinates. The recorded energy and sky coordinates of a photon that arrives with a particular energy and sky coordinates has a probability distribution. Like the effective area vector the probability matrix which describes blurring is evaluated (perhaps with error) by calibration of the detector. Finally, the source photons are generally contaminated by background counts which originate somewhere other than the source of interest.

Pile-up occurs in X-ray CCDs (charged coupled devices such as Chandra) when two or more photons arrive at the same area of the detector during the same time frame (i.e., time bin). Such coincident events are counted as a single event with energy equal to the sum of the coincident event energies. The event is lost altogether if the total energy goes above the on-board discriminators. Thus, for bright sources pile-up can seriously distort not only the count rate but also both and the energy spectrum and the image. Accounting for pile-up in a principled manner requires careful modeling and sophisticating statistical computation.

In addition to fitting spectral and image models and obtaining error bars for the various model parameters, astrophysicists are often interested in conducting formal statistical tests for the presence of additive spectral or image features. The likelihood ratio test, or a Gaussian approximation thereof, is routinely used. Since the intensity of these features are generally constrained to be non-negative, the null hypothesis of no feature is on the boundary of the parameter space. Thus, the standard asymptotic reference distribution of the likelihood ratio test is inapplicable and more sophisticated methods are needed for testing such hypotheses.

We use model-based Bayesian methods to handle the complexity of Chandra data. Multi-level models can be designed with components for both the data collection process (e.g., background contamination and pile-up) and the complex spectral and image structures of the sources themselves. Sophisticated computational methods are required for fitting the resulting highly structured models. We believe a Bayesian perspective is ideally suited to such models in terms of both inference and computation. The models may be parameterized on high dimensional spaces with numerous nuisance parameters, for example, describing the data collection process. Bayesian methods offer a straightforward way of handling such parameter spaces; we base inference on marginal posterior distributions of scientifically interesting parameters or groups of parameters.

Computational tools such as the EM algorithm, the Data Augmentation algorithm, the Gibbs sampler, and other Markov chain Monte Carlo (MCMC) methods are ideally suited to highly structured models of this sort. The modular structure of these algorithms fits hand-in-glove with the hierarchical structure of our models. The Gibbs sampler, for example, samples one set of model parameters from their conditional posterior distribution given all other model parameters. This allows us to sequentially fit one component of the overall model at a time while conditioning on the other components. In this way, a complex model fitting task is divided into a sequence of much easier tasks. Many of these easier tasks involve well understood procedures. Using an EM algorithm to handle a blurring matrix and background contamination of Poisson data is a good example. Although this well known (and often rediscovered) technique is unable to handle the richness of our highly structured model, we utilize it and its stochastic counterpart as a step in our mode finding and posterior sampling algorithms.

In our presentation we illustrate how our highly structured models and Bayesian techniques have solved numerous scientific questions where standard methods fail.

This work is part of an on-going collaboration of the Harvard-Smithsonian Astronomy and Statistics Working Group (URL: www.people.harvard.edu/~vandyk/astrostat.html). Participants include faculty, researchers, graduate and undergraduate students in both statistics and astrophysics.

Discussants: To be announced

The previous six Workshops provided extended presentation and discussion on diverse topics.

Back to the top of the page