Jiashun's Research

Click here to go back to Jiashun's main page.

Reports

I have sorted my published papers and submitted manuscripts in two ways: by year and by areas.

Click here to see reports listed by years

Click here to see reports listed by areas

Research description

My primary research area is in statistical inference for Big Data, focusing on how to better the inference by exploiting all kinds of sparsity we see today (signal sparsity, graph sparsity, sparsity in eigenvalues of matrices, etc.).

Vision

Many of my work have been orbited around the vision that in many application examples of Big Data we see today, the signals are Rare are Weak. To me, Rare and Weak signals is not a mathematical curiosity but is the unavoidable consequence of the trend of "large p and small n" we frequently see with Big Data.

When you collect data with increasingly more features (i.e., increasing large dimensions), the signals tend to be increasingly more sparse as the number of true features would not increase proportionally. At the same time, in many cases we can not enroll sufficient subjects for experiments (such as study on a rare disease), so the sample size would not grow proportionally with the number of features, and the signals end up being weak.

In this "Rare and Weak" situation, classical methods and most contemporary empirical methods are simply overwhelmed, and principled statistical approach are badly in need.

Research Topics

In the past years, I have explored the following topics in high dimensional data analysis, where in a significant fraction of the work, the theme in on "Rare and Weak" signals.

Large-Scale Multiple Hypotheses Testing.
Cancer Classification.
Variable Selection.
Spectral Clustering and Principal Component Analysis (PCA).
Random Matrix Theory (RMT).
Network Analysis.
Graph Theory and Precision Matrix Estimation.

Applications

My research are motivated by many interesting problems in various application areas.

Genomics: gene microarray, SNP, protein mass spectroscopy.
Complicate graphs and networks.
Astronomy and Cosmology: non-Gaussian signature detection in Cosmic Microwave Background (CMB).
Computer Security.
Link to Astrostatistics group at Purdue
Link to International Group of Computational Astrostatistics Group

Methods

I have developed and co-developed four groups of new methods appropriate for Rare and Weak signals.

Higher Criticism (with variants for signal detection, classification, and spectral clustering).
Graphlet Screening for variable selection (with two other variants: Univariate Penalized Screening (UPS) and Covariance Assisted Screening and Estimation (CASE)).
Fourier-transformation based procedures for estimating the null parameters and proportion of non-null effects in large-scale multiple testing.
SCORE for network community detection.
A new method for estimating the precision matrix (in progress).

Theory

I have a strong interest in statistical theory, and I am especially fond of the so-called "Phase Diagram" which is a novel way to justify optimality. The phase diagram can be viewed as a new criterion for optimality that is especially appropriate for Rare and Weak signals in Big Data.

Just like there are three phases for water (water vapor, water, and ice), there are three phases for many given statistical problems (variable selection, classification, multiple testing, spectral clustering). The phase diagram is a two-dimensional parameter space, where the x-axis calibrates the signal rarity, and the y-axis calibrates the signal strength. For a particular statistics problem, say, variable selection, the phase space usually partitions into three sub-regions (and so the name of phase diagrams), Phase I-III.

In Phase I, the problem under consideration is relatively trivial since the signals are sufficiently strong.
In Phase II. the problem under consideration is nontrivial but it is still possible to have a reliable solution as the signals are moderately strong.
In Phase III. it is impossible to have a reliable solution simply because the signals are simply too rate and weak.

In the past years, I have worked out the phase diagrams for the following problems.

Detecting rare and weak signals.
Variable selection.
Classification.
Clustering.
Estimating the proportion of signals.
Low-rank matrix recovery.