Reports and software
I have sorted my published papers and submitted manuscripts in two ways:
by year and by areas.
I have created a separate page containing the software I developed or co-developed,
as well as some data sets I have implemented with the software. The data sets including
a handful real data sets in genomics and network.
My primary research area is in statistical inference
for Big Data, focusing on how to better the inference
by exploiting all kinds of sparsity
we see today (signal sparsity, graph sparsity,
sparsity in eigenvalues of matrices, etc.).
of my work have been orbited around the vision that in many
application examples of Big Data we see today, the signals are Rare are Weak.
To me, Rare and Weak signals is not
a mathematical curiosity but is the unavoidable consequence of the
trend of "large p and small n" we frequently see with Big Data.
When you collect data with increasingly
more features (i.e., increasing large dimensions), the signals tend
to be increasingly more sparse as the number of true features would not
increase proportionally. At the same time, in many cases we can not
enroll sufficient subjects for experiments (such as study on a rare
disease), so the sample size would not grow proportionally with the
number of featuers, and the signals end up being weak.
In this "Rare and Weak" situation, classical methods and most contemporary
empirical methods are simply overwhelmed, and principled statistical approach
are badly in need.
In the past years, I have explored the following topics in high dimensional data analysis,
where in a significant fraction of the work, the theme in on "Rare and Weak" signals.
Larg-Scale Multiple Hypotheses Testing.
Spectral clustering and Principle Componenent Analysis (PCA).
Random Matrix Theory (RMT).
Graph theory and pricision matrix.
My research are motivated by many interesting problems in various application areas.
I have developed and co-developed four groups of new methods appropriate for Rare and Weak signals.
Higher Criticism (with variants for signal detection, classification, and spectral clustering).
Graphlet Screening for variable selection (with two other variants: Univariate Penalized Screening (UPS) and Covariance Assisted
Screening and Estimattion (CASE)).
Fourier-transformation based procedures for estimating the null parameters and proportion of non-null effects in large-scale multiple testing.
SCORE for network community detection.
A new method for estimating the precision matrix (in progress).
I have a strong interest in statistical theory, and I am especially
fond of the so-called "Phase Diagram" which is a novel way to justify
optimality. The phase diagram can be viewed as a new criterion for
optimality that is especially appropriate for Rare and Weak
signals in Big Data.
Just like there are three phases
for water (water vapor, water, and ice), there are three phases for
many given statistical problems (variable selection, classification,
multiple testing, spectral clustering). The phase diagram is a
two-dimensional parameter space, where the x-axis calibrates the
signal rarity, and the y-axis calibrates the signal strength.
For a particular statistics problem, say, variable selection,
the phase space usually partitions into three sub-regions (and so the
name of phase diagrams), Phase I-III.
In the past years, I have worked out the phase diagrams for the following problems.
- In Phase I, the problem under consideration
is relatively trivial since the signals are sufficiently strong.
In Phase II. the problem under consideration is notrivial but it is still possible to
have a reliable solution as the signlas are moderately strong.
In Phase III. it is impossible to have a reliable solution simply because the
signals are simply too rate and weak.
Detecing rare and weak signals.
Estimating the proportion of signals.
Low-rank matrix recovery.