Chad M. Schafer and Kjell A. Doksum
This paper considers multiple regression procedures for analyzing the
relationship between a response variable and a vector of
covariates in a nonparametric setting where both tuning parameters and
the number of covariates need to be selected. We introduce an
approach which handles the dilemma that with high dimensional data the
sparsity of data in regions of the sample space makes estimation of
nonparametric curves and surfaces virtually impossible. This is
accomplished by abandoning the goal of trying to estimate true
underlying curves and instead estimating measures of dependence that
can determine important relationships between variables. These
dependence measures are based on local parametric fits on subsets of
the covariate space that vary in both dimension and size within each
dimension. The subset which maximizes a signal to noise ratio is
chosen, where the signal is a local estimate of a dependence parameter
which depends on the subset dimension and size, and the noise is an
estimate of the standard error (SE) of the estimated signal. This
approach of choosing the window size to maximize a signal to noise
ratio lifts the curse of dimensionality because for regions with
sparsity of data the SE is very large. It corresponds to
asymptotically maximizing the probability of correctly finding
non-spurious relationships between covariates and a response or, more
precisely, maximizing asymptotic power among a class of asymptotic
level
-tests indexed by subsets of the covariate space.
Subsets that achieve this goal are called features. We investigate the
properties of specific procedures based on the preceding ideas using
asymptotic theory and Monte Carlo simulations and find that within a
selected dimension, the volume of the optimally selected subset does
not tend to zero as the sample size increases unless the volume of the
subset of the covariate space where the response depends on the
covariate vector tends to zero.