Chad M. Schafer and Kjell A. Doksum
This paper considers multiple regression procedures for analyzing the relationship between a response variable and a vector of covariates in a nonparametric setting where both tuning parameters and the number of covariates need to be selected. We introduce an approach which handles the dilemma that with high dimensional data the sparsity of data in regions of the sample space makes estimation of nonparametric curves and surfaces virtually impossible. This is accomplished by abandoning the goal of trying to estimate true underlying curves and instead estimating measures of dependence that can determine important relationships between variables. These dependence measures are based on local parametric fits on subsets of the covariate space that vary in both dimension and size within each dimension. The subset which maximizes a signal to noise ratio is chosen, where the signal is a local estimate of a dependence parameter which depends on the subset dimension and size, and the noise is an estimate of the standard error (SE) of the estimated signal. This approach of choosing the window size to maximize a signal to noise ratio lifts the curse of dimensionality because for regions with sparsity of data the SE is very large. It corresponds to asymptotically maximizing the probability of correctly finding non-spurious relationships between covariates and a response or, more precisely, maximizing asymptotic power among a class of asymptotic level -tests indexed by subsets of the covariate space. Subsets that achieve this goal are called features. We investigate the properties of specific procedures based on the preceding ideas using asymptotic theory and Monte Carlo simulations and find that within a selected dimension, the volume of the optimally selected subset does not tend to zero as the sample size increases unless the volume of the subset of the covariate space where the response depends on the covariate vector tends to zero.