Department of Statistics Unitmark
Dietrich College of Humanities and Social Sciences

Clustering & Unsupervised Methods

There are currently no projects for this area of research.

A Comprehensive Approach to Mode Clustering

Mode clustering is a nonparametric method for clustering that defines clusters using the basins of attraction of a density estimator's modes. We provide several enhancements to mode clustering: (i) a soft variant of cluster assignment, (ii) a measure of connectivity between clusters, (iii) a technique for choosing the bandwidth, (iv) a method for denoising small clusters, and (v) an approach to visualizing the clusters. Combining all these enhancements gives us a complete procedure for clustering in multivariate problems. We also compare mode clustering to other clustering methods in several examples

A Fast Clustering Algorithm with Application to Cosmology

We present a fast clustering algorithm for density countour clusters (Hartigan, 1975) that is a modified version of the Cuevas, Febrero and Fraiman (2000) algorithm. By Hartigan's definition, clusters are the connected components of a level set \(S_c \equiv \{f \geq c\}\) where \(f\) is the probability density function. We use kernel density estimators and orthogonal series estimators to estimate \(f\) and modify the Cuevas, Febrero and Fraiman (2000) Algorithm to extract the connected components from level set estimators \(\hat{S}_c \equiv \{\hat{f} \geq c\}\). Unlike the original algorithm, our method does not require an extra smoothing parameter and can use the Fast Fourier Tranform (FFT) to speed up the calculations. We show the cosmological definition of clusters of galaxies is equivalent to density contour clusters and present an application in cosmology.

A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density

The goal of clustering is to detect the presence of distinct groups in a data set and assign group labels to the observations. Nonparametric clustering is based on the premise that the observations may be regarded as a sample from some underlying density in feature space and that groups correspond to modes of this density. The goal then is to find the modes and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; modes of the density correspond to leaves of the cluster tree. Estimating the cluster tree is the primary goal of nonparametric cluster analysis. We adopt a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly, for others we have to be content with an approximation. We present a graph-based method that can approximate the cluster tree of any density estimate. Density estimates tend to have spurious modes caused by sampling variability, leading to spurious branches in the cluster tree. We propose excess mass as a measure for the size of a branch, reflecting the height of the corresponding peak of the density above the surrounding valley floor as well as its spatial extent. Excess mass can be used as a guide for pruning the graph cluster tree. We point out mathematical and algorithmic connections to single linkage clustering and illustrate our approach on several examples.

CATS: Clustering After Transformation and Smoothing

CATS - Clustering After Transformation and Smoothing - is a technique for nonparametrically estimating and clustering a large number of curves. Our motivating example is a genetic microarray experiment but the method is very general. The method includes: transformation and smoothing multiple curves, multiple nonparametric testing for trends, clustering curves with similar shape, and nonparametrically inferring the misclustering rate.

Clustering with Confidence: A Binning Approach

We present a plug-in method for estimating the cluster tree of a density. The method takes advantage of the ability to exactly compute the level sets of a piecewise constant density estimate. We then introduce clustering with confidence, an automatic pruning procedure that assesses significance of splits (and thereby clusteres) in the cluster tree; the only user input is the desired confidence level.

Low-Noise Density Clustering

We study density-based clustering under low-noise conditions. Our framework allows for sharply defined clusters such as clusters on lower dimensional manifolds. We show that accurate clustering is possible even in high dimensions. We propose two data-based methods for choosing the bandwidth and we study the stability properties of density clusters. We show that a simple graph-based algorithm known as the ``friends-of-friends'' algorithm successfully approximates the high density clusters.

Nonparametric Density Estimation and Clustering in Astronomical Sky Surveys

We present a nonparametric method for galaxy clustering in astronomical sky surveys. We show that the cosmological definition of clusters of galaxies is equivalent to density contour clusters (Hartigan, 1975) \(S_c = \{ f > c \}\) where \(f\) is a probability density function. The plug-in estimator \(\hat S_c =\{ \hat f > c \}\) is used to estimate \(S_c\) where \(\hat f\) is the multivariate kernel density estimator. To choose the optimal smoothing parameter, we use cross-validation and the plug-in method and show that cross-validation method outperforms the plug-in method in our case. A new cluster catalogue based on the plug-in estimator is compared to existing cluster catalogs, the Abell and EDCCI. Our result is more consistent with the EDCCI than with the Abell, which is the most widely used catalogue. We present a practical algorithm for local smoothing and use the smoothed bootstrap to asses the validity of clustering results.

(Revised 10/04)

Scan Clustering: A False Discovery Approach

We present a method that scans a random field for localized clusters while controling the fraction of false discoveries. We use a kernel density estimator as the test statistic and correct for the bias in this estimator by a method we introduce in this paper. We also show how to combine information across multiple bandwidths while maintaining false discovery control.