Approximate Single Linkage Cluster Analysis of Large Data Sets
in High Dimensional Spaces
William F. Eddy, Audris Mockus, and Shingo Oue
Abstract:
Motivated by a real-world data set of more than 40,000 observations,
we consider single linkage clustering of large data sets in high
dimensional spaces. In this particular case, the data we consider are
observations in the space of one dimensional sub-manifolds of
Euclidean space. This data set is so large that the individual
observations cannot each be inspected. A goal of the analysis is to
produce a partial ordering of the data so that a useful inspection of
selected observations can take place. An essential step is
construction of a metric on the objects in this high dimensional space
that can be computed fairly quickly. The size of the problem makes
the running time of Prim's algorithm (the fastest available algorithm
for finding a minimal spanning tree of a complete graph) prohibitive.
We have developed several methods which approximate the single
linkage tree. We report the result of applying our approximations to
the data set.
Here is the full postscript text for this
technical report. It is 3.648 Mbytes (thats MegaBytes).