Variable Selection for Consistent Clustering


A common problem encountered in clustering analysis is obtaining drastically different results for the same dataset using different clustering methods. Ideally, the variables considered for identifying the true subgroups in the data should yield consistent results regardless of the type of clustering algorithm. Built upon the framework of the maximum clustering similarity (MCS) method by Albatineh and Niewiadomska-Bugaj (2011), we propose a variable selection technique aiming to identify variables with the most consistent clustering results. Similar to the model-based clustering variable selection method by Raftery and Dean (2006), a greedy search algorithm finds the set of variables retaining the highest average similarity index across the assignments of several different clustering methods. We apply the method to several simulated and real datasets, demonstrating the impact of finding consistent variables and how it relates to other variable selection techniques.

University of California, Santa Cruz