A Comparative Study of Dimension Reduction Techniques for Document Clustering


Bin Tang
Xiao Luo
Malcolm I. Heywood
Michael Shepherd

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2


Dimension reduction techniques (DRT) are applicable to a wide range of information systems. Application context naturally has a significant impact on the appropriateness of the DRTs. In this research, a systematic study is conducted of four DRTs for the text clustering problem using five benchmark datasets. Of the four methods -- Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) -- ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of the datasets. Random projection consistently returns the worst results, where this appears to be due to the noise distribution characterizing the document clustering task.

Tech Report Number: 
Report Date: 
December 6, 2004
PDF icon CS-2004-14.pdf1.97 MB