A Comparative Study of Dimension Reduction Techniques for Document Clustering

Authors: 

Bin Tang
Xiao Luo
Malcolm I. Heywood
Michael Shepherd

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

Abstract: 

Dimension reduction techniques (DRT) are applicable to a wide range of information systems. Application context naturally has a significant impact on the appropriateness of the DRTs. In this research, a systematic study is conducted of four DRTs for the text clustering problem using five benchmark datasets. Of the four methods -- Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) -- ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of the datasets. Random projection consistently returns the worst results, where this appears to be due to the noise distribution characterizing the document clustering task.

Tech Report Number: 
CS-2004-14
Report Date: 
December 6, 2004
AttachmentSize
PDF icon CS-2004-14.pdf1.97 MB