A Systematic Study of Document Representation and Dimension Reduction for Text Clustering

Authors: 

Mahdi Shafiei
Singer Wang
Roger Zhang
Evangelos Milios
Bin Tang
Jane Tougas
Ray Spiteri

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

Abstract: 

Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three Dimension Reduction Techniques (DRT) are applied on three different document representation methods in the context of the text clustering problem. Several standard benchmark datasets are used. The dimension reduction methods considered include independent component analysis (ICA), latent semantic indexing (LSI), and a technique based on Document Frequency (DF). These three methods are applied on three Document representation methods based on the vector space model; word, multi-word term, and character N-gram representations. Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better results compared with LSI. Experiments also show that the word representation gives better clustering results compared to term and N-gram representation. Finally, for the N-gram representation, it is shown that a profile length of 2000 is enough to capture the information and in most cases, a 4-gram representation gives better performance than 3-gram representation.

Tech Report Number: 
CS-2006-05
Report Date: 
July 11, 2006
AttachmentSize
PDF icon CS-2006-05.pdf4.67 MB