Interactive Document Clustering Using Iterative Class-Based Feature Selection

Authors: 

Yeming Hu
Evangelos Milios
James Blustein

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

{yeming,eem,jamie}@cs.dal.ca

Abstract: 

Semi-supervised clustering has been found to improve clustering performance by using constraints between documents. Recent research in active learning indicates that feature identification, which takes much less user effort than document labeling, can improve classification performance. We aim to use this new finding to improve document clustering. We first propose an unsupervised clustering framework which involves an iterative updating of the feature set. Users are then invited to help update the feature set by identifying good features. Experiments on various datasets indicate that the performance of document clustering may improve significantly with some user input. In addition, the clustering performance increases initially and then stabilizes with more user effort. Feature reweighting, which gives higher weights to features confirmed by the users, can achieve better clustering performance with less user effort. Based on our experiments, several guidelines are suggested for applying the interactive framework.

Tech Report Number: 
CS-2010-04
Report Date: 
April 29, 2010
AttachmentSize
PDF icon CS-2010-04.pdf6.57 MB