Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering

Authors: 

Yingbo Miao
Vlado Keselj
Evangelos Milios

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

Abstract: 

We propose a new method of document clustering with character N-grams. Traditionally, in vector-space model, each dimension corresponds to a word, with an associated weight equal to a word or term frequency measure (e.g. TFIDF). In our method, N-grams are used to define the dimensions, with weights equal to the normalized N-gram frequencies. We further introduce a new measure of the distance between two vectors based on the N-gram representation. In addition, we compare N-gram representation with a representation based on automatically extracted terms. Entropy and accuracy are our evaluation methods. T-test is used to prove there is significant differences between two results. From our experimental results, we find document clustering using character N-grams produces the best results.

Tech Report Number: 
CS-2005-23
Report Date: 
September 18, 2005
AttachmentSize
PDF icon CS-2005-23.pdf866.98 KB