Feature Selection for an n-gram Approach to Web Page Genre Classification

Authors: 

Jane E. Mason
Michael Shepherd
Jack Duffy

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

Abstract: 

Web page genre classification is a potentially powerful tool in filtering the results of online searches. In this paper, we describe a set of experiments investigating the automatic classification of Web pages by their genres using n-gram representations of the Web pages and Web page genres, and a distance function classification model. The experiments in this study examine the effect of three feature selection measures on the accuracy of Web page classification with this model. The feature selection measures which are investigated include frequency, the Chi-square statistic, and Information Gain. The experiments are run on two well-known data sets, 7-Genre and KI-04, for which published results are available. Our results compare very favorably with those of other researchers.

Tech Report Number: 
CS-2009-04
Report Date: 
June 22, 2009
AttachmentSize
PDF icon CS-2009-04.pdf289.24 KB