Technical Report

Report Title: 

Feature Selection for an n-gram Approach to Web Page Genre Classification

Authors: 

Jane E. Mason, Michael Shepherd, and Jack Duffy

Tech Report Number: 

CS-2009-04

Report Date: 
June 22nd, 2009
Abstract: 

Web page genre classification is a potentially powerful tool in filtering the results of online searches. In this paper, we describe a set of experiments investigating the automatic classification of Web pages by their genres using n-gram representations of the Web pages and Web page genres, and a distance function classification model. The experiments in this study examine the effect of three feature selection measures on the accuracy of Web page classification with this model. The feature selection measures which are investigated include frequency, the Chi-square statistic, and Information Gain. The experiments are run on two well-known data sets, 7-Genre and KI-04, for which published results are available. Our results compare very favorably with those of other researchers.

Author Addresses: 

Dalhousie University
Halifax, NS
Canada

Report Files