A Survey of SOM Based Approaches to Document Classification

Authors: 

Chris Jordan

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

Abstract: 

Document classification has been an object of study for many years. The Web along with other technologies such as digital libraries has facilitated the growth of document collections both in size and popularity. Classification is important as it allows users to effective browse and to quickly understand the general contents of these corpuses. Most approaches to classification have used supervised learning. This has been acceptable in the past since collections have been small enough for teams of experts to generate representative training datasets; this is not feasible for very large corpuses. Such repositories require unsupervised learning methods that will be able to find the clusters with a minimum of human direction. A self organizing map (SOM) is one such clustering algorithm which places clusters that are similar to other each close on a lattice. It is attractive in that it offers users an intuitive interface for browsing the document collection. This paper will discuss what a SOM is and what data preprocessing needs to be done before it can be employed on a document set. Following this, an analytical survey of four popular works involving SOM based algorithms for document classification will be presented.

Tech Report Number: 
CS-2003-10
Report Date: 
December 2, 2003
AttachmentSize
PDF icon CS-2003-10.pdf2.1 MB