An Unsupervised Method for Extracting Domain-specific Affixes in Biological Literature

Authors: 

Haibin Liu
Christian Blouin
Vlado Keselj

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

Abstract: 

We propose an unsupervised method to automatically extract domain-specific prefixes and suffixes from biological corpora based on the use of PATRICIA tree. The method is evaluated by integrating the extracted affixes into an existing learning-based biological term annotation system.

The system based on our method achieves comparable experimental results to the original system in locating biological terms and exact term matching annotation.

However, our method improves the system efficiency by significantly reducing the feature set size. Additionally, the method achieves a better performance with a small training data set.

Since the affix extraction process is unsupervised, it is assumed that the method can be generalized to extract domain-specific affixes from other domains, thus assisting in domain-specific concept recognition.

Tech Report Number: 
CS-2008-01
Report Date: 
January 16, 2008
AttachmentSize
PDF icon CS-2008-01.pdf416.59 KB