N-gram-based Classification and Hierarchical Clustering of Genome Sequences

Authors: 

Andrija Tomovic
Predrag Janicic
Vlado Keselj

Author Addresses: 

Andrija Tomovic
Friedrich Miescher Institute for Biomedical Research
Maulbeerstrasse 66, CH-4058 Basel, Switzerland

and

Predrag Janicic
Faculty of Mathematics, University of Belgrade,
Studentski trg 16, 11000 Belgrade, Serbia and Montenegro

and

Vlado Keselj
Faculty of Computer Science, Dalhousie University,
Halifax, NS, Canada, B3H 1W5

Abstract: 

We address the problem of automated classification of isolates, i.e., the problem of determining the family of genomes to which a given genome belongs. Additionally, we address the problem of automated hierarchical clustering of isolates according only to their statistical substring properties. For both of these problems we present algorithms based on similarity distance between nucleotide n-gram profiles. Results obtained experimentally are very positive and suggest that the proposed techniques can be successfully used in a variety of related problems.

Tech Report Number: 
CS-2005-02
Report Date: 
March 10, 2005
AttachmentSize
PDF icon CS-2005-02.pdf1002.4 KB