An Eulerian Approach to N-Gram Table Conversion

Authors: 

Aaron Olson

Author Addresses: 

Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

Abstract: 

In natural language processing, sequences of n characters or words known as n-grams are used for authorship attribution, text clustering, and other types of text analysis. Given a text corpus and a word n-gram table, we examine the problem of generating tables of character n-grams from tables of word n-grams and vice versa. We also present a framework for analysing this problem and develop techniques to solve some of the more common instances of the problem.

Given a table of word n-grams, we describe how to derive the equivalent character n-gram table. For character n-gram table to word n-gram table conversion, we develop a new technique based on Eulerian circuits in directed graphs. We test a proof-of-concept implementation of this technique and evaluate its effectiveness in guessing words from the original text. We also suggest ways in which this technique may be extended and improved for future work with n-grams in NLP and other areas of text processing.

Tech Report Number: 
CS-2005-22
Report Date: 
January 5, 2005
AttachmentSize
PDF icon CS-2005-22.pdf997.8 KB