CSCI 4141A - MILESTONE 1

Given: September 12, 2006

Due: September 26, 2006 - in class

Purpose of Milestone

The purpose of this milestone is to examine the term-frequency distribution in full text documents, and to become familiar with the data that you will be using for the project.

Zipf Constant

The Zipf constant is not really constant, but is an empirically determined value that is approximate for text databases so that some "rule-of-thumb" modeling can be done. The Zipf value for a particular term is calculated as follows:

calculate the frequency of occurrence of each term
rank the terms in descending order of frequency (ties do not matter)
assign rank 1 to the first term in the list (highest frequency)
assign ranks in ascending order to the rest of the list so that highest rank is lowest frequency
zipf value for each term is rank of term times frequency of occurrence of that term divided by the total number of word occurrences (not number of different terms) in the database
Zipf constant for a database is the average of all the zipf values

Two very good references are:

The Text File - Please do NOT copy the file to your own disk area

The file (description) to be processed is on torch:
/data/fcs/Courses/InfoRetrieval/csci4141/csci4141.txt

Please do the following:

Process only terms in the ‹TITLE› and ‹BODY› fields of each news item.
Convert all alphabetic characters to either upper or lower case so that your result will be case insensitive.
Process the text extracting each term and counting the number of times it occurs. This should result in a file with each record consisting of 2 fields, the term and the number of times it occurs.
Sort the terms file into descending order of occurrence.
Remove all XML tags
Determine the following and hand in the results:
- Statement of Confirmation of Independent Work
- the number of different terms
- the total number of term occurrences
- the number of different terms that occur only once
- what proportion of the total number of word occurrences do the top 20% of the different terms account for
- the top 20 occurring terms, showing the term and the number of times it occurs
- a log-log plot of the rank-frequency distribution
- the Zipf constant for this database (average for all ranks)
- DO NOT hand in the complete list of terms

Please note the following:

From this milestone, you should be able to develop your own stop list of noise words for the rest of the project.

Power Hints for Milestone 1

The creation of the ranked list can be created in a one line string of unix commands.
I am not going to give you the command string, but the sequence of unix tools include the following: grep, sort, tr, uniq
These are listed in alphabetical order. Pipe the output from one to the input to the other, etc. Some tools were used more than once.
See the man pages if you are not familiar with these.

Plotting Hints for Milestone 1

Please use gnuplot for plotting the log-log graph. It is available on torch.