CSCI 4141A - MILESTONE 1

Given: September 12, 2006

Due: September 26, 2006 - in class

Purpose of Milestone

The purpose of this milestone is to examine the term-frequency distribution in full text documents, and to become familiar with the data that you will be using for the project.

Zipf Constant

The Zipf constant is not really constant, but is an empirically determined value that is approximate for text databases so that some "rule-of-thumb" modeling can be done. The Zipf value for a particular term is calculated as follows:

Two very good references are:

The Text File - Please do NOT copy the file to your own disk area

The file (description) to be processed is on torch:
/data/fcs/Courses/InfoRetrieval/csci4141/csci4141.txt

Please do the following:

Please note the following:

From this milestone, you should be able to develop your own stop list of noise words for the rest of the project.

Power Hints for Milestone 1

Plotting Hints for Milestone 1

Please use gnuplot for plotting the log-log graph. It is available on torch.