CSCI 4141A - MILESTONE 1
Given: September 12, 2006
Due: September 26, 2006 - in class
Purpose of Milestone
The purpose of this milestone is to examine the term-frequency distribution
in full text documents, and to become familiar with the data that you will be using for
the project.
Zipf Constant
The Zipf constant is not really constant, but is an empirically determined value that
is approximate for text databases so that some "rule-of-thumb" modeling can be done.
The Zipf value for a particular term is calculated as follows:
- calculate the frequency of occurrence of each term
- rank the terms in descending order of frequency (ties do not matter)
- assign rank 1 to the first term in the list (highest frequency)
- assign ranks in ascending order to the rest of the list so that highest rank is
lowest frequency
- zipf value for each term is rank of term times frequency of occurrence of that term
divided by the total number of word occurrences (not number of different terms) in the database
- Zipf constant for a database is the average of all the zipf values
Two very good references are:
The Text File - Please do NOT copy the file to your own disk area
The file (description) to be processed is on torch:
/data/fcs/Courses/InfoRetrieval/csci4141/csci4141.txt
Please do the following:
- Process only terms in the ‹TITLE› and ‹BODY› fields of each news item.
- Convert all alphabetic characters to either upper or lower case so
that your result will be case insensitive.
- Process the text extracting each term and counting the number of times
it occurs. This should result in a file with each record consisting of
2 fields, the term and the number of times it occurs.
- Sort the terms file into descending order of occurrence.
- Remove all XML tags
- Determine the following and hand in the results:
- Statement of Confirmation of Independent Work
- the number of different terms
- the total number of term occurrences
- the number of different terms that occur only once
- what proportion of the total number of word occurrences do the top 20% of the
different terms account for
- the top 20 occurring terms, showing the term and the number of times it
occurs
- a log-log plot of the rank-frequency distribution
- the Zipf constant for this database (average for all ranks)
- DO NOT hand in the complete list of terms
Please note the following:
From this milestone, you should be able to develop your own stop
list of noise words for the rest of the project.
Power Hints for Milestone 1
- The creation of the ranked list can be
created in a one line string of unix commands.
- I am not going to give you the command string, but the sequence of unix
tools include the following: grep, sort, tr, uniq
- These are listed in alphabetical order. Pipe the output from one to the
input to the other, etc. Some tools were used more than once.
- See the man pages if you are not familiar with these.
Plotting Hints for Milestone 1
Please use gnuplot for plotting the log-log graph.
It is available on torch.