CSCI 4141 -- Topics

Topics

Term distribution, term weighting, feature selection
- term distribution - Zipf's Law
- tf.idf
- Feature Set Selection
  - cluster based measure
  - Information Gain Measure
  - Unsupervised concept identification
    - feature set reduction based on clustering
    - Latent Semantic Indexing
Models
- Boolean Model
- Vector Space Model (VSM)
  - Vector Space Model - binary weights
  - Vector Space Model - non-binary weights
  - cosine similarity measure
  - Rocchio's Feedback Method
- Probabilistic Model
- Language Models
File structures
- Inverted file structures for Boolean and VSM
Evaluation of Effectiveness
- Recall, Precision and Fallout
- Average Precision and Recall
- F-measure and E-measure
- Normalized Recall
Clustering
- Cluster Hypothesis
- Retrieval using clusters, More-like-this-one, Scatter-Gather
- Partitioning
  - Single Pass Algorithm
  - k-means algorithm
  - The Davies-Bouldin Index for evaluation of clustering structure
- Hierarchical
  - Top down or Divisive
    - Bisecting k-means
  - Bottom up or Agglomerative
    - Minimum Spanning Tree
    - Prim-Dijkstra Algorithm
  - Λ Measure for evaluation of clustering structure
Index Structures
- TRIEs
- PATRICIA Trees
- Suffix Tries and Suffix Trees
- Word Signatures and Bloom Filters
- String searching (Aho-Corasick)
Social Network Analysis
- Prestige measure
- Co-citation networks
- PageRank Algorithm
- HITS Algorithm
Research Talks (see powerpoint slides under readings)
- Challenges in Information Retrieval
- Genre and Task
- Tacit Knowledge