Node Similarity in Networked Information Spaces

Authors: 

Wangzhong Lu
Jeannette Janssen
Evangelos Milios
Nathalie Japkowicz

Author Addresses: 

Wangzhong Lu
Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2

Jeannette Janssen
Department of Mathematics and Statistics
Dalhousie University
Halifax, Nova Scotia, Canada, B3H 3J5
janssen@mathstat.dal.ca

Evangelos Milios
Faculty of Computer Science
Dalhousie University
6050 University Ave.
PO Box 15000
Halifax, Nova Scotia, Canada
B3H 4R2
eem@cs.dal.ca

Nathalie Japkowicz
School of Information Technology & Engineering
University of Ottawa
150 Louis Pasteur, P.O. Box 450 Stn. A
Ottawa, Ontario, Canada K1N 6N5
nat@csi.uottawa.ca

Abstract: 

Networked information spaces contain information entities, corresponding to nodes, which are connected by associations, corresponding to links in the network. Examples of networked information spaces are: the World Wide Web, where information entities are web pages, and associations are hyperlinks; the scientific literature, where information entities are articles and associations are references to other articles. Similarity between information entities in a networked information space can be defined not only based on the content of the information entities, but also based on the connectivity established by the associations present. This paper explores the definition of similarity based on connectivity only, and proposes several algorithms for this purpose. Our metrics take advantage of the local neighborhoods of the nodes in the networked information space. Therefore, explicit availability of the networked information space is not required, as long as a query engine is available for following links and extracting the necessary local neighbourhoods for similarity estimation. Two variations of similarity estimation between two nodes are described, one based on the separate local neighbourhoods of the nodes, and another based on the joint local neighbourhood expanded from both nodes at the same time. The algorithms are implemented and evaluated on the citation graph of computer science. The immediate application of this work is in finding papers similar to a given paper in a digital library, but they are also applicable to other networked information spaces, such as the Web.

Tech Report Number: 
CS-2001-03
Report Date: 
January 26, 2001