web robots

Web Robot Project Bibliography

1. WEB INDEXING, SEARCH ENGINES
2. IMAGE/MULTIMEDIA CONTENT-BASED RETRIEVAL
3. NATURAL LANGUAGE PROCESSING
4. FOCUSED CRAWLERS
5. WEB SCIENCE and GRAPH THEORETIC APPROACHES
6. SPECIAL ISSUE ON INTELLIGENT INTERNET SYSTEMS ARTIFICIAL INTELLIGENCE JOURNAL 118(1-2)
7. INTELLIGENT WEB AGENTS TALKS
8. ONTOLOGY/HIERARCHY LEARNING
9. AI-SPECIFIC SOFTWARE RESOURCES
10. MISC
11. Agent-based economics
12. Web-Information Filtering Lab --- TREC
13. Machine learning and information extraction
14. Web information retrieval
15. Data sets
16. Statistical Machine Learning

(Material) Robotics references

Journals to publish in
Text mining - Knowledge Management industry
Health text mining
Industry collaboration reference

Graduate Courses
J. Kleinberg: The Structure of Information Networks Cornell, Computer Science 685 Fall 2002
J. Kleinberg: Randomized and High-Dimensional Algorithms Cornell, Computer Science (Spring 2001).

1. WEB INDEXING, SEARCH ENGINES

Google http://google.stanford.edu/about.html
Google Anatomy paper GoogleAnatomy.pdf
T. Haveliwala: Efficient Computation of PageRank. Stanford U. CS Technical Report, 1999
D. Rafiei and A.Mendelzon What do the Neighbours Think? Computing Web Page Reputations, IEEE Data Engineering Bulletin, September 2000. (WWW9 version)
Open directory project (human-edited indexing) http://dmoz.org/
The Web Robot's Pages
Computer Science Research Paper Search Engine http://www.cora.jprc.com/
Surf Companion web agent http://surfcompanion.wwz.de/Help/TOC_Help.html
The "Invisible Web," the part of cyberspace that's inaccessible to search engines, but is still searchable -- if
you know where to find the gateways. http://gwis2.circ.gwu.edu/~gprice/direct.htm
The Extreme Searcher's Web Page http://www.onstrat.com/
Search engine resources
Web developer's virtual library
Crawling the hidden Web, Sriram Raghavan, Hector Garcia-Molina. In the Proceedings of the 27th Intl. Conf. on Very Large Databases (VLDB), pp. 129-138, September 2001.
The Deep Web http://www.brightplanet.com/
ARVIND ARASU, JUNGHOO CHO, HECTOR GARCIA-MOLINA, ANDREAS PAEPCKE, and SRIRAM RAGHAVAN, Searching the Web, ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001, Pages 2�43.ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001, Pages 2�43.
UIUC Web Integration Repository (Deep Web data sets - information extraction, interaction with deep web sites)

2. IMAGE/MULTIMEDIA CONTENT-BASED RETRIEVAL

Centre for Intelligent Information Retrieval at UMass http://ciir.cs.umass.edu/
Columbia's Content-Based Visual Query Project http://comet.ctr.columbia.edu/~sfchang/demos.html
Image Processing and Retrieval on the Web (Theo Gevers) http://carol.wins.uva.nl/~gevers/
ImageRover S. Sclaroff at BU http://www.cs.bu.edu/groups/ivc/ImageRover/
Excalibur Visual RetrievalWare http://vrw.excalib.com:8015/cst
Interpix http://www.interpix.com
Mike Swain's tech reports on Multimedia Indexing http://www.crl.research.digital.com/publications/techreports/techreports.html
Stanford digital library project http://walrus.stanford.edu/diglib/pub/reports/
MPEG standard for multimedia data compression http://www.mpeg.org/MPEG/
Cobion Visual Content Search
V. Wu's Finding Text in Images (2nd ACM Int. Conf. on Digital Libraries, 1997) --
also Manmatha's papers on multimedia indexing and retrieval.
Free-form object recognition survey
Text-based approaches for the categorization of images, ECDL-99, 3rd European Conference on Research and Advanced Technology for Digital Libraries, Sable and Hatzivassiloglou, also IJDL 2001.
NSERC proposal summary text
J. R. Smith and S.-F. Chang, "Visually Searching the Web for Content," IEEE Multimedia Magazine, Summer, Vol. 4 No. 3, pp.12-20, 1997. (also Columbia U. CU/CTR Technical Report #459-96-25). (WebSEEk demo)
James Z. Wang, Penn State
1. James Z. Wang, Jia Li, Gio Wiederhold, ``SIMPLIcity: Semantics-sensitive Integrated Matching for Picture LIbraries,'' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 9, 16 pp., 2001
2. James Z. Wang, Gio Wiederhold, Oscar Firschein, Sha Xin Wei, ``Wavelet-based image indexing techniques with partial sketch retrieval capability,'' Proc. IEEE Forum on Research and Technology Advances in Digital Libraries (ADL'97), pp. 13-24, Washington D.C., IEEE, May 1997
Mohan Kankanhalli's course on Multimedia Information Retrieval
Rubner and Tomassi: Earth Mover's Distance for Image Database Navigation (applied to color and texture based retrieval, locally!)
Bartlett, M.S., Donato, G.L., Movellan, J.R., Hager, J.C., Ekman, P., and Sejnowski, T.J. (2000). Image representations for facial expression coding. In S. Solla, T. Leen, & K. Mueller, Eds. Advances in Neural Information Processing Systems 12, Cambridge, MA: MIT Press, p. 886-892.
C.C. Jay Kuo. Content-based Audio Classification and Retrieval
Shu-Ching Chen, Mei-Ling Shyu, and R. L. Kashyap, "Augmented Transition Network as a Semantic Model for Video Data," International Journal of Networking and Information Systems, Special Issue on Video Data, vol. 3, no. 1, pp. 9-25, 2000.
Ming-Hsuan Yang, Dan Roth and Narendra Ahuja, "Learning to Recognize 3D Objects With SNoW", podium presentation, in Proceedings of the Sixth European Conference on Computer Vision (ECCV 2000) , pp. 439-454, vol. 1, Dublin, June, 2000.
Ming-Hsuan Yang, Narendra Ahuja, David Kriegman A Survey on Face Detection Methods (1999)
Pixar/Lucas films Graphics Memos: http://www.alvyray.com/Memos/MemosPixar.htm
Spline Tutorial Notes (the classic) by A.R. Smith, 1983
Tech Memo 77, Computer Division, Lucasfilm, May 1983. Also issued as tutorial notes at SIGGRAPHs 83 and 84
"PicASHOW: Pictorial Authority Search by Hyperlinks on the WEB", Ronny Lempel, Aya Soffer, ACM Trans. on Information Systems, Vol. 20, No 1, Jan. 2002, pp. 1-24.

3. NATURAL LANGUAGE PROCESSING

Lillian Lee's distributional clustering approach for hierarchical clustering.
Ellen Riloff's research on NLP information extraction.
1. Riloff, E. (1993) "Automatically Constructing a Dictionary for Information Extraction Tasks", Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93) , AAAI Press/The MIT Press, pp. 811-816. n
2. Riloff, E. and Schmelzenbach, M. (1998) "An Empirical Approach to Conceptual Case Frame Acquisition", In Proceedings of the Sixth Workshop on Very Large Corpora , 1998.
3. Riloff, E. and Jones, R. (1999) "Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping," In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) , 1999.
Efficient Crawling Through URL Ordering http://www-db.stanford.edu/~cho/crawler-paper/
Linguist http://linguist.emich.edu
WordNet http://www.cogsci.princeton.edu/~wn/
Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1998-0099
Jurafsky's computational corpus linguistics http://www.colorado.edu/ling/jurafsky/
Institut für Maschinelle Sprachverarbeitung , U. Stuttgart http://www.ims.uni-stuttgart.de/
(research areas -> research results -> IMS, Decision Tree Tagger)
IMS Corpus Workbench http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
CORPUS SEARCH TOOLS, Lancaster U. http://www.comp.lancs.ac.uk/computing/research/ucrel/tools.html
Word sense disambiguation using a word graph
1. Retrieving with good sense
2. Survey of IR
3. G. Hirst's CITO project
4. Senseval
5. Harabagiu pubs
6. Budanitsky, Alexander and Hirst, Graeme. ``Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures.'' Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, June 2001.
Eric Brill: A Simple Rule-Based Part Of Speech Tagger (Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing) -- Brill's online publications -- POS software
Applying Machine Learning for High-Performance Named-Entity Extraction (Baluja, Mittal, Sikthankar), Computational Intelligence, 16(4), 2000, pp. 586-595.
Knowledge-based Extraction of Named Entitites (J. Callan, T. Mitamura), ACM Conference on Information and Knowledge Management (CIKM), Nov. 4-9, 2002, McLean, Virginia.
An experimental comparison of model-based clustering methods (Meila, Heckerman), Machine Learning, 42, pp. 9-29, 2001.
Concept decompositions for Large sparse text data using clustering (Dhillon, Modha), Machine Learning, 42, pp. 143-175, 2001.
Publications by the Natural Language Processing Group, Univ. of Salford
1. Mima, H., Ananiadou, S. and Tsujii, J. ( 1999). A web-based integrated knowledge mining aid system using term-oriented NLP, Proceedings of Natural Language Processing Pacific Rim Symposium 99, Beijing, pp. 13-18.
2. S. Ananiadou, S. Albert, D. Schuhmann. Evaluation of automatic term recognition of nuclear receptors from MEDLINE.
3. Maynard, D. and Ananiadou, S. (2000a). Creating and using domain-specific ontologies for terminological applications, Proceedings of Second International Conference on Language Resources and Evaluation, Athens, pp. 868-874.
4. Hideki MIMA, Sophia ANANIADOU, An Application and Evaluation of the C/NC-value Approach for the Automatic term Recognition of Multi-Word units in Japanese.
5. D. Maynard, S. Ananiadou. Terminological acquaintance: the importance of contextual information in terminology.
6. Frantzi, K., Ananiadou, S. and Mima, H. ( 2000). Automatic recognition of multiword terms, International Journal of Digital Libraries 3(2): 117-132.
7. Mima, Ananiadou, Nenadic: The ATRACT Workbench: Automatic Term Recognition and Clustering of Terms, 2001.
8. Nenadic, Spacic, Ananiadou: Automatic Discovery of Term Similarities using Pattern Mining, Computerm 2002.
9. Nenadic, Mima, Spasic, Ananiadou, Tsujii: Terminology-driven Literature Mining and Knowledge Acquisition in Biomedicine, International Journal of Medical Informatics (2002).
10. Goran Nenadic, Irena Spasic, Sophia Ananiadou: Term Clustering using a Corpus-Based Similarity Measure, TSD2002.
11. Goran Nenadic, Irena Spasic, Sophia Ananiadou: Automatic Acronym Acquisition and Term Variation Management within Domain Specific Texts, 3rd Int. Conf. on Language Resources and Evaluation, 2002.
12. Nenadic, Mima, Spasic, Ananiadou, Tsujii: Terminology-driven literature mining and knowledge acquisition in Biomedicine, to appear in the Int. Journal of Medical Informatics, 2002.
Microevolutionary language theory (Mike Best thesis sup. by P. Maes)
Resnik, P. Using Information Content to Evaluate Semantic Similarity in a Taxonomy, IJCAI 95. A longer and more recent version appears in JAIR, 11, 1999.
Dolan, William ; Vanderwende, Lucy ; Richardson, Stephen D. Automatically Deriving Structured Knowledge Bases From On-Line Dictionaries In Proceedings of the Pacific Association for Computational Linguistics, April 21-24, 1993, Vancouver, British Columbia.
Ken Church on practical tips on how to implement simple text processing using Unix tools. (53 pages long)
S. Soderland, Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 1999
Christopher D. Manning. Automatic acquisition of a large subcategorization dictionary from corpora. Proceedings of the 31st ACL, pp. 235-242.
Keselj, Vlado (Nick Cercone) Unification-based grammars
Subgrammar extraction for Head-Driven Phrase Structure Grammars HPSG
Stefy: Java parser for HPSGs
Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques (HICSS 2002) Gary A. Monroe, James C. French, and Allison L. Powell
Interactive Document Summarisation Using Automatically Extracted Keyphrases - (HICSS 2002) - Steve Jones, Stephen Lundy, and Gordon W. Paynter
A Novel Method for Detecting Similar Documents (HICSS 2002) James W. Cooper, Anni S. Coden, and Eric W. Brown
MindMap: Utilizing Multiple Taxonomies and Visualization to Understand a Document Collection (HICSS2002) Scott Spangler, Jeffrey T. Kreulen, and Justin Lessler
The Interspace: Concept Navigation Across Distributed Communities
Course on Text Mining (with lots of papers) by Wanda Pratt at UC Irvine (Spring 2001)
Wanda Pratt's home page at U. of Washington
Clifton, C, Cooley, R, Zytkow, JM, and Rauch, J; TopCat: data mining for topic identification in a text corpus. in Principles of Data Mining and Knowledge Discovery. Third European Conference, PKDD'99. 1999.174-83
Statistical NLP and Corpus Based Linguistics Resources
Vlado Keselj's Natural Language Processing: literature, programs and text corpora
Sheffield NLP Group (see resources)

4. FOCUSED CRAWLERS Overview by Francis Crimmins, Sep. 2001

WebSPHINX: A Personal, Customizable Web Crawler
http://www.cs.cmu.edu/~rcm/websphinx/ --- http://www.cis.upenn.edu/~lrossey/websphinx.html
The NAUTILUS: NAvigate AUtonomously and Target Interesting Links for USers http://nautilus.dii.unisi.it/
Focused crawling: a new approach to topic-specific Web resource discovery
Soumen Chakrabarti, Martin van den Berg, Byron Domc http://www.almaden.ibm.com/almaden/feat/www8/
Recent results in automatic Web resource discovery, Soumen Chakrabarti, ACM Computing Surveys ??(???), December 1999,
http://www.cs.brown.edu/memex/ACMCSHT/42/42.html
Lee Giles' publications --- Context graphs vldb2000.pdf --- Min-cut framework kdd2000.pdf
Cora -- (the search engine for CS papers) --- http://cora.whizbang.com/ --- publications on Cora
A. Ng's ML Papers http://gubbio.cs.berkeley.edu/mlpapers/
ResearchIndex Publications
Structural Web Search using a Graph-based Discovery System. Graph-Based Data Mining
Henry Lieberman, C. Fry, L. Weitzman Exploring the Web with Reconnaissance Agents Comm ACM Aug 2001, Vol 44(8) --- Letizia
Larbin: a recommended web crawler
FunnelBack crawler for the P@NOPTIC Search Engine: http://www.panopticsearch.com/
Persona: A Contextualized And Personalized Web Search ( (HICSS 2002) ) Francisco Tanudjaja and Lik Mui, HICSS 2001
Intelligent Crawling on the World Wide Web with Arbitrary Predicates Charu C. Aggarwal, Fatima Al-Garawi, and Philip S. Yu, WWW10, 2001.
The shark-search algorithm Michael Hersovicia, Michal Jacovia Yoelle S. Maareka, Dan Pellegb Menachem Shtalhaima, and Sigalit Ura, WWW7 (the algorithm used in the Mapuccino system).
Watson Jay Budzik, Kristian Hammond, Larry Birnbaum, Devlab, Northwestern U. (also here).

5. WEB SCIENCE and GRAPH THEORETIC APPROACHES

Kleinberg's Authoritative Sources in a Hyperlinked Environment http://www.cs.cornell.edu/home/kleinber/auth.pdf
Barabasi's home page: http://www.nd.edu/~networks/
The diameter of the WWW http://www.nd.edu/~networks/Papers/401130A0.pdf
Emergence of Scaling in the WWW http://www.nd.edu/~networks/Papers/science.pdf
The topology of the WWW http://www.nd.edu/~networks/Papers/proceeding.pdf
The bow tie model of the Web http://www.almaden.ibm.com/almaden/webmap_press.html
Social Networks: http://www.chass.utoronto.ca/~wellman/
Clustering in large graphs and matrices (Drineas et al. Proc. Symp. Discr. Alg, SIAM, 1999)
J. Kleinberg, C. Papadimitriou, P. Raghavan. Segmentation problems: A micro-economic view of data mining. Proc. 30th ACM Symposium on Theory of Computing, 1998.
Silk from a sow's ear: Extracting usable structures from the Web, P. Pirolli, J. Pitkow, and R. Rao. , Proc. ACM SIGCHI, 1996.
How Popular is Your Paper? An Empirical Study of the Citation Distribution, S. Redner, Eur. Phys. Jour. B 4, 131-134 (1998).
The Campfire project (bipartite cores to identify communities on the WWW)
Bipartite cores for modelling web communities
Extracting large scale knowledge bases from the Web (bipartite cores)VLDB 1999
IBM Clever project: http://www.almaden.ibm.com/cs/k53/clever.html
Mining the Link Structure of the World Wide Web (1999) Soumen Chakrabarti, Byron E. Dom, David Gibson, Jon Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, IEEE Computer
Graph connectivity quick reference
Citation graph
1. Chen, C. (1999) Visualising Semantic Spaces and Author Co-Citation Networks in Digital Libraries. Information Processing & Management, 35(3), 401-420.
2. Eugene Garfield's home page.
A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, Finding Authorities and Hubs From Link Structures on the World Wide Web. (WWW10, to appear. See published version.)
Moses Charikar, Greedy approximation algorithms for finding dense components in a graph, In Proc. Third International Workshop Approximation Algorithms for Combinatorial Optimization, APPROX 2000, Klaus Jansen, Samir Khuller (Eds.), LNCS 1913, 84-95.
Monika Henzinger publications sigir98 www8
An Atlas of Cyberspaces: Surf maps, visualizing browsing behaviour
Corinna Cortes, Daryl Pregibon, and Chris T. Volinsky Communities of Interest Authors: (2001) Proceedings of IDA 2001 - Itelligent Data Analysis
Self-similarity in the web Stephen Dill Ravi Kumar Kevin McCurley Sridhar... VLDB 2001
Cybergeography Research
Dodge and Kitchin Mapping Cyberspace Routledge, Oct. 2000
Locating Information with Uncertainty in Fully Interconnected Networks with Applications to World Wide Web Information Retrieval (Kirousis, Kranakis et al).
Self-Organization and Identification of Web Communities Gary Flake, Steve Lawrence, C. Lee Giles, Frans Coetzee
Peer influence groups: identifying dense clusters in large networks James Moody, Social Networks 23 (2001) pp. 261-283
Coevolution and self-organization in dynamical networks (COSIN European consortium)

6. SPECIAL ISSUE ON INTELLIGENT INTERNET SYSTEMS ARTIFICIAL INTELLIGENCE JOURNAL 118(1-2)

Lesser, Victor, Horling, Bryan, Klassner, Frank, Raja, Anita, Wagner, Thomas, and Zhang, Shelley.
BIG: An Agent for Resource-Bounded Information Gathering and Decision Making.
http://mas.cs.umass.edu/publications.shtml
Kushmerick, N. Wrapper induction: Efficiency and expressiveness.
http://www.cs.ucd.ie/staff/nick/home/research/pubs.html
W. Cohen WHIRL: A Word-based Information Representation Language, a journal-length overview paper on WHIRL. (A shorter version is also available.) http://whirl.research.att.com/
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery Learning to Construct Knowledge Bases from the World Wide Web http://www.ri.cmu.edu/people/mitchell_tom.html#publications

7. INTELLIGENT WEB AGENTS TALKS

WebACE
Latent Semantic Indexing
Kozima: Context Sensitive Measure of Word Distance orig.paper
WebToKB (McAllum et al)

8. ONTOLOGY/HIERARCHY LEARNING/USING

Ontology Learning ECAI-2000 Workshop -- http://ol2000.aifb.uni-karlsruhe.de/
1. Enriching very large ontologies using the WWW. E. Agirre, O. Ansa, E. Hovy, D. Martinez
2. Designing Clustering Methods for Ontology Building - The Mo'K Workbench. G.Bisson, C. Nedellec and D. Canamero.
International Journal on Digital Libraries ISSN: 1432-5012 Index Volume 3 Number 3 October 2000
1. Declarative Specification of Z39.50 Wrappers using Description Logics Yannis Velegrakis , Vassilis Christophides , Panos Constantopoulos
2. Text-Based Approaches for the Categorization of Images (1999) (Correct) (1 citation) Carl L. Sable and Vasileios Hatzivassiloglou
Ayad & Kamel: Topic Discovery from Text using Aggregation of different Clustering Methods, Canadian AI Conference, 2002.
I. Varlamis, M. Vazirgiannis, M. Halkidi, B. Nguyen. �THESUS: Effective Thematic Selection And Organization Of Web Document Collections Based On Link Semantics�, to appear in the IEEE Transactions on Knowledge and Data Engineering, 2003.
Data Bases and Knowledge Discovery group @ AUEB
Ontology Matching

9. AI-SPECIFIC Software RESOURCES

(partly from Ali&McRoy "Java Resource for Artificial Intelligence", intelligence, SIGART ACM, 11(2), Summer 2000)

General Java Resources

Sun's Java Website http://java.sun.com

Gamelan, repository of Java tools http://gamelan.earthweb.com

Java Programmer's FAQ http://www.afu.com/javafaq.html

Links to AI-specific Java Resources

Jess, Rule-based system similar to CLIPS http://herzberg.ca.sandia.gov/jess/

Weka, collection of machine learning algorithms http://www.cs.waikato.ac.nz/ml/weka

Genetic Programming, S. Luke's ECJ and A. Qureshi's gpsys http://www.cs.umd.edu/projects/plus/ec/ecj
http://www.cs.ucl.ac.uk/staff/A.Qureshi/gpsys_doc.html

JavaBayes: Bayesian networks http://www.cs.cmu.edu/~javabayes/

Neural networks: jaNet package http://www.hta-bi.bfh.ch/Projects/janet/

YAG: natural language generator http://tigger.cs.uwm.edu/~nlkrrg/

NGram Statistics package http://www.d.umn.edu/~tpederse/nsp.html

AgentBuilder's survey of agent construction tools http://www.agentbuilder.com/AgentTools/

GATE (General Architecture for Text Engineering) http://gate.ac.uk/

Protege Ontology Editor http://protege.stanford.edu/

The KIM Platform for Knowledge & Information Management http://www.sirma.bg/OntoText/KIM/

Jakarta Lucene: a high-performance, full-featured text search engine written entirely in Java. http://jakarta.apache.org/

Torch: a machine-learning library, written in simple C++ http://www.torch.ch/

SVMlight: implementation of Support Vector Machines (SVMs) in C. http://svmlight.joachims.org/

OSU SVM Classifier Matlab Toolbox http://www.ece.osu.edu/~maj/osu_svm/

SOM Matlab Toolbox http://www.cis.hut.fi/projects/somtoolbox/

Pattern Recognition Matlab Toolbox http://neural.cs.nthu.edu.tw/jang/matlab/toolbox/DCPR/

Matlab toolboxes http://www.tech.plym.ac.uk/spmc/matlab/matlab_toolbox.html

BioNLP Resources http://www.tufts.edu/~amorga02/bcresources.html

OntoParser, an XML2RDF translator for OntoBuilder ontologies http://ie.technion.ac.il/OntoBuilder
Ontologies are available under "Ontologies downloads," partitioned into 14 domains. For the OntoParser, go to "OntoBuilder downloads" and follow the link to "OntoParser: an XML2RDF translator of +OntoBuilder ontologies." The zip file contains a user manual with all installation information.

FIHC, Frequent Itemset-based Hierarchical Clustering http://www.cs.sfu.ca/~ddm

eprints software for online publishing http://www.eprints.org/

htdig a search engine http://www.htdig.org/

lemur, language modelling/IR http://www.lemurproject.org/

lucene apache: a search engine in Java http://lucene.apache.org/java/docs/

Tools for the Reuters collection http://www.lins.fju.edu.tw/~tseng/Collections/Reuters-21578.html

OpenNLP: large collection of open NLP tools http://opennlp.sourceforge.net/projects.html

FrameNet - on-line lexical resource for English, based on frame semantics http://framenet.icsi.berkeley.edu/

VerbNet - a lexical resource on verbs http://en.wikipedia.org/wiki/VerbNet

LIBSVM -- Library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Maximum Entropy / Logistic Regression
http://www.cs.utah.edu/~hal/megam/
Patrick Haffner: Scaling large margin classifiers for spoken language understanding, Speech Communication 48 (2006) 239–261

MEAD -- multi-document summarization system (Dragomir Radev) http://www.summarization.com/mead/

JUNG -- Java Universal Network/Graph Framework http://jung.sourceforge.net/
is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.

K. Murhy: Bayes Network Toolbox for Matlab

Datasets

UCI KDD Archive
Reuters-21578
Reuters Corpus (RCV1 and RCV2)
Wikipedia XML corpus

10. MISC

LEDA: A C++ library of the data types and algorithms of combinatorial computing Book Manual
R, a free(GPL) version of statistical software S/S-plus (includes multidimensional scaling)
Book: Social Dimensions of Information Technology
Click on "book excerpt" to read Chapter 1, Virtual Communities and Social Capital by Blanchard and Horan
Sample copies of the Social Science Computer Review from jsamples@sagepub.com
Cluster for Intelligent Mobile Agents for Telecommunication Environments � CLIMATE
How to get (and keep) an NSERC research grant (ps) other resources
Internet Application Workbook by Philip Greenspun
GNU's not Unix
Social Scientists: Managing Identity in Socio-Technical Networks (HICSS2002) Roberta Lamb and Elizabeth Davidson
Discrete Algorithms and Data Structures software
K. Murhy: Bayes Network Toolbox for Matlab
K. Murphy: HMM toolbox
Cawley, G. C. Matlab Support Vector Machine Toolbox
Boost graph library (an open source alternative to LEDA for graph algorithms)
Experimental Design 1 2 3 4

11. Agent-based economics

ASPEN Microsimulation Economics Model
Microsimulation
Mike Wellman's Market-oriented programming
J. Kephart's Dynamic Pricing by software agents html pdf
M. Huhns home page (online auctions, agents)

12. Web-Information Filtering Lab

Context aware retrieval links:
http://www.sims.berkeley.edu/~hearst/papers/data-engineering/
http://www.dcs.ex.ac.uk/~pjbrown/papers/ir.html http://www.research.microsoft.com/research/db/debull/A00sept/issue.htm
The Effect of Linking on Genres of Web documents (Crowston and Williams)
XML Schema Formal Description
IJCAI 2001 Workshop on Intelligent Techniques for Web Personalization (WEB-2)
1. Clustering navigation patterns on a website using a sequence alignment method . Birgit Hay, Geert Wets and Koen Vanhoof, Limburg University, Belgium
2. Modeling users navigation history (IJCAI 2001 Workshop on Intelligent Techniques for Web Personalization).
  Ernesto Damiani, Barbara Oliboni, Elisa Quintarelli and Letizia Tanca, Universita degli Studi di Milano, Italy
3. Improving the effectiveness of collaborative filtering on anonymous web usage data Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, School of Computer Science, Telecommunication, and Information Systems, DePaul University, Chicago, Illinois, USA
4. Web site personalizers for mobile devices Corin R. Anderson, Pedro Domingos, Daniel S. Weld, University of Washington, Seattle, WA, USA
C.J. van Rijsbergen, INFORMATION RETRIEVAL Second Edition (on-line text)
Information Retrieval Links (including SMART).
Biovista.com

ACM SIGIR Information Retrieval Resources

TREC
Main TREC Web site
Web research collections --- Paper on the wt10g collection (TREC 2001)
TREC 2002 Web Track guidelines
Overview of the TREC 2001 Web Track competition (ps.gz))--- Overview of the TREC 2000 Web Track competition
Descriptions of the 2001 contributions for each of the two tasks (adhoc and entry page) described in the overview.
. Web adhoc results (2001) --- topics
. Web entry-page results (2001) --- topics
Agreement to use the data sets on hermes.cs.dal.ca

13. Machine learning and information extraction

Ghahramani, Z. (2001) An Introduction to Hidden Markov Models and Bayesian Networks International Journal of Pattern Recognition and Artificial Intelligence 15(1):9-42.
Jeff Bilmes A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models (1998).
Learning regular languages from simple positive examples
Learning Dfa from Simple Examples
Efficient algorithms for the inference of minimum size DFAs
Hierarchical Wrapper Induction for Semistructured Information Sources Ion Muslea, Steve Minton, Craig Knoblock. Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001
Line Eikvil Information Extraction from World Wide Web - A Survey
IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (ML-1)
1. William W. Cohen & Lee S. Jensen: A structured wrapper induction system for extracting information from semi-structured documents
2. Lee S. Jensen & William W. Cohen: Grouping extracted fields
3. Craig A. Knoblock, Kristina Lerman, Steven Minton & Ion Muslea: A machine-learning approach to accurately and reliably extracting data from the Web
4. Kristina Lerman, Craig Knoblock & Steven Minton: Automatic data extraction from lists and tables in Web sources
5. David Pierce & Claire Cardie: User-oriented machine learning strategies for information extraction: Putting the human back in the loop
IJCAI-2001 Workshop on Text Learning: Beyond Supervision (ML-3)
1. Selective Sampling + Semi-supervised Learning = Robust Multi-View Learning Ion Muslea, Steven Minton, and Craig A. Knoblock
2. Detection of errors in training data by using a decision list and Adaboost Hiroyuki Shinnou
3. Ontology-based Text Clustering A. Hotho, S. Staab, and A. Maedche
4. Probabilistic Models of Text and Link Structure for Hypertext Classification Lise Getoor, Eran Segal, Ben Taskar, and Daphne Koller
WebDB 2000 Proceedings
Kamal Nigam. Using Unlabeled Data to Improve Text Classification. Doctoral Dissertation, Computer Science Department, Carnegie Mellon University. Technical Report CMU-CS-01-126. 2001 (ML paper)
Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3). pp. 103-134. 2000
Kamal Nigam and Rayid Ghani. Analyzing the Effectiveness and Applicability of Co-training. In Ninth International Conference on Information and Knowledge Management (CIKM-2000), pp. 86-93. 2000
David Cohn, Les Atlas and Richard Ladner. (1994) Improving generalization with active learning, Machine Learning 15(2):201-221.
D. Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach, AAAI/IAAI 1998
Fabrizio Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys, 2002
http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf
Rosario, B., and Hearst, M., Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy, in the Proceedings of Empirical Methods in Natural Language Processing EMNLP '01, Pittsburgh, PA, June 2001. (From BAILANDO or "Better Access to Information using Language Analysis and New Displays and Organizations publication list).
Support Vector Machine resources -- Kernel machine resources
Tutorial on Support Vector Machines and Kernel Methods Presented at ICML-2001 by Nello Cristianini
J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
A. J. Smola and B. Sch�lkopf. A Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998.
Peter Cheeseman, John Stutz Bayesian Classification(AutoClass):Theory and Results (1996) Advances in Knowledge Discovery and Data Mining
Hinrich Sch�tze, Craig Silverstein Projections for Efficient Document Clustering (1997)
D. Heckerman. A tutorial on learning with Bayesian Networks. Microsoft Research TR, 1996
ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.ps , ftp://ftp.research.microsoft.com/pub/dtg/david/tutorial.ps
Henry Lieberman, Bonnie A. Nardi, David Wright Training Agents to Recognize Text by Example, ACM Conference on Autonomous Agents [Agents-99], Seattle, 1-5 May 1999
Joachims. Text Categorization with Support Vector Machines. TR VIII-23, U. of Dortmund, 1997.
An Evaluation of Statistical Approaches to Text Categorization (1997) Yiming Yang
Active Learning for Natural Language Parsing and Information Extraction Cynthia A. Thompson, Mary Elaine Califf, and Raymond J. Mooney, Proceedings of the Sixteenth International Machine Learning Conference (ICML-99) , Bled, Slovenia, pp. 406-414, June 1999 (ps)
Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf and Raymond J. Mooney, Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, pp. 328-334, July, 1999 (ps)
A Comparison of Document Clustering Techniques Michael Steinbach, George Karypis, Vipin Kumar, KDD Workshop on Text Mining, 2000.
On the merits of building categorization systems by supervised clustering, Charu C. Aggarwal Stephen C. Gates Philip S. Yu, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999 , San Diego, California.
Manolis Koubarakis work on Boolean queries with proximity operators:
Manolis Koubarakis, Theodoros Koutris, Paraskevi Raftopoulou, and Christos Tryfonopoulos: Efficient dissemination of textual information using the Boolean model 2nd Hellenic Conference on Artificial Intelligence, April 11-12, 2002, Thessaloniki, Greece.
Manolis Koubarakis Boolean queries with proximity operators for information dissemination
International Workshop on FOUNDATIONS OF MODELS FOR INFORMATION INTEGRATION (FMII-2001) as the 10th Workshop in the Series Foundations of Models and Languages for Data and Objects (FMLDO) Viterbo (near Rome), Italy 16-18 September, 2001 (immediately after VLDB-2001)
Text clustering (from Biao Chen)
1. Web Document Clustering: A Feasibility Demonstration (1998) Oren Zamir, Oren Etzioni
2. Fast and Intuitive Clustering of Web Documents (1997) Oren Zamir Oren Etzioni Omid Madani Richard M. Karp Department of Computer...
3. Scalable Techniques for Clustering the Web
4. A Min-max Cut Algorithm for Graph Partitioning and Data Clustering, Chris Ding, Xiaofeng He, Hongyuan Zha, Ming Gu and Horst Simon. Proc. 1st IEEE Int'l Conf. Data Mining. San Jose, CA, 2001. pp.107-114.
5. Automatic Topic Identification Using Webpage Clustering Xiaofeng He, Chris H.Q. Ding, Hongyuan Zha, Horst D. Simon Proc. 1st IEEE Int'l Conf. Data Mining. San Jose, CA, 2001. pp.195-202.
6. Co-clustering documents and words using Bipartite Spectral Graph Partitioning
Statistics reference (hyperstat online)
Clustering overview by Schuetze
Information Extraction in Biology
Clustering software for gene expression profiles, XCluster

14. Web information retrieval

1. The Web IR and IE collection http://www.haifa.il.ibm.com/webir/
Of particular interest are "Selected Publications" and "PhD/MSc related work"

2. Intl. Workshop on Web Document Analysis - WDA2001 http://www.csc.liv.ac.uk/~wda2001/

15. Data sets

1. RCV1-v2 Text Categorization Test Collection (Reuters). Appendix to:
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. .
2. ArXiv is an e-print service in the fields of physics, mathematics, non-linear science, computer science, and quantitative biology.

16. Statistical Machine Learning

1. PASCAL network of excellence - Pattern Analysis, Statistical modelling and ComputAtional Learning (incl. video lectures)
2. Machine Learning Summer schools

(Material) Robotics references

1. Particle swarms
2. Sensor magazine SensorPortal
3. CVOnline, a Computer Vision Encyclopedia
4. Ballard and Brown's Computer Vision text

Related Industry

Halifax-Atlantic Canada
IT Interactive Services (ITIS) (Halifax, Web applications) Genieknows
Coemergence (Halifax, Business Knowledge Management specializing in the mining sector)
Kanayo (Halifax, Peer-to-peer engine for medical information cataloguing and dissemination)
Skywire Software (Moncton, legal document generation and transformation)

Ontario-Quebec
Palomino (Toronto, Web site creation and maintenance tool)
Sysomos (Toronto, Blog text mining, Koudas & Bansal)
OpenText (Waterloo, Livelink is the leading collaboration and knowledge management software for the global enterprise.)
Hummingbird (Toronto, Enterprise Content Management)
Techne Knowledge Systems, Inc. (Toronto, Business Knowledge Management)
Pattern Discovery Software Systems (Waterloo, data mining, spinoff of PAMI lab)
Protana (former MDS Proteomics Inc. - MDSP) (Toronto, a drug discovery company)
Bell University Laboratories (Toronto)
Branddimensions (Toronto, Buzz analysis)

Nstein (Montreal) - Multilingual information management
Lingua Technologies (Montreal) - Translation, Text Mining (Precarn member)
Language Industry Association (Quebec-Canada) - trade association in multilingual text information management
Nomino Technologies (Montreal) - natural language processing in e-customer-service (through web sites)

Western Canada
Axonwave Software Inc. (Vancouver, SFU spinoff, F. Popowich)
Business Objects (Vancouver, Business Intelligence)

USA
Entopia Knowledge Builder (bottom-up Knowledge Management)
Mohomine (document classification)
Wherewithal (knowledge management from the intranet portal)
Autonomy (corporate document knowledge management) white papers, case studies (Autonomy bought Verity (K2 Enterprise (Knowledge Management))
Entrieva (term extraction, document classification, taxonomy building (manual))
Stratify (text classification, taxonomy management)
Applied Semantics (acquired by Google, ontology-based software)
Systems Research and Development (non-obvious relationship awareness)
Dynago (document organization and summarization, metasearch engine. Check out DART)
Eurekster (collaborative Web searches)
Interwoven Inc. (enterprise content management)
Inxight Software Inc. (unstructured data management)
Vivisimo (document clustering engine)
Meaning Master (search engine technology) used by Eurekah Biosciences DB
WebBrain (conceptual organization of web spaces, visual interface for browsing the ODP directory)
Recommind (search, categorization and taxonomy generation from text - Hofmann - Probabilistic LSA)
Language Weaver (statistical machine translation)
Zoominfo (summarization of web content about people or companies)
Semagix (enterprise content exploitation)
Business Objects (business intelligence, bought by SAP)
Insightful Corp. (S-Plus statistical software, data mining)
Burning Glass (resume processing and matching with job requirements)
MarkLogic (Web content management server, some text mining - see customer demos for a rich set of applications)

Europe
BOC Information Technologies Consulting (Vienna, Business Process Management)
Mint Business Solutions (UK) Mint MCI Document Management
Autonomy Corp PLC (UK) Enterprise portals/search, clustering
Unicorn Solutions Inc. (Israel) data semantics
Clearforest (Israel/US) text analytics
Atypon (Greece) Text data mining ScienceLine
Biovista (Greece) Discovery algorithmics in Biotech (AI, NLP)
Velti (Greece) Enterprise content management / portals.
Ontotext (Bulgaria) Semantic annotation, indexing, retrieval
Neurosoft (Greece) NLP, Lexicon of modern Greek
ANCO (Greece) Educational applications, telecommunications
@Semantics (Italy) Enterprise Information Integration
Teezir (Netherlands) Enterprise Search
Collexis (Netherlands) Expert profiling, expert social networks
Dialogos: speech communication systems (Greece) Speech recognition systems - represents Nuance.com in Greece / partly owned by Intracom IT Services

Australia
Mind Systems (Topic Mapping, Personal Information Management)

Deep Web
Quigo Technologies' Intellisonar
Deep Web Technologies' Distributed Explorit
EduMed (based on Multimedia DBMS, VDMS)

Industry collaboration reference
Sample agreements: AUTM -> Agreements -> Sample Agreements
SR&ED Tax Credits
Canadian Research Transfer Network (CRTN)

OTHER

The secret of how Microsoft stays on top

General Java Resources
Sun's Java Website	http://java.sun.com
Gamelan, repository of Java tools	http://gamelan.earthweb.com
Java Programmer's FAQ	http://www.afu.com/javafaq.html
Links to AI-specific Java Resources
Jess, Rule-based system similar to CLIPS	http://herzberg.ca.sandia.gov/jess/
Weka, collection of machine learning algorithms	http://www.cs.waikato.ac.nz/ml/weka
Genetic Programming, S. Luke's ECJ and A. Qureshi's gpsys	http://www.cs.umd.edu/projects/plus/ec/ecj http://www.cs.ucl.ac.uk/staff/A.Qureshi/gpsys_doc.html
JavaBayes: Bayesian networks	http://www.cs.cmu.edu/~javabayes/
Neural networks: jaNet package	http://www.hta-bi.bfh.ch/Projects/janet/
YAG: natural language generator	http://tigger.cs.uwm.edu/~nlkrrg/
NGram Statistics package	http://www.d.umn.edu/~tpederse/nsp.html
AgentBuilder's survey of agent construction tools	http://www.agentbuilder.com/AgentTools/
GATE (General Architecture for Text Engineering)	http://gate.ac.uk/
Protege Ontology Editor	http://protege.stanford.edu/
The KIM Platform for Knowledge & Information Management	http://www.sirma.bg/OntoText/KIM/
Jakarta Lucene: a high-performance, full-featured text search engine written entirely in Java.	http://jakarta.apache.org/
Torch: a machine-learning library, written in simple C++	http://www.torch.ch/
SVMlight: implementation of Support Vector Machines (SVMs) in C.	http://svmlight.joachims.org/
OSU SVM Classifier Matlab Toolbox	http://www.ece.osu.edu/~maj/osu_svm/
SOM Matlab Toolbox	http://www.cis.hut.fi/projects/somtoolbox/
Pattern Recognition Matlab Toolbox	http://neural.cs.nthu.edu.tw/jang/matlab/toolbox/DCPR/
Matlab toolboxes	http://www.tech.plym.ac.uk/spmc/matlab/matlab_toolbox.html
BioNLP Resources	http://www.tufts.edu/~amorga02/bcresources.html
OntoParser, an XML2RDF translator for OntoBuilder ontologies	http://ie.technion.ac.il/OntoBuilder Ontologies are available under "Ontologies downloads," partitioned into 14 domains. For the OntoParser, go to "OntoBuilder downloads" and follow the link to "OntoParser: an XML2RDF translator of +OntoBuilder ontologies." The zip file contains a user manual with all installation information.
FIHC, Frequent Itemset-based Hierarchical Clustering	http://www.cs.sfu.ca/~ddm
eprints software for online publishing	http://www.eprints.org/
htdig a search engine	http://www.htdig.org/
lemur, language modelling/IR	http://www.lemurproject.org/
lucene apache: a search engine in Java	http://lucene.apache.org/java/docs/
Tools for the Reuters collection	http://www.lins.fju.edu.tw/~tseng/Collections/Reuters-21578.html
OpenNLP: large collection of open NLP tools	http://opennlp.sourceforge.net/projects.html
FrameNet - on-line lexical resource for English, based on frame semantics	http://framenet.icsi.berkeley.edu/
VerbNet - a lexical resource on verbs	http://en.wikipedia.org/wiki/VerbNet
LIBSVM -- Library for Support Vector Machines	http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Maximum Entropy / Logistic Regression	http://www.cs.utah.edu/~hal/megam/ Patrick Haffner: Scaling large margin classifiers for spoken language understanding, Speech Communication 48 (2006) 239–261
MEAD -- multi-document summarization system (Dragomir Radev)	http://www.summarization.com/mead/
JUNG -- Java Universal Network/Graph Framework	http://jung.sourceforge.net/ is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.
K. Murhy: Bayes Network Toolbox for Matlab