Web Robot Project Bibliography

1. WEB INDEXING, SEARCH ENGINES
2. IMAGE/MULTIMEDIA CONTENT-BASED RETRIEVAL
3. NATURAL LANGUAGE PROCESSING
4. FOCUSED CRAWLERS
5. WEB SCIENCE and GRAPH THEORETIC APPROACHES
6. SPECIAL ISSUE ON INTELLIGENT INTERNET SYSTEMS ARTIFICIAL INTELLIGENCE JOURNAL 118(1-2)
7. INTELLIGENT WEB AGENTS TALKS
8. ONTOLOGY/HIERARCHY LEARNING
9. AI-SPECIFIC SOFTWARE RESOURCES
10. MISC
11. Agent-based economics
12. Web-Information Filtering Lab --- TREC
13. Machine learning and information extraction
14. Web information retrieval
15. Data sets
16. Statistical Machine Learning

(Material) Robotics references

Journals to publish in
Text mining - Knowledge Management industry
Health text mining
Industry collaboration reference

Graduate Courses
J. Kleinberg: The Structure of Information Networks Cornell, Computer Science 685 Fall 2002
J. Kleinberg: Randomized and High-Dimensional Algorithms Cornell, Computer Science (Spring 2001).

1. WEB INDEXING, SEARCH ENGINES

  1. Google http://google.stanford.edu/about.html
  2. Google Anatomy paper GoogleAnatomy.pdf
  3. T. Haveliwala: Efficient Computation of PageRank. Stanford U. CS Technical Report, 1999
  4. D. Rafiei and A.Mendelzon What do the Neighbours Think? Computing Web Page Reputations, IEEE Data Engineering Bulletin, September 2000. (WWW9 version)
  5. Open directory project (human-edited indexing) http://dmoz.org/
  6. The Web Robot's Pages
  7. Computer Science Research Paper Search Engine http://www.cora.jprc.com/
  8. Surf Companion web agent http://surfcompanion.wwz.de/Help/TOC_Help.html
  9. The "Invisible Web," the part of cyberspace that's inaccessible to search engines, but is still searchable -- if
    you know where to find the gateways. http://gwis2.circ.gwu.edu/~gprice/direct.htm
  10. The Extreme Searcher's Web Page http://www.onstrat.com/
  11. Search engine resources
  12. Web developer's virtual library
  13. Crawling the hidden Web, Sriram Raghavan, Hector Garcia-Molina. In the Proceedings of the 27th Intl. Conf. on Very Large Databases (VLDB), pp. 129-138, September 2001.
  14. The Deep Web http://www.brightplanet.com/
  15. ARVIND ARASU, JUNGHOO CHO, HECTOR GARCIA-MOLINA, ANDREAS PAEPCKE, and SRIRAM RAGHAVAN, Searching the Web, ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001, Pages 2–43.ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001, Pages 2–43.
  16. UIUC Web Integration Repository (Deep Web data sets - information extraction, interaction with deep web sites)

2. IMAGE/MULTIMEDIA CONTENT-BASED RETRIEVAL

  1. Centre for Intelligent Information Retrieval at UMass http://ciir.cs.umass.edu/
  2. Columbia's Content-Based Visual Query Project http://comet.ctr.columbia.edu/~sfchang/demos.html
  3. Image Processing and Retrieval on the Web (Theo Gevers) http://carol.wins.uva.nl/~gevers/
  4. ImageRover S. Sclaroff at BU http://www.cs.bu.edu/groups/ivc/ImageRover/
  5. Excalibur Visual RetrievalWare http://vrw.excalib.com:8015/cst
  6. Interpix http://www.interpix.com
  7. Mike Swain's tech reports on Multimedia Indexing http://www.crl.research.digital.com/publications/techreports/techreports.html
  8. Stanford digital library project http://walrus.stanford.edu/diglib/pub/reports/
  9. MPEG standard for multimedia data compression http://www.mpeg.org/MPEG/
  10. Cobion Visual Content Search
  11. V. Wu's Finding Text in Images (2nd ACM Int. Conf. on Digital Libraries, 1997) --
    also Manmatha's papers on multimedia indexing and retrieval.
  12. Free-form object recognition survey
  13. Text-based approaches for the categorization of images, ECDL-99, 3rd European Conference on Research and Advanced Technology for Digital Libraries, Sable and Hatzivassiloglou, also IJDL 2001.
  14. NSERC proposal summary text
  15. J. R. Smith and S.-F. Chang, "Visually Searching the Web for Content," IEEE Multimedia Magazine, Summer, Vol. 4 No. 3, pp.12-20, 1997. (also Columbia U. CU/CTR Technical Report #459-96-25). (WebSEEk demo)
  16. James Z. Wang, Penn State
    1. James Z. Wang, Jia Li, Gio Wiederhold, ``SIMPLIcity: Semantics-sensitive Integrated Matching for Picture LIbraries,'' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 9, 16 pp., 2001
    2. James Z. Wang, Gio Wiederhold, Oscar Firschein, Sha Xin Wei, ``Wavelet-based image indexing techniques with partial sketch retrieval capability,'' Proc. IEEE Forum on Research and Technology Advances in Digital Libraries (ADL'97), pp. 13-24, Washington D.C., IEEE, May 1997
  17. Mohan Kankanhalli's course on Multimedia Information Retrieval
  18. Rubner and Tomassi: Earth Mover's Distance for Image Database Navigation (applied to color and texture based retrieval, locally!)
  19. Bartlett, M.S., Donato, G.L., Movellan, J.R., Hager, J.C., Ekman, P., and Sejnowski, T.J. (2000). Image representations for facial expression coding. In S. Solla, T. Leen, & K. Mueller, Eds. Advances in Neural Information Processing Systems 12, Cambridge, MA: MIT Press, p. 886-892.
  20. C.C. Jay Kuo. Content-based Audio Classification and Retrieval
  21. Shu-Ching Chen, Mei-Ling Shyu, and R. L. Kashyap, "Augmented Transition Network as a Semantic Model for Video Data," International Journal of Networking and Information Systems, Special Issue on Video Data, vol. 3, no. 1, pp. 9-25, 2000.
  22. Ming-Hsuan Yang, Dan Roth and Narendra Ahuja, "Learning to Recognize 3D Objects With SNoW", podium presentation, in Proceedings of the Sixth European Conference on Computer Vision (ECCV 2000) , pp. 439-454, vol. 1, Dublin, June, 2000.
  23. Ming-Hsuan Yang, Narendra Ahuja, David Kriegman A Survey on Face Detection Methods (1999)
  24. Pixar/Lucas films Graphics Memos: http://www.alvyray.com/Memos/MemosPixar.htm
    Spline Tutorial Notes (the classic) by A.R. Smith, 1983
    Tech Memo 77, Computer Division, Lucasfilm, May 1983. Also issued as tutorial notes at SIGGRAPHs 83 and 84
  25. "PicASHOW: Pictorial Authority Search by Hyperlinks on the WEB", Ronny Lempel, Aya Soffer, ACM Trans. on Information Systems, Vol. 20, No 1, Jan. 2002, pp. 1-24.

3. NATURAL LANGUAGE PROCESSING

  1. Lillian Lee's distributional clustering approach for hierarchical clustering.
  2. Ellen Riloff's research on NLP information extraction.
    1. Riloff, E. (1993) "Automatically Constructing a Dictionary for Information Extraction Tasks", Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93) , AAAI Press/The MIT Press, pp. 811-816. n
    2. Riloff, E. and Schmelzenbach, M. (1998) "An Empirical Approach to Conceptual Case Frame Acquisition", In Proceedings of the Sixth Workshop on Very Large Corpora , 1998.
    3. Riloff, E. and Jones, R. (1999) "Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping," In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) , 1999.
  3. Efficient Crawling Through URL Ordering http://www-db.stanford.edu/~cho/crawler-paper/
  4. Linguist http://linguist.emich.edu
  5. WordNet http://www.cogsci.princeton.edu/~wn/
  6. Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1998-0099
  7. Jurafsky's computational corpus linguistics http://www.colorado.edu/ling/jurafsky/
  8. Institut für Maschinelle Sprachverarbeitung , U. Stuttgart http://www.ims.uni-stuttgart.de/
    (research areas -> research results -> IMS, Decision Tree Tagger)
  9. IMS Corpus Workbench http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
  10. CORPUS SEARCH TOOLS, Lancaster U. http://www.comp.lancs.ac.uk/computing/research/ucrel/tools.html
  11. Word sense disambiguation using a word graph
    1. Retrieving with good sense
    2. Survey of IR
    3. G. Hirst's CITO project
    4. Senseval
    5. Harabagiu pubs
    6. Budanitsky, Alexander and Hirst, Graeme. ``Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures.'' Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, June 2001.
  12. Eric Brill: A Simple Rule-Based Part Of Speech Tagger (Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing) -- Brill's online publications -- POS software
  13. Applying Machine Learning for High-Performance Named-Entity Extraction (Baluja, Mittal, Sikthankar), Computational Intelligence, 16(4), 2000, pp. 586-595.
  14. Knowledge-based Extraction of Named Entitites (J. Callan, T. Mitamura), ACM Conference on Information and Knowledge Management (CIKM), Nov. 4-9, 2002, McLean, Virginia.
  15. An experimental comparison of model-based clustering methods (Meila, Heckerman), Machine Learning, 42, pp. 9-29, 2001.
  16. Concept decompositions for Large sparse text data using clustering (Dhillon, Modha), Machine Learning, 42, pp. 143-175, 2001.
  17. Publications by the Natural Language Processing Group, Univ. of Salford
    1. Mima, H., Ananiadou, S. and Tsujii, J. ( 1999). A web-based integrated knowledge mining aid system using term-oriented NLP, Proceedings of Natural Language Processing Pacific Rim Symposium 99, Beijing, pp. 13-18.
    2. S. Ananiadou, S. Albert, D. Schuhmann. Evaluation of automatic term recognition of nuclear receptors from MEDLINE.
    3. Maynard, D. and Ananiadou, S. (2000a). Creating and using domain-specific ontologies for terminological applications, Proceedings of Second International Conference on Language Resources and Evaluation, Athens, pp. 868-874.
    4. Hideki MIMA, Sophia ANANIADOU, An Application and Evaluation of the C/NC-value Approach for the Automatic term Recognition of Multi-Word units in Japanese.
    5. D. Maynard, S. Ananiadou. Terminological acquaintance: the importance of contextual information in terminology.
    6. Frantzi, K., Ananiadou, S. and Mima, H. ( 2000). Automatic recognition of multiword terms, International Journal of Digital Libraries 3(2): 117-132.
    7. Mima, Ananiadou, Nenadic: The ATRACT Workbench: Automatic Term Recognition and Clustering of Terms, 2001.
    8. Nenadic, Spacic, Ananiadou: Automatic Discovery of Term Similarities using Pattern Mining, Computerm 2002.
    9. Nenadic, Mima, Spasic, Ananiadou, Tsujii: Terminology-driven Literature Mining and Knowledge Acquisition in Biomedicine, International Journal of Medical Informatics (2002).
    10. Goran Nenadic, Irena Spasic, Sophia Ananiadou: Term Clustering using a Corpus-Based Similarity Measure, TSD2002.
    11. Goran Nenadic, Irena Spasic, Sophia Ananiadou: Automatic Acronym Acquisition and Term Variation Management within Domain Specific Texts, 3rd Int. Conf. on Language Resources and Evaluation, 2002.
    12. Nenadic, Mima, Spasic, Ananiadou, Tsujii: Terminology-driven literature mining and knowledge acquisition in Biomedicine, to appear in the Int. Journal of Medical Informatics, 2002.
  18. Microevolutionary language theory (Mike Best thesis sup. by P. Maes)
  19. Resnik, P. Using Information Content to Evaluate Semantic Similarity in a Taxonomy, IJCAI 95. A longer and more recent version appears in JAIR, 11, 1999.
  20. Dolan, William ; Vanderwende, Lucy ; Richardson, Stephen D. Automatically Deriving Structured Knowledge Bases From On-Line Dictionaries In Proceedings of the Pacific Association for Computational Linguistics, April 21-24, 1993, Vancouver, British Columbia.
  21. Ken Church on practical tips on how to implement simple text processing using Unix tools. (53 pages long)
  22. S. Soderland, Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 1999
  23. Christopher D. Manning. Automatic acquisition of a large subcategorization dictionary from corpora. Proceedings of the 31st ACL, pp. 235-242.
  24. Keselj, Vlado (Nick Cercone) Unification-based grammars
    Subgrammar extraction for Head-Driven Phrase Structure Grammars HPSG

    Stefy: Java parser for HPSGs
  25. Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques (HICSS 2002) Gary A. Monroe, James C. French, and Allison L. Powell
  26. Interactive Document Summarisation Using Automatically Extracted Keyphrases - (HICSS 2002) - Steve Jones, Stephen Lundy, and Gordon W. Paynter
  27. A Novel Method for Detecting Similar Documents (HICSS 2002) James W. Cooper, Anni S. Coden, and Eric W. Brown
  28. MindMap: Utilizing Multiple Taxonomies and Visualization to Understand a Document Collection (HICSS2002) Scott Spangler, Jeffrey T. Kreulen, and Justin Lessler
  29. The Interspace: Concept Navigation Across Distributed Communities
  30. Course on Text Mining (with lots of papers) by Wanda Pratt at UC Irvine (Spring 2001)
    Wanda Pratt's home page at U. of Washington
  31. Clifton, C, Cooley, R, Zytkow, JM, and Rauch, J; TopCat: data mining for topic identification in a text corpus. in Principles of Data Mining and Knowledge Discovery. Third European Conference, PKDD'99. 1999.174-83
  32. Statistical NLP and Corpus Based Linguistics Resources
  33. Vlado Keselj's Natural Language Processing: literature, programs and text corpora
  34. Sheffield NLP Group (see resources)

4. FOCUSED CRAWLERS Overview by Francis Crimmins, Sep. 2001

  1. WebSPHINX: A Personal, Customizable Web Crawler
    http://www.cs.cmu.edu/~rcm/websphinx/ --- http://www.cis.upenn.edu/~lrossey/websphinx.html
  2. The NAUTILUS: NAvigate AUtonomously and Target Interesting Links for USers http://nautilus.dii.unisi.it/
  3. Focused crawling: a new approach to topic-specific Web resource discovery
    Soumen Chakrabarti, Martin van den Berg, Byron Domc http://www.almaden.ibm.com/almaden/feat/www8/
  4. Recent results in automatic Web resource discovery, Soumen Chakrabarti, ACM Computing Surveys ??(???), December 1999,
    http://www.cs.brown.edu/memex/ACMCSHT/42/42.html
  5. Lee Giles' publications --- Context graphs vldb2000.pdf --- Min-cut framework kdd2000.pdf
  6. Cora -- (the search engine for CS papers) --- http://cora.whizbang.com/ --- publications on Cora
  7. A. Ng's ML Papers http://gubbio.cs.berkeley.edu/mlpapers/
  8. ResearchIndex Publications
  9. Structural Web Search using a Graph-based Discovery System. Graph-Based Data Mining
  10. Henry Lieberman, C. Fry, L. Weitzman Exploring the Web with Reconnaissance Agents Comm ACM Aug 2001, Vol 44(8) --- Letizia
  11. Larbin: a recommended web crawler
  12. FunnelBack crawler for the P@NOPTIC Search Engine: http://www.panopticsearch.com/
  13. Persona: A Contextualized And Personalized Web Search ( (HICSS 2002) ) Francisco Tanudjaja and Lik Mui, HICSS 2001
  14. Intelligent Crawling on the World Wide Web with Arbitrary Predicates Charu C. Aggarwal, Fatima Al-Garawi, and Philip S. Yu, WWW10, 2001.
  15. The shark-search algorithm Michael Hersovicia, Michal Jacovia Yoelle S. Maareka, Dan Pellegb Menachem Shtalhaima, and Sigalit Ura, WWW7 (the algorithm used in the Mapuccino system).
  16. Watson Jay Budzik, Kristian Hammond, Larry Birnbaum, Devlab, Northwestern U. (also here).

5. WEB SCIENCE and GRAPH THEORETIC APPROACHES

  1. Kleinberg's Authoritative Sources in a Hyperlinked Environment http://www.cs.cornell.edu/home/kleinber/auth.pdf
  2. Barabasi's home page: http://www.nd.edu/~networks/
  3. The diameter of the WWW http://www.nd.edu/~networks/Papers/401130A0.pdf
  4. Emergence of Scaling in the WWW http://www.nd.edu/~networks/Papers/science.pdf
  5. The topology of the WWW http://www.nd.edu/~networks/Papers/proceeding.pdf
  6. The bow tie model of the Web http://www.almaden.ibm.com/almaden/webmap_press.html
  7. Social Networks: http://www.chass.utoronto.ca/~wellman/
  8. Clustering in large graphs and matrices (Drineas et al. Proc. Symp. Discr. Alg, SIAM, 1999)
  9. J. Kleinberg, C. Papadimitriou, P. Raghavan. Segmentation problems: A micro-economic view of data mining. Proc. 30th ACM Symposium on Theory of Computing, 1998.
  10. Silk from a sow's ear: Extracting usable structures from the Web, P. Pirolli, J. Pitkow, and R. Rao. , Proc. ACM SIGCHI, 1996.
  11. How Popular is Your Paper? An Empirical Study of the Citation Distribution, S. Redner, Eur. Phys. Jour. B 4, 131-134 (1998).
  12. The Campfire project (bipartite cores to identify communities on the WWW)
    Bipartite cores for modelling web communities

    Extracting large scale knowledge bases from the Web (bipartite cores)VLDB 1999
  13. IBM Clever project: http://www.almaden.ibm.com/cs/k53/clever.html
  14. Mining the Link Structure of the World Wide Web (1999) Soumen Chakrabarti, Byron E. Dom, David Gibson, Jon Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, IEEE Computer
  15. Graph connectivity quick reference
  16. Citation graph
    1. Chen, C. (1999) Visualising Semantic Spaces and Author Co-Citation Networks in Digital Libraries. Information Processing & Management, 35(3), 401-420.
    2. Eugene Garfield's home page.
  17. A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, Finding Authorities and Hubs From Link Structures on the World Wide Web. (WWW10, to appear. See published version.)
  18. Moses Charikar, Greedy approximation algorithms for finding dense components in a graph, In Proc. Third International Workshop Approximation Algorithms for Combinatorial Optimization, APPROX 2000, Klaus Jansen, Samir Khuller (Eds.), LNCS 1913, 84-95.
  19. Monika Henzinger publications sigir98 www8
  20. An Atlas of Cyberspaces: Surf maps, visualizing browsing behaviour
  21. Corinna Cortes, Daryl Pregibon, and Chris T. Volinsky Communities of Interest Authors: (2001) Proceedings of IDA 2001 - Itelligent Data Analysis
  22. Self-similarity in the web Stephen Dill Ravi Kumar Kevin McCurley Sridhar... VLDB 2001
  23. Cybergeography Research
    Dodge and Kitchin Mapping Cyberspace Routledge, Oct. 2000
  24. Locating Information with Uncertainty in Fully Interconnected Networks with Applications to World Wide Web Information Retrieval (Kirousis, Kranakis et al).
  25. Self-Organization and Identification of Web Communities Gary Flake, Steve Lawrence, C. Lee Giles, Frans Coetzee
  26. Peer influence groups: identifying dense clusters in large networks James Moody, Social Networks 23 (2001) pp. 261-283
  27. Coevolution and self-organization in dynamical networks (COSIN European consortium)

6. SPECIAL ISSUE ON INTELLIGENT INTERNET SYSTEMS ARTIFICIAL INTELLIGENCE JOURNAL 118(1-2)

  1. Lesser, Victor, Horling, Bryan, Klassner, Frank, Raja, Anita, Wagner, Thomas, and Zhang, Shelley.
    BIG: An Agent for Resource-Bounded Information Gathering and Decision Making.
    http://mas.cs.umass.edu/publications.shtml
  2. Kushmerick, N. Wrapper induction: Efficiency and expressiveness.
    http://www.cs.ucd.ie/staff/nick/home/research/pubs.html
  3. W. Cohen WHIRL: A Word-based Information Representation Language, a journal-length overview paper on WHIRL. (A shorter version is also available.) http://whirl.research.att.com/
  4. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery Learning to Construct Knowledge Bases from the World Wide Web http://www.ri.cmu.edu/people/mitchell_tom.html#publications

7. INTELLIGENT WEB AGENTS TALKS

  1. WebACE
  2. Latent Semantic Indexing
  3. Kozima: Context Sensitive Measure of Word Distance orig.paper
  4. WebToKB (McAllum et al)

8. ONTOLOGY/HIERARCHY LEARNING/USING

  1. Ontology Learning ECAI-2000 Workshop -- http://ol2000.aifb.uni-karlsruhe.de/
    1. Enriching very large ontologies using the WWW. E. Agirre, O. Ansa, E. Hovy, D. Martinez
    2. Designing Clustering Methods for Ontology Building - The Mo'K Workbench. G.Bisson, C. Nedellec and D. Canamero.
  2. International Journal on Digital Libraries ISSN: 1432-5012 Index Volume 3 Number 3 October 2000
    1. Declarative Specification of Z39.50 Wrappers using Description Logics Yannis Velegrakis , Vassilis Christophides , Panos Constantopoulos
    2. Text-Based Approaches for the Categorization of Images (1999) (Correct) (1 citation) Carl L. Sable and Vasileios Hatzivassiloglou
  3. Ayad & Kamel: Topic Discovery from Text using Aggregation of different Clustering Methods, Canadian AI Conference, 2002.
  4. I. Varlamis, M. Vazirgiannis, M. Halkidi, B. Nguyen. «THESUS: Effective Thematic Selection And Organization Of Web Document Collections Based On Link Semantics», to appear in the IEEE Transactions on Knowledge and Data Engineering, 2003.
    Data Bases and Knowledge Discovery group @ AUEB
  5. Ontology Matching

9. AI-SPECIFIC Software RESOURCES

(partly from Ali&McRoy "Java Resource for Artificial Intelligence", intelligence, SIGART ACM, 11(2), Summer 2000)
General Java Resources  
Sun's Java Website http://java.sun.com
Gamelan, repository of Java tools http://gamelan.earthweb.com
Java Programmer's FAQ http://www.afu.com/javafaq.html
Links to AI-specific Java Resources  
Jess, Rule-based system similar to CLIPS http://herzberg.ca.sandia.gov/jess/
Weka, collection of machine learning algorithms http://www.cs.waikato.ac.nz/ml/weka
Genetic Programming, S. Luke's ECJ and A. Qureshi's gpsys http://www.cs.umd.edu/projects/plus/ec/ecj
http://www.cs.ucl.ac.uk/staff/A.Qureshi/gpsys_doc.html
JavaBayes: Bayesian networks http://www.cs.cmu.edu/~javabayes/
Neural networks: jaNet package http://www.hta-bi.bfh.ch/Projects/janet/
YAG: natural language generator http://tigger.cs.uwm.edu/~nlkrrg/
NGram Statistics package http://www.d.umn.edu/~tpederse/nsp.html
AgentBuilder's survey of agent construction tools http://www.agentbuilder.com/AgentTools/
GATE (General Architecture for Text Engineering) http://gate.ac.uk/
Protege Ontology Editor http://protege.stanford.edu/
The KIM Platform for Knowledge & Information Management http://www.sirma.bg/OntoText/KIM/
Jakarta Lucene: a high-performance, full-featured text search engine written entirely in Java. http://jakarta.apache.org/
Torch: a machine-learning library, written in simple C++ http://www.torch.ch/
SVMlight: implementation of Support Vector Machines (SVMs) in C. http://svmlight.joachims.org/
OSU SVM Classifier Matlab Toolbox http://www.ece.osu.edu/~maj/osu_svm/
SOM Matlab Toolbox http://www.cis.hut.fi/projects/somtoolbox/
Pattern Recognition Matlab Toolbox http://neural.cs.nthu.edu.tw/jang/matlab/toolbox/DCPR/
Matlab toolboxes http://www.tech.plym.ac.uk/spmc/matlab/matlab_toolbox.html
BioNLP Resources http://www.tufts.edu/~amorga02/bcresources.html
OntoParser, an XML2RDF translator for OntoBuilder ontologies http://ie.technion.ac.il/OntoBuilder
Ontologies are available under "Ontologies downloads," partitioned into 14 domains. For the OntoParser, go to "OntoBuilder downloads" and follow the link to "OntoParser: an XML2RDF translator of +OntoBuilder ontologies." The zip file contains a user manual with all installation information.
FIHC, Frequent Itemset-based Hierarchical Clustering http://www.cs.sfu.ca/~ddm
eprints software for online publishing http://www.eprints.org/
htdig a search engine http://www.htdig.org/
lemur, language modelling/IR http://www.lemurproject.org/
lucene apache: a search engine in Java http://lucene.apache.org/java/docs/
Tools for the Reuters collection http://www.lins.fju.edu.tw/~tseng/Collections/Reuters-21578.html
OpenNLP: large collection of open NLP tools http://opennlp.sourceforge.net/projects.html
FrameNet - on-line lexical resource for English, based on frame semantics http://framenet.icsi.berkeley.edu/
VerbNet - a lexical resource on verbs http://en.wikipedia.org/wiki/VerbNet
LIBSVM -- Library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Maximum Entropy / Logistic Regression

http://www.cs.utah.edu/~hal/megam/
Patrick Haffner: Scaling large margin classifiers for spoken language understanding, Speech Communication 48 (2006) 239–261

MEAD -- multi-document summarization system (Dragomir Radev) http://www.summarization.com/mead/
JUNG -- Java Universal Network/Graph Framework http://jung.sourceforge.net/
is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.
K. Murhy: Bayes Network Toolbox for Matlab  

Datasets

UCI KDD Archive
Reuters-21578
Reuters Corpus (RCV1 and RCV2)
Wikipedia XML corpus

10. MISC

  1. LEDA: A C++ library of the data types and algorithms of combinatorial computing Book Manual
  2. R, a free(GPL) version of statistical software S/S-plus (includes multidimensional scaling)
  3. Book: Social Dimensions of Information Technology
    Click on "book excerpt" to read Chapter 1, Virtual Communities and Social Capital by Blanchard and Horan
    Sample copies of the Social Science Computer Review from jsamples@sagepub.com
  4. Cluster for Intelligent Mobile Agents for Telecommunication Environments — CLIMATE
  5. How to get (and keep) an NSERC research grant (ps) other resources
  6. Internet Application Workbook by Philip Greenspun
  7. GNU's not Unix
  8. Social Scientists: Managing Identity in Socio-Technical Networks (HICSS2002) Roberta Lamb and Elizabeth Davidson
  9. Discrete Algorithms and Data Structures software
  10. K. Murhy: Bayes Network Toolbox for Matlab
  11. K. Murphy: HMM toolbox
  12. Cawley, G. C. Matlab Support Vector Machine Toolbox
  13. Boost graph library (an open source alternative to LEDA for graph algorithms)
  14. Experimental Design 1 2 3 4

11. Agent-based economics

  1. ASPEN Microsimulation Economics Model
  2. Microsimulation
  3. Mike Wellman's Market-oriented programming
  4. J. Kephart's Dynamic Pricing by software agents html pdf
  5. M. Huhns home page (online auctions, agents)

12. Web-Information Filtering Lab

  1. Context aware retrieval links:
    http://www.sims.berkeley.edu/~hearst/papers/data-engineering/
    http://www.dcs.ex.ac.uk/~pjbrown/papers/ir.html http://www.research.microsoft.com/research/db/debull/A00sept/issue.htm
  2. The Effect of Linking on Genres of Web documents (Crowston and Williams)
  3. XML Schema Formal Description
  4. IJCAI 2001 Workshop on Intelligent Techniques for Web Personalization (WEB-2)
    1. Clustering navigation patterns on a website using a sequence alignment method . Birgit Hay, Geert Wets and Koen Vanhoof, Limburg University, Belgium
    2. Modeling users navigation history (IJCAI 2001 Workshop on Intelligent Techniques for Web Personalization).
      Ernesto Damiani, Barbara Oliboni, Elisa Quintarelli and Letizia Tanca, Universita degli Studi di Milano, Italy
    3. Improving the effectiveness of collaborative filtering on anonymous web usage data Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, School of Computer Science, Telecommunication, and Information Systems, DePaul University, Chicago, Illinois, USA
    4. Web site personalizers for mobile devices Corin R. Anderson, Pedro Domingos, Daniel S. Weld, University of Washington, Seattle, WA, USA
  5. C.J. van Rijsbergen, INFORMATION RETRIEVAL Second Edition (on-line text)
  6. Information Retrieval Links (including SMART).
  7. Biovista.com

ACM SIGIR Information Retrieval Resources

TREC
Main TREC Web site
Web research collections
--- Paper on the wt10g collection (TREC 2001)
TREC 2002 Web Track guidelines
Overview of the TREC 2001 Web Track competition (ps.gz))--- Overview of the TREC 2000 Web Track competition
Descriptions of the 2001 contributions for each of the two tasks (adhoc and entry page) described in the overview.
. Web adhoc results (2001) --- topics
. Web entry-page results (2001) --- topics

Agreement to use the data sets on hermes.cs.dal.ca

13. Machine learning and information extraction

  1. Ghahramani, Z. (2001) An Introduction to Hidden Markov Models and Bayesian Networks International Journal of Pattern Recognition and Artificial Intelligence 15(1):9-42.
  2. Jeff Bilmes A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models (1998).
  3. Learning regular languages from simple positive examples
  4. Learning Dfa from Simple Examples
  5. Efficient algorithms for the inference of minimum size DFAs
  6. Hierarchical Wrapper Induction for Semistructured Information Sources Ion Muslea, Steve Minton, Craig Knoblock. Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001
  7. Line Eikvil Information Extraction from World Wide Web - A Survey
  8. IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (ML-1)
    1. William W. Cohen & Lee S. Jensen: A structured wrapper induction system for extracting information from semi-structured documents
    2. Lee S. Jensen & William W. Cohen: Grouping extracted fields
    3. Craig A. Knoblock, Kristina Lerman, Steven Minton & Ion Muslea: A machine-learning approach to accurately and reliably extracting data from the Web
    4. Kristina Lerman, Craig Knoblock & Steven Minton: Automatic data extraction from lists and tables in Web sources
    5. David Pierce & Claire Cardie: User-oriented machine learning strategies for information extraction: Putting the human back in the loop
  9. IJCAI-2001 Workshop on Text Learning: Beyond Supervision (ML-3)
    1. Selective Sampling + Semi-supervised Learning = Robust Multi-View Learning Ion Muslea, Steven Minton, and Craig A. Knoblock
    2. Detection of errors in training data by using a decision list and Adaboost Hiroyuki Shinnou
    3. Ontology-based Text Clustering A. Hotho, S. Staab, and A. Maedche
    4. Probabilistic Models of Text and Link Structure for Hypertext Classification Lise Getoor, Eran Segal, Ben Taskar, and Daphne Koller
  10. WebDB 2000 Proceedings
  11. Kamal Nigam. Using Unlabeled Data to Improve Text Classification. Doctoral Dissertation, Computer Science Department, Carnegie Mellon University. Technical Report CMU-CS-01-126. 2001 (ML paper)
    Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3). pp. 103-134. 2000
    Kamal Nigam and Rayid Ghani. Analyzing the Effectiveness and Applicability of Co-training. In Ninth International Conference on Information and Knowledge Management (CIKM-2000), pp. 86-93. 2000
  12. David Cohn, Les Atlas and Richard Ladner. (1994) Improving generalization with active learning, Machine Learning 15(2):201-221.
  13. D. Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach, AAAI/IAAI 1998
  14. Fabrizio Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys, 2002
    http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf
  15. Rosario, B., and Hearst, M., Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy, in the Proceedings of Empirical Methods in Natural Language Processing EMNLP '01, Pittsburgh, PA, June 2001. (From BAILANDO or "Better Access to Information using Language Analysis and New Displays and Organizations publication list).
  16. Support Vector Machine resources -- Kernel machine resources
    Tutorial on Support Vector Machines and Kernel Methods Presented at ICML-2001 by Nello Cristianini
  17. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
    A. J. Smola and B. Schölkopf. A Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998.
  18. Peter Cheeseman, John Stutz Bayesian Classification(AutoClass):Theory and Results (1996) Advances in Knowledge Discovery and Data Mining
  19. Hinrich Schütze, Craig Silverstein Projections for Efficient Document Clustering (1997)
  20. D. Heckerman. A tutorial on learning with Bayesian Networks. Microsoft Research TR, 1996
    ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.ps , ftp://ftp.research.microsoft.com/pub/dtg/david/tutorial.ps
  21. Henry Lieberman, Bonnie A. Nardi, David Wright Training Agents to Recognize Text by Example, ACM Conference on Autonomous Agents [Agents-99], Seattle, 1-5 May 1999
  22. Joachims. Text Categorization with Support Vector Machines. TR VIII-23, U. of Dortmund, 1997.
  23. An Evaluation of Statistical Approaches to Text Categorization (1997) Yiming Yang
  24. Active Learning for Natural Language Parsing and Information Extraction Cynthia A. Thompson, Mary Elaine Califf, and Raymond J. Mooney, Proceedings of the Sixteenth International Machine Learning Conference (ICML-99) , Bled, Slovenia, pp. 406-414, June 1999 (ps)
  25. Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf and Raymond J. Mooney, Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, pp. 328-334, July, 1999 (ps)
  26. A Comparison of Document Clustering Techniques Michael Steinbach, George Karypis, Vipin Kumar, KDD Workshop on Text Mining, 2000.
  27. On the merits of building categorization systems by supervised clustering, Charu C. Aggarwal Stephen C. Gates Philip S. Yu, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999 , San Diego, California.
  28. Manolis Koubarakis work on Boolean queries with proximity operators:
    Manolis Koubarakis, Theodoros Koutris, Paraskevi Raftopoulou, and Christos Tryfonopoulos: Efficient dissemination of textual information using the Boolean model 2nd Hellenic Conference on Artificial Intelligence, April 11-12, 2002, Thessaloniki, Greece.
    Manolis Koubarakis Boolean queries with proximity operators for information dissemination
    International Workshop on FOUNDATIONS OF MODELS FOR INFORMATION INTEGRATION (FMII-2001) as the 10th Workshop in the Series Foundations of Models and Languages for Data and Objects (FMLDO) Viterbo (near Rome), Italy 16-18 September, 2001 (immediately after VLDB-2001)
  29. Text clustering (from Biao Chen)
    1. Web Document Clustering: A Feasibility Demonstration (1998) Oren Zamir, Oren Etzioni
    2. Fast and Intuitive Clustering of Web Documents (1997) Oren Zamir Oren Etzioni Omid Madani Richard M. Karp Department of Computer...
    3. Scalable Techniques for Clustering the Web
    4. A Min-max Cut Algorithm for Graph Partitioning and Data Clustering, Chris Ding, Xiaofeng He, Hongyuan Zha, Ming Gu and Horst Simon. Proc. 1st IEEE Int'l Conf. Data Mining. San Jose, CA, 2001. pp.107-114.
    5. Automatic Topic Identification Using Webpage Clustering Xiaofeng He, Chris H.Q. Ding, Hongyuan Zha, Horst D. Simon Proc. 1st IEEE Int'l Conf. Data Mining. San Jose, CA, 2001. pp.195-202.
    6. Co-clustering documents and words using Bipartite Spectral Graph Partitioning
  30. Statistics reference (hyperstat online)
  31. Clustering overview by Schuetze
  32. Information Extraction in Biology
  33. Clustering software for gene expression profiles, XCluster

14. Web information retrieval

1. The Web IR and IE collection http://www.haifa.il.ibm.com/webir/
Of particular interest are "Selected Publications" and "PhD/MSc related work"

2. Intl. Workshop on Web Document Analysis - WDA2001 http://www.csc.liv.ac.uk/~wda2001/

15. Data sets

1. RCV1-v2 Text Categorization Test Collection (Reuters). Appendix to:
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. .
2. ArXiv is an e-print service in the fields of physics, mathematics, non-linear science, computer science, and quantitative biology.

16. Statistical Machine Learning

1. PASCAL network of excellence - Pattern Analysis, Statistical modelling and ComputAtional Learning (incl. video lectures)
2. Machine Learning Summer schools


(Material) Robotics references

1. Particle swarms
2. Sensor magazine SensorPortal
3. CVOnline, a Computer Vision Encyclopedia
4. Ballard and Brown's Computer Vision text


Related Industry

Halifax-Atlantic Canada
IT Interactive Services (ITIS) (Halifax, Web applications) Genieknows
Coemergence
(Halifax, Business Knowledge Management specializing in the mining sector)
Kanayo
(Halifax, Peer-to-peer engine for medical information cataloguing and dissemination)
Skywire Software
(Moncton, legal document generation and transformation)

Ontario-Quebec
Palomino
(Toronto, Web site creation and maintenance tool)
Sysomos (Toronto, Blog text mining, Koudas & Bansal)
OpenText (Waterloo, Livelink is the leading collaboration and knowledge management software for the global enterprise.)
Hummingbird (Toronto, Enterprise Content Management)
Techne Knowledge Systems, Inc. (Toronto, Business Knowledge Management)
Pattern Discovery Software Systems (Waterloo, data mining, spinoff of PAMI lab)
Protana (former MDS Proteomics Inc. - MDSP) (Toronto, a drug discovery company)
Bell University Laboratories (Toronto)
Branddimensions (Toronto, Buzz analysis)

Nstein (Montreal) - Multilingual information management
Lingua Technologies (Montreal) - Translation, Text Mining (Precarn member)
Language Industry Association (Quebec-Canada) - trade association in multilingual text information management
Nomino Technologies (Montreal) - natural language processing in e-customer-service (through web sites)

Western Canada
Axonwave Software Inc. (Vancouver, SFU spinoff, F. Popowich)
Business Objects (Vancouver, Business Intelligence)

USA
Entopia Knowledge Builder (bottom-up Knowledge Management)
Mohomine (document classification)
Wherewithal (knowledge management from the intranet portal)
Autonomy (corporate document knowledge management) white papers, case studies (Autonomy bought Verity (K2 Enterprise (Knowledge Management))
Entrieva (term extraction, document classification, taxonomy building (manual))
Stratify (text classification, taxonomy management)
Applied Semantics (acquired by Google, ontology-based software)
Systems Research and Development (non-obvious relationship awareness)
Dynago (document organization and summarization, metasearch engine. Check out DART)
Eurekster (collaborative Web searches)
Interwoven Inc. (enterprise content management)
Inxight Software Inc. (unstructured data management)
Vivisimo (document clustering engine)
Meaning Master
(search engine technology) used by Eurekah Biosciences DB
WebBrain (conceptual organization of web spaces, visual interface for browsing the ODP directory)
Recommind (search, categorization and taxonomy generation from text - Hofmann - Probabilistic LSA)
Language Weaver (statistical machine translation)
Zoominfo (summarization of web content about people or companies)
Semagix (enterprise content exploitation)
Business Objects (business intelligence, bought by SAP)
Insightful Corp. (S-Plus statistical software, data mining)
Burning Glass (resume processing and matching with job requirements)
MarkLogic (Web content management server, some text mining - see customer demos for a rich set of applications)

Europe
BOC Information Technologies Consulting
(Vienna, Business Process Management)
Mint Business Solutions (UK) Mint MCI Document Management
Autonomy Corp PLC (UK) Enterprise portals/search, clustering
Unicorn Solutions Inc. (Israel) data semantics
Clearforest (Israel/US) text analytics
Atypon (Greece) Text data mining ScienceLine
Biovista (Greece) Discovery algorithmics in Biotech (AI, NLP)
Velti (Greece) Enterprise content management / portals.
Ontotext (Bulgaria) Semantic annotation, indexing, retrieval
Neurosoft (Greece) NLP, Lexicon of modern Greek
ANCO (Greece) Educational applications, telecommunications
@Semantics (Italy) Enterprise Information Integration
Teezir (Netherlands) Enterprise Search
Collexis (Netherlands) Expert profiling, expert social networks
Dialogos: speech communication systems (Greece) Speech recognition systems - represents Nuance.com in Greece / partly owned by Intracom IT Services

Australia
Mind Systems (Topic Mapping, Personal Information Management)

Deep Web
Quigo Technologies' Intellisonar
Deep Web Technologies' Distributed Explorit
EduMed
(based on Multimedia DBMS, VDMS)

Industry collaboration reference
Sample agreements
: AUTM -> Agreements -> Sample Agreements
SR&ED Tax Credits
Canadian Research Transfer Network (CRTN)

OTHER

The secret of how Microsoft stays on top