Web Robot Project Bibliography
1. WEB INDEXING, SEARCH ENGINES
2. IMAGE/MULTIMEDIA CONTENT-BASED RETRIEVAL
3. NATURAL LANGUAGE PROCESSING
4. FOCUSED CRAWLERS
5. WEB SCIENCE and GRAPH THEORETIC APPROACHES
6. SPECIAL ISSUE ON INTELLIGENT INTERNET SYSTEMS
ARTIFICIAL INTELLIGENCE JOURNAL 118(1-2)
7. INTELLIGENT WEB AGENTS TALKS
8. ONTOLOGY/HIERARCHY LEARNING
9. AI-SPECIFIC SOFTWARE RESOURCES
10. MISC
11. Agent-based economics
12. Web-Information Filtering Lab --- TREC
13. Machine learning and information extraction
14. Web information retrieval
15. Data sets
16. Statistical Machine Learning
(Material) Robotics references
Journals to publish in
Text mining - Knowledge Management
industry
Health
text mining
Industry collaboration
reference
Graduate Courses
J. Kleinberg: The
Structure of Information Networks Cornell, Computer Science 685 Fall 2002
J. Kleinberg: Randomized
and High-Dimensional Algorithms Cornell, Computer Science (Spring 2001).
1. WEB INDEXING, SEARCH ENGINES
- Google http://google.stanford.edu/about.html
- Google Anatomy paper GoogleAnatomy.pdf
- T. Haveliwala: Efficient
Computation of PageRank. Stanford U. CS Technical Report, 1999
- D. Rafiei and A.Mendelzon What
do the Neighbours Think? Computing Web Page Reputations, IEEE Data Engineering
Bulletin, September 2000. (WWW9 version)
- Open directory project (human-edited indexing) http://dmoz.org/
- The Web Robot's Pages
- Computer Science Research Paper Search Engine http://www.cora.jprc.com/
- Surf Companion web agent http://surfcompanion.wwz.de/Help/TOC_Help.html
- The "Invisible Web," the part of cyberspace that's inaccessible
to search engines, but is still searchable -- if
you know where to find the gateways. http://gwis2.circ.gwu.edu/~gprice/direct.htm
- The Extreme Searcher's Web Page http://www.onstrat.com/
- Search engine resources
- Web developer's virtual library
- Crawling
the hidden Web, Sriram Raghavan, Hector Garcia-Molina. In the Proceedings
of the 27th Intl. Conf. on Very Large Databases (VLDB), pp. 129-138, September
2001.
- The Deep Web
http://www.brightplanet.com/
- ARVIND ARASU, JUNGHOO CHO, HECTOR GARCIA-MOLINA, ANDREAS PAEPCKE, and SRIRAM
RAGHAVAN, Searching the Web, ACM Transactions
on Internet Technology, Vol. 1, No. 1, August 2001, Pages 2–43.ACM Transactions
on Internet Technology, Vol. 1, No. 1, August 2001, Pages 2–43.
- UIUC Web Integration
Repository (Deep Web data sets - information extraction, interaction with
deep web sites)
2.
IMAGE/MULTIMEDIA CONTENT-BASED RETRIEVAL
- Centre for Intelligent Information Retrieval at UMass http://ciir.cs.umass.edu/
- Columbia's Content-Based Visual Query Project http://comet.ctr.columbia.edu/~sfchang/demos.html
- Image Processing and Retrieval on the Web (Theo Gevers) http://carol.wins.uva.nl/~gevers/
- ImageRover S. Sclaroff at BU http://www.cs.bu.edu/groups/ivc/ImageRover/
- Excalibur Visual RetrievalWare http://vrw.excalib.com:8015/cst
- Interpix http://www.interpix.com
- Mike Swain's tech reports on Multimedia Indexing http://www.crl.research.digital.com/publications/techreports/techreports.html
- Stanford digital library project http://walrus.stanford.edu/diglib/pub/reports/
- MPEG standard for multimedia data compression http://www.mpeg.org/MPEG/
- Cobion Visual Content Search
- V. Wu's Finding Text in Images (2nd ACM
Int. Conf. on Digital Libraries, 1997) --
also Manmatha's papers
on multimedia indexing and retrieval.
- Free-form
object recognition survey
- Text-based approaches
for the categorization of images, ECDL-99, 3rd European Conference on
Research and Advanced Technology for Digital Libraries, Sable and Hatzivassiloglou,
also IJDL 2001.
- NSERC proposal summary
text
- J. R.
Smith and S.-F. Chang, "Visually Searching the Web for Content," IEEE
Multimedia Magazine, Summer, Vol. 4 No. 3, pp.12-20, 1997. (also Columbia
U. CU/CTR Technical Report #459-96-25). (WebSEEk demo)
- James Z. Wang, Penn State
- James Z. Wang, Jia Li, Gio Wiederhold, ``SIMPLIcity:
Semantics-sensitive Integrated Matching for Picture LIbraries,'' IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no.
9, 16 pp., 2001
- James Z. Wang, Gio Wiederhold, Oscar Firschein, Sha Xin Wei, ``Wavelet-based
image indexing techniques with partial sketch retrieval capability,''
Proc. IEEE Forum on Research and Technology Advances in Digital Libraries
(ADL'97), pp. 13-24, Washington D.C., IEEE, May 1997
- Mohan Kankanhalli's course
on Multimedia Information Retrieval
- Rubner and Tomassi:
Earth Mover's Distance for Image Database Navigation (applied to color and
texture based retrieval, locally!)
- Bartlett, M.S.,
Donato, G.L., Movellan, J.R., Hager, J.C., Ekman, P., and Sejnowski, T.J.
(2000). Image representations for facial
expression coding. In S. Solla, T. Leen, & K. Mueller, Eds. Advances in
Neural Information Processing Systems 12, Cambridge, MA: MIT Press, p. 886-892.
- C.C. Jay
Kuo. Content-based Audio Classification and Retrieval
- Shu-Ching Chen, Mei-Ling Shyu,
and R. L. Kashyap, "Augmented
Transition Network as a Semantic Model for Video Data," International
Journal of Networking and Information Systems, Special Issue on Video Data,
vol. 3, no. 1, pp. 9-25, 2000.
- Ming-Hsuan Yang,
Dan Roth and Narendra Ahuja, "Learning
to Recognize 3D Objects With SNoW", podium presentation, in Proceedings
of the Sixth European Conference on Computer Vision (ECCV 2000) , pp. 439-454,
vol. 1, Dublin, June, 2000.
- Ming-Hsuan Yang, Narendra Ahuja, David Kriegman A
Survey on Face Detection Methods (1999)
- Pixar/Lucas films Graphics Memos: http://www.alvyray.com/Memos/MemosPixar.htm
Spline Tutorial Notes (the
classic) by A.R. Smith, 1983
Tech Memo 77, Computer Division, Lucasfilm, May 1983. Also issued as tutorial
notes at SIGGRAPHs 83 and 84
- "PicASHOW: Pictorial Authority
Search by Hyperlinks on the WEB", Ronny Lempel, Aya Soffer, ACM Trans.
on Information Systems, Vol. 20, No 1, Jan. 2002, pp. 1-24.
3. NATURAL LANGUAGE PROCESSING
- Lillian Lee's distributional
clustering approach for hierarchical clustering.
- Ellen Riloff's
research on NLP information extraction.
- Riloff, E. (1993) "Automatically
Constructing a Dictionary for Information Extraction Tasks", Proceedings
of the Eleventh National Conference on Artificial Intelligence (AAAI-93)
, AAAI Press/The MIT Press, pp. 811-816. n
- Riloff, E. and Schmelzenbach, M. (1998) "An
Empirical Approach to Conceptual Case Frame Acquisition", In Proceedings
of the Sixth Workshop on Very Large Corpora , 1998.
- Riloff, E. and Jones, R. (1999) "Learning
Dictionaries for Information Extraction by Multi-Level Bootstrapping,"
In Proceedings of the Sixteenth National Conference on Artificial Intelligence
(AAAI-99) , 1999.
- Efficient Crawling Through URL Ordering http://www-db.stanford.edu/~cho/crawler-paper/
- Linguist http://linguist.emich.edu
- WordNet http://www.cogsci.princeton.edu/~wn/
- Beyond Document Similarity: Understanding Value-Based Search and Browsing
Technologies http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1998-0099
- Jurafsky's computational corpus linguistics http://www.colorado.edu/ling/jurafsky/
- Institut für Maschinelle Sprachverarbeitung , U. Stuttgart http://www.ims.uni-stuttgart.de/
(research areas -> research results -> IMS, Decision Tree Tagger)
- IMS Corpus Workbench http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
- CORPUS SEARCH TOOLS, Lancaster U. http://www.comp.lancs.ac.uk/computing/research/ucrel/tools.html
- Word sense disambiguation using a word graph
- Retrieving
with good sense
- Survey
of IR
- G.
Hirst's CITO project
- Senseval
- Harabagiu
pubs
- Budanitsky, Alexander and Hirst, Graeme. ``Semantic
distance in WordNet: An experimental, application-oriented evaluation
of five measures.'' Workshop on WordNet and Other Lexical Resources,
Second meeting of the North American Chapter of the Association for Computational
Linguistics, Pittsburgh, June 2001.
- Eric Brill: A Simple Rule-Based
Part Of Speech Tagger (Proceedings of ANLP-92, 3rd Conference on Applied
Natural Language Processing) -- Brill's
online publications -- POS
software
- Applying Machine Learning for
High-Performance Named-Entity Extraction (Baluja, Mittal, Sikthankar),
Computational Intelligence, 16(4), 2000, pp. 586-595.
- Knowledge-based Extraction
of Named Entitites (J. Callan, T. Mitamura), ACM Conference on Information
and Knowledge Management (CIKM), Nov.
4-9, 2002, McLean, Virginia.
- An experimental comparison of
model-based clustering methods (Meila, Heckerman), Machine Learning, 42,
pp. 9-29, 2001.
- Concept decompositions for Large
sparse text data using clustering (Dhillon, Modha), Machine Learning,
42, pp. 143-175, 2001.
- Publications by the Natural
Language Processing Group, Univ. of Salford
- Mima, H., Ananiadou, S. and Tsujii, J. ( 1999). A
web-based integrated knowledge mining aid system using term-oriented NLP,
Proceedings of Natural Language Processing Pacific Rim Symposium 99, Beijing,
pp. 13-18.
- S. Ananiadou, S. Albert, D. Schuhmann. Evaluation
of automatic term recognition of nuclear receptors from MEDLINE.
- Maynard, D. and Ananiadou, S. (2000a). Creating
and using domain-specific ontologies for terminological applications,
Proceedings of Second International Conference on Language Resources and
Evaluation, Athens, pp. 868-874.
- Hideki MIMA, Sophia ANANIADOU, An
Application and Evaluation of the C/NC-value Approach for the Automatic
term Recognition of Multi-Word units in Japanese.
- D. Maynard, S. Ananiadou. Terminological
acquaintance: the importance of contextual information in terminology.
- Frantzi, K., Ananiadou, S. and Mima, H. ( 2000). Automatic
recognition of multiword terms, International Journal of Digital Libraries
3(2): 117-132.
- Mima, Ananiadou, Nenadic: The ATRACT
Workbench: Automatic Term Recognition and Clustering of Terms, 2001.
- Nenadic, Spacic, Ananiadou: Automatic
Discovery of Term Similarities using Pattern Mining, Computerm 2002.
- Nenadic, Mima, Spasic, Ananiadou, Tsujii: Terminology-driven
Literature Mining and Knowledge Acquisition in Biomedicine, International
Journal of Medical Informatics (2002).
- Goran Nenadic, Irena Spasic, Sophia Ananiadou:
Term Clustering using a Corpus-Based Similarity Measure, TSD2002.
- Goran Nenadic, Irena Spasic, Sophia Ananiadou: Automatic
Acronym Acquisition and Term Variation Management within Domain Specific
Texts, 3rd Int. Conf. on Language Resources and Evaluation, 2002.
- Nenadic, Mima, Spasic, Ananiadou, Tsujii: Terminology-driven
literature mining and knowledge acquisition in Biomedicine, to appear
in the Int. Journal of Medical Informatics, 2002.
- Microevolutionary
language theory (Mike Best thesis sup. by P. Maes)
- Resnik, P. Using Information Content to
Evaluate Semantic Similarity in a Taxonomy, IJCAI 95. A longer and more
recent version appears in JAIR, 11, 1999.
- Dolan, William ; Vanderwende, Lucy ; Richardson, Stephen D. Automatically
Deriving Structured Knowledge Bases From On-Line Dictionaries In Proceedings
of the Pacific Association for Computational Linguistics, April 21-24, 1993,
Vancouver, British Columbia.
- Ken Church on practical tips on
how to implement simple text processing using Unix tools. (53 pages long)
- S. Soderland, Learning Information
Extraction Rules for Semi-structured and Free Text. Machine Learning,
1999
- Christopher D. Manning.
Automatic acquisition of a large subcategorization
dictionary from corpora. Proceedings of the 31st ACL, pp. 235-242.
- Keselj, Vlado (Nick Cercone) Unification-based
grammars
Subgrammar extraction for Head-Driven Phrase Structure Grammars HPSG
Stefy: Java parser for HPSGs
- Obtaining Language Models
of Web Collections Using Query-Based Sampling Techniques (HICSS
2002) Gary A. Monroe, James C. French, and Allison L. Powell
- Interactive Document Summarisation
Using Automatically Extracted Keyphrases - (HICSS
2002) - Steve Jones, Stephen Lundy, and Gordon W. Paynter
- A Novel Method for Detecting
Similar Documents (HICSS
2002) James W. Cooper, Anni S. Coden, and Eric W. Brown
- MindMap: Utilizing Multiple Taxonomies
and Visualization to Understand a Document Collection (HICSS2002)
Scott Spangler, Jeffrey T. Kreulen, and Justin Lessler
- The Interspace: Concept Navigation
Across Distributed Communities
- Course on Text
Mining (with lots of papers) by Wanda Pratt at UC Irvine (Spring 2001)
Wanda Pratt's home page
at U. of Washington
- Clifton, C, Cooley, R, Zytkow, JM, and Rauch, J; TopCat:
data mining for topic identification in a text corpus. in Principles of
Data Mining and Knowledge Discovery. Third European Conference, PKDD'99. 1999.174-83
- Statistical NLP
and Corpus Based Linguistics Resources
- Vlado Keselj's Natural Language
Processing: literature, programs and text corpora
- Sheffield NLP Group (see resources)
4. FOCUSED CRAWLERS
Overview
by Francis Crimmins, Sep. 2001
- WebSPHINX: A Personal, Customizable Web Crawler
http://www.cs.cmu.edu/~rcm/websphinx/
--- http://www.cis.upenn.edu/~lrossey/websphinx.html
- The NAUTILUS: NAvigate AUtonomously and Target Interesting Links for USers
http://nautilus.dii.unisi.it/
- Focused crawling: a new approach to topic-specific Web resource discovery
Soumen Chakrabarti, Martin van den Berg, Byron Domc http://www.almaden.ibm.com/almaden/feat/www8/
- Recent results in automatic Web resource discovery, Soumen Chakrabarti,
ACM Computing Surveys ??(???), December 1999,
http://www.cs.brown.edu/memex/ACMCSHT/42/42.html
- Lee
Giles' publications --- Context graphs
vldb2000.pdf --- Min-cut framework kdd2000.pdf
- Cora -- (the search engine for CS papers) --- http://cora.whizbang.com/
--- publications on Cora
- A. Ng's ML Papers http://gubbio.cs.berkeley.edu/mlpapers/
- ResearchIndex
Publications
- Structural Web Search using a Graph-based
Discovery System. Graph-Based
Data Mining
- Henry Lieberman,
C. Fry, L. Weitzman Exploring
the Web with Reconnaissance Agents Comm ACM Aug 2001, Vol 44(8) --- Letizia
- Larbin: a recommended
web crawler
- FunnelBack crawler
for the P@NOPTIC Search Engine: http://www.panopticsearch.com/
- Persona: A Contextualized And Personalized
Web Search ( (HICSS
2002) ) Francisco Tanudjaja and Lik Mui, HICSS 2001
- Intelligent Crawling
on the World Wide Web with Arbitrary Predicates Charu C. Aggarwal, Fatima
Al-Garawi, and Philip S. Yu, WWW10, 2001.
- The
shark-search algorithm Michael Hersovicia, Michal Jacovia Yoelle S. Maareka,
Dan Pellegb Menachem Shtalhaima, and Sigalit Ura, WWW7 (the algorithm used
in the Mapuccino
system).
- Watson
Jay Budzik, Kristian Hammond, Larry Birnbaum, Devlab,
Northwestern U. (also here).
5. WEB SCIENCE and GRAPH THEORETIC
APPROACHES
- Kleinberg's Authoritative Sources in a Hyperlinked Environment http://www.cs.cornell.edu/home/kleinber/auth.pdf
- Barabasi's home page: http://www.nd.edu/~networks/
- The diameter of the WWW http://www.nd.edu/~networks/Papers/401130A0.pdf
- Emergence of Scaling in the WWW http://www.nd.edu/~networks/Papers/science.pdf
- The topology of the WWW http://www.nd.edu/~networks/Papers/proceeding.pdf
- The bow tie model of the Web http://www.almaden.ibm.com/almaden/webmap_press.html
- Social Networks: http://www.chass.utoronto.ca/~wellman/
- Clustering in large graphs and matrices
(Drineas et al. Proc. Symp. Discr. Alg, SIAM, 1999)
- J. Kleinberg, C. Papadimitriou, P. Raghavan. Segmentation
problems: A micro-economic view of data mining. Proc. 30th ACM Symposium
on Theory of Computing, 1998.
- Silk
from a sow's ear: Extracting usable structures from the Web, P. Pirolli,
J. Pitkow, and R. Rao. , Proc. ACM SIGCHI, 1996.
- How Popular is Your Paper? An
Empirical Study of the Citation Distribution, S. Redner, Eur. Phys. Jour.
B 4, 131-134 (1998).
- The Campfire project
(bipartite cores to identify communities on the WWW)
Bipartite cores for modelling web communities
Extracting large scale knowledge bases from the Web (bipartite cores)VLDB
1999
- IBM Clever project:
http://www.almaden.ibm.com/cs/k53/clever.html
- Mining the
Link Structure of the World Wide Web (1999) Soumen Chakrabarti, Byron
E. Dom, David Gibson, Jon Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar
Rajagopalan, Andrew Tomkins, IEEE Computer
- Graph connectivity quick reference
- Citation graph
- Chen, C. (1999)
Visualising Semantic
Spaces and Author Co-Citation Networks in Digital Libraries. Information
Processing & Management, 35(3), 401-420.
- Eugene Garfield's home
page.
- A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, Finding
Authorities and Hubs From Link Structures on the World Wide Web. (WWW10,
to appear. See published version.)
- Moses Charikar, Greedy approximation algorithms
for finding dense components in a graph, In Proc. Third International
Workshop Approximation Algorithms for Combinatorial Optimization, APPROX 2000,
Klaus Jansen, Samir Khuller (Eds.), LNCS 1913, 84-95.
- Monika Henzinger
publications sigir98 www8
- An Atlas of Cyberspaces:
Surf maps, visualizing browsing behaviour
- Corinna Cortes, Daryl Pregibon, and Chris T. Volinsky Communities
of Interest Authors: (2001) Proceedings of IDA 2001 - Itelligent Data
Analysis
- Self-similarity in the
web Stephen Dill Ravi Kumar Kevin McCurley Sridhar... VLDB 2001
- Cybergeography Research
Dodge and Kitchin Mapping Cyberspace
Routledge, Oct. 2000
- Locating Information with
Uncertainty in Fully Interconnected Networks with Applications to World Wide
Web Information Retrieval (Kirousis, Kranakis et al).
- Self-Organization and Identification
of Web Communities Gary Flake, Steve Lawrence, C. Lee Giles, Frans Coetzee
- Peer
influence groups: identifying dense clusters in large networks James Moody,
Social Networks 23 (2001) pp. 261-283
- Coevolution and self-organization in dynamical
networks (COSIN European consortium)
6. SPECIAL ISSUE ON INTELLIGENT
INTERNET SYSTEMS ARTIFICIAL INTELLIGENCE JOURNAL 118(1-2)
- Lesser, Victor, Horling, Bryan, Klassner, Frank, Raja, Anita, Wagner, Thomas,
and Zhang, Shelley.
BIG: An Agent for Resource-Bounded Information Gathering and Decision Making.
http://mas.cs.umass.edu/publications.shtml
- Kushmerick, N. Wrapper induction: Efficiency and expressiveness.
http://www.cs.ucd.ie/staff/nick/home/research/pubs.html
- W. Cohen WHIRL: A Word-based Information Representation Language, a journal-length
overview paper on WHIRL. (A shorter version is also available.) http://whirl.research.att.com/
- M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam,
and S. Slattery Learning to Construct Knowledge Bases from the World Wide
Web
http://www.ri.cmu.edu/people/mitchell_tom.html#publications
7. INTELLIGENT WEB AGENTS TALKS
- WebACE
- Latent Semantic Indexing
- Kozima: Context Sensitive Measure of Word
Distance orig.paper
- WebToKB
(McAllum et al)
8. ONTOLOGY/HIERARCHY LEARNING/USING
- Ontology Learning ECAI-2000 Workshop -- http://ol2000.aifb.uni-karlsruhe.de/
- Enriching very large ontologies using the WWW. E. Agirre, O. Ansa,
E. Hovy, D. Martinez
- Designing Clustering Methods for Ontology Building - The Mo'K Workbench.
G.Bisson, C. Nedellec and D. Canamero.
- International
Journal on Digital Libraries ISSN: 1432-5012 Index Volume 3 Number 3 October
2000
- Declarative
Specification of Z39.50 Wrappers using Description Logics Yannis Velegrakis
, Vassilis Christophides , Panos Constantopoulos
- Text-Based Approaches
for the Categorization of Images (1999) (Correct) (1 citation) Carl
L. Sable and Vasileios Hatzivassiloglou
- Ayad & Kamel: Topic
Discovery from Text using Aggregation of different Clustering Methods,
Canadian AI Conference, 2002.
- I. Varlamis, M.
Vazirgiannis, M. Halkidi,
B. Nguyen. «THESUS: Effective
Thematic Selection And Organization Of Web Document Collections Based On Link
Semantics», to appear in the IEEE Transactions on Knowledge and Data Engineering,
2003.
Data Bases and Knowledge Discovery
group @ AUEB
- Ontology
Matching
9.
AI-SPECIFIC Software RESOURCES
(partly from Ali&McRoy "Java Resource for Artificial Intelligence",
intelligence, SIGART ACM, 11(2), Summer 2000)
General Java Resources |
|
Sun's Java Website |
http://java.sun.com |
Gamelan, repository of Java tools |
http://gamelan.earthweb.com |
Java Programmer's FAQ |
http://www.afu.com/javafaq.html |
Links to AI-specific Java Resources |
|
Jess, Rule-based system similar to CLIPS |
http://herzberg.ca.sandia.gov/jess/ |
Weka, collection of machine learning algorithms |
http://www.cs.waikato.ac.nz/ml/weka |
Genetic Programming, S. Luke's ECJ and A. Qureshi's gpsys |
http://www.cs.umd.edu/projects/plus/ec/ecj
http://www.cs.ucl.ac.uk/staff/A.Qureshi/gpsys_doc.html
|
JavaBayes: Bayesian networks |
http://www.cs.cmu.edu/~javabayes/ |
Neural networks: jaNet package |
http://www.hta-bi.bfh.ch/Projects/janet/ |
YAG: natural language generator |
http://tigger.cs.uwm.edu/~nlkrrg/ |
NGram Statistics package |
http://www.d.umn.edu/~tpederse/nsp.html |
AgentBuilder's survey of agent construction tools |
http://www.agentbuilder.com/AgentTools/ |
GATE (General Architecture for Text Engineering)
|
http://gate.ac.uk/ |
Protege Ontology Editor |
http://protege.stanford.edu/ |
The KIM Platform for Knowledge & Information Management |
http://www.sirma.bg/OntoText/KIM/ |
Jakarta Lucene: a high-performance, full-featured text search engine written
entirely in Java. |
http://jakarta.apache.org/ |
Torch: a machine-learning library, written in
simple C++ |
http://www.torch.ch/ |
SVMlight: implementation of Support Vector Machines (SVMs) in C. |
http://svmlight.joachims.org/ |
OSU SVM Classifier Matlab Toolbox |
http://www.ece.osu.edu/~maj/osu_svm/ |
SOM Matlab Toolbox |
http://www.cis.hut.fi/projects/somtoolbox/ |
Pattern Recognition Matlab Toolbox |
http://neural.cs.nthu.edu.tw/jang/matlab/toolbox/DCPR/ |
Matlab toolboxes |
http://www.tech.plym.ac.uk/spmc/matlab/matlab_toolbox.html |
BioNLP Resources |
http://www.tufts.edu/~amorga02/bcresources.html |
OntoParser, an XML2RDF translator for OntoBuilder ontologies |
http://ie.technion.ac.il/OntoBuilder
Ontologies are available under "Ontologies downloads,"
partitioned into 14 domains. For the OntoParser, go to "OntoBuilder downloads"
and follow the link to "OntoParser: an XML2RDF translator of +OntoBuilder
ontologies." The zip file contains a user manual with all installation information.
|
FIHC, Frequent Itemset-based Hierarchical
Clustering |
http://www.cs.sfu.ca/~ddm |
eprints software for online publishing |
http://www.eprints.org/
|
htdig a search engine |
http://www.htdig.org/ |
lemur, language modelling/IR |
http://www.lemurproject.org/ |
lucene apache: a search engine in Java |
http://lucene.apache.org/java/docs/ |
Tools for the Reuters collection |
http://www.lins.fju.edu.tw/~tseng/Collections/Reuters-21578.html |
OpenNLP: large collection of open NLP tools |
http://opennlp.sourceforge.net/projects.html |
FrameNet - on-line lexical resource for English,
based on frame semantics |
http://framenet.icsi.berkeley.edu/ |
VerbNet - a lexical resource on verbs |
http://en.wikipedia.org/wiki/VerbNet |
LIBSVM -- Library for Support Vector Machines |
http://www.csie.ntu.edu.tw/~cjlin/libsvm/ |
Maximum Entropy / Logistic Regression |
http://www.cs.utah.edu/~hal/megam/
Patrick Haffner: Scaling
large margin classifiers for spoken language understanding, Speech
Communication 48 (2006) 239–261 |
MEAD -- multi-document summarization system
(Dragomir Radev) |
http://www.summarization.com/mead/ |
JUNG -- Java Universal Network/Graph Framework |
http://jung.sourceforge.net/
is a software library that provides a common and extendible
language for the modeling, analysis, and visualization of data that can
be represented as a graph or network. |
K. Murhy: Bayes
Network Toolbox for Matlab |
|
Datasets
UCI KDD Archive
Reuters-21578
Reuters Corpus (RCV1
and RCV2)
Wikipedia XML corpus
10. MISC
- LEDA: A C++ library of the
data types and algorithms of combinatorial computing Book
Manual
- R, a free(GPL) version of statistical
software S/S-plus (includes multidimensional
scaling)
- Book: Social
Dimensions of Information Technology
Click on "book excerpt" to read Chapter 1, Virtual
Communities and Social Capital by Blanchard and Horan
Sample copies of the Social
Science Computer Review from jsamples@sagepub.com
- Cluster
for Intelligent Mobile Agents for Telecommunication Environments — CLIMATE
- How to get (and keep)
an NSERC research grant (ps)
other resources
- Internet
Application Workbook by Philip Greenspun
- GNU's not Unix
- Social
Scientists: Managing Identity in Socio-Technical Networks (HICSS2002)
Roberta Lamb and Elizabeth Davidson
- Discrete Algorithms and Data Structures
software
- K. Murhy: Bayes
Network Toolbox for Matlab
- K. Murphy: HMM
toolbox
- Cawley, G. C. Matlab
Support Vector Machine Toolbox
- Boost graph
library (an open source alternative to LEDA for graph algorithms)
- Experimental Design 1
2 3
4
11. Agent-based economics
- ASPEN Microsimulation Economics
Model
- Microsimulation
- Mike
Wellman's Market-oriented programming
- J. Kephart's Dynamic Pricing by software agents html
pdf
- M. Huhns home page (online
auctions, agents)
12. Web-Information
Filtering Lab
- Context aware retrieval links:
http://www.sims.berkeley.edu/~hearst/papers/data-engineering/
http://www.dcs.ex.ac.uk/~pjbrown/papers/ir.html
http://www.research.microsoft.com/research/db/debull/A00sept/issue.htm
- The Effect of Linking
on Genres of Web documents (Crowston and Williams)
- XML Schema Formal Description
- IJCAI 2001 Workshop on Intelligent Techniques
for Web Personalization (WEB-2)
- Clustering navigation patterns on a website
using a sequence alignment method . Birgit Hay, Geert Wets and Koen
Vanhoof, Limburg University, Belgium
- Modeling users navigation history
(IJCAI 2001 Workshop on Intelligent Techniques for Web Personalization).
Ernesto Damiani, Barbara Oliboni, Elisa Quintarelli and Letizia Tanca,
Universita degli Studi di Milano, Italy
- Improving the effectiveness of collaborative
filtering on anonymous web usage data Bamshad Mobasher, Honghua Dai,
Tao Luo, Miki Nakagawa, School of Computer Science, Telecommunication,
and Information Systems, DePaul University, Chicago, Illinois, USA
- Web site personalizers for mobile devices
Corin R. Anderson, Pedro Domingos, Daniel S. Weld, University of Washington,
Seattle, WA, USA
- C.J. van Rijsbergen, INFORMATION
RETRIEVAL Second Edition (on-line text)
- Information
Retrieval Links (including SMART).
- Biovista.com
ACM SIGIR Information
Retrieval Resources
TREC
Main TREC Web site
Web research collections --- Paper
on the wt10g collection (TREC 2001)
TREC 2002
Web Track guidelines
Overview
of the TREC 2001 Web Track competition (ps.gz))---
Overview of the TREC
2000 Web Track competition
Descriptions of the 2001 contributions for each of the two tasks (adhoc
and entry page) described in the overview.
.
Web adhoc results (2001) --- topics
.
Web entry-page results (2001) --- topics
Agreement to use the data sets on hermes.cs.dal.ca
13. Machine learning and information extraction
- Ghahramani,
Z. (2001) An Introduction to Hidden
Markov Models and Bayesian Networks International Journal of Pattern Recognition
and Artificial Intelligence 15(1):9-42.
- Jeff Bilmes A Gentle Tutorial of the
EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture
and Hidden Markov Models (1998).
- Learning regular
languages from simple positive examples
- Learning Dfa
from Simple Examples
- Efficient algorithms for
the inference of minimum size DFAs
- Hierarchical Wrapper
Induction for Semistructured Information Sources Ion Muslea, Steve Minton,
Craig Knoblock. Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114,
2001
- Line Eikvil Information
Extraction from World Wide Web - A Survey
- IJCAI-2001 Workshop on Adaptive Text
Extraction and Mining (ML-1)
- William W. Cohen & Lee S. Jensen: A
structured wrapper induction system for extracting information from semi-structured
documents
- Lee S. Jensen & William W. Cohen: Grouping
extracted fields
- Craig A. Knoblock, Kristina Lerman, Steven Minton & Ion Muslea: A
machine-learning approach to accurately and reliably extracting data from
the Web
- Kristina Lerman, Craig Knoblock & Steven Minton: Automatic
data extraction from lists and tables in Web sources
- David Pierce & Claire Cardie: User-oriented
machine learning strategies for information extraction: Putting the human
back in the loop
- IJCAI-2001 Workshop on Text Learning:
Beyond Supervision (ML-3)
- Selective Sampling + Semi-supervised
Learning = Robust Multi-View Learning Ion Muslea, Steven Minton, and
Craig A. Knoblock
- Detection of errors in training data
by using a decision list and Adaboost Hiroyuki Shinnou
- Ontology-based Text Clustering
A. Hotho, S. Staab, and A. Maedche
- Probabilistic Models of Text and Link
Structure for Hypertext Classification Lise Getoor, Eran Segal, Ben
Taskar, and Daphne Koller
- WebDB
2000 Proceedings
- Kamal Nigam. Using
Unlabeled Data to Improve Text Classification. Doctoral Dissertation,
Computer Science Department, Carnegie Mellon University. Technical Report
CMU-CS-01-126. 2001 (ML paper)
Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell.
Text Classification from Labeled and Unlabeled Documents using EM. Machine
Learning, 39(2/3). pp. 103-134. 2000
Kamal Nigam and Rayid Ghani. Analyzing
the Effectiveness and Applicability of Co-training. In Ninth International
Conference on Information and Knowledge Management (CIKM-2000), pp. 86-93.
2000
- David Cohn, Les
Atlas and Richard Ladner. (1994) Improving
generalization with active learning, Machine Learning 15(2):201-221.
- D. Freitag. Information Extraction
from HTML: Application of a General Machine Learning Approach, AAAI/IAAI
1998
- Fabrizio Sebastiani, Machine learning
in automated text categorization, ACM Computing Surveys, 2002
http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf
- Rosario, B., and Hearst, M., Classifying
the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy,
in the Proceedings
of Empirical Methods in Natural Language Processing EMNLP '01, Pittsburgh,
PA, June 2001. (From BAILANDO
or "Better Access to Information using Language Analysis and New Displays
and Organizations publication list).
- Support Vector Machine resources
-- Kernel machine resources
Tutorial on Support Vector Machines
and Kernel Methods Presented at ICML-2001 by Nello Cristianini
- J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Knowledge Discovery and Data Mining,
2(2), 1998.
A. J. Smola and B. Schölkopf. A
Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR-98-030,
Royal Holloway College, University of London, UK, 1998.
- Peter Cheeseman, John Stutz Bayesian
Classification(AutoClass):Theory and Results (1996) Advances in Knowledge
Discovery and Data Mining
- Hinrich Schütze, Craig Silverstein Projections
for Efficient Document Clustering (1997)
- D. Heckerman. A tutorial on
learning with Bayesian Networks. Microsoft Research TR, 1996
ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.ps , ftp://ftp.research.microsoft.com/pub/dtg/david/tutorial.ps
- Henry Lieberman, Bonnie A. Nardi, David Wright Training
Agents to Recognize Text by Example, ACM Conference on Autonomous Agents
[Agents-99], Seattle, 1-5 May 1999
- Joachims. Text Categorization with
Support Vector Machines. TR VIII-23, U. of Dortmund, 1997.
- An Evaluation of Statistical Approaches
to Text Categorization (1997) Yiming Yang
- Active Learning for Natural Language
Parsing and Information Extraction Cynthia A. Thompson, Mary Elaine Califf,
and Raymond J. Mooney, Proceedings of the Sixteenth International Machine
Learning Conference (ICML-99) , Bled, Slovenia, pp. 406-414, June 1999 (ps)
- Relational Learning of Pattern-Match
Rules for Information Extraction Mary Elaine Califf and Raymond J. Mooney,
Proceedings of the Sixteenth National Conference on Artificial Intelligence
(AAAI-99), Orlando, FL, pp. 328-334, July, 1999 (ps)
- A Comparison
of Document Clustering Techniques Michael Steinbach, George Karypis, Vipin
Kumar, KDD Workshop on Text Mining, 2000.
- On
the merits of building categorization systems by supervised clustering,
Charu C. Aggarwal Stephen C. Gates Philip S. Yu, Proceedings of the fifth
ACM SIGKDD international conference on Knowledge discovery and data mining,
1999 , San Diego, California.
- Manolis Koubarakis work on Boolean queries with proximity operators:
Manolis Koubarakis, Theodoros Koutris, Paraskevi Raftopoulou, and Christos
Tryfonopoulos: Efficient dissemination
of textual information using the Boolean model 2nd Hellenic Conference
on Artificial Intelligence, April 11-12, 2002, Thessaloniki, Greece.
Manolis Koubarakis Boolean queries with
proximity operators for information dissemination
International Workshop on FOUNDATIONS OF MODELS FOR INFORMATION INTEGRATION
(FMII-2001) as the 10th Workshop in the Series Foundations of Models and Languages
for Data and Objects (FMLDO) Viterbo (near Rome), Italy 16-18 September, 2001
(immediately after VLDB-2001)
- Text clustering (from Biao Chen)
1. Web Document Clustering:
A Feasibility Demonstration (1998) Oren Zamir, Oren Etzioni
2. Fast and Intuitive
Clustering of Web Documents (1997) Oren Zamir Oren Etzioni Omid Madani
Richard M. Karp Department of Computer...
3. Scalable
Techniques for Clustering the Web
4. A Min-max Cut Algorithm
for Graph Partitioning and Data Clustering, Chris Ding, Xiaofeng He, Hongyuan
Zha, Ming Gu and Horst Simon. Proc. 1st IEEE Int'l Conf. Data Mining. San
Jose, CA, 2001. pp.107-114.
5. Automatic Topic Identification
Using Webpage Clustering Xiaofeng He, Chris H.Q. Ding, Hongyuan Zha, Horst
D. Simon Proc. 1st IEEE Int'l Conf. Data Mining. San Jose, CA, 2001. pp.195-202.
6. Co-clustering documents
and words using Bipartite Spectral Graph Partitioning
- Statistics reference (hyperstat
online)
- Clustering
overview by Schuetze
- Information
Extraction in Biology
- Clustering
software for gene expression profiles, XCluster
14. Web information retrieval
1. The Web IR and IE collection http://www.haifa.il.ibm.com/webir/
Of particular interest are "Selected Publications" and "PhD/MSc
related work"
2. Intl. Workshop on Web Document Analysis - WDA2001 http://www.csc.liv.ac.uk/~wda2001/
15. Data sets
1. RCV1-v2
Text Categorization Test Collection (Reuters). Appendix to:
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A
New Benchmark Collection for Text Categorization Research. Journal of Machine
Learning Research, 5:361-397, 2004. .
2. ArXiv is an e-print service in the fields
of physics, mathematics, non-linear science, computer science, and quantitative
biology.
16. Statistical Machine Learning
1. PASCAL network of excellence
- Pattern Analysis, Statistical modelling and ComputAtional Learning (incl.
video lectures)
2. Machine Learning Summer schools
(Material) Robotics references
1. Particle swarms
2. Sensor magazine SensorPortal
3. CVOnline, a Computer Vision Encyclopedia
4. Ballard and Brown's Computer
Vision text
Related Industry
Halifax-Atlantic Canada
IT Interactive Services
(ITIS) (Halifax, Web applications) Genieknows
Coemergence (Halifax, Business
Knowledge Management specializing in the mining sector)
Kanayo (Halifax, Peer-to-peer engine
for medical information cataloguing and dissemination)
Skywire Software (Moncton,
legal document generation and transformation)
Ontario-Quebec
Palomino (Toronto, Web site creation
and maintenance tool)
Sysomos (Toronto, Blog
text mining, Koudas & Bansal)
OpenText (Waterloo,
Livelink is the leading collaboration and knowledge management software for
the global enterprise.)
Hummingbird (Toronto, Enterprise
Content Management)
Techne Knowledge Systems, Inc. (Toronto,
Business Knowledge Management)
Pattern Discovery Software Systems
(Waterloo, data mining, spinoff of PAMI lab)
Protana (former MDS Proteomics Inc. - MDSP)
(Toronto, a drug discovery company)
Bell University Laboratories
(Toronto)
Branddimensions
(Toronto, Buzz analysis)
Nstein (Montreal) - Multilingual
information management
Lingua
Technologies (Montreal) - Translation, Text Mining (Precarn member)
Language Industry Association (Quebec-Canada)
- trade association in multilingual text information management
Nomino Technologies
(Montreal) - natural language processing in e-customer-service (through web
sites)
Western Canada
Axonwave Software Inc. (Vancouver,
SFU spinoff, F. Popowich)
Business Objects
(Vancouver, Business Intelligence)
USA
Entopia Knowledge Builder
(bottom-up Knowledge Management)
Mohomine (document classification)
Wherewithal (knowledge management
from the intranet portal)
Autonomy (corporate document knowledge
management) white papers,
case studies (Autonomy bought Verity (K2
Enterprise (Knowledge Management))
Entrieva (term
extraction, document classification, taxonomy building (manual))
Stratify (text classification,
taxonomy management)
Applied Semantics (acquired
by Google, ontology-based software)
Systems Research and Development
(non-obvious relationship awareness)
Dynago (document organization and summarization,
metasearch engine. Check out DART)
Eurekster (collaborative Web
searches)
Interwoven Inc. (enterprise
content management)
Inxight Software Inc. (unstructured
data management)
Vivisimo (document clustering engine)
Meaning Master (search engine technology) used by Eurekah
Biosciences DB
WebBrain (conceptual organization
of web spaces, visual interface for browsing the ODP directory)
Recommind (search, categorization
and taxonomy generation from text - Hofmann
- Probabilistic LSA)
Language Weaver (statistical
machine translation)
Zoominfo (summarization of web
content about people or companies)
Semagix
(enterprise content exploitation)
Business Objects
(business intelligence, bought by SAP)
Insightful Corp. (S-Plus
statistical software, data mining)
Burning Glass (resume
processing and matching with job requirements)
MarkLogic (Web content
management server, some text mining - see customer demos for a rich set of applications)
Europe
BOC Information Technologies Consulting
(Vienna, Business Process Management)
Mint
Business Solutions (UK) Mint
MCI Document Management
Autonomy Corp PLC (UK)
Enterprise portals/search, clustering
Unicorn Solutions Inc. (Israel)
data semantics
Clearforest (Israel/US) text
analytics
Atypon (Greece) Text data mining
ScienceLine
Biovista (Greece) Discovery algorithmics
in Biotech (AI, NLP)
Velti (Greece) Enterprise content
management / portals.
Ontotext (Bulgaria) Semantic annotation,
indexing, retrieval
Neurosoft (Greece) NLP, Lexicon
of modern Greek
ANCO (Greece) Educational applications,
telecommunications
@Semantics (Italy)
Enterprise Information Integration
Teezir (Netherlands) Enterprise
Search
Collexis (Netherlands)
Expert profiling, expert social networks
Dialogos: speech communication systems
(Greece) Speech recognition systems - represents Nuance.com
in Greece / partly owned by Intracom
IT Services
Australia
Mind Systems (Topic
Mapping, Personal Information Management)
Deep Web
Quigo Technologies' Intellisonar
Deep Web Technologies' Distributed
Explorit
EduMed (based on Multimedia
DBMS, VDMS)
Industry
collaboration reference
Sample agreements: AUTM ->
Agreements -> Sample Agreements
SR&ED Tax Credits
Canadian Research Transfer Network
(CRTN)
OTHER
The secret
of how Microsoft stays on top