Spring 2004
MATH710 and CMSC691
Introduction to Computational Information Retrieval
Large collections of text documents are now increasingly common and
available. Mining such data sets is a major contemporary challenge.
An approach to the problem is to transform the set of
documents into vectors in a finite dimensional Euclidean
space and to deal with vectors rather that texts.
The course will focus on vector space models,
and linear algebra and clustering techniques for handling
large data sets with a limited amount of resources (e.g.
memory and cpu cycles).
Potential topics to be covered
-
Vector Space Models
- Term by Document Matrices
- Term Weighting
- Polysemy and Synonymy
- Automatic Ambiguity Resolution
-
Linear Algebra Techniques and Applications
- Singular Value Decomposition (SVD)
- Semidiscrete Decomposition (SDD)
- QR Factorization
- Lanczos Method
- Latent Semantic Indexing (LSI)
-
Clustering Techniques
- Divisive and Agglomerative Methods
- k-Means Algorithm
- Principal Direction Divisive Partitioning
- BIRCH
- EM Clustering
- Clustering by means of the Wavelet Transform
Students will participate in a class project involving both the creation and
management of a large document collection on the WWW.
This project will
require programming in languages such as Perl/CGI, C/C++, or Java.
Have questions? Interested?
drop me an e-mail at
kogan@math.umbc.edu