hide
Free keywords:
-
Abstract:
In this work we present a study of different techniques for semantic indexing
by dimension reduction, with special emphasis on the LSI technique. Dimension
reduction is important in the Information Retrieval (IR) context to enable fast
retrieval and elimination of noisy data.
LSI attempts to improve IR quality by deriving a latent semantic space with
lower dimensionality, based on the co-occurrence of the terms in the documents
from the document collection. It is a heuristic method and although experiments
show that the LSI technique often improves the retrieval performance, there are
deficiencies regarding mathematical models and rigorous theorems. Several
variants of the LSI technique have been proposed, which differ in the function
used for the mapping to the lower-dimensional space.
Our comparative study is carried out using mathematical tools, like Linear
Algebra, and systematic experiments.
We present a theoreticla analysis of the two main LSI variants found in the
literature - we call them Angle-stretching LSI and Angle-preserving LSI - and
we prove that the results of the two can, in principle, arbitrarily, differ.
The experiments reveal interesting features of the LSI variants and the
differences in their behavior. In our experiments, the Angle-stretching LSI
performs consistently worse than the Angle-preserving LSI.