hide
Free keywords:
-
Abstract:
The construction of full-text indexes on very large text collections
is nowadays a hot problem. The suffix array [Manber-Myers,~1993] is
one of the most attractive full-text indexing data structures due to
its simplicity, space efficiency and powerful/fast search operations
supported. In this paper we analyze, both theoretically and
experimentally, the I/O-complexity and the working space of six
algorithms for constructing large suffix arrays. Three of them are
the state-of-the-art, the other three algorithms are our new
proposals. We perform a set of experiments based on three different
data sets (English texts, Amino-acid sequences and random texts) and
give a precise hierarchy of these algorithms according to their
working-space vs. construction-time tradeoff. Given the current
trends in model design~\cite{Farach-et-al,Vitter} and disk
technology~\cite{dahlin,Ruemmler-Wilkes}, we will pose particular
attention to differentiate between ``random'' and ``contiguous''
disk accesses, in order to reasonably explain some practical
I/O-phenomena which are related to the experimental behavior of
these algorithms and that would be otherwise meaningless in the
light of other simpler external-memory models.
At the best of our knowledge, this is the first study which provides
a wide spectrum of possible approaches to the construction of suffix
arrays in external memory, and thus it should be helpful to anyone
who is interested in building full-text indexes on very large text
collections.
Finally, we conclude our paper by addressing two other issues. The
former concerns with the problem of building word-indexes; we show
that our results can be successfully applied to this case too,
without any loss in efficiency and without compromising the
simplicity of programming so to achieve a uniform, simple and
efficient approach to both the two indexing models. The latter issue
is related to the intriguing and apparently counterintuitive
``contradiction'' between the effective practical performance of the
well-known BaezaYates-Gonnet-Snider's algorithm~\cite{book-info},
verified in our experiments, and its unappealing (i.e., cubic)
worst-case behavior. We devise a new external-memory algorithm that
follows the basic philosophy underlying that algorithm but in a
significantly different manner, thus resulting in a novel approach
which combines good worst-case bounds with efficient practical
performance.