Help Privacy Policy Disclaimer
  Advanced SearchBrowse





A theoretical and experimental study on the construction of suffix arrays in external memory


Crauser,  Andreas
Algorithms and Complexity, MPI for Informatics, Max Planck Society;


Ferragina,  Paolo
Algorithms and Complexity, MPI for Informatics, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

(Any fulltext), 458KB

Supplementary Material (public)
There is no public supplementary material available

Crauser, A., & Ferragina, P.(1999). A theoretical and experimental study on the construction of suffix arrays in external memory (MPI-I-1999-1-001). Saarbrücken: Max-Planck-Institut für Informatik.

Cite as: https://hdl.handle.net/11858/00-001M-0000-0014-6F9B-2
The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [Manber-Myers,~1993] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/fast search operations supported. In this paper we analyze, both theoretically and experimentally, the I/O-complexity and the working space of six algorithms for constructing large suffix arrays. Three of them are the state-of-the-art, the other three algorithms are our new proposals. We perform a set of experiments based on three different data sets (English texts, Amino-acid sequences and random texts) and give a precise hierarchy of these algorithms according to their working-space vs. construction-time tradeoff. Given the current trends in model design~\cite{Farach-et-al,Vitter} and disk technology~\cite{dahlin,Ruemmler-Wilkes}, we will pose particular attention to differentiate between ``random'' and ``contiguous'' disk accesses, in order to reasonably explain some practical I/O-phenomena which are related to the experimental behavior of these algorithms and that would be otherwise meaningless in the light of other simpler external-memory models. At the best of our knowledge, this is the first study which provides a wide spectrum of possible approaches to the construction of suffix arrays in external memory, and thus it should be helpful to anyone who is interested in building full-text indexes on very large text collections. Finally, we conclude our paper by addressing two other issues. The former concerns with the problem of building word-indexes; we show that our results can be successfully applied to this case too, without any loss in efficiency and without compromising the simplicity of programming so to achieve a uniform, simple and efficient approach to both the two indexing models. The latter issue is related to the intriguing and apparently counterintuitive ``contradiction'' between the effective practical performance of the well-known BaezaYates-Gonnet-Snider's algorithm~\cite{book-info}, verified in our experiments, and its unappealing (i.e., cubic) worst-case behavior. We devise a new external-memory algorithm that follows the basic philosophy underlying that algorithm but in a significantly different manner, thus resulting in a novel approach which combines good worst-case bounds with efficient practical performance.