A theoretical and experimental study on the construction of suffix arrays in 
external memory

Crauser, Andreas; Ferragina, Paolo

Lokale TagsFreigabegeschichteDetailsÜbersicht

A theoretical and experimental study on the construction of suffix arrays in external memory

Crauser, A., & Ferragina, P.(1999). A theoretical and experimental study on the construction of suffix arrays in external memory (MPI-I-1999-1-001). Saarbrücken: Max-Planck-Institut für Informatik.

Item is Freigegeben

einblenden: alle ausblenden: alle

Basisdaten

einblenden: ausblenden:

Datensatz-Permalink: https://hdl.handle.net/11858/00-001M-0000-0014-6F9B-2 Versions-Permalink: https://hdl.handle.net/21.11116/0000-0001-7568-7

Genre: Bericht

Dateien

einblenden: Dateien

ausblenden: Dateien

:

MPI-I-1999-1-001.pdf (beliebiger Volltext), 458KB

Öffnen Speichern

Datei-Permalink:
https://hdl.handle.net/21.11116/0000-0001-7569-6

Name:
MPI-I-1999-1-001.pdf

Beschreibung:
-

OA-Status:

Sichtbarkeit:
Öffentlich

MIME-Typ / Prüfsumme:
application/pdf / [MD5]

Technische Metadaten:

Öffnen

Copyright Datum:
-

Copyright Info:
-

Lizenz:
-

Externe Referenzen

einblenden:

Urheber

einblenden:

ausblenden:

Urheber:
Crauser, Andreas¹, Autor
Ferragina, Paolo¹, Autor

Affiliations:
1Algorithms and Complexity, MPI for Informatics, Max Planck Society, ou_24019

Inhalt

einblenden:

ausblenden:

Schlagwörter: -

Zusammenfassung: The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [Manber-Myers,~1993] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/fast search operations supported. In this paper we analyze, both theoretically and experimentally, the I/O-complexity and the working space of six algorithms for constructing large suffix arrays. Three of them are the state-of-the-art, the other three algorithms are our new proposals. We perform a set of experiments based on three different data sets (English texts, Amino-acid sequences and random texts) and give a precise hierarchy of these algorithms according to their working-space vs. construction-time tradeoff. Given the current trends in model design~\cite{Farach-et-al,Vitter} and disk technology~\cite{dahlin,Ruemmler-Wilkes}, we will pose particular attention to differentiate between ``random'' and ``contiguous'' disk accesses, in order to reasonably explain some practical I/O-phenomena which are related to the experimental behavior of these algorithms and that would be otherwise meaningless in the light of other simpler external-memory models. At the best of our knowledge, this is the first study which provides a wide spectrum of possible approaches to the construction of suffix arrays in external memory, and thus it should be helpful to anyone who is interested in building full-text indexes on very large text collections. Finally, we conclude our paper by addressing two other issues. The former concerns with the problem of building word-indexes; we show that our results can be successfully applied to this case too, without any loss in efficiency and without compromising the simplicity of programming so to achieve a uniform, simple and efficient approach to both the two indexing models. The latter issue is related to the intriguing and apparently counterintuitive ``contradiction'' between the effective practical performance of the well-known BaezaYates-Gonnet-Snider's algorithm~\cite{book-info}, verified in our experiments, and its unappealing (i.e., cubic) worst-case behavior. We devise a new external-memory algorithm that follows the basic philosophy underlying that algorithm but in a significantly different manner, thus resulting in a novel approach which combines good worst-case bounds with efficient practical performance.

Details

einblenden:

ausblenden:

Sprache(n): eng - English

Datum: Erschienen: 1999

Publikationsstatus: Erschienen

Seiten: 40 p.

Ort, Verlag, Ausgabe: Saarbrücken : Max-Planck-Institut für Informatik

Inhaltsverzeichnis: -

Art der Begutachtung: -

Identifikatoren: Reportnr.: MPI-I-1999-1-001
BibTex Citekey: CrauserFerragina99

Art des Abschluß: -

ausblenden:

Titel: Research Report / Max-Planck-Institut für Informatik

Genre der Quelle: Reihe

Urheber:

Affiliations:

Ort, Verlag, Ausgabe: -

Seiten: - Band / Heft: - Artikelnummer: - Start- / Endseite: - Identifikator: -

Datensatz

Basisdaten

Dateien

Externe Referenzen

Urheber

Inhalt

Details

Veranstaltung

Entscheidung

Projektinformation

Quelle 1