Mind the Gap: Large-scale Frequent Sequence Mining

Miliaraki, Iris; Berberich, Klaus; Gemulla, Rainer; Zoupanos, Spyros

doi:10.1145/2463676.2465285

Lokale TagsFreigabegeschichteDetailsÜbersicht

Mind the Gap: Large-scale Frequent Sequence Mining

Miliaraki, I., Berberich, K., Gemulla, R., & Zoupanos, S. (2013). Mind the Gap: Large-scale Frequent Sequence Mining. In K. Ross, D. Srivastava, D. Papadias, & S. Papadopoulos (Eds.), SIGMOD'13 (pp. 797-808). New York, NY: ACM. doi:10.1145/2463676.2465285.

Item is Freigegeben

einblenden: alle ausblenden: alle

Basisdaten

einblenden: ausblenden:

Datensatz-Permalink: https://hdl.handle.net/11858/00-001M-0000-0015-1D76-9 Versions-Permalink: https://hdl.handle.net/11858/00-001M-0000-0024-BF85-A

Genre: Konferenzbeitrag

ausblenden:

Urheber:
Miliaraki, Iris¹, Autor
Berberich, Klaus¹, Autor
Gemulla, Rainer¹, Autor
Zoupanos, Spyros¹, Autor

Affiliations:
1Databases and Information Systems, MPI for Informatics, Max Planck Society, ou_24018

Inhalt

einblenden:

ausblenden:

Schlagwörter: -

Zusammenfassung: Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are suffciently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose PFSM, a scalable algorithm for frequent sequence mining on MapReduce. PFSM can handle so-called ``gap constraints'', which can be used to limit the output to a controlled set of frequent sequences. At its heart, PFSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a ``projected database'' used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our extensive experimental study in the context of text mining suggests that PFSM is significantly more efficient and scalable than alternative approaches.

Details

einblenden:

ausblenden:

Sprache(n): eng - English

Datum: Online veröffentlicht: 2013-06-22Erschienen: 2013

Publikationsstatus: Erschienen

Seiten: -

Ort, Verlag, Ausgabe: -

Inhaltsverzeichnis: -

Art der Begutachtung: -

Identifikatoren: BibTex Citekey: Miliaraki2013
Anderer: Local-ID: 086027E8ABA46DC6C1257B0F003D8C96-Miliaraki2013
DOI: 10.1145/2463676.2465285

Art des Abschluß: -

Veranstaltung

einblenden:

ausblenden:

Titel: ACM SIGMOD International Conference on Management of Data

Veranstaltungsort: New York, NY, USA

Start-/Enddatum: 2013-06-22 - 2013-06-27

Entscheidung

einblenden:

Projektinformation

einblenden:

Quelle 1

einblenden:

ausblenden:

Titel: SIGMOD'13

Untertitel : International Conference on Management of Data

Kurztitel : SIGMOD 2013

Genre der Quelle: Konferenzband

Urheber:
Ross, Kenneth¹, Herausgeber
Srivastava, Divesh¹, Herausgeber
Papadias, Dimitris¹, Herausgeber
Papadopoulos, Stavros¹, Herausgeber

Affiliations:
1 External Organizations, ou_persistent22

Ort, Verlag, Ausgabe: New York, NY : ACM

Seiten: - Band / Heft: - Artikelnummer: - Start- / Endseite: 797 - 808 Identifikator: ISBN: 978-1-4503-2037-5

Datensatz

Basisdaten

Dateien

Externe Referenzen

Urheber

Inhalt

Details

Veranstaltung

Entscheidung

Projektinformation

Quelle 1