Faster text search with hybrid indexing

Auer, Eric

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Poster

Faster text search with hybrid indexing

MPG-Autoren

/persons/resource/persons4

Auer, Eric
The Language Archive, MPI for Psycholinguistics, Max Planck Society;

Externe Ressourcen

Es sind keine externen Ressourcen hinterlegt

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

poster-trova-lucene-clin-2013.pdf
(Preprint), 2MB

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Auer, E. (2013). Faster text search with hybrid indexing. Poster presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands.

Zitierlink: https://hdl.handle.net/11858/00-001M-0000-000E-780E-B

Zusammenfassung

Growing amounts of annotation data in The Language Archive make it necessary to significantly speed up search to keep response times user friendly. Unlike keyword oriented web search engines, the Trova and CQL Search services at TLA allow searching for arbitrary exact substrings and (at lower speed) even regular expressions, not just whole words. To achieve both fast and versatile search, a combination of indexes is used. Word, substring and regular expression search queries are analyzed, yielding information about substrings and other properties which must be present in a tier (or file) so that tier can contain a hit for the query in question at all. Those properties are then either hash-mapped to fixed size bit vectors (fingerprints) for PostgreSQL based filtering or expressed as sets of N-grams (up to a fixed length) for filtering with Lucene N-gram indexes. Both methods aim to quickly find a small list of candidate tiers, containing all (but not much more) tiers which may contain hits. As Lucene has no native support for substring search, our system uses a fast but accurate N-gram based approximation. We present details of the implemented algorithm and elaborate the improvements in response times achieved. We were able to speed up most steps (of: opening indexes, defining a search domain, gathering candidates, finding hits and collecting hit details) and a typical benchmark session now completes in a fraction of the time used by the already powerful previous implementation.