Faster text search with hybrid indexing

Auer, Eric

アイテム詳細

登録内容を編集ファイル形式で保存

一時保存へ追加

タグ情報を表示リリース履歴を表示詳細要約

公開

ポスター

Faster text search with hybrid indexing

MPS-Authors

/persons/resource/persons4

Auer, Eric
The Language Archive, MPI for Psycholinguistics, Max Planck Society;

External Resource

There are no locators available

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

フルテキスト (公開)

poster-trova-lucene-clin-2013.pdf
(プレプリント), 2MB

付随資料 (公開)

There is no public supplementary material available

引用

Auer, E. (2013). Faster text search with hybrid indexing. Poster presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands.

引用: https://hdl.handle.net/11858/00-001M-0000-000E-780E-B

要旨

Growing amounts of annotation data in The Language Archive make it necessary to significantly speed up search to keep response times user friendly. Unlike keyword oriented web search engines, the Trova and CQL Search services at TLA allow searching for arbitrary exact substrings and (at lower speed) even regular expressions, not just whole words. To achieve both fast and versatile search, a combination of indexes is used. Word, substring and regular expression search queries are analyzed, yielding information about substrings and other properties which must be present in a tier (or file) so that tier can contain a hit for the query in question at all. Those properties are then either hash-mapped to fixed size bit vectors (fingerprints) for PostgreSQL based filtering or expressed as sets of N-grams (up to a fixed length) for filtering with Lucene N-gram indexes. Both methods aim to quickly find a small list of candidate tiers, containing all (but not much more) tiers which may contain hits. As Lucene has no native support for substring search, our system uses a fast but accurate N-gram based approximation. We present details of the implemented algorithm and elaborate the improvements in response times achieved. We were able to speed up most steps (of: opening indexes, defining a search domain, gathering candidates, finding hits and collecting hit details) and a typical benchmark session now completes in a fraction of the time used by the already powerful previous implementation.