日本語
 
Help Privacy Policy ポリシー/免責事項
  詳細検索ブラウズ

アイテム詳細


公開

ポスター

Faster text search with hybrid indexing

MPS-Authors
/persons/resource/persons4

Auer,  Eric
The Language Archive, MPI for Psycholinguistics, Max Planck Society;

External Resource
There are no locators available
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
フルテキスト (公開)

poster-trova-lucene-clin-2013.pdf
(プレプリント), 2MB

付随資料 (公開)
There is no public supplementary material available
引用

Auer, E. (2013). Faster text search with hybrid indexing. Poster presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands.


引用: https://hdl.handle.net/11858/00-001M-0000-000E-780E-B
要旨
Growing amounts of annotation data in The Language Archive make it necessary to significantly speed up search to keep response times user friendly. Unlike keyword oriented web search engines, the Trova and CQL Search services at TLA allow searching for arbitrary exact substrings and (at lower speed) even regular expressions, not just whole words. To achieve both fast and versatile search, a combination of indexes is used. Word, substring and regular expression search queries are analyzed, yielding information about substrings and other properties which must be present in a tier (or file) so that tier can contain a hit for the query in question at all. Those properties are then either hash-mapped to fixed size bit vectors (fingerprints) for PostgreSQL based filtering or expressed as sets of N-grams (up to a fixed length) for filtering with Lucene N-gram indexes. Both methods aim to quickly find a small list of candidate tiers, containing all (but not much more) tiers which may contain hits. As Lucene has no native support for substring search, our system uses a fast but accurate N-gram based approximation. We present details of the implemented algorithm and elaborate the improvements in response times achieved. We were able to speed up most steps (of: opening indexes, defining a search domain, gathering candidates, finding hits and collecting hit details) and a typical benchmark session now completes in a fraction of the time used by the already powerful previous implementation.