Fast Error-Tolerant Search on Very Large Texts

Celikik, Marjan; Bast, Holger

アイテム詳細

登録内容を編集ファイル形式で保存

一時保存へ追加

タグ情報を表示リリース履歴を表示詳細要約

公開

会議論文

Fast Error-Tolerant Search on Very Large Texts

MPS-Authors

/persons/resource/persons44223

Celikik, Marjan
Algorithms and Complexity, MPI for Informatics, Max Planck Society;

/persons/resource/persons44076

Bast, Holger
Algorithms and Complexity, MPI for Informatics, Max Planck Society;

External Resource

There are no locators available

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

フルテキスト (公開)

公開されているフルテキストはありません

付随資料 (公開)

There is no public supplementary material available

引用

Celikik, M., & Bast, H. (2009). Fast Error-Tolerant Search on Very Large Texts. In D., Shin (Ed.), The 24th Annual ACM Symposium on Applied Computing (pp. 1724-1731). New York, NY: ACM.

引用: https://hdl.handle.net/11858/00-001M-0000-000F-184C-3

要旨

We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: ... ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents. We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.