Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using 
Index Structures

Hoffmann, Steve; Otto, Christian; Kurtz, Stefan; Sharma, Cynthia Mira; Khaitovich, Philipp; Vogel, Jörg; Stadler, Peter F.; Hackermüller, Jörg

アイテム詳細

登録内容を編集ファイル形式で保存

一時保存へ追加

タグ情報を表示リリース履歴を表示詳細要約

公開

学術論文

Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

MPS-Authors

/persons/resource/persons82158

Sharma, Cynthia Mira
Max-Planck Research Group RNA Biology, Max Planck Institute for Infection Biology, Max Planck Society;

/persons/resource/persons82198

Vogel, Jörg
Max-Planck Research Group RNA Biology, Max Planck Institute for Infection Biology, Max Planck Society;

Stadler, Peter F.
Max Planck Society;

External Resource

There are no locators available

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

フルテキスト (公開)

PLoS_Comput_Biol_2009_5_e1000502.pdf
(出版社版), 704KB

付随資料 (公開)

There is no public supplementary material available

引用

Hoffmann, S., Otto, C., Kurtz, S., Sharma, C. M., Khaitovich, P., Vogel, J., Stadler, P. F., & Hackermüller, J. (2009). Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures. PLoS Computational Biology, 5(9):.

引用: https://hdl.handle.net/11858/00-001M-0000-000E-C0B3-5

要旨

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.