hide
Language(s):
eng - English
Dates:
2010-10-07
Publication Status:
Accepted / In Press
Pages:
III, 127 S.
Publishing info:
Zagreb : University of Zagreb
Table of Contents:
Table of Contents
1. GENERAL INTRODUCTION.........................................................................1
1.1. Suffix trees and other index data structures used in biological sequence analysis.....................................................................................................................9
1.1.1. Suffix Tree..........................................................................................11
1.1.2. The space and the time complexity of the algorithms for the suffix tree construction.......................................................................................................13
1.1.3. Suffix Array........................................................................................14
1.1.4. The space and the time complexity of the algorithms for suffix array construction.......................................................................................................15
1.1.5. Enhanced Suffix Array.......................................................................17
1.1.6. The 64-bit implementation of the lightweight suffix array construction algorithm 21
1.1.7. Self-index...........................................................................................22
1.1.8. Burrows-Wheeler transform..............................................................23
1.1.9. The FM-Index and the backward search algorithm..........................25
1.1.10. The space and the time-complexity of the FM-index.........................29
2. EFFICIENT ESTIMATION OF PAIRWISE DISTANCES BETWEEN GENOMES...............................................................................................................31
2.1. Introduction................................................................................................31
2.2. Methods.....................................................................................................33
2.2.1. Definition of an alignment-free estimator of the rate of substitution, Kr 33
2.2.2. Problem statement.............................................................................35
2.2.3. Time complexity analysis of the previous approach (kr 1)................35
2.2.4. Time complexity analysis of the new approach (kr 2).......................37
2.2.5. Algorithm 1: Computation of all Kr values during the traversal of a generalized suffix tree of n sequences................................................................38
2.2.6. The implementation of kr version 2...................................................44
2.3. Analysis of Kr on simulated data sets........................................................45
2.3.1. Auxiliary programs............................................................................45
2.3.2. Consistency of Kr...............................................................................46 i
2.3.3. The affect of horizontal gene transfer on the accuracy of Kr............48
2.3.4. The effect of genome duplication on the accuracy of Kr....................49
2.3.5. Run time comparison of kr 1 and kr 2...............................................50
2.4. Application of kr version 2........................................................................53
2.4.1. Auxililary software used for the analysis of real data sets................56
2.4.2. The analysis of 12 Drosophila genomes............................................57
2.4.3. The analysis of 13 Escherichia coli and Shigella genomes...............58
2.4.4. The analysis of 825 HIV-1 pure subtype genomes.............................61
2.5. Discussion..................................................................................................62
3. EFFICIENT ALIGNMENT-FREE DETECTION OF LOCAL SEQUENCE HOMOLOGY....................................................................................66
3.1. Introduction................................................................................................66
3.2. Methods.....................................................................................................69
3.2.1. Problem statement – determining subtype(s) of a query sequence....69
3.2.2. Construction of locally homologous segments..................................71
3.2.3. Time complexity of computing a list of intervals Ii............................72
3.2.4. Algorithm 2: Construction of an interval tree...................................73
3.2.5. Computing a list of segements Gi.......................................................80
3.3. Analysis of st on simulated data sets.........................................................82
3.3.1. Run-time and memory usage analysis of st........................................82
3.3.2. Consistency of st................................................................................85
3.3.3. Comparison to SCUEAL on simulated data sets...............................92
3.4. Application of st.........................................................................................97
3.4.1. The analysis of Neisseria meningitidis..............................................98
3.4.2. The analysis of a recombinant form of HIV-1...................................99
3.4.3. The analysis of circulating recombinant forms of HIV-1................103
3.4.4. The analysis of an avian pathogenic Escherichia coli strain..........104
3.5. Discussion................................................................................................107
4. CONCLUSION..............................................................................................110
5. REFERENCES..............................................................................................112
6. ELECTRONIC SOURCES...........................................................................121
7. LIST OF ABBREVIATIONS AND SYMBOLS.........................................122 ii
iii
ABSTRACT............................................................................................................124
SAŽETAK..............................................................................................................125
CURRICULUM VITAE........................................................................................126
ŽIVOTOPIS...........................................................................................................127
Rev. Type:
-
Identifiers:
eDoc: 544378
Other: Diss/12255
Degree:
PhD