hide
Free keywords:
Normalization of read count data; Enrichment Calling; Difference Calling; ChIP-seq; RNA-seq; ATAC-seq; STARR-seq
Abstract:
Molecular Biology pertains to the molecular basis of the regulation of biomolecular processes in the cell, e.g. gene expression or the genome-wide localization of DNA-associated proteins. These molecular quantities are routinely measured by Next Generation Sequencing (NGS)-based tech- niques due to their genome-wide scalability and cost-efficiency. In order to discern background- regions from genomic loci that harbor a biological relevant signal, i.e. difference calling, the NGS measurements need to be corrected for technical biases with the help of a control, i.e. nor- malization. However, the normalization itself requires the knowledge of background regions and, consequently, difference calling and normalization are inseparable. Here, this problem is solved by the data-driven “normR” framework which models the inter- dependency of NGS mea- surements in background- and signal-regions as a multinomial sampling trial with a binomial mixture model. The robust normR normalization accounts for the effect of signal on the overall measurement statistic by modeling treatment and control simultaneously. In this thesis, I used normR in three studies concerning the inference of DNA-protein binding from ChIP-seq data. Firstly, the two-component “enrichR” model is shown to achieve a more sensitive enrichment calling (AUC≥0.93) than six competitor methods (AUC≤0.86) in low, e.g. H3K36me3, and high, e.g. H3K4me3, signal-to- noise ratio (S/N) ChIP-seq data. enrichR’s enrichment calls augment the resolution and comprehensiveness of chromatin segmentations by chromHMM and its normal- ization improves on present in silico and in vitro ChIP-seq normalization methods. Secondly, the three-component “regimeR” model dissects enrichment into two unprecedented regimes of dif- ferent signal levels. A regimeR-based analysis identified two distinct facultative and constitutive heterochromatic enrichment regimes in H3K27me3 and H3K9me3 ChIP-seq data, respectively. The identified peak regions (high enrichment) resemble nucleation sites for heterochromatin embedded in regions of broad (low) enrichment. Lastly, the three-component “diffR” model calls conditional differences in ChIP-seq enrichment between two conditions. The diffR calls in low (H3K27me3) and high (H3K4me3) S/N ChIP-seq data are confirmed by a systematic compari- son to four difference callers. Overall, normR represents a robust and versatile framework for the comprehensive analysis of ChIP-seq data, yet, it can be readily applied to other NGS-based experiments like ATAC- seq, STARR-seq or RNA-seq.