English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Journal Article

Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets

MPS-Authors
/persons/resource/persons188010

Mafessoni,  Fabrizio       
Genomes, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Max Planck Society;

/persons/resource/persons72912

Prüfer,  Kay       
Genomes, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

Mafessoni_Turning_GBE_2018.pdf
(Publisher version), 541KB

Supplementary Material (public)
There is no public supplementary material available
Citation

Mafessoni, F., Prasad, R. B., Groop, L., Hansson, O., & Prüfer, K. (2018). Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets. Genome Biology and Evolution, 10(10), 2697-2708. doi:10.1093/gbe/evy199.


Cite as: https://hdl.handle.net/21.11116/0000-0002-D530-7
Abstract
It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling data sets
with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear
as SNPs. Here, we devise a method to detect systematic errors in combined data sets. To measure quality differences between
individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The
abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects.
Applying our method to the 1000 Genomes data set, we find that coding regions are enriched for errors, where

1% of the
higher frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer
(
<
0.001%). As expected, predicted errors are found less often than other variants in a data set that was generated with
a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000
Genomes errors are also found in other large data sets; our observation is thus not specific to the 1000 Genomes data set. Our
results show that batch effects can be turned into a virtue by using the resulting variation in large scale data sets to detect
systematic errors.