Help Privacy Policy Disclaimer
  Advanced SearchBrowse




Journal Article

Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification


Borry,  Maxime
Archaeogenetics, Max Planck Institute for the Science of Human History, Max Planck Society;

External Resource
No external resources are shared
Fulltext (public)

(Publisher version), 148KB

Supplementary Material (public)
There is no public supplementary material available

Borry, M. (2019). Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification. The Journal of Open Source Software, 01540. doi:10.21105/joss.01540.

Cite as: http://hdl.handle.net/21.11116/0000-0004-DD2C-3
SourcePredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking. DNA shotgun sequencing of human, animal, and environmental samples has opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics (Hugenholtz & Tyson, 2008). One aspect of metagenomics is investigating the community composition of organisms within a sequencing sample with tools known as taxonomic classifiers, such as Kraken (Wood & Salzberg, 2014). In cases where the origin of a metagenomic sample, its source, is unknown, it is often part of the research question to predict and/or confirm the source. For example, in microbial archaelogy, it is sometimes necessary to rely on metagenomics to validate the source of paleofaeces. Using samples of known sources, a reference dataset can be established with the taxonomic composition of the samples, i.e., the organisms identified in the samples as features, and the sources of the samples as class labels. With this reference dataset, a machine learning algorithm can be trained to predict the source of unknown samples (sinks) from their taxonomic composition. Other tools used to perform the prediction of a sample source already exist, such as Source- Tracker (Knights et al., 2011), which employs Gibbs sampling. However, the Sourcepredict results are more easily interpreted since the samples are embedded in a human observable low-dimensional space. This embedding is performed by a dimension reduction algorithm followed by K-Nearest-Neighbours (KNN) classification.