Sourcepredict: Prediction of metagenomic sample sources using dimension 
reduction followed by machine learning classification

Borry, Maxime

doi:10.21105/joss.01540

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Journal Article

Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification

MPS-Authors

/persons/resource/persons241968

Borry, Maxime
Archaeogenetics, Max Planck Institute for the Science of Human History, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

shh2440.pdf
(Publisher version), 148KB

Supplementary Material (public)

There is no public supplementary material available

Citation

Borry, M. (2019). Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification. The Journal of Open Source Software, 01540. doi:10.21105/joss.01540.

Cite as: https://hdl.handle.net/21.11116/0000-0004-DD2C-3

Abstract

SourcePredict is a Python package distributed through Conda, to classify and predict the
origin of metagenomic samples, given a reference dataset of known origins, a problem also
known as source tracking.
DNA shotgun sequencing of human, animal, and environmental samples has opened up new
doors to explore the diversity of life in these different environments, a field known as metagenomics
(Hugenholtz & Tyson, 2008). One aspect of metagenomics is investigating the community
composition of organisms within a sequencing sample with tools known as taxonomic
classifiers, such as Kraken (Wood & Salzberg, 2014).
In cases where the origin of a metagenomic sample, its source, is unknown, it is often part of the
research question to predict and/or confirm the source. For example, in microbial archaelogy,
it is sometimes necessary to rely on metagenomics to validate the source of paleofaeces.
Using samples of known sources, a reference dataset can be established with the taxonomic
composition of the samples, i.e., the organisms identified in the samples as features, and the
sources of the samples as class labels.
With this reference dataset, a machine learning algorithm can be trained to predict the source
of unknown samples (sinks) from their taxonomic composition.
Other tools used to perform the prediction of a sample source already exist, such as Source-
Tracker (Knights et al., 2011), which employs Gibbs sampling.
However, the Sourcepredict results are more easily interpreted since the samples are embedded
in a human observable low-dimensional space. This embedding is performed by a dimension
reduction algorithm followed by K-Nearest-Neighbours (KNN) classification.