Amino acid encoding for deep learning applications

ElAbd, Hesham; Bromberg, Yana; Hoarfrost, Adrienne; Lenz, Tobias L.; Franke, Andre; Wendorff, Mareike

doi:10.1186/s12859-020-03546-x

DetailsÜbersicht

Amino acid encoding for deep learning applications

ElAbd, H., Bromberg, Y., Hoarfrost, A., Lenz, T. L., Franke, A., & Wendorff, M. (2020). Amino acid encoding for deep learning applications. BMC Bioinformatics, 21: 235. doi:10.1186/s12859-020-03546-x.

Item is Freigegeben

einblenden: alle ausblenden: alle

Basisdaten

einblenden: ausblenden:

Datensatz-Permalink: https://hdl.handle.net/21.11116/0000-0006-AFEA-E Versions-Permalink: https://hdl.handle.net/21.11116/0000-0006-AFEB-D

Genre: Zeitschriftenartikel

ausblenden:

Urheber:
ElAbd, Hesham, Autor
Bromberg, Yana, Autor
Hoarfrost, Adrienne, Autor
Lenz, Tobias L.¹, Autor
Franke, Andre, Autor
Wendorff, Mareike, Autor

Affiliations:
1Research Group Evolutionary Immunogenomics, Department Evolutionary Ecology, Max Planck Institute for Evolutionary Biology, Max Planck Society, ou_2068286

Inhalt

einblenden:

ausblenden:

Schlagwörter: Deep-learning; Amino acid encoding; Amino acids embedding; Protein-protein interaction (PPI); HLA-II peptide interaction; Convoluted-neural network (CNN); Recurrent neural network (RNN); Machine-learning (ML); Human-leukocyte antigen (HLA)

Zusammenfassung: Background

The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction – a process called ‘end-to-end learning’ – has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN.
Results

By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension.
Conclusion

Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.

Details

einblenden:

ausblenden:

Sprache(n): eng - English

Datum: Eingereicht: 2020-02-20Angenommen: 2020-05-12Online veröffentlicht: 2020-06-09

Publikationsstatus: Online veröffentlicht

Seiten: -

Ort, Verlag, Ausgabe: -

Inhaltsverzeichnis: -

Art der Begutachtung: -

Identifikatoren: DOI: 10.1186/s12859-020-03546-x

Art des Abschluß: -

Veranstaltung

einblenden:

Entscheidung

einblenden:

Projektinformation

einblenden:

Quelle 1

einblenden:

ausblenden:

Titel: BMC Bioinformatics

Genre der Quelle: Zeitschrift

Urheber:

Affiliations:

Ort, Verlag, Ausgabe: BioMed Central

Seiten: - Band / Heft: 21 Artikelnummer: 235 Start- / Endseite: - Identifikator: ISSN: 1471-2105
CoNE: https://pure.mpg.de/cone/journals/resource/111000136905000

Datensatz

Basisdaten

Dateien

Externe Referenzen

Urheber

Inhalt

Details

Veranstaltung

Entscheidung

Projektinformation

Quelle 1