A Framework for Space-Efficient String Kernels.

Belazzougui, Djamal; Cunial, Fabio

doi:10.1007/s00453-017-0286-4

Local TagsRelease HistoryDetailsSummary

A Framework for Space-Efficient String Kernels.

Belazzougui, D., & Cunial, F. (2017). A Framework for Space-Efficient String Kernels. Algorithmica, 79(3), 857-883. doi:10.1007/s00453-017-0286-4.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-0009-0F71-8 Version Permalink: https://hdl.handle.net/21.11116/0000-0009-0F72-7

Genre: Journal Article

Files

show Files

Locators

show

Creators

show

hide

Creators:
Belazzougui, Djamal, Author
Cunial, Fabio¹, Author

Affiliations:
1Max Planck Institute for Molecular Cell Biology and Genetics, Max Planck Society, ou_2340692

Content

show

hide

Free keywords: -

Abstract: String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a rangeDistinct data structure on the Burrows-Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the rangeDistinct data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in (n log sigma) bits of space in addition to the input, where sigma is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just 3n log sigma + o(n log sigma) bits of space, and that can be learnt in randomized O(n) time using O(n log sigma) bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in 2m + o(m) bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.

Details

show

hide

Language(s):

Dates: Date issued: 2017-11-01

Publication Status: Issued

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: DOI: 10.1007/s00453-017-0286-4
Other: cbg-8145

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: Algorithmica

Other : Algorithmica

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: -

Pages: - Volume / Issue: 79 (3) Sequence Number: - Start / End Page: 857 - 883 Identifier: -