hide
Free keywords:
Computer Science, Computational Complexity, cs.CC,Computer Science, Data Structures and Algorithms, cs.DS
Abstract:
Can we analyze data without decompressing it? As our data keeps growing,
understanding the time complexity of problems on compressed inputs, rather than
in convenient uncompressed forms, becomes more and more relevant. Suppose we
are given a compression of size $n$ of data that originally has size $N$, and
we want to solve a problem with time complexity $T(\cdot)$. The naive strategy
of "decompress-and-solve" gives time $T(N)$, whereas "the gold standard" is
time $T(n)$: to analyze the compression as efficiently as if the original data
was small.
We restrict our attention to data in the form of a string (text, files,
genomes, etc.) and study the most ubiquitous tasks. While the challenge might
seem to depend heavily on the specific compression scheme, most methods of
practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be
unified under the elegant notion of Grammar Compressions. A vast literature,
across many disciplines, established this as an influential notion for
Algorithm design.
We introduce a framework for proving (conditional) lower bounds in this
field, allowing us to assess whether decompress-and-solve can be improved, and
by how much. Our main results are:
- The $O(nN\sqrt{\log{N/n}})$ bound for LCS and the $O(\min\{N \log N, nM\})$
bound for Pattern Matching with Wildcards are optimal up to $N^{o(1)}$ factors,
under the Strong Exponential Time Hypothesis. (Here, $M$ denotes the
uncompressed length of the compressed pattern.)
- Decompress-and-solve is essentially optimal for Context-Free Grammar
Parsing and RNA Folding, under the $k$-Clique conjecture.
- We give an algorithm showing that decompress-and-solve is not optimal for
Disjointness.