# Item

ITEM ACTIONSEXPORT

Released

Paper

#### Faster Approximate Pattern Matching: A Unified Approach

##### External Resource

No external resources are shared

##### Fulltext (restricted access)

There are currently no full texts shared for your IP range.

##### Fulltext (public)

arXiv:2004.08350.pdf

(Preprint), 2MB

##### Supplementary Material (public)

There is no public supplementary material available

##### Citation

Charalampopoulos, P., Kociumaka, T., & Wellnitz, P. (2020). Faster Approximate Pattern Matching: A Unified Approach. Retrieved from https://arxiv.org/abs/2004.08350.

Cite as: https://hdl.handle.net/21.11116/0000-0007-8C63-C

##### Abstract

Approximate pattern matching is a natural and well-studied problem on

strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the

starting positions of) all substrings of $T$ that are at distance at most $k$

from $P$. We consider the two most fundamental string metrics: the Hamming

distance and the edit distance. Under the Hamming distance, we search for

substrings of $T$ that have at most $k$ mismatches with $P$, while under the

edit distance, we search for substrings of $T$ that can be transformed to $P$

with at most $k$ edits.

Exact occurrences of $P$ in $T$ have a very simple structure: If we assume

for simplicity that $|T| \le 3|P|/2$ and trim $T$ so that $P$ occurs both as a

prefix and as a suffix of $T$, then both $P$ and $T$ are periodic with a common

period. However, an analogous characterization for the structure of occurrences

with up to $k$ mismatches was proved only recently by Bringmann et al.

[SODA'19]: Either there are $O(k^2)$ $k$-mismatch occurrences of $P$ in $T$, or

both $P$ and $T$ are at Hamming distance $O(k)$ from strings with a common

period $O(m/k)$. We tighten this characterization by showing that there are

$O(k)$ $k$-mismatch occurrences in the case when the pattern is not

(approximately) periodic, and we lift it to the edit distance setting, where we

tightly bound the number of $k$-edit occurrences by $O(k^2)$ in the

non-periodic case. Our proofs are constructive and let us obtain a unified

framework for approximate pattern matching for both considered distances. We

showcase the generality of our framework with results for the fully-compressed

setting (where $T$ and $P$ are given as a straight-line program) and for the

dynamic setting (where we extend a data structure of Gawrychowski et al.

[SODA'18]).

strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the

starting positions of) all substrings of $T$ that are at distance at most $k$

from $P$. We consider the two most fundamental string metrics: the Hamming

distance and the edit distance. Under the Hamming distance, we search for

substrings of $T$ that have at most $k$ mismatches with $P$, while under the

edit distance, we search for substrings of $T$ that can be transformed to $P$

with at most $k$ edits.

Exact occurrences of $P$ in $T$ have a very simple structure: If we assume

for simplicity that $|T| \le 3|P|/2$ and trim $T$ so that $P$ occurs both as a

prefix and as a suffix of $T$, then both $P$ and $T$ are periodic with a common

period. However, an analogous characterization for the structure of occurrences

with up to $k$ mismatches was proved only recently by Bringmann et al.

[SODA'19]: Either there are $O(k^2)$ $k$-mismatch occurrences of $P$ in $T$, or

both $P$ and $T$ are at Hamming distance $O(k)$ from strings with a common

period $O(m/k)$. We tighten this characterization by showing that there are

$O(k)$ $k$-mismatch occurrences in the case when the pattern is not

(approximately) periodic, and we lift it to the edit distance setting, where we

tightly bound the number of $k$-edit occurrences by $O(k^2)$ in the

non-periodic case. Our proofs are constructive and let us obtain a unified

framework for approximate pattern matching for both considered distances. We

showcase the generality of our framework with results for the fully-compressed

setting (where $T$ and $P$ are given as a straight-line program) and for the

dynamic setting (where we extend a data structure of Gawrychowski et al.

[SODA'18]).