English

# Item

ITEM ACTIONSEXPORT

Released

Paper

#### Confidence-Calibrated Adversarial Training and Detection: More Robust Models Generalizing Beyond the Attack Used During Training

##### MPS-Authors
/persons/resource/persons228449

Stutz,  David
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

/persons/resource/persons45383

Schiele,  Bernt
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

##### External Resource
No external resources are shared
##### Fulltext (restricted access)
There are currently no full texts shared for your IP range.
##### Fulltext (public)

arXiv:1910.06259.pdf
(Preprint), 2MB

##### Supplementary Material (public)
There is no public supplementary material available
##### Citation

Stutz, D., Hein, M., & Schiele, B. (2019). Confidence-Calibrated Adversarial Training and Detection: More Robust Models Generalizing Beyond the Attack Used During Training. Retrieved from http://arxiv.org/abs/1910.06259.

Cite as: https://hdl.handle.net/21.11116/0000-0005-5559-8
##### Abstract
Adversarial training is the standard to train models robust against
training incurs a significant loss in accuracy and is known to generalize
poorly to stronger attacks, e.g., larger perturbations or other threat models.
In this paper, we introduce confidence-calibrated adversarial training (CCAT)
where the key idea is to enforce that the confidence on adversarial examples
decays with their distance to the attacked examples. We show that CCAT
preserves better the accuracy of normal training while robustness against
adversarial examples is achieved via confidence thresholding, i.e., detecting
adversarial examples based on their confidence. Most importantly, in strong
contrast to adversarial training, the robustness of CCAT generalizes to larger
perturbations and other threat models, not encountered during training. For
evaluation, we extend the commonly used robust test error to our detection
setting, present an adaptive attack with backtracking and allow the attacker to
select, per test example, the worst-case adversarial example from multiple
black- and white-box attacks. We present experimental results using $L_\infty$,
$L_2$, $L_1$ and $L_0$ attacks on MNIST, SVHN and Cifar10.