English

# Item

ITEM ACTIONSEXPORT

Released

Paper

#### Confidence-Calibrated Adversarial Training and Detection: More Robust Models Generalizing Beyond the Attack Used During Training

##### MPS-Authors
/persons/resource/persons228449

Stutz,  David
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

/persons/resource/persons45383

Schiele,  Bernt
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

##### External Resource
No external resources are shared
##### Fulltext (public)

arXiv:1910.06259.pdf
(Preprint), 2MB

##### Supplementary Material (public)
There is no public supplementary material available
##### Citation

Stutz, D., Hein, M., & Schiele, B. (2019). Confidence-Calibrated Adversarial Training and Detection: More Robust Models Generalizing Beyond the Attack Used During Training. Retrieved from http://arxiv.org/abs/1910.06259.

Cite as: http://hdl.handle.net/21.11116/0000-0005-5559-8
##### Abstract
Adversarial training is the standard to train models robust against adversarial examples. However, especially for complex datasets, adversarial training incurs a significant loss in accuracy and is known to generalize poorly to stronger attacks, e.g., larger perturbations or other threat models. In this paper, we introduce confidence-calibrated adversarial training (CCAT) where the key idea is to enforce that the confidence on adversarial examples decays with their distance to the attacked examples. We show that CCAT preserves better the accuracy of normal training while robustness against adversarial examples is achieved via confidence thresholding, i.e., detecting adversarial examples based on their confidence. Most importantly, in strong contrast to adversarial training, the robustness of CCAT generalizes to larger perturbations and other threat models, not encountered during training. For evaluation, we extend the commonly used robust test error to our detection setting, present an adaptive attack with backtracking and allow the attacker to select, per test example, the worst-case adversarial example from multiple black- and white-box attacks. We present experimental results using $L_\infty$, $L_2$, $L_1$ and $L_0$ attacks on MNIST, SVHN and Cifar10.