English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Paper

Good Teachers Explain: Explanation-Enhanced Knowledge Distillation

MPS-Authors
/persons/resource/persons292901

Parchami-Araghi,  Amin
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

/persons/resource/persons230363

Boehle,  Moritz Daniel
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

/persons/resource/persons253144

Rao,  Sukrut Sridhar
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

/persons/resource/persons45383

Schiele,  Bernt       
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

arXiv:2402.03119.pdf
(Preprint), 17MB

Supplementary Material (public)
There is no public supplementary material available
Citation

Parchami-Araghi, A., Boehle, M. D., Rao, S. S., & Schiele, B. (2024). Good Teachers Explain: Explanation-Enhanced Knowledge Distillation. Retrieved from https://arxiv.org/abs/2402.03119.


Cite as: https://hdl.handle.net/21.11116/0000-000F-5534-7
Abstract
Knowledge Distillation (KD) has proven effective for compressing large
teacher models into smaller student models. While it is well known that student
models can achieve similar accuracies as the teachers, it has also been shown
that they nonetheless often do not learn the same function. It is, however,
often highly desirable that the student's and teacher's functions share similar
properties such as basing the prediction on the same input features, as this
ensures that students learn the 'right features' from the teachers. In this
work, we explore whether this can be achieved by not only optimizing the
classic KD loss but also the similarity of the explanations generated by the
teacher and the student. Despite the idea being simple and intuitive, we find
that our proposed 'explanation-enhanced' KD (e$^2$KD) (1) consistently provides
large gains in terms of accuracy and student-teacher agreement, (2) ensures
that the student learns from the teacher to be right for the right reasons and
to give similar explanations, and (3) is robust with respect to the model
architectures, the amount of training data, and even works with 'approximate',
pre-computed explanations.