Speaker diarization using gesture and speech

Gebre, Binyam Gebrekidan; Wittenburg, Peter; Drude, Sebastian; Huijbregts, Marijn; Heskes, Tom

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Speaker diarization using gesture and speech

MPS-Authors

/persons/resource/persons4454

Gebre, Binyam Gebrekidan
The Language Archive, MPI for Psycholinguistics, Max Planck Society;

/persons/resource/persons216

Wittenburg, Peter
The Language Archive, MPI for Psycholinguistics, Max Planck Society;

/persons/resource/persons39383

Drude, Sebastian
The Language Archive, MPI for Psycholinguistics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

interspeech_paper.pdf
(Preprint), 950KB

Supplementary Material (public)

There is no public supplementary material available

Citation

Gebre, B. G., Wittenburg, P., Drude, S., Huijbregts, M., & Heskes, T. (2014). Speaker diarization using gesture and speech. In H. Li, & P. Ching (Eds.), Proceedings of Interspeech 2014: 15th Annual Conference of the International Speech Communication Association (pp. 582-586).

Cite as: https://hdl.handle.net/11858/00-001M-0000-0019-B65B-7

Abstract

We demonstrate how the problem of speaker diarization can be solved using both gesture and speaker parametric models. The novelty of our solution is that we approach the speaker diarization problem as a speaker recognition problem after learning speaker models from speech samples corresponding to gestures (the occurrence of gestures indicates the presence of speech and the location of gestures indicates the identity of the speaker). This new approach offers many advantages: comparable state-of-the-art performance, faster computation and more adaptability. In our implementation, parametric models are used to model speakers' voice and their gestures: more specifically, Gaussian mixture models are used to model the voice characteristics of each person and all persons, and gamma distributions are used to model gestural activity based on features extracted from Motion History Images. Tests on 4.24 hours of the AMI meeting data show that our solution makes DER score improvements of 19% on speech-only segments and 4% on all segments including silence (the comparison is with the AMI system).