Modeling brain responses to video stimuli using multimodal video transformers

Dong, Tianai; Toneva, Mariya

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Modeling brain responses to video stimuli using multimodal video transformers

MPS-Authors

/persons/resource/persons282734

Dong, Tianai
International Max Planck Research School for Language Sciences, MPI for Psycholinguistics, Max Planck Society;
Multimodal Language Department, MPI for Psycholinguistics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

Dong_etal_2023_CCN 2023.pdf
(Publisher version), 187KB

Supplementary Material (public)

There is no public supplementary material available

Citation

Dong, T., & Toneva, M. (2023). Modeling brain responses to video stimuli using multimodal video transformers. In Proceedings of the Conference on Cognitive Computational Neuroscience (CCN 2023) (pp. 194-197).

Cite as: https://hdl.handle.net/21.11116/0000-000F-DE62-9

Abstract

Prior work has shown that internal representations of artificial neural networks can significantly predict brain responses elicited by unimodal stimuli (i.e. reading a book chapter or viewing static images). However, the computational modeling of brain representations of naturalistic video stimuli, such as movies or TV shows, still remains underexplored. In this work, we present a promising approach for modeling vision-language brain representations of video stimuli by a transformer-based model that represents videos jointly through audio, text, and vision. We show that the joint representations of vision and text information are better aligned with brain representations of subjects watching a popular TV show. We further show that the incorporation of visual information improves brain alignment across several regions that support language processing.