Extending DeepGaze II: Scanpath prediction from deep features

Kümmerer, M; Wallis, T; Bethge, M

doi:10.1167/18.10.371

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Meeting Abstract

Extending DeepGaze II: Scanpath prediction from deep features

MPS-Authors

There are no MPG-Authors in the publication available

External Resource

https://jov.arvojournals.org/article.aspx?articleid=2699363
(Publisher version)

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Kümmerer, M., Wallis, T., & Bethge, M. (2018). Extending DeepGaze II: Scanpath prediction from deep features. Journal of Vision, 18(10): 32.21, 371.

Cite as: https://hdl.handle.net/21.11116/0000-0001-7E3D-F

Abstract

Predicting where humans choose to fixate can help understanding a variety of human behaviour. The last years have seen substantial progress in predicting spatial fixation distributions when viewing static images. Our own model "DeepGaze II" (Kümmerer et al., ICCV 2017) extracts pretrained deep neural network features from the VGG network from input images and uses a simple pixelwise readout network to predict fixation distributions from these features. DeepGaze II is state-of-the-art for predicting freeviewing fixation densities according to the established MIT Saliency Benchmark.
However, DeepGaze II predicts only spatial fixation distributions instead of scanpaths. Therefore, the models model ignores crucial structure in the fixation selection process. Here we extend DeepGaze II to predict fixation densities conditioned on the previous scanpath. We add additional feature maps encoding the previous scanpath (e.g. the distance of image pixels to previous fixations) to the input of the readout network. Except for these few additional feature maps, the architecture is exactly as for DeepGaze II. The model is trained on ground truth human fixation data (MIT1003) using maximum-likelihood optimization. Even using only the last fixation location increases performance by approximately 30 relative to DeepGaze II and reproduces the strong spatial fixation clustering effect reported previously (Engbert et al., JoV 2015). This contradicts the way Inhibition of Return has often been used in computational models of fixation selection. Using a history of two fixations increases performance further and learns a suppression effect around the earlier fixation location. Due to the probabilistic nature of our model, we can sample new scanpaths from the model that capture the statistics of human scanpaths much better than scanpaths sampled from a purely spatial distribution. The modular architecture of our model allows us to explore the effects of many different possible factors on fixation selection.