hide
Free keywords:
-
Abstract:
Autonomous robots and vehicles that primarily act in environments that inhabit humans require the recognition of human action from incoming sensory video data in order to form a complete scene understanding. This scene understanding is necessary for all high level functionality of the autonomous machine. Motivated by this, this work reviews several methods for human action recognition and detection in videos. To recognize or detect actions, the extraction of spatio-temporal features is a prerequisite. Good performance of 2D CNNs in the domain of action recognition has proven that the spatial dimension contains indications for the present actions. 3D CNNs that symmetrical convolve the spatial and temporal dimension could improve the performance of 2D CNNs only moderately, giving hints that the temporal dimension requires special treatment. This thesis took inspiration from visual processing in mammalian brains to select a very deep two stream architecture, SlowFast, where the two streams represent the spatial and temporal dimension respectively. Furthermore, this thesis empirically demonstrates that the chosen SlowFast architecture has exceptional spatio-temporal modeling capabilities. This is supported by the high parameter utilization of SlowFast. Low runtime cost, high throughput and state-of-the-art accuracy underline the representational power of the architecture. Moreover, the methodical analysis reveals that hierarchical processing in SlowFast is a significant contributor to its performance. The two streams achieve high functional specialization for the tasks of modeling motion and form modeling with the same underlying computational principles. Channel capacity and temporal resolution are shown to have high responsibility in achieving said functional specialization.