We propose a network structure for action recognition that is capable of extracting multi-scale temporal representations of actions. The key of the network is to combine a multi-scale temporal pooling module with a dense connection module, called multi-scale temporal pooling dense convolutional network (MTPDNet). The multi-scale temporal pooling module consists of multiple temporal scale levels. At each scale level, video frames are divided into several segments and a pooling operation is then performed on each segment to get temporal pooling information. The number of segments is set differently at different time scale levels, aiming to obtain multi-scale temporal pooling information. In addition, at each scale level, we adopt a redesigned dense connection module to learn motion representations from temporal pooling information. Finally, predictions are independently made at each scale level and the class scores of each scale level are fused to get the final prediction scores. Experimental results on two standard datasets, UCF101 and HMDB51, demonstrate that MTPDNet gets comparable or even better results among leading methods, which proves the effectiveness of the strategy combining multi-scale temporal pooling and dense connection. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 1 scholarly publication and 1 patent.
Video
RGB color model
Optical flow
Video acceleration
Convolution
Network architectures
3D modeling