Spatial-Temporal Neural Networks for Action Recognition

. Action recognition is an important yet challenging problem in many applications. Recently, neural network and deep learning approaches have been widely applied to action recognition and yielded impressive results. In this paper, we present a spatial-temporal neural network model to recognize human actions in videos. This network is composed of two connected structures. A two-stream-based network extracts appearance and optical flow features from video frames. This network characterizes spatial information of human actions in videos. A group of LSTM structures following the spatial network describe the temporal information of human actions. We test our model with data from two public datasets and the experimental results show that our method improves the action recognition accuracy compared to the baseline methods.


Introduction
Action recognition is to predict an action category label for an input video.It is an important problem in many applications, such as video search, security surveillance, and human-machine interaction.
Recognizing actions in daily-activity videos is a challenging problem.First, some different action categories have similar appearance and motion features.For example, the actions drinking and eating are very similar in motion features.Second, motion noise in videos increases the difficulty of action recognition.Third, the unrelated background or scene features often make the model unable to capture the key information of action recognition.
In this paper, we present a spatial-temporal neural network model to recognize human actions in videos.This network is composed of two connected structures -the spatial structure and the temporal structure, as shown in Figure 1.The spatial structure is a Two-Stream Network [1], which extracts appearance and optical flow features from video frames.Following the spatial network, the temporal structure is a group of LSTM networks [23] which represent the temporal and transition information of human actions.With these two structures, our model can deeply mine and utilize the spatial and temporal features in videos for action recognition.We test our model with data from two challenging datasets -MSR DailyActivity 3D [2] and UCF101 [3].The experimental results show that our method improves the action recognition performance compared to other baseline methods.

Related Work
Traditional action recognition methods generally consist of two key parts: feature extraction and feature classification.For feature extraction, most approaches are based on appearance, geometric, or motion features of human bodies, such as skeleton features [4,5], optical flow features [6].These methods extract features from human bodies and can achieve satisfactory results in most scenes.However, in complex scenes with cluttered backgrounds, it is difficult to compute the accurate positions of human body parts, the action recognition accuracy is drastically depressed.Similar to HOG (Histogram of oriented gradient) [7] and SIFT (Scale-invariant feature transform) [8] in images, multi-scale feature extraction algorithms with prior knowledge were proposed.For example, some approaches extracted action features around the spatial-temporal interest point [9][10][11].In complex scenes or backgrounds, such kinds of methods have achieved impressive improvements in action recognition accuracy.With the extracted features, various classifiers are learned to recognize actions, such as Support Vector Machine (SVM) [12].Recently, neural networks and deep learning techniques [13][14][15] have been widely used in action recognition and achieved impressive performance [16][17][18][19][20][21].Compared with static image classification, the temporal components of videos provide additional and important recognition clues -motion information [1,22].In the early stage, action recognition based on single CNNs model was adopted [16].Although this method improves action recognition performance compared with traditional methods, the characteristics of time series was not deeply processed.Later, two-stream networks [1] which utilize appearance and optical flow CNNs have significantly improved action recognition performance compared with the single CNNs model.After that, the Long Short-Term Memory (LSTM) models [23] and other Recurrent Neural Network (RNN) models are applied to action recognition tasks [24][25][26].LSTM and RNN models incorporate the temporal information of videos into spatial features and therefore remarkably improve the recognition accuracy compared with previous neural network architectures.
Inspired by those models, our spatial-temporal network model is a hybrid architecture of the two-stream network [1] and the LSTM network [23][24][25].From data preprocessing to network structures, it extracts local and global features, combines multifeature learning, and is consistent with the sequential-data-based action recognition.

Spatial-Temporal Neural Network Model
A video frame containing appearance and geometric information of human actions is the smallest feature unit of the video sequence [24][25][26][27][28].The temporal and motion information between successive frames is also essential for distinguishing different actions.Inspired by the previous convolutional network and LSTM methods [1, 23-25, 29], we present a spatial-temporal neural network model to jointly describe the spatial information in single frames and the temporal information between successive frames, as shown in Figure 1.This network is composed of two connected structures -the spatial structure and the temporal structure, as shown in Figure 1.The spatial structure is a two-stream network [1], which extracts appearance and optical flow features from video frames.Following the spatial network, the temporal structure is a group of LSTM networks [25] which describe the temporal and motion information of human actions.With these two structures, our model can deeply mine and utilize the spatial and temporal features in videos for action recognition.
For a complete video, we first carry out the frame pre-processing (Section 3).One stream of the spatial structure extracts RGB features and another stream extracts optical flow features.The main structures of each stream are 6 convolutional layers and two dense layers [29].The size of each channel input to the first convolutional layer (conv1) is 227*227*8.Through the feature extraction of the convolutional layer (conv1 to conv6), the dimension of our FC1 and FC2 layers is 4096.We use two layers of dense layer to prevent over fitting.Through this part, we obtain the feature in Fe+ layer with dimension of 4096*2L.
The temporal structure is a sequential 6 layers LSTM network.By dynamically inputting the obtained Fe+ features into the sequence learning module, it can learn the temporal feature of the video sequence.

Data Processing
Data preprocessing is the process to convert the original video sequence into the actual input data of the network model.In this section, we will address the problem of multi-frame inputs and introduce how to calculate multi-frame optical flows.

RGB Multi Frame Sequence Input Processing
We adopt a pre-processing method to take into account the feature expression of single frame RGB data and multi-frame RGB data.The basic idea is that for a complete video sequence, we take into account the connectivity of a video segment when we divide video into multiple segments.In the process, every 8 frames are used as a unit fragment.A sliding window is defined with 4 frames per step.It slides from the video start to the end.This process method is shown in algorithm 1.

Multi Optical Flow Calculation
In this section, we introduce how to calculate optical flow features in videos.Optical flow is an important feature of videos and it is widely used in the task of action recognition [22,30,1].The optical flow contains the motion information of targets and is used to describes the video frame changes.In video sequence data, the instantaneous speeds of pixels can be used to characterize the correlations between pixel sequences in time domain, as shown in Figure 2.
We adopt the similar method with two-stream network [1] to calculate the optical flow featues in videos.For a video segment with L frames, we extract the optical flow information along the X and Y axes in each two adjacent frames.Then the optical flow feature of the video segment is an encapsulation of all the frame optical flow features.It is a vector with a dimension of w*h*2L [1], where w*h is the dimension of the single optical flow.

Experiments
We use the action recognition accuracy to evaluate the performance of different methods.The action recognition accuracy is defined as the ratio of correctly labeled video numbers to all testing video numbers.We train the model under the caffe framework [31] and use the hardware CUDA plus GPU to deal with the floating-point matrix operation of the network.We test the models on the MSR DailyActivity 3D dataset [2] and the data samples from the UCF101 datase [3].
For the convolutional network component, we use video frames and optical flows to fine-tune a pre-trained AlexNet model [32].We set the learning batch size as 32.The learning rate starts at 0.001 and is divided by 10 after every 30k iterations.For all experimental settings, we set the dropout regularization ratio as 0.5 to reduce complex co-adaptations of neurons in nets.
For the LSTM part, the output of Fe+ is used as the input to the LSTM.The momentum and weight decay are set as 0.9 and 0.0005, respectively.The learning rate starts at 0.01 and is divided by 10 after every 30k iterations.The output dimension of the softmax layer is 16.

Action Recognition on MSR DailyActivity 3D Dataset
The MSR DailyActivity 3D dataset [2] is captured using a Kinect camera.There are 16 action classes: drinking water, eating, reading, calling, writing on paper, using notebooks, vacuuming, waking up, sitting, throwing paper, playing games, lying on the sofa, walking, playing guitar, standing up, sitting down.There are ten subjects in total and two types of actions in each subject.One type action is at a standing position and one at a sitting position.The depth frames, the 3D skeletons of human bodies, and the RGB frames are recorded.We compare our method with seven other approaches: Dynamic Temporal Warping [33], Actionlet Ensemble on Joint Features [34], HDMM+3ConvEets [35], 4DH [36], 4DHOI [36], Proposed method with Spatial Structure, and Proposed method with Temporal Structure.Proposed method with Spatial Structure only uses the Two-Stream Network component, and Proposed method with Temporal Structure uses the LSTM component.Table 1shows the overall action recognition accuracy comparison, and Figure 3 (a) shows the accuracy of each action category.
Our method achieves an accuracy of 0.87, which outperforms other baseline approaches.Table 1 also shows that our method outperforms the spatial structure method and temporal structure method by a considerable margin, which proves the effectiveness of joint spatial-temporal network.We compare our approach with the spatial structure method of two-stream network and the temporal structure method of LSTM.Table 2 shows the accuracy comparison and Figure 3 (b) shows the accuracy of each action class.Our method achieves an accuracy of 0.85.The results show that our method outperforms the comparison methods by a large margin, which proves the strength and effectiveness of our method.

Conclusions
This paper presents a spatial-temporal neural network model to recognize human actions in videos.Our model jointly uses temporal and spatial dimension features of video sequences.With spatial and temporal structures, our model can deeply mine and utilize the spatial and temporal features in videos for action recognition.We test our model on two challenging datasets.The experimental results show that our methods improve the performance compared to other baseline methods.
Our future work will focus on complex neural network models on action recognition and video understanding.

Fig. 1 .
Fig. 1.Illustration of our spatial-temporal network model.The left side is a two-stream model and the right side with LSTM describes the temporal information of human actions.

Fig. 2 .
Fig. 2. Illustration of optical flow in human actions.(a) and (b) show a pair of successive video frames with human body motion.(c) shows the motion area.

Fig. 3 .
Fig. 3. Accuracy of each action of MSR DailyActivity 3D and UCF101 recognition on the network.

Table 2 .
Action recognition comparison on MSR DailyActivity 3D Dataset.Each category consists of 25 groups and each group has 4 videos, with a total of 100 videos per category.The videos from the same group may share some common features, such as similar background, similar viewpoint, etc.

Table 2 .
Action recognition comparison with UCF101 data.