Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICE AND METHOD FOR RECOGNIZING ACTIVITY IN VIDEOS
Document Type and Number:
WIPO Patent Application WO/2020/088763
Kind Code:
A1
Abstract:
Embodiments of the present invention relate to action recognition in videos. To this end, an embodiment of the invention includes a device and method for recognizing one or more activities in a video, wherein the device and method employ a deep-learning network. The device is configured to: receive the video; separate the video into an RGB part and an optical flow (OF) part; employ a spatial part of the deep-learning network to calculate a plurality of spatial label predictions based on the RGB part; employ a temporal part of the deep-learning network to calculate a plurality of temporal label predictions based on the OF part; and fuse the spatial and temporal label predictions to obtain a label associated with an activity in the video.

Inventors:
REDZIC MILAN (DE)
CHOWDHURY TARIK (DE)
LIU SHAOQING (DE)
YU BING (DE)
YUAN PENG (DE)
OZBAYBURTLU HAMDI (DE)
WANG HONGBIN (DE)
Application Number:
PCT/EP2018/079890
Publication Date:
May 07, 2020
Filing Date:
October 31, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
REDZIC MILAN (DE)
International Classes:
G06K9/00
Other References:
KAREN SIMONYAN ET AL: "Two-Stream Convolutional Networks for Action Recognition in Videos", 9 June 2014 (2014-06-09), XP055324674, Retrieved from the Internet [retrieved on 20180604]
"Serious Games", vol. 9912, 2016, SPRINGER INTERNATIONAL PUBLISHING, Cham, ISBN: 978-3-540-37274-5, ISSN: 0302-9743, article LIMIN WANG ET AL: "Temporal Segment Networks: Towards Good Practices for Deep Action Recognition", pages: 20 - 36, XP055551834, 032682, DOI: 10.1007/978-3-319-46484-8_2
ANDREJ KARPATHY ET AL: "Large-Scale Video Classification with Convolutional Neural Networks", 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, June 2014 (2014-06-01), pages 1725 - 1732, XP055560536, ISBN: 978-1-4799-5118-5, DOI: 10.1109/CVPR.2014.223
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
Claims

1. Device (100) for recognizing one or more activities in a video (101), each activity being associated with a predetermined label (104), wherein the device (100) is configured to employ a deep-learning network (102) and, in an inference phase, to:

- receive the video (101),

separate the video (101) into an RGB part (10 la) and an optical flow, OF, part (10lb),

employ a spatial part (l02a) of the deep-learning network (102) to calculate a plurality of spatial label predictions (l03a) based on the RGB part (lOla), employ a temporal part (l02b) of the deep-learning network (102) to calculate a plurality of temporal label predictions (l03b) based on the OF part (lOlb), and fuse the spatial and temporal label predictions (103 a, l03b) to obtain a label (104) associated with an activity in the video (101).

2. Device (100) according to claim 1, further configured to:

extract a plurality of RGB snippets (200a) and a plurality of OF snippets (200b) from the video (101), in order to separate the video into the RGB part (10 la) and OF part (lOlb),

employ the spatial part (l02a) of the deep-learning network (102) to calculate a plurality of label predictions (20 la) for each of the RGB snippets (200a), employ the temporal part (l02b) of the deep-learning network (102) to calculate a plurality of label predictions (20 lb) for each of the OF snippets (200b), calculate the plurality of spatial label predictions (103 a) based on the label predictions (20 la) of the RGB snippets (200a), and

calculate the plurality of temporal label predictions (l03b) based on the label predictions (20 lb) of the OF snippets (200b).

3. Device (100) according claim 1 or 2, further configured to:

employ the spatial part (l02a) of the deep-learning network (102) to calculate a plurality of label predictions for each RGB frame in a given RGB snippet (200a) and calculate the plurality of label predictions (20 la) for the given RGB snippet (200a) based on the label predictions of the RGB frames, and/or employ the temporal part (l02b) of the deep-leaming network (102) to calculate a plurality of label predictions for each OF frame in a given OF snippet (200b) and calculate the plurality of label predictions (20 lb) for the given OF snippet (200b) based on the label predictions of the OF frames.

4. Device (100) according to one of the claims 1 to 3, further configured to, in order to fuse the spatial and temporal label predictions (103 a, l03b):

calculate a sum of normalized label predictions for the same label from a determined number of the plurality of spatial label predictions (103 a) and a determined number of the plurality of temporal label predictions (l03b), and select a normalized label prediction having highest score as the label (104).

5. Device (100) according to claim 4, further configured to:

calculate, as the sum of normalized label predictions for the same label, a sum of a normalized scaled frequency of appearance of all spatial and temporal label predictions for the same label (103 a, l03b) .

6. Device (100) according to one of the claims 1 to 5, further configured to:

obtain a label (104) for each of a plurality of videos (101) in a dataset, and calculate an accuracy for the dataset based on the obtained labels (104).

7. Device (100) according to one of the claims 1 to 5, wherein

the deep-leaming network (102) is a TSN-bn-inception, enhanced with skip connections from a residual network, type of a network.

8. Device (100) according to one of the claims 1 to 6, wherein

the spatial part (l02a) and/or the temporal part (l02b) of the deep-leaming network (102) comprises a plurality of connected input layers, a plurality of connected output layers, and a plurality of skip connections, each skip connection connecting an input layer to an output layer.

9. Device (100) according to one of the claims 1 to 8, further configured to, in a training/testing phase:

- receive a training or testing video (300), and output a result (301) including a ranked list of predicted labels (104) based on the training/testing video (300), each predicted label (104) being associated with a confidence value score.

10. Device (100) according to claim 9, wherein

the result (301) further includes a calculated loss.

11. Device (100) according to claim 9 or 10, configured to:

interrupt the training phase, if calculating a loss of a predetermined value.

12. Device (100) according to one of the claims 9 to 11, configured to:

obtain a pre-trained network model of the deep-leaming network (102) at the end of the training phase. 13. Method (700) for recognizing one or more activities in a video (101), each activity being associated with a predetermined label (104), wherein the method (900) employs a deep-learning network (102) and comprises in an inference phase:

- receiving (701) the video (101),

separating (702) the video (101) into an RGB part (lOla) and an optical flow, OF, part (10 lb),

employing (703) a spatial part (l02a) of the deep-leaming network (102) to calculate a plurality of spatial label predictions (l03a) based on the RGB part (10 la),

employing (704) a temporal part (l02b) of the deep-leaming network (102) to calculate a plurality of temporal label predictions (l03b) based on the OF part

(10 lb), and

fusing (705) the spatial and temporal label predictions (l03a, l03b) to obtain a label (104) associated with an activity in the video (101). 14. Computer program product comprising program code for controlling a device to perform the method (700) of claim 13 when the program code is executed by one or more processors of the device.

Description:
DEVICE AND METHOD FOR RECOGNIZING ACTIVITY IN VIDEOS

TECHNICAL FIELD

Embodiments of the present invention relate to action recognition in videos. To this end, embodiments of the invention provide a device and method for recognizing one or more activities in a video, wherein the device and method employ a deep-leaming network. Accordingly, embodiments of the invention are also concerned with designing an effective deep-leaming network architecture, which is particularly suited for recognizing activities in videos. Embodiments of the invention are applicable, for instance, to video surveillance systems and cameras.

BACKGROUND

Conventional video surveillance systems assist, for instance, police or security personnel in preventing crimes. The benefits of surveillance camera networks is clear: instead of having security or law enforcement personnel stationed at every comer, huge territories can be monitored by a few individuals, for instance, from a control room. The number of surveillance cameras has grown exponentially since the l990s. Typically, motion detection algorithms are used for video surveillance systems, which are sensitive to illumination changes, camera shake, motion in the background such as moving foliage, or distant vehicles, but usually cannot deal with continuous motion in the camera field of view.

There has thus been significant ongoing research effort to apply image and video analysis methods together with deep-leaming techniques, in order to proceed towards a more autonomous analysis. Employing deep-leaming algorithms can increase the robustness of video surveillance systems, particularly when large amounts of data are used and the algorithms are trained for extensive amounts of time (e.g. days). Current deep-leaming frameworks are based on networks, tuned by many parameters, which obtain an improved performance compared to classical computer vision based solutions. Notably, an adequate use of hardware and dataset creation is needed to obtain more reliable results in this field of research. Accordingly, the potential to develop a robust end-to-end solution for behavior analysis of a video, and integrate it e.g. with existing surveillance systems, so far looks promising.

In particular, behavior analysis (BA), or unusual behavior analysis (UBA) for certain classes of activities, based on video action recognition from video surveillance data have attracted intensive research interests in consumer industry. BA systems based on deep- learning have the following advantages, when compared to BA based on classical computer vision:

1. They can avoid extraction of human engineered features (edges, comers, color)

2. They are robust to use on different datasets (by supporting data augmentation)

3. A model can be saved and transfer learning can be used to fine-tune the weights, in order to achieve robustness on different datasets,

4. They provide a way to leverage a lot of computational resources, in order to ran a system by employing a multi-GPU heuristic for available GPU processors.

However, conventional approaches of BA based on video action recognition still remain dependent on training based on human annotations (i.e. supervised learning), which need to be provided for every camera sensor. Further, conventional approaches of BA based on video action recognition still suffer from a lack of accuracy, a lack of efficiency and/or a lack of robustness.

SUMMARY

In view of the above, embodiments of the invention aim to improve conventional approaches for action recognition in videos. An objective is to provide a device and method able to recognize activities in a video more efficiently, more accurately, and more reliably, or in other words with improved robustness, than conventional approaches.

In particular, the device and method should be able to recognize different types of user activities (e.g. classes) given in video or image form (i.e. so called action events in a video) by associating a label with each activity identified in the video. Based on such labels, the device and method should also be able to analyze behavior of people present in these videos. Nowadays, video surveillance systems and applications are targeting event tasks typically performed by security personnel, i.e. : detection of loitering, perimeter breach, and detection of unattended objects, etc. Accordingly, the device and method should be capable of the detecting specifically those aforementioned activities in a video. The device and method should be based on deep-leaming techniques.

Embodiments of the invention are defined in the enclosed independent claims. Advantageous implementations of the embodiments of the invention are further defined in the dependent claims.

In particular, embodiments of the invention allow realizing a BA module based on an activity recognition heuristic, which takes into account several conventional methods, but puts them in a new unified framework. Thereby, particularly a late fusion function is employed as one main idea, in order to derive more information about the input video. Embodiments of the invention further take into account principles of effective deep- leaming network architectures for action recognition in a video, and are able to leam network models given only limited training samples.

An idea is also to extract Red, Green, Blue (RGB) and Optical Flow (OF) frames, respectively, from an input video in a dual network fashion. Then, to combine a sparse temporal sampling strategy and a video-level supervision, in order to enable an effective learning using the late fusion function. The RGB frames are individual images of the video, extracted at a particular frame-rate. The OF can, for example, be calculated by determining a pattern of apparent motion of image objects between two consecutive frames caused by the movement of the objects and/or a camera. OF can be described as a two dimensional (2D) vector field where each vector is a displacement vector showing the movement of points from a first frame to a second frame. One can obtain a 2-channel array with optical flow vectors and find their magnitude and direction. The direction corresponds to a Hue value of the image while the magnitude corresponds to a Value plane.

A first aspect of the invention provides a device for recognizing one or more activities in a video, each activity being associated with a predetermined label, wherein the device is configured to employ a deep-leaming network and, in an inference phase, to: receive the video, separate the video into an RGB part and an OF part, employ a spatial part of the deep-leaming network to calculate a plurality of spatial label predictions based on the RGB part, employ a temporal part of the deep-leaming network to calculate a plurality of temporal label predictions based on the OF part, and fuse the spatial and temporal label predictions to obtain a label associated with an activity in the video.

By obtaining first the spatial and temporal label predictions separately from another, and then fusing them to obtain the final label (i.e. apply a late fusion function), a more efficient, accurate, and reliable recognition of activities in the video is achieved.

A“deep-leaming network” includes, for instance, a neural network like a Convolutional Neural Network (CNN) or Convolutional Networks (ConvNets), and/or include one or more skip connections as proposed in the Residual Networks (ResNet), and/or a batch normalization (bn)-inception type of network. A deep-leaming network can be trained in a training phase of the device, and can be used for recognizing activities in the video during an inference phase of the device.

A“label” or“class label” identifies an activity or a class of an activity (e.g.“loitering” or “perimeter breach”). That is, a label is directly associated with an activity. Labels can be determined before operating the device for activity recognition.

A“label prediction” is a predicted label, i.e. at least one preliminary label, and may typically include a prediction of multiple labels, e.g. label candidates, each associated with a different probability of being the correct label.

“Temporal” bases on the OF, i.e. refers to the motion in the video, while“spatial” bases on the RGB, i.e. refers to the spatial distribution of features, e.g. colors brightness etc. of e.g. pixels or areas, in the video.

In an implementation form of the first aspect, the device is further configured to: extract a plurality of RGB snippets and a plurality OF snippets from the video, in order to separate the video into the RGB part and OF part, employ the spatial part of the deep-leaming network to calculate a plurality of label predictions for each of the RGB snippets, employ the temporal part of the deep-leaming network to calculate a plurality of label predictions for each of the OF snippets, calculate the plurality of spatial label predictions based on the label predictions of the RGB snippets, and calculate the plurality of temporal label predictions based on the label predictions of the OF snippets.

In this way, the label can be predicted even more accurately and efficiently. A“snippet” is a short segment or piece of the video, which may for instance be randomly sampled from the video. An“RGB snippet” consists of RGB frames extracted from the video snippet while an“OF snippet” consists of OF frames extracted from the video snippet.

In a further implementation form of the first aspect, the device is further configured to: employ the spatial part of the deep-learning network to calculate a plurality of label predictions for each RGB frame in a given RGB snippet and calculate the plurality of label predictions for the given RGB snippet based on the label predictions of the RGB frames, and/or employ the temporal part of the deep-learning network to calculate a plurality of label predictions for each OF frame in a given OF snippet and calculate the plurality of label predictions for the given OF snippet based on the label predictions of the OF frames.

The RGB part, and each RGB snippet, includes a plurality of frames, i.e.“RGB frames”. The OF part, and each OF snippet, includes a plurality of frames, i.e.“OF frames”. A frame is an image or picture of the video, i.e. a label prediction for a frame takes into account that picture of the video to predict one or more labels associated with activities.

In a further implementation form of the first aspect, the device is further configured to, in order to fuse the spatial and temporal label predictions: calculate a sum of normalized label predictions for the same label from a determined number of the plurality of spatial label predictions and a determined number of the plurality of temporal label predictions, and select a normalized label prediction having highest score as the label.

In this way, the label can be predicted even more accurately and efficiently.

In a further implementation form of the first aspect, the device is further configured to: calculate, as the sum of normalized label predictions for the same label, a sum of a normalized scaled frequency of appearance of all spatial and temporal label predictions for the same label. “Frequency of appearance” means how often a spatial or temporal label (candidate), i.e. label prediction, is predicted.

In a further implementation form of the first aspect, the device is further configured to: obtain a label for each of a plurality of videos in a dataset, and calculate an accuracy for the dataset based on the obtained labels.

Thus, the accuracy of the action recognition can be further improved.

In a further implementation form of the first aspect, the deep-learning network is a Temporal Segment Network (TSN)-bn-inception, enhanced with skip connections from a residual network (ResNet), type of a network.

In this way, the device is able to efficiently but accurately obtain the label(s) based on deep- learning. A skip connection shortcuts layers in the deep-learning network.

In a further implementation form of the first aspect, the spatial part and/or the temporal part of the deep-learning network comprises a plurality of connected input layers, a plurality of connected output layers, and a plurality of skip connections, each skip connection connecting an input layer to an output layer.

In a further implementation form of the first aspect, the device is further configured to, in a training and testing phase: receive a training/testing video, and output a result including a ranked list of predicted labels based on the training/testing video, each predicted label being associated with a confidence value score.

Thus, an accurate training of the deep-learning network is possible, and improves the results obtained in the inference phase of the device.

In a further implementation form of the first aspect, the result further includes a calculated loss.

In a further implementation form of the first aspect, the device is further configured to: interrupt the training phase, if calculating a loss of a predetermined value. In a further implementation form of the first aspect, the device is further configured to: obtain a pre-trained network model of the deep-learning network at the end of the training phase.

A second aspect of the invention provides a method for recognizing one or more activities in a video, each activity being associated with a predetermined label, wherein the method employs a deep-learning network and comprises in an inference phase: receiving the video, separating the video into an RGB part and an OF part, employing a spatial part of the deep- learning network to calculate a plurality of spatial label predictions based on the RGB part, employing a temporal part of the deep-learning network to calculate a plurality of temporal label predictions based on the OF part, and fusing the spatial and temporal label predictions to obtain a label associated with an activity in the video.

In an implementation form of the second aspect, the method further comprises: extracting a plurality of RGB snippets and a plurality OF snippets from the video, in order to separate the video into the RGB part and OF part, employing the spatial part of the deep-learning network to calculate a plurality of label predictions for each of the RGB snippets, employing the temporal part of the deep-learning network to calculate a plurality of label predictions for each of the OF snippets, calculating the plurality of spatial label predictions based on the label predictions of the RGB snippets, and calculating the plurality of temporal label predictions based on the label predictions of the OF snippets.

In a further implementation form of the second aspect, the method further comprises: employing the spatial part of the deep-learning network to calculate a plurality of label predictions for each RGB frame in a given RGB snippet and calculate the plurality of label predictions for the given RGB snippet based on the label predictions of the RGB frames, and/or employing the temporal part of the deep-leaming network to calculate a plurality of label predictions for each OF frame in a given OF snippet and calculate the plurality of label predictions for the given OF snippet based on the label predictions of the OF frames.

In a further implementation form of the second aspect, the method further comprises, in order to fuse the spatial and temporal label predictions: outputting and calculating a sum of normalized label predictions for the same label from a determined number of the plurality of spatial label predictions and a determined number of the plurality of temporal label predictions, and selecting a normalized label prediction having highest score as the label.

In a further implementation form of the second aspect, the method further comprises: calculating, as the sum of normalized label predictions for the same label, a sum of a normalized scaled frequency of appearance of all spatial and temporal label predictions for the same label.

In a further implementation form of the second aspect, the method further comprises: obtaining a label for each of a plurality of videos in a dataset, and calculating an accuracy for the dataset based on the obtained labels.

In a further implementation form of the second aspect, the deep-learning network is a TSN- bn-inception, enhanced with skip connections from a residual network, type of a network.

In a further implementation form of the second aspect, the spatial part and/or the temporal part of the deep-learning network comprises a plurality of connected input layers, a plurality of connected output layers, and a plurality of skip connections, each skip connection connecting an input layer to an output layer.

In a further implementation form of the second aspect, the method further comprises, in a training/testing phase: receiving a training/testing video, and outputting a result including a ranked list of predicted labels based on the training/testing video, each predicted label being associated with a confidence value score.

In a further implementation form of the second aspect, the result further includes a calculated loss.

In a further implementation form of the second aspect, the method further comprises: interrupting the training phase, if calculating a loss of a predetermined value. In a further implementation form of the second aspect, the method further comprises: obtaining a pre -trained network model of the deep-learning network at the end of the training phase.

The method of the second aspect and its implementation forms achieve all advantages and effects described above with respect to the device of the first aspect and its respective implementation forms.

A third aspect of the invention provides a computer program product comprising program code for controlling a device to perform the method of the second aspect or any of its implementation forms, when the program code is executed by one or more processors of the device.

Accordingly, by executing the program code the advantages of the method of the second aspect and its implementation forms are achieved.

Generally in the above aspects and implementation forms, the fusing (late fusion function) of the spatial and temporal label predictions may be performed taking into account predictions of both the RGB and OF parts, particularly fusing the output predictions of the two data streams, respectively. This fusion function may be based on the top k predictions (k > 1) for each stream separately: e.g. for each RGB frame of an input video, the top k predictions of the network output may be found. Then all the predictions may be grouped based on one source of information (all the RGB frames) and also the top k predictions may be chosen (based on majority of votes or a frequency of appearance). To output a prediction for the input video based on the RGB part only, the first ranked (most-likely) prediction may be taken as the correct one, and may be compared it to its label (ground truth prediction). For the OF part, the same process may be repeated, and the prediction based on this part only may be obtained. To fuse the RGB and OF parts the processes for RGB and OF may be repeated, but this time by taking the top m (m > 1 and preferably > k) predictions into account. Then a union (sum) of normalized predictions for the same label may be found from both parts, and the one with the most votes may be chosen. One may notice that this fusion heuristic does not actually depend on a type of the input data. The main improvement provided by the embodiments of the invention (as defined in the aspects and implementation forms) are an increase of accuracy and an improvement of efficiency, as compared to conventional approaches of video action recognition. The accuracy improvement is reflected on three various different datasets tested, while there is also a slight improvement of the training speed.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which FIG. 1 shows a device according to an embodiment of the invention.

FIG. 2 shows an interference phase of a device according to an embodiment of the invention. FIG. 3 shows a training phase of a device according to an embodiment of the invention.

FIG. 4 shows an exemplary inference phase procedure, implemented by a device according to an embodiment of the invention. FIG. 5 shows a basic block example of a deep-learning network using skip connections for a device according to an embodiment of the invention.

FIG. 6 shows an example of a part of a deep-leaming network for by a device according to an embodiment of the invention.

FIG. 7 shows a method according to an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows a device 100 according to an embodiment of the invention. The device 100 is configured to recognize one or more activities in a (input) video 101. The device 100 may be implemented in a video surveillance system, and/or may receive the video 101 from a camera, particularly a camera of a video surveillance system. However, the device 100 is able to perform action recognition on any kind of input video, regardless of its origin. Each activity is associated with a predetermined label 104. A number of predetermined labels 104 may be known to the device 100 and/or determined labels 104 may be learned or trained by the device 104. The device 100 is specifically configured to employ a deep- leaming network 102, and can accordingly be operated in an inference phase and a training phase. The deep-leaming network may be implemented by at least one processor or processing circuitry of the device 100.

As shown in FIG. 1, in the inference phase, the device 100 is particularly configured to receive the video 101 (e.g. from a video camera or from a video post-processing device, which outputs a post-processed video), in which video activities are to be recognized by the device 100. To this end, the device 100 is configured to first separate the video 101 into an RGB part lOla and an OF part 10 lb, respectively. The RGB part represents spatial features (e.g. colors, contrast, shapes etc.) in the video 101, and the OF part 10 lb represents temporal features (i.e. motion features) in the video 101.

Further, the device 100 is configured to employ the deep-leaming network 102, which includes a spatial part l02a and a temporal part l02b. The deep-leaming network 102 may be software-implemented in the device 100. The spatial part l02a is employed to calculate a plurality of spatial label predictions l03a based on the RGB part lOla of the video 101, while the temporal part l02b is employed to calculate a plurality of temporal label predictions l03b based on the OF part 10 lb of the video 101.

After that, the device 100 is configured to fuse the spatial and temporal label predictions l03a, l03b, in order to obtain a (fused) label 104 associated with an activity in the video 101. This fusing is also referred to as late fusion, since it operates on label predictions, i.e. on preliminary results. The label 104 classifies an activity in the video 101, i.e. an activity in the video 101 has been recognized.

Two more specific block diagrams of the device 100, which are shown in FIG. 2 and FIG. 3, respectively, give more insights about some dependencies and functionalities between certain components of the device 100. FIG. 2 shows in particular a device 100 according to an embodiment of the invention, which builds on the device 100 shown in FIG. 1, and is operated in the inference phase. FIG. 3 also shows a device 100 according to an embodiment of the invention, which builds on the device 100 shown in FIG. 1, but is operated in a training phase. The devices 100 of FIG. 2 and FIG. 3 may be the same. Training and inference (also referred to as testing) phases (also called stages) can be distinguished, because the device 100 is deep-learning network based.

A block diagram with respect to the testing/inference phase of the device 100 is shown in FIG. 2. It can be seen that the device 100 is configured to extract a plurality of RGB snippets 200a and a plurality of OF snippets 200b, respectively, from the video 101, in order to separate the video into the RGB part lOla and the OF part 10 lb. The RGB and OF snippets 200a, 200b then propagate through corresponding deep-learning network 102 parts (i.e. the spatial part l02a and the temporal part l02b, respectively). Thereby, the spatial part l02a calculates a plurality of label predictions 20 la for each of the RGB snippets 200a, and the temporal part l02b calculates a plurality of label predictions 20 lb for each of the OF snippets 200b.

Then, spatial and temporal consensus predictions are obtained, wherein the device 100 is configured to calculate the plurality of spatial label predictions l03a based on the label predictions 20 la of the RGB snippets 200a, and to calculate the plurality of temporal label predictions l03b based on the label predictions 20 lb of the OF snippets 200b.

Afterwards, the spatial and temporal label predictions l03a, l03b are fused (late fusion) to obtain at least one label 104 associated with at least one activity in the video 101, i.e. a final prediction of the activity is made. The label 104, or multiple labels 104, may be provided to a watch-list ranking block 202. Multiple predictions for multiple videos available (from a dataset) may be processed to obtain a final accuracy (on the whole dataset). In other words, the device 100 may obtain at least one label 104 for each of a plurality of videos 101 in the dataset, and may calculate an accuracy for the dataset based on the obtained labels 104.

A block diagram with respect to the training phase of the device 100 is shown in FIG. 3. After each training iteration, a validation output 301, which includes a ranked list of (the current) predicted labels 104 based on input video frames (images) of a training/testing video 300, may be used to calculate a validation accuracy result, in addition to corresponding confidence value scores. In other words, the device 100 may output the result 301 including a ranked list of predicted labels 104 based on the training/testing video 300, wherein each predicted label 104 is associated with a confidence value score. Also, a loss may be calculated, and the overall training process may repeated, until the process either finishes with the last training iteration or reaches a particular (predefined) loss value. In other words, the result 301 output by the device 100 in the training phase may further include a calculated loss, and the device 100 may be configured to interrupt the training phase, if calculating a loss of a predetermined value.

Eventually, the device 100 obtains a pre-trained network model of the deep-learning network 102 at the end of the training phase (i.e. at least a network graph and trained network weights), which can be used by the device 100 during the testing (inference) phase, e.g. as shown in FIG. 2, for recognizing activities in the video 101.

The deep-learning network 102 employed by the device 100 according to an embodiment of the invention may be a TSN-bn-inception, enhanced with skip connections from a ResNet, type of a network. In particular, the deep-learning network 101 may be a modification and/or combination of different building blocks, in particular it may be based on a combination of a TSN and a bn-inception type network with skip connections as proposed in the ResNets. Below, first the individual building blocks and then the combined deep-leaming network 102 are described.

The TSN may be chosen as one building block of the deep-leaming network 102. The TSN is generally enabled to model dynamics throughout a video. The TSN may to this end be composed of spatial stream ConvNets and temporal stream ConvNets. A general approach for performing action recognition in a video with a device 100 employing such a TSN is shown in FIG. 4. In particular, an inference phase of such a device 100 is shown. In the example of FIG. 4, the input video is divided 400 into a plurality of segments (also referred as slices or chunks), and then short snippets are extracted 401 from each segment, wherein a snippet comprises more than one frame, i.e. a plurality of frames. That means, instead of working on single frames or snippets (also referred to as frame stacks), the TSN operates on a sequence of short snippets sparsely sampled (in time and/or spatial domain, for example depending on the video size) from the entire video. Each snippet in the sequence may produce its own preliminary prediction of action classes (class scores 402). Class scores 402 of different snippets may be fused 403 by a segmental consensus function to yield segmental consensus, which is a video-level prediction. Predictions from all modalities are then fused 404 to produce the final prediction. ConvNets on all snippets may share parameters.

In the training/leaming phase, the loss values of video-level predictions, other than those of snippet-level predictions, may be optimized by iteratively updating the model parameters. Formally, a given video V may be divided into K segments (Sl, S2, ... , SK} of equal durations. Then, the TSN may model a sequence of snippets as follows:

TSN(Tl, T2, ..., TK) = M(G(F(Tl;W), F(T2;W), ..., F(TK;W))).

Here (T 1 , T2, ... ,TK) is a sequence of snippets. Each snippet Tk may be randomly sampled from a corresponding segment Sk, wherein k is an integer index in a range from 1 to K. F(Tk;W) may define a function representing a ConvNet with parameters W, which operates on the short snippet Tk and produces class scores for all the classes. The segmental consensus function G combines the outputs from multiple short snippets to obtain a consensus of class hypothesis among them. Based on this consensus, the prediction function M (Softmax function) predicts the probability of each action class for the whole video. Combining with standard categorical cross-entropy loss, the final loss function regarding the segmental consensus may read:

Here C is the number of action classes and yi the ground-truth label concerning class i. Here a class score Gi is inferred from the scores of the same class on all the snippets, using an aggregation function.

Inception with Batch Normalization (bn-inception) may be chosen as another building block of the deep-leaming network 102. That is, the deep-learning network 102 may particularly be or include a bn-inception type of network. The bn-inception type of network may be specifically chosen, due to its good balance between accuracy and efficiency.

The bn-inception architecture may be specifically designed for the two-stream ConvNets as the first building block. The spatial stream ConvNet may operate on a single RGB image, and the temporal stream ConvNet may take a stack of consecutive OF fields as input. The two-stream ConvNets may use RGB images for the spatial stream and stacked OF fields for the temporal stream. A single RGB image usually encodes a static appearance at a specific time point, and lacks the contextual information about previous and next frames. The temporal stream ConvNets takes the OF flow fields as an input, and aims to capture the motion information. In realistic videos, however, there usually exists camera motion, and the OF fields may not concentrate on the human movements.

A ResNet framework may be chosen as another building block of the deep-leaming network 102. Although deep networks have better performance in classification most of the times, they are harder to train than ResNets, which is mainly due to two reasons:

1. Vanishing / exploding gradients: sometimes a neuron dies during the training process, and depending on its activation function it might never be in operation again. This problem can be resolved by employing some initialization techniques. 2. Harder optimization: when the model introduces more parameters, it becomes more difficult to train the network.

The main difference in ResNets is that they have shortcut connections parallel to their normal convolutional layers. This results in a faster training and also provides a clear path for gradients to back propagate to early layers of the network. This makes the learning process faster by avoiding vanishing gradients or dead neurons.

A ResNet model specifically designed for the deep-learning network 102 of the device 100 according to an embodiment of the invention may accept images and classify them. A naive method could be just up-sampling an image and then give it to the trained model, or just skipping the first layer and insert the original image as the input of the second convolutional layer, and then fine tuning a few of the last layers to get higher accuracy.

To sum up, as already mentioned above, the deep-leaming network 102 of the device 100 according to an embodiment of the invention may be based on TSN-bn-inception with skip connections (as proposed in the ResNets). For obtaining such a network 102, a two-step approach may be applied as described in the following:

1. When stacked with more layers on top of a very deep network model, skip connections and deep residual layers may be used to allow the network to learn deviations from the identity layer.

2. The network may be simplified by reducing the layers and approximating them with layers which can better distinguish the features and improve the accuracy.

An embodiment of the deep learning network consists, for example, in total of 202 layers. The residual connections are also part of the network. The Rectified Linear Unit (ReLU) layer is connected to the convolutional layer of every sub-inception unit. Also the convolutional layer of this unit is connected to the output of the 8th unit placed in the middle of the huge network. An addition layer connects the input with a batch normalization layer and can lead to a ReLU layer after the addition process. Modifications were done on the following parts for the both RGB and OF stream and throughout the network. 1. An addition layer is placed between convolutional inception_3a_lxl and bn inception_3a_lxl_bn layers and connects an input directly to the addition layer.

2. Output of a batch_normalization('inception_3b_lxl_bn') as an input to an addition layer which is placed between convolutional 'inception_3c_3x3' and bn

'inception_3c_3x3_bn' layers.

3. Output of a batch_normalization(‘inception_3a_double_3x3_2_bn’) as an input to an addition layer which is placed between convolutional‘inception_3b_double_3x3_2’ and bn‘inception_3b_double_3x3_2_bn’ layers.

4. Output of a batch_normalization(‘inception_3c_3x3_bn’) as an input to an addition layer which is placed between convolutional ‘inception_4a_3x3’ and bn

‘inception_4a_3x3_bn’ layers.

5. Output of a batch_normalization(‘inception_4a_3x3_bn’) as an input to an addition layer which is placed between convolutional ‘inception_4b_3x3’ and bn

‘inception_4b_3x3_bn’ layers.

6. Output of a batch_normalization(‘inception_4c_pool_proj_bn’) as an input to an addition layer which is placed between convolutional‘inception_4e_3x3’ and bn ‘inception_4e_3x3_bn’ layers.

7. Output of a batch_normalization(‘inception_4e_double_3x3_2_bn’) as an input to an addition layer which is placed between convolutional‘inception_5a_3x3’ and bn ‘inception_5a_3x3_bn’ layers.

8. Output of a batch_normalization(‘inception_5a_lxl_bn’) as an input to an addition layer which is placed between convolutional‘inception_5a_double_3x3_2’ and bn ‘ inception_5 a_double_3 x3_2_bn’ layers .

9. Output of a batch_normalization(‘inception_5b_3x3_bn’) as an input to an addition layer which is placed between convolutional ‘inception_5b_pool_proj’ and bn ‘inception_5b_pool_proj_bn’ layers.

The descriptions above are defined as below:

inception_3a_lxl=Convolution2D(l92, 64, kemel_size=(l, 1), stride=(l, 1))

inception_3a_lxl_bn = like above in addition to the batch normalization

inception_3a_3x3=Convolution2D(64, 64, kemel_size=(3, 3), stride=(l, 1), padding=(l,

1))

inception_3a_5x5= Convolution2D(32, 32, kerne l_size=(5, 5), stride=(l, 1), padding=(l,

1))

inception_3a_pool_proj = Convolution2D(l92, 32, kemel_size=(l, 1), stride=(l, 1)) inception_3a_double_3x3_2 = Convolution2D(96, 96, kemel_size=(3, 3), stride=(l, 1), padding=(l, 1))

inception_3c_3x3 = Convolution2D(320, 128, kemel_size=(l, 1), stride=(l, 1)) inception_4a_3x3 = Convolution2D(576, 64, kemel_size=(l, 1), stride=(l, 1))

inception_4c_pool_proj = Convolution2D(576, 128, kemel_size=(l, 1), stride=(l, 1)) inception_4e_double_3x3 = Convolution2D(608, 192, kemel_size=(l, 1), stride=(l, 1)) inception_5a_lxl = Convolution2D(l056, 352, kemel_size=(l, 1), stride=(l, 1)) inception_5b_3x3 = Convolution2D(l92, 320, kemel_size=(3, 3), stride=(l, 1), padding=(l, 1))

In case where bn is a part of the name, the batch normalization is applied afterwards as given in inception_3a_lxl_bn example.

Due to space limitations, not the entire modified network may be in this form. However, by adding a simple shortcut connection, accuracy is improved in the inference part and makes the training process faster. But the tradeoff is that ResNets are more prone to overfitting, which is not desirable. By using dropout layer and data augmentation one can reduce the overfitting, as proven in experiments.

A basic block example of the network is shown in FIG. 5, and a zoomed-up part of the network looks, for example, like shown in FIG. 6.

In the following, details of the data augmentation are described. Learning effectiveness of deep-learning networks are known to depend on the availability of sufficiently large training data. Data augmentation is an effective method to expand the training data by applying transformations and deformations to the labeled data, resulting in new samples as additional training data. Here in this work, the following data augmentation techniques were used: random brightness, random flip (left-to-right flipping) and a bit of random contrast. The addition of these varying versions of images enables the networks to model discriminative characteristics pertaining to these variety of representations. Thus, the training of the deep networks with the augmented data will improve its generalization on unseen samples. In the following, details of the network training are described. A cross modality pre- training technique is applied, in which RGB models are utilized to initialize the temporal networks. First, the OF fields (OF snippets) are discretized into the interval from 0 to 255 by a linear transformation. This step makes the range of OF fields to be the same with RGB images (RGB snippets). Then, the weights of first convolution layer of RGB models are modified to handle the input of OF fields. Specifically, the weights are averaged across the RGB channels and this average is replicated by the channel number of temporal network input.

During the learning process, batch normalization will estimate the activation mean and variance within each batch and use them to transform these activation values into a standard Gaussian distribution. This operation speeds up the convergence of training but also leads to over-fitting in the transferring process, due to the biased estimation of activations distributions from limited number of training samples. Therefore, after initialization with pre-trained models, the mean and variance parameters of all Batch Normalization layers except the first one are frozen. As the distribution of OF is different from the RGB images, the activation value of first convolution layer will have a different distribution and the mean and variance need to be re-estimated accordingly. An extra dropout layer is added after the global pooling layer in bn— inception architecture to further reduce the effect of over- fitting. Data augmentation can generate diverse training samples and prevent severe over-fitting. In the original two-stream ConvNets, random left to right flipping in addition to random contrast and brightness are employed to augment training samples. Also we fix the size of input image or optical flow fields, and the width and height of all the training and validation images. This is similarly achieved for the testing images as well.

In the following, details of the fusion part and the network testing are described. Due to the fact that all snippet-level ConvNets share the model parameters in temporal segment networks, the learned models can perform frame-wise evaluation as normal ConvNets. This allows carrying out fair comparison with models learned without the temporal segment network framework. Specifically, we follow the testing scheme of the original two-stream ConvNets, where we sample 25 RGB frames or optical flow stacks from the action videos. For the fusion of spatial and temporal stream networks, we take a weighted average of them. This means the following: for each RGB frame of an input video we find the top 8 predictions of the network output. Then we group all the predictions based on the all RGB frames and choose also top 8 predictions (based on majority of votes or a frequency of appearance). To output prediction for the input video based on the RGB part only, we take the first ranked (most-likely) prediction as the correct one and compare it with its label (ground truth prediction). For the OF part we repeat the same process and we obtain the prediction based on this part only. To fuse the RGB and OF parts we repeat the processes for RGB and OF, but this time we take top 13 predictions into account. Then we find a union (sum) of normalized predictions for the same label from the both parts and choose one with the most votes.

Some variations to this fusion heuristic have been tested and applied. One way is to employ the normalized Sofhnax predictions (normalized in [0, 1] interval) and to scale the normalized frequency of appearance with the aforementioned ones. Thus, one would be able to get the outputs for the RGB and Optical Flow part similarly to what we described above. Then in the fusion process we find the sum of the normalized weighted scaled frequency of appearance for all the predictions. Eventually for the output we would take the most likely output and compare it to the ground truth label. Another variation of the fusion function is to employ additional weights on top of scaled joint predictions discussed above and/or to take the first and second ranked guesses based on the threshold difference between them, i.e. by observing differences between first top two ranked predictions for RGB and for the OF in many confidence pairs, we concluded that for these differences beyond some reliably large thresholds, and for both RGB and OF, we were sure that the correct prediction was the lst ranked one, based either on the RGB or on the OF (or both). Then we take that prediction as the correct one.

Another thing worth mentioning is that the performance gap between spatial stream ConvNets and temporal stream ConvNets is much smaller than that in the original two- stream ConvNets. Based on this fact, embodiments give equal credit to the spatial and temporal stream by setting their weight as 1 and analyzing their outputs on per classification basis. Thus the segmental consensus function is applied before the Sofhnax normalization. To test a video prediction, we decided to fuse the prediction scores of all extracted predictions for RGB and Optical Flow frames and different streams before the Sofhnax normalization. FIG. 7 shows a method according to an embodiment of the invention. The method 700 is for recognizing one or more activities in a video 101, each activity being associated with a predetermined label 104. The method 700 employs a deep-learning network 102 and may be carried out by the device 100 shown in FIG. 1 or FIG. 2.

In an inference phase, the method 700 comprises: a step 701 of receiving the video 101; a step 702 of separating the video 101 into an RGB part lOla and an OF part 10 lb; a step 703 of employing a spatial part l02a of the deep-leaming network 102 to calculate a plurality of spatial label predictions l03a based on the RGB part lOla; a step 704 of employing a temporal part l02b of the deep-leaming network 102 to calculate a plurality of temporal label predictions l03b based on the OF part 10 lb; and a step 705 of fusing the spatial and temporal label predictions l03a, l03b to obtain a label 104 associated with an activity in the video 101.

Embodiments of the invention may be implemented in hardware, software or any combination thereof. Embodiments of the invention, e.g. the device and/or the hardware implementation, may be implemented as any of a variety of suitable circuitry, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, etc., or any combinations thereof. Embodiments may comprise computer program products comprising program code for performing, when implemented on a processor, any of the methods described herein. Further embodiments may comprise at least one memory and at least one processor, which are configured to store and execute program code to perform any of the methods described herein. For example, embodiments may comprise a device configured store instructions for software in a suitable, non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform any of the methods described herein.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word“comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.