Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR DETECTING PERSON ACTIVITY IN VIDEO
Document Type and Number:
WIPO Patent Application WO/2021/032295
Kind Code:
A1
Abstract:
A system and a method for detecting person activity in a video are disclosed. The method comprises detecting one or more persons in a frame of the video; generating a first set of feature vectors based respectively on a set of frames of the video; generating a second set of person-specific feature vectors based on the set of frames; determining a temporal scene context vector for the frame, based on the first set of feature vectors and the second set of person-specific feature vectors; and for a detected person of the one or more detected persons, determining a hidden state vector for the frame, based on the temporal scene context vector and a subset of the second set of person-specific feature vectors corresponding to the detected person; and detecting one or more activities performed by the detected person in the frame based on the determined hidden state vector.

Inventors:
MINCIULLO LUCA (BE)
SOURI YASER (DE)
GALL JÜRGEN (DE)
BISWAS SOVAN (DE)
Application Number:
PCT/EP2019/072348
Publication Date:
February 25, 2021
Filing Date:
August 21, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TOYOTA MOTOR EUROPE (BE)
UNIV BONN RHEINISCHE FRIEDRICH WILHELMS (DE)
International Classes:
G06V10/25
Other References:
SINGH BHARAT ET AL: "A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 27 June 2016 (2016-06-27), pages 1961 - 1970, XP033021374, DOI: 10.1109/CVPR.2016.216
TKACHENKO DMYTRO: "Human Action Recognition Using Fusion of Modern Deep Convolutional and Recurrent Neural Networks", 2018 IEEE FIRST INTERNATIONAL CONFERENCE ON SYSTEM ANALYSIS & INTELLIGENT COMPUTING (SAIC), IEEE, 8 October 2018 (2018-10-08), pages 1 - 6, XP033433486, DOI: 10.1109/SAIC.2018.8516860
GKIOZARI ET AL.: "Finding action tubes", CVPR, 2015, pages 759 - 768, XP032793486, DOI: 10.1109/CVPR.2015.7298676
HOU: "Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos", ICCV, 2017, pages 5822 - 5831
KALOGEITON ET AL.: "Action Tubelet Detector for Spatio-Temporal Action Localization", ICCV, 2017, pages 4415 - 4423, XP033283315, DOI: 10.1109/ICCV.2017.472
SINGH: "Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction", ICCV, 2017, pages 3657 - 3666, XP033283236, DOI: 10.1109/ICCV.2017.393
ROSS GIRSHICK: "Fast R-CNN", ICCV, 2015, pages 1440 - 1448, XP032866491, DOI: 10.1109/ICCV.2015.169
CHAO: "Rethinking the Faster R-CNN Architecture for Temporal Action Localization", CVPR, 2018, pages 1130 - 1139, XP033476075, DOI: 10.1109/CVPR.2018.00124
WEINZAEPFEL ET AL., TOWARDS WEAKLY-SUPERVISED ACTION LOCALIZATION, 2016
LI: "Recurrent tubelet proposal and recognition networks for action detection", ECCV, 2018, pages 303 - 318
GIRDHAR, A BETTER BASELINE FOR AVA, 2018
IBRAHIM: "A Hierarchical Deep Temporal Model for Group Activity Recognition", CVPR, 2016, pages 1971 - 1980, XP033021375, DOI: 10.1109/CVPR.2016.217
BISWAS ET AL.: "Structural Recurrent Neural Network (SRNN) for Group Activity Analysis", WACV, 2018, pages 1625 - 1632, XP033337832, DOI: 10.1109/WACV.2018.00180
TANG ET AL.: "Mining Semantics-Preserving Attention for Group Activity Recognition", ACM-MM, 2018, pages 1283 - 1291, XP058420861, DOI: 10.1145/3240508.3240576
REN: "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", NIPS, 2015, pages 991 - 99
HE: "Deep Residual Learning for Image Recognition", CVPR, 2016, pages 770 - 778, XP055536240, DOI: 10.1109/CVPR.2016.90
CARREIRA: "Quo Vasis, Action recognition? A new model and the kinetics dataset", CVPR, 2017, pages 4724 - 4733
GU: "AVA: A Video dataset of Spatio-Temporally Localized Atomic Visual Actions", CVPR, 2018, pages 6047 - 6056, XP033473520, DOI: 10.1109/CVPR.2018.00633
SCHUSTER, MIKEKULDIP K. PALIWAL: "Bidirectional recurrent neural network", IEEE TRANSACTIONS ON SIGNAL PROCESSING, vol. 45, no. 11, 1997, pages 2673 - 2681
LECUN, YANN: "Backpropagation applied to handwritten zip code recognition", NEURAL COMPUTATION, vol. 1.4, 1989, pages 541 - 551, XP000789854
Attorney, Agent or Firm:
DELUMEAU, François et al. (FR)
Download PDF:
Claims:
CLAIMS

1. A method for detecting person activity in a video, comprising: detecting (104) one or more persons in a frame of the video; generating (106) a first set of feature vectors (110) based respectively on a set of frames of the video, the set of frames including the frame and contained in a frame window centered at the frame; generating (112) a second set of person-specific feature vectors (114) based on the set of frames; determining, using a first trained neural network (116), a temporal scene context vector (118) for the frame, based on the first set of feature vectors (110) and the second set of person-specific feature vectors (114); and for a detected person of the one or more detected persons, determining, using a second trained neural network (120), a hidden state vector (122) for the frame, based on the temporal scene context vector (118) and a subset (114_i) of the second set of person- specific feature vectors corresponding to the detected person; and detecting (124) one or more activities (126) performed by the detected person in the frame based on the determined hidden state vector (122).

2. The method of claim 1, wherein determining the temporal scene context vector (118) for the frame comprises: perfoming a maxpool operation (302) based on the second set of person-specific feature vectors (114) to generate a maxpooled person feature vector (304); concatenating (306) the maxpooled person feature vector (304) with the first set of feature vectors (110) to generate a concatenated feature vector (308); and applying the concatenated feature vector (308) together with temporal context vectors (312) of a preceding frame and a consecutive frame to a bidirectional Gated Recurrent Unit, BiGRU (310) to generate the temporal scene context vector (118) for the frame.

3. The method of any of claims 1 and 2, wherein determining the hidden state vector (122) for the frame comprises determining the hidden state vector (122) based on the temporal scene context vector (118), the subset (114_i) of the second set of person-specific feature vectors corresponding to the detected person, and a combined hidden state vector (406) corresponding to the one or more detected persons in the frame.

4. The method of claim 3, wherein the combined hidden state vector (406) contains information representative of relations between potential activities performed by the one or more detected persons in the frame.

5. The method of any of claims 3 and 4, wherein determining the hidden state vector (122) for the frame comprises: performing a maxpool operation (404) on previous iteration hidden state vectors (402) corresponding respectively to the one or more detected persons in the frame to generate the combined hidden state vector (406); concatenating (408) the combined hidden state vector (406) with the temporal scene context vector (118) and the subset (114_i) of the second set of person-specific feature vectors corresponding to the detected person to generate a concatenated vector (410); and applying the concatenated vector (410) to a Gated Recurrent Unit, GRU (412) to generate the hidden state vector (122) corresponding to the frame.

6. The method of claim 1, wherein detecting (124) the one or more activities (126) performed by the detected person in the frame comprises: applying the determined hidden state vector (122) to a fully- connected layer (502) to generate a set of probabilities (504_1, 504_M), each probability corresponding to a respective activity of a set of activities and indicating the probability that the detected person is performing the respective activity in the frame; and detecting (506) the one or more activities (126) performed by the detected person in the frame based on the generated set of probabilities (504_1, 504_M).

7. The method of claim 6, wherein applying the determined hidden state vector (122) to the fully-connected layer (502) comprises applying the hidden state vector (122) to the fully-connected layer together with equivalent hidden state vectors corresponding to other detected persons in the frame. 8. A system (100) for detecting person activity in a video, comprising: a person detector (104) configured to detect one or more persons in a frame of the video; a feature extractor (106) configured to generate a first set of feature vectors (11) based respectively on a set of frames of the video, the set of frames including the frame and contained in a frame window centered at the frame; a Region of Interest, Rol pooling module (112) configured to generate a second set of person-specific feature vectors (114) based on the set of frames; a first trained neural network (116) configured to determine a temporal scene context vector (118) for the frame, based on the first set of feature vectors (110) and the second set of person-specific feature vectors (114); a second trained neural network (120) configured to determine a hidden state vector (122) for the frame for a detected person of the one or more detected persons, based on the temporal scene context vector (118) and a subset (114_i) of the second set of person-specific feature vectors corresponding to the detected person; and an activity detector (124) configured to detect one or more activities (126) performed by the detected person in the frame based on the determined hidden state vector (122).

9. A computer program (PROG) including instructions that when executed by a processor (202) cause the processor (202) to execute a method according to any one of claims 1 to 7. 10. A computer-readable medium (204) recorded thereon a computer program according to claim 9.

Description:
SYSTEM AND METHOD FOR DETECTING PERSON ACTIVITY IN VIDEO

Field of the disclosure

The present disclosure relates to the field of person activity detection in a video.

Description of Related Art

The task of person activity detection in a video concerns recognizing the activities of persons appearing in frames of the video where each person can perform one or more activities in a frame.

A common conventional approach includes detecting bounding boxes in each frame using object detectors and linking the detected bounding boxes to obtain action tubes, which are then classified. Works describing aspects of this conventional approach include "Gkiozari et al., "Finding action tubes," in CVPR, 2015, pp. 759-768," "Hou et a!., "Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos," in ICCV, 2017, pp. 5822-5831," "Kalogeiton et al., "Action Tubelet Detector for Spatio-Temporal Action Localization, " in ICCV, 2017, pp. 4415-4423," "Singh et ah, "Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction," in ICCV, 2017, pp. 3657-3666," "Ross Girshick, "Fast R-CNN," in ICCV, 2015, pp. 1440-1448," and "Chao et al., "Rethinking the Faster R-CNN Architecture for Temporal Action Localization," in CVPR, 2018, pp. 1130-1139."

A deficiency of this conventional approach is that in order for it to learn an underlying model (using full supervision) a fully-annotated training dataset is required. This means that for every frame of the training dataset all actions (activities) must be labeled and all persons must be identified with bounding boxes. This is very expensive to implement in practice. For example, the large-scale AVA 2.1 dataset provides only sparse annotations at the rate of one frame per second (i.e., every second, only one frame of the video is annotated).

Other conventional approaches may be trained with sparsely annotated datasets (e.g., one annotated frame per second). These approaches are described in "Weinzaepfel et aL, "Towards weakly- supervised action localization," arXiv: 1605.05197, 2016," "Li et al., "Recurrent tubelet proposal and recognition networks for action detection," in ECCV, 2018, pp. 303-318," and "Girdhar et al., "A better baseline for AVA," arXiv: 1807.10066, 2018." However, these methods treat each person in the frame independently, i.e., a person's activity is inferred without regard to potential interactions between persons in the frame which may or may not impact the person's activity.

In the context of group activity analysis, the relations between various individual persons are used to infer an action label for the group as well as for the individuals. Typically, these methods rely on a hierarchical model in which the individuals' actions are modeled in a lower level model and the group activity is modeled in an upper level model. Works describing these methods include in "Ibrahim etaL, "A Hierarchical Deep Temporal Model for Group Activity Recognition," in CVPR, 2016, pp. 1971-1980," "Biswas et al., "Structural Recurrent Neural Network (SRNN) for Group Activity Analysis," in WACV, 2018, pp. 1625-1632," and "Tang et aL, "Mining Semantics-Preserving Attention for Group Activity Recognition," in ACM-MM, 2018, pp. 1283-1291."

However, these methods assume that each person is a part of a group (e.g., a sports team) and that each person performs a single action as part of the group, i.e., that the person's activity necessarily depends on being part of the group. As such, these methods are not suitable for videos in which a person may perform multiple activities in a frame and in which the person's activities may or may not be based on interactions with other persons.

Summary The present disclosure overcomes one or more deficiencies of the prior art by proposing a method for detecting person activity in a video, comprising: detecting one or more persons in a frame of the video; generating a first set of feature vectors based respectively on a set of frames of the video, the set of frames including the frame and contained in a frame window centered at the frame; generating a second set of person-specific feature vectors based on the set of frames; determining, using a first trained neural network, a temporal scene context vector for the frame, based on the first set of feature vectors and the second set of person-specific feature vectors; and for a detected person of the one or more detected persons, determining, using a second trained neural network, a hidden state vector for the frame, based on the temporal scene context vector and a subset of the second set of person-specific feature vectors corresponding to the detected person; and detecting one or more activities performed by the detected person in the frame based on determined the hidden state vector.

Accordingly, the one or more activities detected for a person are identified as a function of a temporal scene context vector, which describes the temporal scene taking place around the frame being processed. The temporal scene context is defined by a frame window which size may be configured as desired. For example, the size of the frame window may be set depending on the activities that are most sought to be detected in the video (certain actions can be better recognized with a larger temporal context than other actions).

The one or more activities detected for the person are also identified as a function of a hidden state vector, which contains implicit information describing what is happening in the scene from the perspective of the detected person.

In an embodiment, detecting the one or more persons in the frame comprises determining one or more bounding boxes corresponding to the locations of one or more persons in the frame. In an embodiment, the one or more bounding boxes are determined using a Faster Region-Based Convolutional Neural Network (FRCNN). In an embodiment, the FRCNN has a Residual Network (ResNet) architecture.

In an embodiment, the frame window comprises L = 2l + 1 frames, where I > 0. In an embodiment, generating the first set of feature vectors comprises generating a first set of feature maps based on the L frames contained in the frame window; and generating the first set of feature vectors from the first set of feature maps.

In an embodiment, generating the first set of feature maps comprises extracting Inflated 3D (I3D) feature maps from the L frames contained in the frame window.

Each feature vector of the second set of person-specific feature vectors includes person-specific feature information corresponding to at least one person detected in the respective frame corresponding to the feature vector.

In an embodiment, generating the second set of person-specific feature vectors comprises extracting a second set of feature maps from the first set of feature maps; and generating the second set of person- specific feature vectors from the second set of feature maps. In an embodiment, extracting the second set of feature maps from the first set of feature maps comprises, for each feature map of the first set of feature maps (which corresponds to a respective frame in the frame window), applying Region of Interest (Rol) pooling to the feature map based on one or more bounding boxes detected in the respective frame.

In an embodiment, the first trained neural network is a recurrent neural network (RNN).

In an embodiment, determining the temporal scene context vector for the frame comprises: perfoming a maxpool operation based on the second set of person- specific feature vectors to generate a maxpooled person feature vector; concatenating the maxpooled person feature vector with the first set of feature vectors to generate a concatenated feature vector; and applying the concatenated feature vector together with temporal context vectors of a preceding frame and a consecutive frame to a bidirectional Gated Recurrent Unit (BiGRU) to generate the temporal scene context vector for the frame.

The maxpool operation based on the second set of person-specific feature vectors allows for the different person-specific feature vectors to be combined in a compact manner, which reduces redundant information. The resulting maxpooled feature vector is relatively small in size, which increases the speed of determining the temporal scene context vector.

In an embodiment, determining the hidden state vector for the frame comprises determining the hidden state vector based on the temporal scene context vector, the subset of the second set of person- specific feature vectors corresponding to the detected person, and a combined hidden state vector corresponding to the one or more detected persons in the frame. The combined hidden state vector contains information representative of relations between potential activities performed by the one or more detected persons in the frame. As such, the one or more activities detected for the person are identified additionally based on interactions taking place between persons in the frame. In other words, the detection also takes into account whether a particular interaction impacts or not the person's activity and determines the person's activity accordingly.

In an embodiment, the second trained neural network is a recurrent neural network (RNN).

In an embodiment, determining the hidden state vector for the frame comprises: performing a maxpool operation on previous iteration hidden state vectors corresponding respectively to the one or more detected persons in the frame to generate the combined hidden state vector; concatenating the combined hidden state vector with the temporal scene context vector and the subset of the second set of person -specific feature vectors corresponding to the detected person to generate a concatenated vector; and applying the concatenated vector to a Gated Recurrent Unit (GRU) to generate the hidden state vector corresponding to the frame.

In an embodiment, detecting the one or more activities performed by the detected person in the frame comprises: applying the determined hidden state vector to a fully-connected layer to generate a set of probabilities, each probability corresponding to a respective activity of a set of activities and indicating the probability that the detected person is performing the respective activity in the frame; and detecting the one or more activities performed by the detected person in the frame based on the generated set of probabilities. In an embodiment, applying the determined hidden state vector to the fully-connected layer comprises applying the hidden state vector to the fully-connected layer together with equivalent hidden state vectors corresponding to other detected persons in the frame. As such, the detection of the one or more activities performed by the detected person benefit from information about the interactions taking place between the different actors in the video. As certain activities involve interaction between people (e.g., looking at someone, talking to someone, etc.) and other activities don't (e.g., sitting, standing, etc.), the interaction information helps distinguish at least between these two categories of activities.

In an embodiment, detecting the one or more activities for the detected person in the frame comprises selecting one or more activities from the set of activities having respective probabilities above a threshold. The threshold may be configurable which allows for increasing or decreasing the confidence level with which activities are selected.

The present disclosure also proposes a system for detecting person activity in a video, comprising: a person detector configured to detect one or more persons in a frame of the video; a feature extractor configured to generate a first set of feature vectors based respectively on a set of frames of the video, the set of frames including the frame and contained in a frame window centered at the frame; a Region of Interest (Rol) pooling module configured to generate a second set of person-specific feature vectors based on the set of frames; a first trained neural network configured to determine a temporal scene context vector for the frame, based on the first set of feature vectors and the second set of person-specific feature vectors; a second trained neural network configured to determine a hidden state vector for the frame for a detected person of the one or more detected persons, based on the temporal scene context vector and a subset of the second set of person-specific feature vectors corresponding to the detected person; and an activity detector configured to detect one or more activities performed by the detected person in the frame based on the determined hidden state vector.

This system may be configured to perform any of the above- described method embodiments of to the present disclosure.

In an embodiment, any of the above-described method embodiments may be implemented as instructions of a computer program. As such, the present disclosure provides a computer program including instructions that when executed by a processor cause the processor to execute a method according to any of the above-described method embodiments.

The computer program can use any programming language and may take the form of a source code, an object code, or a code intermediate between a source code and an object code, such as a partially compiled code, or any other desirable form.

The computer program may be recorded on a computer-readable medium. As such, the present disclosure is also directed to a computer- readable medium having recorded thereon a computer program as described above. The computer-readable medium can be any entity or device capable of storing the computer program.

Brief description of the drawings

Further features and advantages of the present disclosure will become apparent from the following description of certain embodiments thereof, given by way of illustration only, not limitation, with reference to the accompanying drawings in which:

FIG. 1 illustrates a system for detecting person activity in a video according to an embodiment; FIG. 2 illustrates an example computer implementation of the system of FIG. 1;

FIG. 3 illustrates an example implementation of a first neural network according to an embodiment;

FIG. 4 illustrates an example implementation of a second neural network according to an embodiment; and

FIG. 5 illustrates an example implementation of an activity detector according to an embodiment.

Description of embodiments FIG. 1 illustrates a system 100 for detecting person activity in a video according to an embodiment. System 100 is provided for the purpose of illustration only and is not limiting of embodiments of the present disclosure.

As shown in FIG. 1, system 100 includes a person detector 104, a feature extractor 106, a Region of Interest (ROI) pooling module 112, a hierarchical neural network 130, and an activity detector 124.

In an embodiment, the hierarchical neural network 130 includes a first trained neural network 116 and a second trained neural network 120. The first trained neural network 116 and the second trained neural network 120 may be trained together or separately during a training phase. Only a sparsely-annotated training dataset (e.g., one annotated frame or less per second) is required to train the first neural network 116 and the second neural network 120. This is due to the fact that neural network 130 is configured to operate on the basis of a frame window of several video frames as further described below. In an embodiment, the AVA 2.1 dataset is used to train the first neural network 116 and the second neural network 120. The AVA 2.1 dataset is annotated for 60 action classes (i.e., 60 person activities) and includes 235 videos of 15 minutes each, annotated at the rate of one frame per second.

In an embodiment, system 100 may be implemented on a computer system such as computer system 200 shown in FIG. 2. Specifically, system 100 may be implemented as a computer program including instructions that, when executed by a processor 202 of computer system 200, cause the processor 202 to execute methods or functions of system 100 as further described herein. In an embodiment, the computer program may be recorded on a computer-readable medium 204 of computer system 200.

Returning to FIG. 1, system 100 is configured to receive a video including a series of frames 102. It is noted that a frame corresponds to a single image of the video. As further described below, for a given frame (hereinafter referred to as "the frame" or the "frame t") of the video, system 100 detects one or more activities for at least one detected person in the frame (assuming that a person is detected in the frame). The processing as further described herein is repeated per frame.

As shown in FIG. 1, the frames 102 are provided to person detector 104 and feature extractor 106.

Person detector 104 acts on the frames 102 on a frame-by-frame basis. Specifically, for a frame t of the video, person detector 104 is configured to detect one or more persons in the frame.

In an embodiment, detecting the one or more persons in the frame comprises determining one or more bounding boxes corresponding to the locations of one or more persons in the frame. In an embodiment, the one or more bounding boxes are determined using a Faster Region-Based Convolutional Neural Network (FRCNN) as described in "Ren et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in NIPS, 2015, pp. 991-99). The FRCNN may have a Residual Network (ResNet) architecture as described in "He et al., "Deep Residual Learning for Image Recognition," in CVPR, 2016, pp. 770-778).

Before being used as part of system 100, person detector 104 is trained to detect persons in frames. In an embodiment, person detector 104 is trained on a generic object dataset (e.g., PASCAL Visual Object Classes (VOC)) and then fine-tuned for improved person detection. Assuming that one or more persons are detected in the frame t by person detector 104, the processing of the frame t by system 100 continues. Otherwise, a next frame is processed. Hereinafter, it is assumed that one or more persons are detected in the frame t.

Feature extractor 106 is configured to extract features that can be used to describe the temporal scene context of the frame t. The temporal scene context of the frame t describes the scene that is taking place around the frame t. For example, a series of frames may show a person opening the door of a car, entering the car, and sitting behind the wheel. The temporal scene context for a frame in this series of frames is then a person entering into a car.

In an embodiment, feature extractor 106 is pre-configured to define a frame window centered at the frame t based on which the temporal scene context of the frame t is to be determined. In an embodiment, the frame window comprises L = 21 + 1 frames, where I > 0. For example, the frame window may include 3 frames, i.e., the frame t, a frame t-1, and a frame t+1. The frames t-1, t, and t+1 may or may not be immediately consecutive within the series of frames, e.g., the frame t-1 may precede the frame t by one second, and the frame t+1 may follow the frame t by one second, but intervening frames may be present between the frames t and t-1 and the frames t and t+1.

In an embodiment, feature extractor 106 is configured to generate a first set of feature vectors 110 based respectively on a set of frames of the video. The set of frames include the frame t and are contained in a frame window centered at the frame t. Assuming a frame window of size L=2I+1, i.e., I frames before and 1 frames after the frame t, the first set of features vectors 110 includes L feature vectors. Hereinafter, the L feature vectors are denoted mathematically by the vector sequence Assuming that the L frames are separated by one second, the L features vectors correspond to a temporal scene context of L seconds.

In an embodiment, generating the first set of feature vectors 110 comprises generating a first set of feature maps based on the L frames contained in the frame window; and generating the first set of feature vectors 110 from the first set of feature maps. In an embodiment, the first set of feature maps are flattened to a single dimension to obtain the corresponding first set of feature vectors 110.

In an embodiment, a single feature map, and feature vector, is generated per frame of the frame window.

A feature map contains spatio-temporal information that captures what can be viewed in the frame (e.g., objects and activities appearing in the frame). Typically, a feature map may include one or more features. Each feature may be computed based on a temporal context of several frames that surround the frame for which the feature map is computed.

In an embodiment, generating the first set of feature maps comprises extracting Inflated 3D (I3D) feature maps from the L frames contained in the frame window. This may be done as described in "Carreira et a!., "Quo Vasis, Action recognition? A new model and the kinetics dataset," in CVPR, 2017, pp. 4724-4733".

Returning to FIG. 1, ROI pooling module 112 is configured to generate a second set of person-specific feature vectors 114 based on the set of frames contained in the frame window. Each feature vector of the second set of person-specific feature vectors 114 includes person-specific feature information corresponding to a respective person detected in the respective frame corresponding to the feature vector.

In an embodiment, ROI pooling module 112 receives from person detector 104 a dataset 108 including, for each frame of the L frames contained in the frame window the one or more bounding boxes determined for the frame (corresponding to the one or more persons detected in the frame). In addition, ROI pooling module 112 receives from feature extractor 106 a dataset 128 containing the first set of feature maps generated by feature extractor 106 based on the L frames contained in the frame window.

In an embodiment, generating the second set of person-specific feature vectors 114 comprises extracting a second set of feature maps from the first set of feature maps; and generating the second set of person-specific feature vectors 114 from the second set of feature maps.

In an embodiment, extracting the second set of feature maps from the first set of feature maps comprises, for each feature map of the first set of feature maps (which corresponds to a respective frame in the frame window), applying Region of Interest (Rol) pooling to the feature map based on the one or more bounding boxes detected in the respective frame (obtained from dataset 108). Specifically, the one or more bounding boxes detected in the respective frame are used to pull specific feature map information from the feature map (of the first set) corresponding to the frame, the specific feature map information being the spatio-temporal information specific to the person(s) detected in the frame.

In an embodiment, the second set of feature maps are extracted from the first set of feature maps 128 as described in "Gu et al., "AVA: A Video dataset of Spatio-Temporally Localized Atomic Visual Actions," in CVPR, 2018, pp. 6047-6056".

In an embodiment, for each frame contained in the frame window, one or more person-specific feature vectors corresponding respectively to one or more persons detected in the frame are generated. Generally, the persons detected in the set of L frames corresponding to the frame window are substantially the same over the set of frames, such that a detected person p i is associated with L person-specific feature vectors each corresponding to a respective frame of the L frames.

Hereinafter, a person-specific feature vector corresponding to a detected person p i in frame t is denoted by

Returning to FIG. 1, the first neural network 116 is configured to receive the first set of L feature vectors 110 from feature extractor 106 and the second set of person-specific feature vectors 114 from ROI pooling module 112. The first neural network 116 is configured to determine a temporal scene context vector 118 for the frame t, based on the first set of feature vectors 110 and the second set of person-specific feature vectors 114. In an embodiment, the first neural network is a recurrent neural network (RNN).

FIG. 3 illustrates an example implementation 300 of the first neural network 116 according to an embodiment. As shown in FIG. 3, example first neural network 300 includes a maxpool module 302, a concatenation module 306, and a bidirectional Gated Recurrent Unit (BiGRU) 310. The maxpool module 302 is configured to perform a maxpool operation based on the second set of person-specific feature vectors 114 to generate a maxpooled person feature vector 304.

In an embodiment, the maxpooled person feature vector 304 is computed as In other words, the maxpooled person feature vector 304 is obtained by maxpooling the person-specific feature vectors corresponding to all detected persons (V) in the frame t.

The concatenation module 306 is configured to concatenate the maxpooled person feature vector 304 with the first set of L feature vectors to generate a concatenated feature vector 308.

In an embodiment, the concatenated feature vector 308, denoted is computed as denotes vector concatenation.

The concatenated feature vector 308 is applied together with temporal context vectors 312 of a preceding frame (t-1) and a consecutive frame (t+1) to BiGRU 310 to generate the temporal scene context vector 118 for the frame t. A BiGRU is a recurrent network in which temporal nodes are connected bidirectionally such that information from one node in the temporal sequence affects the adjacent nodes (t-1, t+1). A description of a BiGRU can be found, for example, in "Schuster, Mike and Kuldip K. Paliwal, "Bidirectional recurrent neural network," IEEE Transactions on Signal Processing, Vol. 45, No. 11, (1997): 2673-2681".

In an embodiment, the temporal scene context vector 118 is computed as denote the temporal scene context vectors 312 of the preceding frame (t-1) and the consecutive frame (t+1).

Returning to FIG. 1, the second neural network 120 is configured to receive the temporal scene context vector 118 of the frame t and the second set of person-specific feature vectors 114. For each detected person i in the frame t, the second neural network 120 is configured to determine a hidden state vector 122 for the frame t, based on the temporal scene context vector 118 and a subset 114_i of the second set of person-specific feature vectors 114 corresponding to the detected person i. The hidden state vector 122 for a detected person contains implicit information describing what is happening in the scene from the perspective of the detected person.

In an embodiment, the second neural network 120 is a recurrent neural network (RNN). In an embodiment, given V detected person in the frame t, the second neural network 120 models the relations of the detected persons and uses this modeling to infer the activity(ies) of each of them. With each detected person represented as a node v i e V of a fully-connected graph, the second neural network 120 iteratively updates a hidden state vector for each node v i based on the intermediate hidden state vectors of the other nodes.

In an embodiment, the hidden state vector 122 for a detected person i is determined based on the temporal scene context vector 118, the subset 114_i of the second set of person-specific feature vectors corresponding to the detected person, and a combined hidden state vector corresponding to the one or more detected persons in the frame.

In an embodiment, the combined hidden state vector contains information representative of relations between potential activities performed by the one or more detected persons in the frame. As such, the one or more activities detected for a detected person (which as described below are determined based on the hidden state vector 122) are identified additionally based on interactions taking place between persons in the frame. In other words, the activity detection also takes into account whether a particular interaction impacts or not the person's activity and determines the person's activity accordingly. This is based on the insight that actions of different persons may be correlated (e.g., "talk to" and "listen to", actions often performed by a group together, e.g., "play instrument"), uncorrelated, or even exclusive of each other (e.g., if a person "drives" a car, it is unlikely that other persons in the car "stand" or "play an instrument").

FIG. 4 illustrates an example implementation 400 of the second neural network 120 according to an embodiment. As shown in FIG. 4, example second neural network 400 includes a maxpool module 402, a concatenation module 408, and a Gated Recurrent Unit (GRU) 412.

The maxpool module 402 is configured to perform a maxpool operation on previous iteration hidden state vectors 402, which correspond respectively to the one or more detected persons in the frame t, to generate a combined hidden state vector 406. In an embodiment, the combined hidden state vector 406 for iteration j is computed as denotes the hidden state vector of node v i for iteration (j-1) (for frame t).

The combined hidden state vector 406 is then concatenated, using concatenation module 408, with the temporal scene context vector 118, the subset 114_i of the second set of person-specific feature vectors corresponding to the detected person i to generate a concatenated vector 410. In an embodiment, the concatenated vector 410 for detected person i (for frame t) for iteration j is computed

The concatenated vector 410 is then applied to GRU 412 to generate the hidden state vector 122 for the detected person i for the frame t.

In an embodiment, the hidden state vector 122 for the detected person i for iteration j (for the frame t), denoted as is computed using the following operations (which are computed inside GRU 412): where s represents the sigmoid function, U z , W z , U r , W r , U s , and W s denote weights of GRU 412, and o represents the Hadamard product.

Typically, after a fixed number of iterations, the hidden state vector 122 for the detected person i is output by GRU 412.

Returning to FIG. 1, activity detector 124 is configured to receive the hidden state vector 122 for detected person i (for frame t) from the second neural network 120. Activity detector 124 is configured to detect one or more activities 126 performed by the detected person i based on the hidden state vector 122.

FIG. 5 illustrates an example implementation 500 of the activity detector 124 according to an embodiment. As shown in FIG. 5, example activity detector 500 includes a fully-connected layer 502 and an activity inference module 506.

Fully-connected layer 502 is configured to receive the hidden state vector 122 for a detected person i and to generate a set of probabilities 504_1, ..., 504_M each corresponding to a respective activity of a predefined set of activities (e.g., 60 possible activities) and indicating the probability that the detected person is performing the respective activity in the frame t.

In an embodiment, the fully-connected layer 502 additionally receives the equivalent hidden state vectors for frame t of some or all of the other detected persons in the frame t, and calculates the set of probabilities 504_1, ..., 504_M with account to all received hidden state vectors. As such, the set of probabilities 504_1, ..., 504_M reflect that the action(s) performed by the detected person i in frame t depend, at least in part, on the actions of the other persons detected in frame t.

A fully-connected layer is a known neural network component. A description of fully-connected layers can be found for example in "LeCun, Yann, et al., "Backpropagation applied to handwritten zip code recognition," Neural computation 1.4 (1989): 541-551".

In an embodiment, the fully-connected layer 502 has a sigmoid as activation function.

Based on the generated set of probabilities (504_1, ..., 504_M), activity inference module 506 is configured to detect the one more activities 126 performed by the detected person i in the frame t.

In an embodiment, the activities having a respective probability above a predefined threshold are taken as the one or more detected activities 126 of the detected person i. The threshold may be set according to a desired confidence level in the activities detected.

In another embodiment, the thresholding may be further combined with inferences based on what other detected persons are doing. For example, if the probability that detected person i is speaking is 0.5 (i.e., 50%) and the probability that detected person j is speaking is 0.3, then it is likely that the person i is speaking and person j is listening in the frame.

Although the present invention has been described above with reference to certain specific embodiments, it will be understood that the invention is not limited by the particularities of the specific embodiments. Numerous variations, modifications and developments may be made in the above-described embodiments within the scope of the appended claims.