Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
EVENT DETECTION
Document Type and Number:
WIPO Patent Application WO/2020/192868
Kind Code:
A1
Abstract:
A neural network-based video processing system for determining a correlation between two time-spaced images from a video stream, the system comprising: an image feature map extractor comprising a neural network, the image feature map extractor being configured to determine, for each of the images, a feature map comprising a plurality of channels and a plurality of locations of pixels in the image, the feature map representing the response of the respective image over each of the plurality of channels and at each of the plurality of locations of pixels in the respective image; and a feature map aggregator configured to form an aggregated feature map by weighting each of the values in the feature map by (i) a factor representing the total channel response at the location corresponding to the respective value in the feature map normalised with respect to the total channel response over the respective image and (ii) a factor that indicates the extent to which the feature map indicates the image's response over the channel corresponding to the respective value in the feature map; the system being configured to determine the correlation by comparing the aggregated feature maps for each of the images.

Inventors:
REDZIC MILAN (DE)
LIU SHAOQING (DE)
YUAN PENG (DE)
Application Number:
PCT/EP2019/057258
Publication Date:
October 01, 2020
Filing Date:
March 22, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
REDZIC MILAN (DE)
International Classes:
G06K9/00; G06K9/46; G06K9/62
Domestic Patent References:
WO2018210796A12018-11-22
Foreign References:
US20180032846A12018-02-01
Other References:
KAREN SIMONYAN ET AL: "Two-Stream Convolutional Networks for Action Recognition in Videos", 9 June 2014 (2014-06-09), XP055324674, Retrieved from the Internet [retrieved on 20191205]
FEICHTENHOFER CHRISTOPH ET AL: "Convolutional Two-Stream Network Fusion for Video Action Recognition", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 27 June 2016 (2016-06-27), pages 1933 - 1941, XP033021371, DOI: 10.1109/CVPR.2016.213
DOSOVITSKIY ALEXEY ET AL: "FlowNet: Learning Optical Flow with Convolutional Networks", 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 7 December 2015 (2015-12-07), pages 2758 - 2766, XP032866621, DOI: 10.1109/ICCV.2015.316
"RoboCup 2008: RoboCup 2008: Robot Soccer World Cup XII", vol. 9912, 1 January 2016, SPRINGER INTERNATIONAL PUBLISHING, Cham, ISBN: 978-3-319-10403-4, article LIMIN WANG ET AL: "Temporal Segment Networks: Towards Good Practices for Deep Action Recognition", pages: 20 - 36, XP055551834, 032682, DOI: 10.1007/978-3-319-46484-8_2
SINGH GURKIRT ET AL: "Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction", 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 22 October 2017 (2017-10-22), pages 3657 - 3666, XP033283236, DOI: 10.1109/ICCV.2017.393
YUE ZHAO; YUANJUN XIONG; LIMIN WANG; ZHIRONG WU; XIAOOU TANG; DAHUA LIN: "Temporal Action Detection with Structured Segment Networks", 2017, ICCV
V. ESCORCIA; F. CABA HEILBRON; J. C. NIEBLES; B. GHANEM: "Daps: Deep action proposals for action understanding", 2016, ECCV, pages: 768 - 784
A. DOSOVITSKIY; P. FISCHER; E. ILG; P. HAUSSER; C. HAZIRBAS; V. GOLKOV; P. V.D. SMAGT; D. CREMERS; T. BROX: "FlowNet: Learning Optical Flow with Convolutional Networks", 2015, ICCV
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1 . A neural network-based video processing system for determining a correlation between two time-spaced images from a video stream, the system comprising:

an image feature map extractor comprising a neural network, the image feature map extractor being configured to determine, for each of the images, a feature map comprising a plurality of channels and a plurality of locations of pixels in the image, the feature map representing the response of the respective image over each of the plurality of channels and at each of the plurality of locations of pixels in the respective image; and

a feature map aggregator configured to form an aggregated feature map by weighting each of the values in the feature map by (i) a factor representing the total channel response at the location corresponding to the respective value in the feature map normalised with respect to the total channel response over the respective image and (ii) a factor that indicates the extent to which the feature map indicates the image’s response over the channel corresponding to the respective value in the feature map; the system being configured to determine the correlation by comparing the aggregated feature maps for each of the images.

2. The system as claimed in claim 1 , wherein comparing the aggregated feature maps for each of the images to determine the correlation comprises using a linear weighted combination of unidirectional and bidirectional combinations of values from the aggregated feature maps.

3. The system as claimed in claim 2, wherein the linear weightings are determined based on one of: (i) a determination of the importance of the locations; (ii) a metric determined for the exponential space of the locations; and (iii) a metric determined by correlating values from the aggregated feature maps.

4. The system as claimed in any one of the preceding claims, wherein the factor that indicates the extent to which the feature map indicates the image’s response over the channel corresponding to the respective value in the feature map is derived from the proportion of non-zero locations per channel.

5. The system as claimed in any one of the preceding claims, wherein the determined correlation between the two time-spaced images represents the 2D motion of pixels between the images.

6. The system as claimed in claim 5, wherein the system is further configured to use the determined correlation to match and/or fail to match heuristics corresponding to one or more predetermined types of objects or activities.

7. A method for determining correlation between two time-spaced images from a video stream, the method comprising:

determining, for each of the images, a feature map comprising a plurality of channels and a plurality of locations of pixels in the image, the feature map representing the response of the respective image over each of the plurality of channels and at each of the plurality of locations of pixels in the respective image; forming an aggregated feature map by weighting each of the values in the feature map by (i) a factor representing the total channel response at the location corresponding to the respective value in the feature map normalised with respect to the total channel response over the respective image and (ii) a factor that indicates the extent to which the feature map indicates the image’s response over the channel corresponding to the respective value in the feature map; and

determining the correlation by comparing the aggregated feature maps for each of the images.

8. A video processing system for estimating subject matter of a video stream, the system being configured to:

for each of a series of still images in the video stream, form a first series of estimates of the respective image’s subject matter ranked by prediction strength; for each of a series of video segments in the video stream, form a second series of estimates of the respective video segment’s subject matter ranked by prediction strength;

for each of a first set of numbers in turn, analyse that number of the estimates having the highest prediction strength in each of the first series, and thereby form a first combined estimate of the subject matter of the video stream; for each of a second set of numbers in turn, analyse that number of the estimates having the highest prediction strength in each of the second series, and thereby form a second combined estimate of the subject matter of the video stream; for each of a third set of numbers in turn, analyse that number of the estimates having the highest prediction strength in each of the first series, and thereby form a third combined estimate of the subject matter of the video stream, the third set being different from the first set;

for each of a fourth set of numbers in turn, analyse that number of the estimates having the highest prediction strength in each of the second series, and thereby form a fourth combined estimate of the subject matter of the video stream, the fourth set being different from the second set; and

analyse the first, second, third and fourth combined estimates to form a global estimate of the subject matter of the video stream.

9. The video processing system as claimed in claim 8, wherein the second set of numbers is the same as the first set of numbers.

10. The video processing system as claimed in claim 8 or claim 9, wherein the fourth set of numbers is the same as the third set of numbers.

1 1 . The video processing system as claimed in any one of claims 8 to 10, wherein the first set of numbers is the set of integers between 4 and 10 inclusive and/or the third set of numbers is the set of integers between 6 and 1 1 inclusive.

12. The video processing system as claimed in any of claims 8 to 1 1 , wherein the second series of estimates is formed from an intermediate input representing the 2D motion of pixels between two still images, the two still images being the first and last frames of each respective video segment in the series.

13. The video processing system as claimed in claim 12, wherein the intermediate input is formed by the method of claim 7.

14. The video processing system as claimed in any one of claims 8 to 13, wherein the first and second series of estimates are formed using a respective pretrained deep learning network model.

15. A method for estimating subject matter of a video stream, the method comprising: for each of a series of still images in the video stream, forming a first series of estimates of the respective image’s subject matter ranked by prediction strength; for each of a series of video segments in the video stream, forming a second series of estimates of the respective video segment’s subject matter ranked by prediction strength;

for each of a first set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the first series, and thereby form a first combined estimate of the subject matter of the video stream;

for each of a second set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the second series, and thereby form a second combined estimate of the subject matter of the video stream; for each of a third set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the first series, and thereby form a third combined estimate of the subject matter of the video stream, the third set being different from the first set;

for each of a fourth set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the second series, and thereby form a fourth combined estimate of the subject matter of the video stream, the fourth set being different from the second set; and

analysing the first, second, third and fourth combined estimates to form a global estimate of the subject matter of the video stream.

Description:
EVENT DETECTION

FIELD OF THE INVENTION

This disclosure relates to methods for detecting events in video streams, for example for use in video monitoring or for identifying highlights in a passage of video.

BACKGROUND

Event detection (ED) systems based on video action detection and recognition from video data have attracted extensive research interest in consumer industry. There are many situations where it is desirable to find the starting and ending timestamps of events in a video stream, in addition to classifying the event (recognition). Given the ubiquity of CCTV, there is a significant research effort to apply image and video analysis methods together with deep learning techniques to move towards autonomous analysis of such data sources.

Traditional approaches to scene understanding remain dependent on training based on human annotations provided by different sensors. Systems have been introduced to assist human effort, for example to assist the police and security personnel in preventing crime. Even if an event is not identified at the time of its occurrence, recorded data can be used to provide evidence of the crime and to identify perpetrators and victims after the act.

Previous methods relating to event detection predominantly use sliding windows as candidates and focus on designing hand-crafted feature representations for classification. Recent works incorporate deep learning convolutional neural networks (CNNs) into the detection frameworks and have been shown to obtain improved performance, as detailed in “Temporal Action Detection with Structured Segment Networks”, by Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, Dahua Lin, ICCV 2017. One solution, S-CNN (as described in V. Escorcia, F. Caba Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding, B. Leibe, J. Matas, N. Sebe, and M. Welling (editors), ECCV, pages 768-784, 201 6), proposes a multi-stage CNN which improves detection accuracy via a localization network, but relies on C3D as the feature extractor, which is initially designed for snippet-wise action classification. Extending this method to detection of long action proposals requires an undesirably large temporal kernel stride.

Other known methods use a Recurrent Neural Network (RNN) to learn a glimpse policy for predicting the start and end points of an action. Such sequential prediction is often time-consuming for processing long videos and does not support joint training of the underlying feature extraction CNN. Some other works use classical dense optical flow extraction primarily based on OpenCV libraries.

ED systems based on deep learning have numerous advantages compared to those based on classical computer vision approaches. For example, the extraction of human engineered features, such as edges, corners and color, can be avoided. Such systems are also robust to use on different datasets by supporting data augmentation. Furthermore, models can be saved and transfer learning can be used to fine-tune the model weights in order to achieve robustness on different datasets. Flowever, deep learning approaches generally require more computational resources in order to be run, such as GPU processors.

While current video analytics methods greatly improve on passive surveillance systems, they suffer from undesirable false positive rates. Typically, motion detection algorithms are sensitive to illumination changes, camera shake, motion in the background, such as moving foliage, and usually cannot deal with continuous motion in the camera field of view. Many current approaches also require the tuning of a large number of parameters and demand high computational costs.

It is desirable to develop an improved method of accurately detecting and recognising events in video data that are less sensitive to these problems. SUMMARY OF THE INVENTION

According to a first aspect of the invention, there is provided a neural network-based video processing system for determining a correlation between two time-spaced images from a video stream, the system comprising: an image feature map extractor comprising a neural network, the image feature map extractor being configured to determine, for each of the images, a feature map comprising a plurality of channels and a plurality of locations of pixels in the image, the feature map representing the response of the respective image over each of the plurality of channels and at each of the plurality of locations of pixels in the respective image; and a feature map aggregator configured to form an aggregated feature map by weighting each of the values in the feature map by (i) a factor representing the total channel response at the location corresponding to the respective value in the feature map normalised with respect to the total channel response over the respective image and (ii) a factor that indicates the extent to which the feature map indicates the image’s response over the channel corresponding to the respective value in the feature map; the system being configured to determine the correlation by comparing the aggregated feature maps for each of the images. The formation of weighted aggregated feature maps in the correlation process may help to improve the accuracy of the correlation determination performed by the system.

Comparing the aggregated feature maps for each of the images to determine the correlation may comprise using a linear weighted combination of unidirectional and bidirectional combinations of values from the aggregated feature maps. The linear weightings may be determined based on one of: (i) a determination of the importance of the locations; (ii) a metric determined for the exponential space of the locations; and (iii) a metric determined by correlating values from the aggregated feature maps. This may improve the accuracy of the correlation determination.

The factor that indicates the extent to which the feature map indicates the image’s response over the channel corresponding to the respective value in the feature map may be derived from the proportion of non-zero locations per channel. In this way, the per-channel sparsity is taken into account in the formation of the weighted aggregated feature maps.

The determined correlation between the two time-spaced images may represent the 2D motion of pixels between the images. The processing system may therefore generate a representation of optical flow.

The system may be further configured to use the determined correlation to match and/or fail to match heuristics corresponding to one or more predetermined types of objects or activities. This may allow for classification of sections of video data in dependence on the activities that are taking place in the video, or the objects that appear in the video.

According to a second aspect of the invention there is provided a method for determining correlation between two time-spaced images from a video stream, the method comprising: determining, for each of the images, a feature map comprising a plurality of channels and a plurality of locations of pixels in the image, the feature map representing the response of the respective image over each of the plurality of channels and at each of the plurality of locations of pixels in the respective image; forming an aggregated feature map by weighting each of the values in the feature map by (i) a factor representing the total channel response at the location corresponding to the respective value in the feature map normalised with respect to the total channel response over the respective image and (ii) a factor that indicates the extent to which the feature map indicates the image’s response over the channel corresponding to the respective value in the feature map; and determining the correlation by comparing the aggregated feature maps for each of the images. The use of weighted aggregated feature maps in the correlation process may help to improve the accuracy of the correlation determination.

According to a third aspect of the invention there is provided a video processing system for estimating subject matter of a video stream, the system being configured to: for each of a series of still images in the video stream, form a first series of estimates of the respective image’s subject matter ranked by prediction strength; for each of a series of video segments in the video stream, form a second series of estimates of the respective video segment’s subject matter ranked by prediction strength; for each of a first set of numbers in turn, analyse that number of the estimates having the highest prediction strength in each of the first series, and thereby form a first combined estimate of the subject matter of the video stream; for each of a second set of numbers in turn, analyse that number of the estimates having the highest prediction strength in each of the second series, and thereby form a second combined estimate of the subject matter of the video stream; for each of a third set of numbers in turn, analyse that number of the estimates having the highest prediction strength in each of the first series, and thereby form a third combined estimate of the subject matter of the video stream, the third set being different from the first set; for each of a fourth set of numbers in turn, analyse that number of the estimates having the highest prediction strength in each of the second series, and thereby form a fourth combined estimate of the subject matter of the video stream, the fourth set being different from the second set; and analyse the first, second, third and fourth combined estimates to form a global estimate of the subject matter of the video stream. The use of a late fusion method to combine predictions from two streams may improve the accuracy of the subject matter estimation performed by the device.

The second set of numbers may be the same as the first set of numbers. The fourth set of numbers may be the same as the third set of numbers. This may allow the predictions from two data stream to be combined to form a global estimate.

The first set of numbers may be the set of integers between 4 and 10 inclusive and/or the third set of numbers may be the set of integers between 6 and 1 1 inclusive. These sets of integers have been shown to achieve improved results.

The second series of estimates may be formed from an intermediate input representing the 2D motion of pixels between two still images, the two still images being the first and last frames of each respective video segment in the series. The first series of estimates may be formed from RGB images. Combining predictions from these two complementary approaches may help to achieve improved accuracy and robustness to the problems that affect the individual modalities. The intermediate input may be formed by the method of the second aspect above. This may allow for a section of video to be classified.

The first and second series of estimates may be formed using a respective pretrained deep learning network model.

According to a fourth aspect of the invention there is provided a method for estimating subject matter of a video stream, the method comprising: for each of a series of still images in the video stream, forming a first series of estimates of the respective image’s subject matter ranked by prediction strength; for each of a series of video segments in the video stream, forming a second series of estimates of the respective video segment’s subject matter ranked by prediction strength; for each of a first set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the first series, and thereby form a first combined estimate of the subject matter of the video stream; for each of a second set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the second series, and thereby form a second combined estimate of the subject matter of the video stream; for each of a third set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the first series, and thereby form a third combined estimate of the subject matter of the video stream, the third set being different from the first set; for each of a fourth set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the second series, and thereby form a fourth combined estimate of the subject matter of the video stream, the fourth set being different from the second set; and analysing the first, second, third and fourth combined estimates to form a global estimate of the subject matter of the video stream. The use of a late fusion method to combine predictions from two streams may improve the accuracy of the subject matter estimation.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings: Figure 1 shows an overview of an event detection framework, as described in “Temporal Action Detection with Structured Segment Networks”, by Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, Dahua Lin, ICCV 2017.

Figure 2 shows an overview of the training phase of the event detection process.

Figure 3 shows an overview of the inference phase of the event detection process.

Figures 4 (a) and (b) show examples of temporal flow images.

Figure 5 shows how two temporally separated images from an input video are processed to form a representation of optical flow using two separate, yet identical, processing streams for the two input images. The input images are combined together at a later stage, as described in“FlowNet: Learning Optical Flow with Convolutional Networks", A. Dosovitskiy and P. Fischer and E. Ilg and P. Hausser and C. Hazirbas and V. Golkov and P. v.d. Smagt and D. Cremers and T. Brox, ICCV 2015.

Figure 6 shows an example of a correlation layer.

Figure 7 illustrates the formation of a weighted feature map

Figure 8 shows an example of a method for forming a representation of correlation between two time-spaced images from a video stream.

Figure 9 shows an example of a neural network-based video processing system for determining a correlation between two time-spaced images from a video stream.

Figure 10 shows an example of a method for estimating the subject matter of a video stream.

Figure 1 1 shows an example of an end-to-end process employing the aspects described herein. DETAILED DESCRIPTION

An overview of an event detection framework is shown in Figure 1 . The system takes as its input a video 1 1 and a set of temporal action proposals 12. The system outputs a set of predicted activity instances 13, each associated with a category label and a temporal range of the video. The temporal range is delimited by a starting point and an ending point of the input video.

From the input to the output of the system, there are three key steps. Firstly, the framework relies on a proposal method to produce the set of temporal proposals 12 of varying durations, where each proposal has a starting and an ending time and is composed of three consecutive stages: starting, course, and ending, which respectively capture how the action starts, proceeds, and ends.

Next, for each proposal, structured temporal pyramid pooling (STPP) is performed by:

1 ) splitting the proposal into the three stages (starting, course and ending);

2) building a temporal pyramidal representation for each stage; and

3) building global representation 14 for the whole proposal by concatenating stage level representations.

Finally, two classifiers 15, 1 6 for recognizing the activity category and the completeness (whether a proposal captures a complete activity instance) respectively, are applied to the representation 14 obtained using STPP and their predictions are combined, resulting in a subset of proposals tagged with category labels. The classifiers together with STPP are integrated into a single network that is trained in an end-to-end way. Examples of the classification functions used are described in “Temporal Action Detection with Structured Segment Networks”, by Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, Dahua Lin, ICCV 2017.

The proposals can be described as follows:

- positive proposals: overlap with the closest groundtruth instances with at least 0.7 intersection of union (loU);

- background proposals: do not overlap with any ground truth instances; and - incomplete proposals: 80% of its own span is contained in a groundtruth instance, while its loU with that instance is below 0.3.

As the whole framework is deep-network based, training and testing (inference) stages can be distinguished. A block diagram of the training process is shown in Figure 2. After each training iteration, the validation output comprises a ranked list of the current predicted labels based on the input video frames (images) used to calculate the validation accuracy result, in addition to corresponding confidence value scores. Additionally, the loss is calculated and the overall training process is repeated until either the last training iteration is finished or a particular (predefined) loss value is reached. This results in a pre-trained network model (network graph and trained network weights) that can be used during the testing (inference) stage.

A block diagram of the inference phase is shown in Figure 3. At block 31 , an input video of an event is sliced into snippets. The snippets are created by extracting and grouping RGB frames and optical flow images. Each image is associated with a label corresponding to the event it belongs to. Formally, given a video V, the video is divided into K segments {S1 , S2, ..., SK} of equal direction. The temporal network models a sequence of snippets as follows:

The snippets proceed to flow extractors, denoted by 32 and 33 for RGB and Optical Flow image snippets respectively. Optical flow gives the relation of the motion field, the 2D projection of the physical movement of points relative to the observer to 2D displacement of pixel patches on the image plane. Examples of optical flow images are shown in Figures 4(a) and (b).

Then each image from a corresponding snippet is propagated through a deep network stream, denoted by 34 and 35 for the spatial and temporal networks respectively, to obtain intermediate output predictions for each network stream separately. The streams are handled separately because for some classes, the optical flow data in its model may outperform the RGB data in that model and vice versa. Thus, activities which are better classified by the RGB model are best determined by which objects are present in a scene, while activities which are better classified by the Flow model are best classified by the kind of motion in the scene.

The intermediate output predictions are input into a class-score fusion heuristics block 36, which gives solutions for obtaining the final watch list of rankings and confidence scores, shown as cylinder-shaped block 37, from which the most likely prediction is taken as the correct one. Because RGB and flow data are complementary, taking both streams into account using a fusion process has been shown to show improvement over the baseline single-frame systems. This process will be described in further detail later.

The operation of the block 33 of Figure 3 according to the present invention will now be described in more detail.

Figure 5 shows how two temporally separated images 50, 51 from an input video are processed using two separate, yet identical, processing streams for the two input images. The input images are combined together at a later stage. With this architecture, the network is constrained to first produce meaningful representations, known as feature maps, of the two images separately and then combines them at a higher level. This resembles the standard feature matching approach, when features are first extracted from patches of both images and then those feature vectors are compared, as described in “FlowNet: Learning Optical Flow with Convolutional Networks", A. Dosovitskiy and P. Fischer and E. Ilg and P. Hausser and C. Hazirbas and V. Golkov and P. v.d. Smagt and D. Cremers and T. Brox, ICCV 2015.

A feature map f for an image comprises a plurality of channels and a plurality of locations of pixels in the image and represents the response of the image over each of the plurality of channels and at each of the plurality of locations of pixels in the image. The tensor f can be denoted by: from a particular layer L, where C is the total number of channels and W and H are the spatial dimensions of the tensor. Also,

An entry f corresponding to channel c at spatial location (i,j) is denoted by The channel-wise matrices are denoted as where Similar y, is used to denote the vector of channel responses at location where

When the processing streams are combined, a correlation layer 52 is formed that performs multiplicative patch comparisons between two feature maps corresponding to the first and second images 50, 51 . A magnified view of the correlation layer 52 is shown in Figure 6.

The original correlation (matching) of two patches centred at x 1 in the first map f 1 (coming from the first input image) and centred at X2 in the second map f2 (coming from the second input image) is calculated using:

Comparing all patch combinations involves (WH) 2 computations, where W and H are the dimensions (width and height) of the feature map. Thus, the correlation feature map comprises all the correlation values obtained when the correlation function is applied to different patches which originate from different feature maps (the feature maps originating from the different input images). Since this feature map is crucial for determining the optical flow prediction, this layer should be as discriminative as possible in terms of feature separation. In order to achieve a more robust matching process, the approach described herein uses a correlation layer which uses enhanced feature maps in order to better distinguish features in the feature maps and more precisely facilitate the matching process. The approach also uses a linear weighted combination of unidirectional and bidirectional matches to improve the correlation between two patches. Furthermore, three different strategies may be used to find the most optimal weighting system (one based on importance, one based on exponential space, one based on correlation values), which will be explained in more detail below. To perform an improved correlation process between patches xi (in feature map fi) and X2 (in feature map f2), weighted aggregated feature maps can be used, where the weighting is based on their importance, i.e. the relationship between their elements across dimensions. Then, correlation matching is performed on all the patches between the weighted aggregated feature maps.

As shown schematically in Figure 7, weighted aggregated feature maps are formed by weighting the feature map channel-wise by a weight vector be and location-wise (spatially) by a weight matrix ay, based on the normalized total response across all channels.

The spatial weighting factor is a factor representing the total channel response at the location corresponding to the respective value in the feature map normalised with respect to the total channel response over the respective image. For each location i,j in the feature map, a weight ay is assigned that is applied to each channel at that location. An Lp normalization may be used for ay.

S e E W X H is a matrix of aggregated responses for all channels per spatial location (i,j), which can be computed by summing features maps using the equation below:

After applying normalization and scaling, the aggregated spatial response map for an element at location (i,j) is given by:

After computing the 2D spatial aggregation map S for feature map f, ay can be independently applied on every channel. The values of the coefficients a and b in equation (5) can be chosen to achieve the best results. In one example, a = 0.5 and b = 2. The spatial weighting boosts locations for which multiple channels are active relative to other spatial locations of the same image. Since the spatial weighting ay has already been precomputed, the spatial weighting is a non-parametric way to improve the feature maps at no additional cost.

The channel weighting factor is a factor that indicates the extent to which the feature map indicates the image’s response over the channel corresponding to the respective value in the feature map. For each channel c, a weight be is assigned that is applied to each location in that channel be is given by:

The weighted feature map is then given by:

where is the original feature map.

The correlations and of two patches centered at xi in the weighted aggregated first feature map and centered at X2 in the weighted aggregated second feature map are defined as:

The total correlation is calculated as: The correlation layer lets the network compare each patch from f’i with each patch from f’2 using the total correlation function given in equation (1 1 ).

Different weighting systems may be used when integrating the uni- and bi-directional matchings by using different values of 012, 021 and 0121 in equation (1 1 ). The optimal linear weightings may be determined based on, for example, one of: a determination of the importance of the locations, a metric determined for the exponential space of the locations and a metric determined by correlating values from the aggregated feature maps.

The following heuristic may be used to determine the coefficients, which better separates the coefficients in“exponential space":

The corresponding coefficients 012, 021, and 0121 can then be obtained.

In one example where t and denote the

values of for the corresponding coefficients 012, 021, and 0121 respectively. This combination of values was shown to achieve good results.

Alternatively, the coefficients may be determined using equations (13)-(15) below:

These coefficients are based on an exponential type of normalization and were shown to achieve good results.

Figure 8 summarises a method for forming a representation of correlation between two time-spaced images from a video stream. At step 81 , the method comprises, for each of the images, determining a feature map comprising a plurality of channels and a plurality of locations of pixels in the image, the feature map representing the response of the respective image over each of the plurality of channels and at each of the plurality of locations of pixels in the respective image. The method continues at step 82, by forming an aggregated feature map by weighting each of the values in the feature map by (i) a factor representing the total channel response at the location corresponding to the respective value in the feature map normalised with respect to the total channel response over the respective image and (ii) a factor that indicates the extent to which the feature map indicates the image’s response over the channel corresponding to the respective value in the feature map. At step 83, the method further comprises determining the correlation by comparing the aggregated feature maps for each of the images.

Figure 9 shows an example of a system for implementing the method described above. The neural network-based video processing system is shown generally at 90. The system comprises an image feature map extractor 91 comprising a neural network. The image feature map extractor is configured to determine the weighted aggregated feature map for each of the images using the method described above. The system also comprises a feature map aggregator 92. The feature map aggregator 92 is configured to form an aggregated feature map by weighting each of the values in the feature map by the spatial and channel weighting factors, as described above. The feature map extractor 91 and the feature map aggregator 92 may each comprise a processor and a non-volatile memory. The memory may store a set of instructions that are executable by the processor. The processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium. The computer program may store instructions for causing the processor to perform its methods in the manner described herein. The system 90 is configured to determine the correlation by comparing the weighted aggregated feature maps for each of the images, as described above. The output of the correlation process described above is a representation of optical flow. This representation can be input to the network model denoted by block 35 in Figure 3, which handles the optical flow stream and output predictions based on that data stream.

Alternatively, a representation of optical flow determined using other known methods may be input to block 35.

Blocks 34 and 35 use a pretrained artificial intelligence model, such as a deep learning model, to form predictions of activities from the RGB and optical flow stream respectively.

Network blocks 34 and 35 are supported by data augmentation. The following data augmentation techniques may additionally be used: random brightness, random flip (left-to-right flipping) and random contrast. These blocks also carry out the saving of the model and the determined weights after the training and validation phase, and perform the model initialization and reloading before the testing phase. The training parameters for the RGB and optical flow network streams are different and are data specific.

Once the intermediate RGB and optical flow predictions have been output from blocks 34 and 35, the predictions are input into a class-score fusion heuristics block 36 of Figure 3. This gives solutions for obtaining the final watch list of rankings and confidence scores, shown as cylinder-shaped block 37, from which the most likely prediction is taken as the correct one.

A late fusion function which takes into account predictions using both RGB and Optical Flow is derived and applied to the output predictions of the two data streams. This function is based on the top k predictions for each of the streams separately.

For each RGB frame of an input video, the top k predictions of the network output are determined. Then, all of the predictions are grouped based on one source of information (for all the RGB frames) and the top k predictions are also chosen, which may be based on majority of votes or a frequency of appearance. To output the prediction for the input video based on the RGB part only, the first ranked (most-likely) prediction is taken as the correct one and is compared to its label (ground truth prediction). This heuristic is employed for k = 4, 5, ..., 10, and the most likely prediction is identified for each value of k to form a first series of estimates of the respective RGB frame’s subject matter ranked by prediction strength.

For the optical flow stream, the same process is repeated for k = 4, 5, ..., 10 and the most likely prediction based on this part only is obtained.

The fusion function then merges both analysis results into one semantic interpretation of an input video. To fuse the RGB and Optical Flow parts, the processes are repeated for RGB and Optical Flow, but this time the top m predictions are taken into account. Then a union (sum) of normalized predictions is found from both parts and the one with the most votes is chosen. This heuristic is employed for m = 6, ..., 1 1 , and the most likely prediction is determined.

Taking k as the set of integers from 4 to 10 inclusive and m as the set of integers from 6 to 1 1 inclusive has been shown to achieve good results. However, other values or ranges of the integers k and m may also be used.

Figure 10 shows an example of a method carried out by the video processing system for estimating the subject matter of a video. At step 101 , the method comprises forming, for each of a series of still images in the video stream, a first series of estimates of the respective image’s subject matter ranked by prediction strength. At step 102, for each of a series of video segments in the video stream, the method then comprises forming a second series of estimates of the respective video segment’s subject matter ranked by prediction strength. The method continues at step 103, where, for each of a first set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the first series, and thereby form a first combined estimate of the subject matter of the video stream. At step 104, the method further comprises, for each of a second set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the second series, and thereby form a second combined estimate of the subject matter of the video stream. At step 105, for each of a third set of numbers in turn, the method further comprises analysing that number of the estimates having the highest prediction strength in each of the first series, and thereby form a third combined estimate of the subject matter of the video stream, the third set being different from the first set. The method continues at step 106 and further comprises, for each of a fourth set of numbers in turn, analysing that number of the estimates having the highest prediction strength in each of the second series, and thereby form a fourth combined estimate of the subject matter of the video stream, the fourth set being different from the second set. Finally, at step 107, the first, second, third and fourth combined estimates are analysed to form a global estimate of the subject matter of the video stream.

Because RGB and optical flow data are complementary, taking the predictions output from each of these streams into account using a fusion process has been shown to show improvement over the baseline single-frame systems. Combining the strengths of these two complementary approaches achieves improved accuracy and robustness to the problems that affect the individual modalities. In this late fusion, the predictions are combined on a semantic level. Since each modality is trained separately, learning of the joint probability of the modalities is not performed explicitly and instead the trained unimodal data are used in a late fusion process. As no re-training is necessary, late fusion systems can use larger datasets more easily.

Different variations of this fusion heuristic may be applied. In one example, the normalized softmax predictions (normalized in a [0, 1 ] interval) may be employed and the normalized frequency of appearance scaled with the aforementioned ones. Thus, outputs for the RGB and optical flow streams may be obtained, similarly to the method described above. The fusion process then finds the sum of the normalized weighted scaled frequency of appearance for all the predictions. The most likely output can then be compared to the ground truth label.

In other examples, additional weights may be employed on top of the scaled joint predictions discussed above, or the first and second ranked guesses may both be utilized based on the threshold difference between them. The optical flow representation determined using the method described above in the first aspect of the invention can be input to the late fusion function to fuse information from RGB and optical flow predictions separately. However, optical flow calculated using any traditional method may also be input to the late fusion function.

Furthermore, this fusion heuristic does not depend on the type of input data, and thus this method is versatile and may applied to a range of different input data.

An end-to-end diagram summarizing the methods described herein is shown in Figure 1 1 . As shown at 1 1 1 , a video is input to the system. The input video may be based on one of a YouTube™ url or a video file, such as an .avi or .mp4 file. The video is then sliced into chunks at 1 12. The chunks may in one example be portions of video with a duration of 1 s. At 1 13, the slices are iteratively processed as described above to obtain RGB and optical flow snippets. In this example, at 1 14, CNN based pre-trained models are then used to obtain event prediction scores for the RGB and optical flow frame stacks. At 1 15, the frame wise scores are aggregated and the fusion function is then applied at 1 16 as described above to obtain the final score, from which the event is classified based on predefined labels at 1 17.

The method presented here explicitly models the action structure via structural temporal pyramid pooling in addition to efficiently utilizing optical flow information. The aspects described herein may be utilised separately or many be combined for efficient use of memory-based optical flow in addition to the use of a late fusion function of the predictions from the RGB and optical flow streams.

In previous works, this integration process has been solved via or through the network part only, by applying an early fusion through the layers’ modification to achieve integration on a feature level (feature concatenation). By employing the methods described herein, the robustness of the system may be increased by utilizing a large amount of data and by training them all together for extensive amount of time (for example, days).

In one example, during testing, video snippets may be sampled with a fixed interval of 6 frames, and the temporal pyramid constructed thereon. The original formulation of the temporal pyramid first computes pooled features and then applies the classifiers and regressors on top, which is not efficient. For each video, hundreds of proposals will be generated, and these proposals can significantly overlap with each other, therefore, a considerable portion of the snippets and the features derived thereon are shared among proposals. Use of the technique described herein may reduce the processing time after extracting network outputs from around 10 seconds to less than 4 seconds per video on average.

As the system is capable of supporting different hardware requirements, it can provide a viable way of solving event detection problems in videos and may be integrated with existing products.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.