Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICE AND METHOD FOR SEPARATING A PICTURE INTO FOREGROUND AND BACKGROUND USING DEEP LEARNING
Document Type and Number:
WIPO Patent Application WO/2020/043296
Kind Code:
A1
Abstract:
Embodiments of the present invention relate to field of separating pictures, particularly pictures of a surveillance video, into foreground and background. A device and method are provided that employ a Convolutional Neural Network (CNN), i.e. are based on deep learning. The CNN is configured to receive as an input the picture and a background model image. The CNN is configured to generate feature maps of different resolution based on the input, wherein the resolution of feature maps is gradually reduced. Based on the feature maps, the CNN is configured to generate activation maps of different resolution, wherein the resolution of activation maps is gradually increased. Further, the CNN is configured to output a 1-channel probability map having the same resolution as the picture, wherein each pixel of the output 1-channel probability map corresponds to a pixel of the picture and indicates a probability that the corresponding pixel of the picture is associated with a foreground object or with a background object.

Inventors:
HOANG THAI (DE)
BRENNER MARKUS (DE)
WANG HONGBIN (DE)
TANG JIAN (DE)
Application Number:
PCT/EP2018/073381
Publication Date:
March 05, 2020
Filing Date:
August 30, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
HOANG THAI V (DE)
International Classes:
G06N3/08; G06N3/04; G06T7/11; G06T7/194
Foreign References:
CN108154518A2018-06-12
Other References:
LIM KYUNGSUN ET AL: "Background subtraction using encoder-decoder structured convolutional neural network", 2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), IEEE, 29 August 2017 (2017-08-29), pages 1 - 6, XP033233444, DOI: 10.1109/AVSS.2017.8078547
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
Claims

1. Device (100) for separating a picture (101) into foreground and background configured to employ a Convolutional Neural Network, CNN, to

receive as an input (101, 102) the picture (101) and a background model image

(102),

generate a plurality of feature maps (103) of different resolution based on the input (101, 102), wherein the resolution of feature maps (103) is gradually reduced,

generate a plurality of activation maps (104) of different resolution based on the plurality of feature maps (103) of different resolution, wherein the resolution of activation maps (104) is gradually increased, and

output a 1 -channel probability map (105) having the same resolution as the picture

(101),

wherein each pixel of the output 1 -channel probability map (105) corresponds to a pixel of the picture (101) and indicates a probability that the corresponding pixel of the picture (101) is associated with a foreground object or with a background object.

2. Device (100) according to claim 1, configured to

threshold the output 1 -channel probability map (105) to get a binary mask, wherein each pixel of the binary mask indicates whether the corresponding pixel of the picture (101) is associated with a foreground object or with a background object.

3. Device according to claim 1 or 2, wherein

the input (101, 102) includes a 3-channel [particularly RGB] high-resolution background model image (102) and a 3-channel [particularly RGB] picture (101).

4. Device (100) according to one of the claims 1 to 3, wherein

the CNN includes an encoder-decoder architecture,

the encoder (200) is configured to generate the plurality of feature maps (103) of different resolution, and

the decoder (210) is configured to generate the plurality of activation maps (104) of different resolution.

5. Device (100) according to one of the claims 1 to 4, wherein

the CNN comprises an encoder (200) with a plurality of consecutive encoder layers (20 la, 20 lb) and a decoder (210) with a plurality of consecutive decoder layers (21 la, 21 lb),

the encoder (200) is configured to generate one of the plurality of feature maps

(103) per encoder layer (20la, 20lb),

wherein the first encoder layer (20 la) is configured to generate and downsample a feature map (103) from the received input (101, 102), and each further encoder layer (20lb) is configured to generate and downsample a further feature map (103) based on the feature map (103) generated by the previous encoder layer (20la, 20lb), and

the decoder (210) is configured to generate one of the plurality of activation maps

(104) per decoder layer (21 la, 21 lb) ,

wherein the first decoder layer (211 a) is configured to upsample the feature map (103) generated by the last encoder layer (20 lb) and generate an activation map ( 104) based on the upsampled feature map, and each further decoder layer (21 lb) is configured to upsample the activation map (104) generated by the previous decoder layer (21 la, 21 lb) and generate a further activation map (104) based on the upsampled activation map (104).

6. Device (100) according to claim 5, wherein

each encoder layer (20 la, 20 lb) contains at least one convolutional filter configured to operate on respectively the input (101, 102) or the feature map (103) of the previous encoder layer (20la, 20lb), in order to generate a feature map (103), and

each decoder layer (21 la, 21 lb) contains at least one convolutional filter configured to operate on respectively the feature map (103) of the last encoder layer (21 lb) or the activation map (104) of the previous decoder layer (21 la, 21 lb), in order to generate an activation map (104).

7. Device (100) according to claim 5 or 6, wherein

each encoder layer (20 lb) is configured to reduce the resolution of the feature map (103) by performing a strided convolution or pooling operation, and

each decoder layer (21 lb) is configured to increase the resolution of the feature map (103) of the last encoder layer (20 lb) or of the activation map (104) generated by the previous decoder layer (21 la, 21 lb) by performing a transposed convolution or unpooling operation.

8. Device (100) according to one of the claims 5 to 7, wherein

the CNN further comprises a plurality of skip connections (202),

each skip connection (202) connects one of the further encoder layers (20 lb), which is configured to generate a feature map (103) of a certain resolution, with one of the further decoder layers (21 lb), which is configured to generate an activation map (104) of a same resolution or the most similar resolution, and

said further decoder layer (21 lb) is configured to generate the activation map (104) based on the activation map (104) of the previous decoder layer (21 la, 21 lb) and the feature map (103) generated by the encoder layer (20 lb) to which it is connected via the skip connection (202).

9. Device (100) according to one of the claims 5 to 8, wherein

each of the further encoder layers (20 lb) is configured to generate a feature map (103) including more channels (300) than included in the feature map (103) generated by the previous encoder layer (20 la, 20 lb).

10. Device (100) according to one of the claims 5 to 9, wherein

each of the further decoder layers (21 lb) is configured to generate an activation map (104) including less channels (400) than included in the activation map (104) generated by the previous decoder layer (21 la, 21 lb).

11. Device (100) according to one of the claims 5 to 10, wherein

each decoder layer (21 la, 21 lb) is further configured to output a 1 -channel activation map estimation (212), and the device (100) is configured to

calculate a multi-resolution loss based on all the output 1 -channel activation map estimations (212) of the decoder layers (21 la, 21 lb), and

upsample each l-channel activation map estimation (212) and use it as input to the next decoder layer (21 lb).

12. Hardware implementation of a Convolutional Neural Network, CNN, configured to receive as an input (101, 102) a picture (101) and a background model image (102), generate a plurality of feature maps (103) of different resolution based on the input

(101, 102), wherein the resolution of feature maps (103) is gradually reduced, generate a plurality of activation maps (104) of different resolution based on the plurality of feature maps (103) of different resolution, wherein the resolution of activation maps (104) is gradually increased, and

output a 1 -channel probability map (105) having the same resolution as the picture

(101),

wherein each pixel of the output 1 -channel probability map (105) corresponds to a pixel of the picture (101) and indicates a probability that the corresponding pixel of the picture (101) is associated with a foreground object or with a background object.

13. Method (500) of employing a Convolutional Neural Network, CNN, for separating a picture (101) into foreground and background, the method comprising

receiving (501) as an input (101, 102) the picture (101) and a background model image (102),

generating (502) a plurality of feature maps (103) of different resolution based on the input (101, 102), wherein the resolution of feature maps (103) is gradually reduced, generating (503) a plurality of activation maps (104) of different resolution based on the plurality of feature maps (103) of different resolution, wherein the resolution of activation maps (104) is gradually increased, and

outputting (504) a l-channel probability map (105) having the same resolution as the picture (101),

wherein each pixel of the output l-channel probability map (105) corresponds to a pixel of the picture (101) and indicates a probability that the corresponding pixel of the picture (101) is associated with a foreground object or with a background object.

14. Computer program product comprising program code for performing, when implemented on a processor, a method (500) according to claim 13.

15. Computer comprising at least one memory and at least one processor, which are configured to store and execute program code to perform the method (500) according to claim 13.

Description:
DEVICE AND METHOD FOR SEPARATING A PICTURE INTO FOREGROUND AND BACKGROUND USING DEEP LEARNING

TECHNICAL FIELD

Embodiments of the present invention relate to the task of separating a picture, e.g. a picture of a video particularly of a surveillance video, into foreground and background. Specifically, to separate moving foreground objects from a static background scene. To this end, the present invention presents a device and method, which employ a Convolutional Neural Network (CNN), i.e. perform the separation based on deep learning.

BACKGROUND

The ever-expanding camera networks around the world generate a huge amount of surveillance video data. This video data needs efficient video analytics pipelines, in order to provide timely and accurately useful information to the concerned authorities.

Segmentation is a key component of conventional video analytics, and is used to extract moving foreground objects from static background scenes. At the picture level, segmentation can be seen as the grouping of pixels of the picture into regions that represent moving objects. It is essential to achieve high segmentation accuracy, since it is one of the first steps in many processing pipelines. Current segmentation techniques do not deliver satisfactory results for a range of different recording conditions of the cameras, while maintaining a low computational complexity for real-time processing.

Since surveillance cameras have a stationary position most of the time, they record the same background scene with moving foreground objects of interest. Conventional approaches usually exploit this assumption, and extract the moving objects from each picture of the video by removing the stationary regions of the picture.

“Background subtraction” is one such conventional approach, and is based on the “difference” between a current picture and a reference picture, often called“background model”. Variants of this approach depend on how the background model is constructed and how the“difference” operation is defined at the pixel level. For example, the background model can be estimated as the median at each pixel location of all the pictures (or frames) in a sliding window close to the current picture (or current frame). And the“difference” can be defined as the difference in pixel intensity between the current picture and the background model at each pixel location. Although some background subtraction techniques are relatively fast, and a number of them has been widely used for surveillance video analytics, these techniques have several limitations:

• Noisy segmentation, due to a slight intensity difference between the current picture and background model in case of e.g. shadow, illumination changes, or weather condition.

• Color similarity between the foreground and background regions may create holes or even break the foreground masks into disconnected blobs.

• Intermittent moving objects can become part of the background model, and thus cannot be extracted.

“Semantic/instance segmentation” techniques in computer vision have improved substantially in recent years, due to the use of deep learning with ever-increasing computing resources and training data. Semantic segmentation is the process of associating each pixel of a picture with a class label (such as“person”,“vehicle”,“bicycle”,“tree”...), while instance segmentation combines object detection with semantic segmentation to label each object instance using a unique instance label. Even though these techniques start to be used in advanced perception systems, such as autonomous vehicles, they are not designed explicitly for surveillance video applications, and thus do not exploit in their algorithmic formulation the fact that surveillance cameras are stationary. The performance of these techniques in foreground objects extraction is thus sub-optimal.

“Background subtraction/semantic segmentation combination” is a hybrid technique, which consists of leveraging object-level semantics to address some challenging scenarios for background subtraction. More precisely, the technique combines a“probability map” output from semantic segmentation with the output of a background subtraction technique, in order to reduce false positives. However, this hybrid technique has to run two models separately in parallel, and does not directly use deep learning in an end-to-end solution for extracting foreground moving objects. “Deep learning-based background subtraction” is a relatively new technique. There are some conventional approaches that use deep neural networks to solve the foreground object extraction from surveillance videos. Characteristic features and disadvantages of these approaches using deep learning are summarized below:

• Some approaches use scene- specific models. However, a new training is required in this case for each new deployment to a new scene, making these approaches rather inefficient and impractical.

• Some approaches use patches of a small size, where patches are extracted from a picture and a background model image, respectively. However, the small size of the patches leads to too little contextual information from the neighboring regions in the spatial domain to help deciding accurately, whether a given pixel in the small patch belongs to the foreground or background.

• Some approaches convert imaging data from RGB to greyscale, before inputting it into a CNN model. However, since grayscale data (1 color channel) contains much less information than RGB data (3 color channels), it is more difficult for the CNN model to perform the background subtraction task. Thus, the performance becomes sub-optimal.

• Some approaches do not have the above-mentioned small patch size or greyscale problems, but have instead a too complex architecture using e.g. 10 consecutive pictures as an input. The high computational complexity prevents these approaches from being deployed in real systems.

• Some approaches do not work on patches, but work with pictures that are resized (reduced in size) to e.g. 336x336 and 320x240, respectively. Due to this size reduction, it becomes difficult to segment small-sized foreground objects.

SUMMARY

In view of the above-mentioned problems and disadvantages, embodiments of the present invention aims to improve the conventional approaches. An objective is to provide a segmentation technique, which is high-performant and robust to different recording conditions. In particular, embodiments of the invention aim for a lightweight device and a method for separating pictures of surveillance videos in an improved manner. Embodiments of the present invention are defined in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims.

In particular, embodiments of the present invention propose a segmentation technique, which is developed specifically for surveillance videos. Contrary to conventional semantic segmentation, which assigns a category label to each pixel of a picture to indicate an object or a stuff the pixel belongs to, the proposed segmentation technique is able to assign a binary value to each pixel to indicate whether the pixel belongs to a moving object, or in other words, whether it belongs to a foreground object of interest or to background. The output is, for example, a binary map indicating for each pixel of the picture (or image or frame) whether it is associated to a foreground object of interest or not (e.g. to background).

The limitations of the above-described conventional approaches are particularly addressed by proposing a CNN model for separating the picture into foreground and background (the CNN model is also referred to as“BackgroundNet”). The CNN model bases on the features summarized below:

• Background subtraction is reformulated into a trainable end-to-end segmentation problem, which is suitable for deep learning.

• The CNN input is a picture and a background model image, preferably in high- resolution RGB (6 channels in total).

• The CNN output is a 1 -channel probability map of the same resolution as the input, i.e. as the picture. The probability value at each pixel location indicates the confidence that the pixel of the picture belongs to a foreground moving object. Then, preferably, a decision on foreground/background per pixel is forced by thresholding to obtain a binary map.

• The CNN has an encoder-decoder architecture for multi-resolution feature extraction in the encoder, and multi-resolution foreground activation maps in the decoder.

• Preferably, multiple skip connections are provided from the encoder to the decoder to help restore fine boundary details of the activation maps at multi-resolution levels.

• Preferably, the training of the CNN optimizes the multi-resolution binary cross entropy loss by using multi-resolution activation maps. Thus, the appearance of holes in each activation map or even breakdown of an activation map into disconnected regions can be avoided.

Accordingly embodiments of the invention are defined by the following aspects and implementation forms.

A first aspect of the invention provides a device for separating a picture into foreground and background configured to employ a CNN, to receive as an input the picture and a background model image, generate a plurality of feature maps of different resolution based on the input, wherein the resolution of feature maps is gradually reduced, generate a plurality of activation maps of different resolution based on the plurality of feature maps of different resolution, wherein the resolution of activation maps is gradually increased, and output a 1 -channel probability map having the same resolution as the picture, wherein each pixel of the output 1 -channel probability map corresponds to a pixel of the picture and indicates a probability that the corresponding pixel of the picture is associated with a foreground object or with a background object.

The“picture” may be a still image or may be a picture of a video. That is, the picture may be one picture in a sequence of pictures, which builds a video. Accordingly, the picture may also be a frame of a video. The video may specifically be a surveillance video, which is typically taken by a stationary surveillance video camera.

Each value of a“feature map” indicates, whether one or more determined features are present at one of multiple different regions of the input. The resolution of feature maps is gradually reduced so that longer range information is more easily captured in the deeper feature map. Each value of an“activation map” indicates a confidence that one of multiple different regions of a corresponding feature map is associated with a foreground object or with background. That is, the activation maps may be considered representing foreground masks. The resolution of the activation maps is gradually increased so that object details are better recovered in the deeper activation map.

A“probability map“ can be obtained by applying a sigmoid function to a 1 -channel “activation map”. Thus, the probability map is also l-channel, i.e. a l-channel probability map, and its values are, e.g., in the range [0, 1]. By employing the CNN as described above, the device of the first aspect is able to segment the picture with high-performance and robustness to different recording conditions. The device of the first aspect is particularly well suited for separating one or more pictures of a surveillance video.

In an implementation form of the first aspect, the device is configured to threshold the output 1 -channel probability map to get a binary mask, wherein each pixel of the binary mask indicates whether the corresponding pixel of the picture is associated with a foreground object or with a background object.

That is, the device can easily produce an accurate separation of the picture in foreground and background.

In a further implementation form of the first aspect, the input includes a 3-channel [particularly RGB] high-resolution background model image and a 3-channel [particularly RGB] picture of similar resolution.

That is, the device does not need to perform a greyscale conversion, i.e. more information can be used for the separation of the picture.

In a further implementation form of the first aspect, the CNN includes an encoder-decoder architecture, the encoder is configured to generate the plurality of feature maps of different resolution, and the decoder is configured to generate the plurality of activation maps of different resolution.

This structure allows a particular high performance and accurate separation of the picture into foreground and background.

In a further implementation form of the first aspect, the CNN comprises an encoder with a plurality of consecutive encoder layers and a decoder with a plurality of consecutive decoder layers, the encoder is configured to generate one of the plurality of feature maps per encoder layer, wherein the first encoder layer is configured to generate and downsample a feature map from the received input, and each further encoder layer is configured to generate and downsample a further feature map based on the feature map generated by the previous encoder layer, and the decoder is configured to generate one of the plurality of activation maps per decoder layer, wherein the first decoder layer is configured to upsample the feature map generated by the last encoder layer and generate an activation map based on the upsampled feature map, and each further decoder layer is configured to upsample the activation map generated by the previous decoder layer and generate a further activation map based on the upsampled activation map.

In a further implementation form of the first aspect, each encoder layer contains at least one convolutional filter configured to operate on respectively the input or the feature map of the previous encoder layer, in order to generate a feature map, and each decoder layer contains at least one convolutional filter configured to operate on respectively the feature map of the last encoder layer or the activation map of the previous decoder layer, in order to generate an activation map.

In a further implementation form of the first aspect, each encoder layer is configured to reduce the resolution of the feature map by performing a strided convolution or pooling operation, and each decoder layer is configured to increase the resolution of the feature map of the last encoder layer or of the activation map generated by the previous decoder layer by performing a transposed convolution or unpooling operation.

In a further implementation form of the first aspect, the CNN further comprises a plurality of skip connections, each skip connection connects one of the further encoder layers, which is configured to generate a feature map of a certain resolution, with one of the further decoder layers, which is configured to generate an activation map of a same resolution or the most similar resolution, and said further decoder layer is configured to generate the activation map based on the activation map of the previous decoder layer and the feature map generated by the encoder layer to which it is connected via the skip connection.

That is, a further activation map may be generated based on a concatenation of the feature map from an encoder layer with similar resolution (due to the skip connection) and an upsampled activation map generated by previous decoder layer. The skip connections are beneficial for restoring fine boundary details in the multi-resolution activation maps. In a further implementation form of the first aspect, each of the further encoder layers is configured to generate a feature map including more channels than included in the feature map generated by the previous encoder layer.

In a further implementation form of the first aspect, each of the further decoder layers is configured to generate an activation map including less channels than included in the activation map generated by the previous decoder layer.

In a further implementation form of the first aspect, each decoder layer is further configured to output a 1 -channel activation map estimation, and the device is configured to calculate a multi-resolution loss based on all the output 1 -channel activation map estimations of the decoder layers, and upsample each 1 -channel activation map estimation and use it as input to the next decoder layer.

This improves the separation of the picture into foreground and background. Further, the final loss can be used for training the CNN, thus improving the performance of the CNN.

A second aspect of the invention provides a hardware implementation of a CNN configured to receive as an input a picture and a background model image, generate a plurality of feature maps of different resolution based on the input, wherein the resolution of feature maps is gradually reduced, generate a plurality of activation maps of different resolution based on the plurality of feature maps of different resolution, wherein the resolution of activation maps is gradually increased, and output a 1 -channel probability map having the same resolution as the picture, wherein each pixel of the output 1 -channel probability map corresponds to a pixel of the picture and indicates a probability that the corresponding pixel of the picture is associated with a foreground object or with a background object.

With the hardware implementation of the second aspect and corresponding implementations of the second aspect corresponding to the above implementations of the first aspect, the respective advantages and effects of the device of the first aspect and its implementations can be achieved.

A third aspect of the invention provides a method of employing a CNN for separating a picture into foreground and background, the method comprising receiving as an input the picture and a background model image, generating a plurality of feature maps of different resolution based on the input, wherein the resolution of feature maps is gradually reduced, generating a plurality of activation maps of different resolution based on the plurality of feature maps of different resolution, wherein the resolution of activation maps is gradually increased, and outputting a 1 -channel probability map having the same resolution as the picture, wherein each pixel of the output 1 -channel probability map corresponds to a pixel of the picture and indicates a probability that the corresponding pixel of the picture is associated with a foreground object or with a background object.

In an implementation form of the third aspect, the method comprises thresholding the output 1 -channel probability map to get a binary mask, wherein each pixel of the binary mask indicates whether the corresponding pixel of the picture is associated with a foreground object or with a background object.

In a further implementation form of the third aspect, the input includes a 3-channel [particularly RGB] high-resolution background model image and a 3-channel [particularly RGB] picture.

In a further implementation form of the third aspect, the employed CNN includes an encoder-decoder architecture, the encoder generates the plurality of feature maps of different resolution, and the decoder generates the plurality of activation maps of different resolution.

In a further implementation form of the third aspect, the employed CNN comprises an encoder with a plurality of consecutive encoder layers and a decoder with a plurality of consecutive decoder layers, the encoder generates one of the plurality of feature maps per encoder layer, wherein the first encoder layer generates and downsamples a feature map from the received input, and each further encoder layer generates and downsamples a further feature map based on the feature map generated by the previous encoder layer, and the decoder generates one of the plurality of activation maps per decoder layer, wherein the first decoder layer upsamples the feature map generated by the last encoder layer and generates an activation map based on the upsampled feature map, and each further decoder layer upsamples the activation map generated by the previous decoder layer and generates a further activation map based on the upsampled activation map. In a further implementation form of the third aspect, each encoder layer contains at least one convolutional filter operating on respectively the input or the feature map of the previous encoder layer, in order to generate a feature map, and each decoder layer contains at least one convolutional filter operating on respectively the feature map of the last encoder layer or the activation map of the previous decoder layer, in order to generate an activation map.

In a further implementation form of the third aspect, each encoder layer reduces the resolution of the feature map by performing a strided convolution or pooling operation, and each decoder layer increases the resolution of the feature map of the last encoder layer or of the activation map generated by the previous decoder layer by performing a transposed convolution or unpooling operation.

In a further implementation form of the third aspect, the CNN further comprises a plurality of skip connections, each skip connection connects one of the further encoder layers, which generates a feature map of a certain resolution, with one of the further decoder layers, which generates an activation map of a same resolution or the most similar resolution, and said further decoder layer generates the activation map based on the activation map of the previous decoder layer and the feature map generated by the encoder layer to which it is connected via the skip connection.

In a further implementation form of the third aspect, each of the further encoder layers is configured to generate a feature map including more channels than included in the feature map generated by the previous encoder layer.

In a further implementation form of the third aspect, each of the further decoder layers generates an activation map including less channels than included in the activation map generated by the previous decoder layer.

In a further implementation form of the third aspect, each decoder layer further outputs a 1 -channel activation map estimation, and the method comprises calculating a multi resolution loss based on all the output 1 -channel activation map estimations of the decoder layers, and upsampling each 1 -channel activation map estimation and using it as input to the next decoder layer.

With the method of the third aspect and its implementation forms, the respective advantages and effects described above for the device of the first aspect and its respective implementation forms can be achieved.

A fourth aspect of the invention provides a computer program product comprising program code for performing, when implemented on a processor, a method according to the third aspect or any of its implementation forms.

A fifth aspect of the invention provides a computer comprising at least one memory and at least one processor, which are configured to store and execute program code to perform the method according to the fourth aspect or any of its implementation forms.

In summary, the above-described aspects and implementation forms have the following advantages over the conventional approaches and techniques:

• Better than conventional background subtraction techniques, because of no feature engineering and no parameter tuning.

• Better than semantic/instance segmentation, because of an optimized foreground object extraction for surveillance videos.

• Better than the hybrid technique, since a single end-to-end trainable CNN model is provided, and no additional semantic segmentation is necessary.

• Better than existing CNN models for background subtractions, because:

Not scene-specific, i.e. a single model is pre-trained to work on all scenes.

No RGB -to-gray scale conversion required, i.e. more information of the picture and background model image are available.

No picture resizing requires, i.e. pictures or video inputted to the CNN can have a resolution similar to the resolution used for video recording for high segmentation performance. For example, if a video is recorded at 1920x1080, the CNN will accept as the input data of size 1920x1080. Lightweight, because the CNN architecture is carefully designed so that it can perform foreground object extraction on 1920x1080 videos in real time using a common GPU.

• Better foreground object extraction results than all conventional approaches and techniques.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which FIG. 1 shows a device according to an embodiment of the invention.

FIG. 2 shows a device according to an embodiment of the invention.

FIG. 3 shows an encoder of a CNN of a device according to an embodiment of the invention.

FIG. 4 shows a decoder of a CNN of a device according to an embodiment of the invention. FIG. 5 shows a method according to an embodiment of the invention.

FIG. 6 shows a comparison between results obtained by the invention and results obtained by conventional background subtraction techniques.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows a device 100 according to an embodiment of the invention. The device 100 is configured to separate a picture 101 into foreground and background, e.g. into moving objects and a static scene. To this end, the device 100 of FIG. 1 is configured to employ a CNN (CNN model, CNN architecture), i.e. is configured to perform the separation of the picture 101 by using deep learning. The device 100 may be an image processor, computer, microprocessor, or the like, or multiples thereof or any combination thereof, that implements the CNN.

The CNN is configured to receive the picture 101 and a background model image 102 as an input 101, 102. The background model image 102 may be an image of a scene, which is monitored by a surveillance video camera also providing the picture, and which is taken in advance (or at some determined time) without any (moving) foreground objects or can be estimated as the median at each pixel location of all the pictures (or frames) in a sliding window close to the current picture (or current frame) . The picture 101 may be a still picture or may be one of a sequence of pictures, e.g. pictures of a video, as provided for instance by said surveillance video camera.

Further, the CNN is configured to generate a plurality of feature maps 103 (indicated in FIG. 1 by dotted rectangles) of different resolution (indicated by the different sizes of the dotted rectangles) based on the input 101, 102. The resolution of the feature maps 103 is gradually reduced, i.e. each further generated feature map 103 has a lower resolution than the previous one.

Further, the CNN is configured to generate a plurality of activation maps 104 (indicated by dashed rectangles in FIG. 1) of different resolution (indicated by the different sizes of the dashed rectangles) based on the plurality of feature maps 103 of different resolution. The resolution of the activation maps 104 is gradually increased, i.e. each further generated activation map 104 has a higher resolution than the previous one.

The CNN is finally configured to output a 1 -channel probability map 105 having the same resolution as the picture 101. Each pixel of the output 1 -channel probability map 105 corresponds to a pixel of the picture 101 and indicates a probability that the corresponding pixel of the picture 101 is associated with a foreground object or with a background object. To generate the 1 -channel probability map, the CNN may apply a sigmoid function to an activation map having 1 -channel.

The device 100 may be further configured to threshold the output 1 -channel probability map 105 to obtain a binary mask, wherein each pixel of the binary mask indicates whether the corresponding pixel of the picture 101 is associated with a foreground object or with a background object. In other words, the probability per pixel of the probability map is compared with a probability threshold, and the pixel is e.g. attributed to the background, if its probability value is below the threshold, and to foreground, if its probability value is above the threshold. Notably, the thresholding can also be done by another device receiving the 1 -channel probability map 105 from the CNN.

During a training stage of the CNN, binary masks of different resolution are beneficial, since they can be compared with the ground-truth data of different resolution. In the inference stage of the CNN, only the binary mask calculated from the output 1 -channel probability map may be used.

FIG. 2 shows a device 100 according to an embodiment of the invention, which builds on the device 100 shown in FIG. 1. Accordingly, the device 100 of FIG. 2 includes all elements of the device 100 of FIG. 1, wherein identical elements are labelled with the same reference signs and function likewise.

It can be seen in FIG. 2 that the CNN of the device 100 includes an encoder-decoder architecture, i.e. it includes an encoder 200 and a decoder 210. The encoder 200 is configured to generate the plurality of feature maps 103, while the decoder 210 is configured to generate the plurality of activation maps 104 and the l-channel probability map 105, respectively. The encoder 200 comprises a plurality of consecutive encoder layers 20la, 20lb, namely a first encode layer 20la and at least one further encoder layer 20lb. The encoder 200 is configured to generate one of the plurality of feature maps 103 per encoder layer 20 la, 20 lb, wherein each feature map 103 has a different resolution. The decoder 210 comprises a plurality of consecutive decoder layers 21 la, 21 lb, namely a first decoder layer 21 la and at least one further decoder layer 21 lb. The decoder 200 is configured to generate one of the plurality of activation maps 104 per decoder layer 21 la, 21 lb, wherein each activation map 104 has a different resolution. The last encoder layer 21 lb specifically generates the 1 -channel probability map 105 from the activation map 104 it generates.

The CNN has further a plurality of skip connections 202. Each skip connection 202 connects one of the further encoder layers 20 lb, which is configured to generate a feature map 103 of a certain resolution, with one of the decoder layers 21 la and 21 lb, which is configured to generate an activation map 104 of a same resolution or at least of the most similar resolution.

Each decoder layer 21 la, 21 lb is further configured to output a 1 -channel activation map estimation 212. These estimations 212 can be used by the device 100 to calculate a multi resolution loss based on all the output l-channel activation map estimations 212 of the decoder layers 21 la, 21 lb, in order to recover spatial information lost during downsampling. The CNN of the device 100 is further configured to upsample each 1- channel activation map estimation 212 and to then use it as input to the next decoder layer 21 lb. The l-channel probability map 105 output by the last decoder layer 21 lb may correspond to the l-channel activation map estimation 212 output from that layer 21 lb.

The encoder 200 is now exemplarily described in more detail. The encoder 200 may be defined as a sequence of convolution filtering intertwining with downsampling operations to obtain features maps 103 of the picture 101 and background model 102, respectively. Each encoder layer 20 la, 20 lb may contain:

• A certain number of convolution filters to be applied on the feature map 103 that is output from the previous layer 20 la, 20 lb or on the input of the network. • Downsampling means for operating based on e.g. a strided convolution or pooling operation to reduce the resolution of the feature map 103 of the current layer 20 la, 20lb.

A feature map 103 output from each layer 20 la, 20 lb can be described as a set of activations, which represents semantic information for a region of the input 101, 102 at that layer’ s resolution. Thus, the encoder 200 can be seen as a multi-resolution feature extractor. Feature maps 103 output from encoder layers 20 la, 20 lb can be used as input to the decoder 210 by means of the skip connections 202, in order to reconstruct activation maps 104 of the foreground moving objects at multiple scales at the decoder 210.

An example encoder 200 is shown in FIG. 3. The encoder 200 may accept 6-channel input data (3 channels for the picture 101 and 3 channels for background model image 102) of original picture size, i.e. no picture resizing to a fixed size value is required. This input 101, 102 will go through a sequence of 5 encoder layers 20la, 20lb (1 5). The number of channels in the feature maps 103 (indicated by the dashed rectangles) output from these layers 20la, 20lb increases as the data goes deeper into the encoder 200 (e.g., in FIG. 3 the first encoder layer 20 la outputs a feature map 103 with 64 channels, whereas the fifth encoder layer 20lb outputs a feature map 103 with 512 channels). At the same time, the spatial resolution of the feature maps 103 decrease by a factor of 1/2 after each encoder layer 20la, 20lb, reaching a downsampling factor of 1/32 at the end of the encoder 200.

The decoder 210 is now exemplarily described in more detail. The decoder 210 may be defined as a sequence of convolution filtering intertwining with upsampling operations to obtain activation maps 104 of foreground moving objects. Each decoder layer 21 la, 21 lb may contain:

• A certain number of convolution filters to be applied on the concatenation of previous decoder layer’s activation map 104 and/or 1 -channel activation map estimation 212 with the corresponding encoder’s feature map 103 (in terms of resolution).

• Upsampling means operating based on e.g. a transposed convolution or unpooling operation. An activation map 104 output of each decoder layer 211 a, 2l lb may be described as an estimation of binary masks at the current decoder layer’s resolution. Thus a multi-layer decoder 210 produces multi-resolution estimations of binary masks of foreground moving objects.

An example decoder 210 is shown in FIG. 4. The decoder 210 receives multi-resolution feature maps 103 from the encoder 200, particularly by means of the skip connections 212 between encoder-decoder layers of similar resolution. The concatenated previous decoder layer’s activation map 104 with the corresponding encoder’s feature map 103 will go through a sequence of 4 decoder layers 21 la, 21 lb (1 4). The number of channels in the activation maps 104 output from these layers 21 la, 21 lb decreases as the data is nearer to the end of the decoder 210 (e.g., the first decoder layer 21 la outputs an activation map 104 of 256 channels, whereas the fourth (last) decoder layer 21 lb generates an activation map 104 of 1 channel and outputs a probability map 105 having 1 channel). At the same time, the spatial resolution of the activation maps 104 increases by a factor of 2 after each decoder layer 21 la, 21 lb, reaching 1/4 at the end of the decoder 210 and before a final bi linear interpolation with a resizing factor of 4 and a sigmoid function to obtain the 1- channel probability map 105.

Each decoder layer 21 la, 21 lb may additionally have a mask-estimator that produces a 1- channel activation map estimation 212 at that layer’s resolution. In total, there are 4 mask- -estimators, one of which is used as the last layer 21 lb of the decoder module. The outputs of these 4 mask-estimators can be used to calculate multi-resolution loss in the training phase.

The skip connections 202 are used to bring feature maps 103 from encoder layers 20 la, 20 lb to the corresponding decoder layers 21 la, 21 lb of same or similar resolution. The skip connections 202:

• Allow the encoder’s 200 feature maps 103 of different resolutions to contribute directly to the generation of the activation maps 104 of foreground moving objects in the decoder 210.

• Are beneficial, if fine boundary details in the activation maps 104 are desired. • Can take the form of a direct concatenation of encoder’s 200 feature maps 103 and the corresponding decoder’s 210 activation maps 104. A lxl convolution can be used to reduce the number of channels in the encoder’s feature map 103 before concatenation in order to decrease decoder’s 210 computational complexity.

Multi-resolution loss may further be calculated and used to enforce the generation of activation maps 104 at multi-resolution. For example, the loss at each resolution is first computed as the binary cross-entropy between the estimation of binary masks at that resolution with a downsampled version of the expected binary masks (ground- truth). The final loss used for training (i.e. updating the values of encoder’s and decoder’s convolution filters) is the weighted sum of all the losses at all resolutions.

Stochastic gradient descent supervised training may be used to determine the convolution filters, so that the device 100 can perform optimally on certain datasets. For each mini batch of size k of the training dataset:

• Input to CNN model: / = [(if , iff (if, if f · , (if, if)], where k is the number of training samples in the current mini-batch, / and b denote the picture 101 and the corresponding background model image 102.

• The corresponding ground-truth (i.e. expected network output) of the network input I. T = [T 1 , T 2 , ... , G 3 ], where k is the number of training samples.

• Use / as input to the model and obtain multi-resolution 1 -channel activation maps O = number of training samples, and (1 ® 4) denotes the four multi-resolution indices (i.e. there are 4 generated activation maps 104 at 4 resolutions).

• Calculate the multi-resolution loss between T and O and update the model parameters (convolution filters’ values) using back-propagation of loss function’s gradient until no improvement on loss function is observed.

Below are given examples of an implementation in PyTorch of an encoder 200 and an implementation in PyTorch of a decoder 210, respectively.

FIG. 5 shows a method 500 according to an embodiment of the invention. The method 500 employs a CNN and is used for separating a picture 101 into foreground and background. The method 500 may be carried out by the device 100 shown in FIG. 1 or FIG. 2.

The method 500 comprises a step 501 of receiving as an input 101, 102 the picture 101 and a background model image 102. Further, a step 502 of generating a plurality of feature maps 103 of different resolution based on the input 101, 102, wherein the resolution of feature maps 103 is gradually reduced. Further, a step 503 of generating a plurality of activation maps 104 of different resolution based on the plurality of feature maps 103 of different resolution, wherein the resolution of activation maps 104 is gradually increased. Further, a step 504 of outputting a 1 -channel probability map 105 having the same resolution as the picture 101. Each pixel of the output 1 -channel probability map 105 corresponds to a pixel of the picture 101 and indicates a probability that the corresponding pixel of the picture 101 is associated with a foreground object or with a background object.

FIG. 6 shows an example foreground object extraction results from a surveillance video frame (original picture shown top-left of Fig. 6) using the device 100 of the present invention (BackgroundNet, shown top-right of Fig. 6) and two implementations of conventional background subtraction techniques (CNT and MOG2, shown bottom- left and bottom-right respectively in Fig. 6). It can be seen that BackgroundNet provides segmentation results with mush less noise and disconnectivity.

Embodiments of the invention may be implemented in hardware, software or any combination thereof. Embodiments of the invention, e.g. the device and/or the hardware implementation, may be implemented as any of a variety of suitable circuitry, such as one or more microprocessors, digital signal processors (DSPs), application- specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, etc, or any combinations thereof. Embodiments may comprise computer program products comprising program code for performing, when implemented on a processor, any of the methods described herein. Further embodiments may comprise at least one memory and at least one processor, which are configured to store and execute program code to perform any of the methods described herein. For example, embodiments may comprise a device configured store instructions for software in a suitable, non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform any of the methods described herein.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word“comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.