METHOD AND SYSTEM FOR KEYPOINT DETECTION BASED ON NEURAL NETWORKS

Title:

METHOD AND SYSTEM FOR KEYPOINT DETECTION BASED ON NEURAL NETWORKS

Document Type and Number:

WIPO Patent Application WO/2021/213742

Kind Code:

Abstract:

The invention relates to a computer-implemented method for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects based on a neural network (2) and a post processing system (3) coupled with said neural network (2).

Inventors:

GOR CSABA (DE)
ROHOSKA PETER (DE)
KALAPOS ANDRAS (DE)

Application Number:

PCT/EP2021/056887

Publication Date:

October 28, 2021

Filing Date:

March 17, 2021

Export Citation:

Click for automatic bibliography generation Help

Assignee:

CONTINENTAL AUTOMOTIVE GMBH (DE)

International Classes:

G06K9/00

Other References:

BAI YANG ET AL: "ACPNet:Anchor-Center Based Person Network for Human Pose Estimation and Instance Segmentation", 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), IEEE, 8 July 2019 (2019-07-08), pages 1072 - 1077, XP033590402, DOI: 10.1109/ICME.2019.00188
YU XIANG ET AL: "PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 November 2017 (2017-11-01), XP081319465
ZHE CAO ET AL: "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 24 November 2016 (2016-11-24), XP080734074, DOI: 10.1109/CVPR.2017.143
GEORGE PAPANDREOU ET AL: "PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model", COMPUTER VISION - ECCV 2018 : 15TH EUROPEAN CONFERENCE, MUNICH, GERMANY, SEPTEMBER 8-14, 2018, PROCEEDINGS, PART XIV, 14 September 2018 (2018-09-14), Cham, pages 1 - 21, XP055611454, ISBN: 978-3-030-01264-9, Retrieved from the Internet [retrieved on 20190807], DOI: 10.1007/978-3-030-01264-9_17
ZHE CAOTOMAS SIMONSHIH-EN WEIYASER SHEIKH: "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", COMPUTER VISION AND PATTERN RECOGNITION, 24 November 2016 (2016-11-24)
XINGYI ZHOUDEQUAN WANGPHILIPP KRAHENBUHL: "Objects as Points", COMPUTER VISION AND PATTERN RECOGNITION, 16 April 2019 (2019-04-16)

Attorney, Agent or Firm:

CONTINENTAL CORPORATION (DE)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1 ) Computer-implemented method for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects based on a neural network (2) and a post processing system (3) coupled with said neural network (2), the method comprising the steps of:

- Providing confidence heatmaps (CH) by the neural network (2) (S10), said confidence heatmaps (CH) comprising a lower resolution than the image and therefore provide information regarding the rough location of keypoints of interest;

- Providing refinement offset vectors (ROV) by the neural network (2)

(S11), each refinement offset vector (ROV) being associated with a certain keypoint, wherein each refinement offset vector (ROV) provides information for refining the location of associated keypoint;

- Providing (S12): o a set of centroid offset vectors (COV) by the neural network (2), each centroid offset vector (COV) of the set of centroid offset vectors (COV) being associated with a keypoint, the centroid offset vector (COV) providing information regarding the distance and direction between the keypoint and the centroid of the object to which said keypoint belongs to; or o one or more centroid confidence heatmaps by the neural network (2), each centroid confidence heatmap comprising a lower resolution than the image and therefore provide information regarding the rough location of a centroid, wherein one or more centroid refinement offset vectors are provided by the neural network (2), each centroid refinement offset vector being associated with a centroid and providing information for refining the location of said associated centroid;

- Determining the location of maximum value in the respective confidence heatmaps (CH), using location of maximum value in the respective confidence heatmaps (CH) as rough keypoint locations and refining said rough keypoint locations by adding refinement offset vectors (ROV), thereby obtaining refined keypoint locations (S13);

- Determining the centroids of said multiple objects (S14) by o adding centroid offset vectors (COV) to the refined keypoint locations or the rough keypoint locations; or o refining the rough centroid locations provided by centroid confidence heatmaps by adding a centroid refinement offset vector to the respective rough centroid location (S14);

- Associating keypoints to the objects based on the determined centroids of the objects (S15).

2) Method according to claim 1 , wherein the neural network (2) provides a set of affinity field vectors (AFV), said affinity field vectors (AFV) providing information regarding connections between pairs of keypoints.

3) Method according to claim 2, wherein the affinity field vectors (AFV) are used for determining pairs of keypoints which are interconnected by object connections.

4) Method according to claim 2 or 3, wherein in case of keypoint ambiguities, affinity field vectors (AFV) are used to remove keypoint ambiguities.

5) Method according to anyone of preceding claims, wherein the neural network (2) provides confidence heatmaps (CH), refinement offset vectors (ROV), centroid offset vectors and/or centroid confidence heatmaps for different types of objects.

6) Method according to anyone of preceding claims, wherein confidence heatmaps (CH) are grouped and/or labelled according to given types of keypoints.

7) Method according to anyone of preceding claims, wherein the step of associating keypoints to objects is performed based on searching nearest neighbored centroid to one or more keypoints. 8) Method according to anyone of preceding claims, wherein the step of associating keypoints to objects is performed by determining one or more centroid clusters, assigning a label to each centroid cluster and assigning the label of a certain centroid cluster to the keypoints by considering centroid offset vectors (COV) associated with said keypoints.

9) Method according to anyone of the preceding claims, wherein the step of determining the centroids of the objects comprises applying centroid offset vectors (COV) to the keypoint locations, leading to multiple provisional centroids and determining a centroid by applying an interpolation algorithm to said multiple provisional centroids.

10) Method according to anyone of the preceding claims, wherein the method steps are executed by processing hardware included in a vehicle in order to process environment images surrounding the car and/or to process images captured from the interior of the vehicle.

11 ) Method according to anyone of preceding claims, wherein the method steps are executed by processing hardware included in a camera of a vehicle.

12) Method according to anyone of the preceding claims, wherein keypoints associated to an object are connected for estimating the pose of the object.

13) Computer program product for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions being executable by a processor to cause the processor to perform the method according to anyone of preceding claims.

14) System for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects, the system comprising a neural network (2) and a post processing system (3) coupled with said neural network (2), the system further being configured to execute the steps of: - Providing confidence heatmaps (CH) by the neural network (2), said confidence heatmaps (CH) comprising a lower resolution than the image and therefore provide information regarding the rough location of keypoints of interest;

- Providing refinement offset vectors (ROV) by the neural network (2), each refinement offset vector (ROV) being associated with a certain keypoint, wherein each refinement offset vector (ROV) provides information for refining the location of associated keypoint;

- Providing: o a set of centroid offset vectors (COV) by the neural network (2), each centroid offset vector (COV) of the set of centroid offset vectors (COV) being associated with a keypoint, the centroid offset vector (COV) providing information regarding the distance and direction between the keypoint and the centroid of the object to which said keypoint belongs to; or o one or more centroid confidence heatmaps by the neural network (2), each centroid confidence heatmap comprising a lower resolution than the image and therefore provide information regarding the rough location of a centroid, wherein one or more centroid refinement offset vectors are provided by the neural network (2), each centroid refinement offset vector being associated with a centroid and providing information for refining the location of said associated centroid;

- Determining the location of maximum value in the respective confidence heatmaps (CH) , using the location of maximum value in the respective confidence heatmaps (CH) as rough keypoint locations and refining said rough keypoint locations by adding refinement offset vectors (ROV), thereby obtaining refined keypoint locations;

- Determining the centroids of said multiple objects by o adding centroid offset vectors (COV) to the refined keypoint locations or the rough keypoint locations; or o refining the rough centroid locations provided by centroid confidence heatmaps by adding a centroid refinement offset vector to the respective rough centroid location;

- Associating a set of keypoints to the objects based on the determined centroids of the objects.

Description:

Method and system for keypoint detection based on neural networks

The present invention relates generally to the field of neural networks, specifically deep neural networks. More specifically, the invention relates to a method and a system for detecting keypoints of multiple objects included in an image and associating said keypoints to the respective objects based on a neural network. The neural network may be, for example, a convolutional neural network.

Keypoint detection aiming to locate a predefined set of points of an object present on the image and group them to individual object instances is a challenging task, specifically in automotive applications. By detecting keypoints of objects like pedestrians, cyclists, vehicles etc. it is possible to determine the pose of said object with respect to the line of sight of a camera. In addition, skeleton detection of one or more persons included in an image provided by a camera is possible.

Specifically in vehicles, computational resources are limited. Nevertheless, in automotive applications like autonomous or at least partly assisted driving applications it is necessary to detect keypoints, group said keypoints and associate said grouped keypoints to objects in real-time (e.g. with a delay of 0.1s or lower) because based on detected object keypoints or poses, decisions have to be made promptly.

Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, Computer Vision and Pattern Recognition, 24 ^th November 2016 discloses to predict gaussian confidence maps centered around the location of keypoints of interest. Relating the keypoints to person instances was done in two steps: first, by introducing directed connections between keypoints, they formed a complete skeleton, then tried to locate the connections by predicting vector fields aligned with them. After finding all valid connections, discrete skeletons could be reconstructed by joining the connections which shared keypoints with each other. The main disadvantage of the pose estimation algorithm proposed by Cao et al. is the computational complexity of upscaling algorithm using bicubic pixel interpolation and computational complexity of post processing which groups keypoints and associates said grouped keypoints to object instances.

Xingyi Zhou, Dequan Wang, Philipp Krahenbiihl, Objects as Points, Computer Vision and Pattern Recognition, 16th April 2019 discloses an algorithm which detects centroids for objects on the image, relates keypoint predictions to these centroids and predicts the offsets of the keypoints relative to the centroid. The proposed method is unable to handle cases where keypoints are missing for object instances (e.g. missing knees and ankles because the lower portions of legs are missing). In case of missing keypoints, they are forced to predict also missing keypoints of a person on the image, thus creating a high number of false positive keypoint detections.

It is an objective of the embodiments of the invention to provide a method for detecting keypoints of multiple objects included in an image and associating said keypoints to the respective objects which on the one hand requires low computational resources and, on the other hand, is less prone to false positive keypoint detections. The objective is solved by the features of the independent claims. Preferred embodiments are given in the dependent claims. If not explicitly indicated otherwise, embodiments of the invention can be freely combined with each other.

According to an aspect, the invention refers to a method for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects. The method is performed based on a neural network and a post processing system coupled with said neural network. The method comprises the following steps:

First, confidence heatmaps are provided by the neural network. Each confidence heatmap provides information regarding the rough location of keypoints of interest. More in detail, each confidence heatmap may be a Gaussian or Gaussian-like distribution which is centered around the location of a certain keypoint. The confidence heatmap may comprise a lower resolution compared to the resolution of the image provided as an input to the neural network. Thereby, computation complexity is reduced which comes with the disadvantage of lower accuracy. Thus, the confidence heatmap can only provide the rough location of keypoint due to the reduced resolution or scale of confidence heatmap.

In addition, refinement offset vectors are provided by the neural network. Each refinement offset vector is associated with a certain keypoint and provides information for refining the rough location of associated keypoint which is indicated by the confidence heatmap. A refinement offset vector is a vector, specifically, a two-dimensional vector which provides directional information where the keypoint is located exactly with respect to the rough location of keypoint indicated by the confidence heatmap. More in detail, the confidence heatmap may be a probability distribution, i.e. provides values indicative for the probability that at a respective position of the image a keypoint is located. The confidence heatmap comprises a maximum value which indicates - due to the lower resolution of the confidence heatmap - a rough location of the keypoint. After adding the refinement offset vector to the location of maximum value of the confidence heatmap, a refined location of the keypoint is obtained. For example, if the confidence heatmap indicates that the rough location of keypoint is (x=2, y=2) and refinement offset vector has the values (0.1 , 0.8), the refined location of keypoint can be obtained by adding the values of refinement offset vector to the values of rough location of keypoint, thereby obtaining (2.1 , 2.8) as refined location of keypoint.

In addition, either a set of centroid offset vectors or one or more centroid confidence heatmaps together with centroid refinement offset vectors are provided by the neural network. In case of first alternative, each centroid offset vector of the set of centroid offset vectors is associated with a certain keypoint and provides information regarding the distance and direction between the keypoint and the centroid of the object to which said keypoint belongs to. More in detail, the position of centroid of an object can be obtained by adding the values of the centroid offset vector to the location of the keypoint. The location of the keypoint may be the rough location of keypoint (i.e. the location of the maximum value of confidence heatmap without adding the refinement offset vector) or the exact location of keypoint (i.e. the location resulting from the summation of refinement offset vector to the maximum value of confidence heatmap).

In case of second alternative, each centroid confidence heatmap provides information regarding the rough location of a centroid. More in detail, each centroid confidence heatmap may be a Gaussian or Gaussian-like distribution which is centered around the location of a certain centroid. The centroid confidence heatmap may be a probability distribution, i.e. provides values indicative for the probability that at a respective position of the image a centroid is located. The centroid confidence heatmap may comprise a lower resolution compared to the resolution of the image provided as an input to the neural network. Thereby, computation complexity is reduced which comes with the disadvantage of lower accuracy. Thus, the centroid confidence heatmap can only provide the rough location of a centroid due to the reduced resolution or scale of centroid confidence heatmap.

Each centroid refinement offset vector is associated with a centroid and provides information for refining the location of said associated centroid. A centroid refinement offset vector is a vector, specifically, a two-dimensional vector which provides directional information where the centroid is located exactly with respect to the rough location of the centroid indicated by the centroid confidence heatmap. More in detail, the centroid confidence heatmap comprises a maximum value which indicates - due to the lower resolution of the centroid confidence heatmap - a rough location of the centroid. After adding the centroid refinement offset vector to the location of maximum value of the centroid confidence heatmap, a refined location of the centroid is obtained. For example, if the centroid confidence heatmap indicates that rough location of centroid is (x=3, y=3) and centroid refinement offset vector has the values (0.2, 0.7), the refined location of centroid can be obtained by adding the values of centroid refinement offset vector to the values of rough location of centroid, thereby obtaining (3.2, 3.7) as refined location of centroid.

In post processing, information provided by the neural network is processed. The location of maximum value in the respective confidence heatmaps is determined in order to obtain rough keypoint locations. Said rough keypoint locations are refined based on information included in refinement offset vectors. More in detail, the location of maximum value of a confidence heatmap indicates the rough location of a keypoint. By adding the refinement offset vector to the rough location of a keypoint, a refined location of keypoint is obtained.

In addition, the centroid of objects is determined either by applying centroid offset vectors to the keypoint locations or by refining the rough centroid location provided by centroid confidence heatmaps based on centroid refinement offset vectors. More in detail, the location of maximum value of a centroid confidence heatmap indicates the rough location of a centroid. By adding the centroid refinement offset vector to the rough location of a centroid, a refined location of centroid is obtained.

Finally, keypoints are grouped to the objects based on the determined centroids of the objects thereby associating sets of keypoints to respective objects. The association of the objects can be performed in different ways. For example, the information provided by the centroid offset vector may be used for associating a keypoint to a centroid and therefore to an object to which the centroid belongs to. In addition, object connections may be determined based on said grouped keypoints.

Said method is advantageous because by using refinement offset vectors for refining the location of keypoints, computationally complex upscaling operations can be avoided. In addition, by determining centroids and using said centroids for keypoint grouping and associating sets of keypoints to objects, also the tasks of keypoint grouping and association to objects can be handled with low computational complexity. Furthermore, the proposed method is less prone to false positive detections because - due to transforming keypoint locations to centroid locations - it is not necessary to provide keypoints which are not present on the image.

So, in an overall assessment, proposed method is more efficient and comprises a higher reliability. According to an embodiment, the neural network provides a set of affinity field vectors, said affinity field vectors providing information regarding connections between pairs of keypoints. More in detail, a set of affinity field vectors is a vector field which provides information regarding an existing object connection between a pair of keypoints. The sets of affinity field vectors can be used for keypoint grouping.

According to an embodiment, the affinity field vectors are used for determining pairs of keypoints which are interconnected by object connections. So, based on said set of affinity field vectors it is possible to determine if a connection between a pair of keypoints exits. Said determination may be a verification step if a connection detected by considering information regarding centroids is correct or not. Thereby the risk of wrong keypoint grouping can be significantly reduced.

According to an embodiment, in case of keypoint ambiguities, affinity field vectors are used to remove keypoint ambiguities. Keypoints may comprise labels indicating a feature or position of said keypoint. For example, said feature may indicate that the keypoint is “right ear” or “left shoulder”. In case that upper-mentioned method detects two or more keypoints with the same label to a certain object, information included in affinity field vectors can be used for determining which of the keypoints is correctly associated to the object. Thereby the risk of wrong keypoint grouping can be significantly reduced.

According to an embodiment, a confidence heatmap and/or a centroid confidence heatmap is provided based on a grid which comprises a lower resolution than the resolution of the image. Thereby, the computational effort for calculating confidence heatmaps and/or centroid confidence heatmaps can be significantly reduced. In order to mitigate the accuracy loss due to lowered resolution, in post-processing, the keypoint position can be refined based on refinement offset vectors.

According to an embodiment, the neural network provides confidence heatmaps, refinement offset vectors, centroid offset vectors and/or centroid confidence heatmaps for different types of objects. For example, a first type may be “human being” and a second type may be “vehicle”. The neural network my provide information, which type of object is detected. Said detection of different objects may be performed in parallel, i.e. the method may determine keypoints for different objects in a single detection procedure of the neural network. Preferably, only those keypoints are detected in association to an object which are actually present on the image. Thereby, the detection quality is significantly improved.

According to an embodiment, confidence heatmaps are grouped and/or labelled according to given types of keypoints. So, in other words, the keypoints may comprise an indicator which kind of keypoint is present at a certain position of the image (e.g. left elbow, right shoulder, etc.). Thereby, the determination of an object skeleton is significantly improved.

According to an embodiment, the step of grouping a set of keypoints to an object is performed based on searching nearest neighbored centroid to one or more keypoints. Thereby, the complexity of keypoint grouping is significantly reduced.

According to an embodiment, the step of grouping a set of keypoints to an object is performed by determining one or more centroid clusters, assigning a label to each centroid cluster and assigning the label of a certain centroid cluster to the keypoints by considering centroid offset vectors associated with said keypoints. So, said embodiment considers information which keypoints belong to which centroid to label keypoints associated with the same centroid in order to group said keypoints together and associate the grouped keypoints to an object.

According to an embodiment, the step of determining the centroid of one or more objects comprises applying centroid offset vectors to the keypoint locations, leading to multiple provisional centroids and determining the centroid by applying an interpolation algorithm to said multiple provisional centroids. Thus, centroids are determined in an inward directed manner starting at keypoint positions. Thereby, inaccuracies in determining centroid offset vectors can be mitigated. According to an embodiment, the method steps are executed by processing hardware included in a vehicle in order to process environment images surrounding the car and/or to process images captured from the interior of the vehicle. Specifically in automotive applications, the proposed method is advantageous because it provides low runtime and code complexity and is therefore highly suitable for automotive processing hardware which is typically very limited in processing power.

According to an embodiment, the method steps are executed by processing hardware included in a camera of a vehicle. So, in other words, at least the neural network tasks and preferably also the post-processing are performed in camera-internal processing hardware.

According to an embodiment, grouped keypoints of an object are connected for estimating the pose of the object. Thereby it is possible to derive information regarding future motion and/or future poses in order to coordinate driving maneuvers according to the estimated future motion/poses.

According to a further aspect, the invention relates to a computer program product for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects. The computer program product comprises a computer readable storage medium having program instructions embodied therewith. The program instructions are executable by a processor to cause the processor to perform the method according to anyone of preceding embodiments.

According to yet a further aspect, the invention relates to a system for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects. The system comprises a neural network and a post processing system coupled with said neural network. The system is configured to execute the steps of:

- Providing confidence heatmaps by the neural network, said confidence heatmaps comprising a lower resolution than the image and therefore provide information regarding the rough location of keypoints; - Providing refinement offset vectors by the neural network, each refinement offset vector being associated with a certain keypoint, wherein each refinement offset vector provides information for refining the location of associated keypoint;

- Providing: o a set of centroid offset vectors by the neural network, each centroid offset vector of the set of centroid offset vectors being associated with a keypoint, the centroid offset vector providing information regarding the distance and direction between the keypoint and the centroid of the object to which said keypoint belongs to; or o one or more centroid confidence heatmaps by the neural network, each centroid confidence heatmap comprising a lower resolution than the image and therefore provide information regarding the rough location of a centroid, wherein one or more centroid refinement offset vectors are provided by the neural network, each centroid refinement offset vector being associated with a centroid and providing information for refining the location of said associated centroid;

- Determining the location of maximum value in the respective confidence heatmaps, using the location of maximum value in the respective confidence heatmaps as rough keypoint locations and refining said rough keypoint locations by adding refinement offset vectors, thereby obtaining refined keypoint locations;

- Determining the centroids of said multiple objects by o adding centroid offset vectors to the refined keypoint locations or the rough keypoint locations; or o refining the rough centroid locations provided by centroid confidence heatmaps by adding a centroid refinement offset vector to the respective rough centroid location;

- Associating a set of keypoints to the objects based on the determined centroids of the objects. Any upper-mentioned feature described as an embodiment of the method is also applicable as a system feature in the system according to the present disclosure.

The term “vehicle” as used in the present disclosure may refer to a car, truck, bus, train or any other crafts.

The term “keypoint” as used in the present disclosure may refer to an object location or object point which characterizes the pose of said object. In case of a human being, the keypoint may be “eye”, “ear”, “shoulder”, “elbow”, “wrist”, “hip”, “knee”, “ankle” etc.

The term “confidence heatmap” as used in the present disclosure may refer to a region of the object in which a keypoint is located and which is superimposed by a probability distribution, for example, a gaussian distribution. The probability distribution indicates according to which probability the keypoint is located at the location of the respective probability value.

The term “refinement offset vector” as used in the present disclosure may refer to a vector indicating at which distance and in which direction the rough keypoint location has to be moved in order to arrive at the actual keypoint location.

The term “centroid” as used in the present disclosure may refer to the center, specifically the gravity center, of an object.

The term “centroid offset vector” as used in the present disclosure may refer to a vector indicating at which distance and in which direction the rough centroid location has to be moved in order to arrive at the actual centroid location.

The term “centroid confidence heatmap” as used in the present disclosure may refer to a region of the object in which the centroid of an object is located and which is superimposed by a probability distribution. The probability distribution indicates according to which probability the centroid is located at the location of the respective probability value. The term “object connection” as used in the present disclosure may refer to a direct connection of a pair of keypoints which forms part of an object skeleton. In case of a human object, an object connection may be, for example, a limb.

The term “essentially” or “approximately” as used in the invention means deviations from the exact value by +/- 10%, preferably by +/- 5% and/or deviations in the form of changes that are insignificant for the function and/or for the traffic laws.

Various aspects of the invention, including its particular features and advantages, will be readily understood from the following detailed description and the accompanying drawings, in which:

Fig. 1 shows an exemplary schematic diagram of a system for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects;

Fig. 2 schematically illustrates the steps performed in the post processing system of the system according to Fig. 1 ;

Fig. 3 shows an exemplary image on which the method is applied;

Fig. 4 illustrates keypoints and centroids of the image according to Fig. 3;

Fig. 5 illustrates a confidence heatmap and a refinement offset vector, both being associated with a certain keypoint;

Fig. 6 schematically illustrates a process for determining the centroid of an object by using multiple centroid offset vectors;

Fig. 7 schematically illustrates the process of associating keypoints to centroids based on centroid offset vectors; Fig. 8 illustrates multiple affinity field vectors being arranged between a pair of keypoints;

Fig. 9 illustrates detected skeletons of the image according to Fig. 3; and

Fig. 10 shows a schematic block diagram illustrating the steps of a method for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects.

The present invention will now be described more fully with reference to the accompanying drawings, in which example embodiments are shown. The embodiments in the figures may relate to preferred embodiments, while all elements and features described in connection with embodiments may be used, as far as appropriate, in combination with any other embodiment and feature as discussed herein, in particular related to any other embodiment discussed further above. Flowever, this invention should not be construed as limited to the embodiments set forth herein. Throughout the following description similar reference numerals have been used to denote similar elements, parts, items or features, when applicable.

The features of the present invention disclosed in the specification, the claims, examples and/or the figures may both separately and in any combination thereof be material for realizing the invention in various forms thereof.

Fig. 1 illustrates a system 1 for determining keypoints of multiple objects included in an image and associating said keypoints to the respective objects. An object may be, for example, a person, a vehicle etc. The image may include multiple objects of a single object class (e.g. multiple persons on the image) or may include multiple objects of different object classes (e.g. one or more persons and one or more cars).

The image may be provided by a camera, specifically a camera attached to or included in a vehicle. The image is received by a neural network 2, which can be, for example, a convolutional neural network. The neural network 2 may be a trained neural network, i.e. the neural network 2 has been trained to specific tasks in advance in order to provide certain characteristics used for determining information regarding keypoints of multiple objects included in an image and information required for associating said keypoints to the respective objects.

As shown in Fig. 1 , the neural network 2 provides different output information. First, the neural network 2 provides multiple confidence heatmaps CH. Each confidence heatmap CH defines a certain area on the image and is associated with a certain keypoint of an object (e.g. left elbow of a person). The confidence heatmap CH indicates that associated keypoint is located in said confidence heatmap CH. Neural network 2 may provide confidence heatmaps CH for each keypoint visible on the image.

According to embodiments, confidence heatmaps CH are labelled. Based on said labels, it is possible to determine the keypoint type, e.g. “right elbow”, “right shoulder” etc.

Confidence heatmap CH further provides probability values over at least two dimensions of the image. Said probability values may indicate according to which probability the keypoint is located at a certain area of the image. The probability values of the confidence heatmap CH may be arranged according to a Gaussian or Gaussian-like distribution.

In order to reduce computational complexity, the confidence heatmap CH may comprise a lower resolution than the image provided as an input to the neural network 2. For example, the confidence heatmap CH may comprise only ¼ to 1/100, specifically 1/8 of the resolution of the image. Said resolution reduction leads to a loss of accuracy.

In order to mitigate loss of accuracy, neural network 2 provides refinement offset vectors ROV. Each refinement offset vector ROV is associated to a certain confidence heatmap CH. The refinement offset vector ROV provides correction information for correcting the position of maximum value of associated confidence heatmap CH in order to precisely define the position on the image at which the keypoint (which is associated with the confidence heatmap CH) is located.

More in detail, the refinement offset vector ROV may be a vector defining a change of position of the maximum of confidence heatmap CH thereby mitigating accuracy loss induced by lowering the resolution of confidence heatmap CH compared to image resolution.

It is worth mentioning that refinement offset vector ROV is provided by the neural network 2, i.e. the mitigation of accuracy loss induced by lowering the resolution of confidence heatmap CH compared to image resolution has not to be provided by a complex refinement algorithm performed in post processing 3.

The neural network may be trained based on ground truth information to estimate refinement offset vectors ROV. Ground truth information may include ground truth refinement offset vectors which are established by calculating a vector which starts at the rough location of the ground truth key point and ends at the exact location of ground truth key point location. The rough location of the ground truth key point is determined in the lower resolution of confidence heat map and the exact location may be determined in a higher resolution, namely the resolution of the image provided as a input to the neural network. Based on said ground truth information it is possible to train the neural network to estimate refinement offset vectors for example by using a loss function.

In addition, reference is made to Xingyi Zhou, Dequan Wang, Philipp Krahenbiihl, Objects as Points, Computer Vision and Pattern Recognition, 16th April 2019 which provides in section 3 an example loss function for training a neural network to provide vectors which mitigate loss of accuracy due to down sampling. The neural network 2 can be trained in a similar way to estimate refinement offset vectors ROV. Furthermore, the neural network 2 may provide information regarding centroid of one or more objects included in the image. Said information regarding centroid can be provided in different ways.

According to the embodiment of Fig. 1 , neural network 2 provides centroid offset vectors COV. Each centroid offset vector COV is associated with a certain keypoint of an object. Each centroid offset vector COV provides displacement information indicating in which direction and how far a certain keypoint has to be shifted in order to arrive at the centroid of the object. So, in other words, based on centroid offset vector COV, a certain keypoint of an object can be shifted to centroid location of said object.

According to another embodiment, neural network 2 may provide information regarding centroid of one or more objects included in the image based on one or more centroid confinement heatmaps and one or more centroid refinement offset vectors.

Similar to upper-mentioned confinement heatmaps, centroid confinement heatmaps defines a certain area on the image in which a centroid of an object is located. Centroid confidence heatmap further provides signal values over at least two dimensions of the image. Said signal values may indicate according to which probability the centroid of an object is located at a certain area of the image. The signal values of the centroid confidence heatmap may be arranged according to a Gaussian or Gaussian-like distribution.

Also centroid confidence heatmap may comprise a lower resolution than the image provided as an input to the neural network 2. In order to mitigate loss of accuracy induced by said lowered resolution, neural network 2 provides one or more centroid refinement offset vectors. Each centroid refinement offset vector is associated to a certain centroid confidence heatmap. The centroid refinement offset vector provides correction information for correcting the position of maximum value of associated centroid confidence heatmap in order to precisely define the position on the image at which the centroid is located. More in detail, the centroid refinement offset vector may be a vector defining a change of position of the maximum of centroid confidence heatmap thereby mitigating accuracy loss induced by lowering the resolution of centroid confidence heatmap compared to image resolution.

It is worth mentioning that centroid refinement offset vector may be provided by the neural network 2, i.e. the mitigation of accuracy loss induced by lowering the resolution of centroid confidence heatmap compared to image resolution has not to be provided by a complex refinement algorithm performed in post processing 3.

The neural network 2 may be trained based on ground truth information to estimate said centroid refinement offset vector. Ground truth information may include ground truth centroid refinement offset vectors which are established by calculating a vector which starts at the rough location of the ground truth centroid and ends at the exact location of ground truth centroid. The rough location of the ground truth centroid is determined in the lower resolution of centroid confidence heat map and the exact location may be determined in a higher resolution, namely the resolution of the image provided as a input to the neural network 2. Based on said ground truth information it is possible to train the neural network 2 to estimate centroid refinement offset vectors, for example by using a loss function.

Finally, according to an embodiment, the neural network 2 may provide one or more sets of affinity field vectors AFV. A set of affinity field vectors is a vector field indicating a given connection between a pair of keypoints. For example, a set of affinity field vectors may indicate the connection between a first keypoint “right shoulder” and a second keypoint “right ellbow”, i.e. , for example, may be indicative for a limb of a person. The vectors included in the set of affinity field vectors are aligned according to the existing connection.

The sets of affinity field vectors may be used to associate keypoints to certain objects and/or to remove ambiguities, as described below in greater detail.

Output information of the neural network 2 may be provided to a post-processing system 3 which is configured to process received information in order to determine keypoints, determine associations of keypoints to certain objects and determine existing connections between keypoints in order to build a skeleton of the object.

Fig. 2 shows processes performed by post processing system 3 in closer detail.

The confidence heatmaps CH provided by the neural network 2 may be parsed in order to determine local maximum. In other words, the peak of each confidence heatmap CH is determined in order to obtain rough location information of the keypoints associated with the respective confidence heatmap. The confidence heatmaps CH may be provided on a grid. The local maximum of a confidence heatmap CH may be arranged at an intersection of grid lines.

Fig. 3 shows an example image which shows two objects, namely two ballet dancers.

In Fig. 4, confidence heatmaps of the objects shown in Fig. 3 are highlighted by white ovals and centroids of the objects are highlighted by white squares.

Fig. 5 exemplarily shows a certain confidence heatmap CH which is provided on top of a grid. Different grey-scales of squares indicate the probability value at the respective position. The peak probability value is located in the middle, indicated by the black square. Returning to Fig. 2, after searching local maximum, a keypoint refinement is performed based on information provided by refinement offset vector ROV. As mentioned before, refinement offset vector ROV is associated to a certain confidence heatmap CH. Due to lowered resolution of confidence heatmap CH, determined local maximum only roughly indicates the location of keypoint.

By applying refinement offset vector ROV to the local maximum, a refined position of keypoint is determined. In other words, loss of accuracy is mitigated by shifting position of local maximum according to information provided by refinement offset vector ROV.

Fig. 5 exemplarily shows local maximum shifting by refinement offset vector ROV, indicated by white arrow in the middle of black square. By applying refinement offset vector ROV to the local maximum of confidence heatmap CH, refined position of keypoint is determined which is indicated by the white cross in the area of the tip of the white arrow.

Again returning to Fig. 2, after keypoint refinement, centroids of objects included in the image are determined.

As mentioned before, determination of centroids can be performed in different ways. The embodiment shown in Fig. 2 uses centroid offset vectors COV for determining a centroid of an object.

Each centroid offset vector COV is associated with a certain keypoint. By applying the centroid offset vector COV to its associated keypoint, the position of centroid of the object, to which the keypoint belongs to, or at least an estimate of the position of centroid can be established.

Fig. 6 exemplarily shows multiple keypoints of an object, indicated by black dots, and a set of centroid offset vectors COV, indicated by white arrows. Each centroid offset vector COV is associated with a certain keypoint. When applying a shift operation according to the centroid offset vector COV to the keypoint, an estimate of centroid is obtained, indicated by a white dot in the centre of the skier. The estimated centroid obtained by said shifting operation may not exactly be located at the actual centroid of the object. However, by considering all estimated centroids and applying an averaging algorithm to said estimated centroids, the actual centroid can be derived.

According to another embodiment, as mentioned before, centroids of objects included in the image can also be determined based on centroid confidence heatmaps and centroid refinement offset vectors, each centroid refinement offset vector corresponding to one of said centroid confidence heatmaps. The location of maximum of centroid confidence heatmap indicates the rough position of centroid. The exact location of centroid is derived by shifting the rough position of centroid based on information included in corresponding centroid refinement offset vector.

Again returning to Fig. 2, after determination of centroid, keypoints are associated with a certain object. So in other words, in case of multiple keypoints present on the image, a decision is made which keypoint belongs to which object.

The keypoints provided by the neural network may be labelled. The keypoint label may indicate at which position the keypoint is located at the object. For example, the label of a keypoint belonging to a person may, for example, indicate “left elbow” or “right shoulder”.

However, in case of multiple keypoints with the same label category, the association of keypoints to objects have to be determined, i.e. , for example, which “right shoulder” belongs to which person.

According to a first example embodiment, association of keypoints to objects can be determined based on a “next neighbour search”-algorithm. So in other words, a certain keypoint is associated to the object by considering which centroid of an object is closest to said keypoint. According to a second example embodiment, information provided by centroid offset vectors COV can be used to associate keypoints to objects. As mentioned before, centroid offset vectors COV provide information regarding the direction and distance of a certain keypoint to the centroid of the object to which the keypoint belongs to. Said information can be used for keypoint-object-association.

For example, a clustering algorithm may be used for determining clusters which include said rough centroid locations obtained by applying centroid offset vectors COV to the keypoints. Example clustering mechanisms may be, for example, K-means algorithm or EM algorithm (EM: expectation maximization). Each cluster may be provided by a label which indicates a certain object. Using the knowledge of centroid offset vectors COV, the label provided to the cluster can also be provided to the set of keypoints which correspond to said centroid offset vectors COV (cf. Fig. 6). Thereby, a set of keypoints can be determined belonging to a certain cluster, respectively, to a certain object.

Fig. 7 schematically illustrates the association of keypoints to objects based on “clustering”-algorithm. Detected keypoints are indicated by X and detected centroids are indicated by C. The arrows indicate said centroid offset vectors COV. By considering the centroid offset vector COV information, a grouping of keypoints to centroids, respectively, to objects can be obtained.

According to a third embodiment, keypoint grouping can be also obtained by using affinity field vectors AFV. As mentioned before, each set of affinity field vectors provides information regarding keypoint coupling, i.e. which keypoint is coupled with which further keypoint by an object connection. Such object connection may be, for example, a limb of a person. By considering information included in said affinity field vectors, grouping of keypoints to objects can be obtained because only keypoint pairs which are interconnected by affinity field vectors can belong to a certain object.

Fig. 8 provides a set of affinity field vectors. Said affinity field vectors may indicate, for example, the coupling of two keypoints “shoulder” and “elbow”. The affinity field vectors are aligned according to the object connection, for example, the orientation of the limb.

After associating keypoints to objects included in the image, there may occur ambiguities. For example, one object may comprise multiple keypoints belonging to the same keypoint label, e.g. multiple “left shoulders”. In order to remove such ambiguities, affinity field vectors AFV can be used to check which keypoint actually belongs to the object and remove the at least one further keypoint association. Said removal of ambiguities may be specifically advantageous in embodiments which use “next neighbour search” or “clustering” mechanisms for obtaining keypoint- object-association because in said embodiments, affinity field vectors AFV have not been used for obtaining keypoint-object-association.

Finally, after ambiguity removal, the skeleton of object is developed. Said skeleton development may be, for example, obtained by connecting keypoints associated to a certain object according to keypoint labels. Such connecting may include, for example, coupling associated keypoints according to a known structure of the object, e.g. “left wrist” keypoint with “left elbow” keypoint and “left elbow” keypoint with “left shoulder” keypoint.

Fig. 9 illustrates detected object skeletons for the images according to Fig. 3 and 4. The skeletons determined based on detected keypoints are indicated by white lines.

The artificial neural network 2 may be trained by using images which are manually labelled before using it for training purposes. Said labels provided to the images are also referred to as ground truth data. For example, ground truth data may provide a bounding box to each object indicating the extent of the object. Based on the bounding box, the centroid of an object can be determined, for example by calculating the centre coordinate of the bounding box and using that centre coordinate as centroid location of the object.

In addition, ground truth data may provide information regarding where the keypoints of the object are located. Furthermore, ground truth data may provide information regarding the type of keypoint, e.g. left shoulder, right ear, left knee etc. or in case of a vehicle as an object, left exterior mirror, right headlight etc.

By using said ground truth data, the artificial neural network 2 can eb trained to estimate confidence heatmaps, rough keypoint locations, refinement offset vectors, centroid offset vectors centroid refinement offset vectors and/or centroid confidence heatmaps.

Some processes performed by post-processing system 3 have been described before in connection with certain embodiments. In addition, post-processing system 3 may perform further tasks like pose estimation, providing bounding boxes around objects based on detected keypoints (also 3D by regressing the depth-wise extension of an object), providing a prediction of relationship between objects (e.g. detecting a motorcycle and a person close to motor thereby taking the decision “motorcyclist”), 3D keypoint detection by regressing or tracking distance of keypoints, 3D orientation of an object (e.g. by determining roll, pitch and yaw of the object with respect to the camera) and object tracing by regressing the temporal offset of the keypoint coordinates.

The present disclosure can be used for keypoint and object detection in the surrounding of a vehicle. However, according to another embodiment, it may also be possible to monitor the interior of the vehicle based on a camera and to detect keypoints of people within the car in order to derive information regarding vehicle passengers, e.g. the readiness of the driver to take over the car in autonomous driving situations.

Fig. 10 shows a block diagram illustrating the method steps of a method for determining keypoints of multiple objects and associating said keypoints to objects.

As a first step, multiple objects may be detected in an image. Each object comprises a centroid and multiple keypoints. Subsequently, confidence heatmaps are provided by the neural network (S10). Said confidence heatmaps provide information regarding the rough location of keypoints.

After providing confidence heatmaps, refinement offset vectors are provided by the neural network (S11 ). Each refinement offset vector is associated with a certain keypoint and provides information for refining the location of associated keypoint.

After providing confidence heatmaps, centroid offset vectors or centroid heat maps are provided by neural network (S12).

As a further step, the location of maximum value in the respective confidence heatmaps is determined which defines rough keypoint location. Said rough keypoint locations are refined based on refinement offset vectors (S13).

As a further step, centroids of objects included in the image are determined (S14).

Finally, the determined keypoints are grouped based on determined centroids (S15).

It should be noted that the description and drawings merely illustrate the principles of the proposed invention. Those skilled in the art will be able to implement various arrangements that, although not explicitly described or shown herein, embody the principles of the invention.

List of reference numerals

1 system

2 neural network

3 post processing

AFV affinity field vector CH confidence heatmap COV centroid offset vector ROV refinement offset vector

Previous Patent: PROCESS AND APPARATUS FOR WHITE LIQUOR OXIDATION

Next Patent: ANODE-LESS LITHIUM BATTERY