Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CONVOLUTION NEURAL NETWORK BASED LANDMARK TRACKER
Document Type and Number:
WIPO Patent Application WO/2020/216804
Kind Code:
A1
Abstract:
There are provided systems and methods for facial landmark detection using a convolutional neural network (CNN). The CNN comprises a first stage and a second stage where the first stage produces initial heat maps for the landmarks and initial respective locations for the landmarks. The second stage processes the heat maps and performs Region of Interest-based pooling while preserving feature alignment to produce cropped features. Finally, the second stage predicts from the cropped features a respective refinement location offset to each respective initial location. Combining each respective initial location with its respective refinement location offset provides a respective final coordinate (x,y) for each respective landmark in the image. Two-stage localization design helps to achieve fine-level alignment while remaining computationally efficient. The resulting architecture is both small enough in size and inference time to be suitable for real-time web applications such as product simulation and virtual reality.

Inventors:
LI TIAN XING (CA)
YU ZHI (CA)
KEZELE IRINA (CA)
PHUNG EDMUND (CA)
AARABI PARHAM (CA)
Application Number:
PCT/EP2020/061249
Publication Date:
October 29, 2020
Filing Date:
April 22, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
OREAL (FR)
International Classes:
G06V10/764; G06V10/771
Other References:
XIAOJIE GUO ET AL: "PFLD: A Practical Facial Landmark Detector", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 28 February 2019 (2019-02-28), XP081121741
NING ZHANG ET AL: "Fine-grained pose prediction, normalization, and recognition", 22 November 2015 (2015-11-22), XP055717074, Retrieved from the Internet
M. KOWALSKIJ. NARUNIECT. TRZCINSKI: "Deep alignment network: A convolutional neural network for robust face alignment", CORR, 2017
Y. SUNX. WANGX. TANG: "Deep convolutional network cascade for facial point detection", 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, June 2013 (2013-06-01), pages 3476 - 3483, XP032493158, DOI: 10.1109/CVPR.2013.446
K. YUENM. M. TRIVEDI: "An occluded stacked hourglass approach to facial landmark localization and occlusion estimation", CORR, 2018
V. KAZEMIJ. SULLIVAN: "One millisecond face alignment with an ensemble of regression trees", 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2014, pages 1867 - 1874, XP032649427, DOI: 10.1109/CVPR.2014.241
D. E. KING: "Dlib-ml: A machine learning toolkit", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 10, 2009, pages 1755 - 1758
P. N. BELHUMEURD. W. JACOBSD. J. KRIEGMANN. KUMAR: "Localizing parts of faces using a consensus of exemplars", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 35, December 2013 (2013-12-01), pages 2930 - 2940
V. LEJ. BRANDTZ. LINL. BOURDEVT. S. HUANG: "Computer Vision - ECCV 2012", 2012, SPRINGER BERLIN HEIDELBERG, article "Interactive facial feature localization", pages: 679 - 692
G. TRIGEORGISP. SNAPEM. A. NICOLAOUE. ANTONAKOSS. ZAFEIRIOU: "Mnemonic descent method: A recurrent process applied for end-to-end face alignment", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, pages 4177 - 4187, XP033021603, DOI: 10.1109/CVPR.2016.453
A. NEWELLK. YANGJ. DENG: "Stacked hourglass networks for human pose estimation", CORR, 2016
M. SANDLERA. G. HOWARDM. ZHUA. ZHMOGINOVL. CHEN: "MobileNetV2: Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation", CORR, 2018
F. N. LANDOLAM. W. MOSKEWICZK. ASHRAFS. HANW. J. DALLYK. KEUTZER: "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1 mb model size", CORR, 2016
X. ZHANGX. ZHOUM. LINJ. SUN: "Shufflenet: An extremely efficient convolutional neural network for mobile devices", CORR, 2017
A. G. HOWARDM. ZHUB. CHEND. KALENICHENKOW. WANGT. WEYANDM. ANDREETTOH. ADAM: "Mobilenets: Efficient convolutional neural networks for mobile vision applications", CORR, 2017
K. HEX. ZHANGS. RENJ. SUN: "Deep residual learning for image recognition", CORR, 2015
A. BULATG. TZIMIROPOULOS: "Human pose estimation via convolutional part heat map regression", CORR, 2016
S. WEIV. RAMAKRISHNAT. KANADEY. SHEIKH: "Convolutional pose machines", CORR, 2016
Y. CHENC. SHENX. WEIL. LIUJ. YANG: "Adversarial learning of structure-aware fully convolutional networks for landmark localization", CORR, 2017
E. INSAFUTDINOVL. PISHCHULINB. ANDRESM. ANDRILUKAB. SCHIELE: "Deepercut: A deeper, stronger, and faster multi-person pose estimation model", CORR, 2016
R. B. GIRSHICK: "Fast R-CNN", CORR, 2015
S. RENK. HER. B. GIRSHICKJ. SUN: "Faster R-CNN: towards real-time object detection with region proposal networks", CORR, 2015
J. LONGE. SHELHAMERT. DARRELL: "Fully convolutional networks for semantic segmentation", CORR, 2014
K. HEG. GKIOXARIP. DOLLARR. B. GIRSHICK: "Mask R-CNN", CORR, 2017
N. ZHANGE. SHELHAMERY. GAOT. DARRELL: "Fine-grained pose prediction, normalization, and recognition", CORR, 2015
Attorney, Agent or Firm:
POTTER, Julian (GB)
Download PDF:
Claims:
Claims

What is claimed is:

1 . A computing device comprising a processing unit and a storage device coupled thereto, the storage unit storing instructions that when executed by the processing unit configure the computing device to process an image to determine respective locations of each of a plurality of landmarks by: processing the image using a Convolutional Neural Network (CNN) having a first stage and a second stage, wherein: the first stage processes the image to generate initial predictions of the respective locations; and the second stage, using intermediate features generated from the image by the first stage, and the initial predictions, generates crops of shared convolutional features for regions of interest to define cropped features and further processes the cropped features to produce respective refinement location offsets to the initial predictions of the respective locations.

2. The computing device of claim 1 wherein the first stage produces and uses the intermediate features to produce initial heat maps from which the initial predictions are generated.

3. The computing device of claim 1 or claim 2 wherein the second stage produces second heat maps from the cropped features, one for each landmark, and uses the second heat maps to produce the respective refinement location offsets.

4. The computing device of any one of claims 1 to 3 wherein the CNN combines the initial predictions of the respective locations and the respective refinement location offsets to provide the respective locations for each of a plurality of landmarks.

5. A computing device comprising a processing unit and a storage device coupled thereto, the storage unit storing instructions that when executed by the processing unit configure the computing device to process an image to determine respective locations of each of a plurality of landmarks by: processing the image using a Convolutional Neural Network (CNN) having a first stage and a second stage, wherein: the first stage receives the image and determines: in a first portion of the first stage, a volume of intermediate features from the image; and, in a second portion of the first stage using the volume of intermediate features, a respective initial location for each of the landmarks within the image; and the second stage receives, at a first portion of the second stage, the volume of intermediate features and receives, at a second portion of the second stage, the respective initial location for each of the landmarks, the second stage further: processing the volume to further refine the intermediate features; performing, on the intermediate features as further refined, a Region of Interest-based pooling while preserving feature alignment to produce cropped features for each of the plurality of landmarks; and, determining, using the cropped features and for each landmark, respective refinement location offsets for each respective initial location of each landmark; and wherein the processing further operating to combine each respective initial location and the respective refinement location offset to determine final location coordinates in the image of each of the plurality of landmarks.

6. The computing device of claim 5 wherein the second portion of the first stage uses the volume of intermediate features to determine initial heat maps and predicts each respective initial location for each of the landmarks using the initial heat maps.

7. The computing device of any one of claims 1 to 6 wherein at least some of the first stage comprises a series of Inverted Residual Blocks and wherein at least some of the second stage comprises a series of Inverted Residual Blocks.

8. The computing device of any of one of claims 1 to 7 wherein the second stage: uses RolAlign for Region of Interest-based pooling while preserving feature alignment to produce the cropped features; and concatenates the cropped features.

9. The computing device of any of one of claims 1 to 8 wherein the second stage comprises a predict block to process the cropped features, the predict block performing, in order: channel-wise convolutions with 3X3 kernel followed by BatchNorm and ReLU activation; and, group-wise channel convolutions with 1X1 kernel followed by BatchNorm; to output each of the respective refinement location offsets.

10. The computing device of any one of claims 1 to 9 wherein the CNN model is trained using respective training images having ground truth heat maps for respective landmarks of the plurality of landmarks defined in accordance with Gaussian distribution with a mode corresponding to respective coordinate positions of the ground truth heat maps in the respective training images.

1 1. The computing device of claim 10 wherein the Gaussian distribution is defined in accordance with:

where x, y denote the coordinates of any pixel in a training image and ( i,yi) is the corresponding landmark coordinate.

12. The computing device of claim 1 1 wherein a regressed xpred, ypred is the expected value of the pixel locations according to the distribution (1 ) computed from a respective predicted heat map such that:

where j is an index over all the pixels in the respective heat map, and w,· denotes a heat map value for that pixel.

13. The computing device of any one of claims 1 to 12 wherein the CNN is trained with a loss function defined by pixel-wise sigmoid cross entropy for learning heat maps.

14. The computing device of claim 13 wherein the loss function further includes a L2 distance loss.

15. The computing device of claim 13 or claim 14 wherein the loss function comprises:

where: is the prediction value of the heat map in the /th channel at pixel location ( i,j ) of nth sample; is the corresponding ground truth; is the weight at pixel location (i,j), which is calculated from equation 4; is the ground truth coordinate of the nths sample’s lth landmark; and is the expected coordinate of the same landmark.

16. The computing device of any one of claims 1 to 15 further configured via instructions to receive the image and perform landmark detection on the image.

17. The computing device of any one of claims 1 to 16 further configured via instructions to modify the image at or about at least one of the landmarks using the respective locations.

18. The computing device of claim 17 wherein to modify the image comprises simulating a product applied to the image.

19. The computing device of claim 17 or claim 18 wherein the image is a video image and the computing device is configured via the instructions to modify and present the image in real time to simulate a virtual reality.

20. The computing device of claim 19 further comprising a camera and wherein the video is a selfie video taken by the camera.

21. The computing device of any one of claims 16 to 20 wherein the landmarks are facial landmarks, the image comprises a face and further comprising using the respective locations of the landmarks to update the image with at least one product simulation.

22. A method comprising: processing an image to determine respective locations of each of a plurality of landmarks by using a Convolutional Neural Network (CNN) having a first stage and a second stage, wherein: the first stage processes the image to generate initial predictions of the respective locations; and the second stage, using intermediate features generated from the image by the first stage, and the initial predictions, generates crops of shared convolutional features for regions of interest to define cropped features and further processes the cropped features to produce respective refinement location offsets to the initial predictions of the respective locations.

23. The method of claim 22 wherein the first stage produces and uses the intermediate features to produce initial heat maps from which the initial predictions are generated.

24. The method of claim 22 or claim 23 wherein the second stage produces second heat maps from the cropped features, one for each landmark, and uses the second heat maps to produce the respective refinement location offsets.

25. The method of any one of claims 22 to 24 wherein the CNN combines the initial predictions of the respective locations and the respective refinement location offsets to provide the respective locations for each of a plurality of landmarks.

26. A method comprising: processing an image to determine respective locations of each of a plurality of landmarks by using a Convolutional Neural Network (CNN) having a first stage and a second stage, wherein: the first stage receives the image and determines: in a first portion of the first stage, a volume of intermediate features from the image; and, in a second portion of the first stage using the volume of intermediate features, a respective initial location for each of the landmarks within the image; and the second stage receives, at a first portion of the second stage, the volume of intermediate features and receives, at a second portion of the second stage, the respective initial location for each of the landmarks, the second stage further: processing the volume to further refine the intermediate features; performing, on the intermediate features as further refined, a Region of Interest-based pooling while preserving feature alignment to produce cropped features for each of the plurality of landmarks; and, determining, using the cropped features and for each landmark, respective refinement location offsets for each respective initial location of each landmark; and wherein the processing further operating to combine each respective initial location and the respective refinement location offset to determine final location coordinates in the image of each of the plurality of landmarks.

27. The method of claim 26 wherein the second portion of the first stage uses the volume of intermediate features to determine initial heat maps and predicts each respective initial location for each of the landmarks using the initial heat maps.

28. The method of any one of claims 22 to 27 wherein at least some of the first stage comprises a series of Inverted Residual Blocks and wherein at least some of the second stage comprises a series of Inverted Residual Blocks.

29. The method of any of one of claims 22 to 28 wherein the second stage: uses RolAlign for Region of Interest-based pooling while preserving feature alignment to produce the cropped features; and concatenates the cropped features.

30. The method of any of one of claims 22 to 29 wherein the second stage comprises a predict block to process the cropped features, the predict block performing, in order: channel-wise convolutions with 3X3 kernel followed by BatchNorm and ReLU activation; and, group-wise channel convolutions with 1X1 kernel followed by BatchNorm; to output each of the respective refinement location offsets.

31. The method of any one of claims 22 to 30 wherein the CNN model is trained using respective training images having ground truth heat maps for respective landmarks of the plurality of landmarks defined in accordance with Gaussian distribution with a mode corresponding to respective coordinate positions of the ground truth heat maps in the respective training images.

32. The method of claim 31 wherein the Gaussian distribution is defined in accordance with:

where x, y denote the coordinates of any pixel in a training image and (xi,yi) is the corresponding landmark coordinate.

33. The method of claim 32 wherein a regressed xpred, ypred is the expected value of the pixel locations according to the distribution (1 ) computed from a respective predicted heat map such that:

where j is an index over all the pixels in the respective heat map, and w, denotes a heat map value for that pixel.

34. The method of any one of claims 22 to 33 wherein the CNN is trained with a loss function defined by pixel-wise sigmoid cross entropy for learning heat maps.

35. The method of claim 34 wherein the loss function further includes a L2 distance loss.

36. The method of claim 34 or claim 35 wherein the loss function comprises:

where: is the prediction value of the heat map in the /th channel at pixel location (i,j) of nth sample; is the corresponding ground truth; is the weight at pixel location which is calculated from equation 4;

is the ground truth coordinate of the nths sample’s lth landmark; and is the expected coordinate of the same landmark.

37. The method of any one of claims 22 to 36 further comprising modifying the image at or about at least one of the respective locations of the plurality of landmarks.

38. The method of claim 37 wherein modifying the image comprises simulating a product applied to the image.

39. The method of claim 37 or claim 38 wherein the image is a video image and wherein the method presents the image as modified in real time to simulate a virtual reality.

40. The method of any one of claims 37 to 39 further comprising performing the method by a personal computing device, preferably a smartphone or tablet, having a camera and wherein the image is a selfie taken by the camera.

41. The method of any one of claims 37 to 40 wherein the landmarks are facial landmarks, the image comprises a face and further comprising using the respective locations of the landmarks to update the image with at least one product simulation.

42. A non-transitory storage device storing instructions that when executed by a processing unit configure a computing device to process an image to determine respective locations of each of a plurality of landmarks by: processing the image using a Convolutional Neural Network (CNN) having a first stage and a second stage, wherein: the first stage receives the image and determines: in a first portion of the first stage, a volume of intermediate features from the image; and, in a second portion of the first stage using the volume of intermediate features, a respective initial location for each of the landmarks within the image; and the second stage receives, at a first portion of the second stage, the volume of intermediate features and receives, at a second portion of the second stage, the respective initial location for each of the landmarks, the second stage further: processing the volume to further refine the intermediate features; performing, on the intermediate features as further refined, a Region of Interest-based pooling while preserving feature alignment to produce cropped features for each of the plurality of landmarks; and, determining, using the cropped features and for each landmark, a respective refinement location offset for the respective initial location of each landmark; and wherein the processing further operating to combine each respective initial location and the respective refinement location offset to determine final location coordinates in the image of each of the plurality of landmarks.

43. A computing device comprising a processor and a storage device, the computing device configured via a Convolutional Neural Network (CNN) to process an image to detect respective locations of a plurality of landmarks in the image, the CNN comprising: a two stage localization architecture in which a first stage employs first heat maps to determine initial coordinates for the respective locations and a second stage employs second heat maps to determine refinement offsets for the initial coordinates, the second stage further using region of interest pooling for each individual landmark for reducing overlapping computation to avoid non-relevant regions and to guide the production of relevant shared features; and wherein the CNN is trained with auxiliary coordinate regression loss to minimize a size and computational resource use of the respective heat maps.

Description:
CONVOLUTION NEURAL NETWORK BASED LANDMARK TRACKER

Field

[0001] This disclosure relates to improvements in computers and computer processing, particularly image processing and neural networks and more particularly to convolution neural network based landmark tracker systems and methods.

Background

[0002] Facial landmark detection, the process of locating pre-defined landmarks on a human face in an image is a common desire in many image processing/computer vision applications. Image processing applications of interest providing practical applications may include facial recognition, animation and augmented reality uses, among others. One example of augmented reality image processing is a virtual try-on application such as for makeup or other products applied to an individual. Virtual makeup try-on applications are tasked to render makeup onto the right locations under different lighting, pose, and face shape variations. Precise alignment, especially for frontal face poses, which are commonly seen in virtual try-on applications, is desired to provide an accurate and pleasing experience. Furthermore, for client-side Web applications, load time is extremely important, and GPUs necessary for fast execution of larger neural network architectures are not as efficiently utilizable.

[0003] While these resource constraints are not a large point of concern for state-of-the-art facial alignment architectures [1 ][2][3] (see the References list herein below, each of which is incorporated herein by reference), to strike a better balance for real-time applications, it is desired that an ideal architecture minimizes load and execution time while preserving or improving alignment accuracy.

Summary

[0004] In the proposed architecture, a first stage makes initial predictions, from which crops of shared convolutional features are taken; these regions of interest are then processed by a second stage to produce refined predictions. This two-stage localization design helps to achieve fine-level alignment while remaining computationally efficient. The resulting architecture is both small enough in size and inference time to be suitable for real-time web applications.

[0005] In one aspect there is provided a computing device comprising a processing unit and a storage device coupled thereto, the storage unit storing instructions that when executed by the processing unit configure the computing device to process an image to determine respective locations of each of a plurality of landmarks by: processing the image using a Convolutional Neural Network (CNN) having a first stage and a second stage, wherein: the first stage processes the image to generate initial predictions of the respective locations; and the second stage, using intermediate features generated from the image by the first stage, and the initial predictions, generates crops of shared convolutional features for regions of interest to define cropped features and further processes the cropped features to produce respective refinement location offsets to the initial predictions of the respective locations.

[0006] In one aspect there is provided a computing device comprising a processing unit and a storage device coupled thereto, the storage unit storing instructions that when executed by the processing unit configure the computing device to process an image to determine respective locations of each of a plurality of landmarks by: processing the image using a Convolutional Neural Network (CNN) having a first stage and a second stage, wherein: the first stage receives the image and determines: in a first portion of the first stage, a volume of intermediate features from the image; and, in a second portion of the first stage using the volume of intermediate features, a respective initial location for each of the landmarks within the image; and the second stage receives, at a first portion of the second stage, the volume of intermediate features and receives, at a second portion of the second stage, the respective initial location for each of the landmarks, the second stage further: processing the volume to further refine the intermediate features; performing, on the intermediate features as further refined, a Region of Interest-based pooling while preserving feature alignment to produce cropped features for each of the plurality of landmarks; and, determining, using the cropped features and for each landmark, respective refinement location offsets for each respective initial location of each landmark; and wherein the processing further operating to combine each respective initial location and the respective refinement location offset to determine final location coordinates in the image of each of the plurality of landmarks.

[0007] In one aspect there is provided a computing device comprising a processor and a storage device, the computing device configured via a Convolutional Neural Network (CNN) to process an image to detect respective locations of a plurality of landmarks in the image, the CNN comprising: a two stage localization architecture in which a first stage employs first heat maps to determine initial coordinates for the respective locations and a second stage employs second heat maps to determine refinement offsets for the initial coordinates, the second stage further using region of interest pooling for each individual landmark for reducing overlapping computation to avoid non-relevant regions and to guide the production of relevant shared features; and wherein the CNN is trained with auxiliary coordinate regression loss to minimize a size and computational resource use of the respective heat maps.

[0008] Method, computer program product and other aspects will be apparent to those of ordinary skill in the art. A computer program product as used herein comprises a non-transitory storage device storing instructions that when executed by a processing unit configure a computing device.

Brief Description of the Drawings

[0009] Fig. 1 is a network diagram showing a two stage CNN.

[0010] Fig. 2 is a diagram of an inverted residual block of feature maps/volumes.

[0011] Fig. 3 is a diagram of a predict block of feature maps/volumes of a second stage of the network of Fig. 1.

[0012] Figs. 4 and 5 are flowcharts of operations.

[0013] The present inventive concept is best described through certain embodiments thereof, which are described herein with reference to the accompanying drawings, wherein like reference numerals refer to like features throughout. It is to be understood that the term invention, when used herein, is intended to connote the inventive concept underlying the embodiments described below and not merely the embodiments themselves. It is to be understood further that the general inventive concept is not limited to the illustrative embodiments described below and the following descriptions should be read in such light. More than one inventive concept may be shown and described and each may standalone or be combined with one or more others unless stated otherwise.

Detailed Description

1. Context

1.1 Facial Landmark Alignment

[0014] The facial landmark alignment problem has a long history with classical computer vision solutions. For instance, the fast ensemble tree based [4] algorithm achieves reasonable accuracy and is widely used for real-time face tracking[5]. However, the model size required to achieve such accuracy is prohibitively large.

[0015] Current state-of-the-art accuracy for facial landmark alignment is achieved by convolutional neural network based methods. To maximize accuracy on extremely challenging datasets [6][7][8], large neural networks are used that are not real-time, and have model sizes of tens to hundreds of megabytes (MB) [3][9] and that entail unreasonable load times for Web applications.

1.2 Efficient CNN Architectures

[0016] To bring the performance of convolutional neural networks to mobile vision applications, numerous architectures with efficient building blocks such as MobileNetV2[10], SqueezeNet[1 1] and ShuffleNet[12] have recently been released. These networks aim to maximize performance (e.g. classification accuracy) for a given computational budget, which consists of the number of required learnable parameters (the model size) and multiply-adds.

[0017] A focus is given to MobileNetV2, whose inverted residual blocks may be used in an implementation of the present design. MobileNetV2’s use of depthwise convolution over regular convolutions drastically reduces the number of multiply-adds and learnable parameters, at a slight cost in performance[13]. Furthermore, the inverted design, which is based upon the principle that network expressiveness can be separated from capacity, allows for a large reduction in the number of cross-channel computations within the network[10]. Finally, the residual design taken from ResNet[14] eases issues with gradient propagation in deeper networks. 1.3 Heat map

[0018] Fully convolutional neural network architectures based on heat map regression [15][16][17][18] have been widely used on human pose estimation tasks. The use of heat maps provides a high degree of accuracy, along with an intuitive means of seeing the network’s understanding and confidence of landmark regression. This technique has also been used in recent facial alignment algorithms such as the Stacked Hourglass architecture [3]. However, the Stacked Hourglass approach[3] uses high resolution heat maps, which require a large amount of computation in the decoding layers. There is room for optimization here, as the heat maps only have non-negligible values in a very concentrated and small portion of the overall image. This observation motivates us to use regional processing, which allows for the network to focus its processing on relevant areas (i.e. the approximate region of interest).

1.4 Mask-RCNN

[0019] There are a series of frameworks which are flexible and robust for object detection and semantic segmentation like Fast R-CNN[19], Faster R-CNN[20] and Fully Constitutional Network[21] Faster R-CNN uses a multi-branch design to perform bounding box regression and classification in parallel. Mask-RCNN[22] is an extension of Faster-RCNN, and adds a new branch for predicting segmentation masks based on each Region of Interest. Of particular interest is Mask-RCNN’s use of RolAlign[22], (where Rol in an initialism from the term“Region of Interest”) which allows for significant savings in computation time by taking crops from shared convolutional features. By doing this, it avoids re-computing features for overlapping regions of interest.

1.5 Verification

[0020] In order to keep the output facial shape to be valid, a verification step may be performed before returning a final prediction, for example, to prevent a return of a weird shape when there is no face, part of face or face is over rotated. To have a standard reference of a face shape, Principle Component Analysis may be used to get a first 100 principle clusters from the training dataset. The smallest distance between transformed predicted shape and one of a cluster centre may be determined. This smallest distance is used as a score to verify whether the predicted shape is valid.

2. Computing Devices, Systems, Methods and Other Aspects

[0021] The following are some of the features described herein: [0022] - RolAlign[22] is used for each individual landmark to save potentially overlapping computation, allow the network to avoid non-relevant regions, and force the network to learn to produce good shared features. In an example, 8x8 heat maps from stage 1 indicate the coordinates of each of the (facial) landmarks. These landmarks may be calculated to form coordinates (x,y) by using a mask mean method. RolAlign uses the first stage’s predicted coordinates to crop an intermediate feature map with a uniform size 4x4. For example, assume there is a first landmark predicted to be located at (0.5, 0.5) in normalized coordinates. The 32x32 feature map is then cropped. The cropped box will be [(14.0, 14.0), (18.0, 18.0)] [top_left_corner, bottom_right_corner].

[0023] - The proposed two-stage localization architecture along with auxiliary coordinate regression loss allows working with extremely small and computationally cheap heat maps at both stages. Two losses may be combined together - a heat map loss and a coordinates distance loss.

2.1 Model Structure.

[0024] The CNN model has two-stages and is trained end-to-end, as illustrated in Fig. 1 . Fig. 1 shows CNN 100 comprising flows of layers and/or blocks having output volumes (feature maps) beginning from an input layer 102 (e.g. an image with a face for a facial landmark example) of dimension 128x128x3 (height, width and color). Other dimensions may be used such as 224x224x3. The dimension may be adjustable. Different dimensions (resolution) may be used for different purposes.

[0025] The flows comprise two stages, a first stage 104 and a second stage 106. The flows of the first and second stages are defined in accordance with respective groups of layers and/or blocks comprising first stage layers/blocks and second stage layers/blocks, each having first potions and second portions. These groups of layers/blocks (e.g. 108, 1 10, 1 14 and 1 16) are represented by arrows between the feature maps/volumes as will be understood by persons of skill in the art. First stage 104 comprises group 108 in a first portion and group 1 10 in a second portion while second stage 106 comprises group 1 14 in a first portion and group 1 16 in a second portion. Groups 108 and 1 10 may also be referenced as first group 108 and second group 1 10 of the first stage. Groups 1 14 and 1 16 may also be referenced as first group 1 14 and second group 1 16 of the second stage. First stage 104 further comprises layer 1 12 and second stage 106 further comprises layer 1 18. These layers 1 12 and 120 are combined at layer 120 to provide the output of CNN 100 as further described. [0026] The shading legend of Fig. 1 indicates a processing operation type for each of the layers and/or blocks of CNN 100. In further detail, group 108 comprises a convolutional layer 108A of dimension 64x64x8 and inverted residual blocks 108B and 108C each of dimension 64x64x8 and 32x32x8 respectively. It is understood that the dimensions for respective blocks or layers reference the size of the output feature maps. A general form of an expanded inverted residual block in accordance with [10] is shown in Fig. 2. Group 1 10 comprises inverted residual blocks 1 10A-1 10D of respective dimensions 16x16x16, 8x8x32, 8x8x32 and 8x8x#L where #L = the number or count of the plurality of landmarks. As trained and tested, #L=16. Other landmark counts (sizes) (e.g. #L) may be implemented (e.g. 65, 86, etc.). The value of #L may be adjustable. Following group 1 10 is layer 1 12 a get mask mean layer of dimension #Lx2.

[0027] The output of group 108 (e.g. following 1 18C) is an intermediate feature map (or sometimes referenced as a volume of intermediate features) of the first stage 104 is shared with (e.g. is an input to) second stage 106 at group 1 14. Group 1 14 comprises inverted residual blocks 1 14A-1 14C of respective dimensions 32x32x8, 32x32x16 and 32x32x16.

[0028] The output of group 1 14 (e.g. the intermediate feature map as further refined by the processing of blocks 1 14A-1 14C) along with the output of layer 1 12, representing initial locations of the landmarks, is processed by group 1 16. Group 1 16 comprises #L Rol Crop + concatenate blocks (represented by blocks 1 16 1 , 1 16 2 , ... 116 #L ) for each of the #L landmarks where each of the #L blocks having a dimension of 4x4x16, giving 4x4x16#L output feature maps when concatenated. The concatenated feature maps are provided to predict block 1 17 or group 1 16 having a dimension 4x4x#L. Predict block 1 17 is expanded in Fig. 3.

[0029] In turn, the output of predict block 1 17 is provided to layer 1 18, a second get mask mean layer of dimension #Lx2. The respective outputs of the two layers 1 12 and 1 18 represent initial locations of the #L landmarks and refinement offsets thereto. These are provided to output layer 120 also having dimension #Lx2 such that when combined, there is produced an (x,y) coordinate, in relation to input layer 102, for each of the respective #L landmarks.

[0030] Thus, the first stage 104 shows a series of Inverted Residual Blocks, which, by 1 10D predict 8 by 8 heat maps, one for each facial landmark. Interpreting the normalized activations over the heat maps as a probability distribution, the expected values of these heat maps is computed to obtain the x, y coordinates. This is described in more detail below. [0031] The second stage has several shared layers/blocks, which branch off from part of the first stage. Using the initial predictions from the previous stage (the intermediate feature maps from group 108 following block 108C as further refined by group 1 14 following block 1 14C), RolAlign[22] is applied to the final shared convolutional features. Each of the cropped features are input to one final convolutional layer (of a predict block 1 17), which has separate weights for each individual landmark. Predict block 1 17 makes use of group convolutions[12] to implement this in a straightforward manner. The output at 1 17 is a heat map for each landmark. The coordinates obtained from these heat maps indicate the required offset from the initial“coarse” prediction, i.e., if the heat map at this stage is perfectly centered, then there is effectively no refinement.

[0032] This Region of Interest based pooling by group 116 uses the first stage’s prediction (from layer 1 12) as a crop centre with the coordinates [x_c, y_c] of each landmark which are derived from applying a mask mean layer at 1 12 to the coarse heat map from 1 10D. Group 1 16, (via predict block 1 17) uses these cropped features (e.g. the concatenated output from blocks 1 16 1 , 1 16 2 , . . . 116 #L ) to predict the refinement offsets (also to predict the heat map first and then using mask mean layer to get the refinement shifting distance[x_r, y_r]). The final prediction (output layer) adds up the coarse prediction from the first stage and refinement prediction from the second stage.

2.2 Coordinate Regression from Heat Maps

[0033] For the ground truth heat maps, a Gaussian distribution is used with a mode corresponding to the ground truth coordinates’ positions. Letting x, y denote the coordinates of any pixel in the feature map, the value can be computed using the following distribution:

Where (x i ,y i ) is the corresponding landmark coordinate. In experiments, s x , s y are set to both be 0.8. (e.g.

[0034] In accordance with the get mask mean layer (e.g. each of 1 12 and 120), the regressed x pred , y pred is then the expected value of the pixel locations according to the distribution computed from the heat map predicted by the CNN (e.g. the“predicted heat map”. Let j index over all the pixels in the predicted heat map, and w, denote the heat map value for that pixel:

2.3 Loss Function

[0035] The loss function uses a pixel-wise sigmoid cross entropy[23] to learn the heat maps.

[0036] Additionally, in order to alleviate issues with the heat maps being cut-off for landmarks near boundaries, there is added on a L 2 distance loss:

Where is the prediction value of the heat map in the lth channel at pixel location (i,j) of n th sample, while is the corresponding ground truth is the weight at pixel location (i,j) for

the location, which is calculated from equation 4. is the ground truth coordinate of the th s sample’s l th landmark, and (x 1 , yj 1 ) is the predicted coordinate of the same landmark. Here L is the number of landmarks and H and W are the height and width of the heat map (e.g. 8x8). The auxiliary coordinate regression loss is the mean-square-error loss in second line of equation (3). The loss function, comprising the combined pixel-wise sigmoid cross entropy loss and the L2 loss, is applied to each respective stage such that each has its own loss determination during training. As noted, the use of two stage facilitates smaller heat maps and, thus, computing resource consumption.

2.4 Blocks

2.4.1 Inverted Residual Block

[0037] With reference to Fig. 2, there is shown a general expansion of an inverted residual block 200 substantially in accordance with reference [10]. Fig. 2 differs in that reference [10] employs Rel_U6 while the present example employs ReLU. Such a general approach may be applied to the inverted residual blocks of CNN 100, though some dimensions may differ.

[0038] Experimentally it was found that an expand ratio of 5 best suited competing needs of performance and efficiency.

[0039] The processing of Inverted Residual Block 200 performs, in order: a. channel-wise convolutions with 1 x1 kernel followed by BatchNorm and ReLU activation of an input layer 202 having dimension HxWxC, where C represents channels rather than color per se. The output is a feature map 204 of dimension HxWxC*5; b. depth-wise convolutions with 3X3 kernel followed by BatchNorm and ReLU activation of feature map 204 providing output (feature map) 206 having dimensions HxWxC*5; and c. channel-wise convolutions with 1 x1 kernel followed by BatchNorm on feature map 206 and an add operation with layer 202 providing an output having dimensions HxWxC.

2.4.2 Predict Block

[0040] After concatenating all RolAlign cropped features, the number of channels is equal to the number of input channels (e.g. 16) multiplied by the number of landmarks (#L). Since each landmark’s refinement heat map is predicted independently, such may be implemented using a 16 channel channel-wise convolution[12] as Fig. 3 shows. Fig 3 shows an expansion of block 1 17 working on an input of cropped and concatenated features (input feature map 1 17A) having dimensions 4x4x16*#L.

[0041] Predict block 1 17 performs, in order: a. group-wise convolutions with 3X3 kernel followed by BatchNorm and ReLU activation on input feature map 1 17A to output a feature map 1 17B having dimensions 4x4x16*#L; and, b. channel-wise convolutions with 1 X1 kernel followed by BatchNorm to output feature map 1 17C having dimensions 4x4x#L (defining 4x4 heat maps for each of the #L landmarks).

2.5 Data Augmentation

[0042] Several common methods are used to perform data augmentation. For example, random rotation, shifting, and horizontally flipping of the input image is used. To better equip the model for handling common occlusion cases such as glasses or hands, these objects are also randomly pasted in pictures around the faces therein.

3. Results and Comparison

[0043] Training of the new model used batch size 8 and a SGD optimizer with learning rate 8<T 5 and momentum = 0.9. The new model was evaluated on an in-house test set, and the distance error calculated which is normalized by the distance between eye centers. The normalized error of the first stage is 3.35% and the error of the full model is 2.89%. The running time of the new model on Web browsers with an iPhone 7 is around 40ms/frame, and its learnable parameters amount to around 300 KB in total.

[0044] Tables 1 and 2 show comparative data for the new model including a comparison to a larger in-house model RF tracker and Mobilenet_v2_0.35_128.

Table 1: The performance compared with in-house RF tracker.

Table 2: The performance compared with MobileNetV2 [10].

[0045] Figs. 4 and 5 are flowcharts of operations showing computer implemented method aspects. Fig. 4 shows operations 400 of a computer implemented method. At 402, an image is received for processing. The image may be a selfie image or video based selfie image. The image may be received from a camera that is a component of the computing device or system performing the method. Such may be a mobile device, kiosk at a product counter, tablet, etc. Other form factors and computing devices and systems will be apparent. Cloud or other service based systems may be contemplated where a local computing device may receive the image via a local camera and provide the image to a remote computing device that is configured to perform as a service. The service may be provided via a native application or browser of the local computing device.

[0046] The image may be pre-processed such as by scaling to a particular dimension (step 404). At 406, operations process the image to determine respective locations of each of a plurality of landmarks by using a Convolutional Neural Network (CNN) having a first stage and a second stage. The first stage processes the image to generate initial predictions of the respective locations. The second stage, using intermediate features generated from the image by the first stage, and the initial predictions, generates crops of shared convolutional features for regions of interest to define cropped features and further processes the cropped features to produce respective refinement location offsets to the initial predictions of the respective locations.

[0047] The CNN may combine the initial predictions of the respective locations and the respective refinement location offsets to provide the respective locations for each of a plurality of landmarks. The respective locations of the landmarks may comprise final locations determined by combining the initial predictions with the respective refinement location offsets or may comprise the set of location data including the initial predictions and the respective refinement locations offsets. The location data, in any of its forms, may be provided for use, such as to modify the image at least one of the landmark locations (step 408).

[0048] The CNN’s first stage may produce and use the intermediate features to produce initial heat maps from which the initial predictions are generated.

[0049] The second stage may produce second heat maps from the cropped features, one for each landmark, and use the second heat maps to produce the respective refinement location offsets.

[0050] Fig. 5 is a flowchart showing operations 500 of a computer implemented method. The operations may be performed by a computing device or system such as described herein (e.g. in relation to Fig. 4, or otherwise). Steps 502 and 504 are similar to steps 402 and 404. Step 510 is similar to step 408, where the image is modified.

[0051] Step 506 shows processing the image to determine respective locations of each of a plurality of landmarks by using a Convolutional Neural Network (CNN) having a first stage and a second stage. The first stage receives the image and determines: in a first portion of the first stage, a volume of intermediate features from the image; and, in a second portion of the first stage using the volume of intermediate features, a respective initial location for each of the landmarks within the image. The second stage receives, at a first portion of the second stage, the volume of intermediate features and receives, at a second portion of the second stage, the respective initial location for each of the landmarks.

[0052] The second stage further operates to: process the volume to further refine the intermediate features; perform, on the intermediate features as further refined, a Region of Interest-based pooling while preserving feature alignment to produce cropped features for each of the plurality of landmarks; and, determine, using the cropped features and for each landmark, respective refinement location offsets for each respective initial location of each landmark.

[0053] The operations 500 may further operate (e.g. at 508) to combine each respective initial location and the respective refinement location offset to determine final location coordinates in the image of each of the plurality of landmarks.

[0054] In the operations 500, the second portion of the first stage may use the volume of intermediate features to determine initial heat maps and predicts each respective initial location for each of the landmarks using the initial heat maps.

[0055] In the operations of Fig. 4 or Fig 5, at least some of the first stage may comprise a series of Inverted Residual Blocks and at least some of the second stage may comprise a series of Inverted Residual Blocks. The second stage may: use RolAlign for Region of Interest-based pooling while preserving feature alignment to produce the cropped features; and concatenate the cropped features.

[0056] In the operations of Fig. 4 or Fig 5, the second stage may comprises a predict block to process the cropped features, the predict block performing, in order: channel-wise convolutions with 3X3 kernel followed by BatchNorm and ReLU activation; and, group-wise channel convolutions with 1 X1 kernel followed by BatchNorm; to output each of the respective refinement location offsets.

[0057] In the operations of Fig. 4 or Fig 5, the CNN model may be trained using respective training images having ground truth heat maps for respective landmarks of the plurality of landmarks defined in accordance with Gaussian distribution with a mode corresponding to respective coordinate positions of the ground truth heat maps in the respective training images. The Gaussian distribution may be defined as previously shown herein above.

[0058] The CNN in the operations 400 or 500 may be trained with a loss function defined by pixel-wise sigmoid cross entropy for learning heat maps. The loss function may further include a L 2 distance loss. The loss function may be as shown and described earlier herein above.

[0059] In operations 400 or 500, modifying the image may comprise simulating a product applied to the image. The image may be a video image and the method may present the image as modified in real time to simulate a virtual reality.

[0060] In operations 400 or 500, the landmarks may be facial landmarks and the image may comprise a face. The respective operations may comprise using the respective locations of the landmarks to update the image with at least one product simulation.

[0061] In addition to computing device (or system) aspects and method aspects, a person of ordinary skill will understand that computer program product aspects are disclosed, where instructions are stored in a non-transient storage device (e.g. a memory, CD-ROM, DVD-ROM, disc, etc.) to configure a computing device to perform any of the method aspects disclosed herein.

[0062] It will be understood that the CNN may provide the respective landmark locations for further processing of the image. For example, a computing device may be configured via instructions to receive an image and perform landmark detection on the image using the CNN.

[0063] The instructions may configured the computing device to modify the image at or about at least one of the landmarks using the final coordinates for the at least one of the landmarks. The image may be annotated (an example of a modification) at or about the at least one of the landmarks, for example, showing a bounding box or region, showing a mask, etc. To modify the image may comprise simulating a product applied to the image. The product may be a makeup product such as when the image is a face and the landmarks are facial landmarks. The image may be a video image and the computing device may be configured via the instructions to modify and present the image in real time to simulate a virtual reality. The computing device may further comprise a camera and the video may be a selfie video taken by the camera.

[0064] Practical implementation may include any or all of the features described herein. These and other aspects, features and various combinations may be expressed as methods, apparatus, systems, means for performing functions, program products, and in other ways, combining the features described herein. A number of embodiments have been described. Nevertheless, it will be understood that various modifications can be made without departing from the spirit and scope of the processes and techniques described herein. In addition, other steps can be provided, or steps can be eliminated, from the described process, and other components can be added to, or removed from, the described systems. Accordingly, other embodiments are within the scope of the following claims.

[0065] Insofar as embodiments of the invention described above are implementable, at least in part, using a software-controlled programmable processing device such as a general purpose processor or special-purposes processor, digital signal processor, microprocessor, or other processing device, data processing apparatus or computer system it will be appreciated that a computer program for configuring a programmable device, apparatus or system to implement the foregoing described methods, apparatus and system is envisaged as an aspect of the present invention. The computer program may be embodied as any suitable type of code, such as source code, object code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language, such as C, C++, Java, BASIC, Perl, Matlab, Pascal, Visual BASIC, JAVA, ActiveX, assembly language, machine code, and so forth. A skilled person would readily understand that term“computer” in its most general sense encompasses programmable devices such as referred to above, and data processing apparatus and computer systems in whatever format they may arise, for example, desktop personal computer, laptop personal computer, tablet, smart phone or other computing device.

[0066] Suitably, the computer program is stored on a carrier medium in machine readable form, for example the carrier medium may comprise memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, Compact Disk Read Only Memory (CD-ROM), Compact Disk Recordable (CD- R), Compact Disk Rewriteable (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disk (DVD) subscriber identity module, tape, cassette solid-state memory. The computer program may be supplied from a remote source embodied in the communications medium such as an electronic signal, radio frequency carrier wave or optical carrier waves. Such carrier media are also envisaged as aspects of the present invention.

[0067] Throughout the description and claims of this specification, the word“comprise” and “contain” and variations of them mean“including but not limited to” and are not intended to (and do not) exclude other components, integers or steps. Throughout this specification, the singular encompasses the plural unless the context requires otherwise. In particular, where the indefinite article is used, the specification is to be understood as contemplating plurality as well as singularity, unless the context requires otherwise.

[0068] Features, integers characteristics, compounds, chemical moieties or groups described in conjunction with a particular aspect, embodiment or example of the invention are to be understood to be applicable to any other aspect, embodiment or example unless incompatible therewith. All of the features disclosed herein (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. The invention is not restricted to the details of any foregoing examples or embodiments. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings) or to any novel one, or any novel combination, of the steps of any method or process disclosed.

[0069] As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary,“or” refers to an inclusive or and not to an exclusive or.

[0070] In addition, use of the“a” or“an” are employed to describe elements and components of the invention. This is done merely for convenience and to give a general sense of the invention. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.

[0071] In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

[0072] The scope of the present disclosure includes any novel feature or combination of features disclosed therein either explicitly or implicitly or any generalisation thereof, unless incompatible therewith, irrespective of whether or not it relates to the claimed invention or mitigate against any or all of the problems addressed by the present invention. The applicant hereby gives notice that new claims may be formulated to such features during prosecution of this application or of any such further application derived therefrom. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in specific combinations enumerated in the claims.

References

The following publications are incorporated herein by reference.

[1 ] M. Kowalski, J. Naruniec, and T. Trzcinski, “Deep alignment network: A convolutional neural network for robust face alignment,” CoRR, vol. abs/1706.01789, 2017.

[2] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3476-3483, June 2013.

[3] K. Yuen and M. M. Trivedi,“An occluded stacked hourglass approach to facial landmark localization and occlusion estimation,” CoRR, vol. abs/1802.02137, 2018.

[4] V. Kazemi and J. Sullivan,“One millisecond face alignment with an ensemble of regression trees,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867- 1874, 2014.

[5] D. E. King,“Dlib-ml: A machine learning toolkit,” Journal of Machine Learning Research, vol. 10, pp. 1755-1758, 2009.

[6] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar,“Localizing parts of faces using a consensus of exemplars,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 2930-2940, Dec 2013.

[7] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive facial feature localization,” in Computer Vision - ECCV 2012 (A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, eds.), (Berlin, Heidelberg), pp. 679-692, Springer Berlin Heidelberg, 2012.

[8] G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou, “Mnemonic descent method: A recurrent process applied for end-to-end face alignment,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4177-4187, 2016.

[9] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” CoRR, vol. abs/1603.06937, 2016.

[10] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen,“MobileNetV2: Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381 , 2018.

[1 1] F. N. landola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1 mb model size,” CoRR, vol. abs/1602.07360, 2016.

[12] X. Zhang, X. Zhou, M. Lin, and J. Sun,“Shufflenet: An extremely efficient convolutional neural network for mobile devices,” CoRR, vol. abs/1707.01083, 2017.

[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861 , 2017. [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.

[15] A. Bulat and G. Tzimiropoulos,“Human pose estimation via convolutional part heat map regression,” CoRR, vol. abs/1609.01743, 2016.

[16] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” CoRR, vol. abs/1602.00134, 2016.

[17] Y. Chen, C. Shen, X. Wei, L. Liu, and J. Yang,“Adversarial learning of structure-aware fully convolutional networks for landmark localization,” CoRR, vol. abs/171 1.00253, 2017.

[18] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, “Deepercut: A deeper, stronger, and faster multi-person pose estimation model,” CoRR, vol. abs/1605.03170, 2016.

[19] R. B. Girshick,“Fast R-CNN,” CoRR, vol. abs/1504.08083, 2015.

[20] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015.

[21] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” CoRR, vol. abs/141 1.4038, 2014.

[22] K. He, G. Gkioxari, P. Dollar, and R. B. Girshick, “Mask R-CNN,” CoRR, vol. abs/1703.06870, 2017.

[23] N. Zhang, E. Shelhamer, Y. Gao, and T. Darrell, “Fine-grained pose prediction, normalization, and recognition,” CoRR, vol. abs/151 1 .07063, 2015.