Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
FACIAL SHAPE REPRESENTATION AND GENERATION SYSTEM AND METHOD
Document Type and Number:
WIPO Patent Application WO/2020/169983
Kind Code:
A1
Abstract:
A method of training a generative adversarial network, GAN, for generation of a 3D face surface. The GAN comprises a generator neural network and a discriminator neural network. The method includes pre-training the discriminator neural network and jointly training the generator neural network and the discriminator neural network. Pre-training the discriminator neural network includes processing a pre-training set of input facial data and updating parameters of the discriminator neural network. Jointly training the generator neural network and the discriminator neural network includes initialising parameters of the networks; processing a training set of input facial data using the generator; processing the training set and generator outputs using the discriminator; updating parameters of the generator based on the generator outputs and discriminator outputs for the generator outputs; and updating parameters of the discriminator based on the training set, the generator outputs and the discriminator outputs for the generator outputs.

Inventors:
MOSCHOGLOU STYLIANOS (GB)
ZAFEIRIOU STEFANOS (GB)
Application Number:
PCT/GB2020/050410
Publication Date:
August 27, 2020
Filing Date:
February 21, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
International Classes:
G06K9/00; G06K9/62
Other References:
STYLIANOS MOSCHOGLOU ET AL: "3DFaceGAN: Adversarial Nets for 3D Face Representation, Generation, and Translation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 May 2019 (2019-05-01), XP081271106
DENG JIANKANG ET AL: "UV-GAN: Adversarial Facial UV Map Completion for Pose-Invariant Face Recognition", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 7093 - 7102, XP033473628, DOI: 10.1109/CVPR.2018.00741
WANG YAXING ET AL: "Transferring GANs: Generating Images from Limited Data", 6 October 2018, INTERNATIONAL CONFERENCE ON FINANCIAL CRYPTOGRAPHY AND DATA SECURITY; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, PAGE(S) 220 - 236, ISBN: 978-3-642-17318-9, XP047488191
BOOTH JAMES ET AL: "Optimal UV spaces for facial morphable model construction", 2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), IEEE, 27 October 2014 (2014-10-27), pages 4672 - 4676, XP032967515, DOI: 10.1109/ICIP.2014.7025947
SLOSSBERG RON ET AL: "High Quality Facial Surface and Texture Synthesis via Generative Adversarial Networks", 23 January 2019, ROBOCUP 2008: ROBOCUP 2008: ROBOT SOCCER WORLD CUP XII; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CHAM, PAGE(S) 498 - 513, ISBN: 978-3-319-10403-4, XP047501135
Attorney, Agent or Firm:
GILL JENNINGS & EVERY LLP (GB)
Download PDF:
Claims:
Claims

1. A method of training a generative adversarial network, GAN, for generation of a 3D face surface, wherein the GAN comprises a generator neural network and a discriminator neural network, wherein the training comprises:

pre-training the discriminator neural network by:

processing a pre-training set of input facial data using the discriminator neural network, wherein the pre-training set of input facial data represents a plurality of input 3D face models, and

updating parameters of the discriminator neural network based on a comparison of the pre-training set of input facial data and the discriminator neural network output for the pre-training set of input facial data; and jointly training the generator neural network and the discriminator neural network by:

initialising parameters of the generator neural network and the discriminator neural network using the updated parameters of the

discriminator neural network;

processing a training set of input facial data using the generator neural network, wherein the training set of input facial data represents a plurality of input 3D face models, to output a set of generator outputs having the same form as the input facial data;

processing the training set of input facial data and the set of generator outputs using the discriminator neural network; and

updating parameters of the generator neural network based on the set of generator outputs and the discriminator neural network outputs for the set of generator outputs; and

updating parameters of the discriminator neural network based on the training set of input facial data, the discriminator neural network outputs for the training set of input facial data, the set of generator outputs, and the discriminator neural network outputs for the set of generator outputs.

2. The method of claim 1, wherein:

updating parameters of the generator neural network comprises updating a plurality of parameters of a decoder portion of the generator neural network without changing a plurality of parameters of a decoder portion of the generator neural network; and updating parameters of the discriminator neural network comprises updating a plurality of parameters of a decoder portion of the discriminator neural network without changing a plurality of parameters of a decoder portion of the discriminator neural network.

3. The method of any preceding claim, wherein updating parameters of the generator neural network is based on a comparison of the training set of input facial data and the set of generator outputs.

4. The method of any preceding claim, wherein each input facial data item of the pre-training set of input facial data and the training set of input facial data is an image like tensor representation of a 3D facial mesh.

5. The method of claim 4, wherein the image-like tensor representation of a 3D facial mesh comprises a UV map representation of a corresponding 3D facial mesh, wherein the components for each of the pixels of the UV map represent a spatial displacement of a corresponding vertex of the 3D facial mesh with respect to a template 3D facial mesh.

6. The method of claim 5, further comprising obtaining the sets of input facial data, wherein obtaining the sets of input facial data comprises:

morphing a template mesh on to each of one or more raw facial meshes to derive one or more registered facial meshes;

unwrapping the template mesh to create a UV representation of the template mesh;

aligning and normalising the registered facial meshes; and for each of the one or more registered 3D facial meshes, storing the registered 3D facial mesh as a UV map by storing the spatial displacement of each vertex of the registered 3D facial mesh as components for the pixel corresponding to the respective vertex.

7. The method of claim 6, wherein obtaining the sets of input facial data further comprises interpolating the UV maps for each of the one or more registered 3D facial meshes.

8. The method of any preceding claim, wherein the generator neural network is an autoencoder.

9. The method of any preceding claim, wherein the discriminator neural network is an autoencoder.

10. The method of any preceding claim, further comprising:

extracting bottlenecks for each input facial data item of the training set of input facial data using the encoder portion of the trained generator neural network;

calculating the mean bottleneck of the extracted bottlenecks;

calculating the covariance of the extracted bottlenecks;

sampling a bottleneck based on the mean bottleneck and the covariance of the extracted bottlenecks; and

generating a face based on the sampled bottleneck using the decoder portion of the trained generator neural network.

11. The method of claim 10, wherein the sampling of the bottleneck is from a multivariate Gaussian distribution with a mean of the mean bottleneck and a covariance of the covariance of the extracted bottlenecks.

12. The use of a decoder portion of a generator neural network of a generative adversarial network trained using the method of any preceding claim for the generation of a 3D face surface.

13. A face generation system comprising:

a bottleneck source;

a trained generation network, wherein the trained generation network is the decoder portion of the generator of a generative adversarial network trained using the method of any of claims 1 to 11; and

a network output to facial mesh converter,

wherein the face generation system is configured to:

obtain a bottleneck using the bottleneck source;

processing the bottleneck source using the trained generation network to produce a generation network output; and converting the network output to a 3D facial mesh using the network output to facial mesh converter.

14. The face generation system of claim 13, wherein the bottleneck source is configured to sample bottlenecks based on a mean bottleneck and covariance of a set of bottlenecks for the training set of input facial data extracted using the encoder portion of the generator neural network.

15. The face generation system of claims 13 or 14, wherein the bottleneck source is configured to retrieve stored bottlenecks from one or more storage systems.

16. The face generation system of any of claims 13 to 15, wherein the bottleneck is a single dimensional vector of a fixed size.

17. The face generation system of any of claims 13 to 16, wherein the generation network output is an image-like tensor representation of a 3D facial mesh.

18. The face generation system of any of claims 13 to 17, wherein the network output to facial mesh converter is configured to convert the network output to a 3D facial mesh by deforming a template 3D facial mesh based on data stored in the generation network output.

19. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 11.

20. A data processing apparatus comprising means for carrying out the method of any one of claims 1 to 11.

Description:
Facial Shape Representation and Generation System and Method Field of the Invention

The present invention relates to facial shape representation and generation, and in particular to systems and methods for representing facial shapes and generating facial shapes using a neural network system.

Background

Facial shape representation relates to deriving a compact representation describing the key characteristics of a 3D face surface. Facial shape generation relates to generating 3D face surfaces. These generated 3D face surfaces maybe arbitrary, reconstructed from a compact representation or an interpolation of two or more 3D face surfaces. Representation and generation of the 3D face surface have wide ranging applications including, but not limited to, 3D face reconstruction from images, 3D face recognition, 3D face tracking, and face relighting and transfer.

Summary

According to an embodiment, a method of training a generative adversarial network, GAN, for generation of a 3D face surface, is provided. The GAN comprises a generator neural network and a discriminator neural network, The training may include pre training the discriminator neural network by processing a pre-training set of input facial data using the discriminator neural network, wherein the pre-training set of input facial data represents a plurality of input 3D face models, and updating parameters of the discriminator neural network based on a comparison of the pre-training set of input facial data and the discriminator neural network output for the pre-training set of input facial data. The training may include jointly training the generator neural network and the discriminator neural network by initialising parameters of the generator neural network and the discriminator neural network using the updated parameters of the discriminator neural network; processing a training set of input facial data using the generator neural network, wherein the training set of input facial data represents a plurality of input 3D face models, to output a set of generator outputs having the same form as the input facial data; processing the training set of input facial data and the set of generator outputs using the discriminator neural network; and updating parameters of the generator neural network based on the set of generator outputs and the discriminator neural network outputs for the set of generator outputs; and updating parameters of the discriminator neural network based on the training set of input facial data, the discriminator neural network outputs for the training set of input facial data, the set of generator outputs, and the discriminator neural network outputs for the set of generator outputs.

Updating parameters of the generator neural network may include updating a plurality of parameters of a decoder portion of the generator neural network without changing a plurality of parameters of a decoder portion of the generator neural network.

Updating parameters of the discriminator neural network may include updating a plurality of parameters of a decoder portion of the discriminator neural network without changing a plurality of parameters of a decoder portion of the discriminator neural network.

Updating parameters of the generator neural network maybe based on a comparison of the training set of input facial data and the set of generator outputs.

Each input facial data item of the pre-training set of input facial data and the training set of input facial data maybe an image-like tensor representation of a 3D facial mesh.

The image-like tensor representation of a 3D facial mesh may include a UV map representation of a corresponding 3D facial mesh. The components for each of the pixels of the UV map may represent a spatial displacement of a corresponding vertex of the 3D facial mesh with respect to a template 3D facial mesh.

The method may include obtaining the sets of input facial data. Obtaining the sets of input facial data may include morphing a template mesh on to each of one or more raw facial meshes to derive one or more registered facial meshes; unwrapping the template mesh to create a UV representation of the template mesh; aligning and normalising the registered facial meshes; and for each of the one or more registered 3D facial meshes, storing the registered 3D facial mesh as a UV map by storing the spatial displacement of each vertex of the registered 3D facial mesh as components for the pixel

corresponding to the respective vertex.

Obtaining the sets of input facial data may include interpolating the UV maps for each of the one or more registered 3D facial meshes. The generator neural network may be autoencoder.

The discriminator neural network may be an autoencoder.

The method may include extracting bottlenecks for each input facial data item of the training set of input facial data using the encoder portion of the trained generator neural network; calculating the mean bottleneck of the extracted bottlenecks;

calculating the covariance of the extracted bottlenecks; sampling a bottleneck based on the mean bottleneck and the covariance of the extracted bottlenecks; and generating a face based on the sampled bottleneck using the decoder portion of the trained generator neural network.

The sampling of the bottleneck may be from a multivariate Gaussian distribution with a mean of the mean bottleneck and a covariance of the covariance of the extracted bottlenecks.

According to another aspect, the decoder portion of a generator network of a generative adversarial network trained according to any of the aforementioned methods for the generation of a 3D face surface is provided.

According to another aspect, a face generation system including a bottleneck source; a trained generation network; and a network output to facial mesh converter is provided. The trained generation network maybe the decoder portion of the generator of a generative adversarial network trained using any of the aforementioned methods. The face generation system may be configured to obtain a bottleneck using the bottleneck source; processing the bottleneck source using the training generation network to produce a generation network output; and converting the network output to a 3D facial mesh using the network output to facial mesh converter.

The bottleneck source maybe configured to sample bottlenecks based on a mean bottleneck and covariance of a set of bottlenecks for the training set of input facial data extracted using the encoder portion of the generator neural network.

The bottleneck source may be configured to retrieve stored bottlenecks from one or more storage systems. The bottleneck may be a single dimensional vector of a fixed size.

The generation network output may be an image-like tensor representation of a 3D facial mesh.

The network output to facial mesh converter may be configured to convert the network output to a 3D facial mesh by deforming a template 3D facial mesh based on data stored in the generation network output.

According to another aspect, a computer-readable storage medium is provided according to claim 19.

According to another aspect, a data processing apparatus is provided according to claim 20.

Brief Description of the Drawings

Certain embodiments of the present invention will now be described, by way of example, with reference to the following figures.

Figure 1 is a schematic block diagram illustrating an example of a neural network system for pre-training a discriminator network;

Figure 2 is a schematic block diagram illustrating an example of a neural network system for adversarial training of a generator network;

Figure 3 is a flow diagram of an example method for training a generative adversarial network;

Figure 4 is a schematic block diagram illustrating an example of a neural network system for generating faces;

Figure 5 is a flow diagram of an example method for sampling bottlenecks;

Figure 6 is a flow diagram of an example method for reparametising 3D facial meshes; Figure 7 illustrates several representations of a 3D facial mesh.

Detailed Description

Example implementations provide system(s) and method(s) for improved

representation and generation of 3D facial surfaces. Representation and generation of 3D face surfaces have applications in 3D face reconstruction from2D texture images,

3D face recognition, 3D face tracking and face relighting and transfer. Therefore, improving the representation and generation of 3D face surfaces, may improve the performance of systems directed at these applications. For example, 3D face tracking systems using the system(s) and method(s) described herein may track faces in a three- dimensional space with a greater accuracy than systems and methods in the state-of- the-art.

Improved representation and generation of 3D faces may be achieved by various example implementations by way of a neural network training method and/ or a 3D facial mesh reparametisation method.

The neural network training method is a method for training variants of generative adversarial networks (GANs). In some embodiments, both the generator network and the discriminator network may be autoencoders. In this method, the discriminator network may be pre-trained by updating the parameters of the discriminator network based on a comparison of an input with the output of the discriminator for that input. The parameters of the pre-trained discriminator network maybe used to initialise the parameters of the generator network prior to adversarial training of the GAN. Pre training the discriminator may result in components of the trained GAN being usable to more accurately reconstruct 3D facial surfaces from a latent representation than state- of-the-art methodologies. Components of this trained GAN may also be usable to generate more realistic faces than state-of-the-art methodologies, as evaluated qualitatively or quantitatively, e.g. using the generalisation and/or specificity metrics.

During adversarial training of the GAN, the parameters in the decoder parts of the generator network and the discriminator network may be frozen. As part of the adversarial training, the GAN processes training inputs and updates the (unfrozen) parameters of the discriminator network and the generator network. Freezing these parameters may result in components of the trained GAN being usable to more accurately reconstruct 3D facial surfaces from a latent representation than state-of-the- art methodologies. Components of this trained GAN may also be usable to generate more realistic faces than state-of-the art methodologies, as evaluated qualitatively or quantitatively, e.g. using the generalisation and/ or specificity metrics. Freezing these parameters also increases training speed as fewer gradients need to be calculated during training. The loss function used to update the (unfrozen) parameters of the discriminator network of the GAN may be based on both differences between the training inputs and respective discriminator outputs for those inputs; and differences between generator outputs for the training inputs and respective discriminator outputs for those generator outputs.

The loss function used to update the (unfrozen) parameters of the generator network of the GAN may be based on both differences between the generator outputs for the training inputs and the respective discriminator outputs for those generator outputs; and differences between the generator outputs and the respective training inputs, i.e. a reconstruction loss component. Using a loss function with a reconstruction loss component ensures that the output of the generator is as close as possible to the corresponding input.

The 3D facial mesh reparametisation method is a method for representing 3D facial meshes in an image-like tensor format. Representing 3D facial meshes in an image-like tensor format enables them to be inputted into any neural network variant capable of accepting 2D colour images. Some of these neural network variants, such as the trained GAN described above, may be better able to preserve high-frequency details of the 3D facial meshes than the neural network variants currently used for generating 3D mesh representations, which typically utilise spectral convolutions or discrete volumetric representations.

Discriminator Pre-training System

Referring to Figure 1, a neural network system too for pre-training a discriminator network is shown.

The discriminator pre-training system too may be implemented on one or more suitable computing devices. For example, the one or more computing devices maybe any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones. Where the discriminator pre-training system too is implemented on a plurality of suitable computing devices, the computing devices may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks. The discriminator pre-training system too maybe implemented using a suitable machine learning framework such as MXNet, TensorFlow, and/or PyTorch.

The discriminator pre-training system includes a discriminator input 102, a

discriminator output 104, a discriminator network 110, and a pre-training loss calculator 120.

The discriminator input 102 may be an image-like tensor representation of a 3D facial mesh. An appropriate 3D facial mesh reparametisation method, such as that outlined in relation to Fig. 6, maybe used to derive the image-like tensor representation. The image-like tensor representation may be a representation of the 3D facial mesh as a UV map where the components for each pixel of the UV map represent the 3D offset, i.e. the spatial displacement, of a vertex of the 3D facial mesh from that of a corresponding vertex of a template 3D facial mesh.

The discriminator output 104 has the same form as the discriminator input 102.

The discriminator network 110 includes an encoder 111, an encoding bottleneck layer 112, a bottleneck 113, a decoding bottleneck layer 114 and a decoder 115. The

discriminator network 110 may be an autoencoder.

The encoder 111 receives and processes the discriminator input 102 to produce an encoder output. The encoder processes the discriminator input 102 using convolutional layers. These convolutional layers perform 2D convolution operations. An element-wise activation function, such as an exponential linear unit, may also be applied to the outputs of these 2D convolution operations. Downsampling operations, such as 2D average pooling, may also be performed. The applications of the activation function and the downsampling operations maybe understood as forming part of a convolutional layer, or maybe understood as layers in themselves. The outputs of some of these convolutional layers may have more channels than the inputs to the respective layers, i.e. the convolutional layers maybe channel increasing. Because of these channel increasing layers, the encoder output may have more channels than the discriminator input 102 while, because of the downsampling operations, it may have smaller spatial dimensions. The encoding bottleneck layer 112 receives and processes the encoder output to produce the bottleneck 113. The encoding bottleneck layer may be a fully connected layer using an activation function, such as a tanh or sigmoid function, to process a multi-channel, 2D output from the encoder to produce the bottleneck 113.

The bottleneck 113 may be a single dimensional vector of a fixed size. The size of the bottleneck 113 maybe significantly smaller than that of the discriminator input 102, i.e. the number of elements in the bottleneck 113 may be significantly fewer than the number in the discriminator input 102. Therefore, the bottleneck 113 maybe understood as the attempt of the encoder 111 and the encoding bottleneck layer 112 to compactly represent information contained in the discriminator input 102. The bottleneck 113 may also be understood as a representation of the important structure of the (high-dimensional) discriminator input 102 in a low dimensional space, i.e. a latent embedding.

The decoding bottleneck layer 114 receives and processes the bottleneck 113 to produce a decoder input. The decoding bottleneck layer maybe a fully connected layer using an activation function, such as a tanh or sigmoid function, to process the bottleneck to produce a multi-channel, 2D input for the decoder.

The decoder 115 receives and processes the decoder input to produce the discriminator output 104. The decoder processes the decoder input using convolutional layers. These convolutional layers perform 2D deconvolution operations. An element-wise activation function, such as an exponential linear unit, may also be applied to the outputs of these 2D deconvolution operations. Upsampling operations, such as nearest-neighbour upsampling, may also be performed. Applications of the activation function and upsampling operations maybe understood as forming part of a given convolutional layer, or maybe understood as layers in themselves. The upsampling operations of the decoder 115 may result in the output of the decoder having the same spatial dimensions as the discriminator input 102. The output layer of the decoder 115 may perform a channel decreasing deconvolution reducing the number of channels in the output to the same number as in the discriminator input 102, e.g. three. Therefore, in some embodiments, the output of the decoder 115 may have the same spatial dimensions and the same number of channels as the discriminator input 102. As the output of the decoder 115 is the discriminator output 104, in these embodiments, the discriminator output 104 has the same format as the discriminator input 102. The pre-training loss calculator 120 is used to calculate a loss measure based on a batch of one or more discriminator inputs 102 and corresponding discriminator outputs 104. The loss measure maybe the Li loss or the L2 loss. The Li loss maybe calculated as:

, where X is a batch of one or more discriminator inputs and D(x ) is a discriminator output for a discriminator input x. Using the same notation, the L2 loss maybe calculated as:

The parameters of the discriminator network 110, i.e. the weights and biases of the neural network layers of the encoder, encoding bottleneck layer, decoder and decoding bottleneck layer, are updated using a gradient based method to minimise the loss measure. These gradients maybe calculated bybackpropagation.

The pre-training of the discriminator network 110 involves multiple iterations of calculating the loss measure and calculating the corresponding updates to the parameters of the discriminator network. The pre-training of the discriminator network may continue until the loss measure for a number of batches is consistently below a threshold value, a maximum number of iterations is reached, and/ or a time limit is reached.

Adversarial Training System

Referring to Figure 2, a neural network system 200 for adversarial training of a generator network is shown.

The adversarial training system 200 may be implemented on one or more suitable computing devices. For example, the one or more computing devices maybe any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones. Where the adversarial training system 200 is implemented on a plurality of suitable computing devices, the computing devices may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks. The adversarial training system maybe implemented using a suitable machine learning framework such as MXNet, TensorFlow, and/or PyTorch.

The adversarial training system includes a training input 202, a generator network 210, a generator output 204, a discriminator network 110, a discriminator output for the generator output 206, a discriminator output for the training input 208, a generator loss calculator 220, and a discriminator loss calculator 230.

The training input 202 may have the same format as the discriminator input 102 used during pre-training. The training input 202 maybe an image-like tensor representation of a 3D facial mesh. An appropriate 3D facial mesh reparametisation method, such as that outlined in relation to Fig. 6, maybe used to derive the image-like tensor representation. The image-like tensor representation may be a representation of the 3D facial mesh as a UV map where the components for each pixel of the UV map represent the 3D offset, i.e. spatial displacement, of a vertex of the 3D facial mesh from that of a corresponding vertex of a template 3D facial mesh.

The generator output 204, the discriminator output for the generator output 206, and the discriminator output for the training input 208 have the same form as training input.

The generator network 210 may be identical or very similar to the discriminator network 110, and the parameters derived during pre-training the discriminator network 110 maybe used to initialise the generator network 210.

The generator network 210 includes an encoder 211, an encoding bottleneck layer 212, a bottleneck 213, a decoding bottleneck layer 214 and a decoder 215. The generator network 210 may be an autoencoder.

The encoder 211 receives and processes the training input 202 to produce an encoder output. The encoder processes the training input 202 using convolutional layers. These convolutional layers perform 2D convolution operations. An element-wise activation function, such as an exponential linear unit, may also be applied to the outputs of these 2D convolution operations. Downsampling operations, such as 2D average pooling, may also be performed. The applications of the activation function and the downsampling operations may be understood as forming part of a convolutional layer, or maybe understood as layers in themselves. The outputs of some of these

convolutional layers may have more channels than the inputs to the respective layers, i.e. the convolutional layers maybe channel increasing. Because of these channel increasing layers, the encoder output may have more channels than the training input 202 while, because of the downsampling operations, it may have smaller spatial dimensions.

The encoding bottleneck layer 212 receives and processes the encoder output to produce the bottleneck 213. The encoding bottleneck layer may be a fully connected layer using an activation function, such as a tanh or sigmoid function, to process a multi-channel, 2D output from the encoder to produce the bottleneck 213.

The bottleneck 213 may be a single dimensional vector of a fixed size. The size of the bottleneck 213 maybe significantly smaller than that of the training input 202, i.e. the number of elements in the bottleneck 213 may be significantly fewer than the number in the training input 202. Therefore, the bottleneck 213 maybe understood as the attempt of the encoder 211 and the encoding bottleneck layer 212 to compactly represent information contained in the training input 202. The bottleneck 213 may also be understood as a representation of the important structure of the (high-dimensional) training input 202 in a low dimensional space, i.e. a latent embedding.

The decoding bottleneck layer 214 receives and processes the bottleneck 213 to produce a decoder input. The decoding bottleneck layer maybe a fully connected layer using an activation function, such as a tanh or sigmoid function, to process the bottleneck to produce a multi-channel, 2D input for the decoder.

The decoder 215 receives and processes the decoder input to produce the generator output 204. The decoder processes the decoder input using convolutional layers. These convolutional layers perform 2D deconvolution operations. An element-wise activation function, such as an exponential linear unit, may also be applied to the outputs of these 2D deconvolution operations. Upsampling operations, such as nearest-neighbour upsampling, may also be performed. Applications of the activation function and upsampling operations maybe understood as forming part of a given convolutional layer, or maybe understood as layers in themselves. The upsampling operations of the decoder 215 may result in the output of the decoder having the same spatial dimensions as the training input 102. The output layer of the decoder 215 may perform a channel decreasing deconvolution reducing the number of channels in the output to the same number as in the training input 202, e.g. three. Therefore, in some embodiments, the output of the decoder 215 may have the same spatial dimensions and the same number of channels as the training input 202. As the output of the decoder 215 is the generator output 204, in these embodiments, the generator output 204 has the same format as the generator input 202.

The discriminator network 110 is or is substantially identical to the discriminator network 110 of discriminator pre-training system too subsequent to pre-training.

The discriminator network 110 is capable of processing the generator output 204 to produce the discriminator output for the generator output 206. The discriminator network 110 is also capable of processing the training input 202 to produce a discriminator output for the training input 208.

The generator loss calculator 220 is used to calculate a generator loss measure based on a batch of one or more training inputs 202, corresponding generator outputs 204 and respective discriminator outputs for these generator outputs 206. The generator loss measure may include an adversarial loss component and a reconstruction loss component.

The adversarial loss component may be:

, where X is a batch of one or more training inputs, where x is a given training input and G(x ) is a corresponding generator output, and

L(y = \\y - D(y) \\

, where D(y ) is the output of the discriminator for a given input y.

The reconstruction loss component may be:

, using the same notation as above. The reconstruction loss component may encourage the generator output for the training input 208 to be closer to the training input 202.

Therefore, the generator loss measure may be: , where 1 is a hyper-parameter that controls how much weight should be placed on the reconstruction loss.

The parameters of the generator network 210, i.e. the weights and biases of the neural network layers of the encoder, encoding bottleneck layer, decoder and decoding bottleneck layer, are updated using a gradient based method to minimise the generator loss measure. These gradients may be calculated by backpropagation.

The training of the generator network 210 involves multiple iterations of calculating a generator loss measure based on a batch of training inputs and corresponding updates to the parameters of the generator network. In some embodiments, only the parameters of the decoding bottleneck layer 214 and decoder 215 may be updated during training. The training of the generator network may continue until the generator loss measure for a number of batches is consistently below a threshold value, a maximum number of iterations is reached, and/or a time limit is reached. The training of the generator network 210 maybe concurrent with the training of the discriminator network 110. Alternatively, training may alternate between the generator network 210 and the discriminator network 110.

The discriminator loss calculator 230 is used to calculate a discriminator loss measure based on a batch of one or more training inputs 202, corresponding generator outputs 204, discriminator outputs for the generator outputs 206, and discriminator outputs for the training inputs 208.

Using the same notation as above, the discriminator loss measure may be:

, where y e [0, 1] is the hyper-parameter which controls the equilibrium

E[L(x)] = y E[L(G(x)], and k is the parameter which controls how much emphasis should be placed on L(G( )). k is incrementally updated over each discriminator loss calculator iteration, t, according to the formula above where l k is the learning rate of k. The parameters of the discriminator network no, i.e. the weights and biases of the neural network layers of the encoder, encoding bottleneck layer, decoder and decoding bottleneck layer, are updated using a gradient based method to minimise the loss measure. These gradients maybe calculated bybackpropagation.

The training of the discriminator network no involves multiple iterations of calculating a discriminator loss measure based on a batch of training inputs and corresponding updates to the parameters of the generator network. In some embodiments, only the parameters of the decoding bottleneck layer 114 and decoder 115 may be updated during training. The training of the discriminator network may continue until the discriminator loss measure for a number of batches is consistently below a threshold value, a maximum number of iterations is reached, and/ or a time limit is reached. The training of the discriminator network 110 may be concurrent with the training of the generator network 210. Alternatively, training may alternate between the discriminator network 110 and the generator network 210.

Generative Adversarial Network Training Method

Figure 3 is a flow diagram illustrating an example method 300 for training a generative adversarial network, where the generative adversarial network includes a generator neural network and a discriminator neural network. The method maybe performed by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the adversarial training system 200. The generative adversarial network may, for example, be the generative adversarial network of adversarial training system 200.

In submethod 310, pre-training of the discriminator network of the generative adversarial network occurs. The pre-training submethod 310 includes an input processing step 312 and a parameter update step 314.

In the input processing step 312, a set of input facial data representing a plurality of input 3D face models is processed by the discriminator neural network. Each of the input facial data items may be an image-like tensor representation of a 3D facial mesh. An appropriate 3D facial mesh reparametisation method, such as that outlined in relation to Fig. 6, maybe used to derive the image-like tensor representation. The image-like tensor representation may be a representation of the 3D facial mesh as a UV map where the components for each pixel of the UV map represent the 3D offset, i.e. spatial displacement, of a vertex of the 3D facial mesh from that of a corresponding vertex of a template 3D facial mesh.

The processing by the discriminator neural network may include processing each of the input facial data items using one or more neural network layers, e.g. the neural network layers of discriminator neural network 110. The processing of each input facial data item produces a discriminator neural network output having the same form as the input facial data item.

In the parameter update step 314, the parameters of the discriminator neural network are updated based on the set of input facial data and the corresponding outputs of the discriminator neural network. The updated parameters of the discriminator network may include the weights and biases of the neural network layers of the discriminator network. The discriminator network may have both encoder and decoder portions and the weights and biases of the neural networks in both of these portions may be updated. The parameters of the discriminator neural network are updated to minimise a loss measure using a gradient based method. These gradients may be calculated by backpropagation.

The loss measure maybe the Li loss or the L2 loss. The Li loss maybe calculated as:

, where X is a set of input facial data and D(x) is a discriminator output for an input facial data item x. Using the same notation, the L2 loss may be calculated as:

The pre-training submethod 310 involves multiple iterations of the input processing step 312 and parameter update step 314. These iterations may occur on different sets of input facial data. Iterations of the pre-training submethod may continue until the loss measure for a number of batches is consistently below a threshold value, a maximum number of iterations is reached, and/ or a time limit is reached.

In submethod 320, adversarial training of the generative adversarial network occurs. The adversarial training submethod 320 includes a parameter initialisation step 321, a generator processing step 322, a discriminator processing step 323, a generator parameter update step 324 and a discriminator parameter update step 325.

In the parameter initialisation step 321, the parameters of the generator network are initialised using the updated parameters of the pre-trained discriminator network. The discriminator network of the generative adversarial network may be the pre-trained discriminator network. In some embodiments, however, the discriminator network of the generative adversarial network may be different, e.g. where pre-training and adversarial training occur at substantially different times or are performed using different machine learning frameworks. In these embodiments, the discriminator network of the generative adversarial network is also initialised using the updated parameters of the pre-trained discriminator network.

In the generator processing step 322, a set of input facial data representing one or more input 3D face models is processed by the generator neural network. This set of input facial data has the same form as that processed during the discriminator pre-training. This set of input facial data may also be the same as, different to, or overlap with data used for pre-training the discriminator network.

The processing by the generator neural network may include processing each of the input facial data items using one or more neural network layers, e.g. the neural network layers of generator neural network 210. The processing of each input facial data item produces a generator neural network output having the same form as an input facial data item.

In the discriminator processing step 323, the set of input facial data and the

corresponding outputs of the generator neural network are processed by the discriminator neural network.

The processing by the discriminator neural network may include processing each of the input facial data items and each of the corresponding outputs of the generator neural network using one or more neural network layers, e.g. the neural network layers of discriminator neural network 110. The processing of each of these items produces a discriminator neural network output having the same form as an input facial data item and, therefore, the same form as a generator output. In the generator parameter update step 324, the parameters of the generator neural network are updated based on the outputs of the generator neural network, the outputs of the discriminator for those generator outputs and, optionally, the set of input facial data. The updated parameters of the generator network may include the weights and biases of the neural network layers of the generator network. The generator network may have both encoder and decoder portions and the weights and biases of the neural networks in the decoder portion may be updated. The parameters of the generator neural network are updated to minimise a generator loss measure using a gradient based method. These gradients may be calculated by backpropagation.

The generator loss measure may include an adversarial loss component and a reconstruction loss component. The adversarial loss component may be:

, where X is a set of input facial data, where x is a given input facial data item, and G(x) is a corresponding generator output, and

L(y) = \\y - D(y \\

, where D(y ) is the output of the discriminator for a given input y.

The reconstruction loss component may be:

, using the same notation as above. The reconstruction loss component may encourage the generator output for an input facial data item to be closer to the input facial data item. Therefore, the generator loss measure may be:

, where l is a hyper-parameter that controls how much weight should be placed on the reconstruction loss.

In the discriminator parameter update step 325, the parameters of the discriminator neural network are updated based on the set of input facial data, the corresponding outputs of the generator neural network, the outputs of the discriminator for those generator outputs and the outputs of the discriminator for the input facial data. The updated parameters of the discriminator network may include the weights and biases of the neural network layers of the discriminator network. The discriminator network may have both encoder and decoder portions and the weights and biases of the neural networks in the decoder portion may be updated. The parameters of the discriminator neural network are updated to minimise a discriminator loss measure using a gradient based method. These gradients may be calculated by backpropagation.

Using the same notation as above, the discriminator loss measure may be:

, where y e [0, 1] is the hyper-parameter which controls the equilibrium

E[L(x )] = y · E[L(G(x)], and k is the parameter which controls how much emphasis should be placed on L(G( )). k is incrementally updated over each iteration, t,

according to the formula above where k is the learning rate of k.

The generator parameter update step 314 and the discriminator parameter update step 315 may be sequential or concurrent.

The adversarial training submethod 320 involves multiple iterations of the steps 322- 325. These iterations may occur on different sets of input facial data. Iterations of the adversarial training submethod may continue until the loss measure for a number of batches is consistently below a threshold value, a maximum number of iterations is reached, and/or a time limit is reached.

Face Generation System

Referring to Figure 4, a system 400 for generating 3D face models is shown.

The face generation system 400 may be implemented on one or more suitable computing devices. For example, the one or more computing devices maybe any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones. Where the face generation system 400 is implemented on a plurality of suitable computing devices, the computing devices may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks. The facial generation system includes a generation network output 402, a generated facial mesh 404, a trained generation network 410, a bottleneck source 420 and a network output to facial mesh converter 430.

The generation network output 402 may be an image-like tensor representation of a 3D facial mesh. The image-like tensor representation maybe a representation of the 3D facial mesh as a UV map where the components for each pixel of the UV map represent the 3D offset, i.e. spatial displacement, of a vertex of the 3D facial mesh from that of a corresponding vertex of a template 3D facial mesh.

The generated facial mesh 404 maybe in any format capable of representing 3D geometry. These formats include custom in-memory object formats and 3D geometry file formats such as COLLADA or OBJ. The generated facial mesh 404 may be displayed and/or manipulated subsequent to generation and/or maybe saved to persistent storage.

The trained generation network 410 is a trained neural network including a decoding bottleneck layer 214 and a decoder 215. The decoding bottleneck layer 214 and decoder 215 maybe, or maybe initialised using parameters from, a decoding bottleneck layer and decoder of a generator network, e.g. generator network 210, trained according to a generative adversarial network training method, e.g. generative adversarial network training method 300. The trained generation network 410 may be implemented using a suitable machine learning framework such as MXNet, TensorFlow, and/or PyTorch. The trained generation network receives a bottleneck, e.g. a single dimensional vector of a fixed size, from the bottleneck source 420 and processes it to produce a generation network output 402, e.g. an image-like tensor representation of a 3D facial mesh.

The bottleneck source 420 provides bottlenecks to be processed by the trained generation network 410. The bottleneck source 420 may randomly generate plausible bottlenecks, e.g. according to bottleneck sampling method 500 of Fig. 5, to cause the face generation system 400 to generate random plausible facial meshes. The bottleneck source 420 may alternatively or additionally retrieve bottlenecks created from real facial data, e.g. input facial data encoded using the encoder 211 and the encoding bottleneck layer 212 of the generator network 210, to cause the face generation system to reconstruct the input facial data. The network output to facial mesh converter 430 processes the generation network output 402 to produce the generated facial mesh 404. Where the generation network output 402 is a representation of a 3D facial mesh in an image-like tensor format, the network output to facial mesh converter 430 may use data in the image-like tensor format to deform a template 3D facial mesh to produce the generated facial mesh 404. In particular, where the generation network output 402 is a UV map, the components for each pixel of the UV map may be used as a 3D offset to deform a vertex of a template 3D facial mesh corresponding to that pixel.

Bottleneck Sampling Method

Figure 5 is a flow diagram illustrating an example method 500 for sampling bottlenecks for use in a face generation system. The method may be performed by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the face generation system 400. The face generation system may, for example, be face generation system 400.

In step 510, the bottlenecks of a generator network of a trained generative adversarial network, e.g. generator network 210 subsequent to training, for a number of training inputs are extracted. These bottlenecks are extracted by processing each of the training inputs using at least an encoder and encoding bottleneck layer of the generator network to obtain a respective bottleneck, then reading, and, at least temporarily, storing each of these bottlenecks.

In step 520, the mean bottleneck of the extracted bottlenecks is determined. The mean bottleneck may be determined by creating a matrix, Z, which is a column-wise concatenation of the extracted bottlenecks, and then using a suitable matrix

computation to determine the mean bottleneck, m z . For example, the matrix

computation functions of a suitable machine learning framework, e.g. MXNet,

TensorFlow, and/or PyTorch, or of a numeric computation library, e.g. numpy, maybe used.

In step 530, the covariance of the extracted bottlenecks is determined. The covariance maybe represented as a zero-mean covariance matrix and determined by using a suitable matrix computation to find the zero-mean covariance matrix, å z , of the matrix, Z, as defined above. For example, the matrix computation functions of a suitable machine learning framework, e.g. MXNet, TensorFlow, and/or PyTorch, or of a numeric computation library, e.g. numpy.

In step 540, a bottleneck is sampled based on the mean bottleneck and the covariance of the extracted bottlenecks. The bottleneck maybe sampled by obtaining a (pseudo random sample from a multivariate Gaussian distribution, N(m z , å z ), i.e. a multivariate Gaussian distribution having the mean m z and the covariance å z . A pseudo-random sample from the multivariate Gaussian distribution may be obtained using the pseudo random sampling functionality of a machine learning framework, e.g. MXNet,

TensorFlow, and/or PyTorch, or a numeric computation library, e.g. numpy.

3D Facial Mesh Reparametisation Method

Figure 6 is a flow diagram illustrating an example method 600 for reparametising 3D facial meshes into an image-like tensor format. The method may be performed by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the discriminator pre-training system too and/or the adversarial training system 200.

In step 610, a template mesh is morphed onto one or more raw facial meshes to derive one or more registered facial meshes. The template mesh may be a facial mesh for a mean face, i.e. an average face as determined according to previous experiments or a theoretical model. For example, the template mesh maybe the mean face of the Large Scale Facial Model. The one or more raw facial meshes may be, or have previously been, obtained using a face capturing apparatus, e.g. a structured light stereo system. The template mesh may be morphed onto each of the one or more raw facial meshes using a non-rigid iterative closest point algorithm such that it captures the surface of the respective raw facial mesh to produce a corresponding registered facial mesh, i.e. a representation of the respective raw facial mesh having a fixed topology and number of vertices. By deriving a registered facial mesh for each of a plurality of raw facial meshes, each of the one or more raw facial meshes may be represented using the same topology and number of vertices.

In step 620, the template mesh is unwrapped to create a UV representation of the template mesh. The UV representation of the template mesh is a representation of the template mesh in two dimensions in an image-like format, where each vertex of the template mesh corresponds to a pixel of the UV representation.

In step 630, the registered facial meshes are aligned and normalised. The registered facial meshes maybe aligned in 3D space by performing General Proscrutes Analysis. The registered facial meshes may be normalised such that all vertices are within the range [1, -1].

In step 640, the vertices of each of the one or more registered 3D facial meshes are stored as a UV map. A registered 3D facial mesh maybe stored as a UV map by storing the coordinates of each of their vertices, i.e. the 3D offset of that vertex from that of the template mesh, as components for the pixel corresponding to the respective vertex. Each of these coordinates has three dimensions so may be stored using three components. Each pixel of a conventional colour image also has three components (for each of red, green and blue). The UV map for a registered facial mesh maybe stored in a raw format, e.g. as a Python pickle file, and/or as a colour image, e.g. a PNG image. The UV map be processed by systems and/or methods that accept colour images as input, e.g. 2D convolutional neural networks.

In step 650, each of the UV maps is interpolated. Interpolation fills out the missing areas of each UV map to produce a dense illustration of that UV map. The

interpolation maybe 2D nearest point interpolation and/or barycentric interpolation in the UV domain. The interpolation may facilitate optimal processing of the UV map.

3D Facial Mesh Representations

Referring to Figure 7, a number of 3D facial mesh representations 700 are shown.

A raw facial mesh 710 is shown. The raw facial mesh indicates the facial mesh as obtained by a face capturing apparatus, e.g. a structured light stereo system.

A registered facial mesh 720 is shown. The registered facial mesh indicates the appearance of a template facial mesh, e.g. the mean face of the Large Scale Facial Model, morphed onto the raw facial mesh 710. A UV map 730 representing an unwrapped facial mesh is shown. Pixels of the UV map 730 correspond to vertices of the registered facial mesh 720, and indicate their offset from the template facial mesh.

An interpolated UV map 740 is shown. The interpolated UV map 740 is the 2D nearest neighbour interpolation of the UV map 730.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the invention, the scope of which is defined in the appended claims. Various components of different embodiments maybe combined where the principles underlying the embodiments are compatible.