Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR GENERATING THREE-DIMENSIONAL MODELS OF A SUBJECT
Document Type and Number:
WIPO Patent Application WO/2022/189693
Kind Code:
A1
Abstract:
The embodiments relate to an apparatus, a method and a computer program product for generating 3D models. The apparatus comprises at least means for receiving an input image depicting a subject, wherein the input image comprises a standard representation (401) of the subject; means (402) for processing the input image to produce a warping parameter representation; means for downsampling the warping parameter representation to result in one or more downsampled warping parameter maps (403); means for combining the downsampled warping parameter maps with outputs of encoder layers to produce warped outputs; and means for creating a UV map representation (410) of the subject based on at least the warped outputs and features of decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject.

Inventors:
FASOGBON PETER OLUWANISOLA (FI)
CRICRI FRANCESCO (FI)
ZHANG HONGLEI (FI)
REZAZADEGAN TAVAKOLI HAMED (FI)
AKSU EMRE BARIS (FI)
Application Number:
PCT/FI2022/050059
Publication Date:
September 15, 2022
Filing Date:
February 01, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
G06T15/04; G06N3/02; G06N20/00; G06T3/40; G06T7/55; G06T17/00
Domestic Patent References:
WO2020096368A12020-05-14
WO2021008444A12021-01-21
Other References:
NA, I. S. ET AL.: "Facial UV Map Completion for Pose-invariant Face Recognition: A Novel Adversarial Approach based on Coupled Attention Residual UNets", HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES SPRINGER OPEN, vol. 10, no. 45, 10 November 2020 (2020-11-10), pages 1 - 17, XP021283999, [retrieved on 20220323], DOI: 10.1186/ s 13673-020 -00250-w
GRIGOREV ARTUR; SEVASTOPOLSKY ARTEM; VAKHITOV ALEXANDER; LEMPITSKY VICTOR: "Coordinate-Based Texture Inpainting for Pose-Guided Human Image Generation", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 15 June 2019 (2019-06-15), pages 12127 - 12136, XP033687009, DOI: 10.1109/CVPR.2019.01241
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
Claims:

1. An apparatus comprising at least

- means for receiving an input image depicting a subject, wherein the input image comprises a standard representation of the subject;

- means for processing the input image to produce a warping parameter representation;

- means for downsampling the warping parameter representation to result in one or more downsampled warping parameter maps;

- means for combining the downsampled warping parameter maps with outputs of encoder layers to produce warped outputs; and

- means for creating a UV map representation of the subject based on at least the warped outputs and features of decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject.

2. The apparatus according to claim 1, wherein said at least visual characteristics includes, at least part, a texture of the subject.

3. The apparatus according to claim 1 or 2, wherein said at least visual characteristics includes, a geometry of the subject.

4. The apparatus according to claim 1, wherein the means for processing comprises a neural network comprising encoder layers and decoder layers and a warp skip connection network.

5. The apparatus according to claim 4, wherein the warp skip connection is configured to pass a warped transformation of standard representation from an encoder layer to a corresponding decoder layer having the UV representation.

6. The apparatus according to claim 4 or 5, wherein the warping parameter representation is downsampled a predefined number of times, wherein the predefined number correspond to a number of encoder-decoder layers of the neural network.

7. The apparatus according to claim 4 or 5 or 6, wherein the input image is processed at the warp skip connection network to produce the warping parameter representation.

8. The apparatus according to any of the claims 1 to 7, further comprising means for reshaping an image into the standard representation.

9. The apparatus according to any of the claims 4 to 8, wherein the warp skip connection network comprises two neural networks.

10. The apparatus according to claim 9, wherein said two neural networks both comprises a warping map generator.

11. The apparatus according to claim 9 or 10, further comprising a skip connection between encoder layers of said two neural networks.

12. The apparatus according to claim 10 or 11 , further comprising means for training the warping map generator and said two neural networks separately.

13. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- to receive an input image depicting a subject, wherein the input image comprises a standard representation of the subject;

- to process the input image at the warp skip connection network to produce a warping parameter representation;

- to downsample the warping parameter representation according to a number of pair of encoder-decoder layers in the neural network;

- to combine the downsampled warping parameter maps with outputs of corresponding encoder layers to produce warped outputs; and

- to create a UV map representation of the subject based on at least the warped outputs with corresponding features at decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject. 14. A method, comprising:

- receiving an input image depicting a subject, wherein the input image comprises a standard representation of the subject;

- processing the input image at the warp skip connection network to produce a warping parameter representation;

- downsampling the warping parameter representation according to a number of pair of encoder-decoder layers in the neural network;

- combining the downsampled warping parameter maps with outputs of corresponding encoder layers to produce warped outputs; and

- creating a UV map representation of the subject based on at least the warped outputs with corresponding features at decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject.

15. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive an input image depicting a subject, wherein the input image comprises a standard representation of the subject;

- process the input image at the warp skip connection network to produce a warping parameter representation;

- downsample the warping parameter representation according to a number of pair of encoder-decoder layers in the neural network;

- combine the downsampled warping parameter maps with outputs of corresponding encoder layers to produce warped outputs; and

- create a UV map representation of the subject based on at least the warped outputs with corresponding features at decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject.

Description:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR GENERATING THREE-DIMENSIONAL MODELS OF A SUBJECT

Technical Field

The present solution generally relates to generation of three-dimensional models. Background

Computer-generated fully textured three-dimensional (3D) human or object models can be used for various applications including, but not limited to, virtual reality, video editing, virtual clothes try-on, 3D animations etc. However, processes for generating textured 3D models can require considerable effort and resources to produce detailed and realistic models.

Summary

An improved method and technical equipment implementing the method has been invented to provide 3D textures and/or other representations of visual characteristics of a human or an object using a UV representation (or other equivalent two- dimensional (2D) representation of the 3D surfaces(s) of a model of the human or object). Various aspects include a method, an apparatus and a non-transitory computer readable medium comprising a computer program, which are characterized by what is stated in the independent claims. Various details of the example embodiments are disclosed in the dependent claims and in the corresponding images and description.

The scope of protection sought for various example embodiments of the invention is set out by the independent claims. The example embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various example embodiments of the invention.

According to a first aspect, there is provided an apparatus comprising at least means for receiving an input image depicting a subject, wherein the input image comprises a standard representation of the subject; means for processing the input image to produce a warping parameter representation; means for downsampling the warping parameter representation to result in one or more downsampled warping parameter maps; means for combining the downsampled warping parameter maps with outputs of encoder layers to produce warped outputs; and means for creating a UV map representation of the subject based on at least the warped outputs and features of decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: to receive an input image depicting a subject, wherein the input image comprises a standard representation of the subject; to process the input image at the warp skip connection network to produce a warping parameter representation; to downsample the warping parameter representation according to a number of pair of encoder-decoder layers in the neural network; to combine the downsampled warping parameter maps with outputs of corresponding encoder layers to produce warped outputs; and to create a UV map representation of the subject based on at least the warped outputs with corresponding features at decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject.

According to a third aspect, there is provided a method comprising receiving an input image depicting a subject, wherein the input image comprises a standard representation of the subject; processing the input image at the warp skip connection network to produce a warping parameter representation; downsampling the warping parameter representation according to a number of pair of encoder- decoder layers in the neural network; combining the downsampled warping parameter maps with outputs of corresponding encoder layers to produce warped outputs; and creating a UV map representation of the subject based on at least the warped outputs with corresponding features at decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject.

According to a fourth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive an input image depicting a subject, wherein the input image comprises a standard representation of the subject; to process the input image at the warp skip connection network to produce a warping parameter representation; to downsample the warping parameter representation according to a number of pair of encoder-decoder layers in the neural network; to combine the downsampled warping parameter maps with outputs of corresponding encoder layers to produce warped outputs; and to create a UV map representation of the subject based on at least the warped outputs with corresponding features at decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject.

According to an embodiment, said at least visual characteristics includes, at least part, a texture of the subject.

According to an embodiment, said at least visual characteristics includes, a geometry of the subject.

According to an embodiment, the means for processing comprises a neural network comprising encoder layers and decoder layers and a warp skip connection network.

According to an embodiment, the warp skip connection is configured pass a warped transformation of standard representation from an encoder layer to a corresponding decoder layer having the UV representation.

According to an embodiment, the warping parameter representation is downsampled a predefined number of times, wherein the predefined number correspond to a number of encoder-decoder layers of the neural network.

According to an embodiment, the input image is processed at the warp skip connection network to produce the warping parameter representation.

According to an embodiment, an image is reshaped into the standard representation.

According to an embodiment, the warp skip connection network comprises two neural networks. According to an embodiment, said two neural networks both comprises a warping map generator.

According to an embodiment, there is a skip connection between encoder layers of said two neural networks.

According to an embodiment, the warping map generator and the two neural networks are trained separately.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of two image representations used for 3D model texture generation;

Fig. 2 shows an inappropriate example of a skip connection between two different image representations;

Fig. 3 shows a warped skip connection between two different image representations according to an example embodiment;

Fig. 4 shows a skip connection branch operation on an encoder-decoder neural network according to an example embodiment;

Fig. 5 shows a warp skip connection branch operation on an encoder-decoder neural network according to an example embodiment;

Fig. 6 shows a warping map generator network according to an example embodiment;

Fig. 7 shows a warping operation using an input image according to an example embodiment; Fig. 8 shows a Bi-modal approach incorporating the warp skip connection according to an example embodiment;

Fig. 9 shows a system according to an embodiment;

Fig. 10 shows a computer system according to an embodiment; and

Fig. 11 is a flowchart illustrating a method according to an embodiment.

In the following, several embodiments will be described in the context of virtual reality. It is to be noted, however, that the invention is not limited to be used with virtual reality solutions only. In fact, the different embodiments have applications in any environment where generation of three-dimensional models is required.

The example embodiments are discussed in the following description and with reference to the drawings, which both are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. Flowever, in certain instances, well-known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.

The fully textured three-dimensional (3D) models (e.g., human or object models) can be used for various applications including, but not being limited to, virtual reality (VR), video editing, virtual clothes try-on, realistic 3D animations, and/or the like. Flowever, historical processes for generating the textured 3D models can often require considerable effort and resources to produce detailed and realistic models. When creating textured 3D models, a UV mapping can be used. UV mapping is a process of projecting a 2D image to a 3D model’s surface for texture mapping. The image that is resulted from a UV mapping is called UV map, or UV texture map. In one example embodiment, the UV map representation is a two-dimensional representation (e.g., based on a U-axis and a V-axis) of the 3D mesh or model of the subject where the 3D mesh has been “unwrapped” from the 3D shape of the subject and flattened on a plane represented by the U and V axes of the UV map representation.

Figure 1 illustrates two important representations that are needed in textured 3D model: a standard map 101 and an UV map 102. Standard map 101 signifies a usual image representation format that humans are familiar with, and it comprises multiple view low resolution images. The standard map may include a color image and a depth image extracted from the color image or from other equivalent means for determining or extracting depth information associated with the color image. In one example embodiment, the color image is a standard image captured using a camera sensor of a user equipment (UE). The system may use the standard map as an input to a machine learning algorithm, such as a neural network, for example, to predict or infer the UV map representation of the subject that embeds both the texture/color (or other visual characteristics) of the subject along with the 3D surface geometry of the subject.

The UV map 102 is a warped representation of multiple input images to a single image plane. It is a high resolution texture image, which can be seen as a compact representation of information from multiple image inputs, and is very useful especially in texture representation for 3D models. Using methods, such as SMPL (Skinned Multi-Persons Linear model) or VideoAvatar, one can extract i) shape parameters corresponding to 3D vertices of a human model; and/or ii) pose parameters to each input images. Thereafter, each face visibility of those 3D vertices can be back projected to a plane to create the UV map representation. Finally, there is a bijective mapping between the UV map 102 and the original 3D vertices (i.e. low fidelity 3D model 102) so that the resulting UV map texture image can be applied to fully texture a human model 104.

The advantage of using the UV map representation for textured model rather than using Standard Map directly is that i) it allows generation of realistic high resolution textured 3D model 104 regardless of the resolution of the input color image; and ii) it also allows simplifying the human texture generation to “2D-2D” space rather than “3D-2D” space. This is possible because of the representation that maps a partial visibility of the input image to a complete visibility on the 2D plane defined by the high-resolution UV map. An accurate UV map is required regardless of how bad the 3D vertices are due to possible low fidelity 3D human capture system. In the quest to reduce computational time complexity for easy 3D model generation, known systems that are based on neural network may provide 3D estimates of low quality due to inaccurate pose estimations and perturbed input standard images. These may eventually lead to texture problems that occur in the form of spatial incoherence, texture misalignment, ghosting effects. The use of machine learning algorithm, for example a neural network, to model Standard map to UV map is particularly attractive to handle these problems.

A neural network (NN) is a computation graph that may comprise several layers of successive computation. A layer may comprise one or more units or neurons performing an elementary computation. A unit may be connected to one or more units, and the connection may have associated a weight. The weight, also called as a “parameter”, or a “weight parameter”, may be used for scaling the signal passing through the associated connection. Weight may be a learnable parameter, i.e., values, which can be learned from training data.

In order to configure a neural network to perform a task, an untrained neural network model has to go through a training phase. The training phase is the development phase, where the neural network learns to perform the final task. A training data set that is used to train the neural network is supposed to be the representative of the data on which the neural network will be used. During training, the neural network uses the examples in the training data set to modify its learnable parameters (e.g., its connections’ weights) in order to achieve the desired task. Input to the neural network is the data, and the output of the neural network represents the desired task. The training may happen by minimizing or decreasing the output’s error, also referred to as the loss. In the deep learning techniques, training is an iterative process, where at each iteration, the algorithm modifies the weights of the neural network to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss.

After training, the trained neural network model is applied to new data during an inference phase, in which the neural network performs the desired task to which it has been trained for. The inference phase is also known as a testing phase. As a result of the inference phase, the neural network provides an output which is a result of the inference on the new data. Training can be performed in several ways. The main ones are supervised, unsupervised, and reinforcement training. In supervised training, the neural network model is provided with input-output pairs, where the output may be a label. In unsupervised training, the neural network is provided only with input data (and also with output raw data in case of self-supervised training). In reinforcement learning, the supervision is sparser and less precise; instead of input-output pairs, the neural network gets input data and, sometimes, delayed rewards in the form of scores (e.g., -1 , 0, or +1 ). The learning is a result of training algorithm, or of a meta-level neural network providing the training signal. In general, the training algorithm comprises changing some properties of the neural network so that its output is as close as possible to a desired output.

Training a neural network is an optimization process, but the final goal may be different from the goal of optimization. In machine learning, the goal of optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e. data which was not used for training the model. This may be referred to as generalization. In practice, data may be split into at least two datasets, the training set and the validations set. The training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss. The validation set is used for checking the performance of the network on data which was not used to minimize the loss, as an indication of the final performance of the model.

Recently, neural image compression and decompression systems are based on neural auto-encoders, or simply auto-encoders. An auto-encoder may comprise two neural networks, one of which is the neural encoder (also referred to as “encoder” in this description for simplicity) and the other is the neural decoder (also referred to as “decoder” in this description for simplicity). The encoder is configured to map the input data (such as an image, for example) to a representation which is more easily or more efficiently compressed. The decoder gets the compressed version of the data and is configured to de-compress it, thus reconstructing the data. The two networks in the auto-encoder may be trained simultaneously, in an end-to-end fashion. The training may be performed by using at least a reconstruction loss, which trains the auto-encoder to reconstruct the image correctly. An example of reconstruction loss is the mean squared error (MSE). In order to achieve also high compression gains, an additional loss on the output of the encoder may be used.

A neural network comprising multiple layers between the input layer and the output layer is called a deep neural network (DNN). Deep neural networks are becoming popular and attractive because of their capability to produce detailed texture with high fidelity. For example, an end-end supervised network may be used to learn a UV map of texture representing the human body from input standard images, which gives a good result but required intensive design choices during training and complicated hyperparameter tuning to have very good result.

So called “skip connection” is an element of convolutional architectures (such as DNN) that can be used to skip some layers in the neural network and feed the output of one layer as an input to the next layers. The idea of skip connection has been proposed to deal with vanishing gradient problem, and thus helps in stable training of neural networks. By using a skip connection, one provides an alternative path for the neural network model convergence. From literatures, skip connections are mostly grouped into short and long skip connections. Short skip connections are used along with consecutive convolutional layers that do not change the input dimensions, while long skip connections may exist in encoded-decoder architectures (e.g. U-net architecture), so that layers from encoder are passed directly to their corresponding decoder layer. It is known that passed image/feature shape and dimension retain global information while small detail in image patches resolves the local image information. These are useful and help the decoder in resolving problems or lost information. In addition, long skip connections can be used to pass features from the encoder to the decoder in order to recover spatial information that is lost during downsampling operations. For example “deactivable skip connections” allow the integration of two auto-encoded neural network branches within the same architecture that can be trained end-to-end. Further, “attention skip connections” add attention neural network gates to standard skip connection for UV map completion task.

In order to tackle the aforementioned problems occurring when UV map of texture representing the human body is learned from input standard image, a skip connection would be tempting to use, as shown in Figure 2. As shown in Figure 2, a skip connection would be included in the neural network comprising both neural encoder and neural decoder. Such skip connection would feed the output of one neural network layer 202 as the input to another layer 206.

However, standard skip connection - as shown in Figure 2 - is not appropriate between two different image representations, i.e., from standard representation 210 to UV map representation 220. Therefore, the present embodiments provide a modified skip connection to handle the changes, in particular, spatial misalignment, between the standard and UV representations. The modified skip connection according to embodiments helps to improve the practicality and accuracy of the system architecture.

The modified skip connection according to example embodiments can be directly attached to any neural network architectures. It is referred here to as a “warped skip connection”. Its task is to pass warped transformation of standard image features to the corresponding decoder layer with UV map structure. This design strategy will help in the training strategies that lead to the improvement in the neural network output. The warped skip connection effectively handles the projection difference between Standard map and UV map, and thus present meaningful information in the neural network training stages.

Figure 3 illustrates a network adopting an encoder-decoder architecture and being configured to map standard map inputs to UV map outputs. The network has a modified skip connection (i.e. warped skip connection) 315 according to example embodiments. For the encoder, the system can use a neural network, wherein the number of layers and/or neurons to use of the network can be determined based on considering the balance between performance and speed. For the decoder, the system can use a neural network. The encoder-decoder model, according to Figure 3, contains three convolutional layers 301 , 302, 303, 311 , 312, 313, each, and the output has three channels as the input. It is appreciated, however, that the number of the layers can vary from what has been presented in Figure 3. Each convolution operation may be followed by nonlinear tanh activation function. This architectural representation shown in Figure 3 is used in the remaining part of the present disclosure as an example architecture.

Figure 4 illustrates a detailed example of the warped skip connection branch 415 according to example embodiments. A reshaped standard image input 401 is passed into the warp skip connection branch 407 comprising warping parameter map generator 402. The output of the warping parameter map generator 402 is reshaped to generate the warping parameter map 403, which can be downsampled based on the number of layers existing in the neural network (i.e. encoder and decoder pairs). In addition, the warp skip connection branch 407 also receive the outputs of layer 405 and layer 406. Thereafter, corresponding downsampled warping parameter maps 418, 419 can be eventually combined with outputs of layers 405, 406 to create the warped output. Finally, the warped outputs can be concatenated with corresponding decoder features at layers 420, 421 to generate the high resolution UV map 410.

When discussing the example embodiments in more detailed manner, the following notations are used: Standard image (e.g. as in Figure 4, item 401 ) is defined as I: (H x W x D) where H, W, D are standard image height, width and channel respectively. For j=l...n number of features, a feature on standard input image is I: (x j x y j x z j ), where x, y are coordinates value, and z is the intensity depending on the total number of channel D. Likewise, UV map image is defined as l u : ( H u x W u x D u ), where a feature for UV map image is F.-fo, y j z j ).

In the example networks, i=l...N is the number of network layers. Thus for layer I, the image is If Hi x W t x D t ) with features kfy, y j z j ). In addition, I· 1 : (H x W· 1 x Df ) with features If 1 : (xj y L , zj 1 ) .

An example of the warp skip connection branch is also shown in Figure 5. The direct input to the warping map generator 502 according to an example embodiment can be a image input 501 e.g. a texture image, or skeleton representation (2D or 3D using known pose estimation methods such as “openpose” or “vnect”), depth maps, normals, silhouette, segmentation, or sparse 3D depth points. However, regardless of the type of the image input, and if is not already, the image needs to be represented in the standard image format, i.e. standard map, as shown in Figure 6 (item 601 (2D or 3D keypoints or texture image or 3D mesh vertices)). The standard map may include a color image and a depth image extracted from the color image or from other equivalent means for determining or extracting depth information associated with the color image. The warping map generator 502 according to example embodiments, can be formed according any deep convolutional neural network architecture, and it reshapes the input image into a warping map kernel (H x W x 3) (A x , A y , a). The estimated warping parameter map (H x W x 3) 503, i.e. warping parameter kernel, is a projection tensor and has the same size as input image. The map features of the estimated warping parameter map 503 are presented with parameters (A x , A y , a) , where (A x , A y ) define the offset of standard coordinate to UV coordinate, and where a is the visibility constraint such that, when a pixel is not visible in the input image, the a = 0 and vice-versa. The visibility information allows manipulation of the network for multi-view solution.

In the warping map generator, 502, four fully convolution (FC) operations can be used, according to an example. The network can be trained by minimizing a warping loss function l w as in equation (1 ): where R· , R represent the ground truth feature pixels and estimated ones on the UV map (Figure 6, UV map 610; a warping parameter map 603). The groundtruth features can be estimated through backprojection using orthographic projection of estimated 3D human shapes, and image pose parameters. (P^,P^) represents the distance function of two warping parameters, ||.. || represents 12 loss and lt l 2 are weighting parameters set in the beginning of the training. The second term at the RFIS of equation (1 ) is a regularization term that helps the system to generalize better.

An example of L{P, P) is given by where t 1 is a weight parameter, and A x ,A y are estimated parameters. According to an alternative design, the first term in the above equation can be log loss; the second and the third term can be L1 loss. In addition, optimizing (P, P) = ^ assumes that the parameter a is unknown. Flowever, in an alternative embodiment when a is known, there is no need to estimate it by the neural network and thus can be removed.

The warping map generator 502 according to an embodiment can apply to sparse or dense data or both combined, thanks to the representation of standard image input to the warp generator. The sparse data can also be 2D or 3D keypoints representing human body in the images. The proposed warping generator can learn misalignments and noise perturbations from standard image inputs over multiple views. The model parameters of the warping generator is obtained through training procedure and finetuned altogether with the encoder-decoder architecture. A good initialization of the warping parameter is obtained through orthographic projection. The system according to an example embodiment is able to learn the best visibility from multi-view inputs for relevant UV map as shown in Figure 6.

The warping parameter map 503 may then be downsampled to result in one or more downsampled warping parameter maps. According to an embodiment, the warping parameter maps can be downsampled number of times, which number is the number of encoder-decoder network layers 1 to N. All layer features between (inclusive) neural network encoder layer i=1 505 and neural network encoder layer i=N 509 from the Main NN’s encoder are matched to the corresponding downsampled parameter map 504, 508, and eventually passed through a “warp” operation to create warped features I· 1 : (Hf x W x Df) 506 in encoder i=1 and warped features If: (Hf x Wf- x Df) 510 in encoder i=N, both with content If: ( x = x j + Ax j , yf y i + Ay Jt z Z j . a). The warp operation may use the warping parameter map and interpolation operation to create a visually pleasing UV map as shown in Figure 7. Eventually, the warped features are concatenated with corresponding features in decoder layers i=1 507 to i=N 511 at main NN’s decoder.

Figure 7 illustrates a warping operation using an input image. The visibility of the input image 711 is used to manipulate the visibility angle a of the visibility image 712. Figure 7 shows a warping illustration 701 and the visibility illustration 710. The warping parameter map 703 is used in warping.

In the following two example embodiments for neural network implementations for texture generation using the warped skip connection according to present embodiments are discussed. The methods comprise 1 ) Bi-modal and 2) Mono- modal system with warped skip connection.

Bi-modal neural network using warped skip connection

According to an embodiment, a bi-modal neural network system combines two neural network branches A and B, which is illustrated in Figure 8. The network of Figure 8 is a warped skip connection network having a neural network A and a neural network B. In practical application, when 3D information is already available, one can create a coarse depth representation of human 3D vertices to incorporate geometry information at standard map input (see Figure 8, “Neural network A”). In Neural network A, UV map geometry information 810 contains three channels, the first two channels embed coarse shape information while the third one embeds the depth information. However, the Neural network B only incorporates texture information 830, with three channel pixel intensity value. It is noted that the numbers and types of channels described above are provided by way of illustration and not as limitations. It is contemplated that more or fewer channels can be used to represent the geometry or visual characteristics of a subject depending on the types of geometries/coordinate systems used and/or the types of visual characteristics being used. For example, three color channels can be used to represent a full color texture pixel, while a channel can be used to represent a surface or bump height of a pixel. Each network branch A, B includes the warped skip connection C according to example embodiments to aggregate and transfer warped information from their respective encoder to decoder layers. For example layers 802A transfer neural network A features; layers 802B transfer neural network B features; layers 803A transfer neural network A features and neural network A special skip connection features; layers 804B transfer neural network B features and neural network A standard skip connection features; layers 805B transfer neural network B features and neural network B standard skip connection features. It is worth noticing that the skip connection network C is the same for the two branches (A and B networks). In addition, this bi-modal configuration uses “intra-connection” 801 based on the warped skip connection branch to transfer the features. In order to transfer features between network A and B, a standard skip connection as a form of inter skip- connection D is used. The standard skip connection D basically just copy features from network A encoder directly to that of network B. In Figure 8, connection E stands for a convolution, and connection F stands for unconvolution.

The full network is trained by minimizing the loss function l bi in equation (2), where l w is the warp loss in equation (1 ), l A , l B are the loss function pertaining to networks A and B respectively. These loss functions follow the same approach as equation (1 ) except that they are only done on UV map geometry and UV map texture respectively.

As the training strategy, the warping map generator may be trained first, which will ensure better results. Then the neural networks A and B can be individually and separately trained. Finally, the whole network can be refined with the full loss in equation (2). During testing step, the use of Neural network A is optional. In that case, one can directly use Neural network B by neglecting the intra-skip connection.

In an alternative embodiment, the warping map generator may be trained jointly with the auto-encoder neural network, instead of having a two-phase training process. In such case, there is no explicit loss for the warped features, thus the generator learns to warp features in such a way that maximizes the final output results (the UV map). This alternative training process may be referred to as end-to-end training.

In an alternative embodiment, the warping map generator outputs multiple warping maps - one for each warping operation. Therefore, there are no downsampling operations of the warping maps.

In an alternative embodiment, there may be multiple warping map generators, where each generator is specialized to output an optimal warping map for a specific warping operation, i.e., for a specific intermediate layer where the warping skip connection is applied.

Mono-modal neural network using warped skip connection

In some applications, it may be difficult to have standard map geometry information inputs, or in cases where one needs to speed up the whole process and avoid geometry information during inferences. Therefore, the present embodiments further provide a variant for the example embodiments. This variant is called Mono- modal neural network. Such an embodiment only use the neural network branch B which incorporate the textual information. The network can be trained to minimize the loss function l mono in equation (3), where l w is the warp loss in equation (1 ), l B is the loss function pertaining to network B. l B follows the same approach as equation (1) and estimated on UV map.

Even without the depth information that Network A can provide, the incorporation of warped skip connection is able to guide the network according to example embodiments to produce accurate results that is closer to the result of Bi-modal network For the training strategy, the warping map generator can be trained first, after which the network B is trained. Finally, the whole network can be refined with the full loss in equation (3).

Training

In training the warping map generator, the datasets from textured 3D models can be used. The 3D models provide photorealistic images and corresponding texture. As there are not enough diversity in the datasets, non-rigid registration of SMPL parameter is performed on those 3D models, then the models are rigged to different poses. Finally, the rigged models are rendered from multiple views to create a standard image and UV map representations. In addition, these data can be rendered using different backgrounds to increase the diversity of the dataset.

To train the neural network, the system can collect a diverse set of images of various subjects (e.g., humans and/or objects). The training set of images can include actual and/or synthetic images. When using synthetic images, the system can assemble a target number of images from a variety of sources. For example, the system can obtain training dataset that are generated using SURREAL datasets, Fluman3.6M datasets, A36pose datasets (e.g., a proprietary dataset created by the inventors and not generally available to the public), and/or any other equivalent dataset. SURREAL (Synthetic hUmans foR REAL tasks), for instance, is a large scale synthetic dataset that supports the SMPL model and provides photorealistic images and corresponding texture UV maps with good resolution and complete visibility UV map-color. The system can create any number of images frames, e.g., approximately 50,000 frames spanning subjects with various clothing, backgrounds, and poses.

To increase diversity of training images, the system can add images from more than one source or from a source that provides different features in the images. For example, the system can collect training data from a source such as but not limited to A36pose. A36pose dataset, for instance, was created by the inventors to improve the diversity of the training images and provides images of subjects from non-rigid registration of SMPL to people in clothing. The system can further obtain images of subjects with different visual appearances or features or identities such as moustache, chunky, bald, hairy etc. In addition, the system can render or obtain images with different backgrounds which further increase diversity. Other dataset such as Human3.6M can provide imagery with subjects engaged in action sequences (e.g., to assist in training to infer movement of changes in the UV map-geometry and/or UV map-color over time to generate 3D animation videos). Images from this dataset can also be selected for challenging scenarios such as increase inter-occlusion of body parts and/or other objects. In one example embodiment, for the training dataset, where there are occlusions, the system can present the occluded portions of images for manual inpainting to ensure complete UV maps.

The present embodiments may be used in neural network solution that may require a projection transformation between multiple plane inputs to a single compressed output representation. The compressed representation can be, for example, an image plane, 3D planes, etc., while the input can be 360-degree videos or even images. Warping operation can be seen as a form of projection that transforms one image representation to another. A transformation that goes beyond direct transformation, i.e. high-level transformation. For example, calibration work that requires correction of lens distortion whereby input is a highly distorted and output is not distorted. This kind of situation may incorporate warping or some projection process in the neural network.

Figure 9 illustrates a system according to an embodiment. The system comprises at least a user equipment (UE) 919, a texture platform 923, one or more content providers 931a...931m, and a services platform 927. The various elements of the system are configured to communicate through a communication network 933. According to an embodiment, a camera sensor of the user equipment 919 may capture the color image, which with a depth image may compose the standard map for the purposes of the present embodiments. The standard map is provided as an input to the neural network to predict or infer the UV map representation of the subject that embeds both the texture/color (or other visual characteristics) of the subject along with the 3D surface geometry of the subject.

The user equipment may comprises a texture client 921 to generate 3D texture according to embodiments. In addition, or alternatively, the system can include a texture platform 923 including a machine learning system 925 to generate 3D textures according to embodiments described herein alone or in combination with the UE 919 and/or texture client 921 , for instance over the communication network 933. The modules and the components of the system can be implemented in circuitry, hardware, firmware, software, or a combination thereof. It is appreciated that the texture client 921 and/or texture platform 923 may be implemented as a module of any other components of the system such as but not limited to a services platform 927, one or more service 929a-929n of the services platform 927, and/or content providers 931a-931m that use the UV map representation outputs. According to another example, the texture client 921 and/or texture platform 923 may be implemented as cloud-based service, local service, native application or combination thereof.

According to an embodiment, content providers 931a- 931 m (also referred to as content providers 931) may provide content or data (e.g., including image data, training data, textures, etc.) to the texture client 921 and/or texture platform 923. The content provided may be any type of content, such as text content, audio content, video content, image content, etc. According to an embodiment, the content providers may provide content that may aid in the generation 3D texture using a UV map representation according to the embodiments described herein. According to an embodiment, the content providers may also store content (e.g. textures, training data, trained machine learning models, 3D mesh data, etc.) used or generated by the texture client 921 and/or texture platform 923. According to another embodiment, the content providers may manage access to a central repository of data, and offer a consistent, standard interface to data.

According to an embodiment, the user equipment 919 is any type of mobile terminal, fixed terminal, or portable terminal including a built-in navigation system, a personal navigation device, mobile handset, station, unit, device, multimedia computer, multimedia tablet, Internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, personal communication system (PCS) device, personal digital assistants (PDAs), audio/video player, digital camera/camcorder, positioning device, fitness device, television receiver, radio broadcast receiver, electronic book device, game device, or any combination thereof, including the accessories and peripherals of these devices, or any combination thereof. It is also contemplated that the UE 919 can support any type of interface to the user (such as “wearable” circuitry, etc.).

In one embodiment, the communication network 933 of system includes one or more networks such as a data network, a wireless network, a telephony network, or any combination thereof. It is contemplated that the data network may be any local area network (LAN), metropolitan area network (MAN), wide area network (WAN), a public data network (e.g., the Internet), short range wireless network, or any other suitable packet-switched network, such as a commercially owned, proprietary packet-switched network, e.g., a proprietary cable or fiber-optic network, and the like, or any combination thereof. In addition, the wireless network may be, for example, a cellular network and may employ various technologies including enhanced data rates for global evolution (EDGE), general packet radio service (GPRS), global system for mobile communications (GSM), Internet protocol multimedia subsystem (IMS), universal mobile telecommunications system (UMTS), etc., as well as any other suitable wireless medium, e.g., worldwide interoperability for microwave access (WiMAX), Long Term Evolution (LTE) networks, code division multiple access (CDMA), wideband code division multiple access (WCDMA), wireless fidelity (Wi-Fi), wireless LAN (WLAN), Bluetooth®, Internet Protocol (IP) data casting, satellite, mobile ad-hoc network (MANET), 5G New Radio, cloud Radio Access Network (RAN), and the like, or any combination thereof.

By way of example, the texture client 921 and/or texture platform 23 communicate with each other and other components of the system using well known, new or still developing protocols. In this context, a protocol includes a set of rules defining how the network nodes within the communication network 933 interact with each other based on information sent over the communication links. The protocols are effective at different layers of operation within each node, from generating and receiving physical signals of various types, to selecting a link for transferring those signals, to the format of information indicated by those signals, to identifying which software application executing on a computer system sends or receives the information. The conceptually different layers of protocols for exchanging information over a network are described in the Open Systems Interconnection (OSI) Reference Model.

Communications between the network nodes are typically effected by exchanging discrete packets of data. Each packet typically comprises (1) header information associated with a particular protocol, and (2) payload information that follows the header information and contains information that may be processed independently of that particular protocol. In some protocols, the packet includes (3) trailer information following the payload and indicating the end of the payload information. The header includes information such as the source of the packet, its destination, the length of the payload, and other properties used by the protocol. Often, the data in the payload for the particular protocol includes a header and payload for a different protocol associated with a different, higher layer of the OSI Reference Model. The header for a particular protocol typically indicates a type for the next protocol contained in its payload. The higher layer protocol is said to be encapsulated in the lower layer protocol. The headers included in a packet traversing multiple heterogeneous networks, such as the Internet, typically include a physical (layer 1 ) header, a data-link (layer 2) header, an internetwork (layer 3) header and a transport (layer 4) header, and various application (layer 5, layer 6 and layer 7) headers as defined by the OSI Reference Model.

Figure 10 illustrates a computer system 1000 upon which an embodiment of the invention may be implemented. Computer system 1000 is programmed (e.g., via computer program code or instructions) to provide 3D texture generation using UV map representations as described herein and includes a communication mechanism such as a bus 1010 for passing information between other internal and external components of the computer system 1000. Information (also called data) is represented as a physical expression of a measurable phenomenon, typically electric voltages, but including, in other embodiments, such phenomena as magnetic, electromagnetic, pressure, chemical, biological, molecular, atomic, sub atomic and quantum interactions. For example, north and south magnetic fields, or a zero and non-zero electric voltage, represent two states (0, 1 ) of a binary digit (bit). Other phenomena can represent digits of a higher base. A superposition of multiple simultaneous quantum states before measurement represents a quantum bit (qubit). A sequence of one or more digits constitutes digital data that is used to represent a number or code for a character. In some embodiments, information called analog data is represented by a near continuum of measurable values within a particular range.

A bus 1010 includes one or more parallel conductors of information so that information is transferred quickly among devices coupled to the bus 810. One or more processors 1002 for processing information are coupled with the bus 1010.

A processor 1002 performs a set of operations on information as specified by computer program code related to providing 3D texture generation using UV map representations. The computer program code is a set of instructions or statements providing instructions for the operation of the processor and/or the computer system to perform specified functions. The code, for example, may be written in a computer programming language that is compiled into a native instruction set of the processor. The code may also be written directly using the native instruction set (e.g., machine language). The set of operations include bringing information in from the bus 1010 and placing information on the bus 1010. The set of operations also typically include comparing two or more units of information, shifting positions of units of information, and combining two or more units of information, such as by addition or multiplication or logical operations like OR, exclusive OR (XOR), and AND. Each operation of the set of operations that can be performed by the processor is represented to the processor by information called instructions, such as an operation code of one or more digits. A sequence of operations to be executed by the processor 1002, such as a sequence of operation codes, constitute processor instructions, also called computer system instructions or, simply, computer instructions. Processors may be implemented as mechanical, electrical, magnetic, optical, chemical or quantum components, among others, alone or in combination.

Computer system 1000 also includes a memory 1004 coupled to bus 1010. The memory 1004, such as a random access memory (RAM) or other dynamic storage device, stores information including processor instructions for providing 3D texture generation using UV map representations. Dynamic memory allows information stored therein to be changed by the computer system 1000. RAM allows a unit of information stored at a location called a memory address to be stored and retrieved independently of information at neighboring addresses. The memory 1004 is also used by the processor 1002 to store temporary values during execution of processor instructions. The computer system 1000 also includes a read only memory (ROM) 1006 or other static storage device coupled to the bus 1010 for storing static information, including instructions, that is not changed by the computer system 1000. Some memory is composed of volatile storage that loses the information stored thereon when power is lost. Also coupled to bus 1010 is a non-volatile (persistent) storage device 1008, such as a magnetic disk, optical disk or flash card, for storing information, including instructions, that persists even when the computer system 1000 is turned off or otherwise loses power.

Information, including instructions for providing 3D texture generation using UV map representations, is provided to the bus 1010 for use by the processor from an external input device 1012, such as a keyboard containing alphanumeric keys operated by a human user, or a sensor. A sensor detects conditions in its vicinity and transforms those detections into physical expression compatible with the measurable phenomenon used to represent information in computer system 1000. Other external devices coupled to bus 1010, used primarily for interacting with humans, include a display device 1014, such as a cathode ray tube (CRT) or a liquid crystal display (LCD), or plasma screen or printer for presenting text or images, and a pointing device 1016, such as a mouse or a trackball or cursor direction keys, or motion sensor, for controlling a position of a small cursor image presented on the display 1014 and issuing commands associated with graphical elements presented on the display 1014. In some embodiments, for example, in embodiments in which the computer system 1000 performs all functions automatically without human input, one or more of external input device 1012, display device 1014 and pointing device 1016 is omitted.

In the illustrated embodiment, special purpose hardware, such as an application specific integrated circuit (ASIC) 1020, is coupled to bus 1010. The special purpose hardware is configured to perform operations not performed by processor 1002 quickly enough for special purposes. Examples of application specific ICs include graphics accelerator cards for generating images for display 1014, cryptographic boards for encrypting and decrypting messages sent over a network, speech recognition, and interfaces to special external devices, such as robotic arms and medical scanning equipment that repeatedly perform some complex sequence of operations that are more efficiently implemented in hardware.

Computer system 1000 also includes one or more instances of a communications interface 1070 coupled to bus 1010. Communication interface 1070 provides a one way or two-way communication coupling to a variety of external devices that operate with their own processors, such as printers, scanners and external disks. In general the coupling is with a network link 1078 that is connected to a local network 1080 to which a variety of external devices with their own processors are connected. For example, communication interface 1070 may be a parallel port or a serial port or a universal serial bus (USB) port on a personal computer. In some embodiments, communications interface 1070 is an integrated services digital network (ISDN) card or a digital subscriber line (DSL) card or a telephone modem that provides an information communication connection to a corresponding type of telephone line. In some embodiments, a communication interface 1070 is a cable modem that converts signals on bus 1010 into signals for a communication connection over a coaxial cable or into optical signals for a communication connection over a fiber optic cable. As another example, communications interface 1070 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN, such as Ethernet. Wireless links may also be implemented. For wireless links, the communications interface 1070 sends or receives or both sends and receives electrical, acoustic or electromagnetic signals, including infrared and optical signals, that carry information streams, such as digital data. For example, in wireless handheld devices, such as mobile telephones like cell phones, the communications interface 1070 includes a radio band electromagnetic transmitter and receiver called a radio transceiver. In certain embodiments, the communications interface 1070 enables connection to the communication network 933 for providing 3D texture generation using UV map representations.

The term computer-readable medium is used herein to refer to any medium that participates in providing information to processor 1002, including instructions (e.g., computer program instructions) for execution. For example, the instructions can cause an apparatus (e.g., processor, computer, device, etc.) to perform one or more steps, functions, operations, etc. specified in the instructions or computer program instructions. According, a computer program may comprise instructions for causing an apparatus to perform at least any of the steps, functions, operations, etc. specified in the instructions. Similarly, a computer-readable medium (e.g., transitory or non-transitory) may comprise instructions (e.g., program instructions, computer program instructions, or equivalent) for causing an apparatus to perform any of the specified steps, functions, operations, etc. Such a medium may take many forms, including, but not limited to, non-volatile or non-transitory media, volatile or transitory media and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as storage device 1008. Volatile media include, for example, dynamic memory 1004. Transmission media include, for example, coaxial cables, copper wire, fiber optic cables, and carrier waves that travel through space without wires or cables, such as acoustic waves and electromagnetic waves, including radio, optical and infrared waves. Signals include man-made transient variations in amplitude, frequency, phase, polarization or other physical properties transmitted through the transmission media. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, CDRW, DVD, any other optical medium, punch cards, paper tape, optical mark sheets, any other physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read. Network link 1078 typically provides information communication using transmission media through one or more networks to other devices that use or process the information. For example, network link 1078 may provide a connection through local network 880 to a host computer 1082 or to equipment 1084 operated by an Internet Service Provider (ISP). ISP equipment 1084 in turn provides data communication services through the public, world-wide packet-switching communication network of networks now commonly referred to as the Internet 1090.

A computer called a server host 1092 connected to the Internet hosts a process that provides a service in response to information received over the Internet. For example, server host 1092 hosts a process that provides information representing video data for presentation at display 1014. It is contemplated that the components of system can be deployed in various configurations within other computer systems, e.g., host 1082 and server 1092.

The method according to an embodiment is shown in Figure 11. The method generally comprises receiving 1110 an input image depicting a subject, wherein the input image comprises a standard representation of the subject; processing 1120 the input image at the warp skip connection network to produce a warping parameter representation; downsampling 1130 the warping parameter representation according to a number of pair of encoder-decoder layers in the neural network; combining 1140 the downsampled warping parameter maps with outputs of corresponding encoder layers to produce warped outputs; and creating 1150 a UV map representation of the subject based on at least the warped outputs with corresponding features at decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for receiving an input image depicting a subject, wherein the input image comprises a standard representation of the subject; means for processing the input image at the warp skip connection network to produce a warping parameter representation; means for downsampling the warping parameter representation according to a number of pair of encoder-decoder layers in the neural network; means for combining the downsampled warping parameter maps with outputs of corresponding encoder layers to produce warped outputs; and means for creating a UV map representation of the subject based on at least the warped outputs with corresponding features at decoder layers, wherein the UV map representation includes, at least in part, a UV map-visual representation of at least one visual characteristic of the at least one subject. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 11 according to various embodiments.

The various embodiments may provide advantages. For example, the present embodiments provide realistic and high-fidelity multi-view texture from low-fidelity 3D human capture system. The present solution according to present embodiments does not require full user cooperation, no professional system, nor calibration. It can be implemented on smartphones from 2 to 8 images of a subject.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.