Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
REALISTIC AUDIO DRIVEN 3D AVATAR GENERATION
Document Type and Number:
WIPO Patent Application WO/2022/103877
Kind Code:
A1
Abstract:
This application is directed to generation of a 3D avatar that is animated in synchronization with audio data. A computer system generates face parameters of a face associated with the person from an image. The face parameters include shape parameters describing a shape of the face and expression parameters describing an expression of the face. The computer system generates a color texture map and a displacement map of a 3D face model of the face associated with the person based on the face parameters. Additionally, audio-based face parameters are extracted from the audio data, independently of the image. In accordance with the shape parameters, expression parameters, color texture map, displacement map, and audio-based face parameters, the computer system renders the 3D avatar of the person in a video clip in which the 3D avatar is animated for an audio activity synchronous with the audio data.

Inventors:
LIU CELONG (US)
WANG LINGYU (US)
XU YI (US)
Application Number:
PCT/US2021/058838
Publication Date:
May 19, 2022
Filing Date:
November 10, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06T15/00; G06N20/00; G06T13/00; G06T17/00
Foreign References:
US20190122411A12019-04-25
US20200151559A12020-05-14
Attorney, Agent or Firm:
WANG, Jianbai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method for avatar rendering, comprising: obtaining a two-dimensional (2D) image, the 2D image including a person; obtaining audio data, the audio data independent of the 2D image; generating, from the 2D image, a plurality of face parameters of a face associated with the person, the plurality of face parameters including a first set of shape parameters describing a shape of the face and a second set of expression parameters describing an expression of the face; generating, from the 2D image, a color texture map and a displacement map of a three-dimensional (3D) face model of the face associated with the person based on the plurality of face parameters; generating, from the audio data, a plurality of audio-based face parameters, independently of the 2D image; and in accordance with the first set of shape parameters, second set of expression parameters, color texture map, displacement map, and audio-based face parameters, rendering a 3D avatar of the person in a video clip in which the 3D avatar is animated for an audio activity synchronous with the audio data, the audio activity including lip movement.

2. The method of claim 1, wherein rendering the 3D avatar of the person in the video clip further comprises: generating a plurality of avatar driving parameters from the first set of shape parameters, second set of expression parameters, color texture map, displacement map, and audio-based face parameters based on an audio-driven 3D avatar head network; and creating the video clip including the 3D avatar of the person based on the plurality of avatar driving parameters.

3. The method of claim 1, wherein the plurality of face parameters of the face are generated from the 2D image using a first reconstruction network, and the first reconstruction network includes a convolutional neural network (CNN).

4. The method of claim 3, wherein the 3D face model includes a plurality of vertices, and the first reconstruction network includes a graph convolutional network (GCN) configured to predict a color for each vertex of the 3D face model.

5. The method of claim 3 or 4, further comprising: obtaining a shape dataset including a plurality of shape training images and a plurality of shape ground truths corresponding to the plurality of shape training images; feeding a subset of the plurality of shape training images to the first reconstruction network to generate a plurality of shape parameters; identifying a shape parameter loss between the plurality of generated shape parameters and the plurality of shape ground truths; and iteratively training the first reconstruction network using the plurality of shape training images and the shape ground truths of the shape dataset based on the shape parameter loss.

6. The method of any of claims 3-5, further comprising: obtaining a first training image, a second training image, and a third training image, the first and second training images corresponding to a first facial expression, the third training image corresponding to a second facial expression distinct from the first facial expression; feeding the first, second, and third training images to the first reconstruction network to generate a first set of expression parameters, a second set of expression parameters, and a third set of expression parameters; identifying a first expression loss equal to a difference between the first set of expression parameters and the second set of expression parameters; identifying a second expression loss of the third training image with respect to the first and second training images; and iteratively training the first reconstruction network based on the first expression loss and the second expression loss.

7. The method of any of claims 3-6, further comprising: obtaining a plurality of training lip images, each training lip image including a lip and a plurality of mouth ground truth keypoints; feeding the plurality of training lip images to the first reconstruction network to generate a first set of mouth keypoints and a second set of face parameters; identifying a mouth keypoint loss between the first set of mouth keypoints and the plurality of mouth ground truth keypoints; for each training lip image, rendering a mouth region using the second set of face parameters, and identifying a mouth rendering loss between the rendered mouth region and the training lip image; and iteratively training the first reconstruction network based on the mouth keypoint loss and the mouth rendering loss.

8. The method of claim 1, wherein the color texture map and displacement map of the 3D face model is generated from the 2D image using a second reconstruction network, and the second reconstruction network includes a first generative adversarial networks (GAN) configured to convert a low-resolution color texture map to a high-resolution color texture map and a second GAN configured to convert the high-resolution color texture map to the displacement map.

9. The method of any of the preceding claims, wherein the plurality of audio-based face parameters are generated from the audio data using an audio-face neural network, and the audio-face neural network further comprises: a first audio-face neural network configured to predict a plurality of face keypoints from the audio data; a second audio-face neural network configured to generate a plurality of facial parameters from the audio data; and a face refining network configured to refine the plurality of facial parameters with the plurality of face keypoints around a mouth region to generate the plurality of audio-based face parameters.

10. The method of any of the preceding claims, wherein the person in the 2D image is a first person, and the audio data is recorded from a second person that is distinct from the first person.

11. The method of any of the preceding claims, wherein rendering the 3D avatar of the person in the video clip further comprises rendering one or more of semi-transparent eyeball, skin details, hair strands, soft shadow, global illumination, and subsurface scattering.

12. The method of any of the preceding claims, wherein the plurality of face parameters have a total number of face parameters among which a first number of face parameters describing a mouth region of the person, and a ratio of the first number and the total number exceeds a predefined threshold ratio.

13. The method of claim 1, wherein: the plurality of face parameters of the face are generated from the 2D image using a first reconstruction network; the color texture map and displacement map of the 3D face model is generated from the 2D image using a second reconstruction network; the plurality of audio-based face parameters are generated from the audio data using an audio-face neural network; the first reconstruction network, second reconstruction network, and audio-face neural network are trained, and the method of claim 1 is implemented at a server, and the video clip is streamed to an electronic device communicatively coupled to the server.

14. The method of claim 1, wherein: the plurality of face parameters of the face are generated from the 2D image using a first reconstruction network; the color texture map and displacement map of the 3D face model is generated from the 2D image using a second reconstruction network; the plurality of audio-based face parameters are generated from the audio data using an audio-face neural network; the first reconstruction network, second reconstruction network, and audio-face neural network are trained at a server, and provided to an electronic device communicatively coupled to the server; and the method of claim 1 is implemented at the electronic device.

15. A computer system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-14.

16. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-14.

Description:
Realistic Audio Driven 3D Avatar Generation

RELATED APPLICATION

[0001] This application claims priority to U.S. Provisional Patent Application No. 63/113,746, titled “Realistic Audio Drivable 3D Head Generation”, filed November 13, 2020, which is incorporated by reference in its entirety.

TECHNICAL FIELD

[0002] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for using deep learning techniques to animate a three-dimensional (3) avatar in synchronization with audio data.

BACKGROUND

[0003] Deep learning techniques have been applied to generate a 3D personalized head from a single image. A template 3D face model is fitted to keypoints of a related face and combined with the 3D personalized head. Such a 3D personalized head is oftentimes static and not animated, and facial details (e.g., wrinkles) are missing from the 3D personalized head. The 3D personalized head can be driven by an audio sequence in some situations. However, only mouth motion is synthesized on the 3D personalized head without involving any facial expression. It would be beneficial to animate a 3D personalized avatar with audio data.

SUMMARY

[0004] Accordingly, there is a need for an efficient 3D avatar driving mechanism for creating a 3D personalized avatar from a two-dimensional (2D) image and driving the 3D personalized avatar in synchronization with independent audio data. The 3D avatar driving mechanism generates a 3D head model from a single image including a personalized face automatically. The 3D head model has a high-resolution texture map and high-resolution geometry details, and is ready to be driven or animated according to a set of animation parameters. The set of animation parameters are predicted from an audio sequence of human voice speaking or singing, and applied to drive and animate the 3D head model. Additionally, the 3D head model is rendered with photo-realistic facial features. In some embodiments, such a 3D avatar driving mechanism is implemented by a neural network model that is optimized for a mobile device having limited computational resources.

[0005] In one aspect, a method is implemented at a computer system for rendering an animated 3D avatar. The method includes obtaining a 2D image including a person. The method includes obtaining audio data, and the audio data is independent of the 2D image. The method further includes generating, from the 2D image, a plurality of face parameters of a face associated with the person. The plurality of face parameters includes a first set of shape parameters describing a shape of the face and a second set of expression parameters describing an expression of the face. The method further includes generating, from the 2D image, a color texture map and a displacement map of a 3D face model of the face associated with the person based on the plurality of face parameters. The method further includes generating, from the audio data, a plurality of audio-based face parameters, independent of the 2D image. The method further includes in accordance with the first set of shape parameters, second set of expression parameters, color texture map, displacement map, and audio-based face parameters, rendering a 3D avatar of the person in a video clip in which the 3D avatar is animated for an audio activity synchronous with the audio data. The audio activity includes at least lip movement. It is noted that animation of the 3D avatar is not limited to a mouth region and involves movements of one or more of a head, facial expression, mouth, hair, or other regions of the 3D avatar.

[0006] In another aspect, some implementations include a computer system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0007] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0008] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0010] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0011] Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.

[0012] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0013] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0014] Figure 5 is a block diagram of an avatar generation model configured to render a 3D avatar based on a 2D image and in synchronization with audio data, in accordance with some embodiments.

[0015] Figures 6A, 6B, and 6C are flow charts of three processes of training a coarse reconstruction network (CRN) for generating a 3D avatar, in accordance with some embodiments, respectively.

[0016] Figure 7 is a block diagram of a fine reconstruction network (FRN), in accordance with some embodiments.

[0017] Figure 8 is a block diagram of an audio-face neural network, in accordance with some embodiments.

[0018] Figure 9 is a flow diagram of a method for generating or driving a 3D avatar, in accordance with some embodiments.

[0019] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0020] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0021] In various embodiments of this application, a 3D digital model of a human head is animated using an audio sequence of human voice (e.g. speaking or singing). The 3D digital model of a human head is reconstructed from an input image, including eyes, hair and teeth of a person. Texture of the 3D digital model of the human head is generated and used for rendering the 3D digital model. A rigged head model is formed when the 3D digital model of the human head is animated by a set of parameters and driven with given audio data. The 3D digital model is applied to generate a personalized virtual avatar for a user based on a photo of the user. The avatar can be applied in many different user applications including social networking applications that involve augmented or virtual reality. Such a personalized avatar is associated with an identity of the user, and talks and expresses emotions on behalf of the user.

[0022] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0023] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0024] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0025] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0026] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102 A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0027] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0028] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.

[0029] Figure 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments. The data processing system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The data processing system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.

[0030] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on; • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the data processing system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 (e.g., applied to implement an avatar generation model 500 in Figure 5) for processing content data using data processing models 240 (e.g., the avatar generation model 500), thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 240 is an avatar generation model 500 that includes reconstruction networks 508 and 510, an audio-face neural network 512, and an audio-driven 3D avatar head network 514, and is applied to render a 3D avatar of a person in a video clip in which the 3D avatar is animated for an audio activity synchronous with audio data, e.g., in Figure 5; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results 242 to be presented on client device 104.

[0031] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200. Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.

[0032] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0033] Figure 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.

[0034] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.

[0035] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0036] The data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318. The data pre- processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre- processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data post- processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0037] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights W 1 , W 2 , W 3 , and W 4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.

[0038] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0039] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0040] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0041] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0042] Figure 5 is a block diagram of an avatar generation model 500 configured to render a 3D avatar 502 based on a 2D image 504 and in synchronization with audio data 506, in accordance with some embodiments. The avatar generation model 500 receives the image 504 and audio data 506, and outputs the animated 3D avatar 502. The image 504 includes a person, and records facial features of the person. Optionally, the audio data 506 includes a voice recording message made by another person that is not in the 2D image 504. Optionally, the audio data 506 includes a voice recording message of the same person that is in the 2D image 504, and the 2D image 504 is captured in an instant that is independent of when the person is making a speech of the audio data 506. Optionally, the audio data 506 includes a voice recording message that is synthesized from a text message. Optionally, the audio data 506 includes a text message that is not converted to any voice recording message. The avatar generation model 500 includes a coarse reconstruction network (CRN) 508, a fine reconstruction network (FRN) 510, an audio-face neural network 512, and an audio-driven 3D avatar head network 514. These networks 508-514 are configured to process the image 504 and audio data 506 jointly to personalize the 3D avatar 502 and animate the 3D avatar 502 in synchronization with the audio data 506. Specifically, the 3D avatar 502 is animated with head, mouth, eye, and/or facial muscle movements. There movements are synchronized with and dynamically controlled based on the audio data 506, i.e., each movement vary dynamically and in real time with one or more of content, a volume and a pitch of a voice, a speech rate, and other characteristics of the audio data 506. For example, an instantaneous voice increase corresponds to an increase in a head movement range and a mouth of the person captured by the image 504 being opened wider, indicating that the person associated with the 3D avatar 502 is excited.

[0043] The CRN 508 is configured to fit the input image to a parametric 3D face model based on a 3D digital scan of human faces, e.g., generate a plurality of face parameters 516 of a face associated with the person from the 2D image 504. The plurality of face parameters 516 include a first set of shape parameters 518 and a second set of expression parameters 520. The first set of shape parameters 518 describe a shape of the face of the person in the 2D image 504, and do not change with time. The second set of expression parameters 520 describe an expression of the face, and temporally varies with the person’s activity. The shape parameters 518 and expression parameters 520 are applied to control an identity and the expression of a face of an avatar 502 to be rendered. In some embodiments, a subset of face parameters 516 provide information of a mouth region used to control lip movement. The subset of face parameters 516 has a number of face parameters 516 that is greater than a control threshold. In an example, the control threshold is 20, and the subset of face parameters 516 has 30 face parameters 516. From a different perspective, the first set of face parameters 516 have a total number of (e.g., 50) face parameters among which a first number of (e.g., 22) face parameters describing the mouth region of the person. A ratio of the first number and the total number exceeds a predefined threshold ratio (e.g., %). By these means, the 3D face model is suitable for animating the 3D avatar 502 for an audio activity (e.g., talking, singing, laughing), and the audio activity includes different movements (e.g., head, facial muscle, eye, and mouth movement) of the 3D avatar 502 which are synchronous with the audio data 506.

[0044] In some embodiments not shown in Figure 5, the CRN 508 includes a convolutional neural network (CNN) configured to regress the face parameters 516 from the 2D image 504. A face differentiable module is optionally coupled to the CNN and configured to utilize a pixel color distribution of the 2D image 504 to regulate the CNN. Further, in some embodiments, a 3D face model includes a mesh, and a topology of the mesh is assumed to be constant. The CRN 508 further includes a graph convolutional network (GCN) to predict a per-vertex color of each vertex of the mesh of the 3D face model.

[0045] The FRN 510 is configured to reconstruct the 3D face model from the 2D image 504 and the plurality face parameters 516, i.e., generate a plurality of color texture map 522 and a plurality of displacement map 524 of the 3D face model of the face associated with the person. The CRN 508 and FRN 510 are jointly trained using a 3D face scanning dataset that includes ground-truth 3D face models and corresponding multi-view training images that are optionally 3D scan of faces of real persons. For each 3D training face model in the training images, the CRN 508 generates a parametric training face model in the form of a plurality of training face parameters 516. In some embodiments, the 3D face scanning dataset includes a shape database and a texture database. The shape dataset includes a plurality of paired training image and face parameters that can be used for training the CRN 508. In some embodiments, the CRN 508 is trained separately using the shape database. On a texture side, a color texture map 522 is derived from each training image, and 3D face fine details of the respective training image are converted into a displacement map 524. Such a texture dataset includes a plurality of correlated training images, color texture maps 522, and displacement maps 524, which can be used for training the FRN 510. In some embodiments, the FRN 510 is trained separately using the texture database.

[0046] In some embodiments, the FRN 510 unwarps a mesh of the 3D face model using the face parameters 516 to determine a partial low-resolution color texture map having a first texture resolution. The FRN 510 includes a first generative adversarial network (GAN) configured to generate a complete and high-resolution color texture map 522 from the partial low resolution color texture map. The high-resolution color texture map 522 has a second texture resolution that is greater than the first texture resolution. In some embodiment, a first training dataset includes first training data pairs of low-resolution and high-resolution training color texture maps, and is applied to train the first GAN of the FRN 510. In each first data pair, the low-resolution training color texture map is applied as an input to the first GAN, and the high-resolution training color texture map is applied as a ground truth to train the first GAN. Additionally, in some embodiments, the FRN 510 further includes a second GAN to generate the displacement map 524 from a color texture map 522 along with the face parameters 516. The second GAN is trained using a second training dataset including second training data pairs of high-resolution training color texture map and training displacement maps. In each second data pair, the high-resolution training color texture map is applied as an input to the second GAN, and the training displacement map is applied as a ground truth to train the second GAN.

[0047] The audio-face neural network 512 is configured to receive audio data 506 (e.g., an audio data sequence) and generate a plurality of audio-based face parameters 526 (also called refined face parameters), independently of the 2D image. The audio data 506 is used to predict a plurality of face keypoints from the audio data. During the course of training the audio-face neural network 512, a third training dataset includes third data pairs of training audio data and related training face parameters, and is used to train the audio-face neural network 512. After training, the audio-face neural network 512 predicts face parameters directly from the audio data 506, and these predicted face parameters are refined to the audio- based face parameters 526 based on the predicted face keypoints, particularly around the mouth region. More details on generation of the face parameters 526 are described with reference to Figure 8.

[0048] In accordance with the first set of shape parameters 518, second set of expression parameters 520, color texture map 522, displacement map 524, and audio-based face parameters 526, an audio-driven 3D avatar head network 514 determines avatar driving parameters 528, and an avatar renderer 530 renders a 3D avatar of the person in a video clip in which the 3D avatar is animated for an audio activity synchronous with the audio data 506. The audio activity includes at least lip movement. The audio-driven 3D avatar head network 514 is applied to obtain the set of avatar driving parameters and the avatar renderer is applied to render a plurality of human head related visual effects, such as semi-transparent eyeball, skin details, hair strands, soft shadow, global illumination, and subsurface scattering. In some embodiments, those effects are generated by a graphics processing unit (GPU). In various embodiments of this application, the avatar Tenderer 530 is configured to reduce a computational cost of the plurality of human head related visual effects and enable a subset or all of the plurality of human head visual effects on a mobile device (e.g., a mobile phone 104C). Specifically, for human skin, the displacement map 524 is configured to enhance skin bumping details. In some embodiments, a sub-surface scattering (SSS) method is used to mimic a skin-like material. One or more types of light (e.g., point light and directional light) are applied to make a human head more realistic. In some embodiments, a percentage closer soft shadow (PCSS) method is used to simulate a shadow in the real world where edges are soft based on the types of light. In some embodiments, a texture-based hair strand method is used to simulate human hair to reduce computational cost. Most of the above-mentioned methods are adaptively applied to render the avatar 502 in real time.

[0049] Figures 6A, 6B, and 6C are flow charts 600, 620, and 640 of three processes of training a coarse reconstruction network (CRN) 508 for generating a 3D avatar 502, in accordance with some embodiments, respectively. Referring to Figure 6A, the CRN 508 is trained to generate a first set of shape parameters 518 describing a shape of a face based on a shape loss L SP (604). A shape dataset 602 includes one or more training images and ground truth shape parameters. The shape parameters 518 are predicted from the CRN 508. The shape loss L SP (604) is equal to a difference of the predicted shape parameters 518 and the ground truth shape parameters. In some situations, during training, the shape loss L SP (604) is optimized (e.g., minimized) by adjusting the weights of the filters of the CRN 508.

[0050] Referring to Figure 6B, the CRN 508 is trained to generate a second set of expression parameters 520 describing an expression of the face based on expression losses L EP (614) and L T (616). An expression dataset 606 includes a plurality of training images containing different facial expressions, The CRN 508 is trained to provide expression parameters 520 for the different facial expressions using this dataset 606 and the loss functions. Such a CRN 508 enables the avatar generation model 500 to produce more accurate avatar 502 with the different facial expressions. In an example, each data sample includes a set of three training images 606A, 606B, and 606C. The training images 606A and 606B have a first facial expression, and the image 606C has a second distinct facial expression. The CRN 508 generates a first expression parameter 520A corresponding to the training image 606A and a second expression parameter 520B corresponding to the training image 606B. Given the training images 606A and 606B having the same first facial expression, a first expression loss L EP (614) between the first and second expression parameters 520A and 520 B is substantially equal to zero (e.g., less than a threshold expression difference). Conversely, a second expression loss L T (616) is defined as follows: where EP 1 , EP 2 , and EP 3 are the expression parameters 520A, 520B, and 520C, respectively. Given the training image 606C having the distinct facial expression from 606A and 606B, the difference L T should be large. As such, the avatar generation model 500 is configured to predict substantially similar expression parameters 520A and 520B for the first two training images 606A and 606B, and predict a different expression parameter 520C (measured by the a first expression loss L T (616), which is based on an L2 distance) for the third training image 606C. During a training process, the CRN 508 is refined iteratively to minimize the first expression loss L EP (614) and to maximize the second expression loss L T (616).

[0051] In some embodiments, face action coding system (FACS) standards are applied to associate the expression parameters 520 to a plurality of human facial movements, thereby describing the different facial expressions more precisely. For example, a predefined number of (e g., 50) expression parameters 520 are organized in an ordered sequence of expression parameters 520, and each expression parameter 520 corresponds to an action unit number of a FACS name that represents one or more muscular controls on a human face. Each expression parameter 520 indicates an intensity level of the one or more muscular controls corresponding to the corresponding FACS name.

[0052] Referring to Figure 6C, in some embodiments, a training dataset 608 includes a number of interview videos, and image frames that contain lip movement are extracted from these interview videos. In each extracted image frame, keypoints are detected around a mouth region as ground-truth keypoints, and the mouth region is also segmented using computer vision techniques. When the CRN 508 is fine-tuned with this dataset 608, a corresponding mouth loss 610 is a combination of two losses including a mouth keypoint loss 610A and a mouth rendering loss 610B. The mouth keypoint loss 610A indicates a difference of physical locations between predicted mouth keypoints 612 and ground-truth keypoints, and the mouth rendering loss 610B indicates a color difference between a rendered mouth region 618 of a predicted face and the ground-truth mouth region. By optimizing these two mouth losses 610A and 610B, the CRN 508 is adjusted to refine the face parameters 516 around the mouth region iteratively, and the result face parameters 516 can be applied to reconstruct complex lip movement on a human head model.

[0053] Figure 7 is a block diagram of a fine reconstruction network (FRN) 510, in accordance with some embodiments. The FRN 510 includes an unwarp module 702, a first generative adversarial network (GAN) 704, and a second GAN 706. The FRN 510 is configured to receive an image 504 and the face parameters 516 from the CRN 508, and unwarp the image 504 to a partial low-resolution color texture map 708 using the unwarp module 702. The first GAN 704 is configured to generate a complete and high-resolution color texture map 522 from the partial low resolution color texture map 708. The second GAN 706 is configured to generate a displacement map 524 from the color texture map 522 along with the face parameters 516 received from the CRN 508. In some embodiments, during training, a first training dataset includes first training data pairs of low-resolution and high-resolution training color texture maps, and is applied to train the first GAN 704. A second training dataset includes second data pairs of high-resolution training color texture map and training displacement map, and is applied to train the second GAN 706 separately. Alternatively, in some embodiments, a texture training dataset 710 includes a set of face mesh 712, high-resolution training color texture map, and training displacement map, and is applied to train the FRN 510 in an end-to-end manner.

[0054] Figure 8 is a block diagram of an audio-face neural network 512, in accordance with some embodiments. The audio-face neural network 512 includes a first audio-face neural network 802, a second audio-face neural network 804, and a face refining network 806. The first audio-face neural network 802 is configured to receive audio data 506 and generate a plurality of face keypoints 808 including a subset of mouth keypoints associated with a mouth region based on the audio data 506. The second audio-face neural network 804 is configured to generate a plurality of face parameters 810 from the audio data 506. The plurality of face parameters 810 include one or more shape parameters describe a shape of a face or one or more expression parameters describing an expression of the face at the time of generating the audio data 506. The face refining network 806 is configured to generate refined face parameters 526 from the face keypoints 808 and face parameters 810. The refined face parameters 526 are applied to drive the 3D avatar 502. In some embodiments, a third training dataset includes third pairs of training audio data and related training face parameters, and is applied to train at least the second audio-face neural network 804. The face parameters 810 generated by the second audio-face neural network 804 are further refined by the predicted face keypoints 808, e.g., around the mouth region.

[0055] Figure 9 is a flow diagram of a method 900 for generating or driving a 3D avatar, in accordance with some embodiments. For convenience, the method 900 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof). In some embodiments, the client device 104 is a mobile phone 104C, AR glasses 104D, smart television device, or drone. Method 900 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the computer system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 900 may be combined and/or the order of some operations may be changed.

[0056] The computer system obtains (902) a two-dimensional (2D) image 504, and the 2D image 504 includes a person. The computer system obtains (904) audio data 506, and the audio data 506 is independent of the 2D image 504. Optionally, the audio data 506 is made by a second person that is distinct from the person in the 2D image 504. Content of the audio data 506 is independent of (e.g., not related to) that of the 2D image 504. Optionally, the audio data 506 is made by the same person in the 2D image 504. The content of the audio data 506 is independent of (e.g., not related to) that of the 2D image 504.

[0057] The computer system generates (906), from the 2D image 504, a plurality of face parameters 516 of a face associated with the person. The plurality of face parameters 516 includes (908) a first set of shape parameters 518 describing a shape of the face and a second set of expression parameters 520 describing an expression of the face. In some embodiments, the plurality of face parameters 516 have (910) a total number of face parameters among which a first number of face parameters describing a mouth region of the person, and a ratio of the first number and the total number exceeds a predefined threshold ratio. In some embodiments, the plurality of face parameters 516 of the face are generated from the 2D image using a first reconstruction network (e.g., a CRN 508), and the first reconstruction network includes a convolutional neural network (CNN). Further, in some embodiments, the 3D face model includes a plurality of vertices, and the first reconstruction network includes a graph convolutional network (GCN) configured to predict a color for each vertex of the 3D face model.

[0058] The computer system generates (912), from the 2D image 504, a color texture map 522 and a displacement map 524 of a three-dimensional (3D) face model of the face associated with the person based on the plurality of face parameters 516. In some embodiments, the color texture map 522 and displacement map 524 of the 3D face model are generated from the 2D image 504 using a second reconstruction network (e.g., an FRN 510), and the second reconstruction network includes a first generative adversarial networks (GAN) 704 and a second GAN 706. The first GAN 704 is configured to convert a low- resolution color texture map 708 to a high-resolution color texture map 522, and the second GAN 706 is configured to convert the high-resolution color texture map 522 to the displacement map 524.

[0059] The computer system generates (914), from the audio data 506, a plurality of audio-based face parameters 526, independently of the 2D image 504, e.g., using an audio- face neural network 512. In some embodiments, the audio-face neural network 512 includes a first audio-face neural network 802 configured to predict a plurality of face keypoints 808 from the audio data 506, a second audio-face neural network 804 configured to generate a plurality of face parameters 810 from the audio data 506, and a face refining network 806 configured to refine the plurality of face parameters 810 with the plurality of face keypoints 808 around a mouth region to generate the plurality of audio-based face parameters 526 (also called refined face parameters).

[0060] In accordance with the first set of shape parameters 518, second set of expression parameters 520, color texture map 522, displacement map 524, and audio-based face parameters 526, the computer system renders (916) a 3D avatar 502 of the person in a video clip in which the 3D avatar 502 is animated for an audio activity synchronous with the audio data 506. The audio activity includes lip movement. In some embodiments, a plurality of avatar driving parameters 528 are generated (918) from the first set of shape parameters 518, second set of expression parameters 520, color texture map 522, displacement map 524, and audio-based face parameters 526 based on an audio-driven 3D avatar head network 514. The computer system creates (920) the video clip including the 3D avatar 502 of the person based on the plurality of avatar driving parameters 528. In some embodiments, the 3D avatar 502 of the person in the video clip is rendered with one or more of: semi-transparent eyeball, skin details, hair strands, soft shadow, global illumination, and subsurface scattering.

[0061] In some embodiments, the computer system obtains a shape dataset 602 including a plurality of shape training images and a plurality of shape ground truths corresponding to the plurality of shape training images. A subset of the plurality of shape training images is fed to the first reconstruction network (e.g., the CRN 508) to generate a plurality of shape parameters 518. A shape parameter loss L SP 604 is identified between the plurality of generated shape parameters and the plurality of shape ground truths. The first reconstruction network is iteratively trained using the plurality of shape training images of the shape dataset based on the shape parameter loss L SP 604.

[0062] In some embodiments, a first training image 606A, a second training image 606B, and a third training image 606C are applied to train the first reconstruction network. The first and second training images 606A and 606B correspond to a first facial expression, and the third training image 606C corresponds to a second facial expression distinct from the first facial expression. The first, second, and third training images 606A-606C are fed to the first reconstruction network to generate a first set of expression parameters 520A, a second set of expression parameters 520B, and a third set of expression parameters 520C, respectively. A first expression loss 614 (e.g., L EP ) is equal to a difference between the first set of expression parameters 520A and the second set of expression parameters 520B. A second expression loss 616 (e.g., L T ) of the third training image 606C is identified with respect to the first and second test images 606A and 606B. The first reconstruction network is iteratively trained based on the first expression loss 614 and the second expression loss 616. Specifically, the first reconstruction network is trained to make the first expression loss 614 substantially close to zero and maximize the second expression loss 616.

[0063] In some embodiments, a plurality of training lip images (e.g., of interview videos 608) are applied, and each training lip image includes a lip and a plurality of mouth ground truth keypoints. The plurality of training lip images are fed to the first reconstruction network to generate a first set of mouth keypoints 612 and a second set of face parameters 516. A mouth keypoint loss 610A is identified between the first set of mouth keypoints 612 and the plurality of mouth ground truth keypoints. For each training lip image, a mouth region 618 is rendered using the second set of face parameters 516, and a mouth rendering loss 610B is identified between the rendered mouth region 618 and the training lip image. The first reconstruction network is iteratively trained based on the mouth keypoint loss 610A and the mouth rendering loss 610B.

[0064] In some embodiments, the plurality of face parameters 516 of the face are generated from the 2D image 504 using a first reconstruction network (e.g., a CRN 508), and the color texture map 522 and displacement map 524 of the 3D face model are generated from the 2D image 504 using a second reconstruction network (e.g., an FRN 510). The plurality of audio-based face parameters 526 are generated from the audio data 506 using an audio-face neural network 512. The first reconstruction network, second reconstruction network, and audio-face neural network 512 are trained. The method 900 is implemented at a server 102, and the video clip is streamed to an electronic device 104 communicatively coupled to the server 102.

[0065] Alternatively, in some embodiments, the plurality of face parameters 516 of the face are generated from the 2D image 504 using a first reconstruction network. The color texture map 522 and displacement map 524 of the 3D face model are generated from the 2D image 504 using a second reconstruction network. The plurality of audio-based face parameters 526 are generated from the audio data 506 using an audio-face neural network 512. The first reconstruction network, second reconstruction network, and audio-face neural network 512 are trained at a server 102, and provided to an electronic device 104 communicatively coupled to the server 102. The method 900 is implemented at the electronic device.

[0066] In various embodiments of this application, a parametric face model is applied to enable fine control around a mouth region of a 3D human model of an avatar. This allows the avatar to be animated with complex lip movements, particularly when the avatar is talking. The face parameters 516, color texture map 522, displacement map 524, and refined face parameters 526 are readily applicable in avatar rendering without any further manual enhancement. Color information of an input image is utilized during differentiable rendering in 3D head reconstruction. Lip movement prediction from the audio data 506 is easy to use and has natural results. Complex rendering effects in facial rendering are made available on mobile phones that have limited power, computational, or storage resources. As such, high resolution 3D faces are rendered on mobile phones in real time and without losing visual performance.

[0067] It should be understood that the particular order in which the operations in Figure 7 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to render a 3D avatar in synchronization with audio data as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 5-8 are also applicable in an analogous manner to method 900 described above with respect to Figure 9. For brevity, these details are not repeated here.

[0068] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0069] As used herein, the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0070] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0071] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.