Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND SYSTEMS FOR RENDERING A MODEL OF A USER USING HAIR DENSITY ESTIMATION
Document Type and Number:
WIPO Patent Application WO/2023/227198
Kind Code:
A1
Abstract:
Systems and methods for rendering a model of the face of a user in real-time. A machine learning trained model (901) is used to generate skin parameters for a BSSRDF function from real or synthetic image data, depth data, albedo data, and IR data. Hair growth and degeneration is estimated using additional machine learned models (401, 501) from a sequence of images. The systems and methods provide a model or synthetic representation in real-time of a user that includes accurate skin rendering and estimated hair rendering that can track evolution of hair loss and regrowth.

Inventors:
CHAGANTI SHIKHA (US)
COMANICIU DORIN (US)
KAPOOR ANKUR (US)
PEICHO DAVID (GB)
TEIXEIRA BRIAN (US)
YU DAPHNE (US)
Application Number:
PCT/EP2022/063942
Publication Date:
November 30, 2023
Filing Date:
May 23, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SIEMENS HEALTHCARE GMBH (DE)
International Classes:
G06T17/00
Foreign References:
JP2006051210A2006-02-23
Other References:
TIM WEYRICH ET AL: "Analysis of human faces using a measurement-based skin reflectance model", INTERNATIONAL CONFERENCE ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, ACM SIGGRAPH 2006 PAPERS, BOSTON, MASSACHUSETTS, ACM, NEW YORK, NY, USA, 1 July 2006 (2006-07-01), pages 1013 - 1024, XP058326332, ISBN: 978-1-59593-364-5, DOI: 10.1145/1179352.1141987
KIM MINKI ET AL: "Evaluation of Automated Measurement of Hair Density Using Deep Neural Networks", SENSORS, vol. 22, no. 2, 14 January 2022 (2022-01-14), CH, pages 650, XP093007126, ISSN: 1424-8220, Retrieved from the Internet DOI: 10.3390/s22020650
ABBAS FAYCAL ET AL: "Efficient deep Neural Network Architectures for Subsurface Scattering Approximation", 2022 7TH INTERNATIONAL CONFERENCE ON IMAGE AND SIGNAL PROCESSING AND THEIR APPLICATIONS (ISPA), IEEE, 8 May 2022 (2022-05-08), pages 1 - 4, XP034131352, DOI: 10.1109/ISPA54004.2022.9786314
CHEN TENN F T4CHEN@UWATERLOO CA ET AL: "Hyperspectral Modeling of Skin Appearance", ACM TRANSACTIONS ON GRAPHICS, ACM, NY, US, vol. 34, no. 3, 8 May 2015 (2015-05-08), pages 1 - 14, XP058516452, ISSN: 0730-0301, DOI: 10.1145/2701416
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A system (100) for generating a model of a face of a user, the system comprising: one or more sensors (105) configured to acquire (200) sensor data of the face of the user; a processor (102) configured to estimate (202) subsurface scattering parameters for a diffuse BSSRDF function using a machine learning model, the estimation based on albedo data, depth data, and infrared data acquired by the one or more sensors or derived from the sensor data; the processor (102) further configured to estimate (203) hair density using a neural network (401) and a temporal generative model (501), the estimated hair density based on estimating hair densities derived from a sequence of images from the sensor data; a rendering system (101) configured to render (204) the model of the face based on a facial geometry of the user, the diffuse BSSRDF function using the subsurface scattering parameters, and the estimated hair density; and a display (106) configured to display the rendered model.

2. The system (100) of claim 1, wherein the subsurface scattering parameters comprise a reduced scattering coefficient and an absorption coefficient.

3. The system (100) of claim 1, wherein the processor (102) is further configured to estimate a percentage of hair degeneration or hair growth based on the estimated hair densities of different images in the sequence of images.

4. The system (100) of claim 1, wherein the one or more sensors (105), the processor (102), the rendering system (101), and the display (106) are included in single handheld device.

5. The system (100) of claim 1, wherein the sequence of images comprises images acquired at least a week apart from one another.

6. The system (100) of claim 1, wherein the rendering system (101) comprises at least a conditional generative model (701) configured to generate hair for the model of the user based on an image of the sequence of images and a selected time.

7. The system (100) of claim 1, wherein the sequence of images comprises at least three images taken over a period of time exceeding a month.

8. A method for estimating hair density evolution over time, the method comprising: acquiring (Al 10) a sequence of RGB images of a user; estimating (A120), from a first image of the sequence of RGB images, a first density of hair of the user using a neural network (401); estimating (A130), from a second image of the sequence of RGB images, a second density of hair of the user using the neural network (401); optimizing (AMO) a latent vector provided to a temporal generative model (501), wherein the temporal generative model takes as input the latent vector and a temporal value and generates an estimated hair density for the temporal value; and extrapolating (A150), by the temporal generative model (501), hair density for a future temporal value.

9. The method of claim 8, further comprising: inputting an RGB image of the sequence of RGB images of the user and the estimated hair density into a conditional generative model (701); and outputting (A160), by the conditional generative model (701), a new RGB image depicting the user’s hair with a target density for a selected time.

10. The method of claim 8, further comprising: estimating, based on at least the first density, the second density, and a temporal difference between the first image and second image, a percentage of hair degeneration or hair growth.

11. The method of claim 8, wherein the temporal generative model (501) further inputs additional information entered by the user to refine its estimation.

12. The method of claim 8, wherein the sequence of RGB images comprises at least three images taken over a period of time exceeding a month.

13. The method of claim 8, wherein the neural network (401) is configured to segment and classify an image, the neural network (401) configured to determine a density of hair from the segmentation and classification.

14. A method for estimating subsurface scattering parameters for a diffuse BSSRDF function, the method comprising: capturing (A210) one or more images of a face of a user; acquiring (A210) albedo data, depth data, and infrared data of the face; generating (A220), by a machine learning model (901), the subsurface scattering parameters based on the one or more images, the albedo data, the depth data, and the infrared data, the machine learning model (901) trained using a dataset containing measured absorption coefficients and reduced scattering coefficients, corresponding color, corresponding depth, and corresponding infrared information; rendering (204), using a rendering system (101) and the diffuse BSSRDF function with the subsurface scattering parameters, the face; and displaying the rendered face.

15. The method of claim 14, further comprising: estimating, using a temporal generative network (501), hair density for a selected time; wherein rendering the face comprises rendering the face with the estimated hair density.

16. The method of claim 14, further comprising: determining an estimated hair density based on the one or more images; and rendering and displaying the face with the estimated hair density.

17. The method of claim 14, wherein the subsurface scattering parameters comprise a reduced scattering coefficient and an absorption coefficient.

18. The method of claim 14, wherein acquiring comprises generating synthetic albedo, synthetic depth, and synthetic infrared data from the one or more images.

19. The method of claim 14, wherein acquiring comprises predicting albedo data and infrared data from one or more RGBD images.

20. The method of claim 14, wherein the one or more images of the user are captured by a handheld device (100).

Description:
METHODS AND SYSTEMS FOR RENDERING A MODEL OF A USER USING HAIR DENSITY ESTIMATION

FIELD

[0001] The present embodiments relate to rendering a model of a user.

BACKGROUND

[0002] Modeling or reconstruction of human faces and bodies is an important field that has wide range of applications in areas such as virtual reality, gaming, virtual shopping, and teleconferencing, among others. However, reconstructing a face with a high level of accuracy, effectiveness, and convenience is a complex task that requires multiple steps.

[0003] In an example, sensors or cameras are used to capture an image of a user. Different mechanisms may be used to obtain or derive three dimensional measurements of the user which are then used, for example, to determine a facial geometry of the user. Additional components such as skin, hair, and facial features must then be generated and rendered to provide a high-quality facial reconstruction. These additional components are one of the most difficult computer graphics challenges as humans are incredibly adept at interpreting facial appearance and noticing any issues. The term uncanny valley was coined to describe the common unsettling feeling that people experience when simulations closely resemble humans in many respects but are not quite convincingly realistic. Although a lot of effort has been devoted to face modeling in computer graphics and software, models still fail to provide an acceptable representation while still being practical. There are many issues, but two of which that are necessary to overcome are providing a system that can accurately depict hair and skin in an efficient and practical manner.

SUMMARY

[0004] In a first aspect, a system is provided for generating a model of a face of a user. The system includes one or more sensors, a processor, a rendering system, and a display. The one or more sensors are configured to acquire sensor data of the face of the user. The processor is further configured to estimate subsurface scattering parameters for a diffuse BSSRDF function using a machine learning model, the estimation based on albedo, depth, and infrared data acquired by the one or more sensors or derived from the sensor data. The processor is further configured to estimate hair density using a neural network and a temporal generative model, the estimated hair density based on estimating hair densities derived from a sequence of images from the sensor data. The rendering system is configured to render the model of the face based on a facial geometry, the diffuse BSSRDF function using the subsurface scattering parameters, and the estimated hair density. The display is configured to display the rendered model.

[0005] The subsurface scattering parameters may comprise a reduced scattering coefficient and an absorption coefficient. The processor may be further configured to estimate a percentage of hair degeneration or hair growth based on the estimated hair densities of different images in the sequence of images. The one or more sensors, the processor, the rendering system, and the display may be included in single handheld device. The sequence of images may comprise images acquired at least a week apart from one another. The rendering system may comprise at least a conditional generative model configured to generate hair for the model of the user based on an image of the sequence of images and a selected time. The sequence of images may comprise at least three images taken over a period of time exceeding a month.

[0006] In a second aspect, a method is provided for estimating subsurface scattering parameters for a diffuse BSSRDF function, the method comprising: capturing one or more images of a face of a user; acquiring albedo data, depth data, and infrared data of the face; generating, by a machine learning model, the subsurface scattering parameters based on the one or more images, the albedo data, the depth data, and the infrared data, the machine learning model trained using a dataset containing measured absorption coefficients and reduced scattering coefficients, corresponding color, corresponding depth, and corresponding infrared information; rendering, using a rendering system and the diffuse BSSRDF function with the subsurface scattering parameters, the face; and displaying the rendered face.

[0007] In an embodiment, the method further includes inputting an RGB image of the sequence of RGB images of the user and the estimated hair density into a conditional generative model and outputting, by the conditional generative model, a new RGB image depicting the user’s hair with a target density for a selected time. [0008] In an embodiment, the method further includes estimating, based on at least the first density, the second density, and a temporal difference between the first image and second image, a percentage of hair degeneration or hair growth.

[0009] The temporal generative model may further input additional information entered by the user to refine its estimation. The sequence of RGB images may comprise at least three images taken over a period of time exceeding a month. The neural network may be configured to segment and classify an image, the neural network also configured to determine a density of hair from the segmentation and classification.

[0010] In a third aspect, a method is provided for estimating subsurface scattering parameters for a diffuse BSSRDF function, the method comprising: capturing one or more images of a face of a user; acquiring albedo data, depth data, and infrared data of the face; generating, by a machine learning model, the subsurface scattering parameters based on the one or more images, the albedo data, the depth data, and the infrared data, the machine learning model trained using a dataset containing measured absorption coefficients and reduced scattering coefficients, corresponding color, corresponding depth, and corresponding infrared information; rendering, using a rendering system and the diffuse BSSRDF function with the subsurface scattering parameters, the face; and displaying the rendered face.

[0011] In an embodiment, the method further comprises estimating, using a temporal generative network, hair density for a selected time. Rendering the face comprises rendering the face with the estimated hair density.

[0012] In an embodiment, the method further comprises determining an estimated hair density based on the one or more images and rendering and displaying the face with the estimated hair density.

[0013] In an embodiment, the subsurface scattering parameters comprise a reduced scattering coefficient and an absorption coefficient.

[0014] Acquiring may comprise generating synthetic albedo, synthetic depth, and synthetic infrared data from the one or more images or predicting albedo data and infrared data from one or more RGBD images.

[0015] In an embodiment, the one or more images of the user are captured by a handheld device. [0016] Any one or more of the aspects described above may be used alone or in combination. These and other aspects, features and advantages will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings. The present invention is defined by the following claims, and nothing in this section should be taken as a limitation on those claims. Further aspects and advantages of the invention are discussed below in conjunction with the preferred embodiments and may be later claimed independently or in combination.

BRIEF DESCRIPTION OF THE DRAWINGS

[0017] The components and the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Moreover, in the figures, like reference numerals designate corresponding parts throughout the different views.

[0018] Figure 1 depicts an example system for rendering a model of a user in realtime according to an embodiment.

[0019] Figure 2 depicts an example workflow for rendering a model of a user in real-time according to an embodiment.

[0020] Figure 3 depicts an example workflow for estimating hair density and generating a synthetic image therefrom according to an embodiment.

[0021] Figure 4 depicts an example of estimating hair density from two different images taken at two different times according to an embodiment.

[0022] Figure 5 depicts an example of optimizing a latent vector according to an embodiment.

[0023] Figure 6 depicts an example of the output of the temporal generative model according to an embodiment.

[0024] Figure 7 depicts an example of an estimated image of the hair of a user at a future time according to an embodiment.

[0025] Figure 8 depicts an example method for calculating parameters for rendering skin of a user according to an embodiment.

[0026] Figure 9 depicts an example workflow for rendering the skin of a human model according to an embodiment. DETAILED DESCRIPTION

[0027] Embodiments provide systems and methods for rendering a model of a user in real-time. A machine learning trained model is used to generate skin parameters for a BSSRDF function from real or synthetic RGB data (red, green, blue), depth data, albedo data, and IR data. Hair growth and degeneration is estimated using additional machine learned models from the sequence of images. The systems and methods provide a model or synthetic representation in real-time of a user that includes accurate skin rendering and estimated hair rendering that can track evolution of hair loss and regrowth and see how a user may look like in the future.

[0028] Previous attempts to create accurate models have been successful but they either require the use of complex hardware and setup that users don’t have access to or use coarse approximations of the parameters that can be far from their true values.

[0029] Hair modeling is important for creating convincing virtual humans. There are also a large number of application requirements for modeling hair in the fields of makeup, advertisement and so on. However, measuring hair of a user and rendering a depiction of the hair over time is complicated due to the inherent characteristics of hair, for example the microscale geometry and the large number of hair strands. Previous hair reconstruction methods allow either a single photo (thus compromising 3D quality) or multiple views, but they require manual user interaction (manual hair segmentation and capture of fixed camera views that span full 360°).

[0030] Modeling skin is likewise complicated as realism requires a very complex model for the skin reflectance of human faces. Skin reflectance varies for different people, e.g., due to age, race, gender, health, etc. Skin reflectance also varies for the same person throughout the course of a day, e.g., hot vs. cold skin, or dry vs. wet. Previous attempts to accurately estimate skin parameters use complex setups. Parameter estimation for skin parameters has also been made using classic optimization techniques, such as the Quasi-Minimal Residual algorithm which provides subpar results. Generating accurate skin typically requires using complex measurement systems such as an Analysis of Human Faces using a Measurement-Based Skin Reflectance Model. However, these complex measurement systems are not practical. One example includes domes containing carefully disposed cameras, with hundreds of LEDs. These approaches are limited for practical use because they require complex setup, need a room with a lot of space, need a custom setup with expensive devices, and as such users cannot acquire the data by themselves. Other approaches have tried to estimate parameters using a single image, such as using a bidirectional scattering- surface reflectance distribution function (BSSRDF) estimation from single images method. However, these approaches typically use coarse light estimation and highlights removal, or negatively affect the estimation and coarsely reconstruct depth from an image using inaccurate algorithms, or use optimization methods that may not converge properly, or converge over a long period. There is a need for systems and methods for rendering a model of a user that are practical and accurate. [0031] Embodiments provide for parameter estimation that uses a supervised machine learning system giving absorption and reduced scattering coefficients to a rendering system. Embodiments also provide for hair density estimation using over time allowing for a more detailed and accurate representation of a user or other human face. The combination provides a quick, efficient, and accurate model of a human face or body. [0032] Advantages of this implementation include time and cost benefits, reliance on commonly used and available hardware (no specialized cameras or sensors), and ease of use. The systems and methods reduce the need for complex hardware (cost reduction) and improve speed (for example, less than a second on commonly used consumer devices). In addition, with the ability to view future hair density or coverage, users experiencing hair growth issues (as a result of stress, genetic hair loss or following cancer treatment) can accurately see how their hair could look like in the future. The combination of the improved skin parameter estimation and the hair density estimation provides useful representations while solving the technical problems that arise from requiring advanced or specialized equipment.

[0033] Figure 1 depicts an example system for rendering a model of a user in realtime. The system includes a device 100 connected optionally with a server 107. The device includes a rendering system 101 that includes at least a processor 102 and a memory 103. The processor 102 may be part of the rendering system 101 or may provide analysis and processing for the rendering system 101 while being configured to perform additional tasks or instructions for the device 100. The device 100 further includes a display 106 and one or more sensors 105 configured to capture or acquire sensor data such as image data, depth data, infrared data, etc. The rendering system 101 acquires data from the sensors 105, memory 103, or the server. The rendering system 101 renders a model or representation of a user that is displayed on the display 106. A workstation or computer may be used with the device 100. Additional, different, or fewer components may be provided. For example, a computer network is included for remote processing or storage or communication with the server 107. As another example, a user input device (e.g., keyboard, buttons, sliders, dials, trackball, mouse, or other device) is provided for input to the device 100.

[0034] In an embodiment, the rendering system 101 and the sensor(s) are included in a device 100 such as a laptop computer, desktop computer, tablet computer, smartphone, or other handheld device. The device 100 may be equipped with one or more sensors 105 that are configured to capture data such as albedo, depth, and infrared data. The device 100 also includes an imaging sensor that is configured to acquire a single or a sequence of RGB images (or other types of image data). The imaging sensor may also be configured to acquire albedo maps, depth and infrared images for example predicted from the acquired RGB or RGBD images. One sensor is shown, but multiple different sensors 105 may be used. The sensor data may also be acquired from the server or the memory 103. In an example a sequence of images is acquired at different temporal points. The images may be acquired by the same device 100 or by different devices. A user may input a sequence of images that were taken in the past and stored in the memory 103 in addition to a captured image or images that are acquired in real time.

[0035] For depth data, the sensor may directly measure depth from the sensor to the user. The sensor may include a separate processor for determining depth measurements from images, or the processor 102 determines the depth measurements from images captured by the sensor. The depth sensor may be or include a LIDAR, 2.5D, RGBD, stereoscopic optical sensor, or other depth sensor. Albedo data may be captured directly or estimated from the other data. Albedo is the measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0, corresponding to a black body that absorbs all incident radiation, to 1, corresponding to a body that reflects all incident radiation. Infrared data may be acquired using an infrared sensor. An infrared sensor works the same way an object detection sensor does. The sensor typically has an IR LED and an IR photodiode. Combining these two gives way to a photo-coupler or optocoupler. The IR LED is a transmitter emitting IR radiations. The radiation is detected by infrared receivers which are available in photodiodes form. The infrared photodiode responds to the infrared light generated by the infrared LED. The resistance of photodiode and the change in output voltage is directly proportional to the infrared light. After the infrared transmitter has produced an emission, it arrives at the object and some of the emission bounces or reflects back towards the infrared receiver. Based on the intensity of the response, the sensor output is decided by the IR receiver. In an embodiment, the albedo, depth, and infrared data is derived from the image data. For example, synthetic albedo, depth, and infrared data from RGB images may be used to estimate parameters of the diffuse BSSRDF function. Alternatively, the system may predict albedo, and infrared data from RGBD images to estimate parameters of the diffuse BSSRDF function.

[0036] The processor 102 is configured to generate or use a facial geometry of the face of the user based on the sensor data, a previously used geometry, or a template. The processor 102 is configured to estimate skin parameters and hair density from a sequence of images acquired by the image capturing device or stored on the server by implementing one or more machine trained models or neural networks. The processor 102 is further configured to render a model of a user using the estimated skin parameters in a diffuse BSSRDF function and the hair density evolution over time. In an embodiment, the processor 102 is configured to train one or more of the models or neural networks using supervised or unsupervised training methods to estimate the skin parameters and the hair density evolution. The processor 102 is a control processor, image processor, general processor, digital signal processor, three-dimensional data processor, graphics processing unit, application specific integrated circuit, field programmable gate array, artificial intelligence processor, digital circuit, analog circuit, combinations thereof, or other now known or later developed device. The processor 102 is a single device, a plurality of devices, or a network. For more than one device, parallel or sequential division of processing may be used. In one embodiment, the processor 102 is a control processor or other processor of a device 100. The processor 102 operates pursuant to and is configured by stored instructions, hardware, and/or firmware to perform various acts described herein.

[0037] The memory 103 is configured to store the skin parameters, hair estimation functions, neural networks, image data, and rendered images. For example, the configuration, nodes, weights, and other parameters of the machine learned models and networks may be stored in the memory 103. The memory 103 may be or include an external storage device, RAM, ROM, database, and/or a local memory (e.g., solid state drive or hard drive). The same or different non-transitory computer readable media may be used for the instructions and other data. The memory 103 may be implemented using a database management system (DBMS) and residing on a memory, such as a hard disk, RAM, or removable media. Alternatively, the memory 103 is internal to the processor 102 (e.g., cache).

[0038] The instructions for implementing the processes, methods, and/or techniques discussed herein are provided on non-transitory computer-readable storage media or memories, such as a cache, buffer, RAM, removable media, hard drive, or other computer readable storage media (e.g., the memory 103). The instructions are executable by the processor 102 or another processor. Computer readable storage media include various types of volatile and nonvolatile storage media. The functions, acts or tasks illustrated in the figures or described herein are executed in response to one or more sets of instructions stored in or on computer readable storage media. The functions, acts or tasks are independent of the instructions set, storage media, processor 102 or processing strategy and may be performed by software, hardware, integrated circuits, firmware, micro code, and the like, operating alone or in combination. In one embodiment, the instructions are stored on a removable media device for reading by local or remote systems. In other embodiments, the instructions are stored in a remote location for transfer through a computer network. In yet other embodiments, the instructions are stored within a given computer, CPU, GPU, or system. Because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present embodiments are programmed.

[0039] The display 106 is configured to display or otherwise provide the model of the user to the user. The display 106 is a CRT, LCD, projector, plasma, printer, tablet, smart phone or other now known or later developed display device for displaying the output.

[0040] The server connects to the device 100 via a network. The network is a local area, wide area, enterprise, another network, or combinations thereof. In one embodiment, the network is, at least in part, the Internet. Using TCP/IP communications, the network provides for communication between the device 100 and the server. Any format for communications may be used. In other embodiments, dedicated or direct communication is used. The server is a processor or group of processors. More than one server may be provided. The server is configured by hardware and/or software. The server may be configured to handle a portion of the processing of the sensor data. The server may be configured to train the networks or models using labeled or unlabeled datasets. The server may be cloud based and provide access for users to an application that provides facial rendering as described herein.

[0041] Figure 2 depicts an example workflow for rendering a model of a face of a user. Sensor data is acquired 200. The sensor data, for example a sequence of images, is used to generate facial geometry 201, skin parameters 202, and estimate hair density 203. A rendering system 101 renders 204 a model or face of a user using the facial geometry, skin parameters, and estimated hair density. The facial geometry may be generated using any known or future methods, for example, by segmentation or other image processing techniques. The estimated hair density is described below in relation to Figure 3. The generation of the skin parameters, in particular for use in a diffuse BSSRDF function is described in Figure 8. The rendering system 101 may render the face using any known or future rendering process. Additional features may be rendered by the rendering system 101, for example, eyes, mouth, ears, nose, etc. using known or future feature rendering mechanisms. The facial geometry, skin parameters, and estimate hair density may be rendering from only a sequence of images. Additional information or inputs may be used, such as user information, RGBD data, Albedo data, Infrared data, and depth data among other inputs.

[0042] Figure 3 depicts an example workflow for estimating hair density evolution through time. The method uses a sequence of images (for example RGB images) taken of a user’s face / hair and estimates hair density at a given time. As a result, users suffering hair loss can automatically track evolution of their loss and regrowth and see how they hair would look like in the future. As presented in the following sections, the acts may be performed using any combination of the components indicated in Figure 1. The following acts may be performed by the rendering system 101, processor 102, device sensors 105, or a combination thereof. Additional, different, or fewer acts may be provided. For example, additional information may be acquired about the user and used to estimate hair density and/or generate a new image. The acts are performed in the order shown or other orders. The acts may also be repeated. Certain acts may be skipped. For example, an image may not be rendered. The output of the hair estimation may be a value or score that indicates positive or negative growth of the density of the hair of the user. The workflow of Figure 3 may be performed simultaneously with the workflow of Figure 8 (skin parameters) in order to generate a more accurate and believable rendering of the user’s face.

[0043] At act Al 10, a system acquires a sequence of images of the user. The system may include a device 100, for example a handheld device 100, that is configured to acquire image data, process the image data, and display an image to the user. The device 100 may acquire the image data from, for example, a camera or other imaging sensor equipped with a standard sensor through which the images of persons and objects are acquired. Alternatively, one or more of the images may be acquired from a different device or location, for example cloud storage. The images may be part of a video or stream or may be individual images or frames. The images may be high definition or low definition. Each image of the sequence of images may be captured at a different time. The images may be taken at intervals, for example, every day, week, month, or year. Alternatively, the intervals may be random, e.g., with varying lengths of time between when the images were captured. One benefit of the current process is the ability to monitor hair density and coverage over time. The more images in the sequence over longer periods of time allow the system to improve its predicts and estimations. The system may be updated over time and become more precise as a user enters more pictures. In an example, the system may acquire two images of a user a month apart. The system can determine the hair density (as described below) and estimate a percentage of hair degeneration, predicting hair loss or estimate a percentage of hair growth, with an estimate of a time for full regrowth. However, with only two images the accuracy / precision may not be very high. If the system uses five, ten, or more images from different times, the system may be able to provide very accurate estimates, particularly if the images depict the hair density or coverage of the user over a longer period of time. If, for example, a user takes an image every month for a year, the system may be able to provide a very accurate estimation of hair loss, hair growth, or future density based on the sequence of the twelve images.

[0044] At act Al 20, the system estimates, from a first image of the sequence of images, a first density of hair of the user using a neural network, also referred to as a hair density estimator. When the user takes a first picture of their hair, the system estimates a first density using a neural network or machine trained model. The density may be represented as a weighted segmentation map, or a scalar. Image segmentation may be used to locate objects and boundaries in an image. For an image of a user, the segmentation or segmentation map may provide a label for each pixel or voxel that is classified as hair and each pixel or voxel that is classified as “not hair.” A scalar value may represent the amount or coverage of hair for the user based on the segmentation of the image. The classification of the user’s hair in the image is provided by a neural network or machine trained model. The neural network or model may be any type of network or model that is trained using machine learning. In an example, the model may be configured to segment the image and calculate a value therefrom.

[0045] At act Al 30, the system estimates, from a second image of the sequence of images, a second density of hair of the user using the neural network. After the user enters a second picture of their hair, the system estimates the hair density of this second picture using the aforementioned neural network. The system now has two density estimation for two given times and may assess if density is increasing, decreasing, or staying steady. The segmentation map or scalars of the two images output by the neural network may be compared against one another to determine, for example, if there are more hair pixels in one image than the other. Different standards may be used, for example length of hair vs density of hair etc. A growth score may be calculated to determine the growth rate (positive or negative) of the hair density over the period of time between the two images. If additional images are used from the sequence, the growth rate may be calculated as a fitting of the outputs of the hair density estimator over the period of time of the additional images. Additional information may be used by the neural networks. For example, any products used by the user may affect the volume or density of the hair. A hair cut may affect the length of the hair. The segmentation map or scalar value may focus on a particular region if, for example, the user has had a haircut during the time between the two images. A particular portion that might be unaffected by the hair cut may be used as the baseline for determining whether density is increasing, decreasing, or staying steady. [0046] Figure 4 depicts an example of estimating hair density from two different images taken at two different times, t=0 and t=l using the neural network / hair density estimator 401. The two images are input into the hair density estimator 401. The hair density estimator 401 segments the images to determine which pixels represent hair. The segmented images are then assigned a density by the hair density estimator 401. The difference between the densities may be used to calculate a score. In Figure 4, the density of the hair has increased for the user between t=0 and t=l and as such, there is a positive growth score. In Figure 4, the user maintains the same pose for both time periods t=0 and t=l. The images may depict different poses, different zoom levels, different lighting, etc. The hair density estimator 401 may be configured to estimate the hair density for each individual image regardless of the properties of the image.

[0047] The hair density estimator 401 is a machine trained model or network. The networks or machine trained models as described herein may be defined as a plurality of sequential feature units or layers. The general flow of output feature values may be from one layer to input to a next layer. The information from the next layer is fed to a next layer , and so on until the final output. The layers may only feed forward or may be bidirectional, including some feedback to a previous layer. Skip connections may be provided where some information from a layer is feed to a layer beyond the next layer. The nodes of each layer or unit may connect with all or only a sub-set of nodes of a previous and/or subsequent layer or unit. Various units or layers may be used, such as convolutional, pooling (e.g., max pooling), deconvolutional, fully connected, or other types of layers. Within a unit or layer , any number of nodes is provided. For example, one hundred nodes are provided. Later or subsequent units may have more, fewer, or the same number of nodes. In general, for convolution, subsequent units have more abstraction. Other network arrangements may be used, such as a support vector machine. Deep architectures include convolutional neural network (CNN) or deep belief nets (DBN), but other deep networks may be used. CNN learns feed-forward mapping functions while DBN learns a generative model of data. In addition, CNN uses shared weights for all local regions while DBN is a fully connected network (e.g., including different weights for different areas of the states). The training of CNN is entirely discriminative through back-propagation. DBN, on the other hand, employs the layerwise unsupervised training (e.g., pre-training) followed by the discriminative refinement with back-propagation if necessary. In an embodiment, the arrangement of the machine learnt network is a fully convolutional network (FCN). Alternative network arrangements may be used, for example, a 3D Very Deep Convolutional Networks (3D-VGGNet). VGGNet stacks many layer blocks containing narrow convolutional layers followed by max pooling layers. A 3D Deep Residual Networks (3D-ResNet) architecture may be used. A Resnet uses residual blocks and skip connections to learn residual mapping. [0048] In an embodiment, the model is trained using a gradient descent technique or a stochastic gradient descent technique. Both techniques attempt to minimize an error function defined for the model. Training the model involves adjusting internal weights or parameters of the model until the model is able to accurately predict the correct outcome given a newly input data point. The result of the training process is a model that includes one or more parameters that minimize the errors of the function given the training data. The one or more parameters may be represented as a vector. In an embodiment, the training data is labeled. Labeled data is used for supervised learning. The model is trained by imputing known inputs and known outputs. Weights or parameters are adjusted until the model accurately matching the known inputs and output. In an example, to train a machine learned model to identify hair density, a segmentation model is used to segment an input image. The output is compared to a labeled map or vector. In an embodiment, the training data is labeled, and the model is taught using a supervised learning process. A supervised learning process may be used to predict numerical values (regression) and for classification purposes (predicting the appropriate class).

[0049] At Act AMO, the system optimizes a latent vector provided to a temporal generative model. The temporal generative model takes as input the latent vector and a temporal value and generates an estimated hair density for the temporal value. The predicted density together with the time information are then given to optimize a latent vector given to a temporal generative model. The temporal generative model takes as input a latent vector and a time information and generates the hair density. The latent vector is optimized such that when given time corresponding to the previously acquired images, the generative model would generate densities matching those of the input images. [0050] In an embodiment, the generative model may be trained using an adversarial training process, e.g., the model may include a generative adversarial network (GAN). For an adversarial training approach, a generative network and a discriminative network are provided for training. The generative network is trained to identify the features of data in one domain A and transform the data from domain A into data that is indistinguishable from data in domain B. In the training process, the discriminative network plays the role of a judge to score how likely the transformed data from domain A is similar to the data of domain B, e.g., if the data is a forgery or real data from domain B. A temporal generative model (or TGAN) is a type of generative adversarial network that is capable of learning representation from an unlabeled sequence and producing a new image. The generator may include two sub networks called a temporal generator and an image generator. The temporal generator first yields a set of latent variables, each of which corresponds to a latent variable for the image generator. Then, the image generator transforms these latent variables into a sequence of images that has the same number of images as the variables. The model including the temporal and image generators can efficiently capture the time series. In an embodiment, the density may be represented as scalars and a statistics extrapolation method such as linear or polynomial extrapolation may be used.

[0051] Figure 5 depicts an example of optimizing a latent vector. In Figure 5, the segmentation maps and densities of Figure 4 are used to configure the temporal generative model 501 in order to generate the latent vectors. The temporal generative model 501 may be trained to encode the latent vector. In an embodiment, the temporal generative model 501 takes as input the latent vector and outputs an image. The goal of the model is to learn to generate the underlying distribution of the real dataset.

[0052] At Act Al 50, the temporal generative model 501 extrapolates hair density for a future temporal value. As a result of learning from at least the first image and the second image, the temporal generative model 501 is now able to extrapolate future hair density by entering times in the future. The more data the user enters, the better the optimization will be and the more accurate the hair density estimation will be. In an embodiment, the user may enter additional information including but not limited to new hair care routine, use of hair growth treatment, change in weight and chemotherapy schedule, among other information. The device 100 may leverage from this additional information to better refine the estimation.

[0053] Figure 6 depicts an example of the output of the temporal generative model 501, for example, an estimated density of hair. In Figure 6, the temporal generative model 501 is input the latent vector for T=5. The output is the estimated density (shown here as a segmented image).

[0054] At act Al 60, the system synthesizes an image with corresponding hair density. An additional conditional generative model may be trained to take as input an RGB image of the user’s hair together with a target hair density and generate a new image depicting the user’s hair with the target density. The image may be displayed to the user to further motivate the user in their hair growth. The conditional generative model (CGAN) is a type of GAN that includes a generator and a discriminator. The generator is given a label and random array as input and generates data with the same structure as the training data observations corresponding to the same label. The discriminator is given batches of labeled data containing observations from both the training data and generated data from the generator, this network attempts to classify the observations as "real" or "generated".

[0055] Figure 7 depicts an example of an estimated image of the hair of a user at a time t=5. In Figure 7, the conditional generative model (hair inpainting network 701) inputs the estimated density and a previous image. The conditional generative model outputs an estimated image that rendered the estimated density instead of the previous hair of the user.

[0056] The result of the workflow of Figure 3 is an automated approach for quantifying and projecting human hair density estimation. The system helps patients detect early sign of hair loss and may give users hope when trying to grow hair. The process is usually long and seeing an estimate of the results in a few months can be very motivating. In addition, the system may run in real time on commonly used mobile devices. The output of the hair density estimation and rendering system 101 is an image or representation of the user at a future time with the estimated hair coverage. While the hair estimation may be beneficial, hair is only one part of a successful model of a user. The lighting and more particularly the lighting of the user’s skin plays a large role in generating a convincing and accurate portrayal of the user. [0057] Figure 8 depicts an example method for calculating parameters for rendering skin of a user. As presented in the following sections, the acts may be performed using any combination of the components indicated in Figure 1. The following acts may be performed by the rendering system 101, processor 102, device sensors 105, or a combination thereof. Additional, different, or fewer acts may be provided. The acts are performed in the order shown or other orders. The acts may also be repeated. Certain acts may be skipped.

[0058] One widely used reflectance model for skin is the Bidirectional Reflectance Distribution Function (BRDF). The fundamental limitation of such a model is that by just using BRDF, models ignore subsurface scattering that is largely responsible for the soft appearance of facial skin. To describe the full effect of light scattering between two points on the surface one can use the Bidirectional Surface-Scattering Distribution Function (BSSRDF). BSSRDF is a mathematical tool describing the relation between the outgoing radiance at a point based on the incoming radiance from surrounding areas. BSSRDF requires the measurement or provision of several parameters. One method for determining subsurface scattering parameters is to use a face-scanning dome. The subject sits in a chair with a headrest to keep the head still during the capture process. The chair is surrounded by 16 cameras and 150 LED light sources that are mounted on a geodesic dome. The system sequentially turns on each light while simultaneously capturing images with all 16 cameras. The complete sequence takes about 25 seconds for the two passes through all 150 light sources (limited by the frame rate of the cameras). Embodiments described herein provide a method for estimating the parameters of a BSSRDF function for Human skin using a device 100 capable of capturing or synthesizing RGB, Depth, Albedo, and IR data in a real-time fashion. The estimated parameters may include, for example, the absorption coefficient and the reduced scattering coefficient.

[0059] At act A210, the system captures a single shot or a video stream of a face of a user. The system may include or be a device 100, for example a handheld device 100, that is configured to acquire images, process the image data, and display an image to the user. The device 100 may acquire the images from, for example, a camera or other imaging sensor equipped with a standard sensor through which the images of persons and objects are acquired. Alternatively, one or more of the images may be acquired from a different device or location, for example cloud storage. In an embodiment, the device 100 also acquires infrared, depth, and albedo data. The albedo, depth, and infrared data may be captured directly, estimated, or synthetic. In an embodiment, a Time of Flight (ToF) sensor uses infrared light to determine depth information. The ToF sensor uses the known speed of light to measure distance, effectively counting the amount of time it takes for a reflected beam of light to return to the device sensor. The device 100 may also be equipped with a LIDAR sensor (a type of ToF sensor). The LiDAR sensor emits multiple signals which hits the subject and returns to the sensor. The time signals take to bounce back is then measured and provides depth-mapping capabilities. The Albedo data may be acquired by another sensor or derived from the depth and image data. Albedo is defined as the fraction of solar energy reflected. Albedo may be expressed as both a simple number and percentage figure. The higher the percentage, the more energy is reflected back to the source. An albedo map may be provided.

[0060] Figure 9 depicts a workflow for calculating parameters. In Figure 9, there are four inputs to the machine learning system 901 (as described below). The four inputs are image data, albedo data, depth data, and infrared data. The albedo data, depth data, and infrared data may be derived or generated from the image data.

[0061] At act A220, the single shot or video stream data is input to a machine learning system 901. The machine learning system 901 includes a machine learning trained model which takes as input a single or a sequence of images together with albedo maps, depth and infrared images either obtained by sensors 105 or predicted from the images. The machine learning system 901 generates per pixel absorption coefficients and reduced scattering coefficients. The machine learning system 901 / machine learning techniques may be or include a trainable algorithm, an, for example deep, i.e., multilayer, artificial neural network, a Support Vector Machine (SVM), a decision tree and/or the like. The machine-learning techniques may be based on k-means clustering, Temporal Difference (TD) learning, for example Q learning, a genetic algorithm and/or association rules or an association analysis. The machine-learning techniques may for example be or include a (deep) convolutional neural network (CNN), a (deep) adversarial neural network, a generative adversarial neural network (GAN), or other type of network.

[0062] The training data for the machine learning trained model includes ground truth data or gold standard data. Ground truth data and gold standard data is data that includes correct or reasonably accurate labels that are verified manually or by some other accurate method. The training data may be acquired at any point prior to inputting the training data into the model. The machine learning model is trained using a dataset containing: measured absorption coefficients and reduced scattering coefficients, corresponding color, corresponding depth, and corresponding infrared information. The training dataset may be represented as a limited sample of discrete acquisitions on the skin. The machine learning model uses spatial information (pixel coordinates and UV mapping) to leverage these partial annotations. The machine learning model may be (but is not limited to) an index to value network mapping the pixel information together with its coordinates to the measured absorption and reduced scattering coefficients at the given pixel. In an embodiment, a differentiable rendering system 101 may be used for the rendered images to supervise the training of the machine learning model, enabling weakly supervised and unsupervised training.

[0063] The predicted absorption and reduced scattering coefficients may be provided to a rendering system 101 as a per pixel map, generated by the network, a per vertex map, using the generated pixel map and the UV mapping of the 3D model that the system is rendering, or a single pair of absorption and reduced scattering coefficients for the whole model, obtained using statistical operations on the generated per pixel map (including but not limited to mean, median, min or max).

[0064] At act A230, the output (parameters) is used to re-create the user’s face using a rendering system 101. The rendering system 101 may use geometry, viewpoint, texture, lighting, and shading information describing a scene. Different rendering algorithms may be used, for example using different light models such as rasterization, ray casting, or ray tracing. In an embodiment, the hair of the user may be estimated using the process described above in Figure 3. The output is a rendered model of the user that is accurate and useful for different applications. In an embodiment, a facial geometry is derived from the sensor data and is used by the rendering system 101 to re-create the user’s face.

[0065] It is to be understood that the elements and features recited in the appended claims may be combined in different ways to produce new claims that likewise fall within the scope of the present invention. Thus, whereas the dependent claims appended below depend on only a single independent or dependent claim, it is to be understood that these dependent claims may, alternatively, be made to depend in the alternative from any preceding or following claim, whether independent or dependent, and that such new combinations are to be understood as forming a part of the present specification.

[0066] While the present invention has been described above by reference to various embodiments, it may be understood that many changes and modifications may be made to the described embodiments. It is therefore intended that the foregoing description be regarded as illustrative rather than limiting, and that it be understood that all equivalents and/or combinations of embodiments are intended to be included in this description.