Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
GENERATIVE MODELING OF THREE DIMENSIONAL SCENES AND APPLICATIONS TO INVERSE PROBLEMS
Document Type and Number:
WIPO Patent Application WO/2023/129190
Kind Code:
A1
Abstract:
Systems and methods for training a generative neural radiance field model can include geometric regularization. Geometric regularization can involve the utilization of reference geometry data and/or an output of a surface prediction model. The geometry regularization can train the generative neural radiance field model to mitigate artifact generation by limiting a distribution considered for color value prediction and density value prediction to a range associated with a realistic geometry range.

Inventors:
DARAS IOANNIS (US)
KUMAR ABHISHEK (US)
CHU WEN-SHENG (US)
LAGUN DMITRY (US)
Application Number:
PCT/US2022/012252
Publication Date:
July 06, 2023
Filing Date:
January 13, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GOOGLE LLC (US)
International Classes:
G06T15/20; G06T17/00
Domestic Patent References:
WO2020226630A12020-11-12
Other References:
GIANNIS DARAS ET AL: "Solving Inverse Problems with NerfGANs", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 December 2021 (2021-12-16), XP091117575
CHAN ERIC R ET AL: "pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis", 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 20 June 2021 (2021-06-20), pages 5795 - 5805, XP034009106, DOI: 10.1109/CVPR46437.2021.00574
YANG HONG ET AL: "HeadNeRF: A Real-time NeRF-based Parametric Head Model", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 13 December 2021 (2021-12-13), XP091115780
BEN MILDENHALL ET AL: "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 March 2020 (2020-03-19), XP081624988
Attorney, Agent or Firm:
WALTERS, Michael S. et al. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A computing system, the system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining training data, wherein the training data comprises a training image, a respective position associated with the training image, and reference geometry data; processing an input position with a machine-learned model to generate a view rendering and three-dimensional reconstruction data, wherein the input position is associated with an observation space associated with the respective position, wherein the view rendering comprises predicted image data descriptive of a scene, and wherein the three-dimensional reconstruction data is descriptive of predicted three- dimensional geometry; evaluating a first loss function that evaluates a difference between the view rendering and the training image; evaluating a second loss function that evaluates a difference between the three-dimensional reconstruction data and the reference geometry data; and adjusting one or more parameters of the machine-learned model based at least in part on at least one of the first loss function or the second loss function.

2. The computing system of any preceding claim, wherein the first loss function comprises a two-dimensional reconstruction loss, wherein the two-dimensional reconstruction loss comprises a contrastive loss.

3. The computing system of any preceding claim, wherein the operations further comprise: annealing a temperature parameter after adjusting the one or more parameters.

4. The computing system of any preceding claim, wherein the second loss function comprises a three-dimensional consistency loss.

38

5. The computing system of any preceding claim, wherein the training image comprises a frontal view of a face.

6. The computing system of any preceding claim, wherein the one or more parameters comprise one or more geometry parameters associated with a range of realistic geometries for an input image.

7. The computing system of any preceding claim, wherein adjusting the one or more parameters of the machine-learned model comprises training the machine-learned model to generate a color value prediction based at least in part on a range of geometric priors.

8. The computing system of any preceding claim, wherein adjusting the one or more parameters of the machine-learned model comprises training the machine-learned model to generate a density value prediction based at least in part on a range of geometric priors.

9. The computing system of any preceding claim, wherein the view rendering comprises data descriptive of a descriptive of a predicted density value and a predicted color value.

10. The computing system of any preceding claim, wherein the view rendering comprises a predicted color value determined at least in part on an integration of a color value distribution in a learned geometric range associated with a three-dimensional voxel grid.

11. The computing system of any preceding claim, wherein the machine-learned model comprises a mapping network model configured with feature-wise linear modulation conditioning.

12. The computing system of any preceding claim, wherein the view rendering comprises a two-dimensional reconstructed image associated with the respective position.

13. The computing system of any preceding claim, wherein the machine-learned model comprises a first model and a second model, wherein the first model comprises a surface prediction model, and wherein the second model comprises a Tenderer model.

39

14. The computing system of any preceding claim, wherein the input position is determined by sampling a point in the observation space.

15. A computer-implemented method, the method comprising: obtaining, by a computing system comprising one or more processors, an input position and an input view direction; processing, by the computing system, the input position and the input view direction with a surface prediction model to determine geometric range data, wherein the geometric range data is associated with a predicted range of geometric outcomes; processing, by the computing system, the input position, the input view direction, and the geometric range data with a generative neural radiance field model to generate a view rendering, wherein the view rendering comprises a predicted color value and a predicted density value, wherein at least one of the predicted color value or the predicted density value are determined based on a learned neural radiance field and the geometric range data; and providing, by the computing system, the view rendering for display.

16. The method of any preceding claim, wherein the surface prediction model and the generative neural radiance field model were jointly trained with a training dataset comprising one or more training camera parameters and latent encoding data.

17. The method of any preceding claim, wherein the generative neural radiance field model was trained with a single frontal view image of a face, and wherein the view rendering comprises a novel view rendering of the face.

18. One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: obtaining a plurality of sampled noise vectors and a plurality of camera parameters; processing the plurality of sampled noise vectors with a neural radiance field model to generate a plurality of neural radiance field datasets; processing the plurality of neural radiance field datasets and the plurality of camera parameters with a rendered model to generate a plurality of two-dimensional images; and

40 determining one or more particular images of the plurality of two-dimensional images with a realistic geometry.

19. The one or more non-transitory computer-readable media of any preceding claim, wherein determining the one or more particular images of the plurality of two-dimensional images with the realistic geometry comprises: processing the plurality of two-dimensional images with a geometry determination model to determine the one or more particular images, wherein the geometry determination model is configured to recognize one or more visual concepts.

20. The one or more non-transitory computer-readable media of any preceding claim, wherein the operations further comprise: training at least one of a surface prediction model or a generative neural radiance field model with geometry data associated with the one or more particular images.

Description:
GENERATIVE MODELING OF THREE DIMENSIONAL SCENES AND

APPLICATIONS TO INVERSE PROBLEMS

RELATED APPLICATIONS

[0001] The present application is based on and claims the right of priority under 35 U.S.C. § 119 to Greek National Application No. 20210100924 having a filing date of December 30, 2021, the disclosure of which is incorporated by reference herein in its entirety for all purposes.

FIELD

[0002] The present disclosure relates generally to generative modeling of three- dimensional scenes and applications to inverse problems. More particularly, the present disclosure relates to training a generative neural radiance field model based on reference geometry data and/or an output of a surface prediction model to reduce and/or eliminate artifact generation when generating view renderings and three-dimensional reconstructions.

BACKGROUND

[0003] NeRFGAN models have become an increasingly prevalent research topic.

However, very little research addresses inverse problems. In particular, there can be various challenges in solving inverse problems with NeRFGANs that can stem from unrealistic three- dimensional structures during inference.

[0004] Some approaches can side-step the computational cost of training a traditional NeRFGAN at high resolutions and present promising results in solving inverse problems. As larger and more realistic three-dimensional neural radiance field generators are made possible, solving inverse problems with the models may become increasingly more relevant for numerous applications.

SUMMARY

[0005] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.

[0006] One example aspect of the present disclosure is directed to a computing system. The system can include one or more processors and one or more non-transitory computer- readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations. The operations can include obtaining training data. The training data can include a training image, a respective position associated with the training image, and reference geometry data. In some implementations, the operations can include processing an input position with a machine-learned model to generate a view rendering and three-dimensional reconstruction data. The input position can be associated with an observation space associated with the respective position. The view rendering can include predicted image data descriptive of a scene, and the three-dimensional reconstruction data can be descriptive of predicted three-dimensional geometry. The operations can include evaluating a first loss function that evaluates a difference between the view rendering and the training image and evaluating a second loss function that evaluates a difference between the three-dimensional reconstruction data and the reference geometry data. The operations can include adjusting one or more parameters of the machine-learned model based at least in part on at least one of the first loss function or the second loss function.

[0007] In some implementations, the first loss function can include a two-dimensional reconstruction loss, and the two-dimensional reconstruction loss can include a contrastive loss. The operations can include annealing a temperature parameter after adjusting the one or more parameters. The second loss function can include a three-dimensional consistency loss. In some implementations, the training image can include a frontal view of a face. The one or more parameters can include one or more geometry parameters associated with a range of realistic geometries for an input image. In some implementations, adjusting the one or more parameters of the machine-learned model can include training the machine-learned model to generate a color value prediction based at least in part on a range of geometric priors. Adjusting the one or more parameters of the machine-learned model can include training the machine-learned model to generate a density value prediction based at least in part on a range of geometric priors.

[0008] In some implementations, the view rendering can include data descriptive of a descriptive of a predicted density value and a predicted color value. The view rendering can include a predicted color value determined at least in part on an integration of a color value distribution in a learned geometric range associated with a three-dimensional voxel grid. The machine-learned model can include a mapping network model configured with feature-wise linear modulation conditioning. In some implementations, the view rendering can include a two-dimensional reconstructed image associated with the respective position. The machine- learned model can include a first model and a second model. The first model can include a surface prediction model, and the second model can include a Tenderer model. The input position can be determined by sampling a point in the observation space.

[0009] Another example aspect of the present disclosure is directed to a computer- implemented method. The method can include obtaining, by a computing system including one or more processors, an input position and an input view direction. The method can include processing, by the computing system, the input position and the input view direction with a surface prediction model to determine geometric range data. The geometric range data can be associated with a predicted range of geometric outcomes. The method can include processing, by the computing system, the input position, the input view direction, and the geometric range data with a generative neural radiance field model to generate a view rendering. The view rendering can include a predicted color value and a predicted density value, and at least one of the predicted color value or the predicted density value can be determined based on a learned neural radiance field and the geometric range data. The method can include providing, by the computing system, the view rendering for display.

[0010] In some implementations, the surface prediction model and the generative neural radiance field model can have been jointly trained with a training dataset including one or more training camera parameters and latent encoding data. The generative neural radiance field model can have been trained with a single frontal view image of a face, and the view rendering can include a novel view rendering of the face.

[0011] Another example aspect of the present disclosure is directed to one or more non- transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations. The operations can include obtaining a plurality of sampled noise vectors and a plurality of camera parameters. The operations can include processing the plurality of sampled noise vectors with a neural radiance field model to generate a plurality of neural radiance field datasets. The operations can include processing the plurality of neural radiance field datasets and the plurality of camera parameters with a rendered model to generate a plurality of two-dimensional images and determining one or more particular images of the plurality of two-dimensional images with a realistic geometry.

[0012] In some implementations, determining the one or more particular images of the plurality of two-dimensional images with the realistic geometry can include processing the plurality of two-dimensional images with a geometry determination model to determine the one or more particular images. The geometry determination model can be configured to recognize one or more visual concepts. The operations can include training at least one of a surface prediction model or a generative neural radiance field model with geometry data associated with the one or more particular images.

[0013] Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices. [0014] These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which: [0016] Figure 1A depicts a block diagram of an example computing system that performs view rendering according to example embodiments of the present disclosure.

[0017] Figure IB depicts a block diagram of an example computing device that performs view rendering according to example embodiments of the present disclosure.

[0018] Figure 1C depicts a block diagram of an example computing device that performs view rendering according to example embodiments of the present disclosure.

[0019] Figure 2 depicts a block diagram of an example generative neural radiance field model according to example embodiments of the present disclosure.

[0020] Figure 3 depicts a block diagram of an example geometry determination model according to example embodiments of the present disclosure.

[0021] Figure 4 depicts an illustration of example model outputs according to example embodiments of the present disclosure.

[0022] Figure 5 depicts a block diagram of an example generative neural radiance field model according to example embodiments of the present disclosure.

[0023] Figure 6 depicts a flow chart diagram of an example method to perform machine- learned model training according to example embodiments of the present disclosure.

[0024] Figure 7 depicts a flow chart diagram of an example method to perform view rendering according to example embodiments of the present disclosure.

[0025] Figure 8 depicts a flow chart diagram of an example method to perform image selection according to example embodiments of the present disclosure. [0026] Figure 9 depicts an illustration of example temperature annealing results according to example embodiments of the present disclosure.

[0027] Figure 10 depicts an illustration of example inpainting plot results according to example embodiments of the present disclosure.

[0028] Figure 11 depicts an illustration of example model results according to example embodiments of the present disclosure.

[0029] Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations.

DETAILED DESCRIPTION

Overview

[0030] Generally, the present disclosure is directed to training a generative neural radiance field model with geometry regularization. For example, the systems and methods disclosed herein can train a generative neural radiance field model based on reference geometry data and/or an output of a surface prediction model. In some implementations, the systems and methods can determine and/or select images with realistic geometry for optimizing the model to output geometric realistic outputs.

[0031] In some implementations, the generative neural radiance field model can include a machine-learned model. Training the machine-learned model can include obtaining training data. The training data can include a training image, a respective position associated with the training image, and reference geometry data. The reference geometry data can be based on one or more images determined to include realistic geometry and/or good geometry. The systems and methods for training the machine-learned model can process an input position with a machine-learned model to generate a view rendering and three-dimensional reconstruction data. In some implementations, the input position can be associated with an observation space associated with the respective position. Additionally and/or alternatively, the view rendering can include predicted image data descriptive of a scene. The three- dimensional reconstruction data can be descriptive of predicted three-dimensional geometry. A gradient descent can then be generated based on one or more loss functions to then be backpropagated to the machine-learned model to train the model. A first loss function can be evaluated by evaluating a difference between the view rendering and the training image. In some implementations, a second loss function can be evaluated by evaluating a difference between the three-dimensional reconstruction data and the reference geometry data. One or more parameters of the machine-learned model can then be adjusted based at least in part on at least one of the first loss function or the second loss function.

[0032] The systems and methods for training the machine-learned model can include obtaining training data. The training data can include a training image, a respective position associated with the training image, and reference geometry data. In some implementations, the training image can include a frontal view of a face. In some implementations, the training data can further include a respective view direction associated with the training image. The training image may be a single frontal view image. Additionally and/or alternatively, the training data can include a plurality of training images, a plurality of respective positions, a plurality of respective view directions, and reference geometry data. The reference geometry data can be based on one or more images determined to include realistic geometry and/or good geometry. The one or more images may include an analogous object or scene to the object or scene depicted in the training image. For example, in some implementations, the one or more images can include one or more objects of a same object class as an object in the training image. The object class may include a face class, and the one or more objects may be different faces.

[0033] In some implementations, the reference geometry data may be generated with a surface prediction model that may process latent vector data and/or position data to generate reference geometry data. The reference geometry data may be based in part on one or more image selections by a geometry determination model configured to recognize good and/or realistic geometry based on recognition of a wide variety of visual concepts. The good geometry may be kept as the reference.

[0034] The systems and methods can process an input position with a machine-learned model to generate a view rendering and three-dimensional reconstruction data. In some implementations, the input position can be associated with an observation space associated with the respective position. The input position may be determined by sampling a point in the observation space. In some implementations, the input position can be the respective position associated with the training image, and the view rendering can include a two-dimensional reconstructed image associated with the respective position.

[0035] In some implementations, the machine-learned model may include a mapping network model configured with feature-wise linear modulation conditioning. Additionally and/or alternatively, the machine-learned model can include a first model and a second model. The first model can include a surface prediction model, and the second model may include a tenderer model. The rendered model can include a neural radiance field model influenced at least in part on the output of the surface prediction model.

[0036] The view rendering can include predicted image data descriptive of a scene (e.g., a scene with an image, a park, a house, a table, a kitchen, etc.). In some implementations, the view rendering may be descriptive of a face. The view rendering can be a reconstruction of the training image (e.g., a reconstruction of a single frontal view image of a face).

Alternatively and/or additionally, the view rendering may be descriptive of a novel view rendering not depicted in the training image. In some implementations, the view rendering can include data descriptive of a descriptive of a predicted density value and a predicted color value. Additionally and/or alternatively, the view rendering may include a predicted color value determined at least in part on an integration of a color value distribution in a learned geometric range associated with a three-dimensional voxel grid. The learned geometric range may be based at least in part on one or more reference images and/or reference geometry. [0037] In some implementations, the three-dimensional reconstruction data is descriptive of predicted three-dimensional geometry.

[0038] A first loss function can then be evaluated by evaluating a difference between the view rendering and the training image. The first loss function can include a two-dimensional reconstruction loss. Alternatively and/or additionally, the two-dimensional reconstruction loss may include a contrastive loss. In some implementations the first loss function can include a perceptual loss, an L2 loss, a generative loss, KL Divergence loss, and/or an adversarial loss. [0039] In some implementations, a second loss function can be evaluated by evaluating a difference between the three-dimensional reconstruction data and the reference geometry data. The second loss function can include a three-dimensional consistency loss.

[0040] One or more parameters of the machine-learned model can then be adjusted based at least in part on the first loss function and/or the second loss function. In some implementations, the one or more parameters may include one or more geometry parameters associated with a range of realistic geometries for an input image. Additionally and/or alternatively, adjusting the one or more parameters of the machine-learned model can include training the machine-learned model to generate a color value prediction based at least in part on a range of geometric priors.

[0041] In some implementations, adjusting the one or more parameters of the machine- learned model may include training the machine-learned model to generate a density value prediction based at least in part on a range of geometric priors. [0042] In some implementations, the systems and methods can include annealing a temperature parameter after adjusting the one or more parameters.

[0043] The trained generative neural radiance field model can then be utilized for one or more tasks (e.g., view rendering generation, three-dimensional geometric reconstruction, etc.). For example, the systems and methods can include obtaining an input position and an input view direction. The input position and the input view direction can then be processed with a surface prediction model to determine geometric range data. In some implementations, the geometric range data can be associated with a predicted range of geometric outcomes. The input position, the input view direction, and the geometric range data can then be processed with a generative neural radiance field model to generate a view rendering. The view rendering can include a predicted color value and a predicted density value. In some implementations, the predicted color value and/or the predicted density value can be determined based at least in part on a learned neural radiance field and the geometric range data. The view rendering can then be provided for display.

[0044] More specifically, in some implementations, the systems and methods can include obtaining an input position and an input view direction. The input position and the input view direction can be associated with an observation associated with a scene. The scene may include one or more objects for observance. The one or more objects may include one or more faces.

[0045] The systems and methods can include processing the input position and the input view direction with a surface prediction model to determine geometric range data. In some implementations, the geometric range data may be associated with a predicted range of geometric outcomes. The surface prediction model may be trained to understand ranges of geometric outcomes for a rendering based on similar geometries to the object or scene being rendered. In some implementations, the geometric range data can include a voxel grid to restrict the color prediction and the density prediction to a set of pixels or voxels in the observation space.

[0046] In some implementations, the systems and methods can include processing the input position, the input view direction, and the geometric range data with a generative neural radiance field model to generate a view rendering. The view rendering can include a predicted color value and a predicted density value. The predicted color value and/or the predicted density value may be determined based at least in part on a learned neural radiance field of the generative neural radiance field model and the geometric range data. In some implementations, the generative neural radiance field model may be trained with a single frontal view image of a face. Additionally and/or alternatively, the view rendering may include a novel view rendering of the face.

[0047] In some implementations, the geometric range data can be processed with the generative neural radiance field model to limit the distribution being integrated for color value prediction and density value prediction. For example, the geometric range data can limit the consideration of only a subset of a distribution for color value prediction and/or density value prediction.

[0048] The view rendering can then be provided for display. Providing for display can include sending the view rendering to a user computing device and/or may include displaying the view rendering as an image via a visual display.

[0049] In some implementations, the surface prediction model and the generative neural radiance field model may have been jointly trained with a training dataset. The training dataset may include one or more training camera parameters and latent encoding data. The latent encoding data may include one or more latent vectors. In some implementations, the latent vectors can include noise vectors.

[0050] In some implementations, reference geometry data utilized for training the generative neural radiance field model can be determined and/or generated based on one or more systems and methods for reference geometry determination. The systems and methods can include obtaining a plurality of sampled noise vectors and a plurality of camera parameters. The plurality of sampled noise vectors can be processed with a neural radiance field model to generate a plurality of neural radiance field datasets. In some implementations, the plurality of neural radiance field datasets and the plurality of camera parameters can be processed with a rendered model to generate a plurality of two-dimensional images. The systems and methods can include determining one or more particular images of the plurality of two-dimensional images with a realistic geometry.

[0051] More specifically, in some implementations, the systems and methods can include obtaining a plurality of sampled noise vectors and a plurality of camera parameters. The plurality of camera parameters can include a plurality of three-dimensional positions and a plurality of two-dimensional view directions.

[0052] In some implementations, the systems and methods can include processing the plurality of sampled noise vectors with a neural radiance field model to generate a plurality of neural radiance field datasets. The plurality neural radiance field datasets can include a plurality of five-dimensional functions. The five-dimensional functions can be descriptive of three-dimensional positions, two-dimensional view directions, color values, appearance embeddings, shape embeddings, and/or volume density values.

[0053] The plurality of neural radiance field datasets and the plurality of camera parameters can then be processed with a rendered model to generate a plurality of two- dimensional images. The plurality of two-dimensional images can include reconstruction images and/or novel view renderings. For example, the plurality of two-dimensional images can include a plurality of different facial views of a plurality of different faces.

[0054] In some implementations, the systems and methods can include determining one or more particular images of the plurality of two-dimensional images with a realistic geometry and/or a desired geometry. Determining the one or more particular images can include processing the plurality of two-dimensional images with a geometry determination model to determine the one or more particular images. In some implementations, the geometry determination model can be configured to recognize one or more visual concepts. [0055] Additionally and/or alternatively, the systems and methods can include training a surface prediction model and/or a generative neural radiance field model with geometry data associated with the one or more particular images.

[0056] In some implementations, the systems and methods can train the generative neural radiance field model with a geometry regularization. The geometry regularization can be enforced and/or implemented by generating a voxelized representation of the three- dimensional observational space. For example, the observational space can include a plurality of points that can be encompassed by a plurality of voxels (e.g., pixel by pixel grids). The points of the scene can be utilized to generate a voxelized representation of the scene. The voxelized representation can be a three-dimensional representation. Portions of the voxelized representation can be confined in one or more voxel grids.

[0057] The geometric regularization can include obtaining one or more latent vectors associated with the density values. The inferred geometry can then be regularized towards a suitable geometry in a realistic dataset (e.g., a most suitable geometry of an example image of a same or similar object class or scene class).

[0058] In some implementations, geometric regularization can include determining a realistic range of geometric outcomes for a rendering or three-dimensional representation (e.g., a mean three-dimensional geometry with a standard deviation for error and/or for variance). The range can then be utilized to confine value consideration for predictions to a range of distributions associated with the range of outcomes. The regularization can therefore mitigate artifacts by removing outliers outside of the range of outcomes. [0059] The range of outcomes may be refined and narrowed throughout training until an ideal geometry is found, determined, or chosen.

[0060] In some implementations, the systems and methods disclosed herein can be utilized to improve the training and sampling time of existing three-dimensional generative models by using a predictor that can identify the location of surfaces of a scene given the latent code associated with each scene. The systems and methods presented herein can include a multi-worker distributional training strategy, enabling model training at an unprecedented scale. The fast rendering of the radiance field and a custom distribution strategy can allow the system to train the models quickly with very high-resolution images. [0061] In some implementations, the systems and methods can solve inverse problems in a multi-view consistent manner. The systems and methods can train a network to identify whether two images are views of the same scene using a contrastive loss. The systems and methods can add the trained identification network to the optimization loop to achieve three- dimensional consistent solutions to inverse problems.

[0062] The systems and methods of the present disclosure provide a number of technical effects and benefits. As one example, the system and methods can train a generative neural radiance field model to generate view renderings and reconstruct three-dimensional data. More specifically, the systems and methods can utilize learned geometric priors to regularize the outputs of generative neural radiance field models. The geometric priors can be determined by processing a plurality of vectors and a plurality of camera parameters to generate a plurality of rendered images, which can be processed to determine particular images with realistic geometry. The geometry data of the particular images can then be utilized to train a surface prediction model. The surface prediction model can then influence the predictions of the neural radiance field model, which can therefore mitigate the generation of artifacts in view renderings and volume representations.

[0063] Another technical benefit of the systems and methods of the present disclosure is the ability to generate novel view facial renderings using a model trained on a single frontal view image of a face. For example, the generative neural radiance field models may be trained on the single frontal view image by backpropagating a gradient descent resulting from two loss functions. The first loss function can evaluate a difference between a reconstructed two-dimensional image and the ground truth single frontal view image. The second loss function can evaluate a difference between a reconstructed three-dimensional geometry and one or more reference geometries. [0064] Another example technical effect and benefit relates to the reduction of computational cost and computational time. The systems and methods disclosed herein can utilize a surface prediction model and/or reference geometry data to reduce the range of a distribution being integrated for the color value prediction and/or the density value prediction. For example, the reference geometry data and/or the output of the surface prediction model can provide context to an expected range of possibilities to consider for determining the predicted values.

[0065] With reference now to the Figures, example embodiments of the present disclosure will be discussed in further detail.

Example Devices and Systems

[0066] Figure 1 A depicts a block diagram of an example computing system 100 that performs view rendering according to example embodiments of the present disclosure. The system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 that are communicatively coupled over a network 180.

[0067] The user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.

[0068] The user computing device 102 includes one or more processors 112 and a memory 114. The one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 114 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations. [0069] In some implementations, the user computing device 102 can store or include one or more generative neural radiance field models 120. For example, the generative neural radiance field models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models. Neural networks can include feedforward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks. Example generative neural radiance field models 120 are discussed with reference to Figures 2 - 8. [0070] In some implementations, the one or more generative neural radiance field models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112. In some implementations, the user computing device 102 can implement multiple parallel instances of a single generative neural radiance field model 120 (e.g., to perform parallel view rendering across multiple instances of positions and/or view directions).

[0071] More particularly, the generative neural radiance field model can intake a position and a view direction (e.g., a three-dimensional position in an observational space and a two-dimensional view direction) and output a view rendering and/or a three-dimensional reconstruction.

[0072] Additionally or alternatively, one or more generative neural radiance field models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship. For example, the generative neural radiance field models 140 can be implemented by the server computing system 140 as a portion of a web service (e.g., a view rendering service). Thus, one or more models 120 can be stored and implemented at the user computing device 102 and/or one or more models 140 can be stored and implemented at the server computing system 130.

[0073] The user computing device 102 can also include one or more user input component 122 that receives user input. For example, the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.

[0074] The server computing system 130 includes one or more processors 132 and a memory 134. The one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 134 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.

[0075] In some implementations, the server computing system 130 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 130 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.

[0076] As described above, the server computing system 130 can store or otherwise include one or more machine-learned generative neural radiance field models 140. For example, the models 140 can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Example models 140 are discussed with reference to Figures 2 - 8.

[0077] The user computing device 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180. The training computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130.

[0078] The training computing system 150 includes one or more processors 152 and a memory 154. The one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, a FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 154 can include one or more non-transitory computer-readable storage mediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the training computing system 150 to perform operations. In some implementations, the training computing system 150 includes or is otherwise implemented by one or more server computing devices.

[0079] The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations.

[0080] In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.

[0081] In particular, the model trainer 160 can train the generative neural radiance field models 120 and/or 140 based on a set of training data 162. The training data 162 can include, for example, one or more training images (e.g., a single frontal view of a face), one or more respective positions, one or more respective view directions, and reference geometry data. In some implementations, the reference geometry data can be replaced with or complemented with one or more latent vectors. Additionally and/or alternatively, training and/or model inferencing may involve the use of a surface prediction model that can determine a range of outcomes based on known geometric priors of similar or the same object or scene.

[0082] In some implementations, if the user has provided consent, the training examples can be provided by the user computing device 102. Thus, in such implementations, the model 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model.

[0083] The model trainer 160 includes computer logic utilized to provide desired functionality. The model trainer 160 can be implemented in hardware, firmware, and/or software controlling a general purpose processor. For example, in some implementations, the model trainer 160 includes program files stored on a storage device, loaded into a memory and executed by one or more processors. In other implementations, the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM hard disk or optical or magnetic media. [0084] The network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).

[0085] The machine-learned models described in this specification may be used in a variety of tasks, applications, and/or use cases.

[0086] In some implementations, the input to the machine-learned model(s) of the present disclosure can be image data. The machine-learned model(s) can process the image data to generate an output. As an example, the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an image segmentation output. As another example, the machine- learned model(s) can process the image data to generate an image classification output. As another example, the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an upscaled image data output. As another example, the machine-learned model(s) can process the image data to generate a prediction output.

[0087] In some implementations, the input to the machine-learned model(s) of the present disclosure can be text or natural language data. The machine-learned model(s) can process the text or natural language data to generate an output. As an example, the machine- learned model(s) can process the natural language data to generate a language encoding output. As another example, the machine-learned model(s) can process the text or natural language data to generate a latent text embedding output. As another example, the machine- learned model(s) can process the text or natural language data to generate a classification output. As another example, the machine-learned model(s) can process the text or natural language data to generate a semantic intent output. As another example, the machine-learned model(s) can process the text or natural language data to generate a prediction output.

[0088] In some implementations, the input to the machine-learned model(s) of the present disclosure can be latent encoding data (e.g., a latent space representation of an input, etc.). The machine-learned model(s) can process the latent encoding data to generate an output. As an example, the machine-learned model(s) can process the latent encoding data to generate a recognition output. As another example, the machine-learned model(s) can process the latent encoding data to generate a reconstruction output. As another example, the machine-learned model(s) can process the latent encoding data to generate a search output. As another example, the machine-learned model(s) can process the latent encoding data to generate a reclustering output. As another example, the machine-learned model(s) can process the latent encoding data to generate a prediction output.

[0089] In some implementations, the input to the machine-learned model(s) of the present disclosure can be statistical data. The machine-learned model(s) can process the statistical data to generate an output. As an example, the machine-learned model(s) can process the statistical data to generate a recognition output. As another example, the machine- learned model(s) can process the statistical data to generate a prediction output. As another example, the machine-learned model(s) can process the statistical data to generate a classification output. As another example, the machine-learned model(s) can process the statistical data to generate a segmentation output. As another example, the machine-learned model(s) can process the statistical data to generate a segmentation output. As another example, the machine-learned model(s) can process the statistical data to generate a visualization output. As another example, the machine-learned model(s) can process the statistical data to generate a diagnostic output.

[0090] In some implementations, the input to the machine-learned model(s) of the present disclosure can be sensor data. The machine-learned model(s) can process the sensor data to generate an output. As an example, the machine-learned model(s) can process the sensor data to generate a recognition output. As another example, the machine-learned model(s) can process the sensor data to generate a prediction output. As another example, the machine-learned model(s) can process the sensor data to generate a classification output. As another example, the machine-learned model(s) can process the sensor data to generate a segmentation output. As another example, the machine-learned model(s) can process the sensor data to generate a segmentation output. As another example, the machine-learned model(s) can process the sensor data to generate a visualization output. As another example, the machine-learned model(s) can process the sensor data to generate a diagnostic output. As another example, the machine-learned model(s) can process the sensor data to generate a detection output.

[0091] In some cases, the input includes visual data and the task is a computer vision task. In some cases, the input includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.

[0092] The systems and methods can utilize one or more neural radiance field models. Neural radiance field models can be trained and/or configured to process a three-dimensional position and a two-dimensional view direction to generate and/or determine a predicted color and a predicted volume density. The predicted color and the predicted volume density can be determined based on a five-dimensional function. The five-dimensional function can be learned by training based on comparisons between a training output and one or more ground truth images. In some implementations, the predicted color and/or the predicted volume density can be determined based on an integration over a distribution (e.g., a color probability distribution or a volume density probability distribution).

[0093] In some implementations, the systems and methods disclosed herein can reduce the size of the distribution being integrated over based on reference geometry data, an output of a surface prediction model, and/or geometric priors.

[0094] Figure 1 A illustrates one example computing system that can be used to implement the present disclosure. Other computing systems can be used as well. For example, in some implementations, the user computing device 102 can include the model trainer 160 and the training dataset 162. In such implementations, the models 120 can be both trained and used locally at the user computing device 102. In some of such implementations, the user computing device 102 can implement the model trainer 160 to personalize the models 120 based on user-specific data. [0095] Figure IB depicts a block diagram of an example computing device 10 that performs according to example embodiments of the present disclosure. The computing device 10 can be a user computing device or a server computing device.

[0096] The computing device 10 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.

[0097] As illustrated in Figure IB, each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.

[0098] Figure 1C depicts a block diagram of an example computing device 50 that performs according to example embodiments of the present disclosure. The computing device 50 can be a user computing device or a server computing device.

[0099] The computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).

[0100] The central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 1C, a respective machine-learned model (e.g., a model) can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model (e.g., a single model) for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50.

[0101] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for the computing device 50. As illustrated in Figure 1C, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).

Example Model Arrangements

[0102] Figure 2 depicts a block diagram of an example generative neural radiance field model 200 according to example embodiments of the present disclosure. In some implementations, the generative neural radiance field model 200 is trained to receive a set of input data 204 descriptive of one or more camera parameters and, as a result of receipt of the input data 204, provide output data 206 that can include one or more rendered two- dimensional images. Thus, in some implementations, the generative neural radiance field model 200 can include a Tenderer model 214 that is operable to render the two-dimensional images based on radiance field data 212.

[0103] The generative neural radiance field model 200 can be and/or can include one or more machine-learned models. Training the generative neural radiance field model can include obtaining one or more camera parameters 204 (e.g., one or more positions and one or more view directions, one or more latent vectors 202 (e.g., one or more noise vectors), one or more training images 216 (e.g., one or more frontal view images), and reference geometry data 218 (e.g., one or more reference geometry datasets descriptive of three-dimensional geometry for one or more faces).

[0104] The latent vectors 202 can be processed with a neural radiance field model 210 (e.g., a pi-GAN model including one or more feature-wise linear modulation blocks) to generate radiance field data 212. The radiance field data can then be utilized to generate reconstructed three-dimensional geometry 208. Additionally and/or alternatively, the radiance field data 212 and the input data 204 (e.g., the camera parameters) can be processed with a Tenderer model 214 to generate a rendered two-dimensional image 206.

[0105] The rendered two-dimensional image 206 and the reconstructed three- dimensional geometry 208 can then be utilized to evaluate one or more loss functions. For example, the rendered two-dimensional image 206 can be compared with a training image 216 to evaluate a two-dimensional reconstruction loss 220. Additionally and/or alternatively, the reconstructed three-dimensional geometry 208 can be compared with the reference geometry data 218 to evaluate a three-dimensional consistency loss 222. The evaluations can be used to generate a two-dimensional gradient descent and a three-dimensional gradient descent. The gradient descents can then be backpropagated to adjust one or more parameters of the generative neural radiance field model 200.

[0106] In particular, Figure 2 can depict an overview of an example method according to implementations disclosed herein. The systems and methods can jointly optimize over two- dimensional reconstruction of the input image 216 and three-dimensional consistency of the radiance field 212. Given a randomly sampled noise vector z 202 from a Gaussian distribution, the systems and methods can obtain a radiance field G(z/) 212 via the pi-GAN generator G 210 and rendered two-dimensional images 206 using a conventional volumetric Tenderer R 214 for given camera parameters c 204.

[0107] Figure 3 depicts a block diagram of an example geometry determination model 300 according to example embodiments of the present disclosure. In some implementations, the geometry determination model 300 is trained to receive a set of input data 304 descriptive of a one or more latent vectors (e.g., one or more noise vectors) and, as a result of receipt of the input data 304, provide output data 306 that can be utilized for training a generative neural radiance field model and/or a surface prediction model. Thus, in some implementations, the geometry determination model 300 can include a pre-trained geometry determination model 302 that is operable to determine whether an image includes realistic geometry.

[0108] Reference geometry data for training one or more machine-learned models can be selected or determined by utilizing a plurality of different techniques. One generation and/or determination method can involve the use of a geometry determination model 300.

[0109] The generation process can include obtaining input data 304 (e.g., one or more latent vectors) and one or more camera parameters 310 (e.g., one or more positions and one or more view directions). The input data 304 can be processed by a radiance field model 302 to generate one or more radiance fields 308. The radiance fields 308 and the camera parameters 310 can then be processed with a Tenderer model 312 to generate a plurality of rendered two-dimensional images 314. The plurality of rendered two-dimensional images 316 can then be processed with a pre-trained geometry determination model 316 to determine if each image has good geometry (e.g., realistic geometry) or bad geometry. The good geometry 306 images may be selected to generate reference geometry data for training one or more machine-learned models.

[0110] In particular, Figure 3 can depict candidate selection for reference geometry. The systems and methods can include a systematic process to pick reference geometries 306, which can enable computation of the three-dimensional consistency loss. Given rendered two-dimensional images 314 from randomly sampled noise vectors {z 1( . . . , z M } 304, the systems and methods can classify “good” vs “bad” geometries 306 based on which two- dimensional images314 satisfy a pre-trained geometry determination model 316 that recognizes a wide variety of visual concepts. The good geometry can then be kept as the reference.

[0111] Figure 4 depicts an illustration of example model outputs according to example embodiments of the present disclosure.

[0112] In particular, Figure 4 can depict an example NerfGAN inversion 400. Given a single frontal view image 402, the systems and methods can generate novel angle views 404 & 406 and the underlying three-dimensional geometry 408 & 410. As shown, latent space optimization as proposed in pi-GAN can create obstructions (stones) (e.g., the circled sections) that can produce artifacts in novel views. The middle column can illustrate the artifacts in the rendered image (top row) 404 and the three-dimensional geometry (bottom row) 408. The example inversion algorithm according to systems and methods disclosed herein can remove these issues by optimizing over both the two-dimensional view 406 and the three-dimensional shape 410 of the radiance field. The middle column can depict a reconstructed three-dimensional geometry 408 and a novel view 404 using pi-GAN direct latent space optimization. The right column can depict generated three-dimensional structure 410 and novel view 406 using the proposed reconstruction algorithm. The systems and methods can emphasize that the example algorithm can also use the same generator (pi- GAN), but may recover a better latent vector compared to direct latent-space optimization. This can lead to a better three-dimensional geometry reconstruction and better two- dimensional novel views.

[0113] Figure 5 depicts a block diagram of an example generative neural radiance field model 500. The generative neural radiance field model 500 can be trained using a process similar to the process depicted in Figure 2. In particular, Figure 5 can depict a plurality of blocks that can process data to generate a view rendering. For example, an upper-pipeline 506 can include one or more linear blocks and one or more feature-wise linear modulation block, and a lower pipeline (e.g., the mapping network model pipeline) 508 can include one or more linear blocks and one or more ReLU blocks.

[0114] The input can include a position 502 and noise data (e.g., a latent vector) 504. The position 502 can be processed with the upper pipeline 506 conditioned based on one or more outputs of the lower pipeline 508. For example, the noise data 504 can be processed with a mapping network model (e.g., a mapping network model with one or more linear blocks and one or more ReLU blocks) to generate a latent encoding output. The position 502 can then be processed by a linear block to generate a linear block output. The linear block output can then be processed with a feature-wise linear modulation block conditioned on the latent encoding data to generate a second output. The processing can occur iteratively.

[0115] The second output can then be processed by a linear block to generate volume density data 510. In some implementations, the second output can be processed with a linear block, a feature-wise linear modulation block conditioned on latent encoding data, and another linear block to generate color data 512. The latent encoding data can include ray direction data.

[0116] The feature-wise linear modulation block 514 can intake a layer input and output a layer output. The layer input can be processed based on frequency data and phase shift data associated with the latent encoding data.

[0117] In some implementations, the ReLU blocks can include one or more multi-layer perceptron sub-blocks.

Example Methods

[0118] Figure 6 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 6 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 600 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure.

[0119] At 602, a computing system can obtain training data. The training data can include a training image, a respective position associated with the training image, and reference geometry data. In some implementations, the training image may include one or more faces. Additionally and/or alternatively, the training image may include a single frontal view of the face.

[0120] At 604, the computing system can process an input position with a machine- learned model to generate a view rendering and three-dimensional reconstruction data. The input position may be associated with an observation space associated with the respective position. The view rendering can include predicted image data descriptive of a scene, and the three-dimensional reconstruction data can be descriptive of a predicted three-dimensional geometry of the scene. In some implementations, the input position may be randomly sampled from the observational space. Alternatively and/or additionally, the input position may be the same position as the respective position associated with the training image. In some implementations, the view rendering can include data descriptive of a descriptive of a predicted density value and a predicted color value.

[0121] The view rendering may include a predicted color value determined at least in part on an integration of a color value distribution in a learned geometric range associated with a three-dimensional voxel grid. In some implementations, the view rendering comprises a two-dimensional reconstructed image associated with the respective position.

[0122] Additionally and/or alternatively, the machine-learned model can include a mapping network model configured with feature-wise linear modulation conditioning. In some implementations, the machine-learned model can include a first model and a second model. The first model can include a surface prediction model, and the second model may include a Tenderer model.

[0123] At 606, the computing system can evaluate a first loss function that evaluates a difference between the view rendering and the training image. In some implementations, the first loss function can include a two-dimensional reconstruction loss, and the two- dimensional reconstruction loss may include a contrastive loss.

[0124] At 608, the computing system can evaluate a second loss function that evaluates a difference between the three-dimensional reconstruction data and the reference geometry data. In some implementations, the second loss function can include a three-dimensional consistency loss.

[0125] At 610, the computing system can adjust one or more parameters of the machine- learned model based at least in part on at least one of the first loss function or the second loss function. In some implementations, the one or more parameters can include one or more geometry parameters associated with a range of realistic geometries for an input image. Adjusting the one or more parameters of the machine-learned model can include training the machine-learned model to generate a color value prediction based at least in part on a range of geometric priors. Additionally and/or alternatively, adjusting the one or more parameters of the machine-learned model can include training the machine-learned model to generate a density value prediction based at least in part on a range of geometric priors.

[0126] In some implementations, the computing system can anneal a temperature parameter after adjusting the one or more parameters. The training process may be repeated iteratively, and for each iteration, annealing may occur until a single reference geometry may be used for geometric regularization. [0127] Figure 7 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 7 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 700 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure.

[0128] At 702, a computing system can obtain an input position and an input view direction. The input position and the input view direction can be input by a user using a user computing device and/or may be randomly selected. The input position can be associated with a three-dimensional position in an observational space. Additionally and/or alternatively, the view direction can be a two-dimensional view direction associated with the input position. [0129] At 704, the computing system can process the input position and the input view direction with a surface prediction model to determine geometric range data. The geometric range data can be descriptive of a range of possible geometric outcomes associated with geometries associated with similar scenes or objects.

[0130] At 706, the computing system can process the input position, the input view direction, and the geometric range data with a generative neural radiance field model to generate a view rendering. The view rendering can include a predicted color value and a predicted density value. In some implementations, the view rendering can include a two- dimensional image. The predicted color value and/or the predicted density value may be determined by integrating over a portion of a distribution associated with a learned neural radiance field. The portion can be selected based at least in part on the geometric range data. [0131] In some implementations, the surface prediction model and the generative neural radiance field model may have been jointly trained with a training dataset including one or more training camera parameters and latent encoding data. Alternatively and/or additionally, the generative neural radiance field model may have been trained with a single frontal view image of a face, and the view rendering may include a novel view rendering of the face.

[0132] At 708, the computing system can provide the view rendering for display. Providing for display can involve providing the view rendering as an image displayed on a visual display of a computing device.

[0133] Figure 8 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure. Although Figure 8 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 800 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure.

[0134] At 802, a computing system can obtain a plurality of sampled noise vectors and a plurality of camera parameters.

[0135] At 804, the computing system can process the plurality of sampled noise vectors with a neural radiance field model to generate a plurality of neural radiance field datasets. [0136] At 806, the computing system can process the plurality of neural radiance field datasets and the plurality of camera parameters with a rendered model to generate a plurality of two-dimensional images.

[0137] At 808, the computing system can determine one or more particular images of the plurality of two-dimensional images with a realistic geometry. Determining the one or more particular images can include processing the plurality of two-dimensional images with a geometry determination model to determine the one or more particular images. The geometry determination model can be configured to recognize one or more visual concepts.

[0138] In some implementations, the computing system can train at least one of a surface prediction model or a generative neural radiance field model with geometry data associated with the one or more particular images.

Example Implementations

[0139] The systems and methods disclosed herein can leverage different techniques and unsupervised methods for inverse problems using pre-trained generative models. For example, the systems and methods can leverage a pre-trained generator and hence may avoid the computational overhead of training an end-to-end supervised network to go from observations to three-dimensional geometry. There can be various benefits of using unsupervised methods for solving inverse problems including robustness to data structure shifts and unknown variations in the corruption process.

[0140] Inversion algorithms can involve building on latent space optimization, and the systems and methods may extend the inverse algorithms using Perceptual Loss and Geodesic regularization for frequencies. The systems and methods disclosed herein can include a three- dimensional regularization term which can be directly applied in the neural radiance field. [0141] The systems and methods disclosed herein can be utilized for solving inverse problems for three-dimensional neural radiance fields given a single two-dimensional view. The systems and methods can include a framework that naturally generalizes even if a partial or corrupted two-dimensional view is available and can extend on existing work on unsupervised methods for inverse problems. In particular, the systems and methods can include regularizing the neural radiance field using reference geometries. The systems and methods can be applicable to generative neural radiance field methods and can improve in performance as more powerful pre-trained NerfGANs become available.

[0142] In some implementations, the systems and methods can include a framework for solving inverse problems using NeRF-style generative models. The systems and methods can solve the problem of three-dimensional scene reconstruction given a single two-dimensional image and known camera parameters. An additional problem can involve naively optimizing the latent space which can lead to artifacts and poor novel view rendering. The problem may be attributed to volume obstructions that become visible in novel views but are clearly visible in the three-dimensional geometry.

[0143] The systems and methods can include a radiance field regularization method to obtain better three-dimensional surfaces and improved novel views given single view observations. The systems and methods can naturally extend to general inverse problems including inpainting where one observes only partially a single view.

[0144] The systems and methods can achieve visual improvements and performance boosts over the baselines in a wide range of tasks. In some implementations, the methods can achieve 30 - 40% MSE reduction and 15 - 25% reduction in LPIPS loss compared to the previous state of the art.

[0145] Generative models can become capable of generating extremely high-fidelity images of the two-dimensional world. Despite their wide success, current generative models can often fail to capture the three-dimensional structure of the represented scenes and can offer limited control over the geometrical properties of the generated images.

[0146] NerfGANs can be considered a new family of generative models that directly model the three-dimensional space by leveraging the success of Neural Radiance Fields (NeRFs). NerfGANs can generate three-dimensional structure in the form of a Neural Radiance Field and can then output two-dimensional images by rendering the field from different camera views. These models may not yet be as competitive as state-of-the-art two- dimensional models for image generation. However, their ability to model directly the three- dimensional space, offers many new possibilities, can extend beyond generating photorealistic images, that may not yet be fully explored.

[0147] The systems and methods can include the solution of inverse problems using pretrained NeRF-style generative adversarial models (Nerf- GANs). The problem of singleview inversion can include: given a single two-dimensional image (e.g., a photograph of a person) the system may be configured to create novel views and reconstruct the three- dimensional geometry leveraging a pre-trained NerfGAN (e.g., the pi-GAN model). The systems and methods may be denoted with G(z, p) a 3D NerfGAN that takes a latent vector z G and a three-dimensional space position p in IE 3 and can output a color and a density value. For a given latent vector z, the NerfGAN scene can be rendered as a two-dimensional image for any camera position. Formally, for a given camera position c the produced two- dimensional image can be denoted by R(c, G(z,-y), where R can be the rendering operator. Given a single target image x* and known camera parameters c, NerfGAN inversion can be the problem of finding the optimal latent code z* that can create a 3D scene that renders to x* .

[0148] A natural method for NerfGAN inversion (used in pi- GAN) can be to optimize the latent vector to match the observed target image: min | |7?(c, G(z,-)) — %*| |. (1) z

[0149] The produced neural radiance field can render correctly to the given target image %*, but even small rotations may produce significant artifacts in novel views. A single front view can be given, and a radiance field can be reconstructed using latent space optimization. The neural fields created by solving the inverse problem can often have three-dimensional obstructions (that may be referred to as stones) that are invisible in the frontal view because of their color pattern. These obstructions can create significant artifacts that become more pronounced in rotated views.

[0150] Similar issues can be raised for the pi-GAN model which can be observed to have inverted hollow-face artifacts but not obstructions. Solving inverse problems using direct latent space optimization can frequently produce unrealistic three-dimensional obstructions that can also lead to visual artifacts when rendered from novel views. To account for this problem, the systems and methods may penalize divergence between the SIREN frequencies and phase shifts and their average values. The approach, even though it produces smooth geometries, can significantly reduce the range of the generator, leading to blurred reconstructions. The systems and methods may involve solving: where S 3D imposes a high penalty for latent vectors z that can create unnatural geometries. The systems and methods may achieve such a 3D regularization by creating a convex combination of distances to reference geometries. [0151] Beyond reconstructing the three-dimensional structure of a scene using a single view %*, the systems and methods can extend the method to general inverse problems. For example, the method can be directly applied when there are missing pixels in the view (inpainting), a blurred observed view, or observations of the single view with random projections or Fourier projections arising in medical imaging and compressed sensing.

[0152] Consider the general setting where the unknown two-dimensional scene can be x* and the system can observe y = ![%*] + noise, (3) where is the forward operator that performs the pixel removal, blurring or projections respectively. Direct latent optimization in this case can correspond to solving the optimization problem to explain the measurements min | |>l[/?(c, G(z,-)] - y| |. (4) z

[0153] The natural baseline can create three-dimensional obstructions and artifacts as in the case of single view inversion. The method can use the same three-dimensional regularization and can solve the optimization problem:

[0155] The method can be applied to linear inverse problems (where A is a matrix) or even non-linear problems like phase retrieval as long as the forward operator can be differentiable almost everywhere.

[0156] The key issue can be to devise a three-dimensional regularizer, S 3D , that does not lead to measurements overfitting (i.e., small measurements error (first term of (5)) but big real error | \R c, — x 11). The systems and methods can use a set of reference geometries and an annealing mechanism in gradient descent to lock-in on better fitting geometries.

[0157] One limitation on reconstructing three-dimensional geometries and novel views from a single reference image can include that existing algorithms may not be sufficient, because either: i) they overfit to the measurements and produce artifacts to novel views, or ii) they significantly limit the expressive power of the generator.

[0158] The problem can be traced back to unrealistic three-dimensional geometries that need to be avoided in the course of the optimization. The systems and methods can include a principled framework for regularizing the radiance field itself, without sacrificing the range of the generator. The framework can drive the network to generate realistic geometries by measuring distance from a set of realistic geometries under a novel three-dimensional loss. [0159] The systems and methods can obtain a candidate set of realistic geometries using CLIP and a pre-trained NerfGAN. The systems and methods can be experimentally evaluated and can achieve visual improvements and performance boosts over the baselines in a wide range of tasks. The method can achieve 30 - 40% MSE reduction and 15 - 25% reduction in LPIPS loss compared to the previous state of the art.

[0160] Reconstructed images can be sufficiently different compared to the input images. The poor frontal view reconstruction can be attributed to the limited range of the pi-GAN generator. Using the exact same generator can be able to get much better reconstructions compared to pi-GAN as shown visually and by the MSE and Perceptual losses.

[0161] The pi-GAN reconstructions can look like smoothed versions of the reference images. The superficial smoothness can come from the regularization term that can be used in the pi-GAN model. Specifically, the systems and methods can penalize divergence between the SIREN frequencies and phase shifts and their average values. A novel view can be obtained with pi-GAN without this regularization. The face can be much closer to the given input. There can be a caveat; the novel view can show artifacts. Experiments can show an important trade-off (i.e., matching better the measurements but with poor generalization vs inferior reconstruction of the image but smoother novel views).

[0162] A natural question can be why the optimization trajectory finds such geometries that match almost perfectly the given two-dimensional image but can produce artifacts in novel views. The short answer can be that these unrealistic geometries are fairly common, even in pure image generation with Gaussian inputs. An experiment can be used to validate this.

[0163] The system can first collect latent vectors, sampled from the Gaussian distribution, and the system can form the set <5 = {z 1( . . . , z M }. For each of the elements in the set, the system can then render images from camera positions c 1 , ..., c K and the system can form the set S z = ))}. The system may assign a cost to each of these sets that can be given by the maximum difference of CLIP logits between the images in the set and the text prompt T = "A non — corrupted image of a person. " The cost w z of S z can be given by:

[0164] After assigning the costs, the set of latents that correspond to unrealistic geometries can be given by:

Bad z e = {z G <S|w z > e }. (7) [0165] The fraction of unrealistic geometries, i.e.

[0166] can be a (noisy) measure of how often bad geometries occur in the range of a NerfGAN. Experiments can convey that unrealistic geometries can be quite common in the range of pi- GAN - approximately 40% of geometries can be classified as “bad” by CLIP. For a visualization of geometries that are classified as bad by CLIP, see the rightmost part of Figure 3.

[0167] Similarly, one can use CLIP to identify realistic geometries. The systems and methods may include collecting a set of realistic geometries that render to visually plausible two-dimensional images. For each Gaussian sampled z, the system can assign two costs: i) the consistency cost that can be defined in (6) and ii) the plausibility cost:

[0168] The system can then collect the set:

[0169] The whole procedure can be illustrated in Figure 3, with examples of geometries and renderings of the “good” and “bad” set. The collected set of realistic geometries can be open sourced.

[0170] Without regularization, the optimization trajectory can often reach points of minimum loss but with poor generalization to novel views. Hence, the systems and methods may regularize towards realistic geometries. The systems and methods may use the notation to denote the density part of the radiance field and to denote a set of points that corresponds to its voxelized representation.

[0171] One technique to constrain the three-dimensional shape can be to force it to be close to a three-dimensional geometry that is known to be good. The systems and methods may collect a set of latent vectors S that correspond to realistic geometries and can try to regularize the inferred geometry towards the most suitable geometry in our realistic set. This can be made more concrete in the form of the following optimization problem: [0172] The systems and methods can also work with the penalized version of the problem and can solve:

[0173] There may be two issues with the formulation of (11). First, one issue can involve a min-min problem where the inner minimum can be over a discrete set. Gradient Descent (GD) can be likely to get stuck to a local minima: the reference geometry that happens to be closer to the initialization can be likely to be the active constraint of (10), even though it might not be the one that minimizes the total objective. The systems and methods can observe this problem experimentally. The second issue can be that the two terms in the loss function might be incompatible. For example, if the reference set 5 is small, there may be no geometry that renders to the measurements.

[0174] In some implementations, the systems and methods can include a relaxation of the objective of (11), where the min can be replaced with a soft-operator that allows all reference geometries to contribute to the gradients based on how close they are, under £, to the current radiance field. The systems and methods can consider the following optimization problem:

[0175] This can be interpreted as finding a nonnegative weighting of the losses that solves such that for some divergence measure D, uniform distribution u, and a radius parameter y. The softmax weighting can emerge when D is taken to be KL-divergence. Observe that for T -» oo (corresponding to a large enough y), the optimization problems of Equations (11), (12) may have the same solution. However, this formulation can be more powerful since it can allow blending of the reference geometries in case the system cannot match the measurements otherwise. If the distribution of the contributions of each loss can be close to a Dirac and the combined loss can be small, then the systems and methods may include guaranteed three-dimensional consistency.

[0176] In the experiments, the systems and methods can use GD to solve the optimization problem of (12). The systems and methods can gradually anneal the temperature parameter T during the course of the optimization to encourage convergence to a single reference geometry. The systems and methods may choose the z that minimizes the total loss. If the loss curve is flattened, the systems and methods can prefer the z that corresponds to higher temperature, because it can signify better three-dimensional consistency.

[0177] To regularize for the stones, for each of the reference radiance fields, the systems and methods may obtain a face surface mask on the three-dimensional space and may constrain the reconstructed voxel grid to match the reference outside of this three- dimensional mask. The intuition can be that voxels outside of the facial surface should have low values (as in the reference geometries). Matching only these voxels can give the method enough flexibility to adjust the facial three-dimensional structure to match the measurements without having high density clusters (stones) outside of the face.

[0178] The systems and methods can let M(p) M 3 -» {0,1} be the operator that gives the facial mask. The systems and methods can define the loss function as: where 11 ■ | | F denotes the Frobenius norm.

[0179] In the experiments, the systems and methods can use the vertices of the generated polygons of the Marching Cubes algorithm to get the facial mask.

Example Experimentation

[0180] Figure 9 depicts an illustration of example temperature annealing results according to example embodiments of the present disclosure. In particular, Figure 9 can depict example temperature annealing results. As the inverse temperature increases, the max weight and entropy of the distribution can progress in the direction of desire, showing that an example method of the systems and methods disclosed herein can “locks-in” good reference geometries as latent space optimization evolves. Changes in the entropy 902, max weight 904, and delta 906 can occur as the steps increase, and the changes are depicted in the three graphs of Figure 9.

[0181] Figure 10 depicts an illustration of example inpainting plot results according to example embodiments of the present disclosure. In particular, Figure 10 can depict illustrations of example inpainting plots 1000 for different views, as the number of measurements (observed pixels) of the frontal view change. As depicted, the example method can consistently outperform the baseline. As the number of observed pixels increases, the baseline method may overfit more and more to the measurements (frontal view) and can perform worse in novel views, indicating that the reconstructed geometry gets worse. The four example plots 1000 depicted include a first plot (with 103.50 pitch and 90.00 yaw) 1002, second plot (with 76.50 pitch and 180.00 yaw) 1004, a third plot (with 76.50 pitch and 72.00 yaw) 1006, and a fourth plot (with 103.50 pitch and 90.00 yaw) 1008.

[0182] Figure 11 depicts an illustration of example model results according to example embodiments of the present disclosure. In particular, Figure 11 can depict example comparisons 1100 between pi-GAN and an example implementation of the systems and methods disclosed herein on different camera angles on MSE and LPIPS metrics. MSE (top) 1102 can indicate two-dimensional pixel reconstruction, where the example method shows comparable loss in the given frontal view at (90,90), and noticeably lower loss in all other novel views. LPIPS (bottom) 1104 can describe the perceptual loss, suggesting that the example method has consistently less perceptual differences than pi-GAN.

[0183] In the experiments, the comparisons can use a pi-GAN generator, pre-trained on faces from CelebA.

[0184] The first comparison can be with the regularization proposed in the pi-GAN paper (/ 2 penalty for divergence from the average frequencies and phase shifts). For a fair comparison, the experiment can use the reconstructions directly from known pi-GAN outcomes. The method can significantly outperform pi-GAN on the frontal view - the systems and methods can observe 42% reduction in MSE and 80% reduction in LPIPS for the first image. The pi-GAN method can indeed give novel views without artifacts, but since the distance to the ground truth can be very large.

[0185] The experiments can compare with the unregularized baseline that follows the CSGM approach to match the measurements (i.e., solves the problem defined in (1)). The experiment may run the method and the CSGM baseline on the images from a database and for some images in the range of NerfGAN. For the experiments, input can be the frontal view (one can run the method for any view, as long as the camera parameters are known).

[0186] For the images in the range, the experiment can have ground truth novel views. The method can produce less artifacts (e.g., the blurry blobs in a reconstructed image). In Figure 11, the figure can show a quantitative comparison between the method and standard pi-GAN inversion for different views. In this experiment, the system can start with a generated three-dimensional neural radiance field such that the system can have ground truth for all views. As shown, the method can achieve 30 - 40% MSE reduction and 15 - 25% reduction in LPIPS loss compared to latent space optimization without the three-dimensional regularization.

[0187] The systems and methods may be motivated by the need to converge as much as possible to a single geometry. However, in the early stages of the optimization, the systems and methods may be allowed to reference geometries to contribute to the gradients since otherwise gradient descent might get trapped in a local minima. To achieve this, the systems and methods may be temperature annealing: in the early steps of gradient descent the system may have big temperature (i.e. , the system may allow our distribution over the references to look like uniform) and as the system may progress the system may decrease the temperature (increase 8) to converge to a single radiance field. In the experiments, the systems and methods may not complete step annealing, increasing every 100 optimization steps the 6 by 50. Figure 9 can show how 6, the entropy of the distribution over the references and its maximum weight can evolve over time. As shown, in the early stages, the distribution may have high entropy (close to uniform) and as time progresses the system can converge to a single radiance field, as may be desired.

[0188] In some implementations, the systems and methods can address the problem of inpainting where one does not observe a full view x* but rather a known subset of pixels is missing. The system may consider two different inpainting settings: random inpainting, where only a random subset of pixels may be observed and box inpainting where a part of an image may be missing.

[0189] In this case, the system may observe y = Ax*, where A can be a matrix created by selecting rows from the identity matrix, one for each observed pixel. In Figure 10, the experiment can be plotted with the Mean Squared Error (MSE) versus the ratio of observed pixels for a novel view. The observations y = Ax* can be from the frontal view x* but with random sub-sampling. As shown, latent-space optimization baseline can have an increasing MSE as the number of observations increases. This can happen because the baseline may be overfitting and may fail to reconstruct the novel view correctly. In contrast, the method may be consistently producing lower MSE for the novel view.

[0190] At each optimization step, the method can generate a voxel grid using the current latent z. One natural question may be how coarse the voxel grid representation should be in order for the regularization to be effective. In some implementations, a voxel grid at resolution 32x32x32 can be utilized across the considered tasks. Hence, the number of additional queries to the NerfGAN can be the same as the ones that we need to generate an image at resolution 128x128 (which can be the standard resolution used in all our experiments). The method may run in approximately 3 mins per image on a workstation of 4 VI 00 GPUs. [0191] The success of the method may be based in part on the quality of the collected set of voxel grids. If the set is not diverse or if it contains non-realistic three-dimensional structures, then the method may have decreased quality for some instances. Moreover, since all the geometries may be in the range of the GAN, any dataset biases may be reflected in the reconstructions. The method may only introduce biases in the three-dimensional structure of a face since the system may not be regularizing for color.

[0192] Another concern can be that the relaxation of the optimization problem can allow for solutions that may be blendings of the three-dimensional structures of the collected set. A blending of two realistic facial geometries may have artifacts. The systems and methods may account for this by annealing the temperature, effectively encouraging the optimization to converge to a single three-dimensional structure.

[0193] Additionally and/or alternatively, the systems and methods may regularize for a realistic three-dimensional structure, but may not add any regularization on the colors. Nonsmoothness in the three-dimensional color signal may give undesired transitions between nearby views. The systems and methods may not observe any such behavior with the pi-GAN generator, but it may happen with other models from the Nerf-GAN family.

Additional Disclosure

[0194] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.

[0195] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.