Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND SYSTEMS FOR VIEW SYNTHESIS WITH IMAGE RELIGHTING
Document Type and Number:
WIPO Patent Application WO/2023/244488
Kind Code:
A1
Abstract:
The present invention is directed to image processing methods and techniques. According to a specific embodiment, a plurality of images characterizing a three-dimensional (3D) scene is obtained. The plurality of images is used to determine a set of coordinates associated with a ray projecting into the 3D scene. The set of coordinates serves as an input to train a neural network to predict RGB radiance for rendering the 3D scene from different viewpoints with different light conditions via a machine learning process. One or more losses are calculated to refine the neural network. There are other embodiments as well.

Inventors:
LI ZHONG (US)
SONG LIANGCHEN (US)
XU YI (US)
Application Number:
PCT/US2023/024796
Publication Date:
December 21, 2023
Filing Date:
June 08, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06T15/04; G06N3/02; G06T15/20; G06T15/50
Domestic Patent References:
WO2022098358A12022-05-12
Foreign References:
US20180204314A12018-07-19
US20210295592A12021-09-23
Attorney, Agent or Firm:
BRATSCHUN, Thomas D. et al. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method for image processing, the method comprising: obtaining a plurality of images characterizing a three-dimensional (3D) scene, the plurality of images being characterized by a plurality of camera views and a plurality of light directions; determining a first camera view and a first light direction associated with the 3D scene using the plurality of images, the first light direction being associated with a first pixel; calculating a set of coordinates using the first camera view and the first light direction; generating a first normal, a first albedo, and a first roughness using the set of coordinates and a first neural network; generating a first color of the first pixel using the first normal, the first albedo, the first roughness, the first light direction, and a second neural network; and providing an output image associated with the 3D scene using at least the first color.

2. The method of claim 1 further comprising calculating a first loss using the first color and a ground-truth color of the first pixel.

3. The method of claim 2 further comprising updating the first neural network using the first loss.

4. The method of claim 1 further comprising calculating a second loss using the first color, the first normal, the first albedo, and the first roughness.

5. The method of claim 4 further comprising updating the second neural network using the second loss.

6. The method of claim 1 wherein the first camera view is different from any of the plurality of camera views.

7. The method of claim 1 wherein the first light direction is different from any of the plurality of light directions.

8. The method of claim 1 wherein the set of coordinates comprises a fourdimensional (4D) coordinate.

9. The method of claim 1 wherein the first neural network comprises a fully connected network.

10. A system for image processing, the system comprising: a camera module configured to capture a plurality of images characterizing a three-dimensional (3D) scene, the camera module comprising one or more cameras and one or more light sources; a storage configured to store a plurality of images, the plurality of images being characterized by a plurality of camera views and a plurality of light directions; and a processor coupled to the storage, the processor being configured to: retrieve the plurality of images from the storage; determine a first camera view and a first light direction associated with the 3D scene using the plurality of images, the first light direction being associated with a first pixel; calculate a set of coordinates using the first camera view and the first light direction; generate a first normal, a first albedo, and a first roughness using the set of coordinates and a first neural network; generate a first color of the first pixel using the first normal, the first albedo, the first roughness, the first light direction, and a second neural network; and provide an output image associated with the 3D scene using at least the first color.

11. The system of claim 10 further comprising a display configured to display the output image.

12 . The system of claim 10 wherein the processor comprises a neural processing unit (NPU) and/or a graphics processing unit (GPU).

13. The system of claim 10 wherein the plurality of images is captured by a plurality of cameras under a plurality of directional lights.

14. The system of claim 10 wherein the processor is further configured to: calculate a first loss using the first color and a ground-truth color of the first pixel.; and update the first neural network using the first loss.

15. The system of claim 10 wherein the processor is further configured to: calculate a second loss using the first color, the first normal, the first albedo, and the first roughness; and update the second neural network using the second loss.

16. A method for image processing, the method comprising: obtaining a first image associated with a three-dimensional (3D) scene, the first image being characterized by a first camera view and a first light direction; determining a second camera view and a second light direction associated with the 3D scene, the second camera view being different from the first camera view, the second light direction being different from the first light direction; calculating a set of coordinates using the second camera view and the second light direction; generating a first normal, a first albedo, and a first roughness using the set of coordinates and a first neural network; generating a first color using the first normal, the first albedo, the first roughness, the first light direction, and a second neural network, the first color being associated with the second light direction; and providing an output image associated with the 3D scene using at least the first color.

17. The method of claim 16 wherein: the 3D scene comprises a first object characterized by a first surface property; and the first normal, the first albedo, and the first roughness are associated with the first surface property.

18. The method of claim 16 further comprising decomposing the first surface property. 19. The method of claim 16 wherein the set of coordinates comprises a four- dimensional (4D) coordinate. 20. The method of claim 16 wherein the first neural network comprises a fully connected network.

Description:
METHODS AND SYSTEMS FOR VIEW SYNTHESIS WITH IMAGE RELIGHTING

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] The present application claims priority to U.S. Provisional Application No. 63351832, entitled “RELIT-NEULF: EFFICIENT RELIGHTING AND NOVEL VIEW SYNTHESIS VIA

NEURAL 4D LIGHT FIELD”, filed lune 14, 2022, which is commonly owned and incorporated by reference herein for all purposes.

BACKGROUND OF THE INVENTION

[0002] The present invention is directed to image processing methods and techniques.

[0003] As digital reality applications become more prevalent, the need to generate photorealistic digital representations of real-world objects has surged. These applications include virtual and augmented reality, video games, movies, and advertising, among others. To meet this demand, various image processing techniques have been developed to create visually rich and convincing three-dimensional (3D) representations of real- world objects. For example, free- view synthesis and image relighting techniques aim to recreate images of a scene from different viewpoints than the original image under different lighting conditions to enrich scene understanding. Over the past, many image processing techniques have been proposed, but they have been inadequate, for the reasons detailed below.

[0004] Therefore, new and improved methods and systems for image processing are desired.

BRIEF SUMMARY OF THE INVENTION

[0005] The present invention is directed to image processing methods and techniques. According to a specific embodiment, a plurality of images characterizing a three-dimensional (3D) scene is obtained. The plurality of images is used to determine a set of coordinates associated with a ray projecting into the 3D scene. The set of coordinates serves as an input to train a neural network to predict RGB radiance for rendering the 3D scene from different viewpoints with different light conditions via a machine learning process. One or more losses are calculated to refine the neural network. There are other embodiments as well. [0006] Embodiments of the present invention can be implemented in conjunction with existing systems and processes. For example, the image processing system according to the present invention can be used in a wide variety of systems, including mobile devices, communication systems, and the like. Additionally, various techniques according to the present invention can be adopted into existing systems via training of the neural network(s), which is compatible with most image processing applications. There are other benefits as well.

[0007] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for image processing. The method includes obtaining a plurality of images characterizing a three-dimensional (3D) scene, the plurality of images being characterized by a plurality of camera views and a plurality of light directions. The method also includes determining a first camera view and a first light direction associated with the 3D scene using the plurality of images, the first light direction being associated with a first pixel. The method also includes calculating a set of coordinates using the first camera view and the first light direction. The method also includes generating a first normal, a first albedo, and a first roughness using the set of coordinates and a first neural network. The method also includes generating a first color of the first pixel using the first normal, the first albedo, the first roughness, the first light direction, and a second neural network. The method also includes providing an output image associated with the 3D scene using at least the first color. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0008] Implementations may include one or more of the following features. The method may include calculating a first loss using the first color and a ground-truth color of the first pixel. The method may include updating the first neural network using the first loss. The method may include calculating a second loss using the first color, the first normal, the first albedo, and the first roughness. The method may include updating the second neural network using the second loss. The first camera view may be different from any of the plurality of camera views. The first light direction may be different from any of the plurality of light directions. The set of coordinates may include a four-dimensional (4D) coordinate. The first neural network may include a fully connected network. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0009] One general aspect includes a system for image processing. The system also includes a camera module configured to capture a plurality of images characterizing a three-dimensional (3D) scene, the camera module may include one or more cameras and one or more light sources. The system also includes a storage configured to store a plurality of images, the plurality of images being characterized by a plurality of camera views and a plurality of light directions. The system also includes a processor coupled to the storage, the processor being configured to: retrieve the plurality of images from the storage; determine a first camera view and a first light direction associated with the 3D scene using the plurality of images, the first light direction being associated with a first pixel; calculate a set of coordinates using the first camera view and the first light direction; generate a first normal, a first albedo, and a first roughness using the set of coordinates and a first neural network; generate a first color of the first pixel using the first normal, the first albedo, the first roughness, the first light direction, and a second neural network; and provide an output image associated with the 3d scene using at least the first color. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0010] Implementations may include one or more of the following features. The system may include a display configured to display the output image. The plurality of images is captured by a plurality of cameras under a plurality of directional lights. The processor is further configured to: calculate a first loss using the first color and a ground-truth color of the first pixel and update the first neural network using the first loss. The processor is further configured to: calculate a second loss using the first color, the first normal, the first albedo, and the first roughness; and update the second neural network using the second loss. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0011] One general aspect includes a method for image processing. The method also includes obtaining a first image associated with a three-dimensional (3D) scene, the first image being characterized by a first camera view and a first light direction. The method also includes determining a second camera view and a second light direction associated with the 3d scene, the second camera view being different from the first camera view, the second light direction being different from the first light direction. The method also includes calculating a set of coordinates using the second camera view and the second light direction. The method also includes generating a first normal, a first albedo, and a first roughness using the set of coordinates and a first neural network. The method also includes generating a first color using the first normal, the first albedo, the first roughness, the first light direction, and a second neural network, the first color being associated with the second light direction. The method also includes providing an output image associated with the 3D scene using at least the first color. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0012] Implementations may include one or more of the following features. The 3D scene may include a first object characterized by a first surface property. The first normal, the first albedo, and the first roughness may be associated with the first surface property. The method may include decomposing the first surface property. The set of coordinates may include a fourdimensional (4D) coordinate. The first neural network may include a fully connected network. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0013] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, the present systems and methods for image processing can generate photorealistic images under arbitrary changes in viewpoints and lighting conditions based on limited image inputs (e.g., sparse viewpoints and limited light sources), allowing for enhanced efficiency and reduced memory footprint. Additionally, the system — trained with one or more losses — can recover the spatially-varying bidirectional reflectance distribution function (SVBRDF) parameters in a weakly- supervised manner, resulting in visually rich representations that enable immersive and interactive virtual experience.

[0014] The present invention achieves these benefits and others in the context of known technology. However, a further understanding of the nature and advantages of the present invention may be realized by reference to the latter portions of the specification and attached drawings. BRIEF DESCRIPTION OF THE DRAWINGS

[0015] Figure 1 is a simplified block diagram illustrating a system for image processing according to embodiments of the present invention.

[0016] Figure 2 is a simplified diagram illustrating a camera module of a system for image processing according to embodiments of the present invention.

[0017] Figure 3 is a simplified diagram illustrating a representation of light field according to embodiments of the present invention.

[0018] Figure 4 is a simplified diagram illustrating a data flow for image processing according to embodiments of the present invention.

[0019] Figure 5 is a simplified flow diagram illustrating a method for image processing according to embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0020] The present invention is directed to image processing methods and techniques. According to a specific embodiment, a plurality of images characterizing a three-dimensional (3D) scene is obtained. The plurality of images is used to determine a set of coordinates associated with a ray projecting into the 3D scene. The set of coordinates serves as an input to train a neural network to predict RGB radiance for rendering the 3D scene from different viewpoints with different light conditions via a machine learning process. One or more losses are calculated to refine the neural network. There are other embodiments as well.

[0021] Over the years, many techniques for free viewpoint rendering have been developed. For instance, some existing techniques rely on geometric reconstruction, which utilizes multiview stereo or structured light to reconstruct the geometry of the 3D object and the diffuse texture. However, due to the complexity of the real-world scene (e.g., complex physical geometry, spatially varying surface reflection, uncontrolled lighting conditions, etc.), existing techniques struggle to provide satisfactory visual representations for complex scenes with sufficient fine details and/or varying lighting conditions. Moreover, many existing techniques are sensitive to input data quality and usually require a large number of input images to achieve high-fidelity results, which can be both time-consuming and computationally expensive. [0022] Thus, a general aspect of the present invention is to come up with a new solution to generate high-fidelity free view synthesis results with arbitrary lighting. In various embodiments, the present invention provides methods and systems that use a limited number of input images to realize photorealistic visual representations with fast rendering speed and low memory cost via a deep learning process.

[0023] The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

[0024] In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

[0025] The reader’ s attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

[0026] Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6. In particular, the use of “step of’ or “act of’ in the Claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.

[0027] Please note, if used, the labels left, right, front, back, top, bottom, forward, reverse, clockwise and counter-clockwise have been used for convenience purposes only and are not intended to imply any particular fixed direction. Instead, they are used to reflect relative locations and/or directions between various portions of an object.

[0028] Figure 1 is a simplified block diagram illustrating a system 100 for image processing according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.

[0029] As shown, system 100 may include a camera module 110 (or other image or video capturing device), a storage 120, and a processor 130. For example, camera module 110 is configured to capture a plurality of images characterizing a three-dimensional (3D) scene and includes one or more cameras and one or more light sources. The 3D scene may include an object characterized by a first surface property. The one or more light sources may be positioned at various locations around a 3D scene and configured to illuminate the scene to provide various lighting conditions. For instance, the one or more light sources are focused toward an object contained in the scene and provide directional lights to illuminate the object from various angles. The one or more light sources may include, without limitation, one or more studio light(s), one or more flash unit(s), one or more LED panel(s), and/or the like. The direction, intensity, and color temperature of each light source may be adjustable to create different lighting scenarios. In some cases, light modifiers (e.g., softboxes, umbrellas, reflectors, etc.) may be used to control the shape, direction, and softness of the light, creating various lighting effects and shadows.

[0030] In some embodiments, the one or more cameras are positioned at various locations around the 3D scene and configured to capture images of the scene from multiple viewpoints. In an example, the one or more cameras are configured to capture a video clip including a sequence of consecutive image frames depicting the scene from various viewpoints. The one or more cameras may include, without limitation, one or more RGB camera(s), one or more Digital Single-Lens Reflex (DSLR) camera(s), one or more mirrorless camera(s), one or more High Dynamic Range (HDR) camera(s), one or more image sensor(s), one or more video recorder(s), and/or the like. Depending on the implementations, camera module 110 may be configured in various arrangements to collect image samples characterized by a wide range of perspectives and lighting conditions. For instance, the one or more cameras and light sources are placed evenly around the object or scene, at a fixed distance and elevation. This arrangement captures images from various angles in a horizontal plane, offering 360-degree coverage. In other examples, the one or more cameras and light sources are placed at various elevations and azimuth angles around the object or scene (e.g., in a spherical arrangement), creating images with multiple perspectives under different lighting conditions. In some embodiments, the plurality of images captured by camera module 110 is characterized by a plurality of camera views and a plurality of light directions.

[0031] According to some embodiments, storage 120 is configured to store the plurality of images captured by camera module 110. Storage 120 may include, without limitation, local and/or network-accessible storage, a disk drive, a drive array, an optical storage device, and a solid-state storage device, which can be programmable, flash-updateable, and/or the like. Processor 130 can be coupled to each of the previously mentioned components and be configured to communicate between these components. In a specific example, processor 130 includes a central processing unit (CPU) 132, graphics processing unit (GPU) 134, and/or network processing unit (NPU) 136, or the like. For example, each of the processing units may include one or more processing cores for parallel processing. In a specific embodiment, CPU 132 includes both high-performance cores and energy-efficient cores. Processor 130 is configured to process the plurality of images to generate a light field representation of the 3D scene and train a neural network to predict RGB radiance for efficient relighting and free view synthesis, as will be described in further detail below.

[0032] The system 100 can also include a network interface 140 and a display 150. Display 150 is configured to display an output image generated by processor 130. The output image may be associated with the same 3D scene and characterized by a camera view and a light direction that are different from any of the plurality of input images. Network interface 140 can be configured to transmit and receive images (e.g., using Wi-Fi, Bluetooth, Ethernet, etc.) for neural network training and/or image processing. In a specific example, the network interface 140 can also be configured to compress or down-sample images for transmission or further processing. Network interface 140 can also be configured to send one or more images to a server for postprocessing. The processor 130 can also be coupled to and configured to communicate between display 150, the network interface 140, and any other interfaces. In various implementations, system 100 further includes one or more peripheral devices 160 configured to improve user interaction in various aspects. For example, peripheral devices 160 may include, without limitation, at least one of the speaker(s) or earpiece(s), audio sensor(s) or microphone(s), noise sensors, keyboard, mouse, and/or other input/output devices. [0033] In an example, processor 130 can be configured to retrieve the plurality of images from storage 120; to determine a first camera view and a first light direction associated with the 3D scene using the plurality of images; to calculate a set of coordinates using the first camera view and the first light direction; to generate a first normal, a first albedo, and a first roughness using the set of coordinates and a first neural network; to generate a first color of a first pixel using the first normal, the first albedo, the first roughness, the first light direction, and a second neural network; and to provide an output image associated with the 3D scene using at least the first color. In various implementations, GPU 134 is coupled to display 150 and camera module 110. GPU 134 may be configured to transmit output images to display 150. NPU 136 may be used to train one or more neural networks for image relighting and view synthesis with one or more losses. For instance, NPU 136 is configured to train the neural network(s) by minimizing a render loss between SVBRDF rendering result and the ground truth color via decomposing the first surface property of the object contained in the 3D scene.

[0034] Other embodiments of this system include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. Further details of methods for text recognition, model training for processing text images, and related techniques are discussed with reference to the following figures.

[0035] Figure 2 is a simplified diagram illustrating a camera module 200 of a system for image processing according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the ail would recognize many variations, alternatives, and modifications.

[0036] Camera module 200 may include one or more cameras and one or more light sources. In an example, as shown in Figure 2, camera module 200 includes a first camera 202a, a second camera 202b, a third camera 202c, a first light source 204a, a second light source 204b, and a third light source 204c. The cameras (202a, 202b, 202c) and light sources (204a, 204b, 204c) may be arranged around a scene including one or more subjects (e.g., a person 206) and are configured to capture images of the scene from various viewpoints under different light configurations. Depending on the implementations, the images captured by camera module 200 may be a video clip including a sequence of images from different viewpoints. In some cases, each camera is configured to capture an image of the scene at each light position, providing a plurality of images characterized by a plurality of camera views and a plurality of light directions. For instance, given N viewpoints and L light sources, each viewpoint is illuminated by L light sources, resulting in a total of N X L images. It is to be appreciated that the number of cameras and light sources is not limited to what is shown in Figure 2, and a different number of cameras and light sources may be employed in other embodiments.

[0037] In other examples, camera module 200 may include a lab-controlled device that consists of a 3D structure (e.g., spherical, cuboid, ellipsoid, cylinder, and/or the like) fitted with an array of light sources and cameras configured to surround the subject/scene. This configuration allows for precise control over lighting conditions and camera calibrations, resulting in high-quality images that are later used for training neural network(s). In some embodiments, the one or more cameras may be synchronized to capture the images from different viewpoints simultaneously. In other embodiments, the one or more cameras may be unsynchronized to capture the plurality of images sequentially. The subject of the scene (e.g., person 206) may remain stationary during the image capture process.

[0038] Figure 3 is a simplified diagram illustrating a representation of light field 300 according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.

[0039] To allow for free view rendering, a light field representation may be used. The light field representation is configured to store the light ray’s properties (e.g., color, intensity, direction, and/or the like) at each point in the 3D space as a function of position and direction. For instance, a 3D scene 308 can be represented as a 4D light field using two-plane parameterization as shown in Figure 3. As shown, the 4D light field parameterizes a light ray R from a camera viewpoint 302 with a known camera pose intersecting with two planes: a uv plane 304 and a st plane 306. Each point on the st plane is connected to a corresponding point on the uv plane. For example, light ray R intersects with uv plane 304 and st plane 306 at point 310 and point 312, respectively. Point 310 on uv plane 304 having a coordinate ui,vi) is connected to point 312 having a coordinate (si,ti) on st plane 306. An oriented line indicating the direction of light ray R can thus be defined by connecting point 310 and point 312 and parameterized by a 4D coordinate (ui, vi, si, ti). [0040] The light field representation offers a variety of advantages that allow for efficient free view rendering. For example, it can capture comprehensive light information of a scene, including complex reflections, refractions, shadows, and/or the like, allowing for the creation of photorealistic images that portray the intricate lighting interactions between objects in the scene. Additionally, light field data can be processed and manipulated using machine learning algorithms to construct new views with arbitrary light direction, as will be described in further detail below.

[0041] Figure 4 is a simplified diagram illustrating a data flow 400 for image processing according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should limit the scope of the claims.

[0042] According to an example, the present invention provides a method to render images from novel viewpoints — viewpoints that are different from the input images — with arbitrary lighting conditions. The method may include two stages implemented by two neural networks: (1) determining a light ray’s SVBRDF components (e.g., normal, albedo, and roughness), which can be used to model the appearance of materials and reflectance properties; (2) training a neural network to render images from novel viewpoints using at least the SVBRDF components. By leveraging light field representation and decomposing SVBRDF components, embodiments of the present invention can efficiently generate high-fidelity novel view synthesis results with arbitrary lighting.

[0043] As shown, flow 400 implements a two-stage network architecture to provide free viewpoint rendering results based on sparse camera views and limited light sources. According to some embodiments, the two-stage network receives input 4D coordinates 402 as inputs, which can be obtained by extracting 4D ray parameterizations of pixels in a 3D scene. The 3D scene may include an object characterized by a first surface property. In an example, a first camera view and a first light direction (e.g., camera viewpoint 302 and light ray R of Figure 3) may be determined using the plurality of input images (e.g., captured by camera module 110 of Figure 1) to calculate the input 4D coordinates 402. For instance, given a set of N calibrated input images characterizing the 3D scene {h, h, I n }, a 4D ray parameterization of each pixel can be extracted as (u^, v , s , ) (k—1...N, i=l ...Nk), where Nk is the total number of pixels in the A-t h image. In some cases, the first light direction may be associated with a first pixel in the 3D scene.

[0044] In various implementations, input 4D coordinates 402 may be fed to a first neural network 404 (which may also be referred to as “DecomposeNet”) to generate SVBRDF parameters. The Spatially- Varying Bidirectional Reflectance Distribution Function (SVBRDF) refers to a mathematical function that describes how light interacts with a surface. Unlike a uniform Bidirectional Reflectance Distribution Function (BRDF), an SVBRDF accounts for the fact that real-world surfaces have spatial variations in their reflectance properties due to imperfections like bumps, scratches, and other irregularities that impact light interaction.

SVBRDF parameters are a set of values that define the reflectance properties of a surface at each point and can be used to model the appearance of real-world objects with complex surface reflectance properties (e.g., a human face).

[0045] In an example, the first neural network 404 generates the SVBRDF parameters including a first normal 412, a first albedo 414, and a first roughness 416 via normal branch 406, albedo branch 408, and roughness branch 410, respectively. For example, a normal parameter defines the direction of a surface normal, which is a vector perpendicular to the surface at a given point. The normal parameter may be used to determine the direction in which light is reflected off the surface. An albedo parameter may include a diffuse albedo and/or a specular albedo.

The diffuse albedo is configured to quantify the amount of light that is diffusely reflected from a surface. The specular albedo is configured to measure the amount of light that is specularly reflected from a surface (i.e., in a mirror-like manner). A roughness parameter is used to determine the microsurface irregularities of a material, which affects the way light is scattered on a surface.

[0046] In various implementations, the first neural network 404 includes a multilayer perceptron (MLP) comprising a series of layers of interconnected neurons. The output of each layer may be fed into the next layer to perform various classification and regression tasks. In an example, the MLP network takes input 4D coordinates 402 r = (M, V, S, t) for each pixel in view i illuminated by light source I as inputs. The MLP network first extracts a shared feature among SVBRDF parameters and then employs three decoders (e.g., normal branch 406, albedo branch 408, and roughness branch 410) to generate SVBRDF parameters (e.g., first normal 412, first albedo 414, and first roughness 416). It is to be appreciated that SVBRDF parameters are closely correlated as they collectively capture various aspects of material appearance, thus the extraction of shared feature can significantly improve the efficiency of the neural network and reduces the risk of overfitting. The first neural network 404 (DecomposeNet) configured to predict SVBRDF parameters can be represented as:

N,A,R = DecomposeNeliv I0d) (Eqn. 1)

[0047] In various implementations, the SVBRDF parameters (e.g., first normal 412, first albedo 414, and first roughness 416) are fed into a second neural network 420 (which may also be referred to as “RenderNet”) as inputs. Second neural network 420 may be configured to generate a first color of the fist pixel using the first light direction, first normal 412, first albedo 414, first roughness 416. Second neural network 420 may include a multilayer perceptron (MLP) comprising a series of layers of interconnected neurons. The output of each layer may be fed into the next layer to perform various classification and regression tasks. To generate a novel view synthesis result under a specific light direction /, second neural network 420 is trained to utilize first normal 412, first albedo 414, first roughness 416, and light direction 422 to generate the ray color 424 via an implicit rendering process. For example, second neural network 420 is used to learn an implicit function RenderNetQ that defines the surface of an object in the 3D scene. To render a ray under specific light direction I, second neural network 420 is trained as follows:

Cr = RenderNet (r,N, A, R, I I0r) (Eqn. 2)

[0048] One or more losses may be used to train first neural network 404 and second neural network 420. For example, for each ray in camera view z and light direction I, a predicted color Cpred (e.g., ray color 424) output by second neural network 420 may be calculated as follows:

C P red = RenderNet(DecomposeNet(r \®d), r, l\Qd) (Eqn. 3)

A first loss 426 may include a photometric loss L p to minimize the multi- view photometric loss. The photometric loss L p can be calculated using the predicted color of a pixel and its groundtruth color:

[0049] In some embodiments, second neural network 420 may be trained using a second loss 428, which may include a microfacet Tenderer loss L m . The microfacet Tenderer loss L m may be calculated using microfacet rendering results and the ground truth to train second neural network 420 in a weakly- supervised manner. For instance, rendering layer 430 is configured to perform a microfacet rendering process to calculate the color and/or intensity of each pixel in the image (e.g., the first pixel) using first normal 412, first albedo 414, and first roughness 416. The microfacet rendering process can model the reflection and refraction of light by taking into account the effects of surface roughness. In some cases, render layer 430 first calculates a distribution of microfacets on the surface of an object, which describes the probability of finding a microfacet with a particular orientation and size. With the known distribution of microfacets, render layer 430 can then calculate the reflection and refraction of light at the surface. The color of each pixel in the image can therefore be calculated using the microfacet distribution and the SVBRDF parameters (e.g., first normal 412, first albedo 414, and first roughness 416). The second neural network 420 can be further refined by minimizing the microfacet Tenderer loss L m -. where M is microfacet BRDF rendering model.

[0050] In various implementations, to render the 3D scene from a novel viewpoint, let v be the view direction and h = is the half vector.

2 2 where D(/i, /?) = is normal distribution function (NDF), F v, h~) = F o + (1 — geometry term. Fo=O.O5.

[0051] Figure 5 is a simplified flow diagram illustrating a method 500 for image processing according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should limit the scope of the claims.

[0052] According to an example, method 500 may be performed by a computing system, such as system 100 shown in Figure 1. As shown, method 500 includes step 502 of obtaining a plurality of images characterizing a three-dimensional (3D) scene. The 3D scene may include a first object characterized by a first surface property. The plurality of images may be captured by a camera module — such as camera module 200 shown in Figure 2 — which includes one or more cameras and one or more light sources. The plurality of images may be characterized by a plurality of camera views and a plurality of light directions. The plurality of images may be used as training inputs to train one or more deep learning models to generate one or more images characterized by viewpoints and lighting conditions that are different from the input images. In some embodiments, the plurality of images can provide ground-truth values (e.g., the colors of pixels) to calculate one or more loss functions to further refine the deep learning models for improved performance.

[0053] In step 504, method 500 includes determining a first camera view and a first light direction associated with the 3D scene using the plurality of images. For instance, the first camera view may be different from any of the plurality of camera views. The first light direction may be different from any of the plurality of light directions. In some cases, the first light direction may be associated with a first pixel in the 3D scene.

[0054] In step 506, method 500 includes calculating a set of coordinates using the first camera view and the first light direction. For instance, the set of coordinates may be calculated using a light field representation as shown in Figure 3. By casting a light ray from the first camera view into the first light direction, the light field representation calculates the intersection points of the light ray and two planes in 3D space. In some cases, the set of coordinates may include a fourdimensional (4D) coordinate, which includes the coordinates of the two intersection points on the two planes.

[0055] In step 508, method 500 includes generating a first normal, a first albedo, and a first roughness using the set of coordinates and a first neural network. The first neural network may include a fully connected network. For instance, the first neural network includes a multilayer perceptron (MLP) comprising a series of layers of interconnected neurons. The first neural network takes the set of coordinates as input to generate one or more SVBRDF parameters including the first normal, the first albedo, and the first roughness, which can later be used to render surfaces with complex reflectance properties. The first neural network may be trained to generate SVBRDF parameters for a variety of surfaces (e.g., metals, plastic, cloth, and/or the like) and can be further refined using one or more loss functions. In an example, method 500 may further include decomposing the first surface property to generate a set of SVBRDF parameters (e.g., the first normal, the first albedo, and the first roughness) that describes the surface reflectance property at each point.

[0056] In step 510, method 500 includes generating a first color of the first pixel using the first normal, the first albedo, the first roughness, the first light direction, and a second neural network. The second neural network may include a fully connected network. In step 512, method 500 includes providing an output image associated with the 3D scene using at least the first color. For instance, the second neural network may include a multilayer perceptron (MLP) comprising a series of layers of interconnected neurons. The second neural network may be trained with the SVBRDF parameters (e.g., the first normal, the first albedo, the first roughness) to render an image of the 3D scene from a novel viewpoint via an implicit rendering process by calculating the first color of the first pixel. In some cases, the first normal, the first albedo, and the first roughness are associated with the first surface property of the object contained in the 3D scene. The output image may be characterized by a camera view that is different from any of the plurality of camera views. The output image may also be characterized by a light direction that is different from any of the plurality of light directions. Compared to existing techniques that render an image of the scene by tracing rays through the scene and accumulating the contributions of all surfaces where the rays intersect, embodiments of the present inventions can generate photorealistic images of surfaces with complex textures more effectively and efficiently.

[0057] According to some embodiments, method 500 may include calculating a first loss using the first color and a ground-truth color of the first pixel. The first loss may be used to update the first neural network for improved performance. For instance, the first loss includes a photometric loss that measures the difference between the predicted color and the ground-truth color. The parameters of the first neural network may be updated by minimizing the photometric loss. In some embodiments, method 500 may further include calculating a second loss using the first color, the first normal, the first albedo, and the first roughness. The second loss may be used to update the second neural network. For instance, the second loss may include a microfacet Tenderer loss, which is calculated using the SVBRDF parameters generated by the first neural network via a microfacet rendering process. The second neural network may be further refined by enforcing the output color (e.g., the first color of the first pixel) to be close to the microfacet rendering result. [0058] While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.