Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
END-TO-END 3D SCENE RECONSTRUCTION AND IMAGE PROJECTION
Document Type and Number:
WIPO Patent Application WO/2022/150217
Kind Code:
A1
Abstract:
The present disclosure provides methods and apparatuses for end-to-end three-dimension (3D) scene reconstruction and image projection. A set of original images shot by a set of cameras may be obtained. A 3D scene may be reconstructed based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network. A target viewpoint may be obtained. A projected image corresponding to the target viewpoint may be generated with the 3D scene through the scene reconstruction network. The projected image may be updated to an enhanced projected image through the image enhancement network.

Inventors:
WEI YANAN (US)
ZHANG ZHENG (US)
LIANG YAOBO (US)
ZHANG XIAO (US)
TANG JIE (US)
XU TAO (US)
Application Number:
PCT/US2021/065595
Publication Date:
July 14, 2022
Filing Date:
December 30, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MICROSOFT TECHNOLOGY LICENSING LLC (US)
International Classes:
G06N3/02; G06T15/20
Domestic Patent References:
WO2020242170A12020-12-03
Other References:
WILES OLIVIA ET AL: "SynSin: End-to-End View Synthesis From a Single Image", 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 13 June 2020 (2020-06-13), pages 7465 - 7475, XP033805337, DOI: 10.1109/CVPR42600.2020.00749
MESHRY MOUSTAFA ET AL: "Neural Rerendering in the Wild", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 15 June 2019 (2019-06-15), pages 6871 - 6880, XP033687320, DOI: 10.1109/CVPR.2019.00704
Attorney, Agent or Firm:
CHATTERJEE, Aaron C. et al. (US)
Download PDF:
Claims:
CLAIMS 1. A method for end-to-end three-dimension (3D) scene reconstruction and image projection, comprising: obtaining a set of original images shot by a set of cameras; reconstructing a 3D scene based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network; obtaining a target viewpoint; generating a projected image corresponding to the target viewpoint with the 3D scene through the scene reconstruction network; and updating the projected image to an enhanced projected image through the image enhancement network. 2. The method of claim 1, wherein the joint optimization is based at least on a gradient back propagation mechanism of the scene reconstruction network and a gradient back propagation mechanism of the image enhancement network. 3. The method of claim 2, wherein the reconstructing a 3D scene comprises: generating an initial 3D point set; generating a projected image instance based on the initial 3D point set and camera parameters of at least one camera in the set of cameras, through the scene reconstruction network; and producing gradient back propagation based at least on the projected image instance and an original image shot by the at least one camera, to optimize the scene reconstruction network and the initial 3D point set. 4. The method of claim 2, wherein the reconstructing a 3D scene comprises: generating an explicit 3D scene representation. 5. The method of claim 4, wherein the generating an explicit 3D scene representation comprises: generating an initial 3D point set; generating a decoded 3D point set based on the initial 3D point set, through a deep learning model in the scene reconstruction network; projecting the decoded 3D point set to a projected image instance with camera parameters of at least one camera in the set of cameras, through a transformation model in the scene reconstruction network; and producing gradient back propagation based at least on the projected image instance and an original image shot by the at least one camera, to optimize the scene reconstruction network, the initial 3D point set and the decoded 3D point set, wherein the optimized decoded 3D point set corresponds to the explicit 3D scene representation. 6. The method of claim 2, wherein the reconstructing a 3D scene comprises: generating an implicit 3D scene representation. 7. The method of claim 6, wherein the generating an implicit 3D scene representation comprises: generating an initial 3D point set; obtaining camera information corresponding to at least one camera in the set of cameras based on camera parameters of the at least one camera, through a transformation model in the scene reconstruction network; generating a projected image instance based on the initial 3D point set and the camera information, through a deep learning model in the scene reconstruction network; and producing gradient back propagation based at least on the projected image instance and an original image shot by the at least one camera, to optimize the scene reconstruction network and the initial 3D point set, wherein the optimized initial 3D point set corresponds to the implicit 3D scene representation. 8. The method of claim 2, further comprising: updating a projected image instance, which corresponds to at least one camera in the set of cameras and is output by the scene reconstruction network, to an enhanced projected image instance through the image enhancement network; and producing gradient back propagation based at least on the enhanced projected image instance and an original image shot by the at least one camera, to optimize the image enhancement network 9. The method of claim 8, wherein the image enhancement network is based on a generative adversarial network (GAN). 10. The method of claim 5 or 7, wherein each item in the initial 3D point set corresponds to a point in the 3D scene, and comprises at least a space position coordinate and a randomly-initialized space information encoding representation of the point. 11. The method of claim 5, wherein each item in the decoded 3D point set corresponds to a point in the 3D scene, and comprises at least a space position coordinate and appearance property of the point. 12. The method of claim 1, wherein the 3D scene is reconstructed for the whole space associated with the 3D scene. 13. The method of claim 1, wherein the target viewpoint corresponds to any space position in the 3D scene. 14. An apparatus for end-to-end three-dimension (3D) scene reconstruction and image projection, comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: obtain a set of original images shot by a set of cameras, reconstruct a 3D scene based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network, obtain a target viewpoint, generate a projected image corresponding to the target viewpoint with the 3D scene through the scene reconstruction network, and update the projected image to an enhanced projected image through the image enhancement network. 15. A computer program product for end-to-end three-dimension (3D) scene reconstruction and image projection, comprising a computer program that is executed by at least one processor for: obtaining a set of original images shot by a set of cameras; reconstructing a 3D scene based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network; obtaining a target viewpoint; generating a projected image corresponding to the target viewpoint with the 3D scene through the scene reconstruction network; and updating the projected image to an enhanced projected image through the image enhancement network.
Description:
END-TO-END 3D SCENE RECONSTRUCTION AND IMAGE PROJECTION BACKGROUND [0001] 3D scene reconstruction may refer to the process of establishing a 3D mathematical model, suitable for computer representing and processing, for a scene in the objective world, which is a key technique for establishing virtual reality that expresses the objective world in a computer. For example, in an image-based 3D scene reconstruction, 3D information may be reconstructed and a 3D scene may be reconstructed, with a plurality of scene images shot from different angles and through a predetermined algorithm. The 3D scene reconstruction has been widely applied for, e.g., industrial measurement, architectural design, medical imaging, 3D animation games, virtual reality (VR), etc. SUMMARY [0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. [0003] Embodiments of the present disclosure propose methods and apparatuses for end-to-end 3D scene reconstruction and image projection. A set of original images shot by a set of cameras may be obtained. A 3D scene may be reconstructed based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network. A target viewpoint may be obtained. A projected image corresponding to the target viewpoint may be generated with the 3D scene through the scene reconstruction network. The projected image may be updated to an enhanced projected image through the image enhancement network. [0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents. BRIEF DESCRIPTION OF THE DRAWINGS [0005] The disclosed aspects will hereinafter be described in conjunction with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. [0006] FIG.1 illustrates an exemplary process of end-to-end 3D scene reconstruction and image projection according to an embodiment. [0007] FIG.2 illustrates an exemplary process of performing joint optimization to a scene reconstruction network and an image enhancement network according to an embodiment. [0008] FIG.3 illustrates an exemplary implementation of a scene reconstruction network according to an embodiment. [0009] FIG.4 illustrates an exemplary implementation of a scene reconstruction network according to an embodiment. [0010] FIG.5 illustrates an exemplary implementation of an image enhancement network according to an embodiment. [0011] FIG.6 illustrates a flowchart of an exemplary method for end-to-end 3D scene reconstruction and image projection according to an embodiment. [0012] FIG.7 illustrates an exemplary apparatus for end-to-end 3D scene reconstruction and image projection according to an embodiment. [0013] FIG.8 illustrates an exemplary apparatus for end-to-end 3D scene reconstruction and image projection according to an embodiment. DETAILED DESCRIPTION [0014] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure. [0015] There are some existing techniques for performing image-based 3D scene reconstruction. These techniques need to collect a plurality of original images shot by a plurality of pre-deployed cameras firstly, and then reconstruct a 3D scene with these original images. Generally, in order to obtain a better 3D scene reconstruction effect, a large number of cameras need to be deployed, and these cameras may adopt expensive 3D cameras, e.g., VR cameras, etc. In some applications, the reconstructed 3D scene may be further used for implementing image projection, so as to present a projected image associated with a specific space viewpoint to the user. Herein, a viewpoint or space viewpoint may refer to a point in space, which has attributes such as a specific position, a specific direction, etc. For example, when the user is watching at a specific viewpoint in the virtual space corresponding to the 3D scene, a projected image corresponding to the viewpoint may be projected with the reconstructed 3D scene, and the projected image may be presented to the user. Therefore, if the user selects different viewpoints in the 3D scene, corresponding projected images may be presented to the user at these viewpoints, respectively. Accordingly, an experience effect that the user feels like being in a virtual space may be achieved. However, the viewpoints that the user can select are often restricted by the cameras that are pre-deployed during the image collecting process. For example, the user can only select a viewpoint corresponding to each pre-deployed camera, but cannot watch at other viewpoints in the 3D scene. Moreover, in these existing techniques, 3D scene reconstruction and image projection are two independent processes that are trained separately, and these two processes are only combined in the application phase. [0016] Embodiments of the present disclosure propose end-to-end 3D scene reconstruction and image projection. The embodiments of the present disclosure also propose an image enhancement mechanism for projected images to generate enhanced projected images with higher quality. In the embodiments of the present disclosure, a 3D scene reconstruction process, an image projection process, and an image enhancement process may be concatenated together and optimized or trained together. For example, end-to-end joint optimization may be performed to a scene reconstruction network used for the 3D scene reconstruction and an image enhancement network used for the image enhancement mechanism. Through the joint optimization, the scene reconstruction network and the image enhancement network may be coupled more closely and effectively, and may adapt to each other more accurately, thereby generating more realistic images and improving the user experience accordingly. [0017] The embodiments of the present disclosure may achieve high-quality 3D scene reconstruction for the whole space associated with a 3D scene, through at least the end-to- end joint optimization of the scene reconstruction network and the image enhancement network. Through modeling the whole space, high-quality image projection may be performed at any viewpoint in the 3D scene, without being restricted by the cameras pre- deployed during the image collecting process. For example, it may be simulated that, during the user is walking arbitrarily in the 3D scene, images corresponding to any viewpoints are presented to the user simultaneously. Therefore, the user's interaction freedom in the 3D scene may be significantly improved, and the user experience may be improved accordingly. In contrast, since the existing techniques do not perform the joint optimization involved in the embodiments of the present disclosure, the 3D scene reconstructed by the existing techniques can only achieve effective image projection at viewpoints corresponding to the cameras used for the image collection, but cannot support high-quality image projection at any other viewpoints. [0018] The embodiments of the present disclosure may utilize limited camera resources for performing 3D scene reconstruction. For example, compared with the existing techniques, the embodiments of the present disclosure may utilize less number of cameras. Moreover, the cameras adopted in the embodiments of the present disclosure are not limited to 3D cameras, and any other types of ordinary camera for shooting 2D images may also be adopted. Therefore, the embodiments of the present disclosure may greatly reduce the collection cost of original images and improve the convenience of the original image collecting process. For example, through the end-to-end joint optimization of the scene reconstruction network and the image enhancement network, the embodiments of the present disclosure may achieve good 3D scene reconstruction and further high-quality image projection only with original images shot by a small number of cameras. [0019] The embodiments of the present disclosure may be deployed in any known or potential applications. For example, in a VR live for, e.g., a concert, through the embodiments of the present disclosure, a viewer may move arbitrarily in the 3D scene of the concert and watch the performance at any viewpoint. For example, in a 3D video conference involving, e.g., a picture of a conference room, through the embodiments of the present disclosure, a participant may move arbitrarily in the 3D scene of the conference room and watch the conference scene at any viewpoint. Only some exemplary applications are given above, and the embodiments of the present disclosure may also be deployed for any other applications. Moreover, the embodiments of the present disclosure are not limited to applications involving streaming live or applications for playing recorded contents, that is, original images may be shot in real time, shot in advance, etc. [0020] FIG.1 illustrates an exemplary process 100 of end-to-end 3D scene reconstruction and image projection according to an embodiment. [0021] According to the process 100, a set of original images 104 shot by a set of cameras 102 may be obtained first. The set of cameras 102 may be pre-deployed in the actual scene. Taking a concert scene as an example, a plurality of cameras may be deployed at different locations such as the stage, auditorium, passages, etc., so that these cameras may shoot images from different shooting angles. In an implementation, the set of original images 104 may be shot by the set of cameras 102 at the same time point. Accordingly, the set of original images 104 corresponding to the same time point may be used for reconstructing the 3D scene at that time point through the process 100. It should be understood that the set of original images 104 may be shot by the set of cameras in 102 in real time, and thus the process 100 may be performed for e.g., applications involving streaming live, or the set of original images 104 may be previously shot by the set of cameras 102, and thus the process 100 may be performed for e.g., applications involving playing recorded contents. [0022] The set of cameras 102 may comprise a total of K cameras. Each camera may have corresponding camera parameters. For example, camera parameters of the k-th camera may be represented as , , , , , y , wherein 1 , are space position coordinate of the k-th camera in the real space, is a direction or orientation of the k-th camera, and are field of view (FOV) parameters of the k-th camera. An original image shot by the k-th camera is composed of a set of pixels, and may be represented as: , , Equation (1) wherein N is the number of pixels included in the original image are position coordinate of the i-th pixel in the original image and is an appearance property of the i-th pixel, e.g., RGB value, etc. It should be understood that in the case of adopting ordinary cameras for shooting 2D images, a shot image may be directly represented by Equation (1), while in the case of adopting 3D cameras or VR cameras for shooting images with depth-of-field information, depth-of-field information obtained in the shooting process may be ignored, and a shot image may still be represented by Equation (1). It can be seen that the embodiments of the present disclosure may even only adopt images shot by ordinary cameras without requiring the use of more expensive 3D cameras, thereby reducing the collection cost of original images and improving the convenience of the original image collecting process. [0023] At 110, 3D scene reconstruction may be performed. For example, a 3D scene may be reconstructed based at least on the set of original images 104 and the camera parameters of the set of cameras 102. [0024] An actual 3D scene S may be represented as: Equation (2) wherein 1 ≤ i ≤ M, M is the number of points or voxels included in the actual 3D scene S, , , are space position coordinate of the i-th point in the actual 3D scene S, and c i is an appearance property of the i-th point, e.g., RGB value, etc. [0025] As the theoretical basis of 3D scene reconstruction, the following relationship may be established between the original image shot by the k-th camera and the actual 3D scene S: Equation (3) wherein ℳ is a transformation model for projecting the actual 3D scene S into a 2D image corresponding to the camera parameters may be referred to as camera information of the k-th camera, which is obtained through applying the transformation model ℳ to the camera parameters of the k-th camera. ℳ may be implemented through various approaches. For example, in an implementation, ℳ may be a hybrid transformation matrix which is used for performing projection transformation, affinity transformation, rendering transformation, etc. Equation (3) shows that may be represented by performing transformation to the actual 3D scene S based on camera parameters or camera information. Accordingly, may be used for reconstructing a 3D scene through a variant of Equation (3). For example, the actual 3D scene S may be reconstructed with a combination of camera parameters or camera information and corresponding original images. [0026] The 3D scene reconstruction at 110 may implement joint optimization or training of a scene reconstruction network 112 and an image enhancement network 114 through concatenating a 3D scene reconstruction process, an image projection process, and an image enhancement process. Through training the scene reconstruction network 112, a reconstructed 3D scene may be obtained. In an implementation, the scene reconstruction network 112 may be constructed based on an approach of explicitly representing a 3D scene, which may generate an explicit 3D scene representation. In an implementation, the scene reconstruction network 112 may be constructed based on an approach of implicitly representing a 3D scene, which may be used for obtaining an implicit 3D scene representation. A 3D scene reconstructed with the scene reconstruction network 112 may be used for performing image projection. Moreover, during training, the image enhancement network 114 may be used for performing image enhancement to a projected image instance output by the scene reconstruction network 112, in order to improve image quality, e.g., to make the image clearer, to make the image look more realistic, etc. The image enhancement network 114 may be constructed based on various approaches, e.g., a Generative Adversarial Network (GAN). It should be understood that the 3D scene reconstruction at 110 actually performs end-to-end joint optimization to the processes of 3D scene reconstruction, image projection, image enhancement, etc. Further details of this joint optimization will be discussed later in connection with FIG.2. [0027] It should be understood that, through the above joint optimization, a good 3D scene reconstruction may be achieved even if only utilizing original images shot by a small number of cameras. Therefore, compared with the existing techniques, the embodiments of the present disclosure may utilize a smaller number of cameras, thereby reducing the collection cost of original images and improving the convenience of the original image collecting process. [0028] After the 3D scene is reconstructed, the process 100 may obtain a target viewpoint 106 at 120. The target viewpoint 106 may be, e.g., designated by a user, or automatically detected based on the user's behavior. The target viewpoint 106 may indicate at what space position, in what direction, etc. the user wants to watch in the 3D scene. The target viewpoint 106 may be represented in an approach similar to camera parameters, e.g., it may be represented through at least one of space position coordinate, direction, field of view parameters, etc. It should be understood that since the 3D scene reconstruction is performed at least through the above joint optimization, the reconstructed 3D scene can effectively and fully characterize any point in the whole space, and thus may be used for performing the subsequent image projection process for any target viewpoint. Accordingly, the target viewpoint 106 may actually correspond to any space position in the 3D scene. [0029] At 130, an image projection process may be performed. For example, a projected image corresponding to the target viewpoint 106 may be generated with the reconstructed 3D scene, through the trained scene reconstruction network 112. [0030] At 140, an image enhancement process may be performed. For example, the projected image generated at 130 may be updated to an enhanced projected image 108 corresponding to the target viewpoint 106, through the trained image enhancement network 114. The enhanced projected image 108 may be further presented to the user. [0031] It should be understood that the process 100 may be repeatedly performed along with time. For example, assuming that the set of original images 104 is obtained at the time point t, accordingly, the 3D scene reconstruction at 110 is actually reconstructing a 3D scene at the time point t, and the scene reconstruction network 112 and the image enhancement network 114 are also trained for the time point t. When reaching the time point t+1, a new set of original images obtained at the time point t+1 may be used for performing the 3D scene reconstruction at 110 again, and accordingly a new scene reconstruction network and image enhancement network may be obtained for producing a new enhanced projected image finally. The target viewpoint at the time point t+1 may be the same as or different from the target viewpoint at the time point t. [0032] FIG.2 illustrates an exemplary process 200 of performing joint optimization to a scene reconstruction network and an image enhancement network according to an embodiment. The process 200 may be performed during the 3D scene reconstruction process at 110 in FIG.1. A scene reconstruction network 210 and an image enhancement network 220 may correspond to the scene reconstruction network 112 and the image enhancement network 114 in FIG.1 respectively. [0033] In the process 200, an initial 3D point set 202 may be generated first. In an implementation, the initial 3D point set 202 may be a randomly initialized 3D point set. The initial 3D point set 202 may be represented as wherein 1 ≤ , M is the number of points or voxels included in a 3D scene. Each item in corresponds to a point in the 3D scene, and includes at least a space position coordinate of the point and a randomly initialized space information encoding representation. For example, are pre-defined space position coordinate of the i-th point through, e.g., uniform sampling in the whole space, and s a randomly initialized space information encoding representation of the i-th point. is a randomly initialized vector, which may be regarded as a hidden variable that encodes 3D scene space information, and at least implicitly contains information related to appearance property and other possible information. [0034] It is assumed that the process 200 is currently performed for an original image shot by the k-th camera with camera parameters A projected image instance may be generated based on the initial 3D point set and the camera p arameters through the scene reconstruction network 210. [0035] Gradient back propagation 214 may be generated based at least on the projected image instance and the original image shot by the k-th camera, to optimize the scene reconstruction network 210 and the initial 3D point set 0 . For example, the scene reconstruction network 210 and the initial 3D point set may be optimized by minimizing the difference between the projected image instance and the original image n an implementation, e.g., the per- pixel L1 loss may be adopted in the gradient back propagation. [0036] In the process 200, the projected image instance 212 output by the scene reconstruction network 210 may be updated to an enhanced projected image instance 222 through the image enhancement network 220. Gradient back propagation 224 may be generated based at least on the enhanced projected image instance 222 and the original image 206, to optimize the image enhancement network 220. For example, the image enhancement network 220 may be optimized by minimizing the difference between the enhanced projected image instance 222 and the original image 206. [0037] The joint optimization of the scene reconstruction network 210 and the image enhancement network 220 in the process 200 is based at least on both the gradient back propagation mechanism of the scene reconstruction network 210 (e.g., the gradient back propagation 214) and the gradient back propagation mechanism of the image enhancement network 220 (e.g., the gradient back propagation 224). For example, since the projected image instance 212 serves as both the output of the scene reconstruction network 210 and the input of the image enhancement network 220, when the scene reconstruction network 210 and the image enhancement network 220 are concatenated together in the approach shown in FIG.2 and are optimized or trained together, the influence of the gradient back propagation 224 will be further propagated to the gradient back propagation 214, thereby achieving end-to-end joint optimization of the scene reconstruction network 210 and the image enhancement network 220. [0038] It should be understood that the process 200 may be repeatedly performed for each original image in a set of original images shot by a set of cameras (e.g., the set of original images 104 in FIG.1), so as to iteratively train or optimize the scene reconstruction network 210 and the image enhancement network 220. Through the joint optimization of the scene reconstruction network 210 and the image enhancement network 220 based on the process 200, a 3D scene may be reconstructed at the scene reconstruction network 210 more accurately and more effectively. Moreover, it should be understood that the embodiments of the present disclosure are not limited to any specific techniques for constructing the scene reconstruction network 210 and the image enhancement network 220. [0039] As described above, according to the embodiments of the present disclosure, depending on different implementations of the scene reconstruction network, the reconstructing of a 3D scene may comprise generating an explicit 3D scene representation, obtaining an implicit 3D scene representation, etc. [0040] FIG.3 illustrates an exemplary implementation of a scene reconstruction network according to an embodiment. In this implementation, a scene reconstruction network 300 is constructed based on an approach of explicitly representing a 3D scene, which may generate an explicit 3D scene representation. The scene reconstruction network 300 is an example of the scene reconstruction network 210 in FIG.2. [0041] An initial 3D point set 302 may correspond to the initial 3D point set 202 in FIG.2 and may be represented as [0042] The scene reconstruction network 300 may comprise a randomly initialized deep learning model 310 which may be represented as wherein is a learnable network parameter. The deep learning model 310 may generate a decoded 3D point set 312 based on the initial 3D point set The decoded 3D point set 312 may be represented as wherein 1 is the number of points or voxels included in a 3D scene, are space position coordinate of the i-th point, and is an appearance property of the i-th point. The deep learning model 310 may at least decode the space information encoding representation ^^ ^ in the initial 3D point set ^ into the appearance property in the decoded 3D point set ^^ 312. Since the appearance property explicitly represents parameters for presenting the i-th point in a 3D scene, e.g., RGB value, etc., the decoded 3D point set 312 may correspond to an explicit 3D scene representation of the 3D scene. [0043] The scene reconstruction network 300 may comprise a transformation model 320 which may utilize camera parameters 304 for projecting the decoded 3D point set 312 into a projected image instance 322. In an implementation, the transformation model 320 may perform image projection according to Equation (3), wherein represents the decoded 3D point set 312, represents the camera parameters 304 of the k-th camera, and represents the projected image instance 322. [0044] As described above in connection with FIG.2, gradient back propagation may be generated based at least on the projected image instance and an original image shot by the k-th camera. The gradient back propagation will optimize the scene reconstruction network 300 and optimize the initial 3D point set 302 and the decoded 3D point set 312. Accordingly, the optimized decoded 3D point set may be used as an explicit 3D scene representation. [0045] After the optimization of the scene reconstruction network 300 is completed, the scene reconstruction network 300 may be used for performing image projection for a target viewpoint at, e.g., 130 in FIG.1. It should be understood that during the image projection process, a projected image corresponding to the target viewpoint may be generated with the optimized decoded 3D point set through the transformation model 320 in the scene reconstruction network 300, wherein the target viewpoint may be represented in an approach similar to camera parameters and provided as an input to the transformation model. [0046] FIG.4 illustrates an exemplary implementation of a scene reconstruction network according to an embodiment. In the implementation, the scene reconstruction network 400 is constructed based on an approach of implicitly representing a 3D scene, which may be used for obtaining an implicit 3D scene representation. The scene reconstruction network 400 is an example of the scene reconstruction network 210 in FIG.2. [0047] An initial 3D point set 402 may correspond to the initial 3D point set 202 in FIG.2 or the initial 3D point set 302 in FIG.3, and may be represented as ^ [0048] The scene reconstruction network 400 may comprise a transformation model 410 which may obtain camera information corresponding to a camera based on camera parameters 404 of the camera. For example, the transformation model 410 may output camera information according to Equation (3), wherein represents camera parameters of the k-th camera, and ℳ is the transformation model. [0049] The scene reconstruction network 400 may comprise a deep learning model 420 which may be represented as wherein is a learnable network parameter. The deep learning model 420 may generate a projected image instance based on the initial 3D point set and the camera information output by the transformation model 410. [0050] As described above in connection with FIG.2, gradient back propagation may be generated based at least on the projected image instance and an original image shot by the k-th camera. The gradient back propagation will optimize the scene reconstruction network 400 and optimize the initial 3D point set 402. In FIG.4, although the scene reconstruction network 400 does not generate an explicit 3D scene representation similar to the decoded 3D point set 312 in FIG.3, the optimized initial 3D point set will contain a space information encoding representation of a 3D scene, e.g., at least implicitly contain information related to appearance property and other possible information, therefore, the optimized initial 3D point set may be used as an implicit 3D scene representation. [0051] After the optimization of the scene reconstruction network 400 is completed, the scene reconstruction network 400 may be used for performing image projection for a target viewpoint at, e.g., 130 in FIG.1. It should be understood that during the image projection process, a projected image corresponding to the target viewpoint may be generated with the optimized initial 3D point set through the transformation model 410 and the deep learning model 420 in the scene reconstruction network 400, wherein the target viewpoint may be represented in an approach similar to camera parameters and provided as an input to the transformation model. [0052] FIG.5 illustrates an exemplary implementation of an image enhancement network according to an embodiment. The image enhancement network 500 is an example of the image enhancement network 220 in FIG.2. In this implementation, the image enhancement network 500 is constructed based on GAN. [0053] The image enhancement network 500 may comprise an enhancement model 510. The enhancement model 510 may generate an enhanced projected image instance 512 based on a projected image instance 502, wherein the projected image instance 502 may correspond to the projected image instance 212 in FIG.2, the projected image instance 322 in FIG.3, the projected image instance 412 in FIG.4, etc. During the training process, the enhancement model 510 aims to update a projected image instance to improve image quality. [0054] The image enhancement network 500 may comprise a discriminator 520. The discriminator 520 may take the enhanced projected image instance 512 and an original image 504 as inputs, wherein the original image 504 may correspond to the original image 206 in FIG.2. [0055] The enhancement model 510 may be trained to generate an image that is as similar as possible to a real image, e.g., the original image 504, and the discriminator 520 may be trained to distinguish between the image generated by the enhancement model 510 and the real image as accurately as possible. [0056] As described above, gradient back propagation may be generated based at least on the enhanced projected image instance 512 and the original image 504 to optimize the image enhancement network 500. [0057] After the optimization or training of the image enhancement network 500 is completed, the image enhancement network 500 may be used for performing image enhancement at, e.g., 140 in FIG.1. It should be understood that in the image enhancement process, the enhancement model 510 in the image enhancement network 500 may be used for updating a projected image, to obtain a high-quality enhanced projected image. [0058] It should be understood that the embodiments of the present disclosure are not limited to constructing an image enhancement network with GAN, but may adopt any other technique for constructing an image enhancement network. [0059] FIG.6 illustrates a flowchart of an exemplary method 600 for end-to-end 3D scene reconstruction and image projection according to an embodiment. [0060] At 610, a set of original images shot by a set of cameras may be obtained. [0061] At 620, a 3D scene may be reconstructed based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network. [0062] At 630, a target viewpoint may be obtained. [0063] At 640, a projected image corresponding to the target viewpoint may be generated with the 3D scene through the scene reconstruction network. [0064] At 650, the projected image may be updated to an enhanced projected image through the image enhancement network. [0065] In an implementation, the joint optimization may be based at least on a gradient back propagation mechanism of the scene reconstruction network and a gradient back propagation mechanism of the image enhancement network. [0066] In an implementation, the reconstructing a 3D scene may comprise: generating an initial 3D point set; generating a projected image instance based on the initial 3D point set and camera parameters of at least one camera in the set of cameras, through the scene reconstruction network; and producing gradient back propagation based at least on the projected image instance and an original image shot by the at least one camera, to optimize the scene reconstruction network and the initial 3D point set. [0067] In an implementation, the reconstructing a 3D scene may comprise: generating an explicit 3D scene representation. The generating an explicit 3D scene representation may comprise: generating an initial 3D point set; generating a decoded 3D point set based on the initial 3D point set, through a deep learning model in the scene reconstruction network; projecting the decoded 3D point set to a projected image instance with camera parameters of at least one camera in the set of cameras, through a transformation model in the scene reconstruction network; and producing gradient back propagation based at least on the projected image instance and an original image shot by the at least one camera, to optimize the scene reconstruction network, the initial 3D point set and the decoded 3D point set, wherein the optimized decoded 3D point set corresponds to the explicit 3D scene representation. [0068] In an implementation, the reconstructing a 3D scene may comprise: generating an implicit 3D scene representation. The generating an implicit 3D scene representation may comprise: generating an initial 3D point set; obtaining camera information corresponding to at least one camera in the set of cameras based on camera parameters of the at least one camera, through a transformation model in the scene reconstruction network; generating a projected image instance based on the initial 3D point set and the camera information, through a deep learning model in the scene reconstruction network; and producing gradient back propagation based at least on the projected image instance and an original image shot by the at least one camera, to optimize the scene reconstruction network and the initial 3D point set, wherein the optimized initial 3D point set corresponds to the implicit 3D scene representation. [0069] In an implementation, the method 600 may further comprise: updating a projected image instance, which corresponds to at least one camera in the set of cameras and is output by the scene reconstruction network, to an enhanced projected image instance through the image enhancement network; and producing gradient back propagation based at least on the enhanced projected image instance and an original image shot by the at least one camera, to optimize the image enhancement network. The image enhancement network is based on a GAN. [0070] Each item in the initial 3D point set may correspond to a point in the 3D scene, and may comprise at least a space position coordinate and a randomly-initialized space information encoding representation of the point. [0071] Each item in the decoded 3D point set may correspond to a point in the 3D scene, and may comprise at least a space position coordinate and appearance property of the point. [0072] In an implementation, the camera parameters of the set of cameras may comprise a space position coordinate, a direction and field of view parameters of each camera. [0073] In an implementation, the 3D scene may be reconstructed for the whole space associated with the 3D scene. [0074] In an implementation, the target viewpoint may correspond to any space position in the 3D scene. [0075] In an implementation, the set of original images may be shot at the same time point. [0076] In an implementation, the set of original images may be shot in real time or shot in advance. [0077] It should be understood that the method 600 may further comprise any step/process for end-to-end 3D scene reconstruction and image projection according to the above embodiments of the present disclosure. [0078] FIG.7 illustrates an exemplary apparatus 700 for end-to-end 3D scene reconstruction and image projection according to an embodiment. [0079] The apparatus 700 may comprise: an original image obtaining module 710, for obtaining a set of original images shot by a set of cameras; a 3D scene reconstructing module 720, for reconstructing a 3D scene based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network; a target viewpoint obtaining module 730, for obtaining a target viewpoint; an image projecting module 740, for generating a projected image corresponding to the target viewpoint with the 3D scene through the scene reconstruction network; and an image enhancement module 750, for updating the projected image to an enhanced projected image through the image enhancement network. [0080] Moreover, the apparatus 700 may further comprise any other modules that perform steps of the methods for end-to-end 3D scene reconstruction and image projection according to the above embodiments of the present disclosure. [0081] FIG.8 illustrates an exemplary apparatus 800 for end-to-end 3D scene reconstruction and image projection according to an embodiment. [0082] The apparatus 800 may comprise: at least one processor 810; and a memory 820 storing computer-executable instructions. When executing the computer-executable instructions, the at least one processor 810 may: obtain a set of original images shot by a set of cameras; reconstruct a 3D scene based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network; obtain a target viewpoint; generate a projected image corresponding to the target viewpoint with the 3D scene through the scene reconstruction network; and update the projected image to an enhanced projected image through the image enhancement network. Moreover, the processor 810 may further perform any other steps/processes of the methods for end-to-end 3D scene reconstruction and image projection according to the above embodiments of the present disclosure. [0083] The embodiments of the present disclosure propose a computer program product for end-to-end 3D scene reconstruction and image projection, comprising a computer program that is executed by at least one processor for: obtaining a set of original images shot by a set of cameras; reconstructing a 3D scene based at least on the set of original images and camera parameters of the set of cameras, through joint optimization of a scene reconstruction network and an image enhancement network; obtaining a target viewpoint; generating a projected image corresponding to the target viewpoint with the 3D scene through the scene reconstruction network; and updating the projected image to an enhanced projected image through the image enhancement network. Moreover, the computer program may be further executed for implementing any other steps/processes of the methods for end-to-end 3D scene reconstruction and image projection according to the above embodiments of the present disclosure. [0084] The embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium. The non-transitory computer readable medium may comprise instructions that, when executed, cause one or more processors to perform any steps/processes of the methods for end-to-end 3D scene reconstruction and image projection according to the above embodiments of the present disclosure. [0085] It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts. [0086] It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together. [0087] Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform. [0088] Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register). [0089] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.