Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MULTIPLANE FEATURES ENCODER-RENDERER FOR VIDEO RESTORATION
Document Type and Number:
WIPO Patent Application WO/2024/094300
Kind Code:
A1
Abstract:
An image restoration apparatus (700) configured to: obtain input images (301) of the same scene, each input image (301) being captured from a different physical viewpoint (310); transform plane sweep volumes (302) from the input images (301) to encode a multiplane feature representation (303) of the scene; backward project the multiplane feature representation (303) to generate a back-projected multiplane feature representation (304) for each output image (305); and render each back-projected multiplane feature representation (304) into a respective output image (305) in dependence on a trained renderer model (309). The apparatus may improve the rendering of the output images.

Inventors:
TANAY THOMAS (DE)
LEONARDIS ALES (DE)
MAGGIONI MATTEO (DE)
Application Number:
PCT/EP2022/080612
Publication Date:
May 10, 2024
Filing Date:
November 03, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
TANAY THOMAS (DE)
International Classes:
G06T5/50
Foreign References:
US20200226816A12020-07-16
Other References:
PRATUL P SRINIVASAN ET AL: "Pushing the Boundaries of View Extrapolation with Multiplane Images", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 May 2019 (2019-05-01), XP081271207
MILDENHALL BEN ET AL: "Local light field fusion", ACM TRANSACTIONS ON GRAPHICS, ACM, NY, US, vol. 38, no. 4, 12 July 2019 (2019-07-12), pages 1 - 14, XP058686672, ISSN: 0730-0301, DOI: 10.1145/3306346.3322980
DAXENBICHLER JULIA: "Local Light Field Fusion using Focus Stacking", 7 September 2020 (2020-09-07), XP055871299, Retrieved from the Internet [retrieved on 20211209]
CHAN K ET AL.: "BasicVSR: the search for essential components in video super-resolution and beyond", CVPR 21
CHAN K: "BasicVSR++:improving video super-resolution with enhanced propagation and alignment", CVPR 22
BHAT, G: "Deep reparameterization of multi-frame super-resolution and denoising", CVPR 21
MILDENHALL, B ET AL.: "NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images", CVPR 22
PEARL, N ET AL.: "NaN: Noise-Aware NeRFs for Burst Denoising", CVPR 22
MA, L ET AL.: "Deblur-NeRF: Neural Radiance Fields from Blurry Images", CVPR 22
ZHOU, T ET AL.: "Stereo Magnification: Learning View Synthesis using Multiplane Images", SIGGRAPH 18
MILDENHALL, B ET AL.: "Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines", SIGGRAPH 19
FLYNN, J ET AL.: "DeepView: View Synthesis with Learned Gradient Descent", CVPR 19
TUCKER, R ET AL.: "Single-View View Synthesis with Multiplane Images", CVPR 20
HAN, Y ET AL.: "Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images", SIGGRAPH 22
Attorney, Agent or Firm:
HUAWEI EUROPEAN IPR (DE)
Download PDF:
Claims:
CLAIMS

1. An image restoration apparatus (700), the apparatus (700) comprising one or more processors (701) and a memory (702) storing in non-transient form data defining program code executable by the one or more processors (701) to implement an image restoration model, the apparatus (700) being configured to: obtain input images (301) of the same scene, each input image (301) being captured from a different physical viewpoint (310); forward project each of the input images (301) into two or more depth planes (204) based on a reference viewpoint (311) to generate a plane sweep volume (302) for each input image (301); transform the plane sweep volumes (302) to encode a multiplane feature representation (303) of the scene; backward project the multiplane feature representation (303) to one or more synthetic viewpoint (312) corresponding to respective target physical viewpoints (312) to generate a back-projected multiplane feature representation (304) for each output image (305); and render each back-projected multiplane feature representation (304) into a respective output image (305), the rendering of the respective output images (305) being in dependence on a trained Tenderer model (309).

2. The image restoration apparatus (700) of claim 1 , wherein the apparatus (700) is configured to transform the plane sweep volumes (302) to encode the multiplane feature representation (303) of the scene in dependence on a trained encoder model (307).

3. The image restoration apparatus (700) of claim 2, wherein the trained Tenderer model (309) and the trained encoder model (307) are trained concurrently by means of end-to-end supervised learning.

4. The image restoration apparatus (700) of claim 2 or 3, wherein the apparatus (700) is configured to implement the trained Tenderer model (309) and/or the trained encoder model (307) on a convolutional neural network.

5. The image restoration apparatus (700) of any preceding claim, wherein the depth planes (204) are distributed orthogonally with respect to a viewing vector from the reference viewpoint (311).

6. The image restoration apparatus (700) of any preceding claim, wherein the depth planes (204) are computed from characteristics about the capture of the input images (301).

7. The image restoration apparatus (700) of any preceding claim, wherein the apparatus (700) is configured to forward project each of the input images (301) into two or more depth planes (204) based on the reference viewpoint (311) to generate the plane sweep volume (302) for each input image (301) by using homographies induced from the depth planes (204).

8. The image restoration apparatus (700) of any preceding claim, wherein the multiplane feature representation (303) is centred on the reference viewpoint (311) and comprises one set of features per depth plane (204).

9. The image restoration apparatus (700) of any preceding claim, wherein the multiplane feature representation (303) comprises an RGB colour representation and/or a plurality of further feature representations.

10. The image restoration apparatus (700) of any preceding claim, wherein the apparatus (700) is configured to backward project the multiplane feature representation (303) to one or more synthetic viewpoint (312) corresponding to respective target physical viewpoints (312) to generate the back-projected multiplane feature representation (304) for each output image (305) by using inverse homographies induced from the depth planes (204).

11. The image restoration apparatus (700) of any preceding claim, wherein a respective synthetic viewpoint (312) is the same as the corresponding physical viewpoint (310) for that input image (301).

12. The image restoration apparatus (700) of any preceding claim, wherein a respective synthetic viewpoint (312) is different from the corresponding physical viewpoint (310) for that input image (301).

13. The image restoration apparatus (700) of any preceding claim, wherein the apparatus (700) is configured to: obtain a video stream of successive frames, each frame comprising input images (301) for that frame time; and repeat one or more of the steps of any preceding claim for each of the frames.

14. The image restoration apparatus (700) of claim 13, wherein the apparatus (700) is configured to: transform the plane sweep volumes (302) to encode the multiplane feature representation (303) of the scene for only one of the frames of the video stream; and render each back-projected multiplane feature representation (304) into the respective output image (305) for all the frames of the video stream.

15. An imaging device comprising the image restoration apparatus (700) of any preceding claim and a plurality of cameras, the plurality of cameras each being configured to capture a respective input image (301) from a different physical viewpoint (310).

16. A method (600) for restoring an image, the method (600) comprising: obtaining input images of the same scene, each input image being captured from a different physical viewpoint (601); forward projecting each of the input images into two or more depth planes based on a reference viewpoint to generate a plane sweep volume for each input image (602); transforming the plane sweep volumes to encode a multiplane feature representation of the scene (603); backward projecting the multiplane feature representation to one or more synthetic viewpoint corresponding to respective target physical viewpoints to generate a back-projected multiplane feature representation for each output image (604); and rendering each back-projected multiplane feature representation into a respective output image, the rendering of the respective output images being in dependence on a trained Tenderer model (605).

Description:
MULTIPLANE FEATURES ENCODER-RENDERER FOR VIDEO RESTORATION

FIELD OF THE INVENTION

This invention relates to image restoration, for example for rendering output images based on different viewpoints.

BACKGROUND

This invention relates to video restoration applications for computational photography. Any imaging device is invariably affected by various forms of degradation, such as noise and blur. The degradation may be due to imperfections in the acquisition process, challenging acquisition settings, and inherent limits of the imaging sensors. Computational photography is used in these cases to go beyond the limit of the imaging sensor and improve the final visual quality of the acquired data. In practice this is implemented with an image processing pipeline (ISP) consisting of several operations, such as denoising, demosaicking, deblurring, white balancing, and super-resolution, which take as input the degraded image and generate a faithful and high-quality restored one.

This invention may be specifically adapted to deals with the problem of video restoration. Videos are in essence an arbitrarily long sequence of images (or frames) acquired in rapid succession. While this problem can be solved with single-image processing (i.e. , processing each frame independently), much better solutions can be found by exploiting the temporal correlation naturally existing across frames. In other words, frames that are adjacent or close to each other, typically contains similar scenes and structures. Therefore, it can be critical to design a video processing solution which is able to effectively extract and exploit such similarities.

The main issue is to design algorithms that are able to compensate motion in the video. Traditionally this is done by finding correspondence between pixels (optical flow) which can then be used to warp and align adjacent frames. However, this strategy is often not robust to local motion in the scene, and often generates significant artifacts in the aligned frames. More recently, more sophisticated methods, such as convolutional neural networks (CNNs), have been used in various fashion to deal explicitly or implicitly with motion in the video, but these strategies may also rely on finding pixel-wise correspondences and then warping data (in either image or feature domain). This can be a big limitation, as this is fundamentally still based on pure 2D processing and thus is likely to fail, especially when processing unreliable data (i.e., corrupted by high noise or large blur). Examples of different previous systems are discussed: 2D-based video restoration: The current systems in video restoration are generally 2D-based. Examples include: BasicVSR [Chan K. et al. BasicVSR: the search for essential components in video super-resolution and beyond. (CVPR 21)], BasicVSR++ [Chan K. et al. BasicVSR++: improving video super-resolution with enhanced propagation and alignment (CVPR 22)], and Deep Rep [Bhat, G. et al. Deep reparameterization of multi-frame super-resolution and denoising (CVPR 21)]. 2D-based methods may rely on two main components: recurrent or joint processing of a batch of successive frames and some form of 2D-based optical flow alignment, to compensate for object and camera motion across frames. By design, these approaches may not make use of the 3D consistency of the visual world, and as a result may suffer from inaccurate reconstruction of 3D geometries, particularly along edges, or flickering artifacts (due to the lack of 3D consistency in time).

3D-based NeRFs for video restoration: Examples of 3D-based video restoration include: NeRF in the Dark [Mildenhall, B. et al. NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images (CVPR 22, oral)], Noise-aware-NeRFs [Pearl, N. et al. NaN: Noise-Aware NeRFs for Burst Denoising (CVPR 22)], and Deblur-NeRF [Ma, L. et al. Deblur-NeRF: Neural Radiance Fields from Blurry Images (CVPR 22)]. 3D-based methods may be based on a Neural Radiance Fields (NeRF) framework. This framework has been successful in some ways, but it may suffer from strong limitations: it may be extremely heavy computationally (orders of magnitude higher than 2D-based SOTA), and it may not generalize easily to unseen and dynamic scenes. These limitations may make this framework unusable in virtually all practical cases for video restoration.

Multiplane images (MPIs) for view synthesis: Instead of the NeRF framework, a 3D-based video restoration using the multiplane image framework for view synthesis is introduced in [Zhou, T. et al. Stereo Magnification: Learning View Synthesis using Multiplane Images (SIGGRAPH 18)] for 2 inputs views, generalized in [Mildenhall, B. et al. Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines (SIGGRAPH 19)] and [Flynn, J. et al. DeepView: View Synthesis with Learned Gradient Descent (CVPR 19)] for more than 2 input views, and adapted in [Tucker, R. et al. Single-View View Synthesis with Multiplane Images (CVPR 20)] and [Han, Y. et al. Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images. (SIGGRAPH 22)] for single-view view synthesis. These works may use a similar image restoration pipeline to that shown in Figure 1.

In image restoration pipeline 100 may receive input images 101. The input images 101 may be forward warped 106 into plane sweep volumes 102. The plane sweep volumes 102 may be transformed into a multiplane image representation 103 by a multiplane image network 107. The multiplane image representation 103 may be backward warped 108 into backward warped multiplane image representation 104. The backward warped multiplane image representation 104 may be rendered into an output image 105 by an overcompositing operator 109.

The main problem with the standard multiplane image processing pipeline is that MPIs are immutable scene representations. Once predicted, MPIs are turned into novel views by applying two fixed operators with very limited information processing power: backward-warping and overcompositing. There is no dynamic information processing happening after the prediction of the MPI. This means that the MPI representation is subject to a double constraint: on the one hand, it must contain all the information necessary to generate all possible views (no new content is created during backward-warping and overcompositing), but on the other hand it cannot contain any excess information either (no content is discarded). In practice, this problem manifests itself in several ways.

Depth discretization: The correct discretization of scene content across depths is a particularly challenging problem for MPI representations. Missing or redundant information at the boundary between two depth planes can result in depth discretization artifacts after applying the overcompositing operator. It is therefore necessary to communicate across depths during MPI prediction, and multiple communication mechanisms have been proposed. Some may simply predict all the depth planes in one shot, such that cross-depth information can be exchanged within the convolutional layers of the MPI network. This solution may be particularly heavy computationally, especially for more than 2 input views, as the dimension of the input tensor grows with both the number of input views and the number of depth planes. Improved approaches may use 3D convolutions, such that information is only exchanged across neighbouring depth planes. By design however, this solution may not be able to handle interactions between distant depth planes. Other solutions treat each depth plane separately and adopt an iterative refinement approach through learned gradient descent to finetune depth discretization. This solution requires the MPI network to be run multiple times, which is again computationally heavy. Another solution adopts a feature masking strategy, to explicitly deal with “inter-plane interaction”. This solution is both complex (use of multiple networks to predict the masks) and rigid (the masking operations are fixed and still work on a per-depth basis).

View dependent effects: Real world scenes typically contain both matte and glossy surfaces. Formally, a matte or “Lambertian” surface is a surface whose apparent brightness to an observer is the same regardless of the observer’s angle of view. The apparent brightness of a non-Lambertian surface, on the other hand, depends on the point of view. By construction, MPI representations struggle to model such surfaces, because the same set of RGBA images is used to render all viewing directions. Some solutions partially address this problem by predicting one MPI per input view, and then fusing the different MPIs by weighted average.

Expressive power: The expressive power of the MPI representation depends on its dimension, which is fixed and relatively small: D 4 H W, where 4 corresponds to the RGBA channels required for overcompositing. Again, some solutions partially increase expressive power by predicting one MPI per input view (dimensions D Vx4 H W).

Ease of optimization: Finally, the RGBA nature of the MPI representation requires values in the [0,1] range which is typically enforced using a sigmoid activation function. This, and the overall rigidity of the MPI representation, tends to make optimization slow and subject to suboptimal convergence.

It is desirable to develop an apparatus and method that overcomes the above problems.

SUMMARY

According to a first aspect there is provided an image restoration apparatus, the apparatus comprising one or more processors and a memory storing in non-transient form data defining program code executable by the one or more processors to implement an image restoration model, the apparatus being configured to: obtain input images of the same scene, each input image being captured from a different physical viewpoint; forward project each of the input images into two or more depth planes based on a reference viewpoint to generate a plane sweep volume for each input image; transform the plane sweep volumes to encode a multiplane feature representation of the scene; backward project the multiplane feature representation to one or more synthetic viewpoint corresponding to respective target physical viewpoints to generate a back-projected multiplane feature representation for each output image; and render each back-projected multiplane feature representation into a respective output image, the rendering of the respective output images being in dependence on a trained renderer model. In this way, the rendering may be pre-trained based on target output images. This may allow the rendering to take account of missing or redundant information in the multiplane representation and non-Lambertian effects, allow the multiplane representation to have higher dimensions, and allow unconstraining of the multiplane representation.

In some implementations, the apparatus may be configured to transform the plane sweep volumes to encode the multiplane feature representation of the scene in dependence on a trained encoder model. In this way, the trained encoder model and the trainer rendered may work concurrently and learn from one another to provide improved output images.

In some implementations, the trained Tenderer model and the trained encoder model may be trained concurrently by means of end-to-end supervised learning. In this way, when the entire pipeline is trained, the trained Tenderer model and the trained encoder model may be trained to work together and learn from one another.

In some implementations, the apparatus may be configured to implement the trained Tenderer model and/or the trained encoder model on a convolutional neural network. CNNs may use relatively little pre-processing compared to other algorithms. This means that the CNN may learn to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are fixed. This independence from prior knowledge and human intervention in image restoration may be advantageous.

In some implementations, the depth planes may be distributed orthogonally with respect to a viewing vector from the reference viewpoint. In this way, the depth planes may each be in parallel and orthogonal planes, which may reduce the variables and calculations to be carried out, reducing computational loading.

In some implementations, the depth planes may be computed from characteristics about the capture of the input images. In this way, the characteristics about the camera may be used to render the output images more accurately as the depths will be more accurate.

In some implementations, the apparatus may be configured to forward project each of the input images into two or more depth planes based on the reference viewpoint to generate the plane sweep volume for each input image by using homographies induced from the depth planes. In this way, the homographies help approximate the real 3D projection from one view to the other, by a set of 2D projections. This means that input images may be simply projected from the physical viewpoint to the reference viewpoint.

In some implementations, the multiplane feature representation may be centred on the reference viewpoint and comprises one set of features per depth plane. In this way, the multiplane feature representation is at common location relative to the depth plane such that the variables and calculations to be carried out are reduced, reducing computational loading. In some implementations, the multiplane feature representation may comprise an RGB colour representation and/or a plurality of further feature representations. In this way, more than simply the RGBA representations may be used, which may provide a more accurate multiplane feature representation, which may improve the output images.

In some implementations, the apparatus may be configured to backward project the multiplane feature representation to one or more synthetic viewpoint corresponding to respective target physical viewpoints to generate the back-projected multiplane feature representation for each output image by using inverse homographies induced from the depth planes. In this way, the homographies help approximate the real 3D projection from one view to the other, by a set of 2D projections. This means that input images may be simply projected from the physical viewpoint to the reference viewpoint.

In some implementations, a respective synthetic viewpoint may be the same as the corresponding physical viewpoint for that input image. In this way, the pipeline may be used to for pure image restoration, without changing the viewpoint.

In some implementations, a respective synthetic viewpoint may be different from the corresponding physical viewpoint for that input image. In this way, the pipeline may be used to for image restoration, and alteration of the viewpoint.

In some implementations, the apparatus may be configured to: obtain a video stream of successive frames, each frame comprising input images for that frame time; and repeat one or more of the steps of any preceding claim for each of the frames. In this way, the pipeline may be used for a video stream restoration.

In some implementations, the apparatus may be configured to: transform the plane sweep volumes to encode the multiplane feature representation of the scene for only one of the frames of the video stream; and render each back-projected multiplane feature representation into the respective output image for all the frames of the video stream. In this way, the multiplane features encoder may need to carry out less encoding for a given number of rendered output images, which may reduce the computational loading.

According to a second aspect there is provided an imaging device comprising the image restoration apparatus of any preceding claim and a plurality of cameras, the plurality of cameras each being configured to capture a respective input image from a different physical viewpoint. In this way, the pipeline may be used for a multi-camera imaging apparatus. According to a third aspect there is provided a method for method for restoring an image, the method comprising: obtaining input images of the same scene, each input image being captured from a different physical viewpoint; forward projecting each of the input images into two or more depth planes based on a reference viewpoint to generate a plane sweep volume for each input image; transforming the plane sweep volumes to encode a multiplane feature representation of the scene; backward projecting the multiplane feature representation to one or more synthetic viewpoint corresponding to respective target physical viewpoints to generate a back-projected multiplane feature representation for each output image; and rendering each back-projected multiplane feature representation into a respective output image, the rendering of the respective output images being in dependence on a trained Tenderer model. In this way, the rendering may be pre-trained based on target output images. This may allow the rendering to take account of missing or redundant information in the multiplane representation and non- Lambertian effects, allow the multiplane representation to have higher dimensions, and allow unconstraining of the multiplane representation.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

Figure 1 schematically illustrates a prior art image restoration pipeline.

Figure 2 schematically illustrates a multiplane representation from different viewpoints.

Figure 3 schematically illustrates an exemplary image restoration pipeline.

Figure 4 schematically illustrates the exemplary image restoration pipeline of Figure 3, including a breakdown of the input images.

Figure 5 shows example results of the exemplary image restoration pipeline compared to the prior art.

Figure 6 illustrates an example method for restoring an image.

Figure 7 illustrates an example of an apparatus configured to perform the methods described herein.

DETAILED DESCRIPTION

The apparatuses and methods described herein concern using a trained Tenderer model.

Embodiments of the present system may tackle one or more of the problems previously mentioned by rendering each back-projected multiplane feature representation into a respective output image in dependence on a trained Tenderer model. In this way, the rendering may be pre-trained based on target output images. This may allow the rendering to take account of missing or redundant information in the multiplane representation and non- Lambertian effects, allow the multiplane representation to have higher dimensions, and allow unconstraining of the multiplane representation.

This present system may shift the paradigm of video restoration from a pure 2D processing to a hybrid 2D and 3D processing. That is, a video processing solution that uses additional camera information (3D position of the camera in space as well as information of its lens) which enables to create a 3D representation of a sequence of frames. This 3D representation may be further processed and to allow disentanglement of motion in the scene not only from 2D pixel information, but also from the geometry and depth in the scene. This additional information may be critical, especially when 2D information is corrupted by large degradations, to generate high-quality output.

Figure 2 schematically illustrates a multiplane representation from different viewpoints.

Figure 1 shows a first view 201 of a scene. The first view 201 may be captured by a camera. The first view 201 may be the original view in an image restoration pipeline. The scene may comprise objects. The objects 201a, 201 b, 201c may be captured in the first view 201. Depending on the viewing vector 205 of the first view 201 , the objects 201a, 201 b, 201c may be located in different places within the first view 201. Figure 1 also shows a second view 202 of a scene. The second view 202 may also be captured by a camera in a different location to the first view 201. The second view 202 may be a novel view in an image restoration pipeline. The second view 202 may be different to the first view 201. The scene may comprise objects. The objects 202a, 202b, 202c may be captured in the second view 202. Depending on the viewing vector 206 of the second view 202, the objects 202a, 202b, 202c may be located in different places within the second view 202. If the viewing vector 205 of the first view 201 and the viewing vector 206 of the second view 202 are different, the objects in the scene will appear in different places in the first view 201 and the second view 202, as shown in Figure 2.

The first view 201 may be forward projected into depth planes 204. In Figure 4, three depth planes 204a, 204b, 204c are shown. The objects 203a, 203b, 203c may appear in different depth planes 204a, 204b, 204c if they are different distances from the camera along the first viewing vector 205. The second view 202 may be forward projected into depth planes 204. In Figure 4, three depth planes 204a, 204b, 204c are shown. The objects 203a, 203b, 203c may appear in different depth planes 204a, 204b, 204c if they are different distances from the camera along the second viewing vector 206. Alternatively, the first view 201 or second view 202 may be backward projected from the depth planes 204. For example, the first view 201 may be forward projected into depth planes 204, and the depth planes 204 may be backward projected to the second view 202, as shown in Figure 2. In this way, the second view 202 may be a novel view generated from the original view of the first view 201 .

The present system may address the general problem of the immutability of the MPI representation by replacing the standard overcompositing operator with a learnable module (see Figure 3). The pipeline may comprise a learnable Encoder-Renderer pair interspaced with fixed warping operators, manipulating an unconstrained multiplane representation, referred to as Multiplane Features (MPF) - the generalization of the MPI to feature space. This design change constitutes a significant conceptual departure from the standard MPI processing pipeline, and it may address the practical problems in the following ways.

Depth discretization: The Encoder-Renderer pair may divide the problem of depth discretization into two smaller subproblems. The encoder may focus on depth separation: the problem of fusing information across views for each depth, while the Tenderer may focus on depth fusion, the problem of fusing information across depths for each view. Missing or redundant information in the MPF may now be dealt with by the Tenderer, without introducing depth discretization artifacts.

View dependent effects: The MPF representation may not be static (unlike the MPI). The MPF may be reprocessed dynamically for each rendered view by the Tenderer. This view-specific processing may allow the modelling of more complex non-Lambertian effects than the standard MPI representation.

Expressive power: The dimension of the MPF representation is not constrained anymore, and its expressive power can be increased. The number of channels, now becomes a hyperparameter that we typically set to 8 or 16 (2 to 4 times the size of the standard MPI).

Ease of optimization: The values of the MPF representation are now also unconstrained, and don’t have to stay in the [0, 1] range. This means that the sigmoid activation function can be dropped, therefore facilitating optimization.

Figure 3 schematically illustrates an exemplary image restoration pipeline. The pipeline 300 may be implemented by an image restoration apparatus (as shown in Figure 7). The pipeline 300 may comprise a warping operator 306. The warping operator 306 may be a fixed warping operator 306. In other words, the warping operator 306 may be pre-defined and is not affected by the supervised training described herein. The pipeline 300 may comprise a multiplane features encoder 307. The multiplane features encoder 307 may be a learnable multiplane features encoder 307. The pipeline 300 may comprise an inverse-warping operator 308. The inverse-warping operator 308 may be a fixed inverse-warping operator 308. In other words, the inverse-warping operator 308 may be pre-defined and is not affected by the supervised training described herein. The pipeline 300 may comprise a multiplane features renderer 309. The multiplane features Tenderer 309 may be a learnable multiplane features Tenderer 309. The date flowing through the pipeline 300 shown in Figure 3 is represented by the cuboids. The data may comprise tensors including the image information. The nomenclature is the following: V: number of input views, D: number of depth planes, h w: height and width of the input images, H W: height and width of the MPI representation, R: number of rendered views.

The apparatus may be configured to obtain an input image 301. The apparatus may be configured to receive the input image 301 from a separate apparatus, for example an imaging device such as a camera. The apparatus may be configured to obtain more than one input image 301 over a period of time. The input image 301 may be a frame of a video stream. The apparatus may be configured to obtain a video stream of input image 301 frames.

The input image 301 may have been captured at a physical viewpoint 310. The input image 301 may have been captured by a camera. The physical viewpoint 310 is the actual viewpoint of the camera. The apparatus may be configured to obtain more than one input image 301 of the same scene at a time. Each input image 301 may have been captured from a different physical viewpoint 310. Each input image 301 may have been captured by a different camera capturing the same scene from different physical viewpoints 310.

Figure 4 schematically illustrates the exemplary image restoration pipeline of Figure 3, including a breakdown of the input images.

Figure 4 shows four input images 301a, 301 b, 301c, 301 d captured from different physical viewpoints 310. However, the number of input images 301 may be varied depending on the requirements for the apparatus. The physical viewpoints 310 may be relatively close to one another. The scene may comprise one or more objects. The object may be captured in the input image 301. When a plurality of input images 301a, 301 b, 301c, 301 d are captured from different physical viewpoints 310, the object may appear in a different location in each input image 301a, 301 b, 301c, 301 d, as explained with reference to Figure 2.

The input image 301 may be degraded. The degradation may be due to noise and blur, which are due to imperfections in the acquisition process, challenging acquisition settings, or inherent limits of the imaging sensors. The degradation may cause issues in the location and structure of the objects in the input image 301.

The apparatus may also be configured to receive camera parameters corresponding to the input images 301. The camera parameters may be in the form of two matrices: the intrinsic matrix may comprise information about the camera lens (focal length, centre of projection), and the extrinsic matrix may comprise information about the pose of the camera in 3D space (rotation, translation).

The apparatus may be configured to forward project each of the input images 301 into two or more depth planes 204. In particular, the warping operator 306 may be configured to forward project each of the input images 301 into two or more depth planes 204. The apparatus may be configured to forward project each of the input images 301 into two or more depth planes 204 by using homographies induced from the depth planes 204. As illustrated in Figure 4, there may be a set of depth planes 204 for each of the input images 301 a, 301b, 301 c, 301 d.

The number of depth planes 204 may be designated by the unit D. Each of the depth planes 204 may be located at a different level of depth. The depth planes 204 may be based on a reference viewpoint 311. The reference viewpoint 311 may be a combination of the different physical viewpoints 310 of the input images 301. The reference viewpoint 311 may be an arbitrary viewpoint. The depth planes 204 may distributed orthogonally with respect to a viewing vector from the reference viewpoint 311. In other words, the depth planes 204 may extend perpendicular to the line between the camera and depth plane 204. The depth planes 204 may also distributed in disparity with reference to the viewing vector. The depth planes 204 may computed from characteristics about the capture of the input images. For example, the depth planes 204 may be computed from the intrinsic matrix and/or the extrinsic matrix. In this way, information about the camera lens and/or the camera pose may be used to generate the depth planes 204. The camera parameters used to generate the depth planes 204 may then be used to induce the homographies. The apparatus may be configured to generate a plane sweep volume 302 for each input image 301. In particular, the warping operator 306 may be configured to generate a plane sweep volume 302 for each input image 301. The forward projection of each of the input images 301 into the two or more depth planes 204 may form the plane sweep volume 302. The plane sweep volume 302 may define a combination of each of the depth planes 204. The plane sweep volume 302 may be made up of the set of depth planes 204. As illustrated in Figure 4, there may be a plane sweep volume 302a, 302b, 302c, 302d for each of the input images 301a, 301b, 301c, 301d.

The apparatus may be configured to transform the plane sweep volumes 302 to encode a multiplane feature representation 303 of the scene. In particular, the multiplane features encoder 307 may be configured to transform the plane sweep volumes 302 to encode a multiplane feature representation 303 of the scene. Each of the plane sweep volumes 302 may be transformed into a single multiplane feature representation 303, as illustrated in Figure 4. In this way, the features from each of the input images 301 may be combined into the single multiplane feature representation 303.

The multiplane feature representation 303 may be centred on the reference viewpoint 311. In other words, the multiplane feature representation 303 may orientate the features of the image as if they were viewed from the reference viewpoint 311. The multiplane feature representation 303 may comprise a 3D representation of the input image 301. The multiplane feature representation 303 may comprise one set of features per depth plane. In this way, the features from each depth plane 204 of each input image 301 may be combined to be accounted for in, and form, the multiplane feature representation 303.

In particular, the multiplane feature representation 303 may comprise a plurality of feature representations. Each of the feature representations may depict the content of the scene at a given depth plane 204. The multiplane feature representation 303 may comprise an RGB (red- blue-green) colour representation. The multiplane feature representation 303 may comprise a plurality of further feature representations, such as an alpha component which represents the transparency level. The advantage of the multiplane feature representation 303 is that it is not limited to only RGB and alpha (a total of four) representations, like a multiplane image representation. As such, the multiplane feature representation 303 may comprise an RGB representation and more than one further feature representation. Alternatively, the multiplane feature representation 303 need not comprise an RGB representation and may comprise a plurality of other representations. The apparatus may be configured to transform the plane sweep volumes 302 to encode the multiplane feature representation 303 of the scene in dependence on a trained encoder model 307. The multiplane features encoder 307 may be configured to implement the trained encoder model 307. The trained encoder model 307 may be trained by means of end-to-end supervised learning of the pipeline 300. In other words, the entire pipeline 300 may be trained, which in turn trains the trained encoder model 307. The training may be carried out before the pipeline 300 is used in practice. The training may be carried out so as to minimise the less between the rendered view and corresponding ground truths.

The multiplane features encoder 307 may implement the trained encoder model 307 on a convolutional neural network (CNN). CNNs may use relatively little pre-processing compared to other algorithms. This means that the CNN may learn to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are fixed. This independence from prior knowledge and human intervention in image restoration may be advantageous. Figure 4 illustrates four multiplane features encoders 307. However, the number of multiplane features encoders 307 may be varied depending on the number of input images 301. In practice, there may be a single multiplane features encoder 307 configured to transform any number of plane sweep volumes 302.

The apparatus may be configured to backward project the multiplane feature representation 303 to one or more synthetic viewpoint 312. In particular, the inverse-warping operator 308 may be configured to backward project the multiplane feature representation 303 to one or more synthetic viewpoint 312. The apparatus may be configured to backward project the multiplane feature representation 303 to one or more synthetic viewpoint 312 by using inverse homographies induced from the depth planes 204.

The depth planes 204 may computed from characteristics about the capture of the input images. For example, the depth planes 204 may be computed from the intrinsic matrix and/or the extrinsic matrix. In this way, information about the camera lens and/or the camera pose may be used to generate the depth planes 204. The camera parameters used to generate the depth planes 204 may then be used to induce the inverse homographies.

The synthetic viewpoint 312 may correspond to a respective target physical viewpoint. In other words, the apparatus may aim to backward project the multiplane feature representation 303 to plurality of target physical viewpoints. The target physical viewpoints may be viewpoints which the user has requested to view. Alternatively, the target physical viewpoints may be arbitrary. A respective synthetic viewpoint 312 may be the same as the corresponding physical viewpoint 310 for that input image 301. In other words, the input and output viewpoints for a certain input image 301 may remain the same. Each of the synthetic viewpoints 312 may be the same as their corresponding physical viewpoints 310. In this way, the apparatus may be used for pure image restoration, and there is no adaption of the viewpoint. Alternatively, a respective synthetic viewpoint 312 may be different from the corresponding physical viewpoint 310 for that input image 301. In other words, the input and output viewpoints for a certain input image 301 may change. Each of the synthetic viewpoints 312 may be different from their corresponding physical viewpoints 310. In this way, the apparatus may be used for image restoration and viewpoint adaption.

The apparatus may be configured to generate a back-projected multiplane feature representation 304 for each output image 305. In particular, the inverse-warping operator 308 may be configured to generate a back-projected multiplane feature representation 304 for each output image 305. The backward projection of the multiplane feature representation to one or more synthetic viewpoint may form the back-projected multiplane feature representation 304 for each output image 305. As illustrated in Figure 4, there may be a back-projected multiplane feature representation 304a, 304b, 304c, 304d for each of the output images 305a, 305b, 305c, 305d.

The apparatus may be configured to render each back-projected multiplane feature representation 304 into a respective output image 305. In particular, the multiplane features renderer 309 may be configured to render each back-projected multiplane feature representation 304 into a respective output image 305. The overcompositing operator illustrated in Figure 1 may transform each back projected MPI into a final rendered view by alpha-compositing the images in the MPI recursively in a back-to-front manner. However, the multiplane features renderer 309 may fuse each back-projected multiplane feature representation 304 to render a respective output image 305. As illustrated in Figure 4, in which there is four input images 301a, 301b, 301c, 301 d, each of the four back-projected multiplane feature representations 305a, 305b, 305c, 305d are fused to render a corresponding four output images 305a, 305b, 305c, 305d. Generally, there is the same number of back-projected multiplane feature representations 304 as the number of output images 305.

The number of input images 301 and number of output images 305 may be in the same. Although, in some embodiments, there may be a different number of output images 305 to input images 301 depending on the requirements on the output of the pipeline 300. For example, the pipeline 300 may be used to general additional viewpoints 312 of the input images 301 , e.g., two input images 301 may be used to generate four output images 305. The multiplane features Tenderer 309 may aim to produce output images 305 including what the scene should look like from the synthetic physical viewpoints 312.

The apparatus may be configured to render each back-projected multiplane feature representation 304 into a respective output image 305 in dependence on a trained encoder model 309. The multiplane features Tenderer 309 may be configured to implement the trained encoder model 309. The trained encoder model 309 may be trained by means of end-to-end supervised learning of the pipeline 300. In other words, the entire pipeline 300 may be trained, which in turn trains the trained encoder model 309. The training may be carried out before the pipeline 300 is used in practice. The training may be carried out so as to minimise the less between the rendered view and corresponding ground truths.

In implementations which comprise both a multiplane features encoder 307 and a multiplane features Tenderer 309 may also be trained by means of end-to-end supervised learning of the pipeline 300. In this way, the entire pipeline 300 may be trained, which in turn concurrently trains both the multiplane features encoder 307 and the multiplane features Tenderer 309. This may be advantageous as the outputs of the multiplane features Tenderer 309 may be used as inputs for the multiplane features encoder 307 and vice versa. In this way, both the multiplane features encoder 307 and the multiplane features Tenderer 309 may learn from one another. This may improve the accuracy of the encoding and the accuracy of the rendering. The training may be carried out before the pipeline 300 is used in practice. The training may be carried out so as to minimise the less between the rendered view and corresponding ground truths.

The multiplane features encoder 307 may be implemented as a Linet, for example with a base number of channels of 64. The multiplane features Tenderer 308 may be implemented as a Linet, for example with a base number of channels of 64. The multiplane feature representation 303 may have 8 channels. An example training was based on 90 scenes, with 10 scenes kept out for validation.

The multiplane features Tenderer 309 may implement the trained encoder model 309 on a convolutional neural network (CNN). CNNs may use relatively little pre-processing compared to other algorithms. This means that the CNN may learn to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are fixed. This independence from prior knowledge and human intervention in image restoration may be advantageous. Figure 4 illustrates four multiplane features Tenderers 309. However, the number of multiplane features renders 309 may be varied depending on the number of input images 301 or output images 305. In practice, there may be a single multiplane features renderer 309 configured to render any number of output images 305.

Figure 4 also shows a pathway 313. The pathway 313 may pass the input images 301 directly to the multiplane features renderer 309. In this way, the apparatus may be configured to use both the input images 301 and the back-projected multiplane feature representations 304 in combination to render the output images 305. The pathway 313 may only function when the output image 305 has the same viewpoint as the input image 301.

As described herein, the apparatus may be configured to obtain a video stream of success frames. Each of the frames may comprise an input image 301 for that frame time. In situations where there are input images 301 captured from different physical viewpoints 310, there may be a plurality of input images 301 for each frame time. A practical implementation may be that a plurality of cameras at different physical viewpoints 310 may be capturing a video stream concurrently with respect to one another. The apparatus may be configured to repeat one or more of the steps of the pipeline 300 for each of the frames. In other words, the successive input images 301 may flow through the pipeline 300 successively. The output of the pipeline 300 would be a successive video stream of output images 305. In this way, the apparatus may be used for restoring a video stream.

In the case of a video steam as the input to the pipeline 300, the apparatus may be configured to transform the plane sweep volumes 302 to encode the multiplane feature representation 303 of the scene for only one of the frames of the video stream. In other words, only one of the frames of the video stream is passed through the multiplane features encoder 307. The frame selected to be passed through the multiple multiplane features encoder 307 may be the first frame in the video stream. Alternatively, the frame selected to be passed through the multiple multiplane features encoder 307 may be an arbitrary frame in the video stream. In some implementations, a frame may be selected to be passed through the multiple multiplane features encoder 307 after a certain number of frames in the video stream. For example, a frame may be selected to be passed through the multiple multiplane features encoder 307 every 100 frames. In the case of a video steam as the input to the pipeline 300, the apparatus may be configured to render each back-projected multiplane feature representation 304 into the respective output image 305 for all of the frames of the video stream. In other words, all of the frames of the video stream are passed through the multiplane features renderer 309. By not encoding all of the frames and rendering all of the frames, this may reduce the computational loading. The ability to do this may be realised by the concurrent training of both the multiplane features encoder 307 and the multiplane features renderer 309. In this way, the output of the multiplane features encoder 307 may be more stable for frames which are close together, and so it is not required to encode all the frames.

Figure 5 shows example results of the exemplary image restoration pipeline compared to the prior art. The exemplary image restoration pipeline 300 and three prior art pipelines were tested on a video denoising problem. A Spaces dataset was used which constituted of 100 scenes captured for 3 to 10 rig positions, using a rig containing 16 cameras. Synthetic noise was applied to the images. The synthetic noise followed a standard noise model made of signal dependent and signal independent components modelled by Gaussian distributions. In particular, a high noise level was used with gains 4, 8, 16 and 20.

Performance between the exemplary image restoration pipeline 300 and the three prior art pipelines compared in Figure 5. The three prior art pipelines are state-of-the-art 2D-based video denoisers. Peak signal to noise ratio (PSNR), structural similarity index (SSIM) and learned perceptual image patch similarity (LPIPS) metrics were used for the comparison. As shown in Figure 5, the exemplary image restoration pipeline 300 produces results across almost all metrics that are significantly above the three prior art pipelines, particularly at high noise levels.

Figure 6 summarises an example of a method 600 for restoring an image. At step 601 , the method 600 comprises obtaining input images of the same scene, each input image being captured from a different physical viewpoint. At step 602, the method 600 comprises forward projecting each of the input images into two or more depth planes based on a reference viewpoint to generate a plane sweep volume for each input image. At step 603, the method 600 comprises transforming the plane sweep volumes to encode a multiplane feature representation of the scene. At step 604, the method 600 comprises backward projecting the multiplane feature representation to one or more synthetic viewpoint corresponding to respective target physical viewpoints to generate a back-projected multiplane feature representation for each output image. At step 605, the method 600 comprises rendering each back-projected multiplane feature representation into a respective output image, the rendering of the respective output images being in dependence on a trained Tenderer model.

An example of an apparatus 700 configured to implement the methods described herein is schematically illustrated in Figure 7. The apparatus 700 may be implemented on an electronic device, such as a laptop, tablet, smart phone or digital camera. The apparatus 700 comprises a processor 701 configured to process the datasets in the manner described herein. For example, the processor 701 may be implemented as a computer program running on a programmable device such as a Central Processing Unit (CPU). The apparatus 700 comprises a memory 702 which is arranged to communicate with the processor

701. Memory 702 may be a non-volatile memory. The processor 701 may also comprise a cache (not shown in Figure 7), which may be used to temporarily store data from memory 702. The apparatus 700 may comprise more than one processor 701 and more than one memory

702. The memory 702 may store data that is executable by the processor 701. The processor 701 may be configured to operate in accordance with a computer program stored in non- transitory form on a machine-readable storage medium. The computer program may store instructions for causing the processor 701 to perform its methods in the manner described herein.

Specifically, the image restoration apparatus 701 may comprise one or more processors, such as processor 701 , and a memory 702 storing in non-transient form data defining program code executable by the processor(s) to implement an image restoration model. The image restoration apparatus may obtain input images of the same scene, each input image being captured from a different physical viewpoint. The image restoration apparatus may forward project each of the input images into two or more depth planes based on a reference viewpoint to generate a plane sweep volume for each input image. The image restoration apparatus may transform the plane sweep volumes to encode a multiplane feature representation of the scene. The image restoration apparatus may backward project the multiplane feature representation to one or more synthetic viewpoint corresponding to respective target physical viewpoints to generate a back-projected multiplane feature representation for each output image. The image restoration apparatus may render each back-projected multiplane feature representation into a respective output image, the rendering of the respective output images being in dependence on a trained Tenderer model.

The image restoration apparatus 700 may be implemented on an imaging device. The imaging device may comprise the image restoration apparatus 700. The imaging device may be a laptop, tablet, smart phone, or digital camera. The imaging device may comprise a plurality of cameras. Each of the plurality of cameras may be configured to capture a respective input image 301. Each of the respective input images 301 may be captured from a different physical viewpoint 310, as described herein.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description, it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.