Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SINGLE PASS RENDERING AND UPSCALING
Document Type and Number:
WIPO Patent Application WO/2021/239205
Kind Code:
A1
Abstract:
A graphics processing system configured to process an image comprising a plurality of rendering pixels, the system being configured to divide the image into a plurality of tiles, each tile comprising a sub-set of the rendering pixels of the image and corresponding to multiple display pixels, the system comprising: an on-chip tile memory; and a processor, wherein the processor is configured to, for at least one rendering pixel in the image, perform a processing pass comprising: rendering the rendering pixel by: determining a set of samples, each sample having a respective location in the rendering pixel and the locations collectively having a sampling density in the rendering pixel; shading the rendering pixel at a first resolution to give a shading result for the rendering pixel; writing a shade value for each of the set of samples to the on-chip tile memory based on the shading result; and storing, in the on-chip tile memory, an upscaled representation of the rendered pixel by determining, in dependence on the shade values for the set of samples, a display pixel shade value for each of a plurality of display pixels overlapping the rendering pixel at a target resolution greater than the first resolution.

Inventors:
LIU BAOQUAN (DE)
Application Number:
PCT/EP2020/064404
Publication Date:
December 02, 2021
Filing Date:
May 25, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
LIU BAOQUAN (DE)
International Classes:
G06T11/40; G06T15/00
Foreign References:
EP3598393A12020-01-22
US20140327696A12014-11-06
Other References:
ENGEL WOLFGANG: "Diary of a Graphics Programmer: Multisample Anti-Aliasing", 16 June 2009 (2009-06-16), XP055772135, Retrieved from the Internet [retrieved on 20210203]
PAVLOS MAVRIDIS ET AL: "MSAA-Based Coarse Shading for Power-Efficient Rendering on High Pixel-Density Displays", HIGH-PERFORMANCE GRAPHICS 2015, 7 August 2015 (2015-08-07), pages 1 - 1, XP055770616
MAVRIDIS, P.PAPAIOANNOU, G.: "MSAA-based coarse shading for power-efficient rendering on high pixel-density displays", HIGH PERFORMANCE GRAPHICS, 2015
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A graphics processing system configured to process an image comprising a plurality of rendering pixels, the system being configured to divide the image into a plurality of tiles, each tile comprising a sub-set of the rendering pixels of the image and corresponding to multiple display pixels, the system comprising: an on-chip tile memory; and a processor, wherein the processor is configured to, for at least one rendering pixel in the image, perform a processing pass comprising: rendering the rendering pixel by: determining a set of samples, each sample having a respective location in the rendering pixel and the locations collectively having a sampling density in the rendering pixel; shading the rendering pixel at a first resolution to give a shading result for the rendering pixel; and writing a shade value for each of the samples to the on-chip tile memory based on the shading result; and storing, in the on-chip tile memory, an upscaled representation of the rendered pixel by determining, in dependence on the shade values for the set of samples, a display pixel shade value for each of a plurality of display pixels overlapping the rendering pixel at a target resolution greater than the first resolution.

2. The system of claim 1 , wherein shading is performed once per rendering pixel.

3. The system of claim 1 or claim 2, wherein the target resolution is greater than the first resolution by a factor equal to the sampling density of the rendering pixel.

4. The system of any preceding claim, wherein the sampling density corresponds to a multisample anti-aliasing level.

5. The system of claim 4, wherein the multisample anti-aliasing level is 4x or 16x.

6. The system of any preceding claim, wherein the system is implemented by a mobile graphics processing unit.

7. The system any preceding claim, wherein rendering of the rendering pixel further comprises determining, for each of the sample locations, whether a respective sample location is covered by a primitive in the image, wherein the shade value for each sample having a location determined to be covered by a primitive is equal to the shading result for the rendering pixel.

8. The system of claim 7, wherein the shading location for a rendering pixel is located at the centroid of the samples determined to be covered by a primitive in the image.

9. The system of claim 7 or claim 8, wherein each of the samples determined to be covered by a primitive in a rendering pixel correspond to the same color in the rendered pixel.

10. The system of any preceding claim, wherein the processor is configured to subsequently write the upscaled representation stored at the on-chip tile memory to a frame buffer in a system memory.

11. The system of any preceding claim, wherein the system is configured to implement an application programming interface, wherein the application programming interface is Vulkan or OpenGL ES.

12. The system of any preceding claim, wherein the step of storing the upscaled representation comprises transforming one of the samples and/or one of the shade values of the samples of the rendering pixel into one display pixel at the target resolution.

13. The system of any preceding claim, wherein the target resolution is 4K or 8K.

14. A method for processing an image at image processing system, the image comprising a plurality of rendering pixels, the system being configured to divide the image into a plurality of tiles, each tile comprising a sub-set of the rendering pixels of the image and corresponding to multiple display pixels, the method comprising, in a processing pass, for at least one rendering pixel in the image: rendering the rendering pixel by: determining a set of samples, each sample having a respective location in the rendering pixel and the locations collectively having a sampling density in the rendering pixel; shading the rendering pixel at a first resolution to give a shading result for the rendering pixel; and writing a shade value for each of the samples to an on-chip tile memory based on the shading result; and storing, in the on-chip tile memory, an upscaled representation of the rendered pixel by determining, in dependence on the shade values for the set of samples, a display pixel shade value for each of a plurality of display pixels overlapping the rendering pixel at a target resolution greater than the first resolution.

15. A computer program which, when executed by a computer, causes the computer to perform the method of claim 14.

Description:
SINGLE PASS RENDERING AND UPSCALING

FIELD OF THE INVENTION

This invention relates to rendering and upscaling of images with anti-aliasing, for example for game rendering on a tile-based mobile graphics processing unit (GPU).

BACKGROUND

The demands placed on mobile GPUs are rising as a result of increases in resolution in consumer electronics products. It is desirable for mobile GPUs to render for large-screen products at 4K or 8K image resolution. 4K screens are becoming more popular and 8K screens are now coming to market. The requirements of these resolutions are illustrated in Figure 1.

Using a mobile GPU to support rendering at 4K and 8K is very challenging for several reasons. Firstly, the pixel shading stage is already the bottleneck for most 3D games on a mobile screen at 1080p resolution. Secondly, a large screen means many more pixels to be shaded: 4K is four times the resolution of 1080p, and 8K is 16 times the resolution of 1080p. Even modern high-end desktop graphics cards may struggle to cope natively with this extra shading workload. Therefore, it may not be possible to perform pixel shading natively at 8K resolution. As a result, upscaling may have to be performed after a low-resolution image is rendered cheaply.

Since it is generally not possible for a mobile GPU to perform pixel shading directly at 8K native resolution, due to the heavy shading workload involved in the pixel shader, rendering at low resolution (LR) and then upscaling to a target high resolution (HR) is required. Figure 2 schematically illustrates the process of tackling the pixel shading workload by rendering at reduced resolution and then upscaling to the target resolution required. Upscaling essentially means stretching a LR image to achieve a HR one. It requires "zooming in" on a LR image, so that it can fill a larger HR screen for display.

However, a drawback of this process is that when a game performs rendering at reduced resolution, the rasterization occurs at a low resolution. Upscaling also means under-sampling of the original imaging signal at a bigger spatial domain. As a result, this introduces aliasing to the rendering result. This can end up creating highly jagged edges of the objects in the image and in the worst case can compromise the entire visibility of smaller objects, resulting in shimmering and flickering effects. For example, spatial noise and temporal flickering may be introduced.

According to the Nyquist-Shannon Theorem, the sampling rate must be equal or superior to double the highest frequency of the signal, otherwise there will be aliasing.

Mobile devices require real-time rendering performance, with high frame-rate and low latency user interaction for games. At the same time, they require low power consumption to extend battery life and low heat dissipation for comfort when hand-held by a user for long periods. This may be a dilemma for rendering high quality and complex pixel shading at high screen resolution, and at the same time ensuring power consumption and heat dissipation are kept at a very low level.

One known upscaling method is nearest reconstruction, as described at https://en.wikipedia.org/wiki/lmage_scaling. In the simplest form for scaling, the original pixels are duplicated to fit the required higher resolution. This is a simple and fast upscaling method. However, the image quality is commonly very poor and the resulting image can look jagged and blocky.

DLSS, as described at https://www.techspot.com/article/1873-radeon-image-sharpenin g-vs- nvidia-dlss/, leverages a deep neural network to extract multidimensional features of the rendered scene to construct a high-quality final image at high resolution. DLSS forces a game to render at a lower resolution (typically 1440p) and then uses an artificial intelligence (Al) algorithm to infer what it would look like if it were rendered at a higher resolution (typically 4K). This can result in less aliasing. The method exploits the dedicated Tensor Core on a Nvidia GPU for the Al algorithm. Less pixel shading is required, so rendering is faster. However, the method is invasive and requires case-by-case Al training for each game before releasing, involving a large training set and re-training for every game update. It is also not suitable for other GPUs that do not have a dedicated Tensor Core and DLSS also resulted in blurry image quality for some tested scenes.

As described at https://www.amd.com/en/technologies/radeon-software-image-sh arpening, Radeon Image Sharpening is an intelligent sharpening technology that may provide a lift in the visual quality of a game without a significant dip in performance. The effect takes a look at the high contrast parts of any given scene in a game and artificially draws out more detail. Textures are sharpened overall which may make a 1080p image appear close to 1440p after upscaling and display on a HR monitor. Unlike DLSS, which must be implemented on a game- by-game basis, Radeon Image Sharpening is an image post-processing after upscaling involving some spatial domain image filtering. It can be switched on or off for any game, does not require per-game implementation and is non-invasive. There is a low cost on discrete GPUs with dedicated video memory and good image quality in most cases. However, the method does not know which elements should be sharpened and which shouldn’t, and existing artefacts may be sharpened and become more visible.

As described at https://software.intel.com/en-us/articles/checkerboard-rende ring-for-real- time-upscaling-on-intel-integrated-graphics, checkerboard rendering is a technique that produces full-resolution pixels with a significant reduction in shading and minimal impact on visual quality. It develops a solution to the problem of rendering content that was designed for a higher target resolution. It may take content that had a target resolution of 1080p and instead renders at 540p (960 x 540), then uses a checkerboard (CBR) technique to scale up to 1080p by temporally combining two frames’ results, with each rendered with 2 x multisample anti-aliasing (MSAA). It allows 4 x upscaling with good image quality at a low cost of 2x MSAA and there is a reduction in pixel shading rate (only at one quarter of the target HR). However, the first column of pixels is not reconstructed, so aliasing will appear. The previous frame must be maintained with 2 x MSAA when rendering the current frame. Therefore, the method has a bandwidth cost. Temporal artefacts, such as blurring and ghosting, may result if objects move too fast between two frames. It is also invasive, as users must use specific extensions.

Variable rate shading, or coarse pixel shading, as described at https://microsoft.github.io/DirectX-Specs/d3d/VariableRateSh ading.html, is a mechanism to enable allocation of rendering performance and power at varying rates across the rendered image for a frame. Visually, there are cases where shading rate can be reduced with little or no reduction in perceptible output quality, leading to “free” performance. This model extends supersampling-with-MSAA into the opposite, “coarse pixel”, direction, by adding a new concept of coarse shading. This is where shading can be performed at a frequency coarser than a pixel. That is to say, a group of pixels can be shaded as a single unit and the result is then broadcast to all samples in the group. A coarse shading API allows apps to specify the number of pixels that belong to a shaded group. The coarse pixel size can be varied after the render target is allocated. Therefore, different portions of the screen or different draw passes can have different subsampling rates. Variable rate shading is a complex solution to decouple the shading rate from visibility. It allows variation of the shading rate in a fine granularity flexibly. Different portions of the screen or different draw passes can have different subsampling rates. However, VRS has more complex GPU thread scheduling logic, which depends on the density image accessing at different portions of the screen. There is an extra scheduling cost for so many concurrent GPU threads, due to the irregularity of the different shading rates across neighboring pixels. There is a cost to build and access the density map for each frame to specify and retrieve the shading rate map for each coarse pixel.

MSAA-based Coarse Shading, as described in Mavridis, P., and Papaioannou, G. 2015, ‘MSAA-based coarse shading for power-efficient rendering on high pixel-density displays’, High Performance Graphics 2015, employs the similar idea of decreasing pixel- shader invocations by using MSAA samples, but it focusses on a software implementation for existing GPUs, and involves two render-passes: the first pass renders to an intermediate render buffer at a lower pixel count but with the appropriate amount of MSAA samples, while the second pass reads in the intermediate render buffer and perform the mapping of sub-pixel MSAA samples to pixels. The problem of this two-pass approach is that the intermediate render buffer (which has a very large memory size due to multiple MSAA samples for each pixel) storing the output of the first pass has to be read in by the second pass from DDR memory to the GPU chip, which is very bandwidth consuming due to the outputting and inputting of the same intermediate render buffer during the two passes.

In summary, the key technical problems for rendering and upscaling are that the algorithm may be too complex for mobile GPUs due to their limited computing power, memory bandwidth, battery and heat dissipation capabilities, and that image quality may be poor, due to aliasing on the final image after LR rendering and then upscaling to HR.

It is desirable to develop a method for rendering and upscaling that overcomes these problems.

SUMMARY OF THE INVENTION

According to a first aspect there is provided a graphics processing system configured to process an image comprising a plurality of rendering pixels, the system being configured to divide the image into a plurality of tiles, each tile comprising a sub-set of the rendering pixels of the image and corresponding to multiple display pixels, the system comprising: an on-chip tile memory; and a processor, wherein the processor is configured to, for at least one rendering pixel in the image, perform a processing pass comprising: rendering the rendering pixel by: determining a set of samples, each sample having a respective location in the rendering pixel and the locations collectively having a sampling density in the rendering pixel; shading the rendering pixel at a first resolution to give a shading result for the rendering pixel; and writing a shade value for each of the samples to the on-chip tile memory based on the shading result; and storing, in the on-chip tile memory, an upscaled representation of the rendered rendering pixel by determining, in dependence on the shade values for the set of samples, a display pixel shade value for each of a plurality of display pixels overlapping the rendering pixel at a target resolution greater than the first resolution.

Both rendering (including pixel shading) and upscaling may be finished within only one single rendering pass along the GPU pipeline. This may produce a high-quality final image at high resolution. In some implementations, there may be no jagged object edges, with very low rendering cost at low shading resolution and low memory bandwidth consumption on a mobile GPU.

The shading may be performed once per rendering pixel. Using per-pixel shading at low resolution may produce a high-quality final image at high resolution with a low rendering cost.

The target resolution may be greater than the first resolution by a factor equal to the sampling density of the rendering pixel. This may allow the samples of each pixel to be read from the on-chip memory and resolved one-by-one into a respective display pixel (one sample per display pixel) during upscaling in the on-chip tile memory of the mobile GPU.

The sampling density may correspond to a multisample anti-aliasing level. The multisample anti-aliasing (MSAA) level may be 4x or 16x. Other MSAA levels may be used. This may allow the pixel to be rendered at low resolution before being upscaled to the target resolution.

The system may be implemented by a mobile graphics processing unit. This may allow the system to be implemented in consumer products such as smartphones and tablets.

Rendering of the rendering pixel may further comprise determining, for each of the sample locations, whether a respective sample location is covered by a primitive in the image. The shade value for each sample having a location determined to be covered by a primitive may be equal to the shading result for the rendering pixel. This may result in a low rendering cost.

The shading location for a rendering pixel may be located at the centroid of the samples determined to be covered by a primitive in the image. Each of the samples determined to be covered by a primitive in a rendering pixel may correspond to the same color in the rendered pixel, resulting from per-pixel shading. Therefore, a pixel may be shaded only once and the result assigned only to the samples that are covered by a primitive. The processor may be configured to subsequently write the upscaled representation stored at the on-chip tile memory to a frame buffer in a system memory. This may allow the image to be subsequently displayed.

The system may be configured to implement an application programming interface. The application programming interface may be Vulkan or OpenGL ES. Other suitable APIs may be used. The subpass feature (or equivalent) of APIs such as Vulkan can be used to perform the custom resolving (from a pixel with multiple samples into multiple final pixels) inside the on-chip tile memory.

The use of extensions in these APIs may allow the sample locations to be moved into regular grid positions inside a pixel. In Vulkan, one rendering pass can be split into multiple subpasses, which can allow rendering and upscaling to be performed in a single processing pass without requiring modification of the GPU’s hardware.

The step of storing the upscaled representation may comprise transforming one of the samples and/or one of the shade values of the samples of the rendering pixel into one display pixel at the target resolution. This may allow the pixel to be rendered at low resolution before upscaling to high resolution.

The target resolution may be 4K or 8K. This may allow the system to be implemented in current high resolution consumer electronics products. Other target resolutions may be used.

According to a second aspect there is provided a method for processing an image at image processing system, the image comprising a plurality of rendering pixels, the system being configured to divide the image into a plurality of tiles, each tile comprising a sub-set of the rendering pixels of the image and corresponding to multiple display pixels, the method comprising, in a processing pass, for at least one rendering pixel in the image: rendering the rendering pixel by: determining a set of samples, each sample having a respective location in the rendering pixel and the locations collectively having a sampling density in the rendering pixel; shading the rendering pixel at a first resolution to give a shading result for the rendering pixel; and writing a shade value for each of the samples to an on-chip tile memory based on the shading result; and storing, in the on-chip tile memory, an upscaled representation of the rendered rendering pixel by determining, in dependence on the shade values for the set of samples, a display pixel shade value for each of a plurality of display pixels overlapping the rendering pixel at a target resolution greater than the first resolution. Using this method, both rendering and upscaling may be finished within only one single rendering pass along the GPU pipeline. This may produce a high-quality final image at high resolution. In some implementations, there may be no jagged object edges, with very low rendering cost (due to per-pixel shading at low resolution) and low memory bandwidth consumption on a mobile GPU.

The method may further comprise determining, for each of the sample locations, whether a respective sample location is covered by a primitive in the image. The shade value for each sample having a location determined to be covered by a primitive may be equal to the shading result for the rendering pixel. This may result in a low rendering cost.

The target resolution may be greater than the first resolution by a factor equal to the sampling density of the rendering pixel. This may allow the shade values for the samples of each pixel to be read from the on-chip memory and resolved one-by-one into a respective display pixel (one sample shade value per display pixel) in the on-chip tile memory.

The sampling density may correspond to a multisample anti-aliasing level. This may allow the pixel to be rendered at low resolution before being upscaled to the target resolution.

According to a third aspect there is provided a computer program which, when executed by a computer, causes the computer to perform the method described above. The computer program may be provided on a non-transitory computer readable storage medium.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings.

In the drawings:

Figure 1 shows the requirements of different screen resolutions.

Figure 2 schematically illustrates rendering at reduced resolution and then upscaling.

Figure 3 schematically illustrates a pipeline where both rendering and upscaling are finished within only one single rendering pass. Figures 4(a) and 4(b) schematically illustrate rendering without and with centroid sampling respectively.

Figure 5 schematically illustrates moving sample locations into a regular grid position inside a pixel.

Figure 6 illustrates modification of the tile buffer hardware block of a GPU before outputting to the system memory.

Figure 7 shows a flowchart for an example of a method for processing an image.

Figure 8 shows an example of a graphics processing system.

DETAILED DESCRIPTION OF THE INVENTION

Herein, rendering refers to any form of generating a visible image, for example for displaying the image on a computer screen, printing, or projecting.

Figure 3 shows a rendering pipeline implemented by a graphics processing system. The system that implements the pipeline comprises an on-chip tile memory and a processor. The system implements tile-based rendering and is configured to receive an image to be rendered comprising a plurality of rendering pixels. The system is configured to divide the image into a plurality of tiles, each tile comprising a sub-set of the rendering pixels of the input image and corresponding to multiple display pixels.

As will now be described, the processor is configured perform rending and upscaling of the image in a single pass along the mobile GPU pipeline.

During the rendering stage, following vertex shading, shown at 301 , the image is rasterized, shown at 302. Multisample anti-aliasing (MSAA) is used to mitigate aliasing over the edges of geometries. This is facilitated by the usage of multi-sampled buffers (render target). Some possible configurations forthe buffers are 2x, 4x, 8x, 16x (i.e. nx, where ‘n’ denotes the number of samples allocated per pixel). MSAA takes place as part of the rasterization stage 302 of the pipeline and is shown generally at 303 in Figure 3. In the example shown in Figure 3, 16x MSAA is used. However, other MSAA levels, such as 4x, may be used. MSAA, which is supported by GPUs in hardware, allows a pixel to be shaded (i.e. an RGB color to be calculated in a pixel shader) only once for a primitive, but stores the result in multiple samples for the pixel. In a preferred implementation, the primitive, such as triangle 304 in Figure 3, can be tested for coverage at each of the ‘n’ sample points, building a 16-bit (for 16x MSAA) coverage mask identifying the portion of the pixel covered by the triangle. The pixel shader 305 is then executed once and the shading result for the pixel is assigned across the samples identified by the coverage mask. This multi-sampled buffer is then resolved by fixed function GPU hardware into a single color value for each final pixel in the frame buffer to address edge aliasing.

By default, the resolving hardware can calculate an average color of these multiple samples as each pixel’s final color. For example, if a pixel’s first three samples contain the exact same shade (under 4x MSAA), and the fourth sample contains background color, averaging them yields anti-aliased output for the final pixel on an object edge. However, as will be described in more detail below, in the system described herein, centroid sampling can be employed and the pixel may be shaded once for a primitive, but the result assigned only to the samples covered by the primitive.

Therefore, using MSAA, the pipeline determines a set of n samples per pixel, each sample having a respective location in the pixel and the locations collectively having a sampling density in the pixel that is equal to the MSAA level. In a preferred implementation, it can then be determined which of the samples in the pixel are covered by a primitive in the image.

The pixel shader 305 is then executed to shade each rendering pixel once (per-pixel shading) to give a shading result for each pixel. A shade value for each of the samples is then written to the on-chip tile memory (i.e. the tile buffer) based on the shading result for the pixel. In the preferred implementation, the shading result is assigned to the samples which were determined by the system to be covered by a primitive. After rendering, as indicated at 306, a custom resolve is then performed on the multiple samples of the rendered pixel stored in the on-chip tile memory (tile buffer). As will be described in more detail below, in the on-chip tile memory, an upscaled representation of a rendered pixel is stored after determining, in dependence on the shade values for the set of samples of the rendering pixel, a display pixel shade value for each display pixel overlapping the rendering pixel. One of the shade values for the set of samples is assigned to one display pixel as that display pixel’s shade value in the high resolution image. The processor may be configured to subsequently write the upscaled representation to a frame buffer in system memory at the target resolution. Once all of the pixels in the image have been rendered and upscaled and output to the system memory, the final image 307 may then be displayed.

Therefore, both rendering (including pixel shading) and upscaling of the image are finished within only one single rendering pass along the pipeline. This pipeline enables rendering at a low resolution (LR) whilst achieving a high resolution (HR) final image by a single rendering pass on mobile GPU.

This is achieved by implicitly exploiting the MSAA functionality of the hardware. The upscaling feature can be enabled so that fragment outputs (from the pixel shader) are upscaled by a custom resolving operation, which resolves a MSAA sample color to a final pixel color inside the on-chip tile memory of the mobile GPU. As a result, access to the system memory is not required for the intermediate MSAA buffer during the upscaling.

Therefore, after all other fragment operations have been completed (including, for example, alpha blending), each of the multiple samples of a rendering pixel is custom resolved to produce a single display pixel color value, which can finally be written into the corresponding color buffer in the system memory at HR.

In one implementation, the writing of the color buffers may be deferred until a later time when all of the pixels of a tile have been rendered and upscaled and can be output to system memory in a batch corresponding to a tile.

In one particular example, the algorithm steps for 4x upscaling (corresponding to rendering using an MSAA level of 4x) are as follows. 3D rendering to a low-resolution frame buffer, with 4x MSAA turned on, is performed. Per-pixel shading is performed at low resolution i.e. shading once for each pixel (with four samples), so that all samples covered by a primitive inside a pixel will have the same shared color.

In the preferred implementation described herein, the shading location for a pixel is at the geometrical center of all of the covered sample locations of a fragment. Figures 4(a) and 4(b) schematically illustrate rendering without and with centroid sampling respectively. In Figure 4(a), the shading location 401 for the pixel 402 is at the center of all of the samples 403, 404, 405, 406, irrespective of whether the samples are covered by primitive 407 or not. In Figure 4(b), using centroid sampling, the shading location 408 for the pixel 402 is at the center of only the covered samples 404, 406. Therefore, using centroid sampling, a pixel is shaded once, and the result is assigned only to the samples that are covered by a primitive. This can be enabled by adding a “centroid” shader qualifier to all fragment shader inputs (for varying interpolation).

As described above, custom resolving is performed using a custom resolve algorithm inside the on-chip tile buffer at the end of the tile-based rendering process, just before writing the tile buffer to the system memory (i.e., a frame buffer in DDR). Using 4x MSAA, the four samples of each pixel from the LR tile buffer are read and resolved one-by-one into 2x2 display pixels (one sample per display pixel) in the HR tile buffer, which can then be written to the HR framebuffer in the system memory for display. The result is a HR framebuffer, which is 4x larger than the LR framebuffer.

The GPU may implement an application programming interface, such as Vulkan or OpenGL ES.

When the API is OpenGL ES, the API extension GL_ARB_sample_location (globally user- defined sample location) may be used so that all of the sample locations can be moved into the regular grid position inside a pixel, shown in Figure 5(b), instead of using the hardware default locations, shown in Figure 5(a). These locations shown in Figure 5(a) and (b) correspond to the global setting for all pixels of a framebuffer.

The sampling locations can therefore be changed from the default locations into the regular grid locations inside a rendering pixel. This happens during the rendering step, which is before the upscaling step.

When the Vulkan API is used, the corresponding API extension VK_EXT_sample_locations may be used similarly to move the sampling locations into the regular grid locations.

In another example, the algorithm steps for 16x upscaling are as follow. Rendering to a LR FrameBuffer is performed with 16x MSAA turned on. Per-pixel shading is performed at LR, shading once for a pixel (with 16 samples), so that all samples covered by a primitive inside a pixel will have the same color. Similarly to the example shown in Figure 5(b), the shading location for a fragment is moved to the geometrical center of the locations of all of the samples covered by a primitive by using a “centroid” shader qualifier for all fragment shader inputs (for varying interpolation). Custom resolving is performed at the end of the tile-based rendering (before writing the tile buffer to DDR). The 16 samples of each pixel from the LR tile-buffer are resolved one-by-one into 4x4 display pixels (each display pixel has only one sample) in the resulting HR tile-buffer, which is then written to the HR framebuffer in DDR for display. The result is a HR framebuffer, which is 16x larger than the LR framebuffer.

The API extension GL_ARB_sample_location for OpenGL ES and VK_EXT_sample_locations for Vulkan (globally user-defined sample location) may be used so that the sample locations can be moved into the regular grid positions inside a pixel, shown at 501 b, 502b, 503b and 504b in Figure 5(b), instead of using the default locations 501a, 502a, 503a, 504a, shown in Figure 5(a).

The custom-resolving algorithm can be implemented at the granularity of a pixel at LR (each pixel containing multiple samples, and each sample may have a 4-byte RGBA color) using pseudo-code as below:

/ * vec2 LR = HR / vec2( xscale, yscale); with MSAAJevel = xscale * yscale;

7

// MSAAJevel could be 4 or 16 for(int y=0; y< TILE_LR_height; y++) //tile size at LR for(int x=0; x< TILE_LR_width; x++)

{ int temp_fragment[msaa_level]; // multiple samples in a pixel (suppose each sample has a RGBA8 color value,. i.e. 4-byte) int samplelD = 0 ; int offsetLR = x + y * TILE_LR_height ; // pixel offset at LR memcpy (temp_fragment, pTileBufferLR+ offsetLR, msaajevei * 4 ); // each fragment has MSAAJevel * 4 bytes.

†or(int j=0; j< yscale; j++) for(int i=0; i< xscale; i++)

{

Int offsetHR = x * xscale + i + (y * yscale + j) * TILE_HR_height ; pTileBufferHR [offsetHR] = temp ragment [sample I D++];

}

} Two exemplary solutions will now be described for the implementation of the custom resolve algorithm on a mobile GPU. These solutions conveniently take advantage of the on-chip tile memory of a mobile GPU. One solution is via dedicated fixed function hardware. The other solution is via the subpass feature in the Vulkan API. Both solutions require only one single rendering pass by the GPU to perform both rending and upscaling.

In the first solution, as illustrated in Figure 6, hardware modification is required to resolve the LR attachment into the HR one. The dedicated fixed function hardware implements the custom resolving at the tile buffer unit 601 , which only accesses the on-chip memory for the resolving.

The tile buffer (on-chip tile memory) 601 on a mobile GPU 602 is modified as described below. To render a tile, the tile buffer module maintains two on-chip color buffers (for a tile) at two different resolution: LR and HR. The LR buffer requires a high MSAA level (for example, 4x or 16x), while the HR buffer requires only one sample per pixel.

The tile buffer 601 performs the custom resolving at the end of the tile-based rendering pipeline (before writing the tile-buffer to DDR) as shown in Figure 6, so that each sample color at the LR buffer can be written into the HR tile buffer as a final pixel color, which will be finally written to the explicit framebuffer 603 in DDR 604 for display.

In-place resolving may be used by allocating only one copy of the tile buffer due to the same memory size of the two tile buffers, although one is at LR and the other is at HR, with different MSAA levels. For example, the LR tile buffer may have an MSAA level of 4 or 16, whereas the HR tile buffer may have an MSAA level of 1 , with only one sample per pixel.

For 4x upscaling, this in-place resolving can be implemented by reading in the RGBA values of one pixel’s multiple (4x) samples, and then writing them into the same address inside the tile buffer as a 2x2 pixel block. Because, after reading in, the RGBA values of the pixel’s multiple (4x) samples are no longer needed, the same memory location can be overwritten as the HR buffer.

In the second solution, the custom resolve is performed without requiring hardware modification by using the subpass feature of the Vulkan API. In this implementation, there is no need to modify the GPU driver and hardware. For a tile-based Tenderer in a mobile GPU, the results of previous rendering can efficiently stay on-chip if subsequent rendering operations are at the same resolution, and if the data at the same pixel location currently being rendered is needed (accessing different pixel locations would require access to values outside the current tile, which breaks this optimization). The solution in Vulkan is to split the rendering operations of one render pass into multiple subpasses, which share the same resolution and tile arrangement, so that one subpass can access the results of previous subpasses.

In Vulkan, a render pass can therefore comprise multiple subpasses, and one subpass can access the previous subpasses’ rendering results for a tile which are still stored in the on-chip memory of a mobile GPU before outputting to the system memory.

This solution takes advantage of this feature in order to implement the single pass pipeline, including both rendering and custom resolving, by using two subpasses.

It should be noted that in Vulkan, multiple sub-passes of one pass share the same unique rendering resolution and cannot output to render-targets that have multiple different resolutions, for example LR and HR. In order to be consistent with the Vulkan specification, both of the sub-passes’ render targets (color attachments) can be set at LR. However, only the first subpass’s render target may actually be used, while the second subpass may not output to its render target. Instead, the second subpass may output to a separate storage image at HR (for example, using the imageStore() function in the GLSL shading language) so that the system can output to arbitrary resolution at arbitrary pixel locations.

This solution therefore compensates sub-pass rendering with a storage image as output, so that the system can output to a higher resolution than the render target’s resolution.

The algorithm steps for 4x upscaling are as follows. In the first subpass:, the pixel is rendered to its render target at LR with 4x MSAA turned on. The second subpass performs the custom resolve, which renders a full-screen quad. In the fragment shader, four samples are read for each pixel from the first subpass’s render target (as input), but the resolved result (i.e. 4 output pixels) is written into a storage image at HR (i.e., using the imageStore function), instead of writing to its own render target. Therefore, the second subpass may not output anything to its own render target. Although its render target is set up at LR, it may not be used. An example of the input and output are as below: > Input: subpassLoad(gsubpasslnputMS subpassO, int sample);

> Output: imageStore to a storagejmage, created with

VK_I MAG E_U SAG E_ST O RAG E_B I T , using this shader function: imageStore( gimage2D image, ivec2 P, gvec4 data);

In one example, the shader code of the second subpass may be as shown below:

//suppose [x, y] is the current pixel coordinates at LR, and there are multiple samples in each input pixel

//the output is a storagelmage for(int j=0; j< yscale; j++) for(int i=0; i< xscale; i++)

{ int samplelD = j * xscale+i ; vec4 sampleColor = subpassLoad(subpassO, samplelD ); // subpassO is the rendering result of the previous subpass, and is the subpass-input of this subpass. It should be a multi-sampled texture, like texture2DMS. imageStore(storagelmage, ivec2 (x * xscale + i, y * yscale +j), sampleColor);

}

The benefit of the second exemplary solution described above is that no driver or hardware modification is required. The pipeline can also support other upscale factors as shown in the table below, with MSAAJevei = xscale * yscale, where (xscale , yscale) are the upscale value pair along X and Y direction.

Table 1

If the upscale value pair is not {1 ,1}, the pixel color values are rendered to an LR intermediate render buffer with a MSAA level larger than 1x, for example, 2x, 4x, 8x, or 16x. The resulting final image is at HR, which is MSAAJevei times larger than LR. The solution described above using the subpass feature can also be implemented using other similar 3D APIs, such as DirectX12 or Metel. Any API having a subpass feature, whereby a first sub-pass can be used to render the pixel and a second sub-pass can be used to upscale the pixel to the target resolution, may be used. For the OpenGL ES API, the subpass solution can be similarly implemented in the GPU driver by an End-of-Tile shader (also known as a post-frame shader), which is a fragment shader to post-process the current tile’s pixels. This can be launched immediately after the tile is rendered into the tile buffer.

In the examples described above, MSAA is used to determine a set of samples for each rendering pixel. However, supersampling anti-aliasing (SSAA) also operates on the above principles and may alternatively be used. SSAA executes the pixel shader for all the covered samples of each pixel. As a result, SSAA may be more costly than MSAA. MSAA operates along the geometric edges, whereas SSAA operates even inside the geometry. This results in higher visual quality, but at a higher performance cost compared to MSAA. For this reason, the use of MSAA is preferred.

Figure 7 shows a flowchart detailing an example of a method for processing an image at image processing system, the image comprising a plurality of rendering pixels, the system being configured to divide the image into a plurality of tiles, each tile comprising a sub-set of the rendering pixels of the image and corresponding to multiple display pixels. The method comprises, in a processing pass, the following steps for at least one rendering pixel in the image. In steps 701 to 703, the rendering pixel is rendered. In step 701 , the method comprises determining a set of samples, each sample having a respective location in the rendering pixel and the locations collectively having a sampling density in the rendering pixel. At step 702, the method comprises shading the rendering pixel at a first resolution to give a shading result for the rendering pixel. At step 703, the method comprises writing a shade value for each of the samples to an on-chip tile memory based on the shading result. At step 704, the rendered pixel is upscaled by storing, in the on-chip tile memory, an upscaled representation of the rendered pixel by determining, in dependence on the shade values for the set of samples, a display pixel shade value for each of a plurality of display pixels overlapping the rendering pixel at a target resolution greater than the first resolution.

Figure 8 is a schematic representation of a system 800 configured to perform the methods described herein. The system 800 may be implemented on a device, such as a laptop, tablet, smart phone, TV or any other device in which graphics data is to be processed. The system is preferably implemented by mobile GPU. The system 800 comprises a graphics processor 801 configured to process data. For example, the processor 801 may be a GPU. Alternatively, the processor 801 may be implemented as a computer program running on a programmable device such as a GPU or a Central Processing Unit (CPU). The system 800 comprises an on-chip memory 802 which is arranged to communicate with the graphics processor 801. The system may comprise more than one processor and more than one memory. The memory may store data that is executable by the processor. The processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium. The computer program may store instructions for causing the processor to perform its methods in the manner described herein. The processor may also write data to an external system memory (not shown in Figure 8).

In summary, in the GPU described herein, rendering is performed at LR using MSAA and custom resolving is performed at the tile buffer (the on-chip memory of the GPU) to result in a HR image. Pixel shading is performed at LR, shading once for each pixel (with n samples). Preferably, all samples covered by a primitive inside a pixel will have the same color. Custom resolving is performed at the tile buffer before writing the upscaled samples from the tile buffer to the system memory. Where, for example, 16x MSAA is used, 16 samples of a rendering pixel from LR are output into 4x4 pixels (one sample per pixel) in the resulting HR image. Therefore, the pixels are rendered at LR with MSAA on, and then a custom resolve operation is performed at the end of the same rendering pipeline to produce a HR (high resolution) final image, instead of using the hardware’s default behaviour for MSAA resolving in a tile-based mobile GPU.

The system is therefore configured to perform rendering (including pixel shading) and upscaling in a single rendering pass on a mobile GPU. In some implementations, this may result in significantly less memory and bandwidth consumption compared with a two-pass method, i.e., one pass for each of rendering and upscaling.

This enables pixel shading at a low resolution, with a very low shading cost, and may result in a high resolution final image without jagged object edges. This is achieved by implicitly exploiting the hardware MSAA functionality and by using the custom resolving solution at the end of the same rendering pipeline.

This may be useful for various rendering techniques where it is desirable to generate high- quality upscaled images with no jagged edges for large screen products in an efficient and cheap manner. There may be a significantly lower fragment shading cost at LR, but a HR final image may be achieved. There is no explicit stretching required to produce the HR image from the LR image. In some implementations, there may be no jagged edges, due to rasterization with 4x or 16x MSAA. The edges of polygons (the most obvious source of aliasing in 3D graphics) may be anti-aliased.

The approach described herein may in some implementations be much more bandwidth efficient, because the tile-based rendering pipeline exploits the on-chip tile memory on a mobile GPU, such that the custom resolving only accesses the on-chip tile memory (instead of the system memory) for the intermediate render buffer to produce the final HR image and a single render pass of the processor is required, without any image stretching from LR to HR. As a result, the approach may be much more memory bandwidth efficient, thus more suitable for bandwidth limited mobile GPUs than traditional methods.

The lower GPU fragment shading rate (at LR) and lower memory and bandwidth consumption may result in longer battery life for mobile devices and an improved frame rate for complex and demanding game rendering.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.