Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
RENDERING AND POST-PROCESSING FILTERING IN A SINGLE PASS
Document Type and Number:
WIPO Patent Application WO/2022/058012
Kind Code:
A1
Abstract:
Described herein is a graphics processing system (1400) configured to receive an input image comprising a plurality of pixels, the system being configured to divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising: a memory (503, 1402) configured to store a set of tiles (702, 901, 1105), the set of tiles being a subset of the plurality of tiles of the input image; and a processor (1401), wherein the processor is configured to, for at least one tile in the set (702, 901, 1105), perform a processing pass comprising: rendering the tile; filtering the tile in dependence on at least one other tile of the set (702, 901, 1105); and storing, in the memory (503, 1402), a rendered and filtered tile. This may result in less memory bandwidth consumption compared with traditional two-pass methods, with less external memory bandwidth impact.

Inventors:
LIU BAOQUAN (DE)
Application Number:
PCT/EP2020/075929
Publication Date:
March 24, 2022
Filing Date:
September 17, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
LIU BAOQUAN (DE)
International Classes:
G06T1/60
Foreign References:
US20080094406A12008-04-24
US20030227462A12003-12-11
Other References:
HAIQIAN YU ET AL: "Optimizing data intensive window-based image processing on reconfigurable hardware boards", IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEM DESIGN AND IMPLEMENTATION, 2 November 2005 (2005-11-02), pages 491 - 496, XP010882621
DONGJU LI ET AL: "Design Optimization of VLSI Array Processor Architecture for Window Image Processing", IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS, COMMUNICATIONS AND COMPUTER SCIENCE, vol. 82, no. 8, 1 August 1999 (1999-08-01), pages 1475 - 1484, XP055658095
CHIHOUB A ET AL: "A Band Processing Imaging Library for a TriCore-Based Digital Still Camera", REAL-TIME IMAGING, ACADEMIC PRESS LIMITED, GB, vol. 7, no. 4, 1 August 2001 (2001-08-01), pages 327 - 337, XP004419458
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A graphics processing system (1400) configured to receive an input image comprising a plurality of pixels, the system being configured to divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising: a memory (503, 1402) configured to store a set of tiles (702, 901 , 1105), the set of tiles being a subset of the plurality of tiles of the input image; and a processor (1401), wherein the processor is configured to, for at least one tile in the set (702, 901 , 1105), perform a processing pass comprising: rendering the tile; filtering the tile in dependence on at least one other tile of the set (702, 901 , 1105); and storing, in the memory (503, 1402), a rendered and filtered tile.

2. The system of claim 1 , wherein the memory is a high-speed cache (503), the high-speed cache being a system level cache or a static on-chip memory.

3. The system of claim 1 or claim 2, wherein the set of tiles comprises a predetermined fixed number of tiles.

4. The system of claim 3, wherein the system is configured to implement a sliding window algorithm to keep the fixed number of tiles in the set for each processing pass performed by the processor.

5. The system of any preceding claim, wherein the set of tiles comprises at least one tile that is un-filtered and at least one tile that has been rendered and filtered.

6. The system of any preceding claim, wherein the system further comprises a scheduler configured to schedule the processing of the tiles of the input image in a vertical or horizontal scanline order.

7. The system of any preceding claim, wherein the set of tiles comprises a fixed number of columns of tiles of the input image.

22

8. The system of claim 7, wherein the number of columns is three.

9. The system of any of claim 1 to 6, wherein the set of tiles comprises K tiles, where K is at least 2*(NTC +1) and where NTC is the number of tiles along each column of the final image.

10. The system of any preceding claim, wherein the rendering and filtering of two different tiles in the set are performed concurrently.

11. The system of any preceding claim, wherein the system is configured to store a different set of tiles for each processing pass.

12. The system of any preceding claim, wherein the processor is configured to subsequently write a rendered and filtered image to a frame buffer in a system memory (504).

13. The system of any preceding claim, wherein the system is implemented by a mobile graphics processing unit.

14. A method for processing an input image comprising a plurality of pixels at an image processing system (1400), the system being configured to receive the input image and divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising a memory (503, 1402) configured to store a set of tiles (702, 901 , 1105), the set of tiles being a subset of the plurality of tiles of the input image, the method comprising, in a processing pass, for at least one tile in the set (702, 901 , 1105): rendering (1301) the tile; filtering (1302) the tile in dependence on at least one other tile of the set (702, 901 , 1105); and storing (1303), in the memory (503, 1402), a rendered and filtered tile.

15. A computer program which, when executed by a computer, causes the computer to perform the method of claim 14.

Description:
RENDERING AND POST-PROCESSING FILTERING IN A SINGLE PASS

FIELD OF THE INVENTION

This invention relates to rendering and post-processing filtering of images, for example for game rending in a mobile graphics processing unit (GPU).

BACKGROUND

Post-processing filtering for game rendering involves applying a spatial filtering operation to the rendering result, with a filter footprint covering other surrounding pixels. For example, filtering may be applied to an image in framebuffer.

It is common for game engines, for example, Unreal, Unity and CryEngine, to render a 3D scene into a framebuffer and then immediately apply some post-processing filters to the rendered image to achieve some special visual effects (such as a gauss blur filter for bloom effect, or an image sharping filter), or to apply an anti-aliasing filter, such as Fast Approximate Anti-Aliasing (FXAA), as described at https://en.wikipedia.org/wiki/Fast_approximate_anti- aliasing.

FXAA ignores polygons and line edges and simply analyses the pixels on the screen by using a pixel shader program that runs for each pixel of the rendered frame. However, it also needs to access a local neighborhood of 3x3 surrounding pixels. If it finds those pixels that create an artificial edge, it smooths them out. FXAA can smooth edges in all pixels on the screen, including those inside alpha-blended textures and those resulting from a previous pass’ pixel shader effects, which were immune to Multi-Sample Anti-Aliasing (MSAA) but now can be solved by FXAA.

FXAA may run at very fast speed on a desktop GPU, usually costing only a millisecond or two. Because desktop GPUs have high bandwidth dedicated video memory, a 3D rendering pass to generate the image followed by a separate 2D post-process pass to filter the rendered image is not a problem. However, this is generally not the case for mobile GPUs which do not have such dedicated video memory.

These post-process filters need to access not only the currently processed pixel but also need to access some neighboring pixels in the footprint of a local neighborhood with a filter size of 3x3, 5x5, or even bigger. Figure 1 shows a 3x3 filtering operation applied to an image, where nine pixels in a local neighborhood are read from DDR to filter the current pixel, shown at 101.

This can cause a bandwidth problem for mobile GPUs, which don’t have dedicated video memory, due to the two render passes involved in performing a 3D rendering and outputting the intermediate framebuffer into DDR (double data rate memory), and then a post-processing pass such as FXAA (with a 3x3 pixel footprint) where each pixel has to be read from DDR into GPU many times to perform a filter operation.

Modern mobile GPUs usually use tile-based rendering, where one tile of pixels at a time are rendered into the on-chip tile buffer memory, which is much faster than the off-chip system memory. As a result, the surrounding pixels cannot be accessed within one single renderpass or its multiple sub-passes (in the terminology of the Vulkan API). As a result, a second separate pass is launched for the post-processing purpose, as shown in Figure 2.

Therefore, modern game engines generally use two passes to perform a rendering pass followed by a post-processing filtering pass, i.e., two separate render-passes are involved with DDR as their data connection. The first pass writes the rendered framebuffer to system memory, and then the second pass reads it again into GPU to perform filtering. While flexible, this approach consumes a large amount of system memory bandwidth (especially the read bandwidth), which is at a premium on mobile devices, in terms of both latency and power consumption. Therefore, this approach is not friendly for tile-based mobile GPUs.

The OpenGL ES extension of EXT_shader_pixel_local_storage cannot help either, because it only allows access to values stored at the current pixel location, but does not allow access to the surrounding pixels in a local neighborhood.

Furthermore, another difficulty for a tile-base mobile GPU is that to filter the current pixel with a filter footprint larger than 1x1 , the required surrounding pixels could even be outside of the current tile being rendered, and may be located in other tiles which may have not been rendered yet, as shown in Figure 3.

In summary, the problem of performing a 3D rendering pass (to generate the image) followed by a separate 2D post-process pass (to filter the rendered image) on tile-based mobile GPU is that the rendered intermediate frame buffer in system memory has to been accessed many times (write once and read nine times for a 3x3 filter) to perform the filtering. This may result in heavy cost, in terms of both latency and power consumption, of reading and writing bandwidth when accessing the rendered intermediate framebuffer in system memory of a System-on-Chip. In particular, the reading bandwidth is very heavy for any practical filter size, for example 3x3, 5x5, 7x7, etc. This may also result in low rendering speed in terms of frames per second.

Mobile devices require real-time rendering performance for games, with high frame rate and low latency of user interaction, and at the same time require low power consumption to extend battery life, and also low heat dissipation for comfortable user hand-holding.

It is desirable to develop a method that overcomes these problems.

SUMMARY OF THE INVENTION

According to a first aspect there is provided a graphics processing system configured to receive an input image comprising a plurality of pixels, the system being configured to divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising: a memory configured to store a set of tiles, the set of tiles being a subset of the plurality of tiles of the input image; and a processor, wherein the processor is configured to, for at least one tile in the set, perform a processing pass comprising: rendering the tile; filtering the tile in dependence on at least one other tile of the set; and storing, in the memory, a rendered and filtered tile.

The rendering pass and the filtering pass are therefore combined into a single processing pass by keeping a small fixed amount of data in the memory, i.e., a small subset of the whole frame image is stored in the memory. As a result, the intermediate framebuffer does not need to go to external system memory. This may result in less memory bandwidth consumption compared with the traditional two-pass methods, with less external memory bandwidth impact.

The processing pass may be a processing pipeline comprising the rendering and filtering stages above. The rendered and filtered tile that is output from the pipeline may be stored in the memory.

The memory may be an on-chip memory. The memory may be a high-speed cache. The high-speed cache may be a system level cache or a static on-chip memory. The memory may be separate to the external system memory. Storing a subset of tiles in a high-speed cache may reduce the bandwidth consumption when rendering and filtering tiles. The set of tiles may comprise a predetermined fixed number of tiles. The memory may store a predetermined fixed number for each processing pass.

The system may be configured to implement a sliding window algorithm to keep the fixed number of tiles in the set for each processing pass performed by the processor. The sliding window scheduling algorithm may exploit a small fixed number of pixel-tiles as a working set stored in the on-chip memory, so that only the very necessary surrounding tiles (of some current tiles that are being or to be filtered) are kept in the on-chip memory at a certain time when applying the filtering operation to these current active tiles in the set that have been rendered recently.

The set of tiles may comprise at least one tile that is un-filtered. The set of tiles may comprise at least one tile that has been rendered and filtered. The set may therefore comprise neighbouring tiles to the tile currently being filtered.

The system may further comprise a scheduler configured to schedule the processing of the tiles of the input image in a vertical or horizontal scanline order. This may be a convenient way of scheduling the tiles. A zigzagged order may also be used to schedule the small tiles inside a larger super-tile, and then use the scanline order to schedule the big-super tiles. Any other possible combinations of these different scheduling orders may also be used.

The set of tiles may comprise a fixed number of columns of tiles of the input image. The number of columns may be three. This may allow the image to be scanned three columns of tiles at a time using a sliding window algorithm.

The set of tiles may comprise K tiles, where K is at least 2*(NTC +1) and where NTC is the number of tiles along each column of the final image. This may allow a sufficient number of tiles to be stored in the on-chip memory so that filtering of a current tile can be performed without the need to access DDR. K may be greater than 2*(NTC +1).

The rendering and filtering of two different tiles in the set may be performed concurrently. This may allow the image to be rendered more quickly.

The system may be configured to store a different set of tiles for each processing pass. Each different set of tiles may comprise a predetermined fixed number of tiles. The processor may be configured to subsequently write a rendered and filtered image to a frame buffer in a system memory. This may allow the final image to be displayed.

The system may be implemented by a mobile graphics processing unit. Implementing the system in a mobile device may help to achieve real-time rendering performance for games, with high frame rate and low latency of user interaction, and at the same time low power consumption to extend battery life, and also low heat dissipation for comfortable user handholding.

According to a second aspect there is provided a method for processing an input image comprising a plurality of pixels at an image processing system, the system being configured to receive the input image and divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising a memory configured to store a set of tiles, the set of tiles being a subset of the plurality of tiles of the input image, the method comprising, in a processing pass, for at least one tile in the set: rendering the tile; filtering the tile in dependence on at least one other tile of the set; and storing, in the memory, a rendered and filtered tile.

In this method, rendering pass and the filtering pass are combined into a single pass by keeping a small fixed amount of data in the on-chip memory. A small subset of the whole frame image is stored in the on-chip memory. As a result, the intermediate framebuffer does not need to go to external system memory. This may result in less memory bandwidth consumption compared with the traditional two-pass methods, with less external memory bandwidth impact.

The processing pass may be a processing pipeline comprising the rendering and filtering stages above. The rendered and filtered tile that is output from the pipeline may be stored in the memory.

The memory may be an on-chip memory. The memory may be a high-speed cache. The high-speed cache may be a system level cache or a static on-chip memory. The memory may be separate to the external system memory. Storing a subset of tiles in a high-speed cache may reduce the bandwidth consumption when rendering and filtering tiles.

According to a third aspect there is provided a computer program which, when executed by a computer, causes the computer to perform the method described above. The computer program may be provided on a non-transitory computer readable storage medium. BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings.

In the drawings:

Figure 1 schematically illustrates a 3x3 filtering operation.

Figure 2 schematically illustrates two passes involved in processing an image on a mobile GPU: a first pass for 3D rendering into the DDR, and a second pass for post-processing filtering which reads from the DDR.

Figure 3 schematically illustrates that, for a 2x2 filtering, surrounding pixels from neighbouring tiles are required for correct filtering.

Figure 4 shows a rendering pipeline where both the rendering pass and the filtering pass can be finished within only one single pass of the processor.

Figure 5 illustrates how a high speed cache usually does not have enough memory space to hold the whole framebuffer for a specific rendering application.

Figure 6 illustrates filtering of a tile of pixels of an image.

Figure 7 illustrates allocation of a fixed memory size of three columns of tiles as an active working set in the high speed cache, and how a sliding window algorithm is performed for these three columns of tiles (by sliding one column at a time from left to right).

Figure 8 illustrates the stages to render a 3D geometry scene to a framebuffer. The tile-based mobile GPU generally has two stages: a binning stage for geometry processing, and a rendering stage for rasterization and pixel shading,

Figures 9(a) to 9(d) illustrate the first four iterations of the rendering and filtering process using a first sliding window algorithm. Figure 10 illustrates exemplary pseudocode for the first sliding window rendering and filtering algorithm.

Figure 11 (a) to 11(d) illustrate the first four iterations of the rendering and filtering process using a second sliding window algorithm.

Figure 12 illustrates exemplary pseudocode for the second sliding window rendering and filtering algorithm.

Figure 13 shows a flowchart for an example of a method for processing an input image.

Figure 14 shows an example of a graphics processing system.

DETAILED DESCRIPTION OF THE INVENTION

Herein, rendering refers to any form of generating a visible image, for example for displaying the image on a computer screen, printing, or projecting.

In exemplary embodiments, described are two sliding window algorithms for a tile-based mobile GPU in order to combine the 3D rendering pass and the spatial filtering pass (for example, FXAA) into a single pass by keeping a small fixed amount of data in a memory, for example a high speed cache (HSC) in buffer mode.

As shown in Figure 4, both the rendering pass and post-process filtering pass may be performed within only one rendering pass along the GPU pipeline. The stages shown in the pipeline of Figure 4 are geometry data submission 401 , vertex shader (VS) 402, pixel shader (PS) 403, render target 404 and post-process 405. The rendering 404 and post-processing filtering 405 are performed in a single processing pass along the GPU pipeline.

As a result, the intermediate framebuffer does not need to go to system memory. This may allow a large amount of data bandwidth to be saved, with much less external memory bandwidth impact. This can be very useful for various rendering techniques used by game engines where it is desirable to apply a post-processing filter to the rendered framebuffer.

For a traditional tile-based Tenderer, for example on a mobile GPU, a tile’s rendering result of previous rendering operations can efficiently stay on-chip if subsequent rendering operations are within the same tile of pixels, and if only the pixels’ data in the current tile being rendered is accessed. However, in traditional implementations, access to other pixel locations would require data outside of the current tile, which breaks the tile-based rendering mechanics.

In the Vulkan API, a render pass can comprise multiple subpasses, and one subpass can access the previous subpasses’ rendering results for a tile which are still staying in the on- chip memory of a mobile GPU before being output to the external system memory. These multiple sub-passes share the same tile arrangement, so that one subpass can access the result of a previous subpass, one tile at a time. However, access to pixels of other surrounding tiles (required by a spatial filter) is not allowed, since other tiles may have been evicted out of the on-chip memory or may have not been rendered yet for the time being.

As will be described in more detail below, in the implementations described herein, the highspeed caches (HSC) on a System-on-Chip (SoC) may be exploited to assist the filtering process to avoid the memory bandwidth involved in accessing the intermediate framebuffer in system memory (i.e. , DDR).

The HSCs, which may be a system level cache, or a static on-chip memory, on a SoC are expensive and are usually very small in memory size. They are also shared by multiple applications. As a result, as shown in Figure 5, HSCs usually do not have enough memory space to hold the whole framebuffer for a specific rendering application. In Figure 5, the rendered image is shown at 501 and the filtered image at 502. The DDR is shown at 504. The HSC 503 is too small to hold one full frame.

This is because framebuffer memory sizes are becoming increasingly large due to the prevalence of higher resolution displays (for example, 4K or 8K ) and HDR pixel formats (for example, 16- or 32-bit per color channel) on modern mobile devices such as smart phones and tablets. This means that a system can only afford to store a very small subset of tiles, instead of the whole image, of the framebuffer inside the HSC during the rendering process of an image.

The rendered intermediate framebuffer data is read by a filtering pixel-shader or a compute shader (executing one GPU thread for every single pixel location) which accesses only a small neighborhood of surrounding pixels when processing each individual pixel and when filtering a tile of pixels, for example tile 601 of image 600 in Figure 6. Therefore, only access to eight additional tiles which are surrounding the current tile is required. These tiles are shown within box 602 along with the current tile 601 . The tiles at the image boundary may have some of their surrounding tiles lying outside of the image boundary. However, this is not a problem for filtering, because these outlying pixels can be clamped to the last pixel at the image edge, or simply clamped to zero.

If these eight surrounding tiles can be temporarily kept in a HSC when filtering the current tile 601 , then the two passes can be combined into one single pass without the need for the intermediate framebuffer going to system memory.

In two exemplary embodiments described herein, two sliding window solutions for tile-based mobile GPU are implemented in order to combine the 3D rendering pass and the spatial filtering pass into a single pass by keeping a small fixed amount of tile-data in the HSC using buffer mode.

For a framebuffer, a fixed small number of tiles are maintained as a working set (WS) in the HSC at any one time. The working set of tiles comprises a predetermined fixed number of tiles.

A sliding window algorithm is implemented to keep a fixed number of active tiles in the WS, in which there is at least one tile (there may be more than one such tile) that is ready to be filtered (i.e. that has already been rendered) with the surrounding tiles also available in the WS. The WS of tiles therefore comprises at least one tile that is un-filtered and at least one tile that has been rendered and filtered. The system is configured to store a different set of tiles for each processing pass.

The GPU hardware schedules a tile-based rendering order (one tile after the other) in a vertical or horizontal scanline order. As will be described in more detail below, the system may use a scheduling algorithm to allow the two jobs (rendering and filtering) to be scheduled to run on a GPU in a cooperating and synchronized way so that a minimum amount of tiles is stored in the WS.

In the following description, for simplicity it will be assumed that the GPU hardware renders tiles in vertical scanline order. However, tiles may also be rendered in a horizontal scanline order.

A first exemplary sliding window solution will now be described.

In this example, a fixed memory size of three columns of tiles is allocated as an active WS in the HSC. A sliding window algorithm is applied to these three columns of tiles by sliding one column at a time from left to right. In the preferred embodiment, the tiles in a column are rendered and filtered in a vertical scanline order.

The number of tiles in each column of the image is: NTC = [ imageHT/TileSize ]; and the number of tiles in each row of image is NTR = f imageWD/TileSize ].

As shown in Figure 7, the light grey tiles indicated at 701 are the tiles that have already been rendered and filtered, and have been evicted out of the WS. The darker tiles shown at 702 are the three columns of tiles that are currently stored in the WS, in which the middle column can be filtered because all of its surrounding tiles are already available in the WS. The white tiles indicated at 703 are the tiles that have not been rendered yet and are not in the WS at present.

In a preferred embodiment, a ring buffer is used to manage the memory of the WS in the HSC. The ring buffer includes three slots (i.e. , slotO, slotl, slot2). Each slot stores one columns of tiles, which equals to NTC tiles.

If a tile has 32x32=1024 pixels, i.e., TileSize=32 pixels, then its memory size is 4 KB, in RGBA8 format. If imageHT=1080p, then NTC = f1080/32/ = 34, then the memory size of the WS is 3*34*4KB=408 KB. If lmageHT=720p, and TileSize=32 pixels, NTC = [720/32] = 23; then the memory size of WS is 3*23*4KB=276 KB.

To render the 3D geometry scene to a framebuffer, a tile-based mobile GPU usually has two stages, as shown in Figure 8: a binning stage for geometry processing, shown generally at 801 , and a rendering stage for rasterization and pixel shading, shown generally at 802. Following the rendering stage of a tile, a filtering job is inserted (for example, by launching a compute shader or a fragment shader for each tile) after the tile is rendered.

Conveniently, a scheduling algorithm can be implemented for the rendering stage to allow the two jobs (rendering and then filtering for each tile) to be scheduled on a GPU in a synchronized way. This is to allow rendering and filtering of each tile in a certain order by using a sliding window algorithm, so that a minimum amount of tiles is stored in the WS and at the same time the surrounding tiles of the tile being filtered are available in the WS.

In one example, the scheduling algorithm has two stages as follows. In the initialization stage, the first three columns of tiles of the frame are rendered and each of these rendered tiles is stored into the three slots of the WS. Then, filtering is performed only for the first two columns of the tiles.

In the iteration stage, at each iteration step, the following algorithmic steps are performed by sliding forward one column of tiles at a time along the scanline direction, which will move the three columns in the WS forwards towards the right hand side of the image.

In a first step, the leftmost column (i.e. , slotO = (slotO+1) %3 ) is released of memory space in the WS, and the WS (with three columns) slides to the right hand side to render a new wavefront column of tiles and store these newly rendered tiles into slot2 of the WS.

In a second step, filtering is only performed for the middle column (slotl) of tiles in the WS and the result is stored to DDR, because all of the surrounding tiles (of the middle column) are now available in the WS.

If the last column is reached, it is filtered and the result stored to DDR and then returned, or else the first iteration step resumes.

The first four iterations of the process are shown in Figures 9(a)-9(d). Figure 9(a) shows the WS state after the initialization stage and Figures 9(b)-(d) show the next three consecutive iterations. The three columns of tiles of the WS are shown in dark grey at 901 , 903, 906 and 909 for each iteration. These are the tiles being rendered or filtered currently in each iteration. The light grey tiles 904, 907 and 910 are the finished tiles (not currently in the WS), and the white tiles 902, 905, 908 and 911 have not yet been processed (also not currently in the WS).

The screen space offset in pixels (along the scanline direction) of the WS after each sliding- window iteration is: x_offsetWS = (wavefront-2)* TileSize, where wavefront is the tile-index of the right most column in the WS.

An example of the algorithm's pseudocode code is shown in Figure 10.

The filtering job can be launched at the granularity of one column of tiles for the middle column stored in slotl of the WS, for example, 1080x32 pixels which are launched together using one shader. The filtering job could be defined by a fragment shader or compute shader, which is scheduled with a deferral after the finish of the rendering job of the wavefront column, i.e, slot2.

The rendering job (defined by a fragment shader) is still launched at tile granularity, i.e., one tile after the other along the vertical direction within the wavefront column, and the result is stored into slot2 of WS

After all the tiles of the wavefront column are rendered, a filtering job can be inserted for the middle column so that the rendering job and the filtering job are scheduled in an interleaved pattern, one column after the other.

The GPU hardware (HW) may require modification to achieve such interleaved scheduling of the rendering job and filtering job at each iteration of the sliding window algorithm. After the whole column (i.e., the wavefront column) of tiles are rendered and stored into slot2 of WS, a filtering job is launched for the whole column of tiles stored in slotl of WS. This scheduling can be implemented by GPU HW (more efficiently than the driver) by inserting a filtering job after, whenever a new wavefront column of tiles have been rendered.

A circular ring buffer can be used to manage memory of the WS in the HSC. The HSC is preferably used in buffer mode, which can guarantee that the preset ring buffer (with 3* NTC tiles of memory size) will never be evicted out of the HSC, i.e., every bit of the WS will stay in the HSC at all times during the rendering process.

The ring buffer includes three slots, each of which stores one column of tiles, i.e., NTC tiles. If lmageHT=1080p, and TileSize=32 pixels, NTC= P 1080/32 1 =34.

After each sliding window iteration, the three columns of tiles in the WS move towards the end of the image along the scanline direction by one column, and at the same time the three slots will move circularly by one slot within HSC as below: slotO= (slotO + 1) % 3; slotl = (slotl + 1) % 3; slot2= (slot2 + 1) % 3;

Access to neighbouring pixels in neighbouring tiles may be performed as follows. When performing the filtering job for the middle column of tiles stored in WS, it is easy to calculate the memory address (in WS) of the surrounding tiles and pixels covered by the footprint of a filter kernel applied to the current pixel. Because the WS has only three columns, a required neighbouring tile may first be found using the tile-index in a WS slot, and the required pixels within it may then be found easily using the intra-tile pixel-offsets.

Developers may write their custom filtering shaders using this shader instruction: vec4 textureOffset ( gsampler2D samplerWS_in_HSC, vec2 pos, ivec2 offset), where pos is the current pixel location to be filtered; offset is the integer offset to the current pixel (e.g., offset along X and Y could be [-2, -1 , 0, 1 , 2] for a filter with a diameter of 5 pixels, i.e., a 5x5 filter); samplerWS_in_HSC is our Working-Set in HSC (at buffer mode) with three columns of tiles.

The compiler calculates the address (in the WS) to load the neighbouring pixels from the HSC. It is possible to calculate the buffer’s offset address in HSC by using the following three levels of indexing:

• slot-index (i.e., [0, 1 , 2] of the ring buffer): with each slot pointing to a certain column (of tiles) in the HSC;

• tile-row-index: each pointing to a tile (along vertical direction) in a certain slot (each slot has a column of tiles: equal to NTC).

• intra-tile offset: i.e. the XY offset within a tile.

In some implementations, GPU driver modifications may be required. An API extension may be provided for the applications to use this GPU feature (i.e., rendering+filtering combined in a single pass). The developers only needs to provide a customized filtering shader. Alternatively, the GPU driver may manage everything directly, include providing the filtering shader, for some specific Postprocessing, such as FXAA. The developers then only need to enable/disable this feature to apply one of these very common Postprocessing filters to the rendering result.

In the first exemplary sliding window solution, the memory addressing of a tile in the WS is easy to calculate, because the alignment of the three columns of tiles stored in HSC. The HW scheduler of the two jobs (for rendering and filtering) is very simple to implement via an interleaved mode. The two jobs (rendering and filtering) are scheduled by the HW in a serial and interleaved mode. As a result, one job waits for the other before sliding forward to the next column. In some implementations, this kind of waiting may introduce pipeline bubbles for the rendering job.

An alternative second sliding window solution is proposed to solve the pipeline bubble problem that may be encountered in some implementations due to the interleaved scheduling of two jobs which have to wait for each other.

In this alternative implementation, a more complex scheduling algorithm is used which can allow the two jobs (rendering and filtering) to be scheduled in a concurrent way on a GPU, instead of in a serial and interleaved way, by using binary semaphore mechanics to synchronize the two jobs so that the rendering job does not need to wait for the filtering job. In some implementations, some HW modification may be used to achieve the described binary semaphore signal-sending mechanics, as will be described in more detail below.

In the second sliding window implementation, a memory size of a fixed number of tiles is still allocated for the WS in the HSC. Instead of three columns of tiles, K tiles are maintained in the HSC (in buffer mode) at all times, where K is a pre-set value that is preferably at least 2*(NTC +1), where NTC is the number of tiles along each column of the final image.

Preferably, a ring buffer is used to manage memory of these K tiles in order to store the WS. The ring buffer manages only K slots, where each slot stores only one tile of pixels. The WS can grow larger as more tiles are being added to it during the sliding window iterations, but it should not be larger than K tiles, otherwise the ring buffer will overflow.

A sliding window algorithm, sliding one tile at a time along the vertical scanline order, can be used to schedule both the rendering job and the filtering job. In this implementation, only one tile is slid at each iteration step, instead of sliding one column of tiles at each iteration, as found in the first example.

Figures 11 (a)-11 (d) show four consecutive iteration steps, where the WS has been allocated memory in the HSC with K tiles of memory size. The tiles 1101 , 1102, 1103 and 1104 (hereinafter referred to as the “current tile” for each iteration) and the surrounding grey tiles within boxes 1105, 1106, 1107 and 1108 are currently in the WS and are being processed (rendered or filtered) at the current iteration step, the light grey tiles to the left of the WS are the finished tiles (not currently in the WS), and the white tiles to the right of the WS have not been touched yet and will be slid into in the future iteration steps (also not currently in the WS).

Three indices of tile-slots are traced in the ring buffer at each iteration step of the second sliding window algorithm, as below: slotO: pointing to the wavefront tile, i.e., the first grey tile in WS. It is also the most recently rendered tile, which moves forward in vertically scanline order. slotl : pointing to the “current” (shown in dark grey), which is the only tile that is ready for being filtered; slot2: pointing to the tail tile, i.e. , the last grey tile in WS.

Whenever a new wavefront-tile (i.e., slotO) finishes its rendering job, a check can be performed to determine whether the current tile (i.e., slotl) can start to perform its filtering job. For example, it may progress to filtering when its lagging-behind distance to the wavefront tile is larger than (NTC+1). Otherwise, the current tile’s filtering job waits before sliding forward to the next tile. This is to ensure that the current tile has all its surrounding tiles already available in the WS before performing its filtering job.

The second sliding window algorithm is therefore a more complex HW scheduling algorithm which can allow the two jobs (rendering and filtering) to be scheduled in a concurrent style, instead of a serial style, by using a binary semaphore signal sending mechanics. The rendering job can send a signal to the filtering job to indicate that it is safe (because all of the tiles required for filtering are present in the WS) for the filtering job to perform filtering for the current tile (i.e., slotl in WS). At the same time, the rendering job can keep moving forward in its own rendering rhythm (by sliding forward to the next tile, to render one tile after the other in the vertical scanline order) without needing to stop and wait for the filtering job. Also at the same time, the filtering job waits for a new semaphore signal from the rendering job which indicates that it is safe to filter the current tile. After that, it may perform the filtering operation and then slide forward to next tile in slotl .

Note that the rendering job (at the wavefront tile) does not need to wait, while the filtering job does need to wait for a certain lagging-behind distance (from it to the wavefront tile) before it can slide forward.

After the current tile finishes its filtering and slides to the next tile, a check can be performed to determine whether the tail-tile’s distance to the current tile is larger than (NTC+1). If so, the tail-tile can be evicted out of the WS and its slot-index can slide forward to pointer to the next tile.

Generally, rendering a tile is slower then filtering a tile due to the complex shading equations used by game engines (such as physically based rendering). Therefore, in this example, the number of valid tiles in the WS (which may generally be equal to slot0-slot2+1 , i.e., the tileindex distance between the wavefront tile and the tail tile) is no larger than K. In one implementation, K is the pre-set allocated memory size for the ring buffer. In this example, if the above condition is not true (i.e. if the number of tile in the WS is greater than K), the ringbuffer may be full and could overflow (i.e., when (slotO+1) is equal to slot2 ). If this happens, the pre-set value of K may be too small, and additional memory may need to be allocated in HSC or even in system memory in order to hold a bigger ring buffer. In order to avoid using additional memory, the GPU hardware scheduler may allocate more computing units (shader cores) to the filtering job in order to make it run faster, so that it can catch up with the rendering job’s rhythm and the lagging behind distance (between slotO and slot2) will not be bigger than the pre-set K and therefore the ring buffer will not overflow. To avoid this, preferably, K should be larger than 2*(NTC+1). slotO - slotl is usually equal to NTC+1 , and slotl - slot2 usually also equal to NTC+1 , as shown in 1107 and 1108 of Figure 11 . Therefore, K should preferably be at least 2*(NTC+1)+1 to avoid the ring buffer becoming full and overflowing.

Exemplary pseudocode for the second algorithm is shown in Figure 12, where the two jobs (rendering and filtering) run concurrently on the same GPU and are scheduled by hardware in a synchronized way by using the binary semaphore signaling mechanics.

Using such semaphore mechanics, the rendering job sends a semaphore signal to the filtering job whenever the slot-index distance (in the ring buffer) between the wavefront tile (slotO) and the current tile (slotl) is larger than (NTC+1). And at the same time, the filtering job waits for a new semaphore signaled from the rendering job which indicates that it is safe to filter the current tile (because all the required neighboring tiles are present in the WS), and after receiving the signal the current tile can be filtered and then the system slides forward to the next tile for filtering.

Therefore, in the second exemplary solution, a fixed, preset number of tiles K, where K should preferably be at least 2*(NTC+1), is maintained as the ring buffer in the HSC to store all the tiles in WS. Memory alignment in the HSC is still easy, as a pool of K tiles that are maintained in the HSC, so memory is aligned at the granularity of a tile’s memory size. The ring buffer’s memory size is small and fixed: at most K tiles are stored in the HSC at any time.

It is easy to address the tiles in the WS and access neighboring pixels in neighboring tiles for filtering. When performing the filtering job for the current tile, it is easy to calculate the memory address (in the WS) for the surrounding tiles and pixels covered by the footprint of a filtering operation applied to the current pixel. The WS’s memory is managed by using a ring buffer with K tiles, where each tile has a slot-index in the WS. Therefore, a required neighboring tile can firstly be found by using the slot-index distance (in the WS) between this neighboring tile and the current tile, and then the required pixels within this tile can be found using the intratile pixel-offsets.

For example, the slot-indexes of the eight neighboring tiles of a current tile (i.e., slotl) in the WS can be calculated as below: o its top-left neighboring tile’s slot-index can be calculated as: slot1-(NTC+1); o its left neighboring tile’s slot-index can be calculated as: slot1-NTC; o its bottom-left neighboring tile’s slot-index should be: slot1-(NTC-1); o its top neighboring tile’s slot-index should be: slotl -1; o its bottom neighboring tile’s slot-index should be: slotl +1; o its top-right neighboring tile’s slot-index can be calculated as: slotl +(NTC-1); o its right neighboring tile’s slot-index can be calculated as: slotl +NTC; o its bottom-right neighboring tile’s slot-index should be: slotl +(NTC+1);

The HW scheduler of the two jobs (rendering and filtering) allows the two jobs to be scheduled in a concurrent way. As a result, the rendering job can keep moving forward in its own rhythm, by sliding forward a tile at a time, without the need to wait for the filtering job. Thus, this will not introduce rendering pipeline bubbles, such that the rendering job can keep running continuously for one tile after the other without stopping.

The filtering job may, for example, be a fragment shader or a compute shader. Both of these operations are possible because the filtering job performs at the granularity of a tile.

The main benefits of the rendering solutions described herein are as described below.

The intermediate framebuffer (after rendering) does not need to go to system memory. This may save read bandwidth by up to 49x for a 7x7 filter and save 1x write bandwidth of the whole frame.

For a 3x3 filter, the approximate bandwidth (BW) saving is as below:

■ Saving in read BW: up to 9x 1080x1920 pixels per frame;

■ Saving in write BW: 1x 1080x1920 pixels per frame;

■ At 60 FPS, the BW saving is up to 10x2Mx60 pixels per second, i.e., 1200Mx8Byte=96GB/S for RGBA16F pixel format. This is a huge bandwidth saving. The filter kernel can be as large as (2*TileSize+1)x (2*TileSize+1). For example, the filter size may be from 3x3 to 65x65, if TileSize=32.

After rendering a frame, many different post-processing filters can be applied to the rendered framebuffer by game engines. The solutions described herein can support at least the following post-processing filters without the need of storing the intermediate framebuffer to external system memory: Nvidia’s FXAA: 3x3, Bloom effect filter: 7x7, Gaussian blur filter: 3x3, 5x5, 7x7, SuperResolution filter: 7x7, AMD’s CAS filter: 3x3, and 4x4, Chromatic Aberration, Depth Of Field: applying a blur effect based on distance to focal point, Motion Blur: blurs objects based on its motion using a variable blur size, bicubic filtering: 4x4, and many more spatial filters, or any other post-processes that are applied to a rendered framebuffer that use neighboring pixels’ values to calculate a new value for the current pixel.

Instead of the tile-based scheduling of the two GPU jobs (rendering and filtering), other granularity of pixel-blocks (super tiles, or hierarchy of tiles) may also be used as the unit for job scheduling, and also such granularity can be used for ring-buffer memory slots management.

Other similar sliding window algorithms may also be used which exploit a small fixed number of pixel tiles as the WS stored in HSC, so that only the necessary surrounding tiles (of some current tiles that are to be filtered) are cached in HSC at a certain time when applying the filtering operation to these currently active tiles (in the WS) that have been rendered recently.

Other tile scheduling orders may be utilized. As described above, vertical and horizontal scanline orders may be used. A zigzagged order may also be used to schedule the small tiles (for example, four tiles) inside a big super-tile, and then use the scanline order to schedule the big-super tiles. Any other possible combinations of these different scheduling orders may also be used.

Figure 13 shows a flowchart detailing an example of a method for processing an input image comprising a plurality of pixels at an image processing system, the system being configured to receive the input image and divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising a memory configured to store a set of tiles, the set of tiles being a subset of the plurality of tiles of the input image. The method comprises, in a processing pass, for at least one tile in the set performing steps 1301-1303 as follows. At step 1301 , the method comprises rendering the tile. At step 1302, the method comprises filtering the tile in dependence on at least one other tile of the set. At step 1303, the method comprises storing, in the memory, a rendered and filtered tile.

Figure 14 is a schematic representation of a system 1400 configured to perform the methods described herein. The system 1400 may be implemented on a device, such as a laptop, tablet, smart phone, TV or any other device in which graphics data is to be processed. The system is preferably implemented by mobile GPU.

The system 1400 comprises a graphics processor 1401 configured to process data. For example, the processor 1401 may be a GPU. Alternatively, the processor 1401 may be implemented as a computer program running on a programmable device such as a GPU or a Central Processing Unit (CPU). The system 1400 comprises an on-chip memory 1402 which is arranged to communicate with the graphics processor 1401. The system may comprise more than one processor and more than one memory. The memory may store data that is executable by the processor. The processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium. The computer program may store instructions for causing the processor to perform its methods in the manner described herein. The processor may also write data to an external system memory (not shown in Figure 14).

In summary, a rendering pipeline for a mobile GPU is described above which implements rendering and post-processing filtering within a single pass via the HSC for a tile-based mobile GPU.

The 3D rendering pass and the spatial filtering pass (for example, FXAA) are combined into a single pass by keeping a small fixed amount of data on a high speed cache (for example, a system level cache or a static on-chip memory) in buffer mode. The method utilizes the advantage of the HSC memory on a mobile GPU to store a small subset of the whole frame image. As a result, the intermediate framebuffer does not need to go to external system memory. This may result in less memory bandwidth consumption compared with the traditional two-pass methods, with less external memory bandwidth impact.

The described sliding window scheduling algorithms exploit a small fixed number of pixel tiles as a working set stored in the HSC, so that only the necessary surrounding tiles (some tiles that are being or to be filtered) are kept in the HSC at a certain time when applying the filtering operation to these current tiles in the WS that have been recently rendered. As described above, in the first described example, three columns of tiles are maintained in the HSC as a ring buffer. The rendering job and the filtering job are maintained in serial mode by sliding forward one new column of tiles at a time. The scheduling granularity is one column of tiles at each iteration step. In the second described example, K tiles are maintained (where K is at least 2*(NTC+1)) in HSC as a ring buffer. Tiles are rendered one by one, for example in a vertical scanline order, with no need to wait. Three slot indices are tracked to chase the wavefront tile, current tile, and the tail-tile, respectively. Signal semaphore mechanics are used to synchronize the rendering job and the filtering job, so that the rendering job doesn’t need to wait for the filtering job. The scheduling granularity is one tile at each iteration step.

The proposed solutions require only one single rendering pass of the processor to finish both rendering and post-process filtering to a framebuffer. The solutions can be extended to support many different post-processes that are applied to the rendered framebuffer that use some neighboring pixels’ information to calculate a new value for the current pixel (such as FXAA), or use some traditional spatial filters (such as a Gaussian blue filter) with kernel size as large as (2*TileSize+1)x (2*TileSize+1), meaning the filter size can be from 3x3 to 65x65 if TileSize=32.

Specific API extensions (for example, Vulkan & GLES) for very common filters, such FXAA, can be implemented in the GPU driver directly so that the user can simply enable the feature and the driver will set up everything else involved. A general extension can be used for all other user-defined filtering operations. The user can provide their own customized filtering shader, which can be compiled and used by the GPU for the filtering job.

The described system may therefore achieve the objective of reducing memory bandwidth (read and write) by accessing the intermediate framebuffer in system memory by combining the 3D rendering pass and post-processing filtering pass into a single render-pass along the graphics pipeline on the mobile GPU by using sliding window algorithms which only need to store a small fixed amount of data (a subset of the whole frame) in a HSC in buffer mode. The approach can reduce the amount of memory bandwidth required by avoiding the intermediate framebuffer going to system memory. This may result in a faster frame-rate and less power consumption. There is no need to perform the cumbersome and redundant copying, addressing, and storing of each tile’s edge-pixels and corner-pixels to its eight neighbor tiles. DRAM’s page breaking is not a problem, as the HWs special port for writing the tile to DRAM can be exploited after a tile is filtered and needs to be written to DDR. The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.