INSTRUCTION PROCESSING - HUAWEI TECH CO LTD

Title:

INSTRUCTION PROCESSING

Document Type and Number:

WIPO Patent Application WO/2019/206398

Kind Code:

Abstract:

A hardware unit for forming descriptors for acting as instructions to a processing unit, each descriptor comprising multiple fields, the hardware unit being configured to receive data defining (i) the addresses of memory locations storing descriptor content and (ii) a set of inherited fields, and the hardware unit being configured to form a new descriptor by (i) populating the inherited fields of the new descriptor by copying data from a previously formed descriptor and (ii) populating other fields of the new descriptor by copying data from memory locations at the defined addresses.

Inventors:

PRASAD VATSALYA (DE)

Application Number:

PCT/EP2018/060384

Publication Date:

October 31, 2019

Filing Date:

April 23, 2018

Export Citation:

Click for automatic bibliography generation Help

Assignee:

HUAWEI TECH CO LTD (CN)
PRASAD VATSALYA (DE)

International Classes:

G06T1/60; G06F9/54

Foreign References:

US20140347371A1

2014-11-27

Other References:

TREVOR HAMMOCK: "In Depth Overview of Descriptor sand How to Organize Them", CS 419V / 519V -- VULKAN, 7 March 2018 (2018-03-07), XP055546306, Retrieved from the Internet [retrieved on 20190123]

Attorney, Agent or Firm:

KREUZ, Georg (DE)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1 . A hardware unit for forming descriptors for acting as instructions to a processing unit, each descriptor comprising multiple fields, the hardware unit being configured to receive data defining (i) the addresses of memory locations storing descriptor content and (ii) a set of inherited fields, and the hardware unit being configured to form a new descriptor by (i) populating the inherited fields of the new descriptor by copying data from a previously formed descriptor and (ii) populating other fields of the new descriptor by copying data from memory locations at the defined addresses.

2. The hardware unit of claim 1 , wherein the hardware unit is configured to store the new descriptor.

3. The hardware unit of claim 1 or claim 2, wherein the hardware unit is configured to populate the inherited fields of the new descriptor from the fields of a descriptor previously stored by the hardware unit.

4. The hardware unit of any preceding claim, wherein the data defining the addresses of memory locations storing descriptor content comprises a base address and a first data structure comprising offset values corresponding to each descriptor field stored in the memory, and wherein the hardware unit is configured to determine each said address by determining an address offset from the base address by a respective offset value in the first data structure.

5. The hardware unit of any preceding claim, wherein the hardware unit is configured to receive data defining the set of inherited fields comprising a second data structure indicating which fields are to be populated by copying data from a previously formed descriptor.

6. The hardware unit of claim 5, wherein the second data structure comprises a mask, wherein each field of the mask corresponds to a respective field of the descriptor and indicates whether that field of the descriptor is an inherited field.

7. The hardware unit of any preceding claim, wherein the hardware unit is configured to provide the new descriptor to the processing unit.

8. The hardware unit of any preceding claim, wherein the hardware unit is configured to provide the new descriptor to a graphics processing unit as the processing unit.

9. The hardware unit of any preceding claim, wherein the instructions are one of shader, vertex, pixel, fragment and compute.

10. The hardware unit of any preceding claim, wherein the descriptor is a graphics processing unit job descriptor.

1 1 . A hardware unit as claimed in any preceding claim, wherein the hardware unit shares with a driver memory containing the memory locations.

12. A hardware unit as claimed in claim 1 1 , wherein:

the format of the descriptors is such that a first subset of the fields of the descriptors are of a fixed length and a second subset of the fields of the descriptors are of a variable length; and

the driver is configured to store data for populating the fields of the first subset in a first common area of the memory and to store data for populating the fields of the second subset in a second common area of the memory separate from the first common memory area.

13. A method for forming descriptors for acting as instructions to a processing unit, each descriptor comprising multiple fields, the method comprising:

receiving data defining (i) the addresses of memory locations storing descriptor content and (ii) a set of inherited fields; and

forming a new descriptor by (i) populating the inherited fields of the new descriptor by copying data from a previously formed descriptor and (ii) populating other fields of the new descriptor by copying data from memory locations at the defined addresses.

14. A computer program product including computer executable instructions that, when executed by a processor, perform the steps of the method of claim 13.

Description:

INSTRUCTION PROCESSING

This invention relates to handling instructions to a processing unit such as a graphics processing unit (GPU).

Typically, instructions are provided to a processing unit on which a driver is running by a host, such as a central processing unit (CPU). Instructions to the processor may be packed into jobs, each job comprising a set of instructions that relate to a common task, and the jobs sent to the processor. Jobs may be sent to the processor in the form of a descriptor. The descriptor may include information defining the job either directly (through information contained in the descriptor) or indirectly (through information contained in memory whose address is contained in the descriptor). It is desirable to reduce the load required of the driver. This may allow it to prepare instructions or jobs faster or may free it up to perform other tasks. In a graphics processing application, the instructions and jobs may relate to graphics processing tasks. The jobs and instructions may result from application processing interface (API) calls directed to the driver. The overhead required of the driver to pack and send jobs is influenced by the architecture of the graphics API and of the processor to which the jobs are directed.

In a system comprising a GPU, the overhead of the driver can have a substantial impact on overall power consumption and computing performance. Many mobile systems-on-chip (SoCs) have a fixed total power budget and distribute that budget dynamically between the CPU and a GPU. Some approaches to this are intelligent power allocation (I PA) and energy aware scheduling (EAS).

In a typical mobile SoC, the driver is implemented by the CPU. In this architecture, high driver overhead means that the CPU takes a bigger part of the fixed power budget. This can lead to throttling of the GPU, e.g. by its operational frequency GPU being reduced, hurting the overall performance.

Research has indicated that in an existing GPU driver implementation a relatively large amount of time is consumed on memory transactions, and that most of that memory overhead is due to allocation, copy and set memory operations. Typically, a part of GPU memory is allocated from the pool and data is copied from host memory to the allocated part of GPU mapped memory whenever a new GPU resource (texture, uniform, etc.) is required or updated for the draw call as part of a graphic API state machine implementation. This can use considerable resources.

The Vulkan API is a new generation graphics and compute API that provides high- efficiency, cross-platform access to GPUs and can reduce driver overhead. However, Vulkan does not support parameterised reusable blocks, such as the parameterized command buffer. With Vulkan, the GPU driver builds GPU hardware jobs in the form of a complete GPU hardware descriptor set for each draw call. This can involve a significant level of memory access. In some scenarios, some of the information sent or referenced in one call (e.g. a texture map) may be repeated in a subsequent call, with significant memory resource being used both times.

It is desirable to reduce driver overhead in a graphics processing unit by efficiently generating job descriptors.

According to a first aspect there is provided a hardware unit for forming descriptors for acting as instructions to a processing unit, each descriptor comprising multiple fields, the hardware unit being configured to receive data defining (i) the addresses of memory locations storing descriptor content and (ii) a set of inherited fields, and the hardware unit being configured to form a new descriptor by (i) populating the inherited fields of the new descriptor by copying data from a previously formed descriptor and (ii) populating other fields of the new descriptor by copying data from memory locations at the defined addresses.

The hardware unit may be configured to store the new descriptor. This may allow the descriptor to be easily accessed by the hardware unit.

In an implementation the hardware unit may be provided by a dedicated hardware block or dedicated processing circuit. For example, the hardware unit may be provided by a hardware job manager or the hardware unit may be part of the hardware job manager. The hardware unit may be configured to populate the inherited fields of the new descriptor from the fields of a descriptor previously stored by the hardware unit. This may reduce the need to copy data for descriptor fields whose content is the same as a field of a previous descriptor, which may reduce driver overhead.

The data defining the addresses of memory locations storing descriptor content may comprise a base address and a first data structure comprising offset values corresponding to each descriptor field stored in the memory, and the hardware unit may be configured to determine each said address by determining an address offset from the base address by a respective offset value in the first data structure. This is a convenient way of defining the memory addresses.

The unit may be configured to receive data defining the set of inherited fields comprising a second data structure indicating which fields are to be populated by copying data from a previously formed descriptor. The second data structure may comprise a mask, wherein each field of the mask corresponds to a respective field of the descriptor and indicates whether that field of the descriptor is an inherited field. This is an efficient way of indicating which fields are inherited fields.

The hardware unit may be configured to provide the new descriptor to the processing unit. The descriptor can then be used by the processing unit.

The processing unit may be a graphics processing unit. The descriptor may be a graphics processing unit job descriptor. The instructions may be one of graphics- specific tasks or compute-specific tasks. The instructions may be one of shader, vertex, pixel, fragment and compute tasks. This may allow the hardware unit to be used to carry out common graphics processing operations.

The hardware unit may share with a driver memory containing the memory locations. This is an efficient architecture arrangement. The format of the descriptors may be such that a first subset of the fields of the descriptors are of a fixed length and a second subset of the fields of the descriptors are of a variable length; and the driver may be configured to store data for populating the fields of the first subset in a first common area of the memory and to store data for populating the fields of the second subset in a second common area of the memory separate from the first common memory area. This is an efficient way of storing the descriptor elements.

According to a second aspect there is provided a method for forming descriptors for acting as instructions to a processing unit or processor, each descriptor comprising multiple fields, the method comprising: receiving data defining (i) the addresses of memory locations storing descriptor content and (ii) a set of inherited fields; and forming a new descriptor by (i) populating the inherited fields of the new descriptor by copying data from a previously formed descriptor and (ii) populating other fields of the new descriptor by copying data from memory locations at the defined addresses.

All the features described in relation with the hardware unit according to the first aspect and their associated advantages can be combined to the method according to the second aspect to obtain further implementation forms of the method of the second aspect.

According to a third aspect, there is provided a computer program product including computer executable instructions that, when executed by a processor, perform the steps of the method according to the second aspect.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

Figure 1 illustrates an example of high-level structure of GPU hardware.

Figure 2 shows an example of descriptor content. Figure 3 shows a method for forming descriptors.

Figure 4 shows a method for forming descriptors.

DETAILED DESCRIPTION OF THE INVENTION

Figure 1 shows an example of a high-level structure of GPU hardware. The GPU hardware comprises a number of graphic processing clusters 102, 103, 104 behind a common job manager 101 . These clusters and their interface 101 are collectively referred to as the GPU 100. The GPU may include other hardware not shown in figure 1 . The interface 101 is exposed to a driver 105 of a CPU 108. The CPU also supports an application. The application may be a user application or a part of an operating system. When the application requires the GPU to perform processing, it instructs the driver 105 via an API. The driver can then compose a job for the GPU. The driver forms a descriptor defining the job. The descriptor is passed to the GPU. The job manager 101 of the GPU prepares the job for execution and allocates it to one or more of the processing clusters, which then execute the job. The driver may pass the descriptor directly to the GPU, or it may store the job descriptor in an area 106 of shared memory 107 which can be accessed by both the CPU and the GPU. In the latter case, the job manager 101 can read the descriptors from a queue in area 106 and then action them. The driver may store other information, for example texture definitions, in the shared memory 107, and that information can then be accessed by the GPU.

As indicated above, it is convenient for the driver to pass job descriptors and any associated information defining one or more jobs to the GPU by storing that information in data structures in shared memory 107. It is convenient for at least some functions to be passed more directly from the driver to the GPU. This may for example be done for some high-level configuration and control functions. A convenient way to pass this is through a register interface such as a memory mapped advanced peripheral bus (APB) register interface.

The jobs sent to the GPU may be for graphics-specific tasks (e.g. vertex, shader, pixel or fragment processing), compute tasks or for jobs internal to the system, e.g. to prepare or suspend elements of the system. The approach described herein may be applied for providing jobs to processors that are not dedicated GPUs.

An effect of the job manager interface or shortly job manager 101 is to hide the number and organization of the graphics processing clusters from the CPU. In this way, the CPU can implement a common mechanism for passing jobs to the GPU irrespective of the internal architecture of the GPU.

The job manager 101 may conveniently comprise a dedicated hardware block or dedicated processing circuit or hardware unit. The job manager 101 interfaces to the driver 105 via memory 107. As will be described below, the job manager 101 configured for late GPU descriptor and address patching using parameterized data and a skip list mask generated by the GPU driver.

Each GPU job descriptor formed by the driver 105 may comprise multiple fields. The content of each field may be stored at a respective address in the memory 107. That content may itself define a parameter of the job, or it may be a pointer to another location in memory (conveniently in shared memory 107) where a parameter is stored.

Thus, each GPU descriptor may be a pointer-based data structure. It may have non sequential nodes (i.e. the fields of a descriptor may point to non-sequential blocks in memory). The payload of the descriptor may be small relative to the total amount of data referenced by the descriptor. They contain data to be read and/or processed and the next pointer or pointers to subsequent nodes. Figure 2 shows an example of the format of a descriptor, in the form of a list of fields. In figure 2 the list itself is shown in two columns. Each field is defined by a row in the list shown in figure 2. Where a field refers to a“mem region” the content of that field may be a pointer to a region of memory (referred to as a node) containing the requisite data to satisfy the informational requirements of that field.

For the job manager hardware block 101 to work most efficiently, it would be expected that the GPU driver should pack the nodes of a descriptor sequentially in memory. This could make optimal use of spatial locality optimizations in standard memory subsystems. The GPU descriptors and the associated array data structures of descriptors could be packed and indexed in a unified memory layout. In a typical descriptor, a number of the regions may be of the same, fixed size. In a typical processing system, this fixed size may conveniently be obtained by invoking a“sizeof” operator on the relevant part of the descriptor structure. When the structure in question is an array formed of multiple elements of the same size, the total size may be obtained as the product of the size of the array elements (which can, e.g., be determined using a sizeof operator) and the number of array elements. The fixed size values may be set up during configuration of the system and may then be constant, or they could vary from descriptor to descriptor. The size of one or more elements may be determined from shader information at the linking stage. In the case of descriptors having variable-size elements their size may depend on specific draw call parameters.

As will become apparent, it can be efficient for elements of a descriptor that are of a fixed length and/or those elements of the descriptor that are static or substantially static to be stored in a first contiguous region of the memory 106 containing the descriptor. This region is indicated as a primary template region (PTSZ) in Figure 2. It can further be efficient for other elements, e.g. those of variable sizes and/or whose values are not static or substantially static, to be stored in a second contiguous region of the memory 106 containing the descriptor. This region is indicated as a secondary template region (STSZ) in Figure 2.

The efficiency with which descriptors can be processed can be improved by reducing the need to copy data for descriptor fields whose content is the same as a field of a previous descriptor. The fields in question may be corresponding fields (in descriptors of the type illustrated in figure 2, that would be fields designated by the same the same row in the list but belonging to different descriptors). The previous descriptor may be the immediately preceding descriptor, e.g. in write order and/or in read order. Fields of this nature in a subsequent descriptor to the first may be referred to as inherited fields. Suppose a first descriptor (D1 ) and a subsequent descriptor (D2) are to be written to memory 106, and that the subsequent descriptor includes a field (F2) which is to convey the same data as a field (F1 ) of the first descriptor. The data to be conveyed may be included directly in the field F1 of the first descriptor. Alternatively, the data to be conveyed may be stored in another area R of memory 106 and the field F1 of the first descriptor may comprise a reference to that area. In the present approach, the first descriptor D1 is written to memory 106. The content of the field F1 is stored at a certain location (“M1”). If the data to be conveyed is not to be held directly in that field then the data to be conveyed is stored in memory area (“R”) and the details of R (e.g. its start address and its length) are represented in F1 by being stored at M1 . When the subsequent descriptor D2 is written, it is to have the field F2 representing the same content as field F1 of descriptor D1 . If the content was stored directly in F1 , then F2 is written to memory 106 so as to contain a reference to area M1 . If F1 held a reference to R then a reference to R is stored as field F2. In both cases, field F2 can be inherited; there is no need to store the actual content data a second time as part of storing D2. Content data at a memory area at which it was stored for a first descriptor can be referenced and re-used for a subsequent descriptor. This logic can reduce the workload of memory operations needed to write the subsequent descriptor to memory, because the content data for the inherited field F2 does not need to be written as part of storing D2. On the read side, when the job manager 101 is retrieving a descriptor having one or more inherited fields it can use an indexing table to enable it to efficiently retrieve job descriptor field values that require updating for each draw call. If it has cached the content of the last-read descriptor, it can skip the process of reading from memory parts of a descriptor that are the same as those of the last-read descriptor. This can improve efficiency.

The data defining the addresses of memory locations storing descriptor content may comprise a base address and offset values corresponding to each descriptor field stored in the memory. The hardware unit may determine each memory address by determining an address offset from the base address by a respective offset value.

Fields that are to be populated from a previously generated descriptor (inherited fields) may be defined by a data structure comprising a list of fields and for each field an indication of whether it should be read from memory or skipped. If a field is skipped it is populated from the content of a previous descriptor and hence not read for a second time by the job manager 101 from memory. The data structure may be referred to as a skiplist mask. Each field of the mask may correspond to a respective field of the descriptor and indicate whether that field of the descriptor is an inherited field: i.e. whether the field should be populated from a previously generated descriptor or whether the memory address should be looked up from the base address and offset value and the field populated from memory.

The job manager 101 may be configured so that if a field of a descriptor is absent in memory 106, or empty or indicated by a null marker, it will be treated as having the same value as the corresponding field of the last-read descriptor.

The role of the job manager 101 is to process the descriptors provided by the driver 105 and, in dependence on the content of each descriptor and any other information in memory 106 indicated by reference from the content of the descriptor, initialize one or more of the processing clusters 102-104 to execute the job defined by the descriptor. To do this, the job manager may traverse the stored descriptors one by one. The order in which the descriptors are traversed may be, for example, the order in which they were written, the order in which they are sited in the memory 106 or may be dependent on a priority value associated with each descriptor. As the job manager reads each descriptor and any values it references, it caches that data locally to it. This allows inherited data to be reconstituted from the cache without being re-read from memory.

One or more offset tables may be provided. A first offset table may indicate the address of the memory area where each descriptor is stored, as an offset from a base address. The addresses of the descriptors in the memory 106 are derived from the base pointer plus the respective offset values stored in the offset table. This allows the job manager 101 to access each descriptor and allows the driver to form the content of each descriptor to memory without knowing beforehand what address it will be stored at. A second offset table may indicate, for at least some elements of a descriptor, the offset from the start address of that descriptor where the respective element is stored. Using the second offset table the job manager can copy and/or clone and patch addresses stored in specific elements directly once it knows the start address of the relevant descriptor.

Each of the skiplist mask and the base address may conveniently be provided to the job manager 101 using general purpose registers, as opposed to by being stored in shared memory 107. Once a descriptor has been generated, the hardware unit may store or cache the new descriptor. Where a descriptor has an inherited field, the unit may populate the inherited field from a descriptor stored in the cache or memory that was previously generated and stored.

The approach of avoiding redundant copying of descriptor payload values is especially convenient when the system architecture has a unified memory layout where the descriptors and any other data they are referred to are stored. The memory may take the form of a parameterized buffer.

The size of a block of data that is addressed by a descriptor element can vary depending on the application. It could, for example, be in the range from 4 to 1 K bytes. The payload size should correspond to how much of the payload needs to be examined in the traversal of a job within a job chain.

The arrangement of memory 107 as described above can reduce the overheads of memory allocation, fragmentation and management. One aspect that contributes to this is the division of the memory 107 into separate areas for the storage of (i) descriptors and (ii) data linked to descriptors by being addressed in one or more descriptor elements. A second aspect that contributes to this is the division of memory for each descriptor into a first area where fixed-length descriptor elements are stored and a second are where variable length descriptor elements are stored. The allocation and initialization of the unified memory 107 to serve a particular application 109 can be done at the stage when the application is initialized.

By arranging the job manager to re-use job descriptor data as described above, the number of copies from driver 105 to memory 107 and from memory 107 to the job manager 101 can be reduced compared to some other approaches.

Conveniently the job manager can read a descriptor by maintaining a pointer to the current read location in the memory 107. The location of the pointer can be computed with knowledge of a relatively small number of parameters: 1 . a base address where the start of storage of the job descriptor is located in memory 107;

2. the address of a table storing the offsets of the elements of the descriptor from its base address (there may be multiple such tables for descriptors of different types);

3. the size of the offset table (indicating how far the job manager should read from the table address; this will depend on the number of fields in the descriptor; conveniently the fields in the table are of fixed length for each descriptor field).

The job manager can traverse the offset table, starting from the table address (2), determining a series of offset values each relating to a field of the current descriptor. Each of these can be added to the base address (1 ) to determine the address of a respective descriptor field. That field can be read to give one or more further addresses from which the content data will be read.

In the system described above, the execution characteristics of many different pointer- based descriptors may be represented by the following parameters:

1 . Job base address of unified memory layout.

2. Offset table address

3. Offset table size

4. Current parameterized buffer address

5. Any dependence tree

The memory subsystem could be hard-wired to accept a control logic that permits the job manager to access it using a set of such parameters. This may improve performance.

The job manager may be implemented as a GPU coprocessor executing software or as a dedicated, fixed hardware block. It is expected that a hardware implementation would be more efficient.

Figure 3 illustrates a method of reading descriptors according to the invention. At 301 a list of base addresses is read and the base address of a specific descriptor is retrieved. The skiplist mask is read at 302 to determine which elements of the current descriptor are to be read from memory 107 and which are to be not read (skipped) as they will be populated from the corresponding data of the last-read descriptor. (It should be noted that the last-read descriptor may itself have inherited that data from a previously-read descriptor). The offset table address is read at 303. An initial address in parameterised memory 106 is read at 304. At step 305, a determination is made as to whether the end of the offset table has been reached. If so, the process ends at 306 and the unit then passes the descriptor that has been formed to one or more of the clusters 102, 103, 104 shown in Figure 1 . Thus at step 306 the job descriptor is prepared and ready for use by the graphics processing clusters of the GPU with all its fields populated.

If the end of the offset table has not been reached, the next value from the offset table is loaded at step 307 and a current address pointer is formed at 308 as the base address plus that offset value. The current address pointer for each descriptor field is used by the subsequent control logic in the job manager hardware block to directly access the value stored at that address while preparing the task for the GPU cluster. It does not need to do the arthmetic address calculation operation again by the adding base address and the required offset index. This is a late address patching operation, just before the task is consumed. At 309 the offset table pointer is incremented so as to point to the next offset. At step 310, a determination is made as to whether the next bit value of the skip list mask is set. If the answer to this determination is yes, the entry is skipped and the old data from the previous descriptor is used from template memory at 31 1 . If the answer is no, a block starting at a current value of a parameterized memory pointer and whose size is the value of the offset table value indicated by the current offset pointer is copied from a parameterised memory location in template memory (Step 312). The parameterized memory pointer is then updated as the current value of the parameterized memory pointer plus the current offset size (Step 313). The flow then returns as shown at 314 to step 305.

A summary of the method according to the present invention is described in Figure 4. In step 401 , data is received defining (i) the addresses of memory locations storing descriptor content and (ii) a set of inherited fields. In step 402, a new descriptor is formed by (i) populating the inherited fields of the new descriptor by copying data from a previously formed descriptor and (ii) populating other fields of the new descriptor by copying data from memory locations at the defined addresses. The hardware unit described above can reduce GPU driver overhead, which may be defined as the amount of CPU work that is needed for packing and sending jobs to the GPU resulting from API calls. Reducing the GPU driver overhead on mobile SoCs can improve overall performance by avoiding thermal throttling of, e.g., the GPU. Overall power consumption can also be reduced.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Previous Patent: METHOD AND ASSEMBLY FOR DETERMINING A LAMINAR FLAME SPEED OF A GASEOUS FUEL-AIR MIXTURE

Next Patent: REFLECTIVE OPTICAL ELEMENT, BEAM GUIDING DEVICE AND EUV-BEAM GENERATING DEVICE