MICROPROCESSOR APPARATUS

Title:

MICROPROCESSOR APPARATUS

Document Type and Number:

WIPO Patent Application WO/2014/202825

Kind Code:

Abstract:

A heterogeneous multi-core processor is disclosed which comprises a controller, memory in the formof register files, as well as a local and a global memory, and a plurality of functional units (FUs). The FUs are transport-triggered type units which compute a particular algorithm using a particular data format (e.g. integer or floating point, 8-bit, 16-bit etc.) in response to a move or transfer instruction received at a triggering socket. The controller is configured to compile and execute different kernels in sequence using a Single Instruction Multiple Thread (SIMT) programming paradigm for efficient parallel processing. The controller is further configured to identify the computational and data format requirements of the current kernel, and to configure the core to support the requirements by means of connecting appropriate ones of the FUs to the controller and memory. In the event that a core storing data for use by the current kernel is not itself able to support the kernel's requirements, inter-core communications take place so that an alternative core can process the kernel.

Inventors:

ZETTERMAN TOMMI (FI)
HIRVOLA HARRI (FI)

Application Number:

PCT/FI2013/050679

Publication Date:

December 24, 2014

Filing Date:

June 20, 2013

Export Citation:

Click for automatic bibliography generation Help

Assignee:

NOKIA CORP (FI)

International Classes:

G06F9/38; G06F9/30; G06F13/16

Foreign References:

US20110066811A1	2011-03-17
US20050216707A1	2005-09-29

Attorney, Agent or Firm:

NOKIA CORPORATION et al. (Jussi JaatinenKarakaari 7, Espoo, FI)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. Apparatus comprising:

at least one processing core, the or each processing core comprising:

a controller;

memory; and

a plurality of functional units, each being connectable to the controller and memory by a common bus or buses and each being configured to perform a respective computation and/or to compute data received in a respective format,

wherein the controller is configured to receive a kernel program for launching on the core, to identify one or more computations and/or data format required by the kernel, and to configure the core by means of connecting to a bus or buses only a subset of the functional units which support the one or more computations and/or data format.

2. Apparatus according to claim l, wherein the functional units are transport- triggered type functional units configured to perform their computation in response to receiving a transfer instruction at a triggering input thereof.

3. Apparatus according to claim 1 or claim 2, wherein the functional units are arranged into first and second groups of plural functional units, each group providing a different computational and/or data format capability, wherein the controller is configured to select either a first or second configuration mode based on the identifying step, and to connect one of the groups in accordance with the selected mode.

4. Apparatus according to claim 3, wherein the controller is configured to configure the core by means of enabling the functional units of the selected group and/or connecting their outputs to the bus or buses.

5. Apparatus according to claim 3 claim 4, wherein the first and second functional units each include at least an arithmetic and logic unit (ALU) requiring respective data types, e.g. integer n-bit, float n-bit, integer m*n-bit, float m*n-bit.

6. Apparatus according to any of claims 3 to 5, wherein at least one of the first and second functional units or groups of functional units includes an instruction accelerator module associated with particular hardware. 7. Apparatus according to any preceding claim, wherein the controller is further configured to execute a received kernel program using the memory to store intermediate data.

8. Apparatus according to claim 7, wherein the memory includes registers for storing data resulting from the execution of individual program threads.

9. Apparatus according to claim 7 or claim 8, wherein the memory further includes local and global memory, the controller being configured to store the intermediate data resulting from the simultaneous execution of a group of program threads in the local memory and data for input to a different kernel in the global memory.

10. Apparatus according to any of claims 7 to 9, further comprising a host memory configured to be accessed by one or more other cores of the microprocessor apparatus. 11. Apparatus according to claim 9 or claim 10, further comprising a load and store unit (LSU) for accessing the global and/or local memory on the core.

12. Apparatus according to any of claims 8 to 11, wherein the registers and/or local memory and/or global memory are provided as transport-triggered type devices for processing data in response to a transfer instruction being received at a triggering input.

13. Apparatus according to any preceding claim, wherein the ALUs and bus or buses are configured so that their input vector lengths match. 14. Apparatus comprising a plurality of processing cores according to any preceding claim, further comprising means to determine whether a core storing data required by a received kernel has functional units that will support its computation and/or data format requirements, and if not, to transfer the data to a different core that will support the kernel requirements.

15. A method comprising: controlling a microprocessor having at least one processing core that includes a plurality of functional units, each connectable to a controller and memory by a common bus or buses and each being configured to perform a respective computation and/or to compute data received in a respective format, by:

receiving a first kernel program;

identifying the requirements of the kernel program in terms of computation and/or data format; and

configuring the core by means of connecting to a bus or buses only a subset of the functional units which support the computation and/or data format. i6. The method of claim 15, wherein the functional units are transport-triggered type functional units configured to perform their computation in response to receiving a transfer instruction at a triggering input thereof. 17. The method of claim 15 or claim 16, wherein the functional units are arranged into first and second groups of plural functional units, each group providing a different computational and/or data format capability, and wherein the method further comprises selecting either a first or second configuration mode based on the identifying step, and the configuring step comprises connecting one of the groups in accordance with the selected mode.

18. The method of claim 17, wherein the configuring step comprises enabling the functional units of the selected group and/or connecting their outputs to the bus or buses. 19. The method of claim 17 or claim 18, wherein the first and second functional units each include at least an arithmetic and logic unit (ALU) requiring respective data types, e.g. integer n-bit, float n-bit, integer m*n-bit, float m*n-bit.

20. The method of any of claims 15 to 19, wherein at least one of the first and second functional units or groups of functional units includes an instruction accelerator module associated with particular hardware.

21. The method of any of claims 15 to 20, further comprising executing a received kernel program and using memory to store intermediate data.

22. The method of claim 21, wherein the executing step comprises executing individual program threads and storing resulting data in register memory.

23. The method of claim 21 or claim 22, wherein the executing step comprises simultaneously executing a group of program threads and storing the data resulting therefrom in local memory and data for input to a different kernel in global memory.

24. The method of any of claims 21 to 23, further comprising using a host memory to store data for access by one or more other cores of the microprocessor apparatus.

25. The method of claim 23 or claim 24, further comprising using a load and store unit (LSU) for accessing the global and/or local memory on the core.

26. The method of any of claims 22 to 25, wherein the registers and/or local memory and/or global memory are provided as transport-triggered type devices for processing data in response to a transfer instruction being received at a triggering input.

27. The method of any of claims 15 to 26, wherein the ALUs and bus or buses are provided with matching input vector lengths.

28. The method of any of claims 15 to 27, performed on a multi-core device and further comprising means to determine whether a core storing data required by a received kernel has functional units that will support its computation and/or data format requirements, and if not, to transfer the data to a different core that will support the kernel

requirements. 29. A computer program comprising instructions that when executed by a computer apparatus control it to perform the method of any of claims 15 to 28.

30. A non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method comprising:

receiving a first kernel program;

identifying the requirements of said kernel program in terms of computation and/or data format; and

configuring a processor core that includes a plurality of functional units, each connectable to a controller and memory of the core by a common bus or buses and each configured to perform a respective computation and/or to compute data received in a respective format, the configuring including connecting to a bus or buses only a subset of the functional units which support said computation and/or data format.

31. Apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:

to receive a first kernel program;

to identify the requirements of said kernel program in terms of computation and/or data format; and

to configure a processor core that includes a plurality of functional units, each connectable to a controller and memory of the core by a common bus or buses and each configured to perform a respective computation and/or to compute data received in a respective format, the configuring including connecting to a bus or buses only a subset of the functional units which support said computation and/or data format.

32. The apparatus of claim 31, wherein the functional units are transport-triggered type functional units configured to perform their computation in response to receiving a transfer instruction at a triggering input thereof. 33. The apparatus of claim 31 or claim 32, wherein the functional units are arranged into first and second groups of plural functional units, each group providing a different computational and/or data format capability, and wherein the computer-readable code when executed controls the at least one processor to select either a first or second configuration mode based on the identifying step, and to connect one of the groups in accordance with the selected mode.

34. The apparatus of claim 33, wherein the computer-readable code when executed controls the at least one processor to enable the functional units of the selected group and/or connect their outputs to the bus or buses.

35. The apparatus of claim 33 or claim 34, wherein the first and second functional units each include at least an arithmetic and logic unit (ALU) requiring respective data types, e.g. integer n-bit, float n-bit, integer m*n-bit, float m*n-bit. 36. The apparatus of any of claims 31 to 35, wherein at least one of the first and second functional units or groups of functional units includes an instruction accelerator module associated with particular hardware.

37. The apparatus of any of claims 31 to 36, wherein the computer-readable code when executed controls the at least one processor to execute a kernel program and use memory to store intermediate data.

38. The apparatus of claim 37, wherein the computer-readable code when executed controls the at least one processor to execute individual program threads and store resulting data in register memory. 39. The apparatus of claim 37 or claim 38, wherein the computer-readable code when executed controls the at least one processor to simultaneously execute a group of program threads and store the data resulting therefrom in local memory and data for input to a different kernel in global memory. 40. The apparatus of any of claims 31 to 39, wherein the computer-readable code when executed controls the at least one processor to store on host memory data for access by one or more other cores of the microprocessor apparatus.

41. The apparatus of claim 39 or claim 40, wherein the computer-readable code when executed controls the at least one processor to use a load and store unit (LSU) for accessing the global and/or local memory on the core.

42. The apparatus of any of claims 38 to 41, wherein the registers and/or local memory and/or global memory are provided as transport-triggered type devices for processing data in response to a transfer instruction being received at a triggering input.

43. The apparatus of any of claims 31 to 42, wherein the ALUs and bus or buses are provided with matching input vector lengths. 44. The apparatus of any of claims 31 to 43, comprising a multi-core device and wherein the computer-readable code when executed controls the at least one processor to determine whether a core storing data required by a received kernel has functional units that will support its computation and/or data format requirements, and if not, to transfer the data to a different core that will support the kernel requirements.

45. The non-transitory computer-readable storage medium of claim 30, wherein the functional units are transport-triggered type functional units configured to perform their computation in response to receiving a transfer instruction at a triggering input thereof. 46. The non-transitory computer-readable storage medium of claim 30, wherein the functional units are arranged into first and second groups of plural functional units, each group providing a different computational and/or data format capability, and wherein the computer-readable code when executed causes the at least one processor to select either a first or second configuration mode based on the identifying step, and to connect one of the groups in accordance with the selected mode.

47. The non-transitory computer-readable storage medium of claim 46, wherein the computer-readable code when executed causes the at least one processor to enable the functional units of the selected group and/or connect their outputs to the bus or buses.

48. The non-transitory computer-readable storage medium of claim 46, wherein the first and second functional units each include at least an arithmetic and logic unit (ALU) requiring respective data types, e.g. integer n-bit, float n-bit, integer m*n-bit, float m*n- bit.

49. The non-transitory computer-readable storage medium of claim 30, wherein at least one of the first and second functional units or groups of functional units includes an instruction accelerator module associated with particular hardware. 50. The non-transitory computer-readable storage medium of claim 30, wherein the computer-readable code when executed causes the at least one processor to execute a kernel program and use memory to store intermediate data.

51. The non-transitory computer-readable storage medium of claim 50, wherein the computer-readable code when executed causes the at least one processor to execute individual program threads and store resulting data in register memory.

52. The non-transitory computer-readable storage medium of claim 50, wherein the computer-readable code when executed causes the at least one processor to

simultaneously execute a group of program threads and store the data resulting therefrom in local memory and data for input to a different kernel in global memory.

53. The non-transitory computer-readable storage medium of claim 30, wherein the computer-readable code when executed causes the at least one processor to store on host memory data for access by one or more other cores of the microprocessor apparatus. 54. The non-transitory computer-readable storage medium of claim 52, wherein the computer-readable code when executed causes the at least one processor to use a load and store unit (LSU) for accessing the global and/or local memory on the core.

55. The non-transitory computer-readable storage medium of claim 51, wherein the registers and/or local memory and/or global memory are provided as transport-triggered type devices for processing data in response to a transfer instruction being received at a triggering input.

56. The non-transitory computer-readable storage medium of claim 30, wherein the ALUs and bus or buses are provided with matching input vector lengths.

57. The non-transitory computer-readable storage medium of claim 30, wherein the computer-readable code when executed causes the at least one processor to determine whether a core storing data required by a received kernel has functional units that will support its computation and/or data format requirements, and if not, to transfer the data to a different core that will support the kernel requirements.

Description:

Microprocessor Apparatus Field of the Invention

The present invention relates to a microprocessor apparatus, particularly, though not exclusively, a multi-core microprocessor apparatus for use in applications such as software defined radio (SDR).

Background of the Invention

SDR involves performing baseband radio communications functions using software instead of dedicated hardware components. SDR may require significant computing power, preferably with low latency and low energy consumption. SDR signal processing typically uses different kernels each associated with a particular data type and/or signal processing algorithm; each places different requirements on the microprocessor. For example, Forward Error Correction (FEC) algorithms like Viterbi and Turbo may use short fixed-point or integer data types, eight bits in length. Channel estimation and Fast Fourier Transform (FFT) algorithms generally require longer fixed- point data types, sixteen bits in length, or floating point data types. To meet performance and energy targets, application- specific instructions are provided to accelerate the appropriate algorithms. Summary of the Invention

A first aspect of the invention provides an apparatus comprising: at least one processing core, the or each processing core comprising: a controller; memory; and a plurality of functional units, each being connectable to the controller and memory by a common bus or buses and each being configured to perform a respective computation and/or to compute data received in a respective format, wherein the controller is configured to receive a kernel program for launching on the core, to identify one or more computations and/or data format required by the kernel, and to configure the core by means of connecting to a bus or buses only a subset of the functional units which support the one or more computations and/or data format.

The controller may be configured to configure the core by means of enabling the functional units of the selected group and/or connecting their outputs to the bus or buses.

The first and second functional units may each include at least an arithmetic and logic unit (ALU) requiring respective data types, e.g. integer n-bit, float n-bit, integer m*n-bit, float m*n-bit. The at least one of the first and second functional units or groups of functional units may include an instruction accelerator module associated with particular hardware.

The controller may be further configured to execute a received kernel program using the memory to store intermediate data. The memory may include registers for storing data resulting from the execution of individual program threads. The memory may further include local and global memory, the controller being configured to store the intermediate data resulting from the simultaneous execution of a group of program threads in the local memory and data for input to a different kernel in the global memory. The memory may comprise a host memory configured to be accessed by one or more other cores of the microprocessor apparatus. The apparatus may comprise a load and store unit (LSU) for accessing the global and/or local memory on the core.

The registers and/or local memory and/or global memory may be provided as transport- triggered type devices for processing data in response to a transfer instruction being received at a triggering input.

The ALUs and bus or buses may be configured so that their input vector lengths match.

A microprocessor having a plurality of processing cores according to any of the above definitions may be provided, further comprising means to determine whether a core storing data required by a received kernel has functional units that will support its computation and/or data format requirements, and if not, to transfer the data to a different core that will support the kernel requirements.

A second aspect of the invention provides a method comprising: controlling a

microprocessor having at least one processing core that includes a plurality of functional units, each connectable to a controller and memory by a common bus or buses and each being configured to perform a respective computation and/or to compute data received in a respective format, by: receiving a first kernel program; identifying the requirements of the kernel program in terms of computation and/or data format; and configuring the core by means of connecting to a bus or buses only a subset of the functional units which support the computation and/or data format.

The configuring step may comprise enabling the functional units of the selected group and/or connecting their outputs to the bus or buses.

The first and second functional units may each include at least an arithmetic and logic unit (ALU) requiring respective data types, e.g. integer n-bit, float n-bit, integer m*n-bit, float m*n-bit.

At least one of the first and second functional units or groups of functional units may include an instruction accelerator module associated with particular hardware.

The method may further comprise executing a received kernel program and using memory to store intermediate data.

The executing step may comprise executing individual program threads and storing resulting data in register memory.

The executing step may comprise simultaneously executing a group of program threads and storing the data resulting therefrom in local memory and data for input to a different kernel in global memory.

The method may further comprise using a host memory to store data for access by one or more other cores of the microprocessor apparatus. The method may further comprise using a load and store unit (LSU) for accessing the global and/or local memory on the core. The registers and/ or local memory and/or global memory may be provided as transport- triggered type devices for processing data in response to a transfer instruction being received at a triggering input.

The ALUs and bus or buses may be provided with matching input vector lengths.

The method of any preceding definition may be performed on a multi-core device and further comprises determining whether a core storing data required by a received kernel has functional units that will support its computation and/or data format requirements, and if not, transferring the data to a different core that will support the kernel

requirements.

A third aspect of the invention provides a computer program comprising instructions that when executed by a computer apparatus control it to perform the method of any preceding method definition.

A fourth aspect of the invention provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method comprising:

receiving a first kernel program; identifying the requirements of said kernel program in terms of computation and/ or data format; and configuring a processor core that includes a plurality of functional units, each connectable to a controller and memory of the core by a common bus or buses and each configured to perform a respective computation and/or to compute data received in a respective format, the configuring including connecting to a bus or buses only a subset of the functional units which support said computation and/or data format.

The functional units may be transport-triggered type functional units configured to perform their computation in response to receiving a transfer instruction at a triggering input thereof.

The functional units may be arranged into first and second groups of plural functional units, each group providing a different computational and/or data format capability, and The computer-readable code when executed may cause the at least one processor to select either a first or second configuration mode based on the identifying step, and to connect one of the groups in accordance with the selected mode. The computer-readable code when executed may cause the at least one processor to enable the functional units of the selected group and/or connect their outputs to the bus or buses.

The first and second functional units may each include at least an arithmetic and logic unit (ALU) requiring respective data types, e.g. integer n-bit, float n-bit, integer m*n-bit, float m*n-bit.

At least one of the first and second functional units or groups of functional units may include an instruction accelerator module associated with particular hardware. The computer-readable code when executed may cause the at least one processor to execute a kernel program and use memory to store intermediate data.

The computer-readable code when executed may cause the at least one processor to execute individual program threads and store resulting data in register memory.

The computer-readable code when executed may cause the at least one processor to simultaneously execute a group of program threads and store the data resulting therefrom in local memory and data for input to a different kernel in global memory. The computer-readable code when executed may cause the at least one processor to store on host memory data for access by one or more other cores of the microprocessor apparatus.

The computer-readable code when executed may cause the at least one processor to use a load and store unit (LSU) for accessing the global and/or local memory on the core.

The ALUs and bus or buses may be provided with matching input vector lengths. The computer-readable code when executed may cause the at least one processor to determine whether a core storing data required by a received kernel has functional units that will support its computation and/or data format requirements, and if not, to transfer the data to a different core that will support the kernel requirements.

A fifth aspect of the invention provides apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor: to receive a first kernel program; identify the requirements of said kernel program in terms of computation and/or data format; and to configure a processor core that includes a plurality of functional units, each connectable to a controller and memory of the core by a common bus or buses and each configured to perform a respective computation and/or to compute data received in a respective format, the configuring including connecting to a bus or buses only a subset of the functional units which support said computation and/or data format.

The functional units may be transport-triggered type functional units configured to perform their computation in response to receiving a transfer instruction at a triggering input thereof. The functional units may be arranged into first and second groups of plural functional units, each group providing a different computational and/or data format capability, and wherein the computer-readable code when executed controls the at least one processor to select either a first or second configuration mode based on the identifying step, and to connect one of the groups in accordance with the selected mode. The computer-readable code when executed may control the at least one processor to enable the functional units of the selected group and/ or connect their outputs to the bus or buses. The first and second functional units may each include at least an arithmetic and logic unit (ALU) requiring respective data types, e.g. integer n-bit, float n-bit, integer m*n-bit, float m*n- bit.

At least one of the first and second functional units or groups of functional units may include an instruction accelerator module associated with particular hardware.

The computer-readable code when executed may control the at least one processor to execute a kernel program and use memory to store intermediate data. The computer-readable code when executed may control the at least one processor to execute individual program threads and store resulting data in register memory.

The computer-readable code when executed may control the at least one processor to simultaneously execute a group of program threads and store the data resulting therefrom in local memory and data for input to a different kernel in global memory.

The computer-readable code when executed may control the at least one processor to store on host memory data for access by one or more other cores of the microprocessor apparatus.

The computer-readable code when executed may control the at least one processor to use a load and store unit (LSU) for accessing the global and/or local memory on the core. The registers and/ or local memory and/or global memory may be provided as transport- triggered type devices for processing data in response to a transfer instruction being received at a triggering input.

The ALUs and bus or buses may be provided with matching input vector lengths.

The apparatus may be a multi-core device and wherein the computer-readable code when executed controls the at least one processor to determine whether a core storing data required by a received kernel has functional units that will support its computation and/or data format requirements, and if not, to transfer the data to a different core that will support the kernel requirements.

Brief Description of the Drawings

The invention will now be described, by way of non-limiting example, with reference to embodiments in which:

Figure ι is a schematic diagram of a generalised Transport Triggered Architecture for a microprocessor, which is useful for understanding the invention;

Figure 2 is a perspective view of a mobile terminal embodying aspects of the invention;

Figure 3 is a schematic diagram illustrating components of the Figure l mobile terminal and their interconnection;

Figure 4 is a schematic diagram of a generalised multi-core microprocessor architecture embodiment aspects of the invention; Figure 5 is schematic diagram of a first embodiment architecture according to the invention;

Figure 6 is a schematic diagram of a second embodiment architecture according to the invention;

Figure 7 is a flow diagram illustrating processing steps for compiling a kernel in accordance with the invention;

Figure 8 is a flow diagram illustrating processing steps for launching a compiled kernel in accordance with the invention; and

Figure 9 is a schematic diagram of an OpenCL memory model which is useful for understanding the invention.

Detailed Description of Embodiments

Embodiments described herein relate to a heterogeneous microprocessor (hereafter "processor") configuration having multiple cores, although in theory the configuration can be applied in a single core processor. The configuration allows for the execution of programs at the instruction level in a more efficient way by reducing inter-core communications and without the necessity of using homogeneous processor cores which tend to be inefficient. The multi-core processor offers particular advantages in the field of Software Defined Radio (SDR) which, as explained, typically may use multiple kernels each placing different requirements on the processor, for example in terms of data type (fixed or floating integer types of different word lengths) or acceleration algorithms.

However, the processor is not to be considered limited to such applications and can be used outside the field of SDR.

As will be understood, a processor core is a central processing unit that reads and executes program instructions in association with other functional units such as an arithmetic and logic unit (ALU) and memory. A multi-core processor is a single computing component that comprises two or more independent cores which thereby facilitates the execution of multiple instructions in parallel, and which may also communicate with each other, e.g. when the output of one core is required by the input of another core. A multi-core processor may be provided on a single physical integrated circuit (IC).

As will also be understood, the signal processing application like SDR baseband is divided to multiple kernels. To create a signal processing application, multiple kernels can be organized as a signal flow graph, based on their execution order and data transfer between kernels. As will also be understood, a kernel is a function that is executed on a processor core. A kernel execution is started when it gets input data from a processor input or one or more other kernels. While executing, a kernel computes output data and sends the output data to processor output or one or more other kernels.

In the context of SDR, multiple different kernels may be used in sequence, each operating using different types or formats of data, different word lengths, and/or providing a distinct function such as Forward Error Correction (FEC), Fast Fourier Transforms (FFT) and so on.

In embodiments below, the processor configuration for each core employs what is commonly known as a Transport Triggered Architecture (TTA). TTA is a processor design paradigm in which the program directly controls the internal data transfer in processor buses. The TTA processor is comprised of functional units (FUs) which are

interconnected by multiple buses; each FU usually has a specific computational function and includes a triggering input and either or both of an input and output socket. Program execution is controlled by a global control unit (GCU). When a transfer is made by the GCU to the triggering input, e.g. using a transfer or move instruction issued by the kernel, the FU performs its computation and the result of the computation appears at the output. Thus, the kernel can directly control the internal transport buses and computation results as a side effect of the transport. A TTA also employs register files (RFs) memory load/ store units (LSUs), ALUs, and other processor building blocks. Due to its modular structure, the use of multiple buses and FUs, TTA configurations are ideally suited to instruction level parallelism, for example using single-instruction multiple thread (SIMT) programming, which is the programming paradigm used in this case. A TTA instruction resembles a Very Long Instruction Word (VLIW) configuration; a TTA instruction typically comprises multiple slots, one slot per bus, and each slot determining a data transfer on the corresponding bus. In a SIMT configuration, multiple light-weight threads fed by a single instruction stream are executed simultaneously in a TTA core. The instruction word size can be optimized by broadcasting a single instruction slot to multiple functional units. In the example below (Figure 6) each SIMT 'lane' receives identical instructions (transfers). In the situation where thread executions take divergent paths (e.g. if one thread executes the IF part and another the ELSE part of an IF_THEN_ELSE branch) this can be handled for example by using predication where both parts are executed in all lanes and lane FU outputs are predicated on/off depending on which path the lane actually takes. Alternatively, each lane can have an individual virtual program counter which is modified in case the thread branches, and the lane branching forward waits until the master program counter catches the virtual program counter. TTA processors may be well suited for high performance computing, such as SDR, particularly because it is straightforward to add custom FUs for application-specific needs; because instructions are data transfers there is no need to modify the instruction set to access the new FUs. Also, the results of a FU computation may be passed directly to another FU which avoids storing intermediate results to registers. Hence, smaller register files may be used and there is better performance and lower energy consumption. The so- called 'available register file ports' bottleneck which limits the instruction word length in traditional VLIW processors is relieved by allowing direct data forwarding.

A basic TTA architecture l is shown in Figure l. A GCU 3, RF 5 and three FUs 7, 9, 11 are shown connected to first and second buses 13, 15. The GCU 3 is responsible for issuing the scheduled instructions, namely the transfer instructions. The RF stores intermediate results when required. The FUs include an ALU 7, a LSU 9 and a custom FU 11 which may be dedicated to any computation. The different types of socket are indicated on the LSU 9, and comprise in this particular case an input socket 17, an output socket 19 and a triggering socket 21. A transfer instruction appearing at the latter socket 21 causes the computation to be performed on data appearing at the input socket 17 and triggering socket 21 and the output of the computation appears at the output socket 19. The RF 5 has no triggering socket as it is used simply for the storing of data and performs no

computation as such.

Referring now to Figure 2, a mobile communications terminal 100 is shown, being one example of terminal utilising SDR. The exterior of the terminal 100 may have a touch sensitive display 102, hardware keys 104, a microphone 105, a speaker 118 and a headphone port 120.

Figure 3 shows an example schematic diagram of the components of terminal 100. The terminal 100 has a multi-core processor 106, a touch sensitive display 102 comprised of a display part 108 and a tactile interface part 110, the hardware keys 104, a memory 112, RAM 114, a speaker 118, the headphone port 120, a wireless communication module 122, an antenna 124 and a battery 116. The multi-core processor 106 is connected to each of the other components (except the battery 116) in order to control operation thereof. The memory 112 may be a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD). The memory 112 stores, amongst other things, an operating system 126 and may store software applications 128. The RAM 114 is used by the multi-core processor 106 for the temporary storage of data. The operating system 126 may contain code which, when executed by the controller 106 in conjunction with RAM 114, controls operation of each of the hardware components of the terminal. The operating system 126 includes a plurality of kernels each dedicated to a particular SDR function. The multi-core processor 106 is described in further detail below. Figure 4 shows a typical multi-core processor architecture at the top-level comprised of four interconnected cores 200. Any number of cores may be used.

The terminal 100 may be a mobile telephone or smartphone, a personal digital assistant (PDA), a portable media player (PMP), a portable computer or any other device capable of running SDR applications. In some embodiments, the terminal 100 may engage in cellular communications using the wireless communications module 122 and the antenna 124. The wireless communications module 122 may be configured to communicate via several protocols such as GSM, CDMA, UMTS, Bluetooth and IEEE 802.11 (Wi-Fi).

The display part 108 of the touch sensitive display 102 is for displaying images and text to users of the terminal and the tactile interface part 110 is for receiving touch inputs from users. As well as storing the operating system 126 and software applications 128, the memory 112 may also store multimedia files such as music and video files. A wide variety of software applications 128 may be installed on the terminal including web browsers, radio and music players, games and utility applications. Some or all of the software applications stored on the terminal may provide audio outputs. The audio provided by the applications may be converted into sound by the speaker(s) 118 of the terminal or, if headphones or speakers have been connected to the headphone port 120, by the headphones or speakers connected to the headphone port 120.

In some embodiments the terminal 100 may also be associated with external software applications not stored on the terminal. These may be applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications can be termed cloud-hosted applications. The terminal 100 may be in communication with the remote server device in order to utilise the software application stored there. This may include receiving audio outputs provided by the external software application. In some embodiments, the hardware keys 104 are dedicated volume control keys or switches. The hardware keys may for example comprise two adjacent keys, a single rocker switch or a rotary dial. In some embodiments, the hardware keys 104 are located on the side of the terminal 100. Referring to Figure 5, a first embodiment core architecture 300 is shown, which architecture may be used on one or more cores of the multi-core processor 106.

The architecture 300 is a TTA -type architecture comprising multiple buses 301 and a plurality of FUs and a RF 305 connected to the buses. A global control unit (GCU) (not shown) is provided forexecuting kernels in sequence on the architecture 300. The FUs include a global LSU 303 and a local LSU 307. Kernel compilation is typically performed off-line, on a workstation using a compiler or cross-compiler. OpenCL also allows on-the- fly compilation. The remaining FUs are arranged into at least two groups. The first group comprises a vector ALU 209 which computes using 4 x 8-bit integer words and another, custom, FU 311 which may perform a first predetermined computation algorithm, e.g. FEC or FFT. The second group comprises a ALU 313 which may compute using 32-bit floating point word, and another custom FU 315 which computes a second predetermined algorithm. In other embodiments, different combinations of FUs may be provided for the different groups. As will be clear from the description below, each group provides one or more FUs offering respective capabilities in terms of, e.g. data format and/or algorithm and/or computation. No two groups need offer the same capability. The actual number of blocks depends on how many modes the processor core has. In this embodiment, there are two blocks.

Each FU 303, 307, 309, 311, 313, 315 has a triggering socket and an output socket as mentioned previously. The triggering sockets are indicated by a cross 'x'. Certain ones of the FUs also have an input socket.

The architecture 300 is configured to operate in at least two different modes dependent on the requirements of the kernel that is currently required to be executed. Each mode is associated with a particular configuration of the FUs and their interconnection to the bus or buses 301. The GCU is configured to identify the requirements of the current kernel, which mode is appropriate, and the selection of the mode by the GCU which results in that configuration being applied. If the GCU selects a first mode, for example, only the first group of FUs 309, 311 will be enabled and their outputs connected to the bus or buses 301. Similarly, if the GCU selects a second mode, only the second group of FUs 313, 315 will be enabled and their outputs connected to the bus or busses 301. More than two modes and therefore groups of FUs may be provided but the principle remains the same.

As will be explained below, where data resulting from the execution of one kernel is required by another kernel, where possible, the same core is used albeit in a different configuration to minimise inter-core transfers. If however the same core does not support the requirements of the kernel (i.e. the groups of FUs do not provide the required computation and/or the required data format) then a core which does support it will be selected and data moved to host memory.

Each mode may for example be associated with a specific scalar or vector ALU for a native data type used by kernel type sets, including vector floating point, vector fixed point, vector int8 and so on. Additionally, or alternatively, each mode may be associated with a specific set of instruction acceleration hardware or a specific algorithm, e.g. FEC, FFT.

In overview, when a kernel is required to be compiled, the programmer or cross-compiler tool identifies the specific requirements of the kernel in terms of data type and/or instruction acceleration hardware and/or algorithm to be performed and then selects a suitable mode. A cross-compiler can act as an automatic tool to select the mode. A cross- compiler can use predefined rules describing the possible processor configurations stored in memory. If for example the current kernel needs to operate using 32-bit floating point data, mode two will be selected by the GCU and the configuration is set-up accordingly. The kernel is then compiled by the cross-compiler. The kernel can be programmed using a SIMT programming paradigm, e.g. using OpenCL or CUDA, so that multiple threads are executed in parallel whenever possible. Kernels are compiled for a one-mode view of the multi-mode processor meaning that only the resources available for that mode are visible, and the compiler does not have to be aware of multiple modes. The runtime scheduler configures the processor to the correct mode at launch of the kernel execution and the processor, using SIMT processing, runs threads to completion so that there is no need to save any state when the mode is switched. Data can be passed between kernels utilising different modes of the same core by using the global memory (device-wise memory). The sending kernel writes results to the global memory where the data is ready for the receiving kernel, to be executed later on, on the same core. In order to set up the configuration, the GCU enables the appropriate group of FUs for the selected mode by means of a MODE SET line 317. Further, selector demultiplexers 319, 321 can be used to select the appropriate output to be placed on the bus or buses 301. It will be noted from Figure 5 that the FUs are arranged in complementary pairs, with only one of each pair being utilised at a given time. Again, the MODE SET line 317 is used to select the appropriate output although a separate signal from the GCU could be provided for this purpose.

It will be appreciated that the GCU is common to all modes/configuration on this core architecture 300. In a multi-core processor, there is one GCU per core. The GCU controls the program execution on its respective core, including instruction fetching, decoding, decompressing (if used) and program execution, e.g. CALL and JMP instructions.

Local and global LSUs 303, 307 and the RF 305 are common to all modes/configurations in the architecture 300 and are shared thereby, as is the interconnection network of buses 301. Due to the fact that the only actual instruction issued by the GCU in this architecture 300 corresponds to a data transfer or move between sockets, there is no need to extend the instruction set if new modes/configurations are added. As long as the vector length of the ALUs for the different modes/configurations match the bus length, the shared or common parts of the architecture can remain the same.

Figure 6 shows a second embodiment core architecture 400, which architecture may be used on one or more cores of the multi-core processor 106. This second embodiment architecture 400 comprises multiple instances of the first embodiment architecture 300 utilising a common GCU 401, a common ALU 403, and a common RF 405 and the interconnect network of buses 407. In effect, there are n lanes on the single core which may execute in parallel different threads of execution according to principles of SIMT processing. Each lane comprises its own global LSU 303 and local LSU 305 to access global and local memory and a RF 305 as private memory for the lane. In addition, there are the two mode-specific FUs 309, 311, 313, 315 for each of the two modes. The shown architecture 400 uses 32-but buses 407 to be able to transfer either one 32-bit value (mode 2) or four 8-bit integers (mode 1) using one bus cycle. In order to utilise the common interconnect bus structure 407, the 8-bit integer ALU 309 in each lane can be a 4-width vector ALU so that each lane can compute either one float operation or four int8 operations depending on the mode.

It is also possible to introduce greater sub-word parallelism by means of increasing the bus size, e.g. using a 64-bit bus, length-2 vector ALUs in float mode and length-8 vector ALUs in int8 mode.

Figure 7 shows in overview a kernel compilation process for the multi-core processor 106. In this embodiment, it is assumed that the kernel is an OpenCL kernel. As will be known, OpenCL is a SIMT programming model in which the program is executed in parallel lightweight threads of execution. Each execution of a thread is called a work item and a group of threads executed simultaneously is called a workgroup. For completeness, the OpenCL memory model 500 will be mentioned briefly with reference to Figure 9. Multiple instances of an OpenCL program (in this case a kernel) are executed as parallel light- weight threads. There are four different types of memory:

Private memory for each work item (thread execution) using the fastest memory technology;

Local memory is shared between work items in a workgroup (simultaneous thread executions);

Global and Constant memories are visible to all work items. Global memory can be used to communicate data between different work groups and between different kernels executed on the same OpenCL device;

Host memory is used to communicate data between different OpenCL devices, for example different cores in a multi-core processor. This uses the slowest memory technology.

Memory management in the OpenCL memory model 500 is explicit. In a first step 7.1, the GCU 401 selects a mode for the current kernel. This step determines the requirements of the kernel (e.g. in terms of data types, required custom instruction accelerators etc.) and identifies the mode required. In the next step 7.2, the compiler or cross compiler compiles the kernel with resources dependent on the selected mode. In a third step 7.3, the compiler or cross-compiler attaches a mode identifier (e.g. MODEi, MODE2, MODE N) to the compiled kernel, which mode identifier determines the signal on MODE SET line 317. Figure 8 shows in overview how a task scheduler launches the pre-compiled kernel. Task scheduling may be done dynamically (at run time) or statically (at compile time).

Execution of a task means the execution of one kernel, which usually launches multiple work items. Generally speaking, task scheduling includes the following main steps:

- Task assignment (which core runs the task);

Ordering (in which order tasks are launched); and

Timing (at which exact time the task is launched).

Any combination doing steps statically or dynamically is possible.

Task scheduling also takes into account the situation where the core having the data needed for execution of the current (new) kernel is not appropriate for executing the current kernel. This process is performed by the task scheduler. In a first step 8.1, the kernel is launched. In the next step 8.2, the mode identifier attached to the kernel from the compilation stage (Figure 7) is read. Based on the mode identifier, in the next step 8.3 it is determined if the core which has the input data required by this kernel supports the mode identified in the previous step. If yes, then in step 8.4 the core is configured as described previously for the mode read in step 8.2. In step 8.5 the kernel is executed, using a SIMT programming method, in that same core, retrieving required data from the global memory on the core if necessary. If at step 8.3 the processor core holding input data does not support the mode read in step 8.2, then in step 8.6 another core is selected which does support the kernel mode and this other core is used. In step 8.7 the input data is transferred to the host memory on the core selected in step 8.6 and the process then returns to step 8.4.

It will be appreciated that steps 8.6 and 8.7 involve inter-core communications that of course will not be required if the current core can handle the required mode. Energy, memory and time is saved in situations where consecutively run kernels in the signal processing pipeline use the same core. Kernel scheduling may be arranged accordingly to minimise inter-core communications.

In conventional heterogeneous multi-core processors, each core may have an architecture specific to a subset of algorithms to cater for particular native data types and attached hardware accelerators for the subset. This necessitates a large number of inter-core data transfers which results in communication becoming expensive compared with

computation. There is also the requirement for extra memory, which is also expensive. Homogeneous multi-core solutions, in which every core can compute every algorithm, have significant disadvantages in that they tend not to be as energy efficient as

heterogeneous cores and are significantly more complex to support multiple data types and require complex instruction sets. The present description provides a heterogeneous architecture in which the amount of data transfer between cores may be reduced. Whilst silicon technology is scaling to smaller dimensions, and the number of available transistors tends to follow the

exponential growth curve (Moore's Law), wire delays and RAM sizes are not expected to follow this exponential scaling in the appropriate direction. As a result, communication becomes more expensive compared to computation because the distance between cores is typically longer than the wires inside a single core, and memory space becomes more expensive compared to computation because SRAM and DRAM cells do not scale as well as logic (inter-core communications needs extra memory space to store copies of data whilst the transfer is taking place, as well as storing the received data whilst waiting for it to be processed.) Hence, by controlling inter-core data communications using the above multi-mode, multi-configuration heterogeneous architecture, the present Application addresses the performance and energy challenges.

If the known OpenCL computation model is used (as an example of SIMT) then less host memory may be needed because there is no need to transfer data via host memory when consecutive kernels are executed on the same core. Note that kernel compilation is not affected, other than attaching or associating the mode identifier bits to it. Runtime control to support multiple modes is straightforward to implement in, and compatible with, the OpenCL execution model. There is no need to increase the instruction length, and if the processor architecture is TTA, as in the above embodiments, there is no need to change the instruction set.

Example

As an example of the additional silicon area cost resulting from adding multiple modes to a core, an eight lane TTA SDR core, targeted for FEC computation (native data type: 8-bit) has a total area of 3.63 mm ² (when synthesized for 40nm CMOS technology) for one mode consisting of:

mm ²

Common parts of core (including interconnect buses) 0.33

Private memories 0.19

Local memory 0.91

Global memory 0.97 Instruction memory 1.01

Lane ALUs (total of 8 lanes) 0.15

Lane custom instruction accelerators (total of 8 lanes) 0.07

The last two (ALUs and accelerators) are mode specific. If a second mode is added to the core and the lane ALUs and custom instruction accelerators are expected to be about the same size in both modes, the total area will increase to 3.85 mm ² (e.g. a dual-mode core will be about 6.1% bigger than a single-mode core). Moreover, with three modes, the expected area increase is -12% compared to a single mode, with 4 modes having an increase of ~i8% etc. With future ASIC technologies, the area overhead penalty is expected to decrease because custom logic scales better than memory cells.

It will be appreciated that the above described embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present application.

Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Previous Patent: PITCH ANGLE INDICATOR SYSTEM

Next Patent: METHOD AND APPARATUS FOR AUTOMATIC WIRELESS DATA TRANSFER