Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SHARED REGISTER FOR VECTOR REGISTER FILE AND SCALAR REGISTER FILE
Document Type and Number:
WIPO Patent Application WO/2022/220835
Kind Code:
A1
Abstract:
Embodiments of processors and operations thereof are disclosed. In an example, a processor includes a scalar register configured to store a first scalar, a vector register configured to store a first vector, a shared register configured to store a second vector including a set of second scalars, a scalar function unit operatively coupled to the scalar register and the shared register, and a vector function unit operatively coupled to the vector register and the shared register. The scalar function unit is configured to access the first scalar from the scalar register and the set of second scalars from the shared register. The vector function unit is configured to access the first vector from the vector register and the second vector from the shared register.

Inventors:
CONG LI (US)
WEI JIAN (US)
Application Number:
PCT/US2021/027543
Publication Date:
October 20, 2022
Filing Date:
April 15, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ZEKU INC (US)
International Classes:
G06F9/30; G06F15/00
Domestic Patent References:
WO2006055546A22006-05-26
Foreign References:
US5513366A1996-04-30
US20140047211A12014-02-13
US20160094535A12016-03-31
US20080195983A12008-08-14
US20100042808A12010-02-18
US20110047533A12011-02-24
US20040243788A12004-12-02
US20150193234A12015-07-09
Attorney, Agent or Firm:
ZOU, Zhiwei (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A processor, comprising: a scalar register configured to store a first scalar; a vector register configured to store a first vector; a shared register configured to store a second vector comprising a set of second scalars; a scalar function unit operatively coupled to the scalar register and the shared register and configured to access the first scalar from the scalar register and the set of second scalars from the shared register; and a vector function unit operatively coupled to the vector register and the shared register and configured to access the first vector from the vector register and the second vector from the shared register.

2. The processor of claim 1, wherein a total length of the set of second scalars is the same as a length of the second vector.

3. The processor of claim 1, wherein the second vector consists of the set of second scalars.

4. The processor of claim 1, wherein the scalar register and the shared register form a scalar register file of the processor, and the vector register and the shared register form a vector register file of the processor.

5. The processor of claim 4, wherein each of the scalar register file and the vector register file comprises two read-ports and one write-port.

6. The processor of claim 1, wherein the scalar function unit is further configured to access the set of second scalars in parallel from the shared register.

7. The processor of claim 1, wherein the vector function unit is further configured to operate on the first vector accessed from the vector register to generate the second vector, and store the second vector comprising the set of second scalars in the shared register; and the scalar function unit is further configured to access one or more second scalars of the set of second scalars from the shared register, and operate on the one or more second scalars.

8. The processor of claim 1, wherein the scalar function unit is further configured to operate on the first scalar accessed from the scalar register to generate a second scalar, and store the second scalar in the shared register as part of the set of second scalars; and the vector function unit is further configured to access the second vector comprising the second scalar from the shared register, and operate on the second vector.

9. The processor of claim 1, further comprising: an instruction decode unit configured to decode an instruction to move a target scalar in a target vector stored in the vector register to the scalar register to provide a control signal to the vector register and the shared register; and the vector register and the shared register are configured to, in response to the control signal, move the target vector comprising the target scalar to the shared register.

10. A processor, comprising: a scalar register file comprising a plurality of scalar registers; and a vector register file comprising a plurality of vector registers, wherein the scalar register file and the vector register file share a set of registers.

11. The processor of claim 10, wherein the set of registers is configured to be used as a set of scalar registers by the scalar register file or a vector register used by the vector register file.

12. The processor of claim 11, wherein the set of scalar registers is m n- bit scalar registers, and the vector register is one (mxn)- bit vector register.

13. The processor of claim 10, wherein each of the scalar register file and the vector register file comprises two read-ports and one write-port. 14. The processor of claim 10, further comprising: a scalar function unit operatively coupled to the scalar register file and configured to access a scalar from the scalar register file; and a vector function unit operatively coupled to the vector register file and configured to access a vector from the vector register file.

15. The processor of claim 14, wherein the scalar function unit is further configured to access a set of scalars in parallel from the set of registers.

16. A method for processor operation, comprising: accessing a first vector from a vector register; operating on the first vector to generate a second vector comprising a set of scalars; storing the second vector in a shared register; accessing a first scalar of the set of scalars from the shared register; and operating on the first scalar to generate a second scalar.

17. The method of claim 16, wherein a total length of the set of scalars is the same as a length of the second vector.

18. The method of claim 16, wherein the second vector consists of the set of scalars.

19. The method of claim 16, further comprising: storing the second scalar in the shared register; accessing a third vector comprising the second scalar from the shared register; and operating on the third vector.

20. A method for processor operation, comprising: accessing a first scalar from a scalar register; operating on the first scalar to generate a second scalar; storing the second scalar in a shared register; accessing a vector comprising the second scalar from the shared register; and operating on the vector.

Description:
SHARED REGISTER FOR VECTOR REGISTER FILE AND SCALAR

REGISTER FILE

BACKGROUND

[0001] Embodiments of the present disclosure relate to processors and operations thereof.

[0002] To improve computation performance, central processing unit (CPU, or so-called microprocessor) designs typically add a vector engine (a.k.a., a vector co-processor), for example, to a digital signal processor (DSP). Vector engines can greatly improve performance on certain workloads, notably scientific computations, multimedia and graphics applications, and similar tasks. Vector engines can support the single-instruction-multiple-data (SIMD) execution model, which reduces the instruction bandwidth.

SUMMARY

[0003] Embodiments of processors and operations thereof are disclosed herein.

[0004] In one example, a processor includes a scalar register configured to store a first scalar, a vector register configured to store a first vector, a shared register configured to store a second vector including a set of second scalars, a scalar function unit operatively coupled to the scalar register and the shared register, and a vector function unit operatively coupled to the vector register and the shared register. The scalar function unit is configured to access the first scalar from the scalar register and the set of second scalars from the shared register. The vector function unit is configured to access the first vector from the vector register and the second vector from the shared register.

[0005] In another example, a processor includes a scalar register file including a plurality of scalar registers, and a vector register file including a plurality of vector registers. The scalar register file and the vector register file share a set of registers.

[0006] In still another example, a method for processor operation is disclosed. A first vector is accessed from a vector register. The first vector is operated on to generate a second vector including a set of scalars. The second vector is stored in a shared register. A first scalar of the set of scalars is accessed from the shared register. The first scalar is operated on to generate a second scalar.

[0007] In yet another example, a method for processor operation is disclosed. A first scalar is accessed from a scalar register. The first scalar is operated on to generate a second scalar. The second scalar is stored in a shared register. A vector including the second scalar is accessed from the shared register. The vector is operated on.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.

[0009] FIG. 1 illustrates a block diagram of an exemplary system having a system-on-a- chip (SoC), according to some embodiments of the present disclosure.

[0010] FIG. 2 illustrates a detailed block diagram of an exemplary SoC in the system of

FIG.l, according to some embodiments of the present disclosure.

[0011] FIGs. 3A-3C illustrate various exemplary shared registers for a vector register file and a scalar register file, according to various embodiments of the present disclosure.

[0012] FIG. 4 illustrates a flow chart of an exemplary method for processor operation using a shared register, according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

[0013] Although specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications.

[0014] It is noted that references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” “certain embodiments,” etc., indicate that one or more embodiments described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

[0015] In general, terminology may be understood at least in part from usage in context.

For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the terms “based on,” “based upon,” and terms with similar meaning may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

[0016] Various aspects of the present disclosure will now be described with reference to various apparatus and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, units, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.

[0017] Vector processing is a data processing method that implements a set of computer instructions on one-dimensional arrays of data, as compared to on single-data items. Such a data array is also known as a “vector.” Vector processing avoids the overhead of the loop control mechanism that occurs in general-purpose computers. Each vector includes multiple data elements, which can be in different data types, e.g., fixed-point numbers (short integers and long integers), floating-point numbers, etc., and have different accuracies, e.g., 8-bit, 12- bit, 2-bit, 32-bit, etc.

[0018] Known CPU architectures add vector engines with vector processing capability to improve the computation performance and thus, also add a vector register file (reg-file), dedicated to vector processing, to the existing scalar register file (a.k.a. generic/general- purpose register file). However, the communication between the scalar register file and vector register file becomes the performance bottleneck; it is so slow, and thus, the most effective way to transfer data in-between is actually through data memory - this is a dramatic overhead in both computation performance and battery power consumption.

[0019] On the other hand, each processor needs to be fast reacting towards external hardware block generated interrupt. Most of the time, the major bottleneck for interrupt reacting is the context-switch, which needs to save the current entire register file to data memory (one by one), then load the content of interrupt service from data memory into the register file (again one-by-one) To improve the context switch speed, known processors use a much larger register file, to save more than one set of contexts, yet still, only access one set of them at any given time - when interrupt happens, switch to another set of registers instead of save/load between data memory. This helps to reduce latency, with the price of dramatically increase register file size, which is a significant cost increase of silicon size. Also, a large register file can slow down each access, thereby preventing the CPU from running at a higher clock frequency and hurting the entire system performance.

[0020] Various embodiments in accordance with the present disclosure introduce shared registers that can be used for both the scalar register file and the vector register file of a processor. A portion of the vector register file, which is actually a set of scalar registers, can make the communication between the scalar function units and the vector function units much more efficient by removing any overhead of communication from one another. Moreover, as part of the vector register file, the set of scalar registers of the shared registers can be accessed in parallel by a wide bus. The wide-bus accessed scalar register file can make context-switch much faster, which is a significant improvement of interrupt response capability.

[0021] FIG. 1 illustrates a block diagram of an exemplary system 100 having an SoC 102, according to some embodiments of the present disclosure. System 100 may include SoC 102 having a processor 108 and a primary memory 110, a bus 104, and a secondary memory 106. System 100 may be applied or integrated into various systems and apparatus capable of high speed data processing, such as computers and wireless communication devices. For example, system 100 may be part of a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, a virtual reality (VR) device, an argument reality (AR) device, or any other suitable electronic devices having high-speed data processing capability. Using a wireless communication device as an example, SoC 102 may serve as an application processor (AP) and/or a baseband processor (BP) that imports data and instructions from secondary memory 106, executing instructions to perform various mathematical and logical calculations on the data, and exporting the calculation results for further processing and transmission over cellular networks.

[0022] As shown in FIG. 1, secondary memory 106 may be located outside SoC 102 and operatively coupled to SoC 102 through bus 104. Secondary memory 106 may receive and store data of different types from various sources via communication channels (e.g., bus 104). For example, secondary memory 106 may receive and store digital imaging data captured by a camera of the wireless communication device, voice data transmitted via cellular networks, such as a phone call from another user, or text data input by the user of the system through an interactive input device, such as a touch panel, a keyboard, or the like. Secondary memory 106 may also receive and store computer instructions to be loaded to processor 108 for data processing. Such instructions may be in the form of an instruction set, which contains discrete instructions that teach the microprocessor or other functional components of the microcontroller chip to perform one or more of the following types of operations — data handling and memory operations, arithmetic and logic operations, control flow operations, co processor operations, etc. Secondary memory 106 may be provided as a standalone component in or attached to the apparatus, such as a hard drive, a Flash drive, a solid-state drive (SSD), or the like. Other types of memory compatible with the present disclosure may also be conceived. It is understood that secondary memory 106 may not be the only component capable of storing data and instructions. Primary memory 110 may also store data and instructions and, unlike secondary memory 106, may have direct access to processor 108. Secondary memory 106 may be a non-volatile memory, which can keep the stored data even though power is lost. In contrast, primary memory 110 may be volatile memory, and the data may be lost once the power is lost. Because of this difference in structure and design, each type of memory may have its own dedicated use within the system.

[0023] Data between secondary memory 106 and SoC 102 may be transmitted via bus 104.

Bus 104 functions as a highway that allows data to move between various nodes, e.g., memory, microprocessor, transceiver, user interface, or other sub-components in system 100, according to some embodiments. Bus 104 can be serial or parallel. Bus 104 can also be implemented by hardware (such as electrical wires, optical fiber, etc.). It is understood that bus 104 can have sufficient bandwidth for storing and loading a large amount of data (e.g., vectors) between secondary memory 106 and primary memory 110 without delay to the data processing by processor 108. [0024] SoC designs may integrate one or more components for computation and processing on an integrated-circuit (IC) substrate. For applications where chip size matters, such as smartphones and wearable gadgets, SoC design is an ideal design choice because of its compact area. It further has the advantage of small power consumption. In some embodiments, as shown in FIG. 1, one or more processors 108 and primary memory 110 are integrated into SoC 102. It is understood that in some examples, primary memory 110 and processor 108 may not be integrated on the same chip, but instead on separate chips.

[0025] Processor 108 may include any suitable specialized processor including, but not limited to, CPU, graphic processing unit (GPU), DSP, tensor processing unit (TPU), vision processing unit (VPU), neural processing unit (NPU), synergistic processing unit (SPU), physics processing unit (PPU), and image signal processor (ISP). Processor 108 may also include a microcontroller unit (MCU), which can handle a specific operation in an embedded system. In some embodiments in which system 100 is used in wireless communications, each MCU handles a specific operation of a mobile device, for example, communications other than cellular communication (e.g., Bluetooth communication, Wi-Fi communication, frequency modulation (FM) radio, etc.), power management, display drive, positioning and navigation, touch screen, camera, etc.

[0026] As shown in FIG. 1, processor 108 may include one or more processing cores 112

(a.k.a. “cores”), a register array 114, and a control module 116. In some embodiments, processing core 112 may include one or more functional units that perform various data operations. For example, processing core 112 may include an arithmetic logic unit (ALU) that performs arithmetic and bitwise operations on data (also known as “operand”), such as addition, subtraction, increment, decrement, AND, OR, Exclusive-OR, etc. Processing core 112 may also include a floating-point unit (FPU) that performs similar arithmetic operations but on a type of operands (e.g., floating-point numbers) different from those operated by the ALU (e.g., binary numbers). The operations may be addition, subtraction, multiplication, etc. As described below in detail, another way of categorizing the functional units may be based on whether the data processed by the function unit is a scalar or a vector. For example, processing cores 112 may include scalar function units (SFUs) for handling scalar operations and vector function units (VFUs) for handling vector operations. It is understood that in case that processor 108 includes multiple processing cores 112, each processing core 112 may carry out data and instruction operations in serial or in parallel. This multi-core processor design can effectively enhance the processing speed of processor 108 and multiplies its performance. In some embodiments, processor 108 may be a CPU with a vector co-processor (a vector engine) that can handle both scalar operations and vector operations.

[0027] Register array 114 may be operatively coupled to processing core 112 and primary memory 110 and include multiple sets of registers for various purposes. Because of their architecture design and proximity to processing core 112, register array 114 allows processor 108 to access data, execute instructions, and transfer computation results faster than primary memory 110, according to some embodiments. In some embodiments, register array 114 includes a plurality of physical registers fabricated on SoC 102, such as fast static random- access memory (RAM) having multiple transistors and multiple dedicated read and write ports for high-speed processing and simultaneous read and/or write operations, thus distinguishing from primary memory 110 and secondary memory 106 (such as a dynamic random-access memory (DRAM), a hard drive, or the like). The register size may be measured by the number of bits they can hold (e.g., 4 bits, 8 bits, 16 bits, 32 bits, 64 bits, 128 bits, 256 bits, 512 bits, etc.). In some embodiments, register array 114 serves as an intermediary memory placed between primary memory 110 and processing core 112. For example, register array 114 may hold frequently used programs or processing tools so that access time to these data can be reduced, thus increasing the processing speed of processor 108 while also reducing power consumption of SoC 102. In another example, register array 114 may store data being operated by processing core 112, thus reducing delay in accessing the data from primary memory 110. This type of register is known as data registers. Another type is address registers, which may hold addresses and may be used by instructions for indirect access of primary memory 110. There are also status registers that decide whether a certain instruction should be executed, such as the control and status register (CSR). In some embodiments, at least part of register array 114 is implemented by one or more physical register files (PRFs) within processor 108.

[0028] Consistent with the scope of the present disclosure, in some embodiments, register array 114 includes a vector register file having scalar registers and a scalar register file having scalar register, and the vector register file and scalar register file share a set of registers that is configured to be used either as a set of scalar registers by the scalar register file or a vector register used by the vector register file. In other words, for example, register array 114 may include a scalar register configured to store a first scalar, a vector register configured to store a first vector, and a shared register configured to store a second vector including a set of second scalars.

[0029] Control module 116 may be operatively coupled to primary memory 110 and processing core 112. Control module 116 may be implemented by circuits fabricated on the same semiconductor chip as processing core 112. Control module 116 may serve as a role similar to a command tower. For example, control module 116 may retrieve and decode various computer instructions from primary memory 110 to processing core 112 and instruct processing core 112 what processes to be carried out on operands loaded from primary memory 110. Computer instructions may be in the form of a computer instruction set. Different computer instructions may have a different impact on the performance of processor 108. For example, instructions from a reduced instruction set computer (RISC) are generally simpler than those from a complex instruction set computer (CISC) and thus may be used to achieve fewer cycles per instruction, therefore reducing the processing time by processor 108. Examples of processes carried out by processor 108 include setting a register to a fixed value, copying data from a memory location to a register, copying data between registers, adding, subtracting, multiplying, and dividing, comparing values stored on two different registers, etc. In some embodiments, control module 116 may further include an instruction decoder (not shown in FIG. 1, described below in detail) that decodes the computer instructions into instructions readable by other components on processor 108, such as processing core 112. The decoded instructions may be subsequently provided to processing core 112.

[0030] It is understood that additional components, although not shown in FIG.l, may be included in SoC 102 as well, such as interfacing components for data loading, storing, routing, or multiplexing within SoC 102, as described below in detail with respect to FIG. 2.

[0031] FIG. 2 illustrates a detailed block diagram of exemplary SoC 102 in system 100 of

FIG.l, according to some embodiments of the present disclosure. As shown in FIG. 2, processor 108 may be configured to handle both vectors and scalars by including one or more vector function units (VFUs) 202 and one or more scalar function units (SFUs) 204. Each VFU 202 or SFU 204 may be fully pipelined and can perform arithmetic or logic operations on vectors and scalars, respectively. VFUs 202 and SFUs 204 may be parts of processing core 112 in FIG. 1.

[0032] Processor 108 in FIG. 2 may further include vector registers 206 and scalar registers

208 configured to store vectors and scalars, respectively, as parts of register array 114 in FIG. 1. Processor 108 may further include one or more multiplexer units (MUXs) 218 operatively coupled to vector registers 206 and VFUs 202 and configured to select between multiple input data to output. For example, two MUXs 218 may each select a vector operand from vector registers 206 and output them to VFUs 202 for vector operations on the two vector operands, and one MUX 218 may select an operation result and output it back to vector registers 206. Similarly, processor 108 may further include one or more MUXs 220 operatively coupled to scalar registers 208 and SFUs 204 and configured to select between multiple input data to output.

[0033] As shown in FIG. 2, processor 108 may further include data load/store units for data moving data between primary memory 110 and registers 206 and 208, including a vector load/store unit 214 operatively coupled to vector registers 206 and a scalar load/store unit 216 operatively coupled to scalar registers 208. For example, vector load/store unit 214 may load vector operands from primary memory 110 to vector registers 206 and store vector results from vector registers 206 to primary memory 110; scalar load/store unit 216 may load scalar operands from primary memory 110 to scalar registers 208 and store scalar results from scalar registers 208 to primary memory 110. In some embodiments, as shown in FIG. 2, processor 108 may also include one or more MUXs 219 operatively coupled to vector registers 206 and vector load/store unit 214 and configured to select between multiple input data to output. For example, MUX 219 may select a vector from vector registers 206 and output it to primary memory 110. Similarly, processor 108 may further include one or more MUXs 221 operatively coupled to scalar registers 208 and scalar load/store unit 216 and configured to select between multiple input data to output. For example, MUX 221 may select a scalar from scalar registers 208 and output it to primary memory 110. As described above, data, such as scalars and vectors, can be transferred and processed in data paths having components, such as vector load/store unit 214, MUXs 219, vector registers 206, MUXs 218, VFUs 202, scalar load/store unit 216, MUXs 221, scalar registers 208, MUXs 220, and SFUs 204.

[0034] Consistent with the scope of the present disclosure, in some embodiments, processor 108 further includes shared registers 209 configured to store one or more vectors each including a set of scalars. Different from scalar registers 208 and vector registers 206 dedicated to storing scalars and vectors, respectively, shared registers 209 may store either scalars or vectors. As described below in detail, in some embodiments, shared registers 209 include one or more sets of registers each configured to be used as a set of scalar registers or a vector register. In some embodiments, scalar registers 208 and shared registers 209 form a scalar register file 207 (scalar reg-file) of processor 108, and vector registers 206 and shared registers 209 form a vector register file 205 (vector reg-file) of processor 108. In other words, processor 108 may include vector register file 205 including scalar registers, and scalar register file 207 including vector registers; scalar register file 207 and vector register file 205 may share a set of registers, i.e., shared registers 209, which may be used as a set of scalar registers by scalar register file 207 or a vector register used by vector register file 205. Each of scalar register file 207 and vector register file 205 may be a PRF including two read-ports and one write-port. In some embodiments, the set of scalar registers is m n- bit scalar registers, and the vector register is one (mxri)- bit vector register. For example, the set of scalar registers may be 16 32-bit scalar registers, and the vector register may be one 512-bit vector register.

[0035] As shown in FIG. 2, SFUs 204 may be operatively coupled to scalar registers 208 and shared registers 209 and configured to access both a first scalar from scalar registers 208 and a set of second scalars from shared registers 209; VFUs 202 may be operatively coupled to vector registers 206 and shared registers 209 and configured to access a first vector from vector registers 206 and a second vector from shared registers 209. The second vector may include the set of second scalars. In some embodiments, the second vector consists of the set of second scalars. That is, the total length of the set of second scalars may be the same as the length of the second vector. For example, the set of second scalars may be m n- bit scalars, and the vector may be one (mxn)- bit vector. In one example, the set of second scalars may be 16 32-bit scalars, and the vector may be one 512-bit vector. In some embodiments, SFUs 204 may be operatively coupled to scalar register file 207 and configured to access a scalar from scalar register file 207, and VFUs 202 may be operatively coupled to vector register file 205 and configured to access a vector from vector register file 205.

[0036] The scalars stored in shared registers 209, forming a vector, may be accessible by

SFUs 204 in parallel to increase the bandwidth, which can, for example, make context- switch faster and improve the interrupt response capability. In some embodiments, SFUs 204 are configured to access a set of scalars (e.g., forming a vector) in parallel from shared registers 209, e.g., the set of registers shared by scalar register file 207 and vector register file 205.

[0037] As shown in FIG. 2, in some embodiments, control module 116 of processor 108 may include an instruction fetch unit 211 and an instruction decode unit 212 (a.k.a. instruction processing unit (IPU) collectively). Instruction fetch unit 211 may be operatively coupled to primary memory 110 and configured to fetch instructions from primary memory 110 that are to be processed by processor 108. Instruction decode unit 212 may be operatively coupled to instruction fetch unit 211 and each of the components in the data paths described above and configured to decode each instruction and control the operations of each component in the data paths described above based on the decoded instruction, as described below in detail. Consist with the scope of the present disclosure, in some embodiments, for vector processing, the instruction includes SIMD instruction or vector instruction. Instruction decode unit 212 may determine either the fetched instruction is a scalar instruction or a vector instruction. If it is a scalar instruction, scalar processing may be performed by SFUs 204 in conjunction with scalar registers 208 and shared registers 209. If it is a vector instruction, vector processing may be performed by VFUs 202 in conjunction with vector registers 206 and shared registers 209. For example, the address of a vector operand in primary memory 110 may be determined from the decoded instruction and provided to vector load/store unit 214 to load the vector operand into vector registers 206. At the same time, the decoded instruction may be provided to VFUs 202 to operate on the vector operand from vector registers 206. Similarly, for example, the address of a scalar operand in primary memory 110 may be determined from the decoded instruction and provided to scalar load/store unit 216 to load the scalar operand into scalar registers 208. At the same time, the decoded instruction may be provided to SFUs 204 to operate on the scalar operand from scalar registers 208.

[0038] FIGs. 3A-3C illustrate various exemplary shared registers for a vector register file and a scalar register file, according to various embodiments of the present disclosure. Each of 3 A-3C shows one example of SoC 102 with a different design of shared registers 209 for vector register file 205 and scalar register file 207. As shown in FIG. 3 A, scalar register file 207 may include a plurality of scalar registers (32 scalar registers from R0 to R31 in this example). In some embodiments, each scalar register R0 to R31 is configured to store a scalar. A scalar may have a single data element, as opposed to multiple data elements. A scalar may have its length, i.e., the accuracy of the data element (e.g., 8-bits, 12-bits, 16-bits, 24-bits, 32-bits, 64-bits, etc.). Accordingly, each scalar register R0 to R31 may be an n- bit scalar register to store the respective scalar with a length of «-bits (also the accuracy of the data element of the scalar), where n is a positive integer.

[0039] On the other hand, vector register file 205 may include a plurality of vector registers

(32 vector registers from V0 to V31 in this example). In some embodiments, each vector register V0 to V31 is configured to store a vector. A vector may have multiple data elements, as opposed to a single data element. That is, a vector may include a set of scalars, each of which is one element of the vector. In other words, a scalar may be an element of a field that is used to define a vector. In some embodiments, a vector consists of a set of scalars. Thus, a vector may have its length that is the same as the total length of a set of scalars. For example, the length of a vector consisting of 16 32-bits scalars may be 512-bits. Accordingly, each vector register V0 to V31 may consist of m n- bit scalar registers to store the respective vector with a length of (mxn)- bit, where each of m and n may be a positive integer.

[0040] As shown in FIG. 3 A, scalar register file 207 and vector register file 205 share a set of registers R16 to R31 in shared registers 209, according to some embodiments. In other words, scalar register file 207 may include a set of scalar registers R0 to R15 each configured to store a scalar, and shared registers 209 configured to store a vector including a set of scalars; vector register file 205 may include a set of vector registers VI to V31 each configured to store a vector, and shared registers 209 configured to store a vector including a set of scalars. As described above, in some embodiments, since a vector may consist of a set of scalars, shared registers 209 includes one vector register V0 consisting of a set of scalar registers (R16 to R31 in this example). That is, the set of registers in shared register 209 may be configured to be used as a set of m n- bit scalar registers by scalar register file 207 or one (m^ri)- bit vector register used by vector register file 205. In one example, each of 16 scalar registers R16 to R31 may be a 32-bit scalar register, such that vector register V0 may be a 512-bit vector register.

[0041] Different from some known processors in which the SFUs can only access the scalar register file, and the VFUs can only access the vector register file, by introducing shared registers 209, a portion of register array 114 of processor 108 can be accessed by both SFUs 204 and VFUs 202. For example, one vector register V0 in vector register file 205 may be actually the combination of a set of scalar register R15 to R31 in scalar register file 207. As a result, data transfer between scalar register file 207 and vector register file 205 (as well as between SFUs 204 and VFUs 202) does not have to go through primary memory 110, but instead, can be done internally via shared registers 209, thereby significantly reducing the overhead in both computation performance and battery power consumption.

[0042] For example, conventionally, when moving the scalars of a vector between vector register file 205 to scalar register file 207, a first data load instruction may need to be used to first load the scalars of the vector from vector register file 205 or scalar register file 207 into primary memory 110 or secondary memory 106 and, followed a second data load instruction to load the scalars from primary memory 110 or secondary memory 106 into scalar register file 207 or vector register file 205. Or multiple data load instructions may need to be used one by one to move each scalar of the vector between vector register file 205 to scalar register file 207 in series. By implementing shared registers 209 disclosed herein, data transfer between vector register file 205 to scalar register file 207 can be achieved without the above-mentioned data load instructions. A vector having scalars to be used by SFUs 204 may be stored into shared registers 209, such that the scalars can then be accessed by SFUs 204 or scalar load/store unit 216 directly. Similarly, a set of scalars forming a vector to be used by VFUs 202 may be stored into shared registers 209, such that the vector can then be accessed by VFUs 202 or vector load/store unit 214 directly. That is, any scalars or vector that need to be transferred between scalar register file 207 and vector register file 205 (as well as between SFUs 204 and VFUs 202) can be temporarily stored in shared registers 209.

[0043] In some embodiments, VFUs 202 are configured to operate on at least a first vector accessed from a vector register VI to V31 to generate a second vector, and store the second vector including a set of second scalars in the vector register V0 of shared register 209. SFUs 204 are configured to access one or more second scalars of the set of second scalars (e.g., stored in scalar registers R16 to R31) from shared register 209, and operate on the one or more second scalars, according to some embodiments.

[0044] In some embodiments, SFUs 204 are configured to operate on at least a first scalar accessed from a scalar register R0 to R15 to generate a second scalar, and store the second scalar in a scalar register R16 to R31 of shared register 209 as part of a set of second scalars. VFUs 202 are configured to access a second vector including the second scalar from the vector register V0 of shared register 209, and operate on at least the second vector.

[0045] It is understood that in some examples, SoC 102 may still support the legacy data transfer instruction described above used by some known processors. For example, when instruction fetch unit 211 of control module 116 receiving an instruction to move a target scalar in a target vector stored in a vector register to a scalar register, instruction decode unit 212 of control module 116 may be configured to decode the instruction and provide a control signal to vector register file 205 (including vector registers VI to V31 and shared register 209), such that vector register file 205 is configured to, in response to the control signal, move the target vector including the target scalar to shared register 209, which is also part of scalar register file 207 (scalar registers R16 to R31). [0046] Moreover, in some embodiments, since the set of scalar registers R16 to R31 in shared registers 209 is also a vector register V0 that can be accessed as a whole to load the vector stored therein, the set of scalar registers R16 to R31 in shared registers 209 is also parallel-accessible. That is, SFUs 204 may be configured to access a set of scalars in parallel from the set of scalar registers R16 to R31 in shared registers 209. Similarly, scalar load/store unit 216 may be configured to access a set of scalars in parallel from the set of scalar registers R16 to R31 in shared registers 209 as well. For example, as shown in FIG. 3 A, the buses between scalar register file 207 and SFUs 204 and scalar load/store unit 216, respectively, may each include both a 32-bit bus for accessing one 32-bit scalar in series from scalar registers R0 to R15 and a 512-bit wide bus for accessing a set of 16 scalars in parallel from scalar registers R16 to R31. As a result, the wide-bus accessed scalar register file 207 can make context- switch much faster, which is a significant improvement of interrupt response capability.

[0047] Although shared register 209 in FIG. 3 A can be used as one vector register V0, it is understood that the design of shared register 209 may vary in other examples. In one example, as shown in FIG. 3B, shared register 209 may be used as multiple vector registers (two vector registers V0 and VI in this example), each of which may include a set of scalar registers (V0 = R0 to R15, and VI = R16 to R31 in this example). That is, multiple sets of registers may be shared by scalar register file 207 and vector register file 205, and each set of registers may be configured to be used as a set of scalar registers by scalar register file 207 or a vector register used by vector register file 205. The multiple sets of registers of shared registers 209 may be accessed in parallel as a whole (e.g., using a 1,204-bit wide bus, not shown), or each set of registers of shared registers 209 may be accessed in parallel as a whole (e.g., using a 512-bit wide bus as shown in FIG. 3B). In another example, as shown in FIG. 3C, the vector register V0’ in shared register 209 may have a shorter length than other vector registers VI to V31 in vector register file 205. For example, the vector register V0’ in shared register 209 may be a 256-bit vector register consisting of 8 32-bits scalar registers R24 to R31, instead of a 512-bit vector register consisting of 16 32-bits scalar registers 16 to R31. Accordingly, the bus for accessing the 8 32-bits scalar registers R24 to R31 in shared registers 209 by SFUs 204 or scalar load/store unit 216 may include a 256-bit bus as well.

[0048] It is also understood that the external data path for transferring data between scalar register file 207 and vector register file 205 (and between SFUs 204 and VFUs 202) is not limited to the example in FIG. 3A (i.e., including only primary memory 110), but instead, may include primary memory 110 and/or secondary memory 106 (e.g., as shown in the example of FIG. 3D that includes both primary memory 110 and secondary memory 106).

[0049] FIG. 4 illustrates a flow chart of an exemplary method 400 for processor operation using a shared register, according to some embodiments of the present disclosure. Examples of the apparatus that can perform operations of method 400 include, for example, processor 108 depicted in FIGs. 1, 2, and 3A-3D or any other suitable apparatus disclosed herein. It is understood that the operations shown in method 400 are not exhaustive and that other operations can be performed as well before, after, or between any of the illustrated operations. Further, some of the operations may be performed simultaneously, or in a different order than shown in FIG. 4.

[0050] Referring to FIG. 4, method 400 starts at operation 402, in which a first vector is accessed from a vector register. For example, as shown in FIG. 3A, a first vector may be accessed from a vector register VI to V31 of vector register file 205 by VFUs 202. Method 400 proceeds to operation 404, as illustrated in FIG. 4, in which the first vector is operated on to generate a second vector including a set of scalars. The total length of the set of scalars may be the same as the length of the second vector. For example, as shown in FIG. 3A, the first vector may be operated on by VFUs 202 to generate a second vector consisting of a set of scalars. Method 400 proceeds to operation 406, as illustrated in FIG. 4, in which the second vector is stored in a shared register. For example, as shown in FIG. 3A, the second vector consisting of the set of scalars may be stored in the vector register V0 (consisting of the set of scalar registers R16 to R31) in shared registers 209. Method 400 proceeds to operation 408, as illustrated in FIG. 4, in which a first scalar of the set of scalars is accessed from the shared register. For example, as shown in FIG. 3 A, a first scalar of the set of scalars stored in a scalar register R16 to R31 in shared registers 209 may be accessed by SFUs 204. Method 400 proceeds to operation 410, as illustrated in FIG. 4, in which the first scalar is operated on to generate a second scalar. For example, as shown in FIG. 3 A, the first scalar may be operated on by SFUs 204 to generate a second scalar.

[0051] In some embodiments, method 400 proceeds to operation 412, as illustrated in FIG.

4, in which the second scalar is stored in the shared register. For example, as shown in FIG. 3 A, the second scalar may be stored in a scalar register R16 to R31 in shared registers 209. Method 400 proceeds to operation 414, as illustrated in FIG. 4, in which a third vector including the second scalar is accessed from the shared register. For example, as shown in FIG. 3A, a third vector including the second scalar stored in the vector register V0 (consisting of the set of scalar registers R16 to R31) in shared registers 209 may be accessed by VFUs 202. Method 400 proceeds to operation 416, as illustrated in FIG. 4, in which the third vector including the second scalar is operated on. For example, as shown in FIG. 3 A, the third vector including the second scalar may be operated on by VFUs 202.

[0052] In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non-transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such system 100 in FIG. 1. By way of example, and not limitation, such computer-readable media can include RAM, read-only memory (ROM), electrically erasable programmable ROM (EEPROM), compact disc-ROM (CD-ROM) or other optical disk storage, hard disk drive (HDD), such as magnetic disk storage or other magnetic storage devices, flash drive, SSD, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, includes CD, laser disc, optical disc, digital versatile disc (DVD), and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

[0053] According to one aspect of the present disclosure, a processor includes a scalar register configured to store a first scalar, a vector register configured to store a first vector, a shared register configured to store a second vector including a set of second scalars, a scalar function unit operatively coupled to the scalar register and the shared register, and a vector function unit operatively coupled to the vector register and the shared register. The scalar function unit is configured to access the first scalar from the scalar register and the set of second scalars from the shared register. The vector function unit is configured to access the first vector from the vector register and the second vector from the shared register.

[0054] In some embodiments, a total length of the set of second scalars is the same as a length of the second vector.

[0055] In some embodiments, the second vector consists of the set of second scalars.

[0056] In some embodiments, the scalar register and the shared register form a scalar register file of the processor, and the vector register and the shared register form a vector register file of the processor.

[0057] In some embodiments, each of the scalar register file and the vector register file includes two read-ports and one write-port.

[0058] In some embodiments, the scalar function unit is further configured to access the set of second scalars in parallel from the shared register.

[0059] In some embodiments, the vector function unit is further configured to operate on the first vector accessed from the vector register to generate the second vector, and store the second vector including the set of second scalars in the shared register. In some embodiments, the scalar function unit is further configured to access one or more second scalars of the set of second scalars from the shared register, and operate on the one or more second scalars.

[0060] In some embodiments, the scalar function unit is further configured to operate on the first scalar accessed from the scalar register to generate a second scalar, and store the second scalar in the shared register as part of the set of second scalars. In some embodiments, the vector function unit is further configured to access the second vector including the second scalar from the shared register, and operate on the second vector.

[0061] In some embodiments, the processor further includes an instruction decode unit configured to decode an instruction to move a target scalar in a target vector stored in the vector register to the scalar register to provide a control signal to the vector register and the shared register. In some embodiments, the vector register and the shared register are configured to, in response to the control signal, move the target vector including the target scalar to the shared register.

[0062] According to another aspect of the present disclosure, a processor includes a scalar register file including a plurality of scalar registers, and a vector register file including a plurality of vector registers. The scalar register file and the vector register file share a set of registers.

[0063] In some embodiments, the set of registers is configured to be used as a set of scalar registers by the scalar register file or a vector register used by the vector register file.

[0064] In some embodiments, the set of scalar registers is m n- bit scalar registers, and the vector register is one (mxn)- bit vector register.

[0065] In some embodiments, each of the scalar register file and the vector register file includes two read-ports and one write-port. [0066] In some embodiments, the processor further includes a scalar function unit operatively coupled to the scalar register file and configured to access a scalar from the scalar register file, and a vector function unit operatively coupled to the vector register file and configured to access a vector from the vector register file.

[0067] In some embodiments, the scalar function unit is further configured to access a set of scalars in parallel from the set of registers.

[0068] According to still another aspect of the present disclosure, a method for processor operation is disclosed. A first vector is accessed from a vector register. The first vector is operated on to generate a second vector including a set of scalars. The second vector is stored in a shared register. A first scalar of the set of scalars is accessed from the shared register. The first scalar is operated on to generate a second scalar.

[0069] In some embodiments, a total length of the set of scalars is the same as a length of the second vector.

[0070] In some embodiments, the second vector consists of the set of scalars.

[0071] In some embodiments, the second scalar is stored in the shared register, a third vector including the second scalar is accessed from the shared register, and the third vector is operated on.

[0072] According to yet another aspect of the present disclosure, a method for processor operation is disclosed. A first scalar is accessed from a scalar register. The first scalar is operated on to generate a second scalar. The second scalar is stored in a shared register. A vector including the second scalar is accessed from the shared register. The vector is operated on.

[0073] The foregoing description of the specific embodiments will reveal the general nature of the present disclosure that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.

[0074] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. [0075] The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor(s), and thus, are not intended to limit the present disclosure and the appended claims in any way.

[0076] Various functional blocks, modules, and steps are disclosed above. The particular arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be re-ordered or combined in different ways than in the examples provided above. Likewise, certain embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.

[0077] The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.