Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SHARED FUNCTION-MEMORY CIRCUITRY FOR A PROCESSING CLUSTER
Document Type and Number:
WIPO Patent Application WO/2012/068478
Kind Code:
A2
Abstract:
An apparatus for performing parallel processing is provided. The apparatus has a message bus (1420), a data bus (1422), and a shared function-memory (1410). The shared function-memory (1410) has a data interface (7620, 7606, 7624-1 to 7624-R), a message interface (7626) that is coupled to the message bus (1420), a function-memory (7602), a vector-memory (7603), single- input-multiple-data (SIMD) datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P), an instruction memory (7616), a data memory (7618), and a processor (7614). The data interface (7620, 7606, 7624-1 to 7624-R) is coupled to the data bus (1422). The message interface (7626) is coupled to the message bus (1420). The function-memory (7602) is coupled to the data interface (7620, 7606, 7624-1 to 7624-R) and implementing lookup-tables (LUTs) and histograms. The vector- memory (7603) is coupled to the data interface (7620, 7606, 7624-1 to 7624-R) and supports operations that employs vector instructions. The SIMD datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P) are coupled to the vector-memory (7603). Additionally, the processor (7614) is coupled to the data memory (7616), the instruction memory (7616), the function-memory (7603), and the vector-memory (7603).

Inventors:
NYE JEFFREY L (US)
BARTLEY DAVID H (US)
GLOTZBACH JOHN W (US)
JOHNSON WILLIAM (US)
JAYARAJ AJAY (US)
NYCHKA ROBERT J (US)
GUPTA SHALINI (US)
BUSCH STEPHEN (FR)
NAGATA TOSHIO (US)
SHEIKH HAMID (US)
CHINNAKONDA MURALI (US)
SUNDARARAJAN GANESH (US)
Application Number:
PCT/US2011/061431
Publication Date:
May 24, 2012
Filing Date:
November 18, 2011
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TEXAS INSTRUMENTS INC (US)
TEXAS INSTRUMENTS JAPAN (JP)
NYE JEFFREY L (US)
BARTLEY DAVID H (US)
GLOTZBACH JOHN W (US)
JOHNSON WILLIAM (US)
JAYARAJ AJAY (US)
NYCHKA ROBERT J (US)
GUPTA SHALINI (US)
BUSCH STEPHEN (FR)
NAGATA TOSHIO (US)
SHEIKH HAMID (US)
CHINNAKONDA MURALI (US)
SUNDARARAJAN GANESH (US)
International Classes:
G06F13/14
Foreign References:
US20090276606A12009-11-05
US20090316798A12009-12-24
US20020174318A12002-11-21
US20060155955A12006-07-13
Attorney, Agent or Firm:
FRANZ, Warren, L. et al. (Deputy General Patent CounselP.o. Box 655474, Mail Station 399, Dallas TX, US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. An apparatus characterized by:

a message bus (1420);

a data bus (1422); and

a shared function-memory (1410) having:

a data interface (7620, 7606, 7624-1 to 7624-R) that is coupled to the data bus (1422); a message interface (7626) that is coupled to the message bus (1420);

a function-memory (7602) that is coupled to the data interface (7620, 7606, 7624-1 to 7624-R), wherein the function-memory (7602) implementing lookup-tables (LUTs) and histograms;

a vector-memory (7603) that is coupled to the data interface (7620, 7606, 7624-1 to 7624-R), wherein the vector-memory (7603) supports operations that employs vector instructions;

single-input-multiple-data (SIMD) datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P) that are coupled to the vector-memory (7603);

an instruction memory (7616);

a data memory (7618); and

a processor (7614) that is coupled to the data memory (7616), the instruction memory (7616), the function-memory (7603), and the vector-memory (7603).

2. The apparatus of Claim 1, wherein the shared function-memory (1410) is further characterized by a save/restore memory (7610) that is coupled to the processor and that is configured to store register states for suspended threads.

3. The apparatus of Claims 1 or 2, wherein the SIMD datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P) are further characterized by:

a plurality of ports (7605-1 to 7605-Q) that are coupled to the vector-memory (7603); and a plurality of functional units (7607-1 to 7607-P) that are each coupled to at least one of the plurality of ports (7605-1 to 7605-Q).

4. The apparatus of Claims 1, 2, or 3, wherein the vector-memory (7603) is arranged into a plurality of sets of memory banks (7802-1 to 7802-L). 5. The apparatus of Claims 1, 2, 3, or 4, wherein a plurality of functional units

(7607-1 to 7607-P) are arranged into a plurality of sets of a plurality of functional units (7607-1 to 7607-P), and wherein the SIMD datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P) are further characterized by a plurality of registers (7804-1 to 7804-L) with each register (7804-1 to 7804-L) being associated with at least one of the sets of functional units (7607-1 to 7607-P).

6. The apparatus of Claims 1, 2, 3, 4, or 5, wherein the processor (7614) is configured to perform motion estimation, resampling, and discrete-cosine transform, and distortion correction for image processing. 7. A system characterized by:

system memory (1416); and

a processing cluster that is coupled to the system memory (1416); wherein the processing cluster includes:

a message bus (1420);

a data bus (1422);

a plurality of processing nodes (808-1 to 808-N) arranged in partitions (1402-1 to 1402- R) with each partition having a bus interface unit (4710-1 to 4710-R) that is coupled to the data bus (1422), wherein each processing node (808-1 to 808-N) is coupled to the message bus (1420);

a control node (1406) that is coupled to the message bus (1420); and

a load/store unit (1408) that is coupled to the message bus (1420) and the data bus (1422); and

a shared function-memory (1410) having:

a data interface (7620, 7606, 7624-1 to 7624-R) that is coupled to the data bus (1422); a message interface (7626) that is coupled to the message bus (1420); a function-memory (7602) that is coupled to the data interface (7620, 7606, 7624-1 to 7624-R), wherein the function-memory (7602) implementing lookup-tables (LUTs) and histograms;

a vector-memory (7603) that is coupled to the data interface (7620, 7606, 7624-1 to 7624-R), wherein the vector-memory (7603) supports operations that employs vector instructions;

single-input-multiple-data (SIMD) datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P) that are coupled to the vector-memory (7603);

an instruction memory (7616);

a data memory (7618); and

a processor (7614) that is coupled to the data memory (7616), the instruction memory (7616), the function-memory (7603), and the vector-memory (7603).

8. The system of Claim 7, wherein the shared function-memory (1410) is further characterized by a save/restore memory (7610) that is coupled to the processor and that is configured to store register states for suspended threads.

9. The system of Claims 7 or 8, wherein the SIMD datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P) are further characterized by:

a plurality of ports (7605-1 to 7605-Q) that are coupled to the vector-memory (7603); and a plurality of functional units (7607-1 to 7607-P) that are each coupled to at least one of the plurality of ports (7605-1 to 7605-Q).

10. The system of Claims 7, 8, or 9, wherein the vector-memory (7603) is arranged into a plurality of sets of memory banks (7802-1 to 7802-L).

11. The system of Claims 7, 8, 9, or 10, wherein a plurality of functional units (7607- 1 to 7607-P) are arranged into a plurality of sets of a plurality of functional units (7607-1 to 7607-P), and wherein the SIMD datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P) are further characterized by a plurality of registers (7804-1 to 7804-L) with each register (7804-1 to 7804-L) being associated with at least one of the sets of functional units (7607-1 to 7607-P).

12. The system of Claims 7, 8, 9, 10, or 1 1, wherein the processor (7614) is configured to perform motion estimation, resampling, and discrete-cosine transform, and distortion correction for image processing.

Description:
SHARED FUNCTION-MEMORY CIRCUITRY FOR A PROCESSING CLUSTER

[0001] The disclosure relates generally to a processor and, more particularly, to a processing cluster.

BACKGROUND

[0002] FIG. 1 shows a graph that depicts speed-up in execution rate versus parallel overhead for a multi-core systems (ranging from 2 to 16 cores), where speed-up is the single-processor execution time divided by the parallel-processor execution time. As can be seen, the parallel overhead has to be close to zero to obtain a significant benefit from large number of cores. But, since the overhead tends to be very high if there is any interaction between parallel programs, it is normally very difficult to efficiently use more than one or two processors for anything but completely decoupled programs. Thus, there is a need for an improved processing cluster.

SUMMARY

[0003] An embodiment of the present disclosure, accordingly, provides an apparatus for performing parallel processing. The apparatus has a message bus (1420); a data bus (1422); and a shared function-memory (1410) having: a data interface (7620, 7606, 7624-1 to 7624-R) that is coupled to the data bus (1422); a message interface (7626) that is coupled to the message bus (1420); a function-memory (7602) that is coupled to the data interface (7620, 7606, 7624-1 to 7624-R), wherein the function-memory (7602) implementing lookup-tables (LUTs) and histograms; a vector-memory (7603) that is coupled to the data interface (7620, 7606, 7624-1 to 7624-R), wherein the vector-memory (7603) supports operations that employs vector instructions; single-input-multiple-data (SIMD) datapaths (7605-1 to 7605-Q and 7607-1 to 7607-P) that are coupled to the vector-memory (7603); an instruction memory (7616); a data memory (7618); and a processor (7614) that is coupled to the data memory (7616), the i

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] FIG. 1 is a graph of multicore speed-up parameters;

[0005] FIG. 2 is a diagram of a system in accordance with an embodiment of the present disclosure; [0006] FIG. 3 is a diagram of the SOC in accordance with an embodiment of the present disclosure;

[0007] FIG. 4 is a diagram of a parallel processing cluster in accordance with an embodiment of the present disclosure;

[0008] FIG. 5 is a block diagram of shared function-memory;

[0009] FIG. 6 is a diagram of the SIMD data paths for the shared function-memory;

[0010] FIG. 7 is a diagram of a portion of one SIMD data path;

[0011] FIG. 8 is an example of address formation;

[0012] FIGS. 9 and 10 are an examples of addressing performed for vectors and arrays that are explicitly in a source program;

[0013] FIG. 11 is an example of a program parameter;

[0014] FIG. 12 is an example of how horizontal groups can be stored in function-memory contexts; and

[0015] FIG. 13 is an example of the organization for the SFM data memory.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

[0016] An example of application for an SOC that performs parallel processing can be seen in FIG. 2. In this example, an imaging device 1250 is shown, and this imaging device 1250 (which can, for example, be a mobile phone or camera) generally comprises an image sensor 1252, an SOC 1300, a dynamic random access memory (DRAM) 1254, a flash memory 1256, display 1526, and power management integrated circuit (PMIC) 1260. In operation, the image sensor 1252 is able to capture image information (which can be a still image or video) that can be processed by the SOC 1300 and DRAM 1254 and stored in a nonvolatile memory (namely, the flash memory 1256). Additionally, image information stored in the flash memory 1256 can be displayed to the use over the display 1258 by use of the SOC 1300 and DRAM 1254. Also, imaging devices 1250 are oftentimes portable and include a battery as a power supply; the PMIC 1260 (which can be controlled by the SOC 1300) can assist in regulating power use to extend battery life.

[0017] In FIG. 3, an example of a system-on-chip or SOC 1300 is depicted in accordance with an embodiment of the present disclosure. This SOC 1300 (which is typically an integrated circuit or IC, such as an OMAP™) generally comprises a processing cluster 1400 (which generally performs the parallel processing described above) and a host processor 1316 that provides the hosted environment (described and referenced above). The host processor 1316 can be wide (i.e., 32 bits, 64 bits, etc.) RISC processor (such as an ARM Cortex-A9) and that communicates with the bus arbitrator 1310, buffer 1306, bus bridge 1320 (which allows the host processor 1316 to access the peripheral interface 1324 over interface bus or Ibus 1330), hardware application programming interface (API) 1308, and interrupt controller 1322 over the host processor bus or HP bus 1328. Processing cluster 1400 typically communicates with functional circuitry 1302 (which can, for example, be a charged coupled device or CCD interface and which can communicate with off-chip devices), buffer 1306, bus arbitrator 1310, and peripheral interface 1324 over the processing cluster bus or PC bus 1326. With this Condi figuration, the host processor 1316 is able to provide information (i.e., configure the processing cluster 1400 to conform to a desired parallel implementation) through API 1308, while both the processing cluster 1400 and host processor 1316 can directly access the flash memory 1256 (through flash interface 1312) and DRAM 1254 (through memory controller 1304). Additionally, test and boundary scan can be performed through Joint Test Action Group (JTAG) interface 1318.

[0018] Turning to FIG. 4, an example of the parallel processing cluster 1400 is depicted in accordance with an embodiment of the present disclosure. Typically, processing cluster 1400 corresponds to hardware 722. Processing cluster 1400 generally comprises partitions 1402-1 to 1402-R which include nodes 808-1 to 808-N, node wrappers 810-1 to 810-N, instruction memories 1404-1 to 1404-R, and bus interface units or (BIUs) 4710-1 to 4710-R. Nodes 808-1 to 808-N are each coupled to data interconnect 814 (through its respectively BIU 4710-1 to 4710-R and the data bus 1422), and the controls or messages for the partitions 1402-1 to 1402-R are provided from the control node 1406 through the message 1420. The global load/store (GLS) unit 1408 and shared function-memory 1410 also provide additional functionality for data movement (as described below). Additionally, a level 3 or L3 cache 1412, peripherals 1414 (which are generally not included within the IC), memory 1416 (which is typically flash memory 1256 and/or DRAM 1254 as well as other memory that is not included within the SOC 1300), and hardware accelerators (HWA) unit 1418 are used with processing cluster 1400. An interface 1405 is also provided so as to communicate data and addresses to control node 1406.

[0019] Processing cluster 1400 generally uses a "push" model for data transfers. The transfers generally appear as posted writes, rather than request-response types of accesses. This has the benefit of reducing occupation on global interconnect (i.e., data interconnect 814) by a factor of two compared to request-response accesses because data transfer is one-way. There is generally no desire to route a request through the interconnect 814, followed by routing the response to the requestor, resulting in two transitions over the interconnect 814. The push model generates a single transfer. This is important for scalability because network latency increases as network size increases, and this invariably reduces the performance of request-response transactions.

[0020] The push model, along with the dataflow protocol (i.e., 812-1 to 812-N), generally minimize global data traffic to that used for correctness, while also generally minimizing the effect of global dataflow on local node utilization. There is normally little to no impact on node (i.e., 808-i) performance even with a large amount of global traffic. Sources write data into global output buffers (discussed below) and continue without requiring an acknowledgement of transfer success. The dataflow protocol (i.e., 812-1 to 812-N) generally ensures that the transfer succeeds on the first attempt to move data to the destination, with a single transfer over interconnect 814. The global output buffers (which are discussed below) can hold up to 16 outputs (for example), making it very unlikely that a node (i.e., 808-i) stalls because of insufficient instantaneous global bandwidth for output. Furthermore, the instantaneous bandwidth is not impacted by request-response transactions or replaying of unsuccessful transfers.

[0021] Finally, the push model more closely matches the programming model, namely programs do not "fetch" their own data. Instead, their input variables and/or parameters are written before being invoked. In the programming environment, initialization of input variables appears as writes into memory by the source program. In the processing cluster 1400, these writes are converted into posted writes that populate the values of variables in node contexts.

[0022] The global input buffers (which are discussed below) are used to receive data from source nodes. Since the data memory for each node 808-1 to 808-N is single-ported, the write of input data might conflict with a read by the local Single Input Multiple Data (SIMD). This contention is avoided by accepting input data into the global input buffer, where it can wait for an open data memory cycle (that is, there is no bank conflict with the SIMD access). The data memory can have 32 banks (for example), so it is very likely that the buffer is freed quickly. However, the node (i.e., 808-i) should have a free buffer entry because there is no handshaking to acknowledge the transfer. If desired, the global input buffer can stall the local node (i.e., 808- i) and force a write into the data memory to free a buffer location, but this event should be extremely rare. Typically, the global input buffer is implemented as two separate random access memories (RAMs), so that one can be in a state to write global data while the other is in a state to be read into the data memory. The messaging interconnect is separate from the global data interconnect but also uses a push model.

[0023] At the system level, nodes 808-1 to 808-N are replicated in processing cluster 1400 analogous to SMP or symmetric multi-processing with the number of nodes scaled to the desired throughput. The processing cluster 1400 can scale to a very large number of nodes. Nodes 808- 1 to 808-N are grouped into partitions 1402-1 to 1402-R, with each having one or more nodes. Partitions 1402-1 to 1402-R assist scalability by increasing local communication between nodes, and by allowing larger programs to compute larger amounts of output data, making it more likely to meet desired throughput requirements. Within a partition (i.e., 1402-i), nodes communicate using local interconnect, and do not require global resources. The nodes within a partition (i.e., 1404-i) also can share instruction memory (i.e., 1404-i), with any granularity: from each node using an exclusive instruction memory to all nodes using common instruction memory. For example, three nodes can share three banks of instruction memory, with a fourth node having an exclusive bank of instruction memory. When nodes share instruction memory (i.e., 1404-i), the nodes generally execute the same program synchronously.

[0024] The processing cluster 1400 also can support a very large number of nodes (i.e., 808-i) and partitions (i.e., 1402-i). The number of nodes per partition, however, is usually limited to 4 because having more than 4 nodes per partition generally resembles a non-uniform memory access (NUMA) architecture. In this case, partitions are connected through one (or more) crossbars (which are described below with respect to interconnect 814) that have a generally constant cross-sectional bandwidth. Processing cluster 1400 is currently architected to transfer one node's width of data (for example, 64, 16-bit pixels) every cycle, segmented into 4 transfers of 16 pixels per cycle over 4 cycles. The processing cluster 1400 is generally latency-tolerant, and node buffering generally prevents node stalls even when the interconnect 814 is nearly saturated (note that this condition is very difficult to achieve except by synthetic programs).

[0025] Typically, processing cluster 1400 includes global resources that are shared between partitions: (1) Control Node 1406, which implements the system- wide messaging interconnect (over message bus 1420), event processing and scheduling, and interface to the host processor and debugger (all of which is described in detail below).

(2) GLS unit 1408, which contains a programmable RISC processor, enabling system data movement that can be described by C++ programs that can be compiled directly as GLS data- movement threads. This enables system code to execute in cross-hosted environments without modifying source code, and is much more general than direct memory access because it can move from any set of addresses (variables) in the system or SIMD data memory (described below) to any other set of addresses (variables). It is multi-threaded, with (for example) 0-cycle context switch, supporting up to 16 threads, for example.

(3) Shared Function-Memory 1410, which is a large shared memory that provides a general lookup table (LUT) and statistics-collection facility (histogram). It also can support pixel processing using the large shared memory that is not well supported by the node SIMD (for cost reasons), such as resampling and distortion correction. This processing uses (for example) a six- issue RISC processor (i.e., SFM processor 7614, which is described in detail below), implementing scalar, vector, and 2D arrays as native types.

(4) Hardware Accelerators 1418, which can be incorporated for functions that do not require programmability, or to optimize power and/or area. Accelerators appear to the subsystem as other nodes in the system, participate in the control and data flow, can create events and be scheduled, and are visible to the debugger. (Hardware accelerators can have dedicated LUT and statistics gathering, where applicable.)

(5) Data Interconnect 814 and System Open Core Protocol (OCP) L3 connection 1412. These manage the movement of data between node partitions, hardware accelerators, and system memories and peripherals on the data bus 1422. (Hardware accelerators can have private connections to L3 also.)

(6) Debug interfaces. These are not shown on the diagram but are described in this document.

[0026] Turning to FIG. 5, the shared function-memory 1410 can be seen. The shared function- memory 1410 is generally a large, centralized memory supporting operations that are not well- supported by the nodes (i.e., for cost reasons). The main component of the shared function- memory 1410 are the two large memories: the function-memory 7602 and the vector-memory 7603 (each of which has a configurable size between, for example 48 to 1024 Kbytes and organization). This function-memory 7602 implements a synchronous, instruction-driven implementation of high-bandwidth, vector-based lookup-tables (LUTs) and histograms. The vector-memory 7603 can support operations by (for example) a 6-issue processor (i.e., SFM processor 7614) that employs vector instructions, which can, for example, be used for block- based pixel processing. Typically, this SFM processor 7614 can be accessed using the messaging interface 1420 and data bus 1422. The SFM processor 7614 can, for example, operate on wide pixel contexts (64 pixels) that can have a much more general organization and total memory size than SIMD data memory in the nodes, with much more general processing applied to the data. It supports scalar, vector, and array operations on standard C++ integer datatypes as well as operations on packed pixels that are compatible with various datatypes. For example and as shown, the SIMD data paths associated with the vector memory 7603 and function-memory 7602 generally include ports 7605-1 to 7605 -Q and functional units 7607-1 to 7607-P.

[0027] The function-memory 7602 and vector-memory 7603 are generally "shared" in the sense that all processing nodes (i.e., 808-i) can access function-memory 7602 and vector- memory 7603. Data provided to the function-memory 7602 can be accessed via the SFM wrapper (typically in a write-only manner). This sharing is also generally consistent with the context management for processing nodes (i.e., 808-i). Data I/O between processing nodes and shared function-memory 1410 also uses the dataflow protocol, and processing nodes, typically, cannot directly access vector-memory 7603. The shared function-memory 1410 can also write to the function-memory 7602, but not while it is being accessed by processing nodes. Processing nodes (i.e., 808-i) can read and write common locations in function-memory 7602, but (usually) either as read-only LUT operations or write-only histogram operations. It is also possible for a processing node to have read-write access to an function-memory 7602 region, but this should be exclusive for access by a given program.

[0028] In FIG. 5, the example of shared function-memory 1410 there are ports 7624-1 to 7624- R for node access (the actual number is configurable, but there is typically one port per partition). The ports 7624-1 to 7624-R are generally organized to support parallel access, so that all datapaths in the node SIMD, from any given node, can perform a simultaneous LUT or histogram access. [0029] The function-memory 7602 organization in this example has 16 banks containing 16, 16-bit pixels each. It can be assumed that there is a lookup table or LUT of 256 entries, aligned starting at bank 7608-1. The nodes present input vectors of pixel values (16 pixels per cycle, 4 cycles for an entire node), and the table is accessed in one cycle using vector elements to access the LUT. Since this table is represented on a single line of each bank (i.e., 7608-1 to 7608-J), all nodes can perform a simultaneous access because no element of any vector can create a bank conflict. The result vector is created by replicating table values into elements of the result vector. For each element in the result vector, the result value is determined by the LUT entry selected by the value of the corresponding element of the input vector. If, at any given bank (i.e., 7608-1 to 7608-J), input vectors from two nodes create different LUT indexes into the same bank, the bank access is prioritized in favor of the least recent input, or, if all inputs occur at the same time, the left-most port input. Bank conflicts are not expected to occur very often, or to have much if any effect on throughput. There are several reasons for this:

Many tables are small compared to the total number of entries (i.e., 256) that can be accessed at the same time in the same table.

Input vectors are usually from relatively small, local horizontal regions of pixels (for example), and the values are not generally expected to have much variation (which should not cause much variation in LUT index). For example, if the image frame is 5400 pixels wide, the input vector of 16 pixels per cycle represents less than 0.3% of the total scan-line.

- Finally, the processor instruction that accesses the LUT is decoupled from the instruction that uses the result of the LUT operation. The processor compiler attempts to schedule the use as far as possible from the initial access. If there is sufficient separation between LUT access and use, there are no stalls even when a few additional cycles are taken by LUT bank conflicts.

[0030] Within a partition, one node (i.e., node 808-i) usually accesses the function memory 7602 at any given time, but this should not have a significant affect on performance. Nodes (i.e., node 808-i) executing the same program are at different points in the program, and distribute access to a given LUT in time. Even for nodes executing different programs, LUT access frequency is low, and there is a very low probability of a simultaneous access to different LUTs at the same time. If this does occur, the impact is generally minimized because the compiler schedules LUT access as far as possible from the use of the results. [0031] Nodes in different partitions can access function memory 7602 at the same time, assuming no bank conflicts, but this should rarely occur. If, at any given bank, input vectors from two partitions create different LUT indexes into the same bank, the bank access is prioritized in favor of the least recent input, or, if all inputs occur at the same time, the left-most port input (e.g. Port 0 is prioritized over Port 1).

[0032] Histogram access is similar to LUT access, except that there is no result returned to the node. Instead, the input vectors from the nodes are used to access histogram entries, these entries are updated by an arithmetic operation, and the result placed back into the histogram entries. If multiple elements of the input vector select the same histogram entry, this entry is updated accordingly: for example, if three input elements select a given histogram entry, and the arithmetic operation is a simple increment, the histogram entry can be incremented by 3. Histogram updates can typically take one of three forms:

The entries can be incremented by a constant in the histogram instruction.

The entries can be incremented by the value of a variable in a register within a processor.

The entries can be incremented by a separate weight vector that is sent with the input vector. For example, this can weight the histogram update depending on the relative positions of pixels in the input vector.

[0033] Each descriptor can specify the base address of the associated table (bank-aligned) , the size of the input data used to form the indexes, and two, 16-bit (for example) masks used to form indexes into this table relative to the base address. The masks generally determine which bits of the pixel(s) (for example) can be selected to form indexes - any contiguous bits - and thus indirectly indicates the table size. When a node executes a LUT or Histogram instruction, it typically uses a 4-bit field to select the descriptor. The instruction determines the operation on the table, so LUTs and histograms can be in any combination. For example, a node (i.e., 808-i) can access histogram entries by performing a lookup-table operation into the histogram. The table descriptors can be initialized as part of SFM data memory 7618 initialization. However, these values can also be copied to hardware descriptors, so that LUT and histogram operations can access the descriptors, in parallel if desired, without requiring an access to SFM data memory 7618. [0034] Turning back to FIG. 5, the SFM processor 7616 generally provides for general programming access to relatively wide (for example) pixel contexts in a large region of the function-memory 7602. This can includes: (1) general vector and array operations; (2) operations on horizontal groups of pixels (for example), compatible with Line datatypes; and (3) operations on (for example) pixels in Block datatypes, which can support two-dimensional access for data such as video macroblocks or rectangular regions of a frame. Thus, processing cluster 1400 can support both scan- line -based and block-based pixel processing. The size of function-memory 7602 is also configurable (i.e., from 48 to 1024 Kbytes). Typically, a small portion of this memory 7602 is taken for LUT and histogram use, so the remaining memory can be used for general vector operations on banks 7608-1 to 7608- J, including and for example vectors of related pixels.

[0035] As shown, SFM processor 7614 uses a RISC processor for 32-bit (for example) scalar processing (i.e., two-issue in this case), and extends the instruction set architecture to support vector and array processing in (for example) 16, 32-bit datapaths, which can also operate on packed, 16-bit data for up to twice the operational throughput, and on packed, 8-bit data for up to four times the operational throughput. The SFM processor 7614 permits the compilation of any C++ program, while making available the ability to perform operations (for example) on wide pixel contexts, compatible with pixel datatypes (Line, Pair, and uPair). SFM processor 7614 also can provide more general data movement between (for example) pixel positions, rather than the limited side-context access and packing provided by a processor, including both in the horizontal and vertical directions. This generality, compared to node processor, is possible because SFM processor 7614 uses the 2-D access capability of the functional memory 7302, and because it can support a load and a store every cycle instead of four loads and two stores.

[0036] SFM processor 7614 can perform operations such as motion estimation, resampling, and discrete-cosine transform, and more general operations such as distortion correction. Instruction packets can be 120 bits wide, providing for up to parallel issue of two scalar and four vector operations in a single cycle. In code regions where there is less instruction parallelism, scalar and vector instructions can be executed in any combination less than six wide, including serial issue of one instruction per cycle. Parallelism is detected using an instruction bit to indicate parallel issue with the preceding instruction, and instructions are issued in-order. There are two forms of load and store instructions for the SIMD datapath, depending on whether the generated function-memory address is linear or two-dimensional. The first type of access of function-memory 7602 is performed in the scalar datapath, and the second in the vector datapaths. In the latter case, the addresses can be completely independent, based on (for example) 16-bit register values in each datapath half (to access up to, for example, 32 pixels from independent addresses).

[0037] The node wrapper 7626 and control structures of the SFM processor 7614 are similar to those of node processor, and share many common components, with some exceptions. The SFM processor 7614 can support (for example) very general pixel access in the horizontal direction, and the side-context management techniques used for nodes (i.e., 808-i) is generally not possible. For example, the offsets used can be based on program variables (in node processor, pixel offsets are typically instruction immediates), so the compiler 706 cannot generally detect and insert task boundaries to satisfy side-context dependencies. For node processor, the compiler 706 should know the location of these boundaries and can ensure that register values are not expected to live across these boundaries. For the SFM processor 7614, hardware determines when task switching should be performed and provides hardware support to save and restore all registers, in both the scalar and the SIMD vector units. Typically, the hardware used for save and restore is the context save restore circuitry 7610 and the context-state circuit 7612 (which can be, for example 16x256 bits). This circuitry 7610 (for example) comprises a scalar context save circuits (which can be, for example, 16x16x32 bits) and 32 vector context save circuits (which can each, for example, be 16x512 bits), which can be used to save and restore SIMD registers. Generally, the vector-memory 7603 does not support side-context RAMs, and, since pixel offsets (for example) can be variables, it does not generally permit the same dependency mechanisms used in node processor. Instead, pixels (for example) within a region of a frame are within the same context, rather than distributed across contexts. This provides functionality similar to node contexts, except that the contexts should not be shared horizontally across multiple, parallel nodes. The shared function-memory 1410 also generally comprises an SFM data memory 7618, SFM instruction memory 7616, and a global IO buffer 7620. Additionally, the shared function- memory 1410 also includes a interface 7606 that can perform prioritization, bank select, index select and result assembly and that is coupled to the node ports (i.e., 7624-1 to 7624-4) through partition BIUs (i.e., 4710-i). [0038] Turing to FIG. 6, an example of the SIMD data paths 7800 for the shared function- memory 1410. For example, eight SIMD data paths (which can be partitioned into two, 16-bit halves because it can operate on 16-bit packed data) can be used. As shown, these SIMD data paths generally comprise set of banks 7802-1 to 7802-L, associated registers 7804-1 to 7804-L, and associated sets of functional units 7806-1 to 7806-L.

[0039] In FIG. 7, an example of a portion of one SIMD data path (namely and for example, a portion of one of the registers 7804-1 to 7804-L and a portion of one of the functional units 7806-1 to 7806-L) can be seen. As shown and for example, this SIMD data path can include includes a 16-entry, 32-bit register file 7902, two 16-bit multipliers 7904 and 7906, and a single, 32-bit arithmetic/logical unit 7908 that can also perform two, 16-bit packed operations in a cycle. Also, as an example, each SIMD data path can perform two, independent 16-bit operations, or a combined, 32-bit operation. For example, this can form a 32-bit multiply using the 16-bit multipliers combined with 32-bit adds. Additionally, the arithmetic/logical unit 7908 can be capable of performing addition, subtraction, logical operations (i.e., AND), comparisons, and conditional moves.

[0040] Turning back to FIG. 6, the SIMD data path registers 7804-1 to 7804-L can use a load/store interface to the vector memory 7603. These loads and stores can use features of the vector memory 7603 that are provided for parallel LUT and histogram access by nodes (i.e., 808- i): for nodes, each SIMD data path half can provide an index into function-memory 7602; and, similarly, each SIMD data path half in SFM processor 7614 can provide an independent vector memory 7603 address. Addressing is generally organized so that adjacent data paths can perform the same operation on multiple instances of datatypes such as scalars, vectors, and arrays of 8-, 16-, or 32-bit (for example) data: these are called vector-implied addressing modes (the vector is implied by the SIMD with linear vector memory 7603 addressing). Alternatively, each data path can operate on packed pixels from regions of a frame within banks 7608-1 to 7608-J: these are called vector-packed addressing modes (vectors of packed pixels are implied by the SIMD, with two-dimensional vector memory 7603 addressing). In both cases, as with the node processor, the programming model can hide the width of the SIMD, and programs are written as if they operate on a single pixel or element of other datatype.

[0041] Vector-implied datatypes are generally SIMD-implemented vectors of either 8-bit chars, 16-bit halfwords, or 32-bit ints, operated on individually by each SIMD data path (i.e., FIG. 7). These vectors are not generally explicit in the program, but rather implied by hardware operation. These datatypes can also be structured as elements within explicit program vectors or arrays: the SIMD effectively adds a hidden second or third dimension to these program vectors or arrays. In effect, the programming view can be a single SIMD data path with a dedicated, 32- bit data memory, and this memory is accessed using conventional addressing modes. In the hardware, this view is mapped in a way that each of the 32 SIMD data paths has the appearance of a private data memory, but the implementation takes advantage of the wide, banked organization of vector memory 7603 to implement this functionality in the shared function- memory 1410.

[0042] The SFM processor 7614 SIMD generally operates within vector memory 7603 contexts similar node processor contexts, with descriptors having a base address aligned to the sets of banks 7802-1, and sufficiently large to address the entire vector memory 7603 (i.e., 13 bits for the size of 1024 kBytes). Each half of the a SIMD data path is numbered with a 6-bit identifier (POSN), starting at 0 for the left-most data path. For vector-implied addressing, the LSB of this value is generally ignored, and the remaining five bits are used to align the vector memory 7603 addresses generated by the data path to the respective words in the vector memory 7603.

[0043] In FIG. 8, an example of address formation can be seen. Typically, a load or store instruction executed by the SIMD results in an address being generated by each data path, based on registers in the data path and/or instruction-immediate values: this is the address, in the programming view, that accesses a single, private data memory. Since this can, for example, be a 32-bit access, the two LSBs of this address can be ignored for vector memory 7603 accesses and may be used to address the byte or halfword within the word. The address is added to the context base address, resulting in a context index for the implied vector. Each data path concatenates this index with bits (i.e., bits 5: 1) of the POSN value (since this is for a word access), and the resulting value is the index for vector memory 7603 within the context for the datapath. The address is added to the context base address, resulting in an vector memory 7603 address for the implied vector.

[0044] These addresses access values aligned to a bank from each set 7802-1 to 7802-L (i.e., four of the sixteen banks), and the access can occur in a single cycle. No bank conflicts occur, since all addresses are based on the same scalar register and/or immediate values, differing in the POSN value in the LSBs.

[0045] FIGS. 9 and 10 illustrate examples of how addressing can be performed for vectors and arrays that are explicitly in the source program. The program computes the address of the desired element for the first 32-bit data path (with POSN values of 0 and 1 for the two 16-bit halves of the data path) using conventional base-plus-offset addition. Other data paths perform the same computation and compute the same value for the address, but the final address is offset for each data path by the relative position of the data path. This results in an access to four vector memory banks (i.e., 7608-1, 7608-5, 7608-9, and 7608-12) that (for example) access 32 adjacent, 32-bit values, illustrating how addressing modes typically use the vector memory 7603 organization efficiently. Because each data path addresses a private set of function-memory 7602 entries, store-to-load dependencies are checked within the local data path, with forwarding applied when there is a dependency. There is generally no desire to check dependencies between data paths, which would be very complex. These dependencies should be avoided by the compiler 706 scheduling delay slots after a store before a dependent load can be performed (the number of cycles is likely 3-4 cycles).

[0046] Vector-packed addressing modes generally permit the SFM processor 7616 SIMD data paths to operate on datatypes that are compatible with (for example) packed pixels in nodes (808-i). The organization of these datatypes is significantly different in function-memory 7602 compared to the organization in node data memory. Instead of storing horizontal groups across multiple contexts, these groups can be stored in a single context. The SFM processor 7614 can take advantage of the vector memory 7603 organization to pack (for example) pixels from any horizontal or vertical location into data path registers, based on variable offsets, for operations such as distortion correction. In contrast, nodes (i.e., 808-i) access pixels in the horizontal direction using small, constant offsets, and these pixels are all in the same scan-line. Addressing modes for shared function-memory 1410 can support one load and one store per cycle, and performance is variable depending on vector memory bank (i.e., 7608-1) conflicts created by the random accesses.

[0047] Vector-packed addressing modes generally employ addressing analogous to the addressing of two-dimensional arrays, where the first dimension corresponds to the vertical direction within the frame and the second to the horizontal. To access a pixel (for example) at a given vertical and horizontal index, the vertical index is multiplied by the width of the horizontal group, in the case of a Line, or by the width of a Block. This results in an index to the first pixel located at that vertical offset: to this is added to the horizontal index to obtain the vector memory 7603 address of the accessed pixel within the given data structure.

[0048] The vertical index calculation is based on a programmed parameter, an example of which is shown in FIG. 11. This parameter controls the vertical address of both Line and Block datatypes. The fields for this example are generally defined as follows (circular buffers generally contain Line data):

Top Flag (TF): This indicates that a circular buffer is near the top edge of the frame. - Bottom Flag (BF): This indicates that a circular buffer is near the bottom edge of the frame.

Mode (Md): This two-bit field encodes information related to the access. A value 00'b means that the access is for a Block. The values 01-11 'b encode the type of boundary processing used for circular buffers: 01 'b to mirror across the boundary, lO'b to repeat the boundary pixel across the boundary, and 11 'b to return a saturated value 7FFF'h (a pixel is a 16-bit value).

Store Disable (SD): This suppresses writes using this pointer, to account for start-up delays in a series of dependent buffers.

Top/Bottom Offset (TPOffset): This field indicates, for relative location 0 of a circular buffer, how far the location is below the top, or above the bottom, of a frame, in terms of the number of scan-lines. This locates the boundary of the frame with respect to negative (top) or positive (bottom) offsets from location 0.

Pointer: This is a pointer to the scan-line at relative offset 0 in the vertical direction. This can be at any absolute position within the buffer's address range.

Buffer Size: This is the total vertical size of a circular buffer in number of scan-lines. It controls modulo addressing within the buffer.

HG Size / Block Width: This is the width, in units of 32 pixels, of a horizontal group (HG Size) or Block (Block Width). It is the magnitude of the first dimension used to form the vector-packed address.

This parameter is encoded so that, for a Block, all fields but Block Width are zeros, and code generation can treat the value as a char, based on the dimensions of a Block declaration. The other fields are usually used for circular buffers, and are set by both the programmer and code- generation.

[0049] Turning to FIG. 12, an example of how horizontal groups can be stored in function- memory contexts can be seen. This organization of horizontal groups mimics the horizontal groups allocated across nodes (i.e., 808-i), except that these groups (as shown and for example) are stored in a single function-memory context, instead of multiple node contexts. The example shows a horizontal group that is the equivalent of six node contexts wide. The first 64 pixels of the group, numbered 0, are stored in contiguous locations in banks 0-3. The second 64 pixels of the group, numbered 1, are stored in banks 4-7. This pattern repeats up to the sixth set of 64 pixels, numbered 5 and stored in banks 4-7, one line below the second set of 64 pixels, relative to the bank. In this example, the first 64 pixels of the next vertical line, numbered 0, are stored in banks 8-B'h, below the third set of 64 pixels in the first line. These pixels correspond to node pixels stored in the next scan-line in a circular buffer in SIMD data memory. Pixels in the scan- line are accessed using packed addresses generated by the datapaths. Each half of the datapath generates an address for a pixel to be packed into that half of the datapath, or to be written to function-memory 7602 from that half of the datapath. To mimic the node context organization, the SIMD can be conceptually centered on a given set of 64 pixels in the horizontal group. In this case, each half of a datapath is centered on a single pixel within the set, addressed using the POSN value for that half of the datapath. Vector-packed addressing modes define a signed offset from this pixel location, either an instruction immediate or a packed, signed value in a register half associated with the datapath half. This is comparable to the pixel offsets in the node processor instruction set, but is more general, since it has a larger range of values and can be based on a program variable.

[0050] Since the SFM processor 7614 performs processing operations analogous to a node (i.e., 808-i), it is scheduled and sequenced much like a node, with analogous context organization and program scheduling. However, unlike a node, data is not necessarily shared between contexts horizontally across a scan line. Instead, the SFM processor 7614 can operate on much larger, standalone contexts. Additionally, because side contexts may not be dynamically shared, there is no requirement to support fine-grained multi-tasking between contexts, though the scheduler can still use program pre-emption to schedule around dataflow stalls. [0051] Turning to FIG. 13, an example of the organization for SFM data memory 7618 can be seen. This memory 7618 is generally scalar data path for SFM processor 7614, which can, for example have 2048 entries, each 32 bits wide. The first eight locations, for example, of this SFM data memory 7618 generally contain context descriptors 8502 for the SFM data memory 7618 contexts. The next 32 locations, for example, of the SFM data memory 7618 generally contain table descriptors 8504 for up to (for example) 16 LUT and histogram tables in function- memory 7602, with two, 32-bit words taken for each of the table descriptor 8504. Though these table descriptors 8504 are generally located in SFM data memory 8504, these table descriptors 8504 can be copied during initialization of the SFM data memory 7618 into hardware registers used to control LUT and histogram operations from nodes (i.e., 808-i). The remainder of the SFM data memory 7618 generally contains program data memory contexts 8506, which have variable allocations. Additionally, the vector memory 7603 can function as the data memory for the SIMD of SFM processor 7614.

[0052] SFM processor 7614 can also support fully general task switch, with full context save and restore, including SIMD registers. The Context Save/Restore RAMs supports 0-cycle context switch. This is similar to the SFM processor 7614 Context Save/Restore RAM, except in this case there are 16 additional memories to save and restore SIMD registers. This allows program pre-emption to occur with no penalty, which is important for supporting dataflow into and out of multiple SFM processor 7614 programs. The architecture uses pre-emption to permit execution on partially-valid blocks, which can optimize resource utilization since blocks can require a large amount of time to transfer in their entirety. The Context State RAM is analogous to the node (i.e., 808-i) Context State RAM, and provides similar functionality. There are some differences in the context descriptors and dataflow state, reflecting the differences in SFM functionality, and these differences are described below. The destination descriptors and pending permissions tables are usually the same as nodes (808-i). SFM contexts can be organized a number of ways, supporting dependency checking on various types of input data and the overlap of Line and Block input with execution.

[0053] SFM node wrapper 7626 is a component of shared function-memory 1410 which implements the control and dataflow around the SFM processor 7614. SFM node wrapper 7626 generally implements the interface of the SFM to other nodes in processing cluster 1400. Namely, the SFM wrapper 7626 can implement following functions: initialization of the node configuration (IMEM, LUT); context management; programs scheduling, switching and termination; input dataflow and enables for input dependency checking; output dataflow and enables for output dependency checking; handling dependencies between contexts; and signal events on the node and support node-debug operations.

[0054] SFM wrapper 7626 typically has 3 main interfaces to other blocks in processing cluster 1400: messaging interface, data interface, and partition interface. The message interface is on OCP interconnect where input and output messages map to slave and master port of message interconnect respectively. The input messages from the interface are written into (for example) a 4-deep message buffer to decouple message processing from ocp interface. Unless if the mesaage buffer is full, the ocp burst is accepted and processed offline. If the message buffer gets full, then the OCP interconnect is stalled until more message can be accepted. The data interface is generally used for exchanging vector data (input and output), as well as initialization of instruction memory 7616 and function-memory LUTs. The partition interface is on the generally includes at least one dedicated port in shared function-memory 1410 for each partition.

[0055] The initialization of instruction memory 7616 is done using node instruction memory initialization message. The message sets up the initialization process, and the instruction lines are sent on data interconnect. The initialization data is sent by GLS unit 1408 in multiple burst. MReqInfo[15: 14]="00" (for example) can identifie the data on data interconnect 814 as instruction memory initialization data. In each burst, the starting instruction memory location is sent on Mreqlnfo[20: 19] (MSBs) and Mreqlnfo[8:0] (LSBs). Within a burst, the address is internally incremented with each beat. Mdata[l 19:0] (for example) carries the instruction data. A portion of instruction memory 7616 can be reinitialized by providing starting address to reinit a selected program.

[0056] The initialization of function-memory 7602 lookup tables or LUTs is generally performed using an SFM function-memory initialization message. The message sets up the initialization process, and the data word lines are sent on data interconnect 814. The initialization data is sent by GLS unit 1408 in multiple burst. MReqInfo[15: 14]="10" can identifies the data on data interconnect 814 as function-memory 7602 initialization data. In each burst, the starting function-memory address location is sent on MreqInfo[25: 19] (MSBs) and Mreqlnfo[8:0] (LSBs). Within a burst, the address is internally incremented with each beat. A portion of function-memory 1410 can be reinitialized by providing starting address. Function- memory 1410 initialization access to memory has lower priority than partition access to function-memory 1410.

[0057] Various control settings of SFM is initialize using SFM control initialization message. This initializes vontext descriptors, function-memory table desctiptor, and destination descriptors. Since the number of words required to initialize the SFM control are expected to be more than message OCP interconnect max burst lengh, this message can be split in multiple OCP bursts. The message bursts for control initializations can be contigiuous, with no other message type in between. The total number of words for control initialization should be (1 + #Contexts/2 + #Tables + 4*#Contexts). The SFM control initialization should be completed before any input or program scheduling to shard function-memory 7616.

[0058] Now, turning to input dataflow and dependency checking, the input dataflow sequence generally starts with Source Notification message from source. The SFM destination context processes the source notification message and responds by Source Permission (SP) messages to enable data from source. Then the source sends data on respective interconnect followed by Set Valid (encoded on MReqlnfo bit on interconnect). The scalar data is sent using an update data memory message to be written into data memory 7618. The vector data is sent on data interconnect 814 to be written into vector-memory 7603 (or function-memory 7602 for synchronization context with Fm=l). SFM wrapper 7626 also maintains dataflow state variables, which are used to control the dataflow and also to enable the dependency checking in SFM processor 7614.

[0059] The input vector data is from OCP interconnect 1412 is first written into (for example) two 8-entry global input buffer 7620 - consecutive data is written into/read from alternate buffers in ping pong arrangement. Unless if the input data buffer is full, the ocp burst is accepted and processed offline. The data is written into vector-memory 7603 (or function-memory 7602) in a spare cycle when the SFM processor 7614 (or partition) is not accessing the memory. If the global input buffer 7620 becomes full, then the OCP interconnect 1412 is stalled until more data can get accepted. In input buffer ful condition, SFM processor 7614 is also stalled to write into the data memory and avoid stalling the interconnect 1412. The scalar data on the OCP message interconnect is also into (for example) a 4 entry message buffer, to decouple message processing from OCP interface. Unless the mesaage buffer is full, the OCP burst is accepted and data is processed offline. The data is written to data memory 7618 in a spare cycle when SFM processor 7614 is not accessing the data memory 7618. If the message buffer becomes full, then the OCP interconnect 1412 is stalled until more message can be accepted, and SFM processor 7614 is stalled to write into memory 7618.

[0060] Input dependency checking is employed to generally ensure that the vector data being accessed by SFM processor 7614 from vector memory 7603 is a valid data (already received from input). Input dependency check is done for vector packed load instructions. Wrapper 7626 maintains a pointer (valid_inp_ptr) to the largest valid index in the memory 7618. Dependency check fails in a SFM processor 7614 vector unit, if H Index is greater than valid_input_ptr (RLD) or Blk lndex is greater than valid_index_ptr (ALD). Wrapper 7626 also provides a flag to indicate that the complete input has been received and dependency checking is not desired. Input dependency check failure in SFM processor 7614 also causes stall or context switch - signals dependency check failure to wrapper and wrapper does task switch to switch to another ready program (or stalls processor 7614 if there are no ready programs). After a dependency check failure, when the same context program can be executed into again after at least another input has been received (so that dependency checking may pass). When the context program is enabled to execute again, the same instruction packet has to be re-executed. This employs special handling in processor 7614 because the input dependency check failure is detected in execute stage in pipeline. So this means that the other instructions in the instruction packet have already executed before processor 7614 stalls due to dependency check failure. To handle this special case, wrapper 7626 provides a signal to processor 7614 (wp mask non vpld instr), when it re- enabling a context program to execute after a previous dependency check failure. The vector packed load access is usually in a specific slot in the instruction packet, so one slot instruction is re-executed next time, and instruction in other slots are masked for execution.

[0061] Turning now to the Release lnput, once the complete input is received for an interation, no more inputs can be accepted from sources. The source permission is not sent to the sources to enable more input. Programs may release the inputs before end of iteration, so that the input for next iteration can be received. This is done thorugh a Release lnput instruction, and signaled to processor 7614 through flag risc_is_release.

[0062] HG POSN is position for current execution or Line data. For Line data context, HG POSN is used for relative addressing of a pixel. HG POSN is initialized to 0, and increment on the execution of a branch instruction (TBD) in processor 7614. The execution of the instruction is indicated to wrapper by flag: risc_inc_hg_posn. HG POSN is wrapped to 0 after it reaches the right most pixel (HG Size) and a increment flag is received form instruction execution.

[0063] The wrapper 7626 also provides program scheduling and switching. A Schedule Node Program message is generally used for program scheduling, and the Program scheduler does following functions: maintains a list of scheduled programs (active contexts) and the data structure from "schedule node progam" message; maintaints a list of ready contexts. It marks a program as "ready" when the context becomes ready to execute: active context on receiving sufficient inputs become ready; schedules a ready program for execution (based on round robin priority); provides program counter (Start PC) to processor 7614 for a program being scheduled to execute for the first time; amd rovides dataflow variables to processor 7614 for dependency checking as well as some states variables for execution. The scheduler also can continuously keep looking for next ready context (next ready in priority after current executing context).

[0064] SFM wrapper 7626 can also maintain a local copy of descriptor and state bits for current executing context for instant access - these bits normally reside in data memory 7618 or Context descriptor memory. It keeps the local copy coherent when state variables in context descriptor memory are updated. For executing context, these following bits are usually used by processor 7614 for execution: data memory context base address; vector-memory context base address; input dependency check state variables; output dependency check state variables; HG POSN; and flag for hg_posn !=hg_size SFM Wrapper also maintains a local copy of descriptor and state bits for next ready context. When a different context becomes the "next ready context", it again loads the required state variables and configuration bits from data memory 7618 and context descriptor memory. This is done so that the context switching is efficient, and does not wait to retrieve settings from memory access.

[0065] Task switching suspends the current executing program and moves the processor 7614 execution to "next ready context". Shared function-memory 1410 dynamically does a task switch in case of datafiow stall (examples of which can be seen in FIGS. 309 and 310). Dataflow stall is input dependency check failure or output dependency check failure. In case of datafiow stall, processor 7614 signal dependency check failure flag to SFM wrapper 7626. Based on dependency check failed flag, SFM wrapper 7626 starts task switching to a different ready program. While wrapper does the task switch, processor 7614 enters IDLE and flush the pipeline for instructions already in fetch and decode stage - those instruction will be re-fetched when program resumes next time. If there are no other ready contexts, then execution remains suspended until dataflow stall condition can get resolved - respectively on receiving inputs or receiving output permissions. It should also be noted that SFM wrapper 7626 usually guesses whether the dataflow stall has resolved or not, since it does not know the actual Index failing input dependency check, or the actual destination failing output dependency check. On receiving any new input (increment of valid_inp_ptr) or output permission (receiving SP from any destination), the program is marked ready again (and resumed if no other program is executing). It is therefore possible that the program might again fail dependency check after resuming and go through task switch. The task suspend and resume sequence in same context is same as task switch sequence to a different context. Task switch can also attempted on execution of END instruction in a program (examples of which can be seen in FIGS. 311 and 312). This is to give all ready programs a chance to run. If there are no other ready programs, then same program continues to execute. Additionally, the following steps are followed by SFM wrapper 7626 on a task switch:

( 1 ) Assert force_ctxz=0 to processor 7614

i. Save the processor 7614 state for this program into context state memoryi. Restore the T20 and T80 state for new program from context state memory

(2) Assert force_pcz=0 and provide new_pc to processor 7614.

i. For program getting suspended or resuming execution, the PC is saved/restored from context state memory.

i. For program starting execution for first time, the PC is from Start PC of "Schedule Node Program" message.

(3) Load the state variable and config bits copy of "next ready context" to "current executing context"

[0066] Turning now to the output data protocol for different datatype, In general, at the start of a program execution, SFM wrapper 7626 sends Source Notification message to all destinations. The destinations are programmed in destination descriptors, and destinations respond with Source Permission to enable output. For vector output, P Incr field in source permission message indicate the number of transfers (vector set valid) permitted to be sent to respective destination. OutSt state machine govern the behaviour of output dataflow. Two types of outputs can be produced by SFM 1410: scalar output and vector output. Scalar output is sent on message bus 1420 using update data memory message, and vector output is sent on data interconnect 814 (over data bus 1422). Scalar output is result of execution of OUTPUT instruction in processor 7614, and processor 7614 provides an output address (computed), control word (U6 instruction immediate) and output data word (32-bit from GPR). The format of (for example) a 6-bit control word is Set Valid ([5]),Output Data Type ([4:3] which is Input Done(OO), node line (01), Block (10), or SFM Line (11)), and destination number ([2:0] which can be 0-7). Vector output occurs by execution of VOUTPUT instruction in processor 7614, and processor 7614 provides an output address (computed) and control word (U6 instruction immediate). The output data is provided by a vector unit (i.e, 512-bit, [32-bit per vector unit GPR] * 16 vector units) within processor 7614. The format of (for example) a 6-bit control word for VOUTPTU is same as OUTPUT. The output data, address and controls from processor 7614 can be first written into a (for example) 8- entry global output buffer 7620. SFM wrapper 7626 reads the outputs from global output buffer 7620 and drives on the bus 1422. This scheme is done so that processor 7614 can continue execution while output data is being sent out on interconnect. If the interconnect 814 is busy and the global output buffer 7620 becomes full, then processor 7614 can be stalled.

[0067] For output dependency checking, the processor 7614 is allowed to execute output if the respective destination has given permission to SFM source context for sending data. If processor 7614 encounters a OUTPUT or VOUTPTU instruction when the output to the destination is not enabled, it results in a output dependency check failure causing task switch. SFM wrapper 7626 provides two flags to processor 7614 as enable, per-destination, for scalar and vector output respectively. Processor 7614 flag output dependency check failure to SFM wrapper 7626 to start task switch sequence. Output dependency check failure is detected in decode pipeline stage of processor 7614, and processor 7614 enters IDLE and flushes the fetch and decode pipeline if it encounters output dependency check failure. Typically, 2 delay slots are employed between OUTPUT or VOUTPUT instruction with Set Valid so as to update the OutSt state machine based on Set Valid and update the output enable to processor 7614 before the next Set Valid.

[0068] SFM wrapper 7626 also handles the program termination for SFM contexts. There are typically two mechanisms for program termination in processing cluster 1400. If the schedule node program message had Te=l , then the program terminates on END instruction. The other mechanism is based on dataflow termination. With dataflow termination, the program terminates when it has finished execution on all the input data. This allows the same program to run multiple iterations before termination (multiple END and multiple iteration of input data). A source signals Output Termination (OT) to its destinations when it has no more data to send - no more program iterations. The destination context stores the OT signal and terminates at the end of last iteration (END) - when it has completed execution on the last iteration of input data. Or, it may receive the OT signal after finishing the last iteration execution, in which case it immediately terminates.

[0069] The source signals the OT through same interconnect path as the last output data (scalar or vector). If the last output data from the source was scalar, then the output termination is signalled by scalar output termination message on message bus 1420 (same as scalar output). If the last output data from the source was vector, then the output termination is signalled by vector termination packet on data interconnect 814 or bus 1422 (same as data). This is to generally ensure that destination never received OT signal before the last data. On termination, an executing context sends OT message to all its destinations. The OT is sent on the same interconnect as the last output from this program. After finishing sending OT, the context sends node program termination message to control node 1406.

[0070] InTm state machine can also be used for termination. In particular, the InTm state machine can be used to store the Output Termination message and sequence the termination. SFM 1410 uses same InTm state machine as the nodes, but used "first set valid" for state transitions instead of any set valid like in the nodes Following sequence ordering are possible between input (set valid), OT and END at destination context: Input Set Valid - OT - END : terminate on END; Input Set Valid - END -OT : terminate on OT; Input Set Valid (iter n-1) - Release lnput - Input Set Valid (iter n)-OT - END- END: terminate on 2 nd END: last iteration; Input Set Valid (iter n-1) - Release lnput - Input Set Valid (iter n)- END - OT- END: terminate on 2 nd END: last iteration; and Input Set Valid (iter n-1) - Release lnput - Input Set Valid (iter n) - END - END -OT : terminate on OT.

[0071] Node state write message can update instruction memory 7616 (i.e., 256 bits wide), data memory 7618 (i.e., 1024 bits wide), and SIMD register (i.e., 1024 bits wide). Example lengths of the bursts for these can be as follows: instruction memory -9 beats; data memory - 33 beats; and SIMD register - 33 beats. In partition BIU (i.e., 4710-i), there is a counter called debug cntr which increments for every data beat received - once the count reaches (for example) 7 which means 8 data beats (does not count the first header beat that has data count), debug stall is asserted which will disable cmd accept and data accept till the write is done to the destination. The debug stall is a state bit that is set in partition biu and reset by node wrapper when the write is done by node wrapper (i.e., 810-1) - install comes on nodex unstall msg in (for partition 1402-x) input in partition BIU 4710-x. An example of 32 data beats sent from partition BIU 4710-x to node wrapper on bus:

nodex_wp_msg_en[2:0] which is set to M DEBUG

nodex wp msg wdataf M DEBUG OP] == M NODE STATE WR where M DEBUG OP is bits 31 :29 identifying message traffic as node state write when message address[8:6] has 110 encoding

this then fires node state write signal in node wrapper - here two counters are maintained called debug cntr and simd wr cntr (analogous to the ones in partition biu). Look for NODE STATE WRITE comment in node wrapper.v to look for this code.

— The 32 bit packets are then accumulated in node state wr data flop - 256 bits.

When the 256 bits are filled - instruction memory is written.

Similarly for SIMD data memory - when we have 256 bits, SIMD data memory is written - partition biu stalls message interconnect from sending more data beats till node wrapper successfully updates SIMD data memory as other traffic could be updating SIMD data memory - like for example data from global data interconnect in global IO buffers. Once the update into data memory is done - unstall is enabled through debug node state wr done which has a debug_imem_wr | debug_simd_wr | debug_dmem_wr components. This will then unstall the partition biu to accept 8 more data packets and do the next 256 bit write till the entire 1024 bits are done. Simd wr cntr counts 256 bit packets.

[0072] When node state read message comes in - the appropriate slave - instruction memory, SIMD data memory and SIMD register are read and then placed into the (for example) 16x1024 bit global output buffer 7620. From there the data is sent to partition BIU (i.e., 4710-l which then pumps the data out to message bus 1420. When global output buffer 7620 is read, following signals can (for example) be enabled out of node wrapper - these buses typically carry traffic for vector outputs - but are overloaded to carry node state read data as well - therefore not all bits of nodeX io buffer ctrl are typically pertinent: nodeX io buf has data tells partition biu that data is being sent by node wrapper nodeX_io_buffer_data[255:0] has the instruction memory read data or data memory (256 bits at a time) or SIMD register data (256 bits at a time)

nodeX_read_io_buffer[3:0] has signals that indicate the bus availability - using which output buffer is read and data sent to partition biu

nodeX io buffer ctrl indicates various pieces of information

o relevant information is on bits 16: 14

// 16: 14 : 3 bit op

// 000 : node state read - IOBUF CNTL OP DEB // 001 : LUT

// 010: his i

// 011 : his_w

// 100: his

// 101 : output

■ // 110: scalar output

// 111 : nop

32:31

00: imem read

10: SIMD register

■ 11 : SIMD DMEM

In partition BIU 4720-x, look for comments SCALAR OUTPUTS: and follow the signal nodeO msg misc en and nodeO imem rd out en. These then setup ocp msg master instance. Various counters are used again, debug cntr out breaks the (for example) 256 bit packet into 32 bit packets that desire to be sent to message bus 1420. The message that is sent is Node State Read Response.

[0073] Reading of data memory is similar to Node State read - then appropriate slave is read and then placed into the global output buffer and from there it goes to partition BIU 4710-x. For example, bits 32:31 of nodeX io buffer ctrl are set to 01, and the message to be sent can (for example) be 32 bits wide and is sent as data memory read response. Bits 16: 14 should also indicate IOBUF CNTL OP DEB. The slaves can (for example) be: 1. Data memory, CX=0 (aka LS-DMEM) application data - using context number we get the descriptor base and then add the offset that comes along with the message - address bits

2. Data memory descriptor area, CX=1, message data beat [8:7] = 00 identifies this area - use context number to figure out which descriptor is being updated

3. SIMD descriptor - 8:7 = 01 identifies this area - context number provides address

4. context save memory - 8:7 = 10 identifies this area - context number provides address

5. registers inside of processor 7614 - like breakpoint, tracepoint and event registers - 8:7 = 11 identifies this area

a. Following signals are then setup on interface for processor 7614:

i. .dbg_req (dbg_req),

ii. .dbg addr ({15 * b000_0000_0000_0000, dbg addr}),

iii. .dbg din (dbg din),

iv. .dbg_xrw (dbg_xrw),

b. Following parameters are defined in tx sim defs in tpic library directory:

v. define NODE EVENT WIDTH 16

vi. define NODE DBG ADDR WIDTH 5

c. Dbg_addr[4:0] is set as follows for breakpoint/tracepoint - comes from bits 26:25 of Set

Breakpoint/Tracepoint message

vii. Address of 0 is for breakpoint/tracepoint register 0

viii. Address of 1 is for breakpoint/tracepoint register 1

ix. Address of 2 is for breakpoint/tracepoint register 2

x. Address of 3 is for breakpoint/tracepoint register 3

d. Dbg_addr[4:0] is set to lower 5 bits of read data memory offset when event registers are addressed - these have to be set to 4 and above in message.

[0074] The context save memory 7610 that holds the state for processor 7614 also can have (for example) address offsets as follows:

1. the 16 general purpose registers have address offsets 0,4, 8, C, 10, 14,18,1C, 20, 24, 28, 2C, 30, 34, 38 and 3C 2. the rest of the registers are updated as follows:

a. 40 - CSR - 12 bits wide

b. 42 - IER - 4 bits wide

c. 44 - IRP - 16 bits

d. 46 - LBR - 16 bits

e. 48 - SBR - 16 bits

f. 4A - SP - 16 bits

g. 4C - PC - 17 bits

[0075] When Halt message is receives, halt acc signal is enabled which then sets state halt seen. This is then sent on a bus 1420 as follows:

• Halt_t20 [0] : halt seen

• Halt_t20[l]: save context

• Halt_t20[2]: restore context

• Halt_t20[3]: step

Halt seen state is then sent to ls_pc.v which is then used to disable imem rdy so that no more instructions are fetched and executed. However we desire to make sure that both processor 7614 and SIMD pipes are empty before continuing. Once the pipe is drained - that is there are no stalls, then pipe_stall[0] is enabled as input to node wrapper (i.e., 810-1) - using this signal - the halt acknowledge message is sent and the entire context of processor 7614 is saved into context memory. Debugger can then come and modify the state in context memory using update data memory message with CX=1 and address bits 8:7 to indicate context save memory 7610.

[0076] When the resume message is received, halt_risc[2] is enabled which will the restore the context - a force_pcz is then asserted to continue execution from the PC from context state. Processor 7614 uses force_pcz to enable cmem wdata valid which is disabled by node wrapper if the force_pcz is due to resume. Resume seen signal also resets various states - like for example halt_seen and the fact that halt ack message was sent.

[0077] When the step N instruction message is received, the number of instructions to step comes on (for example) bits 20: 16 of message data payload. Using this - imem rdy is throttled. The way the throttling works is as below:

1. reload everything from context state as debugger could have changed state

2. mem rdy is disabled for a clock - one instruction is fetched and executed 3. then pipe_stall[0] is examined - to see if instruction has completed execution

4. once pipe_stall[0] is asserted high - means pipes are drained - then context is saved process is repeated till the step counter goes to 0 - once this goes to 0, a halt acknowledge message is sent

[0078] Breakpoint match/tracepoint matches can be indicated (for example) as follows:

risc brk trc match - breakpoint or tracepoint match took place

risc_trc_pt_match - means it was a tracepoint match

risc_brk_trc_match_id[l :0] indicates which one of the 4 registers matched Breakpoint can occur when we are halted; when this happens, a halt acknowledge message is sent. Tracepoint match can occur when not halted. Back-to-back tracepoint matches are handled by stalling the second one till the first one has had a chance to send the halt acknowledge message.

[0079] Shared function-memory 1410 program scheduling is generally based on active contexts, and does not use a scheduling queue. The program scheduling message can identify the context that the program executes in, and the program identifier is equivalent to the context number. If more than one context executes the same program, each context is scheduled separately. Scheduling a program in a context causes the context to become active, and it remains active until it terminates, either by executing an END instruction with Te=l in the scheduling message, or by dataflow termination.

[0080] Active contexts are ready to execute as long as HG Input > HG POSN. Ready contexts can be scheduled in round-robin priority, and each context can execute until it encounters a dataflow stall or until it executes an END instruction. A dataflow stall can occur when the program attempts to read invalid input data, as determined by HG POSN and the relative horizontal-group position of the access with respect to HG Input, or when the program attempts to execute an output instruction and the output has not been enabled by a Source Permission. In either case, if there is another ready program, the stalled program is suspended and its state is stored in the context save/restore circuit 7610. The scheduler can schedule the next ready context in round-robin order, providing time for the stall condition to be resolved. All ready contexts should be scheduled before the suspended context is resumed.

[0081] If there is a dataflow stall and no other program is ready, the program remains active in the stalled condition. It remains stalled until either the stall condition is resolved, in which case it resumes from the point of the stall, or until another context becomes ready, in which case it is suspended to execute the ready program.

[0082] As described above, all system-level control is accomplished by messages. Messages can be considered system-level instructions or directives that apply to a particular system configuration. In addition, the configuration itself, including program and data memory initialization - and the system response to events within the configuration - can be set by a special form of messages called initialization messages.

[0083] Those skilled in the art to which the invention relates will appreciate that modifications may be made to the described embodiments and additional embodiments realized, without departing from the scope of the claimed invention.