Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUTOMATED LATENCY MANAGEMENT AND CROSS-COMMUNICATION EXCHANGE CONVERSION
Document Type and Number:
WIPO Patent Application WO/2014/110600
Kind Code:
A1
Abstract:
A system and method for communication in a parallel computing system is applied to a system having multiple processing units, each processing unit including processors), memory, and a network interface, where the network interface is adapted to support virtual connections. The memory has at least a portion of a parallel processing application program and a parallel processing operating system. The system has a network fabric between processing units. The method involves identifying need for communication by the first processing unit with a group of processing units, creating virtual connections between the processing units, and transferring data between the first processing units.

Inventors:
HOWARD KEVIN D (US)
Application Number:
PCT/US2014/011546
Publication Date:
July 17, 2014
Filing Date:
January 14, 2014
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MASSIVELY PARALLEL TECH INC (US)
International Classes:
G06F9/38; G06F9/30
Foreign References:
US20080162834A12008-07-03
US20070143578A12007-06-21
US20060095716A12006-05-04
US5471592A1995-11-28
Attorney, Agent or Firm:
BARTON, Steven, K. (4845 Pearl East Circle Suite 20, Boulder CO, US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method of communication in a multiple processor computing system, the system comprising:

A plurality of processing units, each processing unit including at least one processor, a memory, and a network interface, the network interface adapted to support virtual connections, the memory configured to contain at least a portion of a parallel processing application program and at least a portion of a parallel processing operating system, and

A network fabric coupled to each processing unit;

the method comprising:

identifying a need for communication by the parallel processing

application program executing on a first processing unit of the plurality of processing units with a second plurality of the plurality of processing units,

creating a virtual connection between the first processing unit with each processing unit of the second plurality of processing units, transferring data between the first processing unit and each processing unit of the second plurality of processing units.

2. The method of claim 1 further comprising automatically determining a number of processing units to assign to a task associated with the communication.

3. The method of claim 1 wherein the step of identifying a need for communication is performed by performing functional decomposition of a software design to generate a computer-executable finite state machine, the performing functional decomposition performed by a computer and comprising:

decomposing functions in the software design into data transformations and control transformations repetitively until each of the decomposed data transformations consists of a respective linear code block;

wherein the data transformations accept and generate data, and the control transformations evaluate conditions and send and receive control indications to and from associated instances of the data transformations;

converting the software design to a graphical diagram including a

plurality of graphical symbols interconnected to hierarchically represent the data transformations and the control transformations in the software design.

4. The method of claim 3 further comprising automatically determining a number of processing units to assign to a task associated with the communication.

5. The method of claim 3 wherein the step of automatically determining a number of processing units to assign to a task comprises determining a first and a second dataset size for the communication associated with a first and a second number of processors, and using a table interpolation to determine a first and a second communications time, the first communications time associated with the first dataset size and first number of processors, and the second communications time associated with the second dataset size and the second number of processors.

6. The method of claim 1 wherein the step of identifying a need for communication is performed by performing functional decomposition of a software design to generate a computer-executable finite state machine, the performing functional decomposition performed by a computer and

comprising:

decomposing functions in the software design into data transformations and control transformations repetitively until each of the decomposed data transformations consists of a respective linear code block; wherein the data transformations accept and generate data, and the control transformations evaluate conditions and send and receive control indications to and from associated instances of the data transformations;

configuring an automatically-determined number of the plurality of processing units to execute a task of the functionally- decomposed software design and executing the communication on an automatically-determined number of processing units of the plurality of processing units executing the task associated with the communication.

7. The method of claim 6 wherein automatically determining a number of processing units to execute a task comprises determining a first and a second dataset size for the communication associated with a first and a second number of processors, and using a table interpolation to determine a first and a second communications time, the first communications time associated with the first dataset size and first number of processors, and the second communications time associated with the second dataset size and the second number of processors.

8. The method of claim 6 wherein, for at least some of the communications, an all-to-all communication is performed over the virtual connection between the first processing unit and each processing unit of the second plurality of processing units, the first processing unit and second plurality of processing units comprising the processing units executing the task.

9. A multiple processor computing system, the system comprising: A plurality of processing units, each processing unit including at least one processor, a memory, and a network interface, the network interface adapted to support virtual connections, the memory configured to contain at least a portion of a parallel processing application program and at least a portion of a parallel processing operating system, and A network fabric coupled to each processing unit;

The network fabric adapted to support virtual connections between units of the plurality of processing units;

The memory of the processing units comprising machine readable code for creating a virtual connection between the first processing unit with each processing unit of the second plurality of processing units,

the system further comprising machine readable instructions in the memory of the processing units for performing an all-to-all communication over the virtual connection between the first processing unit and each processing unit of the second plurality of processing units, the all-to-all communication comprising transferring data between the first processing unit and each processing unit of the second plurality of processing units; wherein the system is configured to automatically determine a number of processing units to assign to a task associated with the communication, the units assigned to the task comprising the first processing unit and the second plurality of processing units; and wherein the number of processing units assigned to the task is less than a total number of processing units of the system. .

10. The system of claim 9, wherein the system is configured to automatically determine a number of processing units to assign to a task by executing machine readable code comprising code for determining a first and a second dataset size for the communication associated with a first and a second number of processors, and using a table interpolation to determine a first and a second communications time, the first communications time associated with the first dataset size and first number of processors, and the second communications time associated with the second dataset size and the second number of processors.

Description:
AUTOMATED LATENCY MANAGEMENT AND CROSS-COMMUNICATION

EXCHANGE CONVERSION

RELATED APPLICATIONS

[0001] The present document claims priority to U.S. Provisional Patent Application 61/752,292, filed 14 January 2013, the contents of which are incorporated herein by reference.

[0002] The present document is also a continuation-in-part of U.S. Patent Application 13/490,345, filed June 6, 2012 which is a continuation-in- part of U.S. Patent Application 13/425,136, filed 20 March 2012, the contents of which are incorporated herein by reference.

FIELD

[0003] The present document relates to the field of communications within a parallel computing system.

BACKGROUND

[0004] There are many computing problems that are amenable to parallel processing. In parallel processing on a parallel computing system an overall problem is typically divided into multiple sub-problems, or processes, each of which is then assigned to run on a particular processor of a number of processors, since the processors can then execute in parallel, rather than in serial, the overall problem is solved more quickly than with a single processor. Applications that have been run on parallel processing computer systems include cryptanalysis, weather forecasting, and many kinds of simulations.

[0005] In many parallel processing applications, a process running on one processor will need results from another processes running on another processor. For example, if a logic simulation of a microprocessor system is divided into a process simulating a RALU, another simulating a control unit, a third simulating a first level cache of a memory system, and a fourth simulating upper level memory, from time to time the process simulating the RALU may need to receive data from, and send data to, the process simulating the first level cache, and, when the simulated cache scores a "miss", the cache process will need to communicate with the process simulating the upper level memory.

[0006] It has been found that rapid, reliable, communications between processors within a parallel computing system is essential to successful execution.

[0007] A massively— parallel computer system is one in which there are large numbers of processors, each of which has at least some program and data memory associated with it, typically operating in a multiple- instruction, multiple-data (MIMD), processing model.

[0008] In provisional patent application 13/425136, a parallel computing system is described that is adapted to communicate with "scatter- gather" and "all to all" operations. In the scatter-gather operation, a message is sent from a first processor of the system to other processors of the system; the message either includes data being sent to those processors, or includes a request for the other processors to return specific data to the first processor.

[0009] Traditionally, the all-to-all communication exchange uses the binomial-tree multicast model. Fig. 69 illustrates an example of a binomial-tree multicast all-to-all exchange, showing the first of four tree broadcast interactions. Each processing element performs an iteration of the exchange. If "n" is the total number of processing units then it takes "n log2n" (see: Fig. 1 ) communication steps to complete. As can be seen in Fig. 69 , each communication step consists of a pair of processing units in communication.

[0010] It is noted that pair-wise communication is considered safe, as loop-back and other checks, including parity or other checksum checks combined with acknowledgment packets, can be performed on the data to insure that it arrived unchanged, and a retry can be initiated if corruption occurs. Unacknowledged broadcast communications are considered unsafe, since the sending processor may not recognize and correct communication errors.

SUMMARY

[0011] A system and method for communication in a parallel computing system is applied to a system having multiple processing units, each processing unit including processors), memory, and a network interface, where the network interface is adapted to support virtual connections. The memory has at least a portions of a parallel processing application program and a parallel processing operating system. The system has a network fabric between processing units. The method involves identifying need for communication by the first processing unit with a group of processing units, creating virtual connections between the processing units, and transferring data between the first processing units.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] Figure 1 is a system diagram showing an exemplary computing environment in which a system for decomposing functional data functions.

[0013] Figure 2 is a prior art standard functional decomposition diagram.

[0014] Figure 3 shows an example of multiple threads from decomposition of function with dissimilar parameters.

[0015] Figure 4 shows an example of functional decomposition with transition conditions and threads.

[0016] Figure 5 shows an example of functional decomposition with conditions, threads and added loops.

[0017] Figure 6 is an example illustrating the highest level decomposition (level 0).

[0018] Figure 6a is a flowchart showing an exemplary algorithm for converting an MPfd to a finite state machine.

[0019] Figure 7 shows an exemplary functional decomposition diagram.

[0020] Figure 8 shows a finite state machine view of the translation of a single-process bubble into its state machine equivalent.

[0021] Figure 9 shows an exemplary lower level decomposition diagram, functional decomposition view.

[0022] Figure 10 shows an exemplary lower level decomposition diagram, finite state machine view. [0023] Figure 11 shows multiple loops, functional decomposition view.

[0024] Figure 12 shows an example of multiple loops, finite state machine view.

[0025] Figure 13 shows an example of a loop with label, functional decomposition view.

[0026] Figure 14 shows an example of a loop with label, finite state machine view.

[0027] Figure 15 shows an example of multiple data on lines and multiple conditions on transition.

[0028] Figure 16 shows an example of transition and data lines using labels.

[0029] Figure 17 is an exemplary lower level decomposition diagram with composite variable names, functional decomposition view.

[0030] Figure 18 is an exemplary lower level decomposition diagram without composite array names and dimensionality.

[0031] Figure 19 is an exemplary lower level decomposition diagram with composite array names and dimensionality.

[0032] Figure 20 is an exemplary lower level decomposition diagram with composite matrix names with multiple dimensions.

[0033] Figure 21 shows an example of associated bubbles linked via control-flows.

[0034] Figure 22 shows an example of unassociated bubbles.

[0035] Figure 23 shows an example of data associated bubble.

[0036] Figure 24 shows an example of control linked, unassociated level-2 bubbles.

[0037] Figure 25 shows an example of transformation to standard unassociated form.

[0038] Figure 26 shows an example of transformation to standard associated form.

[0039] Figure 27 shows an example of unassociated process bubbles to task parallel indicating finite state machine. [0040] Figure 28 shows an example of transpose notation, functional decomposition view.

[0041] Figure 29 shows an example of transpose notation, finite state machine view.

[0042] Figure 30 shows an example of scatter/gather notation, functional decomposition view.

[0043] Figure 31 shows an example of scatter/gather, finite state machine view.

[0044] Figure 32 shows an example of parallel i/o indication.

[0045] Figure 33 shows an example of selecting particular matrix elements.

[0046] Figures 34a and 34b show examples of incomplete decomposition.

[0047] Figure 35 shows an example of a 1 -dimensional monotonic workload symbol, functional decomposition view.

[0048] Figure 36 shows an example of a 1 -dimensional monotonic workload symbol, finite state machine view.

[0049] Figure 37 shows an example of a 2-dimensional monotonic workload symbol, functional decomposition view.

[0050] Figure 38 shows an example of a 2-dimensional monotonic workload symbol, finite state machine view.

[0051] Figure 39 shows an example of a 3-dimensional monotonic workload symbol, functional decomposition view.

[0052] Figure 40 shows an example of a 3-dimensional monotonic workload symbol, finite state machine view.

[0053] Figure 41 shows an example of a left-right exchange symbol

- no stride, functional decomposition view.

[0054] Figure 42 shows an example of a left-right exchange symbol

- no stride, finite state machine view.

[0055] Figure 43 shows an example of a left-right exchange - with stride, functional decomposition view.

[0056] Figure 44 shows an example of a left-right exchange - with stride, finite state machine view. [0057] Figure 45 shows an example of a next-neighbor exchange symbol - no stride, functional decomposition view.

[0058] Figure 46 shows an example of a next-neighbor exchange - no stride, finite state machine view.

[0059] Figure 47 shows an example of a next-neighbor exchange symbol - with stride, functional decomposition view.

[0060] Figure 48 shows an example of a next-neighbor exchange - with stride, finite state machine view.

[0061] Figure 49 shows an example of a 3-dimensional next- neighbor exchange symbol - no stride, functional decomposition view.

[0062] Figure 50 shows an example of a 3-dimensional next- neighbor exchange - no stride, finite state machine view.

[0063] Figure 51 shows an example of a 3-dimensional next- neighbor exchange symbol - with stride, functional decomposition view.

[0064] Figure 52 shows an example of a 3-dimensional next- neighbor exchange - with stride, finite state machine view.

[0065] Figure 53 shows an example of a 2-dimensional matrix with 2-dimensional stencil for 2-d next-n-neighbor exchange symbol - no stride, functional decomposition view.

[0066] Figure 54 shows an example of a 2-dimensional matrix with 2-dimensional stencil for 2-d next-n-neighbor exchange - no stride, finite state machine view.

[0067] Figure 55 shows an example of a 2-dimensional matrix with 2-dimensional stencil for 2-d next-n-neighbor exchange symbol - with stride, functional decomposition view.

[0068] Figure 56 shows an example of a 2-dimenssional matrix with 2-dimensional stencil for 2-d next-n-neighbor exchange - with stride, finite state machine view.

[0069] Figure 57 shows an example of a 1 -dimensional all-to-all exchange symbol - no stride, functional decomposition view.

[0070] Figure 58 shows an example of a 1 -dimensional all-to-all exchange - no stride, finite state machine view. [0071] Figure 59 shows an example of a 1 -dimensional all-to-all exchange symbol - with stride, functional decomposition view.

[0072] Figure 60 shows an example of a 1 -dimensional all-to-all exchange - with stride, finite state machine view.

[0073] Figure 61 shows an example of a 2-dimensional all-to-all exchange symbol - no stride, functional decomposition view.

[0074] Figure 62 shows an example of a 2-dimensional all-to-all exchange - no stride, finite state machine view.

[0075] Figure 63 shows an example of a 2-dimensional all-to-all exchange symbol - with stride, functional decomposition view.

[0076] Figure 64 shows an example of a 2-dimensional all-to-all - with stride, finite state machine view.

[0077] Figure 65 shows an example of a 3-dimensional all-to-all exchange symbol - no stride, functional decomposition view.

[0078] Figure 66 shows an example of a 3-dimensional all-to-all exchange - no stride, finite state machine view.

[0079] Figure 67 shows an example of a 3-dimensional all-to-all exchange symbol - with stride, functional decomposition view.

[0080] Figure 68 shows an example of a 3-dimensional all-to-all exchange - with stride, finite state machine view.

[0081] Figure 69 is a diagram of a traditional all-to-all

communication on a parallel computer system.

[0082] Figure 70 illustrates a Howard all-to-all exchange, where a number of virtual channels are provided equal to the number of processors minus one.

[0083] Figure 71 is a system diagram showing an exemplary computing environment in which the present system may operate.

[0084] Figure 72 is a system block diagram annotated to show the exchanges of a Howard all-to-all exchange among processing units executing a program.

[0085] Figure 73 is a table indicating a correspondence between traditional data interchange in a parallel processing system, the Howard all-to- all exchange, and the Howard Scatter-Gather operation. [0086] Figure 74 an example of a decomposition of a loop extracted to standard form with an added parameter for indicating communications intent.

DETAILED DESCRIPTION OF THE EMBODIMENTS Decomposition

[0087] Traditional models for functional decomposition of algorithms are vague in their definition of lower decomposition levels. In the Yourdon structured model, control transformations decompose into state transition diagrams which represent the real-time aspects of the system. Although control transformations were used by Yourdon, Ward and Millor, and Hatley and Pirbhai to define real-time control transformation events, their definition of control transformation does not include any of the following types of software statements: goto, if-then-else, switch loops, and subroutine calls.

[0088] If the transformations decompose from the highest to the lower levels, but the complexity is not constrained by the developer as the functionality decomposes, as in the McCabe model, the amount of control is unconstrained, and it is not clear when the decomposition should end.

Furthermore, since the unconstrained decomposition does not inherently simplify the design, it does not actually meet the criteria of mathematical functional decomposition.

[0089] To eliminate the above-noted shortcomings of previous decomposition methods, a simple graph, created in accordance with the multiprocessor functional decomposition (MPfd) model described herein, is constrained to a single control structure per decomposition level and exposes all transitions, preparing the graph for translation into a finite state machine (FSM).

[0090] Traditionally, FSMs have been used to create compilers and have also been used in sequential circuit design. Being able to use FSMs in general software design and thus in general programming offers huge benefits for general programming including increased software clarity and the ability better combine computer software with computer hardware. [0091] Disclosed herein are a system and method for performing functional decomposition of a software design to generate a computer- executable finite state machine. Initially, the software design is received in a form wherein functions in the software design are repetitively decomposed into (1 ) data and control transformations. Included between the functions are control flow indicators which have transformation-selection conditions associated therewith. The data transformations and the control

transformations are translated into states in the finite state machine. The transformation-selection conditions associated with the control

transformations are translated into state transitions in the finite state machine.

[0092] Although functional decomposition has long been used to design software, the multiprocessor functional decomposition (MPfd) techniques and methods described herein extend beyond mere design. First, any design created using the presently described MPfd methods can, by definition, be translated directly into a finite state machine (FSM). Since field programmable gate arrays (FPGAs) and graphical processing units (GPUs) use FSMs in their programming, the MPfd is useful in creating not only CPU but GPU and FPGA codes as well. Second, incorrect MPfd structures can be automatically detected and corrected. Third, MPfd techniques incorporate the automatic selection of the pass-by-value or the pass-by-reference data movement model for moving data between functional elements. This allows the presently-described system to combine computer languages like "C" and "C++" with other computer languages like Fortran or Java. Fourth, MPfd elements are annotated with information concerning the use of any data, not just the data type. Using the MPfd model to automatically find task-level and non-task-level parallelism from design, instead of the user finding it within the code, allows separate compute threads to simultaneously process data.

[0093] Since a task in the present system is equivalent to one or more data transformations (or simply "transformations") and since a transformation is a state in the present finite state machine (FSM), showing which states can be executed in parallel is equivalent to indicating the task parallelism. Definitions

[0094] For the purpose of this document, the following definitions are supplied to provide guidelines for interpretation of the terms below as used herein:

[0095] Function - a software routine, or more simply an algorithm that performs one or more transformations.

[0096] Control Kernel - A control kernel is a software routine or function that contains only the following types of computer-language constructs: subroutine calls, looping statements (for, while, do, etc.), decision statements (if-then-else, etc.), and branching statements (goto, jump, continue, exit, etc.).

[0097] Process Kernel - A process kernel is a software routine or function that does not contain the following types of computer-language constructs: subroutine calls, looping statements, decision statements, or branching statements. Information is passed to and from a process kernel via RAM.

[0098] State Machine - The state machine employed herein is a two- dimensional network which links together all associated control kernels into a single non-language construct that provides for the activation of process kernels in the correct order. The process kernels form the "states" of the state- machine while the activation of those states form the state transition. This eliminates the need for software linker-loaders.

[0099] State Machine Interpreter - for the purpose of the present document, a State Machine Interpreter is a method whereby the states and state transitions of a state machine are used as active software, rather than as documentation.

[0100] Node - A node is a processing element comprised of a processing core, or processor, memory and communication capability.

[0101] Data transformation - A data transformation is a task that accepts data as input and transforms the data to generate output data.

[0102] Control transformation - A control transformation evaluates conditions and sends and receives control to/from other control

transformations and/or data transformations. [0103] Control bubble - A control bubble is a graphical indicator of a control transformation. A control bubble symbol indicates a structure that performs only transitions and does not perform processing.

[0104] Process bubble - A process bubble is a graphical indicator of a data transformation.

[0105] Finite state machine - A finite state machine is an executable program constructed from the linear code blocks resulting from

transformations, where the transformation-selection conditions are state transitions constructed from the control flow.

Computing Environment

[0106] Figure 1 is an exemplary diagram of the computing environment in which the present system and method operates. As shown in Figure 1 , system 100 includes a processor 101 which executes tasks and programs including a kernel management module 1 10, an algorithm management module 105, state machine 124, a kernel execution module 130, and an algorithm execution module 125. System 100 further includes storage 107, in which is stored data including libraries 1 15 / 120 which respectively store algorithms 1 17 and kernels 122. Storage 107 may be RAM, or a combination of RAM and other storage such as a disk drive.

Module 102 performs a translation of a graphical input functional

decomposition diagram 700 (see, e.g., Figure 7) to corresponding MPfd functions (ultimately, states in a state machine), and stores the translated functions in appropriate libraries in storage area 108. Module 103 generates appropriate FSMs from the translated functions.

[0107] System 100 is coupled to a host management system 145, which provides management of system functions, and issues system requests. Algorithm execution module 125 initiates execution of kernels invoked by algorithms that are executed. Algorithm execution system 135 may comprise any computing system with multiple computing nodes 140 which can execute kernels stored in system 100. Management system 145 can be any external client computer system which requests services from the present system 100. These services include requesting that kernels or algorithms be added/changed/deleted from a respective library within the current system. In addition, the external client system can request that a kernel/algorithm be executed. It should be noted that the present system is not limited to the specific file names, formats and instructions presented herein.

[0108] A kernel is an executable computer program or program segment that contains data transformation/data code, and no program execution control code, where execution control code is any code that can change which code is to be executed next. In the exemplary embodiment described herein, kernels 122 are stored in a kernel library file 121 in kernel library 120.

[0109] An algorithm is a state machine that comprises states (kernel invocations) and state transitions (the conditions needed to go from one state to another). References to the "system" in this section refer in general to system 100, and in applicable embodiments, to algorithm management module 105. Each algorithm 1 17 is kept in an algorithm definition file 1 16 in algorithm library 1 15 with a name (Algorithm_Title) that is the concatenation of the organization name, the category name, algorithm name, and user name with a '_' character between each of the names.

Algorithm Definition File with Task Parallelism Example:

[0110] StateNumber[(state1 state n), state x, state y, state z)],

KernellD(nodelnfoXlnputDatasets)(OutputDatasets)(Transiti ons)(Loops)

[0111] In the above example, the parallel tasks are executed at the same time as "StateNumber".

Functional Decomposition

[0112] A control transformation evaluates conditions and sends and receives control. One primary difference between the Yourdon model and the present MPfd model is in how control transformations are handled. MPfd allows a control transformation to contain non-event control items. Non-event control items are conditions that change the sequence of execution of a program (if-then-else, go to, function calls, function returns), and a condition is a regular conditional expression. [0113] Variables used by a control transformation can only be used in a condition; they cannot be transformed into any other value. An Invoke instruction initiates system operation; variables and constants are used in conditions to transition to a control transformation; and a Return instruction gives control back to the control transformation with the name of the returning routine. A control transformation can have only one selection condition per transformation, and there can be, at most, one control transformation per decomposition level.

[0114] The MPfd model creates hierarchical finite state machines (HFSM) whose state transitions have conditions and whose states are data transformations and control transformations. Data transformations can always, eventually, be associated with linear code blocks, while control transformations contain only transitions with no associated code blocks.

[0115] Data transformations represent the parallelizable portion of the software design. In MPfd designs, there are three data transformation types: associated, unassociated, and ambiguous. These types are concerned with the relationship between an upper-level transformation and its immediate next-level decomposition.

[0116] Associated transformations are grouped together and share data and/or control. Unassociated transformations are grouped together but share no data or control. Unassociated transformations can be executed in parallel. This is called task-level parallelization. Ambiguous transformations can always be converted to either associated or unassociated forms.

[0117] A data transformation can contain three types of looping structures: pre-work, post-work and recursion. Pre-work means that the loop- ending condition is checked prior to performing the work and is denoted by a downward-pointing solid-loop symbol on a transformation. Post-work means that the loop-ending condition is checked after performing the work and is denoted by an upward-pointing solid-loop symbol on a transformation.

Recursion means that the transformation calls itself and is denoted by a downward-pointing dashed-loop symbol on a transformation. [0118] In the Yourdon model, only the control transformation decomposes into a finite state machine (FSM). In an MPfd design, the entire diagram of the current decomposition level is converted into an FSM.

[0119] The lowest level of transformation decomposition represents a linear code block. Decomposition ends when a data transformation cannot decompose into a set of data transformations grouped together with a control transformation or when the decomposition results in the same graph as the decomposed transformation.

Equation 1 Mathematics of Functional Decomposition

y = f{a, b, c, ... ) = g (/7 1 (h 2 (a, b), c), h 3 (d, h (e), f), h n (a, b, c, ... ))

[0120] In the example of Equation 1 above, the "h x Q" functions can also be decomposed, and this decomposition can continue. In standard decomposition, there is no specified last decomposition. In an MPfd, the decomposition continues until only a series of function calls depicting the structure of the function remains. A final decomposition then occurs when there are no function calls, and only a single data transformation remains. At that point, the decomposition has progressed to the kernel level, with the non- transformation functions equivalent to control kernels and the transformation- only functions equivalent to process kernels. By its nature, an MPfd forms a disjoint, fully reduced set of functions.

Function Dependencies

[0121] Transforming a function into its decomposed equivalent set of functions means hierarchically identifying functions within functions such that the equivalent functionality of the original function is maintained while the complexity of the component functions simplifies. This can be illustrated using the "g()" function from Equation 1. The function g(hi(h2(a, b), c), h3(d, h.4(e)), ... h n (a, b, c, d, e, f)) uses the various "h x Q" functions as its parameters. The "h x ()" functions can, thus, be ordered by the "g()" function in the same way as variables are ordered within a function. If some or all of the "h x Q" functions were also decomposed, they would have the decomposed functions as additional parameters. Unfortunately, the standard decomposition diagram notation does not make this functional ordering fully visible; that is, usually, the ordering is bound in the mathematics of "g()".

[0122] The standard view of the functional ordering of decomposed functions "g()" might give is shown in Figure 2, which is a diagram showing a standard, prior art, functional decomposition. The function-order arrows (control flow indicators) on the standard functional decomposition diagram of Figure 2 indicate the calling order of the functions. This calling order comes from a combination of the decomposition level (indicated by the level number shown on the diagram) and the parameter order of the functions as shown in Figure 2. If the parameters used by some functions are different from those used by some other functions, those disjoint functions can be executed in parallel. The functions that share the same parameters are said to be joint and are executed serially.

[0123] In order to create different joint execution streams, in accordance with the present MPfd model, each function in a particular algorithm receives an execution-stream identifier. In the present exemplary embodiment, this execution-stream identifier is represented as a program thread. Graphically illustrated, this MPfd-type decomposition takes the form shown in the diagram of Figure 3, which shows multiple threads from decomposition of a function with dissimilar parameters. By examining Figure 3, it can be seen that thread 1 is used to coordinate the parallel execution of threads 2 and 3. In threads 2 and 3, the thread-sharing functions share variables and are linear to each other, but it is clear that threads 2 and 3 do not share data. Since there are no linear dependencies between thread 2 and thread 3 and no shared data, the two threads can be executed

simultaneously.

Conditions for Transition

[0124] In a standard functional decomposition diagram, the function- order arrows contain no information other than indicating a general relationship. In the present system, a condition is added to the function-order arrows and this additional information can be used to identify additional parallelism. The MPfd control flow indicators each comprise a function-order arrow plus an associated condition. Adding function-calling or transition information to a function-order arrow is a way to graphically depict the circumstances under which a function is called; that is, it shows the underlying logical/mathematical rationale for transitioning to another function. For example, separate threads containing functions with the same parameters can be identified if their transition conditions are different, as shown on Figure 4, which shows an example of functional decomposition with transition conditions and threads.

[0125] When the various function-order arrows indicate the transition conditions, they can be thought of as state-transition vectors. If one ignores the variables, the called functions can be thought of as states. Note that the transitions shown in Figure 4 are of two types: conditional from calculation, and conditional because a particular function has completed. Both types are necessary.

Multiple Threads as Nested Finite State Machines

[0126] Since parameters are a part of the function, they can be considered part of the state. Thus, the present functional decomposition with conditions and threads is functionally equivalent to a finite state machine. Furthermore, since each thread is separate from all other threads and each thread consists only of states and transitions, the threads act as their own state machines. Finally, since the threads are hierarchically formed, they depict nested finite-state machines.

Loops

[0127] As previously indicated, function transitions containing one of two types of transition conditions are required to externalize the control elements of functions, allowing them to be gathered together as threads. It is also clear that the transition is a separate entity type from the functions themselves. Loops or looping structures can be thought of as special, more generalized cases of function transition. Whereas a function transition contains only a condition, a looping structure contains a loop order, an initial loop-index value, a loop-index change calculation, and a loop-ending calculation. [0128] Figure 5 shows an exemplary functional decomposition with conditions, threads and added loops. The example in Figure 5 shows three loops: a single loop for a specific function, an outer loop across functions, and an inner loop. The loop across functions can be used to loop at the thread level. An inner loop, indicated by having the lowest number in a multiple-loop system, is incremented first with subsequent numbers then incremented in successive order. It should be noted that it is not possible to loop between threads.

Functional Decomposition Graphical Model

[0129] At this point, the ideas of the prior sections are manually incorporated into a simple graphical model (e.g., a functional decomposition diagram 700, described below with respect to Figure 7, et. seq.) that insures that all of the transitions are exposed. The functional decomposition diagram 700 is then input into graphics storage 108, and translated via graphics translation module 102 into corresponding functions in accordance with the MPfd decomposition methods described herein. The translated functions are stored in memory area 108.

[0130] It should be noted that a looping structure can be attached to any decomposition element. This looping structure initializes some data element (variable, array element, or matrix element), performs a calculation on the data element, tests the changed element value for the ending condition, and then transitions to the next functional decomposition element required if the condition is met. The data element used as the loop index is one of the function parameters, allowing the looping structure to interact with the functional element.

Highest Level of Decomposition

[0131] Level 0 of the MPfd consists of only three types of objects: (1 ) terminators, (2) a single-process bubble (or other indicator) corresponding to the un-decomposed function, and (3) data stores, along with function transitions, loops, and function parameters. The purpose of the highest level of decomposition is to place a function into a larger context. This is accomplished by allowing systems that are external to the function to transmit data and control to/from the function. A terminator represents a complete external system. Figure 6 shows an example of the highest level (level-0) decomposition. The "Function Transition Conditions" of Figure 6 correspond to the "Transition Conditions" shown in Figure 4. The "Process Bubble Name" of Figure 6 corresponds to function "g()" of Equation 1 and Figures 2 - 5. The "Function Parameter Names" of Figure 6 correspond to the parameters shown in Equation 1 and Figures 2 - 5.

Terminators

[0132] A terminator may be represented as a labeled square. The purpose of terminators is to be able to identify interfaces to outside systems. These interfaces do not correspond to any mathematical functions but instead represent access to data outside of the un-decomposed function. A terminator can be used to represent anything from another computer system to a display screen. Functionally, a terminator behaves similarly to a data store in that data can be sent from/to the terminator from/to the un- decomposed function. The difference between a terminator and a data store is that a terminator can transition from/to the un-decomposed function.

Process Bubble

[0133] A process bubble, adds data, changes data, deletes data, or moves data. Since a process-bubble manipulates data, all activities associated with sending and receiving data to various stores is allowed.

Furthermore, since a data element can also serve as a signal, activities associated with various signals are also allowed. A process bubble, as employed in the MPfd model, is a graphical indicator of a data transformation, which is a task that accepts input data and transforms it to generate output data.

Exemplary Allowed Process Bubble Activities

1 ) send data to a data store using output dataflow

2) receive data from a data store using input dataflow

3) Send standard signals to control-bubbles

4) Receive standard signals from control-bubbles 5) Send standard signals to terminators

6) Receive standard signals from terminators

7) Send data to terminators

8) Receive data from terminators

Single-Process Bubble

[0134] The single-process bubble of the highest level of

decomposition represents the un-decomposed function. Since the function is not decomposed, there can be only one level-0 process bubble. It is assumed that the level-0 process bubble will be decomposed into other functions.

Data Stores

[0135] A function typically transforms data. One way to graphically depict the transmission of data from/to the single-process bubble is via a terminator. Another way is with a data store. The displayed data stores can send/receive parameter data to/from the single-process bubble.

Control Bubble

[0136] A control bubble is a graphical indicator of a control transformation, which evaluates conditions and sends and receives control to/from other control transformations and/or data transformations. A control bubble symbol indicates a structure that performs only transitions that control the processing flow of a system, and which does not perform processing.

Conversion of MPFD to Finite State Machine

[0137] A primary goal of functional decomposition is the conversion of an MPfd into a finite state machine. This conversion is enabled by adhering to the following rules:

1 ) There can be only one control bubble at each decomposition level.

2) Only a control bubble can invoke a process bubble.

3) A process bubble can only transmit or receive data from a data store via a data flow.

4) A control bubble can only receive and use data as part of

determining which process bubble is to be called. 5) A control bubble can use process bubbles that have completed to sequence to other process bubbles.

6) Data used by a control bubble must be from a process flow.

7) Process bubbles always return control to their calling control

bubble.

8) A control bubble can receive/use/send control signals from/to

control flows.

9) Process bubbles can decompose into simpler process bubbles and/or a single control bubble and process bubbles.

[0138] An exemplary algorithm for converting an MPfd to a finite state machine is shown in Figure 6A and described below.

Conversion Algorithm

[0139] Step 605: Compare decomposition level x with level( x+ i) and determine if level( X+ i) process bubbles are associated or un-associated. A functional decomposition element, herein represented by a bubble symbol, can decompose into two types: associated and unassociated. Association has to do with the next-level decomposition of the bubble. Depending on the association type, loops defined at a higher decomposition level behave differently when they are integrated into a lower decomposition level.

[0140] If an un-decomposed bubble labeled "A" is decomposed into bubbles labeled "1 ", "2", "3", and "C", then the un-decomposed bubble is said to reside at Level 1. Bubbles "1 ", "2", "3", and "C" are said to reside at Level 2. If a control-flow links together any level 2 bubbles, then those bubbles are said to be associated. If the control-flows do not link together the level 2 bubbles, those bubbles are said to be unassociated.

[0141] Step 610: If level( X+ i) process bubbles are associated, then perform the following steps 615 - 630.

[0142] Step 615: Any loops found at level x start with the first associated process bubble and end with the last associated process bubble. That is, multiple states are in the loop. All loops are associated with the set of process bubbles. This step machine-analyzes the design and correctly interprets how the loops work. Using information from one decomposition level to next allows the system to change the algorithm definition file 1 16 such that the loops are executed correctly.

[0143] Step 620: The single control bubble that associates the levelx process bubbles will be the first state on the FSM of level(x+1 ).

[0144] Step 625: Level( x+ i) control flows are translated into state transition vectors of the level( X+ i) FSM.

[0145] Step 630: Level( X+ i) process bubbles are translated into the state of the FSM.

[0146] Step 635: If level( X+ i ) process bubbles are un-associated, then perform the following.

[0147] Step 640: Any loops found at levelx will form a loop of the same type on each un-associated level( X+ i )) process bubble.

[0148] Step 645: Decompose any non-recursively defined process bubble into an "x+1 " level of the decomposed process bubble. Decomposition levels are complete when an "x+1 " decomposition has no control bubble (a group of un-associated process bubbles) or when there is no "x+1 " level (step 650). All level( x+ i ) data stores are hidden within the states of the FSM. The various "x+1 " levels are represented as nested states, that is, each state is also an FSM.

[0149] Figure 7 shows an exemplary functional decomposition diagram 700 and Figure 8 shows a finite state machine view of the translation of a single-process bubble into its state machine equivalent. As used herein, the term "bubble" refers to a graphical element such as a solid or dashed line having the approximate form of a circle, ellipse, polygon, or the like. Notice that the control bubble is shown in the finite state machine view as the first state; only the control flows are seen, and these act as state transitions. The looping structure is captured as a looping state transition in the finite state machine 800. The process bubbles are translated into the states of the finite state machine. The data stores are captured as part of the states. Throughout this document, where applicable, both the functional decomposition and finite state machine view are shown in the Drawings. Lower Level Decomposition

[0150] All decomposition levels below level 0 have one additional item: the control bubble. There is only one control bubble per function decomposition. The purpose of the control bubble symbol is to indicate a structure that performs only transitions and does not perform processing. This symbol has the effect of insuring that all non-looping control is fully exposed. Allowing only a single control bubble per function decomposition forces the complexity of the work to be expressed primarily through decomposition, insuring a structured decomposition with the minimum amount of complexity for each of the decompositions. The control bubble retains the name of the higher-level process bubble.

[0151] Figures 9 and 10 respectively show functional decomposition and finite state machine views of an example of a lower level decomposition. The process bubbles cannot directly send information from one process bubble to another but can do so through a data store. If the data store has the same name, the finite state machine view assumes it will have the same memory addresses. Likewise, a process bubble cannot directly transition to another process bubble but can do so through a control bubble, which is always the initial state.

Multiple Loops

[0 52] In order to denote multiple loops, each loop definition is defined separately. Figures 11 and 12 respectively show functional decomposition and finite state machine views of multiple loops. As shown in Figures 10 and 1 1 , "LPBN1" represents "Lower Process Bubble Name 1":

[0153] Because multiple loop definitions can take up so much space on the diagram, a label representing a loop definition table can be used instead, changing the loop display to that shown in Figures 13 and 14, which respectively show functional decomposition and finite state machine views of an exemplary looping operation.

[0154] Selecting the loop name can cause the loop definition(s) to be displayed as shown in Table 1 , below: TABLE 1 - EXAMPLE LOOP LABEL DEFINITION

[0155] All loops associated with a process bubble are considered nested loops: one loop is within another loop. The first loop defined is considered the inner-most loop, with each successive outer loop defined as surrounding the inner loop. Thus, the example given in Figure 1 1 and Table 1 means that Loop 2 is inside of Loop 1 ; that is, Loop 1 is invoked after Loop 2. Parallel loops occur when two or more process bubbles, without any mutual dependency and occurring at the same decomposition level, each have a loop. The loops of these independent, loop-bearing process bubbles can occur in parallel.

DATA ELEMENTS

Variables, Arrays, and Matrices

[0156] Variables, arrays, and matrices represent data elements of various orders. A variable is a single data element of a certain type and can be thought of as a zero-dimensional object. An array consists of multiple data elements arranged linearly and can be thought of as a single-dimensional object. A matrix consists of multiple data elements arranged into greater than one dimension and can be thought of as a higher-dimensional object.

Transitions and loops can use these data objects in their conditions and calculations. This means that there must be a precise way to discuss all data objects.

[0157] As with the looping structures, there can be multiple data elements per input/output data line or transition. This means that the line or transition can be identified using a label that points to the appropriate definition, as shown in Figures 15 and 16, which respectively show functional decomposition and finite state machine views.

[0158] Selection of the labeled transition in Figure 16 would then display:

Variables

[0159] A variable only requires a label and a type in order to identify it. The following composite label will fully identify a variable:

Type:variableName

The composite variable name changes the "Function Parameters Names" to a comma-separated list of composite variable names, as shown in Figure 17, which is a functional decomposition view of an exemplary lower level decomposition with composite variable names.

Arrays

[0160] An array requires a composite consisting of a label, a type, and an array index or element number to identify it. The following composite label will fully identify an array:

Type:variableName:"index or element #" If the symbol after the second colon is a Greek symbol, it represents an index; otherwise, it represents an array element. The first index represents a row in MPfd, the second index a column, and the third index the matrix depth.

Designating multiple array elements does not designate a loop, only the movement of a certain number of variables.

[0161] The composite array name changes the "Function

Parameters Names" to a comma-separated list of composite array names, as shown in Figure 18 (lower level decomposition diagram without composite array names and dimensionality) and

[0162] Figure 19 (lower level decomposition diagram with composite array names and dimensionality).

Matrices

[0163] A matrix requires a composite consisting of a label, a type, and multiple array element designations to identify it. The following composite label will fully identify an array:

Type:variableName a,b,... n

[0164] Each matrix element represents a matrix dimension. The first element represents the first dimension, the second element the second dimension, etc.

[0165] The composite matrix name changes the "Function

Parameters Names" to a comma-separated list of composite matrix names, as shown in Figure 20, which illustrates a lower level decomposition with composite matrix names with multiple dimensions.

Profiling to Determine Node Count

[0166] Determining how well a process bubble will scale requires knowing how much exposed work and how much exposed communication time is present. The work time can be obtained by measuring the execution time of the process bubble's attached code with data of a known size. The data comes from the test plans and procedures that are attached to every process bubble of every project designed using the MPfd model. The communication time comes from the a priori determination of actual communication time and actual latency time. As long as the following criteria is met, computational elements can be added to increase the processing performance of a process bubble, as shown in Equation 2:

Equation 2 Profile Parallel Target

S t / (M, + E t ) > T

Where: S t = Single-node processing time

M t = Multi-node processing time

E t = Exposed communication time

[0167] The target value T can be set by the present system.

Profiling will continue until the condition is no longer met. The minimum, maximum, and median dataset sizes associated with a design bubble for an particular kernel or algorithm are used to calculate the number of processing units for any dataset size greater than the minimum and less than the maximum.

Automatic Selection of Data Movement Model

[0168] In computer science parlance, there are two ways to transmit data into a function: pass-by-value and pass-by-reference. Pass-by-value simply means that only the contents of some memory location are transmitted to the function. Sending the contents of a memory location is equivalent to having a constant as an input parameter. That is, all changes made to the value are kept internal to the function with none of those changes accessible outside of the function. This provides for the "encapsulation" of data, insuring that unwanted side effects do not occur between functions. Pass-by- reference allows a function to have multiple output parameters.

[0169] The following information is associated with a data element on an MPfd: composite name, input designation, and output designation. The input/output designations are a function of the directions of the lines associated with the composite name. The three possibilities are input, output, or both.

Pass by Value

[0170] In an MPfd, pass-by-value is another way of saying that a scalar data element (not an array or matrix) is only input into a function, never output from a function. A constant value must also be passed by value as there is no variable, hence no possibility of referencing a memory location. The input-only scalar data element or constant must use pass-by-value, insuring that the data use is encapsulated. Thus, whenever a scalar or constant input is used in an MPfd, it will signify the use of the pass-by-value method.

Pass by Reference

[0171] If the composite name in an MPfd refers to vector data (an array or matrix), particular data elements must be accessible. In computer programming, such access occurs as an offset to some base location. Thus, the base memory location must be transmitted to the function. Also, if the contents of a memory location must change (as is the case for output scalars), the memory location of the data element needs to be known. In both cases, a memory location is passed to the function, called referencing, and the contents of the memory location(s) accessed, called dereferencing. This allows the memory locations to be accessed and changed, with the changes visible to other functions simply using the same differencing method.

Functional Decomposition Data Transmission Model

[0172] Since it is possible for an MPfd to determine the data transmission model (pass-by-value or pass-by-reference) automatically from information generated as part of an MPfd, one of the most confusing aspects of modern computer programming can now be performed automatically, from design.

Automatic Detection of Parallel Algorithm Decomposition

[0173] There are two types of parallel processing indicators that can be included on MPfd design diagrams: structural and non-structural.

Structural parallel indicators are determined by the design without any extra information. Task parallelism is an example of structural indication. Other types of parallelism detectable via structural indication include: transpose detection, parallel I/O detection, scatter detection, and gather detection. [0174] Non-structural parallel indicators need more information than is usually given in design in order to determine the type of parallelism.

Variable definitions in computer languages only support the following information: variable name, variable type, and number of dimensions.

Parallelizing a code requires two other types of information: topology and data intent. Topology defines the computational behavior at the edges of a vector or matrix -examples include: Cartesian, toroidal, and spherical.

[0175] Data intent is the intended use of the data; examples include:

(1 ) particle-like usage - the data represents particles that move

throughout a matrix and may interact,

(2) field-like usage - a force that affects to some degree data across a large section of the matrix simultaneously,

(3) search-like intent - data that interacts with a larger set of data, giving some result, and

(4) series expansions/contractions - calculation of the terms of a

mathematical series.

[0176] The present MPfd method allows a designer to indicate the algorithm processing topology and the data intent, giving the design the information required to complete the parallel processing. The topology can be calculated by the present system 100 based upon the data intent.

Alternatively, the topology information can be added to the vector or matrix information of the input data of a transformation by the designer.

[0177] Since an algorithm is defined as a functional decomposition element, it can be decomposed into multiple, simpler algorithms and/or kernels. As previously noted, a functional decomposition element, herein represented by a bubble symbol, can decompose into two types: associated and unassociated. Association has to do with the next-level decomposition of the bubble. Depending on the association type, loops defined at a higher decomposition level behave differently when they are integrated into a lower decomposition level.

[0178] If the un-decomposed bubble labeled "A" is decomposed into bubbles labeled "1", "2", "3", and "C" then the un-decomposed bubble is said to reside at Level 1. Bubbles "1", "2", "3", and "C" are said to reside at Level 2. If the control-flows link together the level 2 bubbles then those bubbles are said to be associated. Figure 21 shows an example of associated level-2 bubbles linked via control-flows.

[0179] If a looping structure is added to Level 1 (Bubble A) then this is interpreted to have the following effect on Level 2: 1 ) the loop will start with the activation of the first process bubble and end with the last process-bubble ending, 2) the loop will continue to restart the first process bubble until the end-of-loop condition occurs, and 3) upon completion of the loop, control will be transferred back to the original level-1 -defined control bubble or terminator. This is also shown in Figure 21.

[0180] If the control-flows do not link together the level 2 bubbles, those bubbles are said to be unassociated. Figure 22 shows an example of unassociated level-2 bubbles.

[0181] If a looping structure is added to Level 1 (Bubble A) then the looping structure is added to each of the unassociated level 2 bubbles. This is shown in Figure 23. It is possible for level 2 bubbles to appear to be unassociated because no control-flow binds them but be associated instead via data. Data-associated level 2 bubbles are shown in Figure 23.

[0182] Similarly, it is possible to have level-2 bubbles which use the same control structure actually be unassociated as long as neither the control- flows nor the data associates them. This type of unassociated bubble structure is shown in Figure 24.

[0183] If the decomposition is incorrect, it is sometimes possible to rearrange the decomposition based upon association. An example of this transformation to standard unassociated form is shown in Figure 25.

Similarly, it is sometimes possible to rearrange the decomposition based upon un-association, as shown in Figure 26, which is an example showing transformation to standard associated form.

Unassociated Process Bubbles Indicating Task Parallelization

[0184] When process bubbles are grouped together but are not associated, this indicates that those processes can occur at the same time if the tasks are executed on parallel hardware. Figure 27 shows unassociated process bubbles to task parallel indicating finite state machine. Block 2700 indicates a new state made by the system, creating task level parallelism.

Transpose Notation

[0185] By telling the functional decomposition elements that a vector's or an array's data comes in and is processed then leaves, an opportunity to perform a scatter/gather operation (described below) is defined. The indices on an input vector or matrix are reversed on the output version of the same matrix, and the indices are found in the loop, as shown in Figure 28, which shows a transpose notation in functional decomposition view. Note that the accent mark by the second "A" means that at least one element of array A has been changed. Figure 29 shows a transpose notation in finite state machine view.

Scatter/Gather Notation

[0186] A scatter/gather moves data to multiple nodes or gathers information from multiple nodes. The indices of the loops match the active indices of the data, and the order of the data indices does not change. Figure 30 shows an example of scatter/gather notation, functional decomposition view, and Figure 31 shows the corresponding finite state machine view. Note that if bubble 1 is the first activated process bubble then "A"' is an input, if bubble 1 is the last process bubble then "A" is an output matrix.

Parallel Input/Output Indication

[0187] Parallel input and output is defined as being from/to a terminator block. Since a terminator block represents another system interfacing with the currently under-design system, obtaining data from this external system is considered input and transmitting data to this external system is considered output. Inputs and outputs to/from terminator blocks can designate that data for the same vector or matrix is being received or sent via separate, parallel data lines by adding the "[ ]" designator to the vector or matrix index. For example, the following are parallel input-data streams defined, as shown in Figure 32:

Α α [0-Ι00],β[0-10] = 2-dimensional array "A" with indexes a and β. Elements 0 through 100 of index a and elements 0 through 10 of index β are input.

Α α [ΐ01-200],β[0-10] = 2-dimensional array "A" with indexes a and β.

Elements 101 through 200 of index a and elements 0 through 10 of index β are input.

Α α [201-300],β[0-10] = 2-dimensional array "A" with indexes a and β.

[0188] Accordingly, elements 200 through 300 of index a and elements 0 through 10 of index β are input.

[0189] Output works analogously. If separate vector or matrix elements are input/output to/from a process bubble but not to/from a terminator, then a simple element selection is indicated. An example of selecting particular matrix elements is shown in Figure 33, wherein process element "1 " receives data elements from the "A" matrix rows 0 through 100 and columns 0 through 10.

Decomposition Completeness

[0190] The present system can automatically determine if a functional decomposition is complete, as indicated in Figures 34A / 34B, which illustrate examples of incomplete decomposition. One example of incomplete decomposition is shown in Figure 34A. If there is at least one algorithm (bubble 3 in the left-hand diagram, or bubble 2 in the right-hand diagram) which does not decompose into only process and control kernels (the remaining bubbles in Fig. 34A) then the decomposition is incomplete. Another example of incomplete decomposition is shown in Figure 34B. If there is a bubble that does not have at least one input and one output then the decomposition is considered incomplete.

Cross-Communication Notation

[0191] Data-type issues typically revolve around the concept of data primitive types: integer, real, double, complex, float, string, binary, etc.

Groups of data entities are discussed via their dimensionality, as structures, or as structures containing data entities with various dimensionalities. Data primitives, data group structure, and dimensionality all represent a static view of the data. In an MPfd, this information is placed in a table that appears on data flows and data stores. Table 2, below, is an example of a table that provides this information.

TABLE 2 - VARIABLE DESCRIPTION

[0192] The variable name gives a name to an object for the

Decomposition Analysis graph. The description is a text description of the variable just named. The variable type is the data-primitive type. The number of dimensions describes the dimensionality of the variable: 0-dimension means a standard variable, 1 -dimension a vector, and >1 -dimension a matrix. The dimension size is required for >1 -dimensional objects to indicate the number of variable objects that occur in each dimension. The topology explains how the >0-dimensional object treats its space.

[0193] The following are potential topologies: unconnected edges: Cartesian; connected edges: 1-dimension (ring), 2-dimensions (cylindrical, toroid, spherical), and 3-dimensions (hyper-cube). The topology information follows the variable.

[0194] In computer systems, data is rarely static; it is moved, transformed, combined, taken apart: data in computer systems is typically dynamic. The dynamic use of the data is an attribute that is not typically shown in standard representations of data for computer use. With the advent of parallel processing, the dynamic aspects of the data are needed for the selection of the proper parallel processing technique. Examples of the graphical depiction of possible dynamic data usage are shown below.

Monotonic Data Use

Concept: Linked calculations whose workload grows or shrinks after each calculation.

Use: Whenever the workload changes monotonically for each component calculation in a series of calculations.

Arbitrary precision series expansion calculation of transcendental numbers.

Load balancing. Since the workload changes the last calculation has a workload that is very different from the first calculation. Since the computation time of a group of nodes working on a single problem is equal to computation time of the slowest node and, further, since the effect of naively placing the work in the same order as the calculation order is to concentrate the work onto a single node, this produces a non-optimal parallel solution.

None

Create a mesh to provide load balancing. The purpose of this mesh type is to provide load balancing when there is a monotonic change to the work load as a function of which data item is used. The profiler shall calculate the time it takes to process each element. Below shows a naive attempt to parallelize such a problem. Sixteen work elements are distributed over four computational nodes. The work increases or decreases monotonically with the work-element number. Below is a 1- dimensional example of a naive work distribution of a monotonic workload-changing problem. TABLE3 NAIVE WORK DISTRIBUTION OF A MONOTONIC WORKLOAD

CHANGING PROBLEM

[0195] The mesh shown in Table 3 decomposes the work elements by dividing the number of work elements by the number of nodes and assigning each work element to each node in a linear fashion.

[0196] Instead of linearly assigning work elements to nodes, the work elements can be alternated to balance the work. For monotonic workload changes, this means the first and last elements are paired, the second and second-to-last elements are paired, etc., as shown in Table 4:

TABLE 4 - NON-NAIVE WORK 1 -DIMENSIONAL DISTRIBUTION OF A

MONOTONIC WORKLOAD CHANGING PROBLEM

[0197] Figure 35 shows a 1-dimensional monotonic workload symbol in functional decomposition view. If a one-dimensional workload is monotonic, then that information is given to MPfd with the symbols shown in

* u*

Figure 35. The symbol a means that the work (represented as the work within a loop) changes monotonically and that this workload effect applies to vector "A". That is, a means that index alpha is intended to access the data monotonically. Thus the alpha is the loop index and the * mu* is the intended use of the data accessed using the alpha index.

[0198] Note that, for brevity, the loop is defined by

(index:calculation:condition) where the index is the loop index plus any clarifying symbol by the loop index, the calculation is the next index-value calculation, and the condition is the loop-ending condition. Figure 36 shows a 1 -dimensional monotonic workload symbol in finite state machine view. Table 5, below, shows a two-dimensional version of the monotonic workload- changing mesh.

TABLE 5 NON-NAIVE WORK 2-DIMENSIONAL DISTRIBUTION OF A

MONOTONIC WORKLOAD CHANGING PROBLEM

[0199] If a two-dimensional workload is monotonic then that information is given to MPfd with the following symbols. The symbol means that the work (represented as the work within a loop) changes monotonically and that this workload effect applies to vector "A".

[0200] Figure 37 shows a 2-dimensional monotonic workload symbol in functional decomposition view, and Figure 38 shows a 2- dimensional monotonic workload symbol in finite state machine view.

[0201] Table 6, below, shows a three-dimensional version of the monotonic workload-changing mesh.

TABLE 6 NON-NAIVE WORK 2-DIMENSIONAL DISTRIBUTION OF A

MONOTONIC WORKLOAD CHANGING PROBLEM

35, 222, 36, 37, 220, 38, 39, 218, 40,

33, 224, 34, 223

221 219 217

Y1

43, 214, 44, 45, 212, 46, 47, 210, 48,

41 , 216, 42, 215

213 21 1 209

51 , 206, 52, 53, 204, 54, 55, 202, 56,

49, 208, 50, 207

205 203 201

Y2

59, 198, 60, 61 , 196, 62, 63, 194, 64,

57, 200, 58, 199

197 195 193

Ζ3

67, 190, 68, 69, 188, 70, 71 , 186, 72,

65, 192, 66, 191

189 187 185

Υ1

75, 182, 76, 77, 180, 78, 79, 178, 80,

73, 184, 74, 183

181 179 177

83, 174, 84, 85, 172, 86, 87, 170, 88,

81 , 176, 82, 175

173 171 169

Υ2

91 , 166, 92, 93, 164, 94, 95, 162, 96,

89, 168, 90, 167

165 163 161

Ζ4

99, 158, 101 , 156, 102, 103, 154,

97, 160, 98, 159

100, 157 155 104, 153

Υ1

107, 150, 109, 148, 1 10, 11 1 , 146,

105, 152, 106, 151

108, 149 147 1 12, 145

1 15, 142, 1 17, 140, 1 18, 119, 138,

1 13, 144, 1 14, 143

116, 141 139 120, 137

Υ2

123, 134, 125, 132, 126, 127, 130,

121 136, 122, 135

124, 133 131 128, 129

[0202] Figure 39 3-dimensional monotonic workload symbol in functional decomposition view, and Figure 40 shows a 3-dimensional monotonic workload symbol in finite state machine view. If a three- dimensional workload is monotonic then that information is given to MPfd with the symbol shown in Figure 39. There are three symbols attached to the three loops (α ,β , and γ ). These symbols mean that the work

(represented as the work within a loop) changes monotonically and that this workload effect applies to vector "A".

Particle Use Model

Concept: Particles are used to define discrete objects that move about a vector or array.

Use: Modeling physical phenomenon, atoms, ray- traces, fluids, etc.

Example Use: Computational fluid dynamics, changing image analysis.

Parallel Issue: Information sharing.

Action: Determine what to cross communicate.

[0203] A one-dimensional particle exchange with Cartesian topology generates the following version (shown in Tables 7 and 8) of a left-right exchange.

TABLE 7 - INITIAL 1-DIMENSIONAL CONDITION BEFORE LEFT-RIGHT

EXCHANGE

(Cartesian Topology)

TABLE 8 - 1-DIMENSIONAL CONDITION AFTER ONE LEFT-RIGHT

EXCHANGE

[0204] A one-dimensional particle exchange with a Ring topology generates the following version (shown in Table 9 and 10) of a left-right exchange.

TABLE 9 - INITIAL 1-DIMENSIONAL CONDITION BEFORE LEFT-RIGHT

EXCHANGE

(Ring Topology)

[0205] Note: Node 4 edge information wraps around to nodei and nodei wraps around to node 4 in the Ring topology version of the left-right exchange.

[0206] Figure 41 (functional decomposition view) depicts a left-right exchange symbol ( * ττ*) indicating no stride, also shown in the finite state machine view of Figure 42. If a one-dimensional vector is used to depict particles then the * ττ * symbol shown in Figure 41 is used.

[0207] If the processing of the vector skips one or more elements (called striding) then less data needs to be exchanged. The index calculation on the loop indicator can be modified to *π+η * to indicate striding. Figure 43 depicts a left-right exchange - with stride in a functional decomposition view, and Figure 44 depicts a left-right exchange in finite state machine view.

[0208] A two-dimensional particle exchange with Cartesian topology, generates the following version (shown in Table 1 1 below) of a next-neighbor exchange (edge-number exchange only). TABLE 11 - INITIAL 2-DIMENSIONAL CONDITION BEFORE NEXT- NEIGHBOR EXCHANGE (CARTESIAN TOPOLOGY)

TABLE 12 - 2-DIMENSIONAL CONDITION AFTER ONE NEXT-NEIGHBOR

EXCHANGE (CARTESIAN TOPOLOGY)

[0209] Note: Parenthesis indicates that the information here is overlaid such that the underlying code treats it as if it were adjacent memory.

[0210] A two-dimensional particle exchange with Cylindrical topology generates the following version (shown in Tables 13 and 14) of a next-neighbor exchange (edge-number exchange only).

TABLE 13 - INITIAL 2-DIMENSIONAL CONDITION BEFORE NEXT- NEIGHBOR EXCHANGE (CYLINDRICAL TOPOLOGY)

TABLE 14 - 2-DIMENSIONAL CONDITION AFTER ONE NEXT-NEIGHBOR

EXCHANGE (CYLINDRICAL TOPOLOGY)

A two-dimensional particle exchange with Toroid topology generates the version of a next-neighbor exchange (edge-number exchange only) shown in Tables 15 and 16 below.

TABLE 15 - INITIAL 2-DIMENSIONAL CONDITION BEFORE NEXT-

NEIGHBOR EXCHANGE (TOROID TOPOLOGY)

TABLE 16 - 2-DIMENSIONAL CONDITION AFTER ONE NEXT-NEIGHBOR

EXCHANGE

(Toroid Topology)

[0212] Figure 45 shows a next-neighbor exchange - no stride, in functional decomposition view; Figure 46 shows a next-neighbor exchange - no stride, in finite state machine view; Figure 47 shows a next-neighbor exchange symbol - with stride, in functional decomposition view; and Figure 48 shows a next-neighbor exchange - with stride, in finite state machine view. If a two-dimensional matrix is used to depict particles then the symbol shown in Figures 45/47 is used. A new state is automatically added when the system recognizes that a next neighbor exchange is to be used. The data exchange is modified with the "stride" information indicating how much data to skip with each exchange.

[0213] A three-dimensional particle exchange with Cartesian topology generates the version of a next-neighbor exchange (edge-number exchange only) shown in Tables 17 and 18, below. TABLE 17 INITIAL 3-DIMENSIONAL CONDITIONS BEFORE NEXT-

NEIGHBOR EXCHANGE (CYLINDRICAL TOPOLOGY) 215, 216 220 223, 224

229, 230, 233, 234, 235, 237, 238,

225, 226, 227, 228

231 , 232 236 239, 240

Y2

245, 246, 249, 250, 251 , 253, 254,

241 , 242, 243, 244

247, 248 252 255, 256

TABLE 18 - DIMENSIONAL CONDITION AFTER ONE NEXT-NEIGHBOR

EXCHANGE

(Cartesian Topology)

(109,29,157)

(101,21, 133),(10

(97,17,129)(98, (104,105,88,25,153),

2,22,134),(103,2 (110,30,158) 18,130),(99,19, (106.26.154) ,

3,135),

131), (107.27.155) ,

(104,89,105,24,1 (111,31,159) (100,20,132) (108,28,156)

36)

(112,32,160)

(85.37.165) ,

(81.33.161) , (89,104,88,41,169), (93.45.173) ,

(86.38.166) ,

(82.34.162) , (90.42.170) , (94.46.174) ,

(87.39.167) ,

(83.35.163) , (91.43.171) , (95.47.175) ,

(88,40,168,89,10

(84,36,164) (92,44,172) (96,48,176)

5)

Y2 (125,61,189)

(113.49.177) , (117.53.181) , (120.57.185) ,

(114.50.178) , (118.54.182) , (122.58.186) , (126,62,190)

(115.51.179) , (119.55.183) , (123.59.187) ,

(116,52,180) (121,56,184) (124,60,188) (127,191),

(128,64,192)

Z3

(129.65.193) , (133.69.197) , (136.73.201) , (141.77.205) ,

(130.66.194) , (134.70.198) , (138.74.202) , (142.78.206) ,

(131.67.195) , (135.71.199) , (139.75.203) , (143.79.207) , (132,68,196) (137,72,200) (140,76,204) (144,80,208)

Y1 (165.85.213) , (169,152,168,89,

(161.81.209) , (173.93.221) ,

(166.86.214) , 217),

(162.82.210) , (174.94.222) ,

(167.87.215) , (170.90.218) ,

(163.83.211) , (175.95.223) ,

(168,153,169,88, (171.91.219) ,

(164,84,212) (176,96,223)

216) (172,92,220)

(145.97.225) , (149.101.229) , (153,152,168,105 (157.109.237) ,

(146.98.226) , (150.102.230) , ,233), (158.110.238) ,

Y2

(147.99.227) , (151.103.231) , (154.106.234) , (159.111.239) , (148,100,228) (152,104,232,15 (155.107.235) , (160,112,240) 3,169) (156,108,236)

(177.1 13.241 ) , (181.1 17.245) , (184,121 ,249), (189.125.252) ,

(178.1 14.242) , (182.1 18.246) , (186.122.249) , (190.126.253) ,

(179.1 15.243) , (183.119.247) , (187.123.250) , (191.127.254) , (180,1 16,244) (185,120,248) (188,124,251 ) (192,128,255)

Z4

(197,133)

(193.129) , (200.137) , (205.141 ) ,

(198,134)

(194.130) , (202.138) , (206.142) ,

(195.131 ) , (203.139) , (207.143) ,

(199,135)

(196,132) (204, 140) (208,144) (201 ,136)

Y1 (229,149)

(225.145) , (230,150) (233,232,216,153), (237.157) ,

(226.146) , (234.154) , (238.158) ,

(227.147) , (231 ,151 ) (235.155) , (239.159) , (228,148) (236,156) (240,160)

(232,217,

233,152)

(213,165)

(209161 ), (214,166) (217,232,216,169), (221.173) ,

(210.162) , (218.170) , (222.174) ,

(211.163) , (215,167) (219.171 ) , (223.175) ,

Y2 (212,164) (220, 172) (224,176)

(216, 168,

217,233)

(241.177) , (245,181 ) (248.185) , (253.189) ,

(242.178) , (250.186) , (254.190) ,

(243.179) , (246,182) (251.187) , (255.191 ) , (244,180) (252, 188) (256,192)

(247,183)

(249, 184)

[0214] Figure 49 shows a 3-dimensional next-neighbor exchange symbol [* *] indicating no stride, in functional decomposition view; Figure 50 shows a 3-dimensional next-neighbor exchange - no stride, in finite state machine view; Figure 51 shows a 3-dimensional next-neighbor exchange - with stride, in functional decomposition view; and Figure 52 shows a 3- dimensional next-neighbor exchange - with stride, in finite state machine view. If a three-dimensional matrix is used to depict particles, then the symbol shown in Figure 49 is used.

[0215] Figure 53 shows a 2-dimensional matrix with 2-dimensional stencil for 2-d next-n-neighbor exchange symbol - no stride, in functional decomposition view; Figure 54 shows a 2-dimensional matrix with 2- dimensional stencil for 2-d next-n-neighbor exchange - no stride, in finite state machine view; and Figure 55 shows a 2-dimensional matrix with 2- dimensional stencil for 2-d next-n-neighbor exchange symbol - with stride, in functional decomposition view. The next-neighbor exchange can be extended to a next-n-neighbor exchange. Frequently, the depth of the exchange is a function of some size of the stencil that is applied to it. The exchange will consist of using the number of elements along the dimension of the exchange found in the stencil. If the number of elements is greater than the

discretization size then the data must be shared across multiple nodes. Since the stencil is itself a vector or matrix, the symbol for a two-dimensional matrix with a two-dimensional stencil (shown in Figure 53) can be used to generate a next-n-neighbor exchange.

[0216] Figure 56 shows a 2-dimenssional matrix with 2-dimensional stencil for 2-d next-n-neighbor exchange - with stride, in finite state machine view. Since B cannot change (depicted by the lack of an accent mark) and has the same number of dimensions as A', it is assumed to be a stencil. Note that the stencil must be smaller than the processed vector or matrix in every dimension; otherwise, it is considered a non-stenciled matrix operation, and the next-n-matrix does not apply.

Field Use Model

Concept: A field affects everything at once so if the field is distributed over multiple nodes then everything must communicate with everything.

Use: Modeling physical phenomenon.

Example Use: Gravity modeling.

Parallel Issue: Information exchange.

Action: Determine what to cross communicate.

Action Example: Perform an all-to-all exchange of data.

[0217] Figure 57 shows a 1 -dimensional all-to-all exchange symbol - no stride, in functional decomposition view; Figure 58 shows a 1 -dimensional all-to-all exchange - no stride, in finite state machine view; Figure 59 shows a 1 -dimensional all-to-all exchange symbol - with stride, in functional decomposition view; Figure 60 shows a 1 -dimensional all-to-all exchange - with stride, in finite state machine view; and If a one-dimensional vector is used to depict a field then the symbol shown in Figure 57 is used.

[0218] Figure 61 shows a 2-dimensional all-to-all exchange symbol

- no stride, in functional decomposition view; Figure 62 shows a 2- dimensional all-to-all exchange - no stride, in finite state machine view; Figure 63 shows a 2-dimensional all-to-all exchange symbol - with stride, in functional decomposition view figure; and Figure 64 shows a 2-dimensional all-to-all - with stride, IN finite state machine view. If a two-dimensional matrix is used to depict fields then the symbol shown in Figure 61 is used.

[0219] Figure 65 shows a 3-dimensional all-to-all exchange symbol - no stride, in functional decomposition view; Figure 66 shows a 3-dimensional all-to-all exchange - no stride, in finite state machine view; Figure 67 shows a 3-dimensional all-to-all exchange symbol - with stride, in functional decomposition view; and Figure 68 shows a 3-dimensional all-to-all exchange

- with stride, in finite state machine view. If a three-dimensional matrix is used to depict fields then the symbol shown in Figure 65 is used. ALL-TO-ALL DESCRIPTION

Parallel Processing System

[0220] A network-based parallel-processing system is illustrated in Fig. 71 , this system has two or more processing units 902, 903, 922 (PE). Each of the processing units has a network interface 904 that supports virtual connections, and in some embodiments may support frequency-division multiplexing, over a connection 905 to a network fabric 920 that supports virtual connectivity between processing units 902, 903, 922. Network fabric 220 includes physical connectivity elements such as cables or optical fibers, including connection 205, that couple it to each PE 902, 903, 922, and may also include one or more switches or routers as known in the art of computer networks.

[0221] Within each processing element 902, 903, 922, the network interface 904 is coupled to at least one processor 906, and, since high-speed communications typically involve direct-memory access, to a memory system 908. The memory system 908 is configured with sufficient portions of a parallel-processing operating system 910 to permit the processing element to process tasks designated for execution on that processing element, application code 912 sufficient to process at least one task designated for execution on that processing element, and is adapted to storing data 913 associated with both the operating system 910 and application code 912. At least one processing unit 922 is a portal processing element, the portal processing element is further configured with user interface and application program submission code 928 that permits assignment of tasks to other processing units 902, 903 of the system; and in embodiments, may or may not participate in processing tasks as well.

Howard All-To-All Exchange

[0222] Identifying when to use various communication models between processing units and tasks running on those processing units is traditionally critical in decreasing the number of communication steps required to execute a program on a parallel computer. However, when the maximum number of virtual connections between a given processing element 922 and other processing units of the system operating on tasks of the same application can be made to be equal or exceed the number of processing units involved as discussed above (called a Howard all-to-all exchange), then the number of communication steps is always one. A communication of this type is illustrated in Fig. 70, where each processing element 903 is able to communicate with every other processing element.

[0223] As long as the number of virtual channels equals or exceeds the number of communications that need to be exchanged, then it takes only one communication step from the software point of view, and messages need not be forwarded from through processing units to other processing units. Thus, it is better to use the virtual channel all-to-all exchange than other, existing, communication models. Since a primary reason for much of the complexity found in performing parallel processing is eliminated, parallel processing is more easily automated.

[0224] The communication of Fig. 70 operates on the hardware of Fig. 71 as illustrated in Fig. 72, where communicating processing units 902, 903 participate in a Howard all-to-all exchange through virtual channels because they are the processing units 930 executing a program, but processing units 903A and 922 - which are not executing tasks associated with that particular application program running on the system 900, are excluded from the all-to-all exchange.

[0225] The network interfaces 904 participating in the all-to-all exchange participate in verifying each transmission of the exchange with industry-standard techniques (loop-back verification, checksums, cyclic- redundancy checks, hash functions, error-correcting codes, etc.), including acknowledgment packets and retries when an error occurs.

[0226] In an embodiment, the network interface 904 of each processing element 902, 903, is coupled to at least one port of a network switch in the network fabric 920. Network fabric 920 may contain one or more such switches, as necessary, and as required by the system size. With high performance switches, this may enable the full bandwidth of the link between network interface 904 and the switch to be utilized for virtual channels 970, 972, 974 (Fig. 72) to other processing units [0227] Fig. 73 is a table illustrating data transfer intent in a first column 950, traditional communications operations in another column 952, and applicability of the Howard All-to-all exchange and gather-scatter operations in a third column 954.

Dataset Volume Reduction

[0228] Next, data that is to be sent during an all-to-all exchange is reviewed by each participating processing element to determine that portion of a full dataset that is of interest to each destination processing element. In an embodiment, this is done by the processing element, such as processing element 902, having updated data. In an alternative embodiment, the processing element having updated data breaks the exchange into two parts, a first part in which it asks each destination processing element, such as processing units 903, which portions of each updated data array it requires to enable further processing of its assigned task; each destination processing element then replies with its requirement. The processing element, such as element 902, in possession of updated data then transmits only those portions required by each destination processing element to that particular processing element.

[0229] The amount of data that is sent in each exchange of an all-to- all exchange is therefore reduced to the amount required by each specific task running on each processing element.

Automatic Determination of Howard Ail-To— All Exchanges

[0230] In the related application PARALLELISM FROM

FUNCTIONAL DECOMPOSITION, a program design is entered as a functional description into a particular database, and decomposed to automatically provide a program structure. Individual portions, or tasks, of the program may then be coded to give a complete parallel-enabled program. In particular, figures 41 -68 of that application and associated text describe recognition from program structure of each communication type used by the program, and which are copied herein. Many of the communications referenced in text of that application with respect to figures 41-68, including left-right exchanges, next-neighbor, and, of course, all-to-all exchanges, are performed by the system herein described with reference to Figs 69-73 using the Howard all-to-all exchange.

Effect of Howard AII-to-AII Exchange

[0231] Increasing the number of communication end-points without increasing the data transmission time is very important. The following tests were performed to verify the effect using the same 10-GigE communication channels.

1 ) Compare non-VLAN and VLAN single channel performance

2) Create multiple VLANs each with at least one IP address. This allows a single communication wire to decrease its bandwidth as a function of the number simultaneous VLANs.

3) Transmit different data across different VLANs starting from one transmission port and ending at multiple receiving ports.

4) Decrease the data size as transmitted to the receiving ports as a function of the number of receiving ports.

5) Perform ten transmissions of each dataset size and average the results.

TABLE 1

[0232] As Table 1 above shows, with the exception of the 1-GB dataset, the timings of all other dataset sizes are similar for both the VLAN and non-VLAN data transmissions.

1000 850 862 853 861 856 853 857 855 856 853 855.6

VLAN 3 10 10 9 9 9 9 9 9 9 10 9 9.2 Receivers 20 18 18 18 18 18 18 18 19 19 19 18.3

100 87 87 94 88 87 92 93 88 88 87 89.1

1000 854 852 850 850 1007 856 859 1051 851 851 888.1

VLAN 4 10 9 9 9 9 9 9 9 9 9 9 9 Receivers 20 19 19 18 17 17 17 18 17 17 18 17.7

100 85 87 86 86 86 86 86 87 86 87 86.2

1000 1081 886 1039 932 859 975 848 888 983 933 942.4

TABLE 2 - SINGLE TRANSMITTER MULTIPLE VLAN RECEIVERS

[0233] As can be seen in Table 2, with the exception of the 1 GB (1000 MB) dataset size, all of the timings for multiple receivers are the same for a given initial dataset size, regardless of the number of receivers. This is as predicted.

Modified Decomposition

[0234] Fig. 28 illustrates the decomposition of a loop using an earlier version of Applicant's system. Fig. 74 illustrates a loop decomposed using an added symbol to the matrix label, an associated variable / vector / matrix indicating object-like intent.

Optimizing Parallelism

[0235] Many algorithms that have been parallelized produce substantially less throughput than N (the number of processing units) times the throughput provided by a single processing element. This is in part because of the overhead, including input-output, communications between processors, and coordination between processors.

[0236] Consider tasks running on a parallel processing computer where each task must transmit a large number of datasets in the course of processing their data. It has been thought that increasing the bandwidth of the system would decrease the time spent in communication. However, many of these problems show little decrease in communication time, even when communication channels have greatly increased their bandwidth. We consider a new model that decreases the total communication wall-clock time. Rather than increasing the communication channel bandwidth further, decreasing the communication channel bandwidth in the correct way can greatly decrease the overall communication time. All parallel processing communication models will be shown to be inefficient compared to the Howard all-to-all cross- communication model based upon this effect.

Transmission Wall-clock Overhead

[0237] Splitting up a dataset creates overhead in many algorithms. A split dataset often requires transmitting the pieces to multiple processing units. Determining the time required for transmitting data requires considering transmission rate and physical latency. The transmission wall-clock time and latency wall-clock time can be masked, as shown in the following thought experiment.

Thought Experiment

[0238] Given a pipe containing some amount of a viscous fluid, there are only three conditions possible for the fluid in the pipe: less fluid than the pipe's capacity, exactly as much fluid as pipe capacity, and more fluid than pipe capacity. A force in the pipe that causes the fluid to move through the pipe, increasing the density of fluid is analogous to increasing the frequency of an electrical communication channel while increasing the pressure on the fluid (which decreases how long it takes to move the first bit of fluid to the other end of the pipe) is analogous to the physical latency of a communication channel.

[0239] The amount of wall-clock time it takes to move all of the fluid is given by:

o= f t (d)+ L t

where "o" = the total wall-clock time required to move all of the data through and out of the pipe, ft(d) = a function that calculates the wall-clock time used to move the fluid through the pipe minus the wall-clock time required to move the first bit of fluid to but not through the end of the pipe, and l_t = the wall-clock time required to move the first bit of fluid to but not through the end of the pipe.

[0240] If there is less fluid than pipe capacity then the total will-clock time is dominated by L t ; that is, f t (d) < L t . If there is enough fluid to fill the pipe then the f t (d) = l_ t . If there is more fluid than pipe capacity then f t (d) >

[0241] Given multiple pipes of the capacity of the initial, single pipe, originating from the same source but terminating at different locations, the fluid is split among the pipes. Doubling the number of pipes has different effects on the fluid in the pipes. If the pipe is mainly empty then U dominates, which means very little effect on "o". This same effect would take place if we doubled the density of the fluid or doubled the cross-sectional area of the pipe: that is, very little change in how long it takes to move the fluid through the pipe. Doubling the number of pipes, however, has a profound effect when there is enough fluid to fill one pipe. The performance increases by one- fourth. It increases only by one-fourth because with one pipe ft(d) = ½ of the time and U= ½ of the time and U is unaffected by the change. Thus ft(d) changes to one-fourth rather than one-half of the time, meaning the new time is three-fourths of the original time. The more fluid there is beyond that required to fill the pipe, the closer it comes to doubling the performance of the pipes when doubling the number of pipes.

Data Communication

[0242] The thought experiment can easily be applied to electrical data communication systems. The techniques that can be used to mask the physical latency wall-clock time, mask the data transmission wall-clock time, or both, differ for each of the three conditions. If the data is less than the channel capacity then physical latency dominates the wall-clock time;

otherwise, the frequency dominates. f t (d) < L t ; Latency Dominated Model

[0243] For the case where the data is less than the channel capacity (latency bound), the wall-clock time represented by the movement of the data is only a percentage of the total wall-clock data transmission time "o". For example, if f t (d) represents one% of "o" then decreasing the channel capacity to one-half of its frequency represents only a one% increase in total wall-clock time. One processing element transmitting different data to multiple different processing elements safely can be performed in only one of three ways: a) The data is transmitted first to one processing element followed by the next (standard serial method),

b) The data is multicast simultaneously to all processing elements (parallel method),

c) The physical-channel capacity is divided into multiple

bandwidth-limited channels are used to simultaneously transmit to all processing elements.

[0244] It is now possible to show the effects of each of the above data-transmission methods. The wall-clock time for method "a" is given by:

EQUATIO CLOCK TIME

where n = the number of pairs of processing elements

+ 1 in communication and

V t (d) = the time required to verify the transmission. Note: worst case V t d)

= f t (d) + L t

[0245] A complete transmission must end before the next one can begin; thus, the total wall-clock time multiplies the communication time by the number of pair-wise communications. The wall-clock time for method "b" is given by:

EQUATION 4 WORST CASE MULTICAST WALL CLOCK TIME

0 = f t (d) + L t + nV t (d)

[0246] Multicast is usually portrayed as the fastest single-to-multiple processing element communication method. Notice that in a multicast, the full dataset is transmitted to all processing elements. A standard multicast is unsafe because the proper receipt of the data is not guaranteed unless a verification step is included. The wall-clock time for method "c" is given by: EQUATION 5 WOR -TO-ALL WALL CLOCK TIME

[0247] Method "a", the standard method, multiplies the complete transmission wall-clock time by the number of processing-element pairs. Since data transmission is performed in a pair-wise fashion, each

transmission can be verified with industry-standard techniques (loop-back verification, checksums, cyclic-redundancy checks, bash functions, error- correcting codes, etc), making this a safe transmission. However, method "a" can be very costly in terms of wall-clock time.

[0248] The wall-clock time for method "b" grows primarily as a function of the verification and the fact that the full dataset is transmitted to all other processing elements. This is much better than what occurs with method "a" but still does not give the best possible wall-clock time.

[0249] The wall-clock time for method "c" grows as a function of decreasing channel capacity, which can be balanced by simultaneously decreasing dataset size, through the use of virtual connections each carrying only the necessary data as with the Howard All-to-all exchange. That is, the wall-clock time can remain the same regardless of the number of processing elements in communication. The dataset size can be the size required by the algorithm, without the redundancy found with multicast. In addition, method "c" is performed in a pair-wise fashion and is considered a safe data

transmission.

[0250] If the data fills the communication channel to capacity then the signal propagation wall-clock time is equal to the latency wall-clock time. If the dataset is split for transmission to multiple processing elements, then the f t (d) < L t rules apply. As long as the all communication channels are split evenly, the dataset size that occurs after the first split defines "d" for this case. Thus, if the first split is into four sections then the time it takes to transmit to eight sections is the same as the baseline four-section split. f t (d) > L, (Data Bound Model)

[0251] If the amount of data is greater than three times latency then the problem is considered is data bound. In data-bound problems, ft(d) dominates the wall-clock time. However, if the dataset is separable then there is always some number of processing elements across which the data can be split that will decrease the problem to a latency-bound problem, whose wall- clock time which is then manageable using the latency-bound model. Like the full-capacity model, selecting the baseline used to convert the data-bound problem to a latency-bound problem is important.

Reduction of Cross-Communication Models

[0252] The ability to safely communicate with multiple endpoints without incurring any additional data transfer costs means that many of the traditional communication models can be collapsed into a single model: the all-to-all exchange. Traditionally, the all-to-all communication exchange uses the binomial-tree multicast model illustrated in Fig. 69 and as discussed above

[0253] Each processing element performs an iteration of the exchange. If "n" is the total number of processing elements then it takes

"/7log2H" communication steps to complete. As can be seen, each

communication step consists of a pair of processing elements in

communication. Other communication models take fewer communication steps; for example, a true multicast all-to-all exchange takes "n"

communication steps, a 2-D next-neighbor exchanges take eight

communication steps, left-right up-down exchanges take four communication steps, and left-right exchange takes two communication steps. Identifying when to use various communication models is traditionally critical in decreasing the number of communication steps. However, when ft(d) can be made to be less than L t as discussed above (called a Howard all-to-all exchange), then the number of communication steps is always one. Automated Latency Management

[0254] In addition to extracting potential parallelism by decomposing the program structure as heretofore described, and using the Howard all-to-all exchange where all-to-all, left-right (with and without stride), and next- neighbors (with and without stride) exchanges are found in program structure, it is desirable to determine how much physical latency is in the system, and to determine how much parallelism can be exploited before reaching diminishing returns.

[0255] In an embodiment, a parallelizable loop is identified by program decomposition as heretofore described. A latency is determined, T(1 ), for execution of this loop by a single processing unit. A second latency, T(2) is determined for execution of the same loop parallelized into two simultaneously-executing processing units, with one virtual channel used to communicate between the processing units. A third latency, T(3) is determined for execution of this loop by four simultaneously executing processing units, with three virtual channels utilized for communications between each processing unit and the remaining processing units executing the loop, in a manner resembling the virtual channels of 970, 972, 974 of fig. 72. In an embodiment, the communications time component of the latency is estimated by interpolation into a two-dimensional table giving communications time for particular predetermined sizes of the data to be communicated versus number of connections at each processor performing the all-to-all exchange. The interpolated performance is used to determine the estimate.

[0256] An equation is then automatically derived for T(n), the wall- clock latency time of loop execution when the loop is divided among n processors, from latencies T(1 ), T(2), and T(3).

[0257] The equation T(n) is derived from Equation 5 above for latencies T(1 ), T(2), and T(3), the parameters Lt, Vt being fitted to T(1 ), T(2), and T(3). 6: Where ft is a function describing the time it takes to move the dataset (derived from the table interpolation as described above), l_t is the

computation time, and V t is a function describing time for verification of the data movement.

[0258] Since, for some parallelizable processes there can be a point where execution time increases with increasing number of processors, the equation T(n) is then evaluated to determine a minimum T(n), and a number of processors a are assigned to perform tasks associated with executing the loop that is either associated with the n of the minimum T(n) or the maximum number of available processors, whichever is less.

[0259] In an embodiment, equations T(n)(m) are derived for each of m loops, each loop derived from functional decomposition as heretofore described, where each of the m loops is capable of being executed simultaneously.

[0260] The equations T(n)(m) are then evaluated to determine a minimum T(n) for each of the m loops, and a number of processors a(m) are assigned to perform tasks associated with executing each loop that is within the size of the system and optimizes performance.

Combinations

[0261] The system and method herein described may be executed with various combinations of communications types assigned to particular processors and tasks, and may execute of various combinations of hardware. Among anticipated systems and methods are the following:

[0262] A method designated A of communication in a multiple processor computing system, the system including multiple processing units, each processing unit including at least one processor, a memory, and a network interface, the network interface adapted to support virtual connections between the processing units, the memory configured to contain at least a portion of a parallel processing application program and at least a portion of a parallel processing operating system, and a network fabric coupled to each processing unit. The method operating on the system includes identifying a need for communication by the parallel processing application program executing on a first processing unit with a second multiple of the processing units, creating a virtual connection between the first processing unit with each processing unit of the second multiple of processing units, and transferring data between the first processing unit and each processing unit of the second multiple of processing units.

[0263] A method designated AA including the method designated A further including automatically determining a number of processing units to assign to a task associated with the communication.

[0264] A method designated AB including the method designated A wherein the identifying a need for communication is performed by a computer performing functional decomposition of a software design to generate a computer-executable finite state machine, the performing functional decomposition including decomposing functions in the software design into data transformations and control transformations repetitively until each of the decomposed data transformations consists of a respective linear code block; wherein the data transformations accept and generate data, and the control transformations evaluate conditions and send and receive control indications to and from associated instances of the data transformations; converting the software design to a graphical diagram including a plurality of graphical symbols interconnected to hierarchically represent the data transformations and the control transformations in the software design.

[0265] A method designated AC including the method designated AB further including automatically determining a number of processing units to assign to a task associated with the communication.

[0266] A method designated AD wherein the step of automatically determining a number of processing units to assign to a task comprises determining a first and a second dataset size for the communication associated with a first and a second number of processors, and using a table interpolation to determine a first and a second communications time, the first communications time associated with the first dataset size and first number of processors, and the second communications time associated with the second dataset size and the second number of processors. [0267] A method designated AE wherein the step of identifying a need for communication is performed by performing functional decomposition of a software design to generate a computer-executable finite state machine, the performing functional decomposition performed by a computer and including decomposing functions in the software design into data

transformations and control transformations repetitively until each of the decomposed data transformations consists of a respective linear code block; wherein the data transformations accept and generate data, and the control transformations evaluate conditions and send and receive control indications to and from associated instances of the data transformations; configuring an automatically-determined number of the plurality of processing units to execute a task of the functionally-decomposed software design and executing the communication on an automatically-determined number of processing units of the plurality of processing units executing the task associated with the communication.

[0268] A method designated AF including the method designated A, AA, AB, AC, AD, or AE wherein automatically determining a number of processing units to execute a task comprises determining a first and a second dataset size for the communication associated with a first and a second number of processors, and using a table interpolation to determine a first and a second communications time, the first communications time associated with the first dataset size and first number of processors, and the second communications time associated with the second dataset size and the second number of processors.

[0269] A method designated AG including the method designated A, AA, AB, AC, AD, AE, or AF, wherein, for at least some of the communications, an all-to-all communication is performed over the virtual connection between the first processing unit and each processing unit of the second plurality of processing units, the first processing unit and second plurality of processing units comprising the processing units executing the task.

[0270] A multiple processor computing system designated B, the system including multiple processing units, each processing unit including at least one processor, a memory, and a network interface, the network interface adapted to support virtual connections, the memory configured to contain at least a portion of a parallel processing application program and at least a portion of a parallel processing operating system, and a network fabric coupled to each processing unit. The network fabric is a fabric adapted to support virtual connections between units of the plurality of processing units. The memory of the processing units is configured to contain machine readable code for creating a virtual connection between the first processing unit with each processing unit of the second plurality of processing units, and machine readable instructions in the memory of the processing units for performing an all-to-all communication over the virtual connection between the first processing unit and each processing unit of the second plurality of processing units, the all-to-all communication comprising transferring data between the first processing unit and each processing unit of the second plurality of processing units; wherein the system is configured to automatically determine a number of processing units to assign to a task associated with the communication, the units assigned to the task comprising the first processing unit and the second plurality of processing units; and wherein the number of processing units assigned to the task is less than a total number of processing units of the system. .

[0271] A system designated BA including the system designated B, wherein the system is configured to automatically determine a number of processing units to assign to a task by executing machine readable code comprising code for determining a first and a second dataset size for the communication associated with a first and a second number of processors, and using a table interpolation to determine a first and a second

communications time, the first communications time associated with the first dataset size and first number of processors, and the second communications time associated with the second dataset size and the second number of processors.

[0272] Certain changes may be made in the above methods and systems without departing from the scope of that which is described herein. It is to be noted that all matter contained in the above description or shown in the accompanying drawings is to be interpreted as illustrative and not in a limiting sense. The elements and steps shown in the present drawings may be modified in accordance with the methods described herein, and the steps shown therein may be sequenced in other configurations without departing from the spirit of the system thus described. The following claims are intended to cover all generic and specific features described herein, as well as all statements of the scope of the present method, system and structure, which, as a matter of language, might be said to fall therebetween.