Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A SYSTEM AND METHOD FOR HIGH-PERFORMANCE GENERAL-PURPOSE PARALLEL COMPUTING WITH FAULT TOLERANCE AND TAIL TOLERANCE
Document Type and Number:
WIPO Patent Application WO/2019/086120
Kind Code:
A1
Abstract:
The invention relates to a distributed computing system (200) and method. The distributed computing system (200) comprises a plurality of computing nodes (201), wherein each computing node (201) is configured to simultaneously execute a respective sub-task of a parallel computing task in a plurality of computing rounds, and a communication network (203) configured to allow data exchange between the plurality of computing nodes (201). Each computing round comprises an execution stage, a communication stage between the plurality of computing nodes (201) and a synchronization stage between the plurality of computing nodes (201), wherein the distributed computing system (200) is configured to handle one or more high-latency computing nodes of the plurality of computing nodes (201) in each computing round.

Inventors:
MCCOLL BILL (DE)
Application Number:
PCT/EP2017/078153
Publication Date:
May 09, 2019
Filing Date:
November 03, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
MCCOLL BILL (DE)
International Classes:
G06F9/48; G06F9/50; G06F9/54
Foreign References:
US20140181831A12014-06-26
US20140201564A12014-07-17
Other References:
GANESH ANANTHANARAYANAN ET AL: "Effective Straggler Mitigation: Attack of the Clones", PROCEEDINGS OF THE 11TH USENIX SECURITY SYMPOSIUM, AUGUST 5-9, 2002; SAN FRANCISCO, CA, USA, 12 April 2013 (2013-04-12), pages 185 - 198, XP055484981, ISBN: 978-1-931971-00-3, Retrieved from the Internet [retrieved on 20180615]
G ANANTHANARAYANAN; A GHODSI; S SHENKER; I STOICA: "Effective Straggler Mitigation: Attack of the Clones", PROC. 10TH USENIX CONFERENCE ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2013, pages 185 - 198
J DEAN; L A BARROSO: "The Tail at Scale", CACM, vol. 56, no. 2, 2013, pages 74 - 80, XP058030309, DOI: doi:10.1145/2408776.2408794
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A distributed computing system (200), comprising: a plurality of computing nodes (201 ), wherein each computing node (201 ) is configured to simultaneously execute a respective sub-task of a parallel computing task in a plurality of computing rounds; and a communication network (203) configured to allow data exchange between the plurality of computing nodes (201 ); wherein each computing round comprises an execution stage, a communication stage between the plurality of computing nodes (201 ) and a synchronization stage between the plurality of computing nodes (201 ) and wherein the distributed computing system (200) is configured to handle one or more high-latency computing nodes of the plurality of computing nodes (201 ) in each computing round.

2. The system (200) of claim 1 , wherein for handling the one or more high-latency computing nodes in each computing round the plurality of computing nodes (201 ) are configured to simultaneously execute copies of a sub-task of the parallel computing task in the same computing round.

3. The system (200) of claim 2, wherein the number of computing nodes (201 ) is a multiple of the number of sub-tasks allowing multiple copies of any sub-task to be run in parallel in the same computing round.

4. The system (200) of any one of the preceding claims, wherein for handling the one or more high-latency computing nodes in each computing round one or more of the plurality of computing nodes (201 ) is configured to execute at least two sub-tasks of the parallel computing task in sequence during a computing round.

5. The system (200) of any one of the preceding claims, wherein for handling the one or more high-latency computing nodes in each computing round the plurality of computing nodes (201 ) are configured to execute each of the sub-tasks of the parallel computing task in a computing round first on at least one of the plurality of computing nodes (201 ).

6. The system (200) of any one of the preceding claims, wherein the system (200) is configured to identify a respective computing node of the plurality of computing nodes (201 ) as a high-latency computing node in a computing round, if the duration of the execution stage of the respective computing node is larger than a duration threshold for the computing round.

7. The system (200) of claim 6, wherein each of the plurality of computing nodes (201 ) is configured to determine the duration threshold for a computing round on the basis of a minimum duration of the computing round, wherein the minimum duration of the computing round is defined by the duration of the execution stage of the computing node (201 ) having the shortest execution stage of the computing round.

8. The system (200) of claim 7, wherein each computing node (201 ) is associated with a respective synchronization parameter indicating whether or not the respective computing node (201 ) can define the minimum duration for the computing round.

9. The system (200) of any one of the preceding claims, wherein the system (200) further comprises a plurality of standby computing nodes (202), wherein each of the plurality of standby computing nodes (202) is configured to start executing a sub-task of a specific high-latency computing node or any high-latency computing node of the plurality of computing nodes (201 ) in a computing round.

10. The system (200) of any one of the preceding claims, wherein each computing node (201 ) comprises a physical computer, a virtual machine, or a software container.

1 1. A distributed computing method (300), comprising: executing (301 ) a respective sub-task of a parallel computing task in a plurality of computing rounds by a plurality of computing nodes (201 ), wherein each computing round comprises an execution stage, a communication stage between the plurality of computing nodes (201 ) and a synchronization stage between the plurality of computing nodes (201 ); and handling (303) one or more high-latency computing nodes of the plurality of computing nodes (201 ) in each computing round.

12. The method (300) of claim 1 1 , wherein the step of executing (301 ) comprises simultaneously executing copies of a sub-task of the parallel computing task in the same computing round. 13. The method (300) of claim 12, wherein the number of computing nodes (201 ) is a multiple of the number of sub-tasks allowing multiple copies of any sub-task to be run in parallel in the same computing round.

14. The method (300) of any one of claims 1 1 to 13, wherein the step of executing (301 ) comprises executing at least two sub-tasks of the parallel computing task in sequence during a computing round.

15. The method (300) of any one of claims 1 1 to 14, wherein the step of executing (301 ) comprises executing each of the sub-tasks of the parallel computing task in a computing round first on at least one of the plurality of computing nodes.

16. The method (300) of any one of claims 1 1 to 15, wherein the method (300) comprises the further step of identifying a respective computing node of the plurality of computing nodes (201 ) as a high-latency computing node in a computing round, if the duration of the execution stage of the respective computing node is larger than a duration threshold for the computing round.

17. The method (300) of claim 16, wherein the method (300) comprises the further step of determining the duration threshold for a computing round on the basis of a minimum duration of the computing round, wherein the minimum duration of the computing round is defined by the duration of the execution stage of the computing node having the shortest execution stage of the computing round.

18. The method (300) of claim 17, wherein each computing node (201 ) is associated with a respective synchronization parameter indicating whether or not the respective computing node can define the minimum duration for the round.

19. The method (300) of any one of claims 1 1 to 18, wherein the method (300) comprises the further step of starting to execute a sub-task of a specific high-latency computing node or any high-latency computing node of the plurality of computing nodes (201 ) in a computing round.

20. The method (300) of any one of claims 1 1 to 19, wherein each computing node comprises a physical computer, a virtual machine, or a software container.

21. A computer program product comprising program code for performing the method (300) of any one of claims 1 1 to 20 when executed on a computer or a processor.

Description:
A SYSTEM AND METHOD FOR HIGH-PERFORMANCE GENERAL-PURPOSE PARALLEL COMPUTING WITH FAULT TOLERANCE AND TAIL TOLERANCE

TECHNICAL FIELD

Generally, the present invention relates to the field of computing and data processing. More specifically, the present invention relates to a system and method for high- performance general-purpose parallel computing with fault tolerance and tail tolerance. BACKGROUND

Distributed computing systems, high performance computing systems, and other similar systems may facilitate scientists and engineers to solve complex science, engineering, and business problems using applications that benefit from high bandwidth, low latency networking, and very high compute capabilities. These systems may also execute data storage and retrieval, perform more straightforward tasks, and the like. Such systems may include those for cloud computing, Big Data analytics, web services, enterprise services, distributed computing and the like. The competitive business of data and computing services drives manufacturers in the continuous improvement of their processes and products in order to lower costs, deliver reliable service, increase speed and the like. Indeed, in data handling and processing, there is generally an ever-increasing demand to utilize processing resources more efficiently.

In many scenarios in large-scale parallel computing, it is not possible to simply run the computation again, or to use simple methods such as checkpointing, to handle faults or long latencies (i.e. tails). For example, with petaflop and exaflop computations it will be very costly to follow an approach based on "just run again or use checkpointing". Another example is running infinite continuous nonstop computations. In such a case, the "just run again or use checkpointing" approach is impossible, as it is in the case of real-time computations.

Many modern large-scale parallel computing applications are both highly iterative and communication-intensive. Such applications include, for instance, Big Data Analytics, Graph Computing, Machine Learning, Deep Learning, Artificial Intelligence, HPC, Modelling, Genomics, Network Optimization, Simulation. Nevertheless, large-scale parallel computing systems should meet preferably all of the following requirements: can be used to efficiently run any parallel computation, at any scale; can be used on cost- effective commodity architectures; can be used to efficiently run computations continuously, without interruptions; offer high performance; offer high availability, with automatic fault tolerance and tail tolerance, self-healing and self-optimizing.

Cloud Computing offers parallelism, scale and cost-effectiveness, but clouds are very unpredictable. The maximum latency is often more than 100 times greater than the average latency for identical tasks. In some cases, the factor can even be 1000 times or more. Such a case is illustrated in Figure 1 , which shows an exemplary distribution of the number of processes over the expected latency for a conventional distributed computing system. These long high-latency tails are a major problem, and quite different from the related problem of handling faults.

However, with modern software container technology, containers can be relaunched very quickly - in seconds rather than the minutes normally required to relaunch a virtual machine or a physical server. So containers provide a means of restarting computations quickly, but there remains the challenge of deciding when to restart a computing node.

In this new era of large-scale cloud computing with frequent long tails, traditional checkpointing-based approaches to recovery are inadequate for a variety of simple reasons. The latency of writing to, and reading from, resilient storage is very high. A choice must be made between frequent checkpointing and long recovery times - both are very bad, and there is no good tradeoff. A choice must be made between frequent recovery and long tail limit, which are both disadvantageous. With checkpointing, nonstop performance or continuous real-time performance are both impossible, due to stopping and restarting.

In the prior art there are some first attempts to provide alternatives to checkpointing for parallel computing with fault tolerance and tail tolerance, such as in the paper G Ananthanarayanan, A Ghodsi, S Shenker and I Stoica: "Effective Straggler Mitigation: Attack of the Clones". Proc. 10th USENIX Conference on Networked Systems Design and Implementation (2013), 185-198 (herein referred to as AGSS paper).

Another attempt has been described in the paper J Dean and L A Barroso: "The Tail at Scale". CACM 56, 2 (2013), 74-80 (herein referred to as DB paper). The DB paper deals with a simple special case of the MapReduce computations considered in the AGSS paper.

A simple extension of the results of the DB paper for handling simple web requests in the cloud is published in the paper Q Lu, L Zhu, X Xu, L Bass, S Li, W Zhang, and N Wang: Mechanisms and Architectures for Tail-Tolerant System Operations in Cloud. Proc. 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14).

The focus of the AGSS paper is on simple data parallel, constant-round parallel computations, e.g. MapReduce/Hadoop. The AGSS paper considers computations that have a small number of tasks, e.g. at most 10, and execute in a small number of rounds, e.g. 1-3. General purpose parallel computations that may have thousands of tasks, and may execute for thousands of rounds are not considered in the AGSS paper. Moreover, the AGSS paper does not address the challenge of supporting long or continuous parallel computations that need to run nonstop over a huge number of rounds.

The approach disclosed in the AGSS paper is to replicate only small, short-duration tasks, in order to limit overhead. It is assumed that typically no more than about 2% of all resources will be used by 90% of tasks combined. In general purpose parallel computing, however, typically about 90% of all resources will be used by 90% of tasks combined.

The principal concern of the AGSS paper is to optimize data movement in disk-based data communications systems such as Hadoop HDFS. General purpose parallel computing with high performance point-to-point data communications systems such as BSP, MPI, RDMA is not considered in the AGSS paper.

In light of the above, there is a need for improved systems and methods for general purpose parallel computing. SUMMARY

It is an object of the invention to provide improved systems and methods for general purpose parallel computing. The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures. Generally, embodiments of the invention provide systems and methods for high- performance general-purpose parallel computing with fault tolerance and tail tolerance. Embodiments of the invention provide systems for handling high-latency nodes for all large-scale parallel computations, including those that are general dataflow graphs, or communication-intensive, or highly iterative, multi-round. Embodiments of the invention provide distributed computing systems with scalability, high performance, cost- effectiveness, and high availability (automatic fault tolerance and tail tolerance). Embodiments of the invention provide a cost model and automatic optimization for general purpose parallel computations run on large-scale cloud and other commodity architectures where faults and long latency tails are common.

More specifically, according to a first aspect the invention relates to a distributed computing system comprising: a plurality of computing nodes, wherein each computing node is configured to simultaneously execute a respective sub-task of a parallel computing task in a plurality of computing rounds; and a communication network configured to allow data exchange between the plurality of computing nodes. Each computing round comprises an execution stage, i.e. processing of local data at each computing node, a communication stage between the plurality of computing nodes and a synchronization stage between the plurality of computing nodes. The distributed computing system is configured to handle one or more high-latency computing nodes of the plurality of computing nodes in each computing round. As used herein a high-latency computing node is a computing node that takes much longer to execute a task than other similar nodes take to execute a task of the same computational cost.

Thus, an improved general purpose parallel computing system is provided that can handle high-latency computing nodes.

In a further possible implementation form of the first aspect, for handling the one or more high-latency computing nodes the plurality of computing nodes are configured to simultaneously execute copies of a sub-task of the parallel computing task at the same time. Herein this is referred to as horizontal cloning. In a further possible implementation form of the first aspect, the number of computing nodes is a multiple of the number of sub-tasks allowing multiple copies of any sub-task to be run in parallel in the same computing round. In an implementation form the multiple can be an integer or a rationale multiple. In one implementation form the number of computing nodes is at least twice as large as the number of sub-tasks, allowing multiple copies of any sub-task to be run in parallel in the same round.

In a further possible implementation form of the first aspect, for handling the one or more high-latency computing nodes one or more of the plurality of computing nodes is configured to execute at least two sub-tasks of the parallel computing task in sequence during a computing round. Herein this is referred to as vertical cloning.

In a further possible implementation form of the first aspect, for handling the one or more high-latency computing nodes the plurality of computing nodes are configured to execute each of the sub-tasks of the parallel computing task in a round first on at least one of the plurality of computing nodes.

In a further possible implementation form of the first aspect, the distributed computing system is further configured to identify a respective computing node of the plurality of computing nodes as a high-latency computing node in a computing round, if the duration of the execution stage of the respective computing node is larger than a duration threshold for the computing round.

In a further possible implementation form of the first aspect, each of the plurality of computing nodes is configured to determine the duration threshold for a computing round on the basis of a minimum duration of the round, wherein the minimum duration of the round is defined by the duration of the execution stage of the computing node having the shortest execution stage of the computing round. In a further possible implementation form of the first aspect, each computing node is associated with a respective synchronization parameter indicating whether or not the respective computing node can define the minimum duration for the computing round.

In a further possible implementation form of the first aspect, the distributed computing system further comprises a plurality of standby computing nodes, wherein each of the plurality of standby computing nodes is configured to start executing a sub-task of a specific high-latency computing node or any high-latency computing node of the plurality of computing nodes in a computing round.

In a further possible implementation form of the first aspect, each computing node comprises a physical computer, a virtual machine, or a software container.

According to a second aspect the invention relates to a corresponding distributed computing method comprising the following steps: executing a respective sub-task of a parallel computing task in a plurality of computing rounds by a plurality of computing nodes, wherein each computing round comprises an execution stage, i.e. processing of local data at each computing node, a communication stage between the plurality of computing nodes and a synchronization stage between the plurality of computing nodes; and handling one or more high-latency computing nodes of the plurality of computing nodes in each computing round.

In a further possible implementation form of the second aspect, the step of executing comprises simultaneously executing copies of a sub-task of the parallel computing task at the same time. In a further possible implementation form of the second aspect, the number of computing nodes is a multiple of the number of sub-tasks allowing multiple copies of any sub-task to be run in parallel in the same computing round. In an implementation form the multiple can be an integer or a rationale multiple. In one implementation form the number of computing nodes is at least twice as large as the number of sub-tasks, allowing multiple copies of any sub-task to be run in parallel in the same round.

In a further possible implementation form of the second aspect, the step of executing comprises executing at least two sub-tasks of the parallel computing task in sequence during a round.

In a further possible implementation form of the second aspect, the step of executing comprises executing each of the sub-tasks of the parallel computing task in a round first on at least one of the plurality of computing nodes. In a further possible implementation form of the second aspect, the method comprises the further step of identifying a respective computing node of the plurality of computing nodes as a high-latency computing node in a computing round, if the duration of the execution stage of the respective computing node is larger than a duration threshold for the computing round. In a further possible implementation form of the second aspect, the method comprises the further step of determining the duration threshold for a computing round on the basis of a minimum duration of the computing round, wherein the minimum duration of the computing round is defined by the duration of the execution stage of the computing node having the shortest execution stage of the computing round.

In a further possible implementation form of the second aspect, each computing node is associated with a respective synchronization parameter indicating whether or not the respective computing node can define the minimum duration for the computing round. In a further possible implementation form of the second aspect, the method comprises the further step of starting to execute a sub-task of a specific high-latency computing node or any high-latency computing node of the plurality of computing nodes in a computing round. In a further possible implementation form of the second aspect, each computing node comprises a physical computer, a virtual machine, or a software container.

The distributed computing method according to the second aspect of the invention can be performed by the distributed computing system according to the first aspect of the invention. Further features of the distributed computing method according to the second aspect of the invention result directly from the functionality of the distributed computing system according to the first aspect of the invention and its different implementation forms described above and below. According to a third aspect the invention relates to a computer program product comprising program code for performing the distributed computing method according to the second aspect when executed on a computer or a processor.

The invention can be implemented in hardware and/or software. BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, wherein:

Fig. 1 shows a schematic diagram illustrating an exemplary distribution of the number of computing tasks over the expected latency for a conventional distributed computing system; Fig. 2 shows a schematic diagram illustrating a distributed computing system with a plurality of computing nodes according to an embodiment;

Fig. 3 shows a flow chart illustrating steps of a distributed computing method according to an embodiment;

Fig. 4 shows a flow chart illustrating the standard execution procedure for a conventional BSP program with no fault or tail tolerance;

Fig. 5a shows a schematic diagram illustrating an example of a non-load-balanced BSP program;

Fig. 5b shows a schematic diagram illustrating the example of figure 5 with a distributed computing system according to an embodiment; Fig. 6 shows a flowchart illustrating steps for executing a nonstop BSP program by a distributed computing system according to an embodiment;

Fig. 7 shows a schematic diagram illustrating setting and sharing a minimum duration for a computing round in a distributed computing system according to an embodiment;

Fig. 8 shows a flow chart illustrating a tail limit interrupt procedure implemented in a distributed computing system according to an embodiment;

Fig. 9 shows a flow chart illustrating the completion of a computing round at the tail limit as implemented in a distributed computing system according to an embodiment; Fig. 10a shows a schematic diagram illustrating the amount of time taken by a plurality of processes in an exemplary conventional distributed computing system;

Fig. 10b shows a schematic diagram illustrating the amount of time taken by the plurality of processes of figure 10a in a distributed computing system according to an embodiment implementing vertical cloning;

Fig. 10c shows a schematic diagram illustrating a further example of vertical cloning implemented in a distributed computing system according to an embodiment;

Fig. 1 1 a shows a schematic diagram illustrating the amount of time taken by a plurality of processes in a distributed computing system according to an embodiment implementing horizontal cloning; Fig. 1 1 b shows a schematic diagram illustrating a further example of horizontal cloning implemented in a distributed computing system according to an embodiment;

Fig. 1 1 c shows a schematic diagram illustrating a plurality of processes executed by a distributed computing system according to an embodiment implementing both vertical cloning and horizontal cloning;

Fig. 1 1d shows a schematic diagram illustrating a further example for a combination of vertical and horizontal cloning implemented in a distributed computing system according to an embodiment;

Fig. 12a shows a schematic diagram illustrating an example for a plurality of parallel processes in a conventional distributed computing system with disjoint redundancy;

Figs. 12b and 12c show schematic diagrams illustrating an example for a plurality of parallel processes in a distributed computing system according to an embodiment providing fault tolerance using combined redundancy;

Fig. 13a shows a schematic diagram illustrating the amount of time taken by a plurality of processes over several computing rounds in an exemplary conventional distributed computing system; Fig. 13b shows a schematic diagram illustrating the amount of time taken by the plurality of processes of figure 13a in a distributed computing system according to an embodiment;

Fig. 14 shows a schematic diagram illustrating an exemplary execution of a plurality of processes by a distributed computing system according to an embodiment; and

Fig. 15 shows a schematic diagram illustrating a further exemplary execution of a plurality of processes by a distributed computing system according to an embodiment. In the various figures, as far as possible, identical reference signs will be used for identical or functionally equivalent features.

DETAILED DESCRIPTION OF THE EMBODIMENTS In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, specific aspects in which the present invention may be placed. It is understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the present invention is defined by the appended claims.

For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise. Figure 2 shows a distributed computing system 200 according to an embodiment. The distributed computing system 200 comprises a plurality of computing nodes 201 and a communication network 203 configured to allow data exchange between the plurality of computing nodes 201. To this end, the communication network 203 can comprise a plurality of wired and/or wireless connections between the computing nodes 201. In an embodiment, the communication network 203 can comprise an Ethernet communication network. In an embodiment, the distributed computing system 200 is a large-scale general purpose parallel computing system 200. The distributed computing system 200 can be, for instance, a distributed memory supercomputer cluster or a large-scale network of servers in a datacenter. As can be taken from the detailed view in figure 2, each computing node 201 can comprise a processing unit 201 a and a local memory 201 b. In an embodiment, non-local memory references can be handled by inter-node communications across the communication network 203. Each computing node 201 is configured to simultaneously execute a respective sub-task of a parallel computing task in a plurality of computing rounds. Each computing round comprises an execution stage, a communication stage between the plurality of computing nodes 201 and a synchronization stage between the plurality of computing nodes 201. As will be described in more detail below, the distributed computing system 200 is configured to handle one or more high-latency computing nodes of the plurality of computing nodes 201 in each computing round.

As already mentioned above, the distributed computing system 200 is configured to execute large-scale parallel algorithms by "computing in rounds". Such algorithms could be written in software such as MPI, BSP, MapReduce, Spark, Pregel, Giraph, Petuum, or other parallel programming models and systems. This style of parallel computing in rounds is normally referred to as Bulk Synchronous Parallel (BSP) computing. With BSP style parallel algorithms and software, the basic computational model comprises the following stages at each computing node 201 : (i) compute on data in local memory 201 b; (ii) globally communicate across the communication network 203; (iii) synchronize; and (iv) repeat.

Simple parallel models such as MapReduce provide an adequate framework for parallel computations that involve only a small number of rounds. Other models such as Spark provide an adequate framework for parallel computations of limited scale, where the low performance obtained by automatic management of communications is acceptable. For general purpose parallel computing at large scale, BSP style parallelism, either using MPI or BSP message passing software, has proven to be capable of delivering the highest levels of performance in all kinds of applications, including Dense Linear Algebra, Sparse Linear Algebra, Spectral Methods (e.g. FFT), N-Body Methods, Structured Grids, Unstructured Grids, Monte Carlo Simulations, Graph Computing, Dynamic Programming, Combinatorial Search (e.g. Branch-and-Bound), Machine Learning, Discrete Event Simulation. Thus, in embodiments of the invention the distributing computing system 200 is configured to implement a BSP scheme for "computing in rounds". Many of the most important modern large-scale commercial parallel applications such as machine learning, deep learning, Al, network optimization, and graph analytics, are highly iterative, involving thousands of computing rounds, and can be very naturally and easily expressed as BSP computations.

Figure 3 shows a flow chart illustrating steps of a distributed computing method 300 according to an embodiment. The method 300 comprises the steps of: executing 301 a respective sub-task of a parallel computing task in a plurality of computing rounds by the plurality of computing nodes 201 , wherein each computing round comprises an execution stage, a communication stage between the plurality of computing nodes 201 and a synchronization stage between the plurality of computing nodes 201 ; and handling 303 one or more high-latency computing nodes of the plurality of computing nodes 201 in each computing round.

Further embodiments of the distributed computing system 200 and the distributed computing method 300 will be described in the following. In an embodiment, the distributed computing system 200 could be implemented as a computing system in the cloud, herein also referred to as "Supercloud". Although the term is influenced by the many challenges that are faced in the specific area of cloud computing, it is not intended to imply that the invention applies only to cloud computing. As already described above, embodiments of the invention are applicable to any large scale parallel computing system such as a distributed memory supercomputer cluster or a large-scale network of servers in a datacenter. As powerful parallel computing with thousands of cores becomes much cheaper, it is becoming possible to move parallel computing applications from the center to the edge of the network - from centralized clouds and datacenters to edge computing to support nonstop real-time low-latency computing such as mobile Al and intelligent loT. Embodiments of the invention provide a solution not only for cloud and centralized datacenter computing, but also a solution to the resilience, latency and performance requirements of such mobile and edge computing.

The computing nodes 201 can be a physical server or virtualized using virtual machines or software containers. Each computing node 201 is configured to run one or more software processes (herein referred to as sub-tasks) during a round of computation. For example, in an embodiment a distributed computing system 200 with 12 computing nodes can execute a parallel program with 48 parallel processes, i.e. sub-tasks by allocating 4 processes, i.e. sub-tasks to each computing node 201.

In an embodiment, the execution model implemented in the distributed computing system 200 can be considered as a nonstop BSP scheme. In an embodiment, this nonstop BSP scheme has four core features:

(i) It is a BSP program and is structured to compute in rounds. In each round, each of the processes computes on data in local memory 201 b, globally communicates across the network 203, and synchronizes;

(ii) The number of processes in a program defines the parallelism parameter P;

(iii) The number of rounds in a program defines the parameter R, which may be finite or infinite; and

(iv) Each process comprises a BSP synchronization mechanism at the end of each round. As will be described in more detail below, this synchronization mechanism can be parameterized by a (typically Boolean) synchronization parameter or flag indicating whether or not the process/node is one that can set the minimum time value for the round. In an embodiment, the synchronization parameter can be changed dynamically during the execution of a program. If a program has the synchronization parameter set so that no process/node can set the minimum time value in any round, then the distributed computing system 200 can execute the program as a normal BSP program without fault tolerance or tail tolerance.

For illustrative purposes, figure 4 shows a flow chart illustrating the standard execution procedure for a conventional BSP program as described above (with no tail or fault tolerance). As already mentioned, the standard execution procedure for a conventional BSP program comprises an execution and communication stage (see block 401 in figure 4) and a synchronization stage (see block 403 in figure 4). If all local operations and communications for all processes are completed, the next computing round can be started (see block 405 in figure 4).

Figure 5a shows a simple example of a non-load-balanced conventional BSP program with four processes, where, by way of example, there is an expected time per round of 4 seconds for the processes, i.e. sub-tasks P0, P1 and P2 but only 100ms for the process, i.e. sub-task P3. Figure 5b shows a corresponding nonstop BSP program for execution by the distributed computing system 200 according to an embodiment. Different to the program shown in figure 5a, in the embodiment shown in figure 5b the distributed computing system 200 implements a parameterization of the synchronization primitives in order to show whether or not the associated process is one that can set the minimum time value for the computing round. So, for the example shown in figure 5b, the distributed computing system 200 according to an embodiment can set the synchronization parameter to "False" for the process P3, and to "True" for the other processes. The following provides a simple high level overview of the distributed computing system

200 according to an embodiment with E computing nodes or processing elements (e.g. CPUs) 201 , running a P process/sub-tasks parallel program, where P < E. Each of the E processing nodes 201 can run one or more of the processes/sub-tasks during a single round. As will be described in more detail below, for handling any high-latency processes/sub-tasks the system 200 is configured to run multiple instances of a sub-task, which are referred to as clones. The system 200 is configured to run each of the P processes/sub-tasks as the first process/sub-task on at least one of the E computing nodes 201. In an embodiment, the first process/sub-task to be run on a computing node

201 is the only one that can possibly set a minimum duration value (herein referred to as MinTime) for the computing round. During each computing round, if the first sub-task to be run on each computing node 201 has a synchronization parameter set indicating that it can set MinTime, then it monitors its elapsed time for the current round (Nonstop BSP) and attempts to write MinTime when it ends. The first such process to write its time sets MinTime for the computing round.

In an embodiment, each sub-task can have access not only to its own local data and state, but also to other information including: a copy of the minimum duration, i.e. MinTime for the computation round; its elapsed time for the round, which clones of other sub-tasks it may need to communicate with; standby resources available; where other clones of the sub-task are located.

The distributed computing system 200 can be described by a polynomial Predictability exponent D, indicating that an expected fraction of the computing nodes 201 E/t D will fail to complete any round in less than t*MinTime. In an embodiment, the computation performed by the distributed computing system 200 has a TailLimit T. In an embodiment, any sub-task that fails to complete before T * MinTime is marked as a fault/tail, others can be marked as live. A computing round is successful if at least one clone of each of the P sub-tasks completes.

The inter-process communications of the plurality of communication nodes 201 can be handled in several ways. For example, in an embodiment the first clone of a process/sub- task to complete can handle the global communications for all the clones of that process. A number of other variations are also possible, depending on other objectives such as balancing communications at endpoints, increasing network latency resilience and other factors.

In an embodiment, process faults and tails can be handled by transferring state from another live clone of the same process. To improve the speed with which this state transfer and relaunch can be achieved, according to an embodiment a pool of standby computing nodes 202 is maintained that are ready to run (an exemplary standby computing node 202 is shown in figure 2). This can be done in several ways. For example, by having a static pool of standby computing nodes 202 directly associated with the various processes/sub-tasks, or by having a dynamic pool of standby computing nodes 202 each of which can be used with any process/sub-task. The choice between having a static pool, a dynamic pool, or no pool of standby computing nodes can be made based on a tradeoff between speed of relaunch and efficiency of standby computing node utilization. The above embodiments will be described in more detail further below, for instance, in the context of figures 6 to 9.

Figures 6 shows a flowchart illustrating steps for executing a nonstop BSP program by the distributed computing system 200 according to an embodiment. In a first step 601 , a first process/sub-task is executed by a current computing node 201 and communications are initiated, unless another clone of the first process/sub-task has already been completed by another computing node 201. In a further step 603, it is checked whether the minimum duration, i.e. MinTime has been set and/or whether the synchronization parameter has been set to "False" for the current computing node 201. If this is not the case, the current computing node 201 tries to set the minimum duration, i.e. MinTime in a further step 605. In a further step 607, the current computing node 201 notifies other clones, i.e. computing nodes 201 executing the same process/sub-task about the completion of the process by the current computing node 201. In a further step 609, the current computing node 201 is configured to check whether vertical cloning is implemented, i.e. whether the current computing node 201 is supposed to execute a further process/sub-task, which was originally assigned to and still being executed by a different computing node 201. If this is the case, the current computing node 201 executes one or more further processes/sub- tasks and will communicate the completion thereof to the other computing nodes 201 , unless the one or more further processes/sub-tasks have already been completed by another computing node 201. Figure 7 shows a schematic diagram illustrating setting and sharing the minimum duration, i.e. MinTime of a computing round between computing nodes 201 in the distributed computing system 200 according to an embodiment.

Figure 8 shows a flow chart illustrating a tail limit interrupt procedure implemented in the distributed computing system 200 according to an embodiment. In a first step 801 , it is check whether the computing round is not complete yet and whether the elapsed time is larger than the minimum duration or a multiple thereof, e.g. T * MinTime. If this is the case, the respective computing node 201 is a high-latency computing node 201 and will send Tail Limit interrupts in a step 803 to the other computing nodes 201.

Figure 9 shows a flow chart illustrating the completion of a computing round at the tail limit as implemented in a current processing node 201 of the distributed computing system 200 according to an embodiment. In a first step 901 , it is checked whether a first process/sub- task is still being executed by the current computing node 201 and has not been completed by another computing node 201 yet. If this is the case, computation must stop, as there is no completed copy of that sub-task. If nonstop computation is required then the degree of horizontal cloning, the degree of vertical cloning, and the tail limit T should all be set to ensure that this does not happen. Otherwise, in a step 903 it is checked whether the first process/sub-task is still being executed by the current computing node 201 and has already been completed by another computing node 201. If this is the case, the current computing node 201 can be relaunched and/or a standby computing node 202 can be used to replace it. In a step 907, the state(s) for all incomplete vertical clones can be transferred. As already described above, in order to ensure fault tolerance and tail tolerance, embodiments of the distributed computing system 200 provide vertical cloning and/or horizontal cloning of processes/sub-tasks. In a distributed computing system 200 with E computing nodes 201 running a P process/sub-task parallel program with P < E each of the E computing nodes 201 can run one or more of the processes/sub-tasks during a single round. Each of the P processes/sub-tasks is run as the first process on at least one of the E computing nodes 201. As a first exemplary embodiment, the case of vertical cloning for P=E is considered. In this example P=10 with the processes/sub-tasks numbered from 0 to 9. For a perfectly load balanced program, every process/sub-tasks takes exactly the same time as all the others, in every round. Then with no faults or tails it is expected that all processes/sub-tasks will complete the round at the same time. If, however, there are faults or high-latency tails then the processes/sub-tasks may take quite different amounts of time, as illustrated in figure 10a, where, by way of example, the process/sub-task "6" takes the longest time.

Figure 10b illustrates how the example of figure 10a can be handled by the distributed computing system 200 according to an embodiment on the basis of vertical cloning, i.e. multiple process/sub-task instances running consecutively on the same computing node 201. For instance, the respective computing node 201 having completed the process/sub- task "0" starts executing the process/sub-task "1 ", because the other processing node 201 , which was assigned to execute the process/sub-task "1 " in the first place, has not finished yet. As will be appreciated, due to vertical cloning the overall time to complete the round is substantial smaller than the time it would take for the high-latency nodes to complete their respective processes/sub-tasks, which in the example of figure 10b are the computing nodes 201 executing the processes/sub-tasks "3" and "6" as a respective first process of the round. As will be appreciated, vertical cloning does not require any additional computing nodes 201.

In the example shown in figure 10b, each process/sub-task appears twice, but vertical cloning can be used with any multiple. The multiples do not even need to be uniform, although this may normally be the case. For example, vertical cloning can be implemented by the distributed computing system 200 as illustrated in figure 10c. The degree of vertical cloning can be defined by a parameter VC. In the example shown in figure 10c VC=4. In an embodiment, only the first row of processes/sub-tasks can set MinTime for the round, and can do so only if their synchronization parameter is set accordingly. Figure 1 1 a illustrates how high-latency computing nodes 201 can be handled by the distributed computing system 200 according to an embodiment on the basis of horizontal cloning, i.e. multiple process/sub-task instances running concurrently on different computing nodes 201. In the example shown in figure 1 1a, each process/sub-task appears twice, but horizontal cloning can be used with any multiple. The multiples do not need to be uniform. For example, figure 1 1 b shows an exemplary embodiment for the case P=6 and E=21. The level of horizontal cloning can be defined by the parameter HC = E/P. In the examples shown in figures 1 1 a and 1 1 b, the values of HC are 2 and 3.5, respectively.

In an embodiment, the distributed computing system 200 can be configured to combine vertical and horizontal cloning for handling high-latency and/or faulty computing nodes 201. Figure 1 1 c shows an exemplary embodiment, where the distributed computing system 200 using vertical cloning with a vertical cloning parameter VC=3 and horizontal cloning with a horizontal cloning parameter HC=2. Figure 1 1 shows another example with VC=2 and HC=1.5. As in the case of purely vertical cloning, in an embodiment only the first row of processes/sub-tasks can set MinTime for the round, and can do so only if their synchronization parameter is set accordingly.

In addition to handling any high-latency computing nodes 201 the distributed computing system 200 according to an embodiment using horizontal cloning (and in embodiments vertical cloning as well) can increase the resilience of large-scale parallel computing architectures by providing fault tolerance. This advantageous effect provided by horizontal cloning as implemented in the distributed computing system 200 according to an embodiment will be described in the following in the context of figures 12a-c, which shows an example with 10 sub-tasks.

Figure 12a shows the conventional fault tolerance approach of running two separate disjoint copies of the 10-process computation (which will be referred to as Left and Right), with no significant communication between them. Embodiments of the invention have the capability to run both computations together as a single combined computation, executing in rounds, with communications possible between any of the 2P processes, and with automatic fault tolerance and tail tolerance. The cost will be approximately the same as the conventional disjoint redundant computation, but much more resilient.

If, in the conventional approach shown in figure 12a in any round, a process in Left and a process in Right both experience a fault or long tail, then the computation will have to stop. As will be appreciated, this does not have to be the same process. For example, if process 4 in Left and process 2 in Right both experience a fault or long tail, then the computation has to stop. Figure 12b shows the example of figure 12a as implemented in the distributed computing system 200 according to an embodiment. As long as the two copies of any process/sub- task do not both experience a fault or long tail during the same round, there is no need to stop the computation. For example, if process clones 0,3,4,6 from the left group and process clones 1 ,5,8,9 from the right group all experience a fault or long tail during the same round, then the computation can still continue without interruption (as shown in figure 12c).

For the conventional disjoint computation illustrated in figure 12a, the probability of the 2p- process system surviving f failures is and in the computation performed by the distributed computing system 200 according to an embodiment, the probability is

Thus, it will be appreciated that the computation performed by the distributed computing system 200 is much more reliable than the disjoint computation as a function of f. For instance, for p=1000 and f=2, the conventional disjoint computation survives with probability close to 0.5, whereas the computation performed by the distributed computing system 200 survives with probability 0.9995. For p= 10,000 and f=2, the conventional disjoint computation survives with probability close to 0.5 and the computation performed by the distributed computing system 200 survives with probability 0.99995. Embodiments of the invention can be easily extended to higher levels of replication. If, instead of 2x replication a rx replication is used, then the probability of a computation performed by the distributed computing system 200 surviving f faults or long tails, is given by the following formula where

N(r, p, f) = 0 if p < 1 or and

N(r,p,f) = p - ij - i)

In the above formulae, c is the horizontal clone level HC, p is the parallelism, and f is the number of failures.

Embodiments of the invention have an advantageous new cost model that enables optimal parameters to be calculated to achieve optimal performance, ensure nonstop resilience, and to allow tuning to achieve the best possible trade-offs between performance, cost and resilience. The following is a list of the parameters that together make up this cost model applying to embodiments of the invention. Program Parameters: Parallelism P, Rounds R; Network Parameters: Network Throughput g, Network Latency L; Memory Parameter: Local Memory Size M; Parameters of the distributed computing system 200 according to an embodiment: Predictability D, TailLimit T, VerticalCloneLevel VC, HorizontalCloneLevel HC, StandbyLevel S. Given D,P,R it is possible to automatically compute the optimal T, VC, HC, S to optimize performance and ensure nonstop resilience of the distributed computing system 200 as follows: Cost < T*HC*Perfect, where Perfect is the cost for an idealized system where D is infinite (no failures ever, no tails ever). This cost model can be placed at the top level of a hierarchy of cost models for different models of parallel computing, including PRAM (idealized shared memory), MapReduce, and standard BSP with no fault tolerance or tail tolerance.

The parameters P, R and data distribution can be chosen by the programmer. The parameters g, L, M, D are parameters determined by the infrastructure (hardware+software). Using the cost model, optimal T, VC, HC, S parameters can be automatically calculated to guarantee a given Quality of Service Level (performance and resilience). Thus, in addition to providing high-performance general-purpose parallel computing with fault tolerance and tail tolerance, the distributed computing system 200 according to an embodiment is also easy to use. A Boolean synchronization parameter can be added to sync, which can be made always true for a load balanced program. Moreover, any large- scale parallel software can be automatically optimized for the distributed computing system 200 according to an embodiment. Given program Parallelism P and Rounds R, and the Predictability D of the distributed computing system 200, the optimal values of TailLimit, VerticalClone level, HorizontalClone level, StandbyLevel can be automatically generated.

Figure 13a shows a schematic diagram illustrating the amount of time taken by a plurality of processes over several computing rounds in an exemplary conventional distributed computing system. In comparison thereto figure 13b shows a schematic diagram illustrating the amount of time taken by the plurality of processes of figure 13a in the distributed computing system 200 according to an embodiment. As will be appreciated, the distributed computing system 200 provides a substantially improved performance.

As already mentioned above, the distributed computing system 200 allows running nonstop BSP programs having a degree of Parallelism P. A program with integer Parallelism P>0 consists of P processes that compute together in rounds. The number of Rounds R>0 may be finite or infinite. In each round, the processes perform local computation, global communication and barrier synchronization. At the end of each round, the distributed computing system 200 according to an embodiment handles faults and tails. The synchronization mechanism at the end of each round can be parameterized by a value that indicates whether that particular process can or cannot set the minimum time for the round. For example, it can be specified by a Boolean value, true or false. The first processes on each computing node 201 monitor their elapsed time for that round. If all processes have the synchronization parameter set false in every round then there is no fault or tail tolerance. In every round the subset that has the synchronization parameter set true may vary, under program control. In an approximately load balanced parallel computation for which efficient fault and tail tolerance is sought, the synchronization parameter can be simply set to true for all processes in every round.

As already described above, the number of computing nodes 201 can be significantly greater than the number of processes/sub-tasks (e.g., where horizontal cloning is used). Processes can be cloned vertically or horizontally or both. The clones of the processes can be run together within a combined computation, with communication between the process clones as required by the computation in each round. The first process clones on each computing node 201 monitor their elapsed time for the current round. The earliest first process clone to finish a round with a synchronization parameter set true, can set MinTime for the round. Any timing error due to race conditions will typically be negligible compared to the round time. As already described above, the (un)predictability of a distributed computing system 200 according to an embodiment due to faults and tails can be modelled in a number of ways. For example, a simple model of predictability is to have a polynomial predictability function defined by a Predictability exponent D, indicating that in any round, a fraction 1/t D of the E computing nodes 201 will fail to complete the round in time less than t * MinTime, i.e. at t * MinTime one will have E/t D first process faults/tails, for any t≥ 1. For such a polynomial predictability model, realizing a higher value for D may require a much higher investment in the system infrastructure, and/or more cautious management and scheduling. Other predictability models could be chosen. For example, a model that factors in the length of MinTime, or one that factors in an additional extra term related to faults (infinite tails). Linear, sublinear or superpolynomial predictability are possible, e.g. D=1 , D=0.5, or exponential predictability E/2', but these may not reflect realistic system architectures. A system with D=0.5 would probably be grossly inefficient and unusable for many advanced applications. A large-scale system with exponential predictability would have to overcome many unsolved issues in software architectures and/or would be prohibitively expensive and uneconomic.

As already described above, each round has a TailLimit T>1. Any first process clone that fails to complete before T * MinTime is marked as a fault/tail. There are a number of ways in which the inter-process communications of the program can be handled by the distributed computing system 200 according to an embodiment. For example, the first clone of a process to complete can handle the global communications for all the clones of that process. A number of other variations are also possible, depending on other objectives such as balancing communications at endpoints, increasing network latency resilience and other factors.

As already described above, process faults and tails can be handled by transferring state from another live clone of the same process. To improve the speed with which this state transfer and relaunch can be achieved, according to an embodiment a pool of standby computing nodes, e.g. containers 201 is maintained that are ready to run. This can be done in several ways. For example, by having a static pool of computing nodes 201 directly associated with the various processes, or by having a dynamic pool of computing nodes 201 each of which can be used with any process. The choice between having a static pool, a dynamic pool, or no pool of standby nodes 201 can be made based on a tradeoff between speed of relaunch and efficiency of container utilization. The general advantages and disadvantages of the various options for standby pools can be summarized as follows.

In case of no standby nodes 202, it is necessary to wait until all processes are complete or MaxTime reached, then a processing node 201 needs to be relaunched and written to, assuming that at least one finishes by MaxTime. This may result in relaunch delays, but is an inexpensive option.

In case of a dynamic shared pool of standby nodes 202, the required standby pool size can be estimated based on expected number of standbys needed in each round. Writes need to be delayed until all complete or MaxTime reached.

In case of a static pool of standby nodes 202, each process can have an integer StandbvLevel S>0 assigned. Typically S=1. At all times, there is a pool of S * P extra standby nodes 202 ready. Relaunch delays can be minimized, although this can be an expensive solution in terms of resources.

The above variants offer a range of tradeoffs between cost and speed. As an example of how standby pools can be used, one can consider a simple case with VC=1 (no vertical cloning), uniform horizontal cloning with Horizontal Clone level HC, and a Static Pool of standby nodes 202 with StandByLevel S. At all times, there is a pool of S * P extra computing nodes 202 ready. The first of a process's clones to finish writes to the standby node 202 for the process. If HC-1 complete before MaxTime then the tail clone is relaunched as a new standby. If all HC complete then the standby continues as standby. If HC-1 complete and standby is written to, then the round can be stopped before MaxTime. If S>1 then first to finish can write to all S standbys and we need to wait only for HC-S to finish. For each process, let (HC, S) be the current state. In any round, let F be the number of HC that do not complete within MaxTime: If F=0 then new state (HC+S-1 , 1 ) else (HC+S-F, F); If F=HC then "STOP-FORCED". An example with P=4, HC=2, TailLimit T=3, and S=1 is shown in figure 14, where processes are denoted by p,, clones by c, and standbys by s,.

A further example for a computation performed by the distributed computing system 200 according to an embodiment is shown in figure 15. This example illustrates how the distributed computing system 200 according to an embodiment can handle fault and tail latencies running a parallel computation with P=4, R=16, VC=1 , HC=2, S=1 , and TailLimit T=2.5. In the example shown in figure 15, the synchronization parameter is set false for pi ,Ci ,Si , and set true for all others. Each entry in the table denotes the time for that process clone as a multiple of the MinTime (=1 ) for that round. For embodiments of the distributed computing system 200 some of the parameters described above can be chosen in one of the following ways. If all processes/sub-tasks have synchronization parameter false in a round then the distributed computing system 200 would operate as a conventional system. In every round the subset of computing nodes 201 that have synchronization parameter true may vary, under program control. In an approximately load balanced computation for which efficient fault and tail tolerance is sought, it will typically be convenient simply to set synchronization parameter to true for all processes in every round.

Load balancing is important in all areas of high performance computing. Without good load balancing, parallel performance will often be poor anyway, even on a fault-free and tail-free system as provided by the distributed computing system 200 according to an embodiment. If a computation is not approximately load balanced in each round (due to algorithmic work imbalance or hardware imbalance) then this should be incorporated into the setting of the TailLimit T, to avoid unnecessary actions. So if process A might be 2x shorter algorithmically than every other process in some rounds, and one wants to set a TailLimit that also allows 3x tail tolerance due to software unpredictability, then T should be set to 6. Or alternatively, A can have its synchronization parameter set to false and we can set T to 3. Other alternatives are also possible. Since, in a load balanced computation, at most T vertical clones can be executed within a TailLimit of T, VC should typically be set to be at most T. For example, if T=3.5 then VC=3 might be a suitable choice.

While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other.

Although specific aspects have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.