Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TECHNIQUES FOR PERFORMANCE EVALUATION OF AN ELECTRONIC HARDWARE DESIGN ON A COMPUTER SIMULATION SERVER
Document Type and Number:
WIPO Patent Application WO/2018/006928
Kind Code:
A1
Abstract:
The disclosure relates to a method (100) for performance evaluation of an electronic hardware design on a computer simulation server comprising a plurality of processing cores, the method comprising: partitioning (101) a computer simulation (110) of the electronic hardware design among a plurality of worker threads (111, 112, 113) for parallel execution on the plurality of processing cores of the computer simulation server; providing (102) a scheduling element (114) for controlling a progress of the plurality of worker threads (111, 112, 113); and running (103) the scheduling element (114) and the plurality of worker threads (111, 112, 113) on the plurality of processing cores to evaluate a performance of the electronic hardware design on the computer simulation server, wherein an execution of the plurality of worker threads (111, 112, 113) is mutually locked (121, 122) with an execution of the scheduling element (114).

Inventors:
CHALAK ORI (DE)
WU ZUGUANG (DE)
ZHENG LIBING (DE)
Application Number:
PCT/EP2016/065664
Publication Date:
January 11, 2018
Filing Date:
July 04, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
CHALAK ORI (DE)
WU ZUGUANG (DE)
ZHENG LIBING (DE)
International Classes:
G06F9/48; G06F9/52; G06F17/50
Foreign References:
US20150058859A12015-02-26
Other References:
TOM BERGAN ET AL: "CoreDet", PROCEEDINGS OF THE FIFTEENTH EDITION OF ASPLOS ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS '10, ACM PRESS, NEW YORK, NEW YORK, USA, 13 March 2010 (2010-03-13), pages 53 - 64, XP058123636, ISBN: 978-1-60558-839-1, DOI: 10.1145/1736020.1736029
ANDREW OVER ET AL: "A Comparison of Two Approaches to Parallel Simulation of Multiprocessors", PERFORMANCE ANALYSIS OF SYSTEMS & SOFTWARE, 2007. ISPASS 2007. IEE E INTERNATIONAL SYMPOSIUM ON, IEEE, PI, 1 April 2007 (2007-04-01), pages 12 - 22, XP031091884, ISBN: 978-1-4244-1081-1
CHRISTOPH SCHUMACHER ET AL: "parSC", 2010 IEEE/ACM/IFIP INTERNATIONAL CONFERENCE ON HARDWARE/SOFTWARE CODESIGN AND SYSTEM SYNTHESIS (CODES+ISSS 2010) : SCOTTSDALE, ARIZONA, USA, 24 - 29 OCTOBER 2010, 1 January 2010 (2010-01-01), Piscataway, NJ, pages 241, XP055352384, ISBN: 978-1-60558-905-3, DOI: 10.1145/1878961.1879005
"IFIP Advances in Information and Communication Technology", vol. 310, 1 January 2009, ISSN: 1868-4238, article RAUF SALIMI KHALIGH ET AL: "Efficient Parallel Transaction Level Simulation by Exploiting Temporal Decoupling", pages: 149 - 158, XP055234827, DOI: 10.1007/978-3-642-04284-3_14
EZUDHEEN P ET AL: "Parallelizing SystemC Kernel for Fast Hardware Simulation on SMP Machines", PRINCIPLES OF ADVANCED AND DISTRIBUTED SIMULATION, 2005. PADS 2005. WO RKSHOP ON MONTEREY, CA, USA 01-03 JUNE 2005, PISCATAWAY, NJ, USA,IEEE, 1730 MASSACHUSETTS AVE., NW WASHINGTON, DC 20036-1992 USA, 22 June 2009 (2009-06-22), pages 80 - 87, XP058191459, ISSN: 1087-4097, ISBN: 978-0-7695-3713-9, DOI: 10.1109/PADS.2009.25
BRINGMANN OLIVER ET AL: "The next generation of virtual prototyping: Ultra-fast yet accurate simulation of HW/SW systems", 2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), EDAA, 9 March 2015 (2015-03-09), pages 1698 - 1707, XP032765878, DOI: 10.7873/DATE.2015.1105
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS:

1. A method (100) for performance evaluation of an electronic hardware design on a computer simulation server, the computer simulation server comprising a plurality of processing cores, wherein the method comprises: partitioning (101) a computer simulation (110) of the electronic hardware design among a plurality of worker threads (111, 112, 113) for parallel execution on the plurality of processing cores of the computer simulation server; providing (102) a scheduling element (114) for controlling a progress of the plurality of worker threads (111, 112, 113); and running (103) the scheduling element (114) and the plurality of worker threads (111, 112, 113) on the plurality of processing cores to evaluate a performance of the electronic hardware design on the computer simulation server, wherein an execution of the plurality of worker threads (111, 112, 113) is mutually locked (121, 122) with an execution of the scheduling element (114).

2. The method (100) of claim 1, wherein the scheduling element (114) comprises a scheduler thread that is different from the plurality of worker threads (111, 112, 113).

3. The method (100) of claim 2, wherein the plurality of worker threads (111, 112, 113) and the scheduler thread

(114) interlock execution of each other such that execution of a worker thread (111, 112, 113) locks execution of the scheduler thread (114) and execution of the scheduler thread (114) locks execution of the worker thread (111, 112, 113).

4. The method (100) of claim 2 or 3, wherein a processing cycle of the computer simulation server is split into a work phase for performing computation within the worker threads (111, 112, 113) and a transfer phase for performing data transfer between the worker threads (111, 112, 113).

5. The method (100) of claim 4, comprising: assigning one or more partitions (p1 , p2, p3) of the computer simulation (1 10) to a respective worker thread (1 1 1 , 1 12, 1 13); and serially processing the assigned partitions (p1 , p2, p3) by the respective worker thread (1 1 1 , 1 12, 1 13).

6. The method (100) of claim 4 or 5, comprising: assigning one or more communication channels (303) for communication between the partitions (301 , 302) of the computer simulation (1 10) to a respective worker thread (1 1 1 , 1 12, 1 13); and serially processing the assigned communication channels (303) by the respective worker thread (1 1 1 , 1 12, 1 13).

7. The method (100) of one of claims 4 to 6, comprising: synchronizing a worker thread (51 1 ) by the scheduling thread (514) by setting a first synchronization point (521 ) for locking the worker thread (51 1 ) from transitioning to the transfer phase (502) and by setting a second synchronization point (522) for locking the worker thread (51 1 ) from transitioning to the work phase (501 ).

8. The method (100) of claim 7, wherein the second synchronization point (522) is later in time than the first synchronization point (521 ). 9. The method (100) of claim 8, comprising: first setting the second synchronization point (522) and then releasing the first synchronization point (521 ).

10. The method (100) of one of claims 7 to 9, comprising: synchronizing the scheduling thread (514) by a worker thread (51 1 ) by setting a third synchronization point (523) for locking the scheduling thread (514) from transitioning to a second phase (504) and by setting a fourth synchronization point (524) for locking the scheduling thread (514) from transitioning to a first phase (503).

1 1 . The method (100) of claim 10, wherein the fourth synchronization point (524) is later in time than the third synchronization point (523).

12. The method (100) of claim 1 1 , comprising: first setting the fourth synchronization point (524) and then releasing the third synchronization point (523).

13. The method (100) of one of claims 7 to 12, wherein setting the first (521 ), second (522), third (523) and fourth (524) synchronization points comprises setting a mutex, spinlock, semaphore, volatile variable, atomic variable or any other software synchronization technique between threads.

14. A computer program with a program code (500, 600) for performance evaluation of an electronic hardware design according to the method (100) of any of claims 1 to 13, when the computer program runs on a computer simulation server (800).

15. A computer simulation server (900) comprising a plurality of processing cores (910, 920, 930, 940) for performance evaluation of an electronic hardware design, the computer simulation server (900) comprising: a plurality of worker threads (1 1 1 , 1 12, 1 13) running in parallel on the plurality of processing cores (910, 920, 930, 940), wherein the plurality of worker threads (1 1 1 , 1 12, 1 13) is configured to execute a respective partition (p1 , p2, p3) of a computer simulation (1 10) of the electronic hardware design on the plurality of processing cores (910, 920, 930, 940); and a scheduling element (1 14) configured to control a progress of the plurality of worker threads (1 1 1 , 1 12, 1 13), wherein an execution of the plurality of worker threads (1 1 1 , 1 12, 1 13) is mutually locked with an execution of the scheduling element (1 14).

Description:
Techniques for performance evaluation of an electronic hardware design on a computer simulation server

TECHNICAL FIELD

The present disclosure relates to techniques for performance evaluation of an electronic hardware design on a computer simulation server, in particular to a method and system for parallel scheduler of a hardware microarchitecture performance simulator.

BACKGROUND

Architectural timing simulation, also referred to as performance simulation, is used to explore the performance of hardware (HW) architecture at early stage and optimize it. The simulated model is getting larger every year, however simulation server CPUs (central processing units) are not getting faster. This is a simulation speed gap where the simulation runtime is slower in a year by year basis. Alternatively - more cores and threads are available at the simulation server. This in turn sets a challenge for performance simulators to parallelize the simulation efficiently among several threads. The frequent transfer of data between the threads requires that the data transfer should be synchronized. The synchronization has a runtime overhead. The runtime overhead is caused by several factors, such as memory barriers and waking a sleeping thread.

As said above, synchronization has a runtime overhead. For a larger simulated model, there is more data transfer between threads each simulated cycle. More data transfer cause more synchronizations. More synchronization cause larger portion of the simulation time wasted on synchronization.

SUMMARY

It is the object of the invention to provide improved techniques for performance simulation.

This object is achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures. A basic idea of the invention is to solve this problem by not synchronizing the transferred data but instead, to split the simulation runtime into a work phase and a transfer phase and to synchronize on the phase transition between both phases. The solution adds synchronization points on which the execution of the simulation model moves to the next phase. The total number of synchronizations is proportional to the number of working threads. Then, the approach can be used for large simulation models. The

synchronization overhead ratio R can be described by the following formula: R = T s / (T s + T m ), where T s is the synchronization overhead time, T m is the model computation time and R is the synchronization overhead ratio. Hence, large computation time T m results in small synchronization overhead and thus better parallel scaling.

The methods, devices and systems described hereinafter may be executed on multi-core processors. A processor is a component that reads and executes program instructions; these instructions tell the processor what to do, such as reading data from memory or sending data to an output bus. A common type of processor is the Central Processing Unit (CPU). A multi-core processor is generally defined as an integrated circuit to which two or more independent processors (called cores) are attached. Multi-core processors emerged in the computing industry as a method to achieve greater performance through parallelism rather than raw clock speed. Over the last decades, the computer industry developed faster and faster processors, though this pursuit is drawing to a close due to the limits of transistor scaling, power requirements, and heat dissipation. Because single-threaded cores are reaching a plateau of clock frequency, chip manufacturers have turned to multi- core processors to enhance performance using parallelism. The methods, devices and systems described hereinafter may be based on threads and thread-level parallelism. Thread-level parallelism involves executing individual task threads delegated to the CPU simultaneously. Thread-level parallelism substantially impacts multi-threaded application performance through various factors, ranging from hardware-specific, thread-implementation specific, to application-specific. Each thread maintains its own memory stack and instructions, such that it may be thought of as an independent task, even if in reality the thread might not really be independent in the program or operating system. Thread-level parallelism is used by programs and operating systems that have a multi-threaded design. Conceptually, it is straightforward to see why thread-level parallelism increases performance. If the threads are truly independent, then spreading out a set of threads among available cores on a processor reduces the elapsed execution time to the maximum execution time of any of the threads, compared to a single threaded version which would require additive execution time of all of the threads. Ideally, the work would also be evenly divided among threads, and the overhead of allocating and scheduling threads is minimal.

According to a first aspect, the invention relates to a method for performance evaluation of an electronic hardware design on a computer simulation server. The computer simulation server comprises a plurality of processing cores. The method comprises the steps of: partitioning a computer simulation of the electronic hardware design among a plurality of worker threads for parallel execution on the plurality of processing cores of the computer simulation server; providing a scheduling element for controlling a progress of the plurality of worker threads; and running the scheduling element and the plurality of worker threads on the plurality of processing cores to evaluate a performance of the electronic hardware design on the computer simulation server, wherein an execution of the plurality of worker threads is mutually locked with an execution of the scheduling element to ensure synchronized execution.

This provides the advantage that execution of the computer simulation can be efficiently and safely synchronized by worker threads and a scheduling element which can mutually lock each other. This minimize on the one hand synchronization overhead, increasing thus the simulation speed, while on the other hand guaranteeing safety, as the plurality of threads does not hurt the data integrity. In the context of this application, safety is to be understood as guaranteeing that a consumer can consume the data after the producer finalized production of the data.

The scheduling element controls the progress of the threads. The operating system controls the assignment of worker threads to processing cores.

In a first possible implementation form of the method according to the first aspect, the scheduling element comprises a scheduler thread that is different from the plurality of worker threads.

In a second possible implementation form of the method according to the first

implementation form of the first aspect, the plurality of worker threads and the scheduler thread interlock execution of each other such that execution of a worker thread locks execution of the scheduler thread and execution of the scheduler thread locks execution of the worker thread.

This provides the advantage that the scheduler thread and the worker threads can implement some kind of hand-shake procedure for synchronization which is a very efficient and safe procedure. Safe means that it is guaranteed that a consumer can consume the data after the producer completed producing the data.

In a third possible implementation form of the method according to any of the first or second implementation forms of the first aspect, a processing cycle of the computer simulation server is split into a work phase for performing computation within the worker threads and a transfer phase for performing data transfer between the worker threads.

This provides the advantage that the simulation runtime can be split into a work phase and a transfer phase and synchronization may be performed on the phase transition. This solution adds synchronization points on which the execution of the simulation model moves to the next phase. The total number of synchronizations may be proportional to the number of worker threads. Then, the solution may be used for large simulation models. It means that the following issues can be guaranteed: 1 ) transfer starts only after all threads completed work; and 2) work starts after all threads completed transfer.

In a fourth possible implementation form of the method according to the third

implementation form of the first aspect, the method comprises: assigning one or more partitions of the computer simulation to a respective worker thread; and serially processing the assigned partitions by the respective worker thread.

This provides the advantage that the partitioning of the computer simulation to the worker threads can be flexibly handled depending on the size of the electronic hardware design and architecture of the computer simulation server. Hence, execution of the computer simulation can be improved in terms of speed and memory requirements.

In a fifth possible implementation form of the method according to any of the third or fourth implementation forms of the first aspect, the method comprises: assigning one or more communication channels for communication between the partitions of the computer simulation to a respective worker thread; and serially processing the assigned

communication channels by the respective worker thread.

This provides the advantage that data communication between the respective partitions can be flexibly assigned depending on the requirements of the performance evaluation of the electronic hardware design resulting in optimal performance.

In a sixth possible implementation form of the method according to any of the third to the fifth implementation forms of the first aspect, the method comprises: synchronizing a worker thread by the scheduling thread by setting a first synchronization point for locking the worker thread from transitioning to the transfer phase and by setting a second synchronization point for locking the worker thread from transitioning to the work phase.

This provides the advantage that the worker thread can be synchronized between transfer phase and work phase. That means, data processed by a worker thread in the work phase is available when starting the transfer phase and data transferred by a worker thread in the transfer phase is available when starting the next work phase. Simulation speed is achieved by multiplicity of threads. In a seventh possible implementation form of the method according to the sixth implementation form of the first aspect, the second synchronization point is later in time than the first synchronization point.

This provides the advantage that the worker thread can first transfer its processed data to the next worker thread before starting processing of next data. Then, there is enough time to finish processing before the next processing is started.

In an eighth possible implementation form of the method according to the seventh implementation form of the first aspect, the method comprises: first setting the second synchronization point and then releasing the first synchronization point.

This provides the advantage that the setting of the second synchronization point locks the worker thread from transitioning to the next work phase before the releasing of the first synchronization point allows the worker thread transitioning to the transfer phase. As already said, safety of execution can be guaranteed. In a ninth possible implementation form of the method according to any of the sixth to the eighth implementation forms of the first aspect, the method comprises: synchronizing the scheduling thread by a worker thread by setting a third synchronization point for locking the scheduling thread from transitioning to a second phase and by setting a fourth synchronization point for locking the scheduling thread from transitioning to a first phase.

This provides the advantage that the scheduling thread can be synchronized by a worker thread between first phase and second phase. As the scheduling thread controls the progress of the worker threads, this guarantees that the scheduling thread will not proceed to the next phase before all worker threads are ready to move to the next phase.

In a tenth possible implementation form of the method according to the ninth

implementation form of the first aspect, the fourth synchronization point is later in time than the third synchronization point.

This provides the advantage that the scheduler thread can be controlled by the worker thread based on time by time steps. This enables synchronization of the scheduler thread with the worker threads on phase transition and not on each transferred data element. This greatly reduces the amount of synchronizations.

In an eleventh possible implementation form of the method according to the tenth implementation form of the first aspect, the method comprises: first setting the fourth synchronization point and then releasing the third synchronization point.

This provides the advantage that the scheduler thread can be stopped from running a later process before results of an earlier process are available. Thread safety is further improved. In a twelfth possible implementation form of the method according to any of the sixth to the eleventh implementation forms of the first aspect, setting the first, second, third and fourth synchronization points comprises setting a mutex, spinlock, semaphore, volatile variable, atomic variable or any other software synchronization technique between threads. This provides the advantage that elements that are available in common real-time operating systems can be applied for this synchronization technique in order to provide an efficient implementation. Any current or future techniques for synchronization are supported.

This may be performed by optionally using a release consistency and/or any other technique of synchronization between threads that is known to someone who is familiar with the field of multi-threaded software. According to a second aspect, the invention relates to a computer program with a program code for performance evaluation of an electronic hardware design according to the method of the first aspect as such or any of the preceding implementation forms of the first aspect, when the computer program runs on a computer simulation server. This provides the advantage that the method according to the invention can easily be implemented on a computer program and executed on a computer simulation server.

According to a third aspect, the invention relates to a computer simulation server comprising a plurality of processing cores for performance evaluation of an electronic hardware design, the computer simulation server comprising: a plurality of worker threads running in parallel on the plurality of processing cores, wherein the plurality of worker threads is configured to execute a respective partition of a computer simulation of the electronic hardware design on the plurality of processing cores; and a scheduling element configured to control a progress of the plurality of worker threads, wherein an execution of the plurality of worker threads is mutually locked with an execution of the scheduling element.

This provides the advantage that execution on the computer simulation server can be efficiently synchronized by worker threads and a scheduling element which can mutually lock each other. Hence defined stages in execution of the computer simulation on the computer simulation server exist. Then, the solution can be used for implementing large simulation models on the computer simulation server. The scheduling element controls the progress of the threads. The operating system of the computer simulation server controls the assignment of worker threads to processing cores. The proposed method and the proposed simulation server improve simulation speed by good parallel scaling with the model size, that is, the number of simulation server cores, and number of threads. A highly efficient parallel scaling with a time division based thread synchronization mechanism is obtained.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, in which:

Fig. 1 shows a schematic diagram illustrating a method 100 for performance evaluation of an electronic hardware design on a computer simulation server according to an implementation form; Fig. 2a shows a schematic diagram illustrating synchronization of worker threads in a work phase 200a according to an implementation form;

Fig. 2b shows a schematic diagram illustrating synchronization of worker threads in a transfer phase 200b according to an implementation form;

Fig. 3 shows a schematic diagram 300 illustrating two exemplary partitions 301 , 302 of a computer simulation coupled by a communication channel 303 according to an implementation form; Fig. 4a shows a schematic diagram illustrating synchronization of worker threads 41 1 , 412 and scheduler thread 414 during a transfer phase 400a;

Fig. 4b shows a schematic diagram illustrating synchronization of worker threads 41 1 , 412 and scheduler thread 414 during a work phase 400b; Fig. 5 shows a schematic diagram illustrating synchronization of worker threads 51 1 and scheduler thread 514 using synchronization points;

Fig. 6 shows a schematic diagram illustrating an exemplary scheduler algorithm 600 according to an implementation form;

Fig. 7 shows a schematic diagram illustrating an exemplary worker thread algorithm 700 according to an implementation form; Fig. 8 shows a sequence diagram 800 illustrating an exemplary synchronization process between worker threads and scheduler thread according to an implementation form; and

Fig. 9 shows a block diagram illustrating an exemplary computer simulation server 900 for performance evaluation of an electronic hardware design according to an implementation form.

DETAILED DESCRIPTION OF EMBODIMENTS In the following detailed description, reference is made to the accompanying drawings, which form a part thereof, and in which is shown by way of illustration specific aspects in which the disclosure may be practiced. It is understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

It is understood that comments made in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise. Fig.1 shows a schematic diagram illustrating a method 100 for performance evaluation of an electronic hardware design on a computer simulation server comprising a plurality of processing cores according to an implementation form. The method 100 includes partitioning 101 a computer simulation 110 of the electronic hardware design among a plurality of worker threads 111, 112, 113 for parallel execution on the plurality of processing cores of the computer simulation server.

The method 100 includes providing 102 a scheduling element 114 for controlling a progress of the plurality of worker threads 111, 112, 113.

The method 100 further includes running 103 the scheduling element 114 and the plurality of worker threads 111, 112, 113 on the plurality of processing cores to evaluate a performance of the electronic hardware design on the computer simulation server, wherein an execution of the plurality of worker threads 111, 112, 113 is mutually locked 121 , 122 with an execution of the scheduling element 114 to ensure synchronized execution.

The scheduling element 114 may include a scheduler thread that is different from the plurality of worker threads 111, 112, 113. The plurality of worker threads 111, 112, 113 and the scheduler thread 114 may interlock execution of each other such that execution of a worker thread 111, 112, 113 locks execution of the scheduler thread 114 and execution of the scheduler thread 114 locks execution of the worker thread 111, 112, 113. A processing cycle of the computer simulation server may be split into a work phase for performing computation within the worker threads 111, 112, 113 and a transfer phase for performing data transfer between the worker threads 111, 112, 113.

The method may include: assigning one or more partitions p1, p2, p3 of the computer simulation 110 to a respective worker thread 111, 112, 113; and serially processing the assigned partitions p1 , p2, p3 by the respective worker thread 111, 112, 113. The method 100 may further include: assigning one or more communication channels 303 for communication between the partitions 301 , 302 of the computer simulation 110 to a respective worker thread 111, 112, 113; and serially processing the assigned

communication channels 303 by the respective worker thread 111, 112, 113. The method 100 may further include: synchronizing a worker thread 1 1 1 , 1 12, 1 13, e.g. a worker thread 51 1 as shown below with respect to Fig. 5 by the scheduling thread 1 14, e.g. a scheduler thread 51 1 as shown below with respect to Fig. 5 by setting a first synchronization point, e.g. a first synchronization point 521 as shown below with respect to Fig. 5 for locking the worker thread 51 1 from transitioning to the transfer phase, e.g. a transfer phase 502 as shown below with respect to Fig. 5 and by setting a second synchronization point, e.g. a second synchronization point 522 as shown below with respect to Fig. 5 for locking the worker thread 51 1 from transitioning to the work phase, e.g. the work phase 501 as shown below with respect to Fig. 5. The second

synchronization point 522 may be later in time than the first synchronization point 521 . The method 100 may further include: first setting the second synchronization point 522 and then releasing the first synchronization point 521. The method 100 may further include: synchronizing the scheduling thread 514 by a worker thread 511 by setting a third synchronization point, e.g. a third synchronization point 523 as shown below with respect to Fig. 5 for locking the scheduling thread 514 from transitioning to a second phase, e.g. a second phase 504 as shown below with respect to Fig. 5 and by setting a fourth synchronization point, e.g. a fourth synchronization point 524 as shown below with respect to Fig. 5 for locking the scheduling thread 514 from transitioning to a first phase, e.g. a first phase 503 as shown below with respect to Fig. 5. The fourth synchronization point 524 may be later in time than the third synchronization point 523. The method 100 may include first setting the fourth synchronization point 524 and then releasing the third synchronization point 523. Setting the first 521 , second 522, third 523 and fourth 524 synchronization points may include setting a mutex, spinlock, semaphore, volatile variable, atomic variable or any other software synchronization technique between threads.

Fig. 2a shows a schematic diagram illustrating synchronization of worker threads in a work phase 200a according to an implementation form. Fig. 2b shows a schematic diagram illustrating synchronization of worker threads in a transfer phase 200b according to an implementation form. There are multiple worker threads 210, 220, 230, each one has a work group and a transfer group. Worker thread 210 has work group A and transfer group A; Worker thread 220 has work group B and transfer group B; Worker thread 230 has work group C and transfer group C. In the work phase 200a the worker threads 210, 220, 230 are locked 215, 225, 235 by the scheduler element in their work groups A, B, C for processing data. Worker thread 210 processes input inO 21 1 and generates outputs outO, 212 and outl 213. Worker thread 220 processes input in1 , 221 and generates output out2, 222. Worker thread 230 processes input in2, 231 and generates output out3, 232.

In the transfer phase 200b the worker threads 210, 220, 230 are locked 216, 226, 236 by the scheduler element in their work groups A, B, C for transferring data to the next worker threads 210, 220, 230. Worker thread 210 transfers data from its output outO, 212 to the input in1 , 221 of worker thread 220 and from its output outl , 213 to the input in2, 231 of worker thread 230. Worker thread 220 transfers data from its output out2, 222 to an input of a next (not shown) worker thread. Worker thread 230 transfers data from its output out3, 232 to an input of a next (not shown) worker thread. The simulation speed is accelerated by parallelizing the task among concurrent worker threads 210, 220, 230. The simulation also faces a constraint to synchronize simulation data every simulated cycle. The scheduler (not shown) runs on at least one separate thread and the worker threads and scheduler thread inter-lock the progress of each other. The worker thread simulation task is split into a work phase 200a and a transfer phase

200b, where the computation is done during the work phase 200a and the data transfer is done during the transfer phase 200b. The computation at the work phase 200a is parallelized among worker threads 210, 220, 230 by assigning one or more partitions to a worker thread. Each worker thread processes its partitions serially. The data transfer at the transfer phase 200b is parallelized among worker threads 210, 220, 230 by assigning one or more communication channels to a worker thread. A communication channel is a pair of driver output ports connected to a receiver input port. Each worker thread processes its channels serially. Each worker thread 210, 220, 230 is synchronized by the scheduler using 2 synchronization elements (such as mutex, spinlock, conditional variable, etc.): one locks the thread from transitioning to the work phase 200a and the other locks from transitioning to the transfer phase 200b. Each worker thread 210, 220, 230 synchronizes the scheduler using 2 synchronization elements: one locks the scheduler from transitioning to the first phase and the other locks from transitioning to the second phase, e.g. as described below with respect to Fig. 4. In each of the above synchronization methods, the later (far) synchronization point is locked before the earlier (near) synchronization element is unlocked.

Fig. 3 shows a schematic diagram 300 illustrating two exemplary partitions 301 , 302 of a computer simulation coupled by a communication channel 303 according to an implementation form. As described above with respect to Fig. 2, the data transfer at the transfer phase 200b is parallelized among worker threads 210, 220, 230 by assigning one or more communication channels to a worker thread. The communication channel 303 is a pair of driver output port 304 connected to a receiver input port 305. For example partition i 301 is assigned to worker thread 210 and partition j assigned to worker thread 220. Then communication channel 303 corresponds to the channel between output port outO, 212 of worker thread 210 and input port in 1 , 221 of worker thread 220 as shown in Fig. 2. In the transfer phase 200b the worker threads 210, 220 are locked 306 by the scheduler element for transferring data via communication channel 303 from partition i 301 to partition j 302.

Fig. 4a shows a schematic diagram illustrating synchronization of worker threads 41 1 , 412 and scheduler thread 414 during a transfer phase 400a.

In the transfer phase 400a the worker threads 410, 420 are locked for working 401 , i.e. locked for processing data by the scheduler element 414 and unlocked for transferring 402 data. The scheduler thread 414 sets a first synchronization point in the worker threads 410, 420 to zero as can be seen in the upper part of the diagram of Fig. 4a and a second synchronization point in the worker threads 410, 420 to one as can be seen in the lower part of the diagram of Fig. 4a. In the transfer phase 400a the scheduler thread 414 is locked in phaseO, 403 and unlocked to transfer to phasel , 404 by the worker threads 410, 420. The worker threads 410, 420 set a third synchronization point in the scheduler thread 414 to one as can be seen in the upper part of the diagram of Fig. 4a and a second synchronization point in the scheduler thread 414 to zero as can be seen in the lower part of the diagram of Fig. 4a.

Fig. 4b shows a schematic diagram illustrating synchronization of worker threads 41 1 , 412 and scheduler thread 414 during a work phase 400b. The situation in the work phase 400b is complementary to the situation in the transfer phase 400a shown in Fig. 4a. In the work phase 400b the worker threads 410, 420 are unlocked for working 401 , i.e. unlocked for processing data by the scheduler element 414 and locked for transferring 402 data. The scheduler thread 414 sets a first synchronization point in the worker threads 410, 420 to one as can be seen in the upper part of the diagram of Fig. 4b and a second synchronization point in the worker threads 410, 420 to zero as can be seen in the lower part of the diagram of Fig. 4b. In the work phase 400b the scheduler thread 414 is locked in phasel , 404 and unlocked to transfer to phaseO, 403 by the worker threads 410, 420. The worker threads 410, 420 set a third synchronization point in the scheduler thread 414 to zero as can be seen in the upper part of the diagram of Fig. 4b and a second synchronization point in the scheduler thread 414 to one as can be seen in the lower part of the diagram of Fig. 4b.

The example of Figures 4a and 4b was implemented successfully in a working simulator and was able to a significantly accelerate runtime with respect to a single thread.

Fig. 5 shows a schematic diagram illustrating synchronization of worker threads 51 1 and scheduler thread 514 using synchronization points. Fig. 5 is related to Fig. 4, it describes an overview of the different worker threads, scheduler thread, phases and synchronization points.

A worker thread 51 1 , e.g. a worker thread 41 1 , 412 as described above with respect to Fig. 4 can be synchronized by the scheduler thread 514, e.g. a scheduler thread 414 as described above with respect to Fig. 4 by setting a first synchronization point 521 for locking the worker thread 51 1 from transitioning to the transfer phase 502 and by setting a second synchronization point 522 for locking the worker thread 51 1 from transitioning to the work phase 501. The second synchronization point 522 may be later in time than the first synchronization point 521. The second synchronization point 522 may be first set before releasing the first synchronization point 521. The scheduler thread 514 may be synchronized by a worker thread 51 1 by setting a third synchronization point 523 for locking the scheduling thread 514 from transitioning to the second phase 504 and by setting a fourth synchronization point 524 for locking the scheduling thread 514 from transitioning to the first phase 503. The fourth synchronization point 524 may be later in time than the third synchronization point 523. The fourth synchronization point 524 may be first set before releasing the third synchronization point 523. Setting the first 521 , second 522, third 523 and fourth 524 synchronization points may be implemented by setting a mutex, spinlock, semaphore, volatile variable, atomic variable or any other software synchronization technique between threads. Fig. 6 shows a schematic diagram illustrating an exemplary scheduler algorithm 600 according to an implementation form and Fig. 7 shows a schematic diagram illustrating an exemplary worker thread algorithm 700 according to an implementation form.

The following definitions can be applied to Figures 6 and 7: The expression "mutex" is a representative for all synchronization methods, e.g. semaphor, futex, spinlock etc. with N > 1 worker threads (thread). A scheduler is running on a separate thread. Each thread adds 4 mutexes. Each mutex is common to exactly two threads: The scheduler thread and a worker thread. On a specific mutex of a specific thread, the following operations are applied:

o lock(mutexName, thread)

o unlock(mutexName, thread)

o wait(mutexName, thread)

On a specific mutex of all threads, the following operations are applied:

o lockAII(mutexName) : lock(mutexName, thread) for all threads

o unlockAII(mutexName) : unlock(mutexName, thread) for all threads o waitAII(mutexName) : wait(mutexName, thread) for all threads

The two (i.e. scheduler and worker) threads have the following roles: One is responsible for lock and unlock while the other one waits.

By the scheduler algorithm 600 the functionality of the scheduler thread 414, 514 as described above with respect to Figures 4a, 4b and 5 can be implemented. By the worker thread algorithm 700 the functionality of the worker threads 41 1 , 412, 51 1 as described above with respect to Figures 4a, 4b and 5 can be implemented.

Fig. 8 shows a sequence diagram 800 illustrating an exemplary synchronization process between worker threads 81 1 , 812 and scheduler thread 814 according to an

implementation form. Fig. 8 presents a "ladder diagram" demonstrating the flow of the procedures - tick() and the loop iteration task() as shown above with respect to Figures 6 and 7. The diagram of Fig. 8 is demonstrated on 2 worker threads 81 1 , 812. The demonstration of more worker threads is trivial. The diagram of Fig. 8 does not cover the initialization and termination, which are covered in the above code of Figure 6 and 7.

At the beginning of a cycle 820, worker threads 81 1 , 812 are in work phase 801 and scheduler thread 814 is in phasel , 804. Both worker threads 81 1 , 812 are processing their input data to generate output data. The scheduler thread 814 monitors progress of the worker threads 81 1 , 812 and if processing of data is completed the scheduler thread 814 locks the worker threads 81 1 , 812 in transfer phase 802 and then unlocks work phase 801. Further, the worker threads 81 1 , 812 lock the scheduler thread 814 to phaseO, 803 and unlock it from phasel , 804. In the beginning of cycle 820, worker threads 81 1 , 812 are in transfer phase 802 and scheduler thread 814 is in phaseO, 803. Both worker threads 81 1 , 812 are transferring their output data to the next worker threads. The scheduler thread 814 monitors progress of the worker threads 81 1 , 812 and if transfer of data is completed the scheduler thread 814 locks the worker threads 81 1 , 812 in work phase 801 and then unlocks transfer phase 802. Further, the worker threads 81 1 , 812 lock the scheduler thread 814 to phasel , 804 and unlock it from phaseO, 803.

The same procedure as described above is repeated for the next cycle 820. Fig. 9 shows a block diagram illustrating an exemplary computer simulation server 900 for performance evaluation of an electronic hardware design according to an implementation form.

The computer simulation server 900 includes a plurality of processing cores 910, 920, 930, 940 for performance evaluation of an electronic hardware design. Each core 910, 920, 930, 940 is assigned a CPU memory 950 that is shared by all processing cores 910, 920, 930, 940. All software threads belong to a process that can access the entire memory space of the process. A bus interface 960 provides external access to the CPU memory 950. A plurality of worker threads 1 1 1 , 1 12, 1 13, e.g. as described above with respect to Fig. 1 run in parallel on the processing cores 910, 920, 930. The worker threads 1 1 1 , 1 12, 1 13 are configured to execute a respective partition p1 , p2, p3 of a computer simulation 1 10 of the electronic hardware design, e.g. as described above with respect to Fig. 1 , on the processing cores 910, 920, 930. A scheduling element 1 14 runs on processing core 940. The scheduling element 1 14 controls a progress of the worker threads 1 1 1 , 1 12, 1 13. An execution of the plurality of worker threads 1 1 1 , 1 12, 1 13 is mutually locked with an execution of the scheduling element 1 14 to ensure synchronized execution, e.g. as described above with respect to Figures 1 to 8.

The present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein, in particular the steps of the method 100 described above with respect to Fig. 1 and the techniques described above with respect to Figures 2 to 8. Such a computer program product may include a readable non-transitory storage medium storing program code thereon for use by a computer. The program code may perform the steps described herein, in particular the method 100 described above. A computer program may include program code for performing the methods 100 as described above with respect to Figure 1 when executed on a computer.

While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations, such feature or aspect may be combined with one or more other features or aspects of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other. Although specific aspects have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.