TECHNIQUE FOR PROCESSING ASSEMBLY INSTRUCTIONS COMPRISED BY A PROGRAM

Title:

TECHNIQUE FOR PROCESSING ASSEMBLY INSTRUCTIONS COMPRISED BY A PROGRAM

Document Type and Number:

WIPO Patent Application WO/2016/096053

Kind Code:

Abstract:

A technique for processing assembly instructions comprised by a program is presented. A method implementation of the technique comprises receiving the assembly instructions, analyzing data dependencies between the assembly instructions (S4), and arranging the assembly instructions in a data structure that associates the assembly instructions by links representing their data dependencies (S6). The method further comprises assigning, to the links, a respective value indicative of a duration of a memory operation of an assembly instruction associated with the respective link (S8), re-arranging the assembly instructions based on the data structure and the values assigned to the links (S14, S16), and outputting the re¬ arranged assembly instructions (S18). Further, a computer program product and a computing device (60) are presented.

Inventors:

KELEMEN ZOLTÁN (HU)
DÉVAI GERGELY (HU)
LESKO DANIEL (HU)
NÉMETH BOLDIZSÁR (HU)
PINTER ADAM (HU)

Application Number:

PCT/EP2014/078864

Publication Date:

June 23, 2016

Filing Date:

December 19, 2014

Export Citation:

Click for automatic bibliography generation Help

Assignee:

ERICSSON TELEFON AB L M (SE)

International Classes:

G06F9/45

Other References:

ERVEN ROHOU ET AL: "SALTO: System for Assembly-Language Transformation and Optimization", INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE. RAPPORTS DE RECHERCHE, INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE (INRIA), FR, no. 2980, September 1996 (1996-09-01), pages 1 - 27, XP002411119, ISSN: 0249-6399
PHILIP H SWEANY ET AL: "Instruction Scheduling Using Simulated Annealing", PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON MASSIVELY PARALLEL COMPUTING SYSTEMS (MPCS '98), April 1998 (1998-04-01), pages 1 - 7, XP055192964, Retrieved from the Internet [retrieved on 20150602]
MARK HEFFERNAN ET AL: "Data-Dependency Graph Transformations for Instruction Scheduling", JOURNAL OF SCHEDULING, KLUWER ACADEMIC PUBLISHERS, BO, vol. 8, no. 5, October 2005 (2005-10-01), pages 427 - 451, XP019212497, ISSN: 1099-1425, DOI: 10.1007/S10951-005-2862-8
GHASSAN SHOBAKI ET AL: "Optimal trace scheduling using enumeration", ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, vol. 5, no. 4, March 2009 (2009-03-01), pages 1 - 32, XP058033748, ISSN: 1544-3566, DOI: 10.1145/1498690.1498694
ANJALI MAHAJAN ET AL: "Instruction Scheduling using Evolutionary Programming", PROCEEDINGS OF THE WSEAS INTERNATIONAL CONFERENCE ON APPLIED COMPUTING CONFERENCE (ACC'08), 27 May 1998 (1998-05-27), pages 137 - 144, XP055192968, Retrieved from the Internet [retrieved on 20150602]

Attorney, Agent or Firm:

RÖTHINGER, Rainer (Schweigerstr. 2, Muenchen, DE)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. A method for processing assembly instructions comprised by a program, the method comprising:

receiving the assembly instructions;

analyzing data dependencies between the assembly instructions (S4);

arranging the assembly instructions in a data structure that associates the assembly instructions by links representing their data dependencies (S6);

assigning, to the links, a respective value indicative of a duration of a memory operation of an assembly instruction associated with the respective link (S8);

re-arranging the assembly instructions based on the data structure and the values assigned to the links (S14, S16); and

outputting the re-arranged assembly instructions (S18).

2. The method according to claim 1, wherein

the step of re-arranging the assembly instructions (S14, S16) comprises rearranging the assembly instructions in an order that reduces a total processor stall time associated with the memory operations.

3. The method according to claim 1 or 2, wherein

the step of re-arranging the assembly instructions (S14, S16) comprises applying a search algorithm on the data structure comprising the values assigned to the links (S14).

4. The method according to claim 3, wherein

the search algorithm is an informed graph search algorithm.

5. The method according to any one of claims 1 to 4, wherein

analyzing data dependencies between the assembly instructions (S4) is carried out such that a data dependency between a first and a second assembly instruction is identified if and only if the first assembly instruction instructs to write a data location that the second assembly instruction instructs to read or to write.

6. The method according to any one of claims 1 to 5, further comprising:

after the step of arranging the assembly instructions in a data structure (S6), transforming the data structure to a transformed data structure by rearranging the assembly instructions (S10), wherein the step of re-arranging (S14, S16) is based on the transformed data structure.

7. The method according to any one of claims 1 to 6, wherein

the data structure is representative of a graph with nodes and edges connecting the nodes, wherein the links are the edges of the graph and the assembly instructions are the nodes of the graph.

8. The method according to claim 7, wherein

arranging the assembly instructions in a data structure (S6) is carried out such that two of the assembly instructions are connected by an edge if there is a data dependency between the two assembly instructions.

9. The method according to any one of claims 1 to 8, further comprising:

after the step of assigning a value indicative of a duration of a memory operation (S8),

calculating an upper bound for a total processor stall time associated with the memory operations (S12).

10. The method according to claims 9 in combination with claim 3, wherein

a search space used by the search algorithm is reduced by using the upper bound as an upper limit for the total processor stall time.

11. The method according to claim 10, wherein

the search algorithm searches a minimum value of the total processor stall time; and

the step of re-arranging the assembly instructions (S14, S16) comprises rearranging the assembly instructions in an optimal order having the minimum value of the total processor stall time when executed.

12. The method according to any one of claims 9 to 11, wherein

the upper bound for the value corresponds to the sum of all values assigned to the links of the data structure.

13. The method according to any one of claims 1 to 12, further comprising:

before the step of receiving the assembly instructions,

parsing a source code of the program and thereby obtaining the assembly instructions (S2).

14. The method according to any one of claims 1 to 13, wherein

the respective duration of a memory operation of an assembly instruction depends on a memory being accessed by the assembly instruction.

15. A computer program product comprising program code portions for performing the steps of any of the preceding claims when the computer program product is executed on a computing device. 0 16. The computer program product of claim 15, stored on a computer-readable recording medium.

17. A computing device (60) configured to process assembly instructions comprised by a program, the computing device (60) comprising:

5 a first interface (62) configured to receive assembly instructions;

at least one processor (64) configured to

analyze data dependencies between the assembly instructions;

arrange the assembly instructions in a data structure that associates the assembly instructions by links representing their data dependencies;

o assign, to the links, a respective value indicative of a duration of a

memory operation of an assembly instruction associated with the respective link; and re-arrange the assembly instructions based on the data structure and the values assigned to the links; and

a second interface (66) configured to output the re-arranged assembly

5 instructions.

18. The computing device of claim 17, further configured to perform the steps of any of claims 2 to 14.

Description:

Technique for processing assembly instructions comprised by a program

Technical Field

The present disclosure generally relates to processing of assembly instructions comprised by a program. The technique may be embodied in one or more of methods, computer program products, and computing devices. Background

In the field of computer programming and in particular in the field of programming of assembly code, it is desirable to obtain a computer program working as cost- efficiently as possible. The term "cost" as used in connection with a computer program, i.e., of a certain arrangement of (assembly) instructions, refers to a processing time or other processing resources required for executing the program.

In several architectures, for example in PPA (Packet Processing Architecture), a processor can be blocked on memory access for a certain duration of a memory operation. Certain instructions can only be executed after such a blocking time, for example because they access the same memory location as a previous instruction. This blocking, in turn, may lead to a processor stall time during which the processor typically only carries out NOP cycles ("No Operation"). The sum of these stall times occurring upon execution of a computer program is regarded as the total stall time of this particular computer program.

From the point of view of cost effectiveness, it is desirable to minimize the total stall time of a computer program. This may be achieved by a clever arrangement of the individual instructions, such that stall times are reduced.

In the prior art, prefetching is a commonly used method to reduce processor stall times caused by memory accesses, while not changing the semantics of the program. When a load instruction is prefetched, it is moved to the first position where it can logically be executed. This re-arrangement of instructions is currently manually done by a programmer. To manually prefetch instructions the programmer needs to study the data and control flow of the program. This approach is costly in man-hours and error-prone. Therefore, prefetching is usually done only in the most obvious cases. Summary

There is a need for a technique for processing assembly instructions comprised by a program, wherein the technique avoids one or more of the drawbacks discussed above or other problems.

According to a first aspect, a method for processing assembly instructions comprised by a program is presented, wherein the method comprises receiving the assembly instructions, analyzing data dependencies between the assembly instructions, and arranging the assembly instructions in a data structure that associates the assembly instructions by links representing their data dependencies. The method further comprises assigning, to the links, a respective value indicative of a duration of a memory operation of an assembly instruction associated with the respective link, rearranging the assembly instructions based on the data structure and the values assigned to the links, and outputting the re-arranged assembly instructions.

The method may entirely be performed by a computing device. As an example, the method may be performed under control of a program executed by one or more processors of the computing device. The assembly instructions may be instructions in any particular assembly language that may be specific to a particular computer architecture. In other words, an assembly instruction may be a low level program unit that encodes one simple activity that a processing unit (also called processor or CPU herein) may perform. In the following, if not indicated otherwise, the term "instruction" refers to an assembly instruction.

The step of re-arranging the assembly instructions may comprise re-arranging the assembly instructions in an order that reduces a total processor stall time associated with the memory operations. A total stall time of the re-arranged assembly

instructions when executed in the re-arranged order may thus be shorter than a total stall time of the assembly instructions when executed in their initial order. Hence, a total time for executing a program comprising the re-arranged assembly instructions may be reduced compared to a total time for executing a program comprising the assembly instructions in a different order (e.g., their initial order). In one variant, a total stall time of the assembly instructions is optimized (e.g., minimized).

The values assigned to the links may be comprised by the data structure. The step of re-arranging the assembly instructions may then comprise applying a search algorithm on the data structure comprising the values assigned to the links. The search algorithm may be applied such that it is searched for an order of the assembly instructions that reduces (e.g., optimizes) the total processor stall time. The re-arranged assembly instructions may thus be arranged in an optimal order. The re-arranged assembly instructions may exhibit a reduced total processor stall time compared to their initial order.

The search algorithm may be an informed search algorithm and in particular an informed graph search algorithm (e.g., a limited A* search algorithm, a simulated annealing search algorithm, etc). The informed search algorithm may exploit special properties of the search space. Such properties may be known a priori ^'(i.e., prior to receiving the assembly instructions that are to be re-arranged). The search algorithm may be executed by a computing device (e.g., by a Personal Computer, PC). A data dependency may be a relation between a first assembly instruction and a second assembly instruction, which means that they should be executed in the same order or the behavior of the program may change. Thus, if the order of a first data instruction and a second data instruction with respect to each other is changed, an outcome of the program may change, or the program may crash.

Analyzing data dependencies between the assembly instructions may be carried out such that a data dependency between a first and a second assembly instruction is identified if and only if the first assembly instruction instructs to write a data location that the second assembly instruction instructs to read or to write. Thus, data dependencies may be identified based on the so-called Bernstein condition. A data location may be any hardware component (of a system executing the assembly instructions) that is capable of storing information.

The method may further comprise, after the step of arranging the assembly instructions in a data structure, transforming the data structure to a transformed data structure by re-arranging the assembly instructions, wherein the step of rearranging is based on the transformed data structure. Transforming the assembly instructions may comprise processing (e.g., optimizing) the data structure for the step of re-arranging the assembly instructions based on the data structure

comprising the values assigned to the links. In the case a particular search algorithm is used in the step of re-arranging, the step of transforming may comprise adapting the data structure to the search algorithm. The transforming may be carried out such 5 that a complexity of finding an optimal order of the assembly instructions is reduced.

The data structure may be representative of a graph with nodes and edges connecting the nodes. Thus, the data structure may take the form of a data configuration typically used to store a graph. The links (which may be representative i o of the data dependencies between the assembly instructions) are the edges of the graph and the assembly instructions are the nodes of the graph. All edges may be annotated with a number (e.g., a "cost" value) that represents the number of stall cycles that is the result of executing the two linked assembly instructions in sequence. This number may thus indicate the duration of the memory operation

15 associated with the linked assembly instructions.

The assembly instructions may be arranged such that each assembly instruction is represented by one node. The resulting graph may be a directed graph. Further, the resulting graph may be an acyclic graph. For example, the resulting graph may be a 2 o DAG (Directed Acyclic Graph).

Arranging the assembly instructions in a data structure may be carried out such that two of the assembly instructions are connected by an edge if there is at least one of a data dependency and a control dependency between the two assembly

25 instructions. Each dependency may be represented by an edge connecting the two nodes representative of the two assembly instructions, between which the

dependency exists. There may be an edge from a first node to a second node if the instruction of the first node precedes the instruction of the second node and if there is a data or control dependency between the instructions of the first and second

3 o nodes.

The method may further comprise, after the step of assigning a value indicative of a duration of a memory operation, calculating an upper bound for a total processor stall time associated with the memory operations. The upper bound may be

35 calculated based on the (e.g., measured or estimated) values indicative of a duration of a memory operation assigned to the plurality of links. A search space used by the search algorithm may be reduced by using the upper bound as an upper limit for the total processor stall time. For example, if the search algorithm considers an order of assembly instructions, in which not all assembly instructions are used yet, but the total stall time of this ordering is already higher than the upper limit, this ordering may be discarded. Subsequently, another order may be considered as optimal order.

The search algorithm may search a minimum value of the total processor stall time, and the step of re-arranging the assembly instructions may comprise re-arranging the assembly instructions in an optimal order having the minimum value of the total processor stall time when executed. In the case that the search algorithm searches a minimum value of the total processor stall time, the search may be continued until an optimal ordering having a minimum total processor stall time has been found. The search may be continued until the optimal ordering has been found and it is proven that no other ordering can have a lower total processor stall time. For example, the search algorithm may consider every possible order of assembly instructions (in some embodiments limited by the upper bound) and chose the order as optimal order for which an (e.g., estimated) processor stall time has a minimum value. The other possible orders may be discarded or outputted together with the optimal order.

The upper bound for the value may correspond to the sum of all values assigned to the links of the data structure. In the case the data structure is a graph, the upper bound may be indicative of a sum of all values assigned to the edges of the graph. The method may further comprise, before the step of receiving the assembly instructions, parsing a source code of the program and thereby obtaining the assembly instructions. Parsing may comprise decoding a source code from its actual textual format into an internal representation. The internal representation may comprise the assembly instructions. The step or parsing may be carried out by a parser (which may be, e.g., a program library).

The respective duration of a memory operation of an assembly instruction may depend on a memory (e.g., a particular hardware configuration) accessed responsive to execution of the assembly instruction. The duration of the memory operation may be representative of a memory access time (e.g., a time which is needed by the respective memory to process the memory access operation). Different memories may have different durations of a memory operation of an assembly instruction. Further, different assembly instructions may lead to different durations of a memory operation at one and the same memory. The values indicative of a duration of a memory operation may be based on calculated, estimated or empirical values (e.g., measured values). These values may be saved in a data structure (e.g., in a database). Alternatively or additionally, the values indicative of a duration of a memory operation may be obtained by simulation.

According to a second aspect, a computer program product is presented, wherein the computer program product comprises program code portions for performing the steps of any of the methods and method aspects described herein when the computer program product is executed on a computing device. The computing device may be a personal computer, a mobile computing device, or a server.

The computer program product may be stored on a computer-readable recording medium. The computer program product may be stored, e.g., in a volatile or a non- volatile manner. The computer-readable recording medium may be a hard disk, an optical recording medium (e.g., CD, DVD, or Blu-ray Disc), or a flash drive.

According to a third aspect, a computing device configured to process assembly instructions comprised by a program is presented, wherein the computing device comprises a first interface configured to receive assembly instructions, and at least one processor configured to analyze data dependencies between the assembly instructions, to arrange the assembly instructions in a data structure that associates the assembly instructions by links representing their data dependencies, to assign, to the links, a respective value indicative of a duration of a memory operation of an assembly instruction associated with the respective link, and to re-arrange the assembly instructions based on the data structure and the values assigned to the links. The computing device further comprises a second interface configured to output the re-arranged assembly instructions. The computing device may be a personal computer, a mobile computing device, or a server. The first interface may comprise a memory interface, a network interface, and/or a user interface (e.g., a keyboard, a touchpad, and/or a mouse). The processor may be a Central Processing Unit (CPU). For example, the processor may be a processor configured to process the assembly instructions. The second interface may comprise a display device interface, a printing device interface, a storage device interface (e.g., a memory device interface), and/or a network interface. The first interface and the second interface may in certain solutions be integrated into a single interface. The computing device may be further configured to perform the steps of any of the methods described herein.

Brief Description of the Drawings

Embodiments of the technique presented herein are described below with reference to the accompanying drawings, in which:

Fig. 1 shows a flow chart of a method for processing assembly instructions in

accordance with an aspect of this disclosure;

Fig. 2a shows an example of a graph, wherein assembly instructions are represented by nodes of the graph and data dependencies are represented by edges of the graph;

Fig. 2b shows an example of a graph, wherein values indicative of a duration of a memory operation have been assigned to the edges of the graph;

Fig. 3a shows an example of a representation of assembly instructions as a graph before a step of transforming the graph to a transformed graph by rearranging the assembly instructions has been performed;

Fig. 3b shows an example of a transformed graph after transforming the graph

shown in Fig. 3a by introducing additional links between assembly

instructions;

Fig. 4a shows an example of four assembly instructions executed in their initial order;

Fig. 4b shows a graph representation of the assembly instructions of Fig. 4a and the data dependencies between them;

Fig. 4c shows an example of the assembly instructions shown in Fig. 4a executed in a re-arranged order; and shows a computing device configured to process assembly instructions comprised by a program, in accordance with an aspect of this disclosure. Detailed Description

In the following description, for purposes of explanations and not limitation, specific details are set forth, such as specific assembly instructions, data structures, and so on, in order to provide a thorough understanding of the present disclosure. It will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. Those skilled in the art will further appreciate that the methods, steps and functions explained herein below may be implemented using individual hardware circuitry, using software functioning in conjunction with a programmed processor or general purpose computer, using an Application Specific Integrated Circuit (ASIC) and/or using one or more Digital Signal Processors (DSPs). It will also be appreciated that when the present disclosure is described in connection with methods and method aspects, it may also be embodied in a processor and a memory coupled to the processor, wherein the memory is encoded with one or more programs or program instructions that cause the processor to perform the methods, steps and functions disclosed herein upon execution.

Fig. 1 shows a flow chart of a method embodiment for processing assembly instructions in accordance with an aspect of this disclosure. The method may be executed automatically by means of a computing device, such as a PC. In step S2, an original source code of a program is parsed by a parser. From this, an internal representation as a list of assembly instructions is obtained. An assembly instruction is a low level program unit that encodes one simple activity that the processor can perform. Thus, step S2 comprises receiving the assembly instructions by parsing. The assembly instructions may be received in the form of a list having a particular order. The order corresponds to a sequence in which the respective instructions are executed by a processor when the program is executed.

Parsing may be carried out by decoding the assembly instructions from an original textural format into an internal representation. For example, the text "Idd r3, [dl]" becomes the internal representation for an instruction that loads a doubleword of memory from the memory location dl to the third register r3. In a further step S4, data dependencies and control dependencies between the assembly instructions are analyzed. In general, data or control dependency is a relation between two assembly instructions implying that they should be executed in the same order (with respect to each other) or the behavior of the program could change, which might lead to a different outcome of the program (e.g., errors).

Control dependencies are identified when certain assembly instructions can alter the control flow of the program constituted by the assembly instructions (such as jumps and labels). There generally is a control dependency between two assembly instructions if one of them can send or receive program control (e.g., to or from the other).

To obtain the data dependencies between the assembly instructions, for each one of the assembly instructions the location (or the locations) of data that is read by the instruction is identified. Further, the location (or the locations) of data that can be changed (e.g., written) by the assembly instruction is identified. Such locations could comprise register values, data stored in memory or flags.

The step of analyzing data dependencies may be based on the Bernstein condition, which states that there is data dependency between two instructions if and only if one of the instructions writes a data location that the other instruction reads or writes. The term reading is used herein for accessing data and the term writing is used herein as changing data. A data location may be any hardware component of a system executing the program that is capable of storing information.

As an example, the instructions "mov rl, r3" and "mov r2, r3" are not in data dependency. The first instruction reads from the third register r3 and writes into the first register rl, whereas the second instruction reads from the third register r3 and writes into the second register r2. Thus, both instructions read the third register r3 but they write different registers. Further, the instructions "mov rl, r2" and "Idd r2,

[dl]" are in a data dependency, because the first instruction reads the second register r2 and the second instruction writes the register r2.

In a further step S6, the assembly instructions are arranged in a logical data structure that can be stored in a permanent or transitory storage and that associates the assembly instructions by links representing their data dependencies. In the embodiment described herein, the data structure can be represented by a graph, wherein each assembly instruction is represented by a node of the graph. The links are represented by edges connecting the nodes of the graph. Two nodes of the graph are connected by an edge if there is a data dependency between the two assembly instructions represented by the nodes.

5 Although in the example described herein the data structure is represented by a graph, it is to be noted that other types of representations are possible. Further, the assembly instructions do not have to be arranged physically, it is rather sufficient that locations in a logical data structure (e.g., representative of nodes in a graph) are assigned to the assembly instructions and that the assembly instructions are

i o connected by logical links (e.g., representative of edges).

In the embodiment described herein, in step S6 a data dependency graph is built from the assembly instructions and the data dependencies between them, wherein nodes represent instructions. There is a directed edge from a first node representing

15 a first instruction to a second node representing a second instruction if the first

instruction precedes the second instruction and there is a data or control dependency between the first instruction and the second instruction. The resulting graph is directed and acyclic. More precisely, the graph is a Directed Acyclic Graph (DAG). Any ordering of instructions should be considered valid if it is a topological sort on

20 the graph. A valid reordering does not change the semantics of the program.

Thus, data dependency graph is in the present embodiment a graph where nodes are assembly instructions and edges stand for data dependencies between the instructions. The topological sort of a graph is an ordering of the nodes in the graph.

25 In a topological sort there are no back edges, so no node can have an edge to

another node that is behind it in order. Only graphs with no directed cycles can have a topological sort. If in the following the terms "ordering" and "reordering" are mentioned, they refer to the changed order of the assembly instructions.

30 For example, from a set of received assembly instructions 20, 22, 24, and 26, the graph representation shown in Fig. 2a can be generated. In other words, Fig. 2a expresses the data structure in which the following exemplary assembly instructions 20, 22, 24, and 26 are arranged in step S6:

35 mov rl, 0 (20)

mov r2, rl (22)

mov r3, rl (24)

std [r2], r3 (26) In step S8, a respective value indicative of a duration of a memory operation of an assembly instruction associated with the respective link is assigned to the links. In the embodiment described herein, an estimated stall time resulting from the duration of the respective memory operation is assigned to the edges of the graph connecting a first and a second instruction, wherein the estimated stall time corresponds to an estimated stall time from executing the second instruction immediately after the first. This estimation can be based on measured data about memory access times. Step S8 can be performed automatically once such data is provided.

Memory access time is generally a property of a particular memory type that determines how many computer cycles have to pass for a memory operation to complete. For example, if there are two kinds of memory, one from which the queried data arrives in 9 CPU cycles and another where the delay is 50 cycles, the graph shown in Fig. 2b is generated from the following instructions 30, 32, and 34:

Idd rl, [lab] (30)

Idd r2, [rl] (32)

mov r3, r2 (34)

The edges of the graph are labelled with the respective values indicative of a duration of a memory operation of the corresponding instruction 30, 32, 34 associated with the respective edge. In more detail, the edges are annotated with a number that represents the number of stall cycles that is the result of executing the two linked assembly instructions in sequence.

A cost associated with a given reordering is defined to be the number of CPU cycles that will be spent waiting for memory operations to complete. The cost corresponds in one variant to the maximal number of NOP instructions that can be inserted between instructions while not increasing the execution time. The associated cost is a measure of the performance of the program. With lower costs, the performance of the program will increase. For example, the ordering of the code of the above example (see Fig. 2b) will have the cost of 9 + 50 = 59. In step S10, the data structure is transformed to a transformed data structure by rearranging the assembly instructions (e.g., by adding and/or removing edges of the graph). In the embodiment described herein, the graph is transferred to a

transformed graph. In other words, the graph is optimized for searching (i.e., for application of an informal graph search algorithm). This means that no transformations are applied to the graph which will change the optimal ordering, but the complexity of finding the optimal ordering is reduced by the step S10 of transforming. For example, the optimal ordering of the graph shown in Fig. 3a comprising instructions 40, 42, 44, and 46 is not changed by the introduction of a new edge from instruction 42 to instruction 44. Thus, the graph shown in Fig. 3a can be transformed, e.g., into the (transformed) graph shown in Fig. 3b.

Transforming step S10 is optional. Step S10 may be omitted and the following steps may be applied to the untransformed initial graph generated in steps S6 and S8.

In step S12, an upper bound for a total processor stall time associated with the memory operations is calculated. In this regard, an upper bound on the cost of the optimal solution in linear time is established.

An upper bound is a maximum of some quantity. The optimal solution is the cost of the best ordering that is possible. In linear time means that the cost of finding the bound increases proportionally to the number of nodes in the graph. For example, the cost cannot be less than the sum of the values assigned to the edges on the shortest path, and cannot be more than the sum of all values assigned to the edges in the graph. The shortest path in the graph is a sequence of edges from the start node to the end node. It does not contain the same edge twice. Every edge points to the source of the next. The sum of the numbers on the edges is minimal.

Also step S12 of calculating an upper bound is optional and may be omitted. The method may jump directly from step S8 or S10 to step S14. In step S14, it is searched for an order of assembly instructions that reduces a total processor stall time associated with the memory operations. To this end, a search algorithm (e.g., an informed graph search algorithm) is applied to the transformed graph. The search algorithm searches for the optimal ordering. It starts from an empty instruction sequence and adds a new instruction in every step. In the present embodiment, the search space is reduced by using the upper bound on the optimal solution (calculated in step S12). The reduction of the search space in one implementation works like this: If there is a point during performing the search, where not all instructions are used yet, but the cost of the ordering at the point is already higher than a known complete ordering or the established upper bound, then the optimal solution cannot be among the descendants of the node, (i.e., a starting order of the optimal solution cannot correspond to the one found at that point). Thus, the ordering at that point may be discarded and must not be evaluated further. The reason for this is that all descendants of a node will have higher cost than their ancestor.

In step S16, the assembly instructions are ordered according to the result of the graph search of step S14. Hence, in steps S14 and S16, the assembly instructions are re-arranged based on the data structure and the values assigned to the links. For example, if for the three instructions:

Idd rl, [labl]

Idd r2, [Iab2]

add rl, r2 step S14 has shown that the optimal ordering is second-first-third, the result (rearranged instructions) will be:

Idd r2, [Iab2]

Idd rl, [labl]

add rl, r2

In step S18, the re-arranged assembly instructions are output. The modified (reordered) source code is generated using the original form that is stored at the parsing step S2 in the representation of each instruction.

Thus, the output text from the example discussed with respect to the previous step S16 will be: "Idd r2, [Iab2]"

"Idd rl, [labl]"

"add rl, r2" An assembler may then be used for transferring the re-arranged assembly

instructions into executable machine code.

Figs. 4a and 4b exemplarily illustrate the effect of re-arranging assembly instructions in accordance with the method embodiment illustrated in Fig. 1.

Fig. 4a shows an example of the following four instructions 50a, 50b, 50c, and 50d in an initial (non-optimal) order, namely in the following order: Idd r2, [hdr] (50a)

cmp r2, 0 (50b)

mov r3, r4 (50c)

and r3, Oxff (50d) In Fig. 4a, the instructions 50a-d are arranged on a linear horizontal time line. The program comprising the instructions 40a-d is started at a starting time t ₀.

In Fig. 4b, since the second instruction 50b reads from a register r2, to which the first instruction 50a writes, there is a data dependency 54ab between the first and the second instruction 50a, 50b. It is labeled with 2 as a cost, because in the example, the memory operation of instruction 50a takes 2 cycle to load the data to the given register. Further, since the third instruction 50c writes to the register r3, to which the fourth instruction 50d writes, there is a data dependency 54cd between the third and the fourth instruction 50c, 50d. It is labeled with 0 as cost, because 50c is not a memory operation, and the result of 50c will be available in the next cycle immediately. In the example shown in Fig. 4b, the data dependencies 54ab, 54cd are indicated as solid arrows connecting the respective instructions 50a-d.

The execution of each instruction takes one processor cycle. Additionally the 52b and 52c cycles (i.e., stall cycles) are used to load the accessed data into the r2 register.

This is necessary, because the 50b instruction is executed after the 50a instruction in the 52d cycle, and the value of r2 is needed for the 50b instruction to execute.

As shown in Fig. 4a, the third instruction 50c may be carried out immediately after the second instruction 50b since there is no data dependency between the second and the third instruction. The fourth instruction 50d may also be carried out immediately after the third instruction 50c, because no memory load operation was performed in 50c. After the last instruction is finished, the program is ended at a time t ₀+6.

As it can be seen in Fig. 4b there is no data or control dependency between instructions 50b and 50c so their order can be replaced. So the following is a valid ordering of the instructions:

Idd r2, [hdr] (50a)

mov r3, r4 (50c)

cmp r2, 0 (50b)

and r3, Oxff (50d)

And because there is also no dependency between the instructions 50b and 50d their order can also be changed. This results in the ordering that can be seen in Fig. 4c.

Fig. 4c shows the same instructions 50a-d as in the example of Fig. 4a but rearranged according to the method described herein. Thus, the order of

instructions 50a-d is improved and stall times are reduced. As can be seen in Fig. 4c, the instructions 50a-d are executed in the following order:

Idd r2, [hdr] (50a)

mov r3, r4 (50c)

and r3, Oxff (50d)

cmp r2, 0 (50b)

Since the total stall time is reduced from 2 cycles to 0, the program finishes after 4 cycles instead of 6. Thus, the program with the re-arranged instructions is faster and a cost of the program could be reduced.

Fig. 5 shows a computing device 60 configured to process assembly instructions comprised by a program. The computing device 60 may be used for carrying out any one of the methods and method aspects described herein (see, e.g., Fig. 1). In particular, the computing device 60 comprises a first interface 62 configured to receive assembly instructions. The first interface 62 may be an input interface and may comprise an interface for a keyboard, a mouse, a touchpad, an interface to external or internal storage device and/or a network interface. The first interface 62 is configured to receive the assembly instructions, e.g., in the form of an ordered list having an initial order. The first interface 62 may also receive the assembly instructions in an unordered form. The computing device 60 further comprises at least one processor 64 (e.g., a CPU). In some variants, a plurality of processors 64 may be provided, wherein the processors may share the tasks executed by the computing device 60.

The processor 64 of Fig. 5 is configured to execute one or more of the steps of the methods and method aspects described herein. In particular, the processor 64 is configured to execute at least steps S4, S6, S8, S14, and S16 of the method described above. In the case that a plurality of processors 64 is provided, the processors 64 may share the tasks of executing the steps described herein between them.

The step of arranging the assembly instructions may be carried out by the processor 64 without physically arranging the instructions but rather by providing a respective data structure referring to the respective assembly instructions. The data structure may be represented by a graph, wherein the assembly instructions are represented by nodes of the graph and the data dependencies are represented by edges of the graph.

A value assigned to each edge and indicative of a duration of a memory operation may be a value representing an amount of time or a number of clock cycles of the processor 64. The value indicative of a duration of a memory operation may represent a duration of a blocking time during which the accessed memory is blocked (e.g., not usable for another memory operation). The value may be an estimated value, (e.g., based on empirical values). The computing device 60 comprises a second interface 66 configured to output the re-arranged assembly instructions. The second interface 66 may be an output interface connected, e.g., to a display device, a printing device, an interface to external or internal storage device and/or a network interface. The first and the second interface 62, 64 may be combined into a single interface (e.g., a single physical interface).

The step of re-arranging may comprise assigning by the processor 64 a value representative of an order to the respective assembly instructions, without physically re-arranging the assembly instructions. The step of outputting may be performed via the second interface 66. In the step of outputting, the re-arranged assembly instructions may thus be printed out, displayed on a display screen, and/or saved to an internal or external memory device.

The computing device 60 further comprises a memory 68 for storing data necessary for carrying out the steps of the methods and method aspects described herein. The assembly instructions received by the first interface 62 may be stored in the memory 68 and read from the memory 68. Further, the re-arranged assembly instructions may be stored in the memory 68. The memory 68 can be an internal or external storage device (e.g., hard drive, flash drive and/or RAM) of the computing device 60.

The computing device 60 comprises a bus 70 connecting the first and second interfaces 62, 66, the processor 64 and the memory 68 with each other. By means of the bus 70, the aforementioned units of the computing device 60 may communicate with each other and/or exchange data with each other.

As has become apparent from the above description of exemplary embodiments, the technique presented herein permits to process (e.g., optimize) assembly code to reduce stall times resulting from memory accesses. The technique presented herein may be implemented without human interaction, which is efficient in terms of manpower and less error-prone.

It is believed that the advantages of the technique presented herein will be fully understood from the foregoing description, and it will be apparent that various changes may be made in the form, constructions and arrangement of the exemplary aspects thereof without departing from the scope of the invention or without sacrificing all of its advantageous effects. Because the technique presented herein can be varied in many ways, it will be recognized that the invention should be limited only by the scope of the claims that follow.

Previous Patent: METHODS AND APPARATUS FOR RELOCATING PACKET PROCESSING FUNCTIONS

Next Patent: METHOD AND DEVICE FOR MIXING TWO STREAMS OF DROPLETS