Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR PORTING CODE TO A GENERAL-PURPOSE GPU ARCHITECTURE
Document Type and Number:
WIPO Patent Application WO/2023/126464
Kind Code:
A1
Abstract:
The invention relates to a computer-implemented method for porting a code from a CPU-based architecture onto a GPU-based architecture. The method comprises clustering of dependency graphs based on topological proximity.

Inventors:
SOTTET JEAN-SÉBASTIEN (LU)
Application Number:
PCT/EP2022/087987
Publication Date:
July 06, 2023
Filing Date:
December 29, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
LUXEMBOURG INST SCIENCE & TECH LIST (LU)
International Classes:
G06F8/76; G06F8/41
Other References:
ZHOU KEREN ET AL: "An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications", IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, IEEE, USA, vol. 33, no. 4, 1 July 2021 (2021-07-01), pages 854 - 865, XP011883306, ISSN: 1045-9219, [retrieved on 20211014], DOI: 10.1109/TPDS.2021.3094169
NICOLAS WEBER ET AL: "Adaptive GPU Array Layout Auto-Tuning", SOFTWARE ENGINEERING METHODS FOR PARALLEL AND HIGH PERFORMANCE APPLICATIONS, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 31 May 2016 (2016-05-31), pages 21 - 28, XP058260457, ISBN: 978-1-4503-4351-0, DOI: 10.1145/2916026.2916031
CHABERTMAXIMECHRISTINE SOLNON: "International Conference on Principles and Practice of Constraint Programming", 2017, SPRINGER, CHAM, article "Constraint programming for multi-criteria conceptual clustering."
Attorney, Agent or Firm:
LECOMTE & PARTNERS (LU)
Download PDF:
Claims:
CLAIMS

1. Computer-implemented method for transforming a first source code executable by an architecture comprising at least one general-purpose CPU and no GPU, into a second source code executable by an architecture comprising at least one general purpose GPU and at least one general purpose CPU, the method comprising the execution of the following steps by a processor: a) performing a dynamic performance analysis (100) of the first source code to identify portions of potential enhancement; b) generating dependency graphs (200) of the portions of potential enhancement, each dependency graph comprising information represented as vertices and edges; c) for each dependency graph, calculating (300) a topological proximity between each vertex and each other vertex of the dependency graph; d) clustering (400) the information of each dependency graph based on a topological proximity criterion; e) identifying (500) in the first source code, statements and/or expressions corresponding to the clustered information; f) selecting transformation rules (600) to be applied to the identified statements and/or expressions based on at least one of: preestablished rules; the result of the performance analysis; the information of the dependency graph; the topological proximity criterion; and the content of the cluster; g) executing (700) the selected transformation rules to obtain an updated source code; h) estimating or running a performance analysis (800) of the updated source code and update the portions of potential enhancement; i) repeating the above steps b) to h) until the updated source code satisfies a condition of performance and/or until the absence of portions of potential enhancement, wherein while repeating the steps b) to h), the identification of step e) is made in the updated source code and in step f) the result of the performance analysis refers to the analyses of step h) on the updated source code; and g) outputting (900) the last occurrence of the updated source code as the second source code.

2. Method according to claim 1 , wherein the topological proximity is based on one or more of the following parameters: the function used by the vertices, the scope of the domain and/or range of the computation of the vertices, the memory allocations of the vertices. 3. Non-transitory computer readable medium comprising instructions, which, when read by a processor, make the processor execute the method according to claim 1 or 2.

Description:
METHOD FOR PORTING CODE TO A GENERAL-PURPOSE GPU ARCHITECTURE

FIELD OF THE INVENTION

[0001] The invention relates to computer-implemented methods and more particularly to a method for automatically porting a code executable on a CPU-based architecture into a code executable on a CPU and GPU-based architecture.

BACKGROUND ART

[0002] Recent hardware developments have made it possible to use a general-purpose GPU for executing a program or part of it in parallel or in replacement to its execution on a CPU, thereby enhancing the performance of the execution by appropriately taking advantage of the hardware capabilities. This is common for programs involving a video content (e.g., video game, video editing, etc.).

[0003] Adapting (porting) a program which has been developed for a CPU into a program executable by an architecture comprising a CPU and a GPU with the aim of using the computation capabilities of the GPU can be challenging.

[0004] The transformation of the program can be carried out manually by an experienced programmer but this can be fastidious and can lead to a ported program that does not run as efficiently as it could.

SUMMARY OF THE INVENTION

[0005] The invention proposes a method that universally enables to automatically port a code for its execution on an architecture comprising a GPU.

[0006] The invention relates to a computer-implemented method for transforming a first source code executable by an architecture comprising at least one general-purpose CPU and no GPU, into a second source code executable by an architecture comprising at least one general purpose GPU and at least one general purpose CPU, the method comprising the execution of the following steps by a processor: a) performing a dynamic performance analysis of the first source code to identify portions of potential enhancement; b) generating dependency graphs of the portions of potential enhancement, each dependency graph comprising information represented as vertices and edges; c) for each dependency graph, calculating a topological proximity between each vertex and each other vertex of the dependency graph; d) clustering the information of each dependency graph based on a topological proximity criterion; e) identifying in the first source code, statements and/or expressions corresponding to the clustered information; f) selecting transformation rules to be applied to the identified statements and/or expressions based on at least one of: preestablished rules; the result of the performance analysis; the information of the dependency graph; the topological proximity criterion; and the content of the cluster; g) executing the selected transformation rules to obtain an updated source code; h) estimating or running a performance analysis of the updated source code and update the portions of potential enhancement; i) repeating the above steps b) to h) until the updated source code satisfies a condition of performance and/or until the absence of portions of potential enhancement, wherein while repeating the steps b) to h), the identification of step e) is made in the updated source code and in step f) the result of the performance analysis refers to the analysis of step h) on the updated source code; and g) outputting the last occurrence of the updated source code as the second source code. The invention also relates to a non-transitory computer readable medium comprising instructions, which, when read by a processor, make the processor execute the above-mentioned method.

[0007] This method enables to ensure the optimization of the source code for being used in a GPU architecture. The particularly inventive aspect is the use of the topological proximity, which takes particular advantage of the computation aptitudes of the GPU that are particularly efficient for geometry-oriented objects. The use of topological proximity that has some analogy with geometry-oriented computation results in the method enabling to obtain an improved source code that is particularly efficient when run on a GPU. Clusters (by memory access, similar operations/functions, throughput) and parameter-dependent trends of clusters behave like a cluster of pixels when the GPU processes image or video content. The topological proximity is not applied on pixels having some geometric positions, but it is applied, for instance, on the similarity of basic functions (or composite routines) executed on CPU or GPU, or on the data transfer on different kinds of memories, or on the range of the data processed (e.g., small integers or big integers).

[0008] According to a preferred embodiment, the topological proximity is based on one or more of the following parameters: the function used by the vertices, the scope of the domain and/or range of the computation of the vertices, the memory allocations of the vertices.

DEFINITIONS

[0009] In the present document, “GPU” is used to refer to a general-purpose graphics processing unit, and “CPU” is used to refer to a general-purpose central processing unit, unless explicitly mentioned otherwise. Any GPU or CPU is part of an architecture that is composed of hardware. For instance, in an architecture there are the processors (chips) containing the CPU and/or GPU, the different kinds of memories (such as registers, caches, graphic memories, internal or external mass storage devices, RAM thar are shared or not), and communications between all this (e.g., implemented with memory buses). Each item (CPU, GPU, memories, internal/external communication means, ...) is called a “node” of the architecture. This architecture allows sequential or parallel execution of programs. There are different kinds of architectures that our system will consider, such as 1 GPU with 1 CPU or 5 GPU with 1 CPU, or 5 CPU with 1 GPU.

[0010] The method for porting a source code according to the invention is carried out by a CPU which may or may not be the general-purpose CPU of the CPU-GPU architecture for which the source code is to be ported: the porting of the source code can be performed by a device that is independent from the targeted CPU-GPU architecture.

[0011] The wording “executable” refers to the property of a source code that can be run by a given CPU or GPU, potentially after its compilation.

[0012] The “first” source code can be written in one or more of any language (C, Java, etc.) and can be “sequential” in the sense that it requires a CPU to execute commands successively.

[0013] The first source code to be modified is given as input to the method and parts of it are referred to during steps of the method.

[0014] It must be noted that for each step, the source code can be analysed and mapped into different representations to ease the analysis that must be done, as usual for instance in compilers and in interpreters. Our method may use those representations used in different kinds of abstract machines, such as the LLVM IR (https://en.wikipedia.org/wiki/LLVM), such as the code used in the Java Virtual Machine (JVM), or such as the code used in the Virtual Machines of Prolog-like languages (e.g., the Warren Abstract Machine). All these representations keep traceability links towards the original (first) source code.

[0015] In this invention, optionally, source code annotations can also be used in addition to the software code. This is commonly done with the software engineering IDE (Integrated Development Environments). Some annotations can be, for instance, some or all specifications of the software code that are given with annotations in OCL (https://en.wikipedia.org/wiki/Object_Constraint_Language), or other kinds of specification formalisms.

[0016] Hence, the source code given in input can be already annotated with other representations, such as those of virtual machines and of specification languages. This can improve the steps “c)” and “d)”.

[0017] A “dynamic performance analysis” is an analysis concerning the dynamic behaviour of program described in the source code resulting, for instance, in an estimate or a measure of the duration and/or of the memory size required to execute each line of code, and/or each command, and/or each expression, and/or in an estimate or measure of the amount of memory used, and/or in an estimate or measure of the time for accessing a network and/or its throughput, etc. The dynamic performance analysis can be made by analytics tools, by simulation, interpretation, or compilation and execution using, for instance, profilers, optimisers, emulators, or analytics or machine learning from data collected in past analyses.

[0018] A “portion of potential enhancement” is a description of sets of parts of source code that is detected to be underperforming (e.g., using descriptors found in source code change management and version control), relatively to the average performance of the entire source code, or in absolute value with respect to a given threshold. The portions of potential enhancement may be sorted in orderly manner according to their potentiality of enhancement and/or according to their degree of slowness. This may enable the method to prioritize the treatment of particular portions before other portions. In the present document, “portion” and “part” are interchangeably used.

[0019] A “dependency graph” is a formal representation of the interactions between the various objects of a code, i.e., expressions, statements, variables, etc. The objects are represented as vertices and their interactions are represented with edges. (The graphical diagram representation of a graph can be done, for instance, by representing vertices with dots, or circles, or boxes and by representing edges with straight arcs or curved arcs). Beyond the vertices and edges, the dependency graph contains information (metadata) such as which vertex is connected to which other vertex, which vertex has a particular type (function, variable, etc.) or calls a particular subroutine, as well as data related to the performance analysis.

[0020] The expression “topological proximity” is used in the present document in line with the theory of descriptive nearness (“Topology of digital images - visual pattern discovery in proximity spaces”, DOI 10.1007/978-3-642-53845-2, ISBN-978-3-642-53844-5). This concept introduces the nearness or proximity of elements as being descriptively near, if and only if their descriptive intersection is non empty. In other words, two items are near one another if they share a common property. This technique has been applied not only on the description of images (as presented in the book), but also, for instance, on graph comparison and formal concept analyses which are similar to the comparison of software code. In the present document we will use the word “nearness parameter” to refer to the property(ies) that is/are selected to measure the nearness or the topological proximity between two objects. “Proximity spaces provide a level of structure in between topologies and uniformities; in fact a proximity is equivalent to an equivalence class of uniformities with the same totally bounded reflection" (httpsV/ncatlab.org/nlab/show/proximity+space). [0021] In the present document the word “coherent” is given its regular meaning in the field of topology, in the context of the proximity discussed above. A topology space is coherent if it is uniquely determined by the union of subspaces (see also https://en.wikipedia.org/wiki/Proximity_space; or Di Concilio, “Descriptive proximities: Properties and interplay between classical proximities and overlap”, arXiv: 1609.06246).

[0022] The method (of claim 1) comprises several different steps and at each step, specific tools may be used, and parameters for automatic decision-making may be given in input.

[0023] For instance, at each step, the different target architectures (and/or their preferences) with (one or more) CPU and (one or more) GPU can be given inside (some or all of) the parameterisation rules. The architecture is not limited to the number of CPU or GPU, but can take into account the types of CPU and GPU, the network configuration between hardware microprocessors, the characteristics of the different memories (external, internal, hardware buses, shared memory, cache memory, registers, etc.), the speed (of CPU, GPU, memories, buses), and their throughputs, the specialisations of their computations (matrix computations, unlimited Integer/Reals computations, Fourier transforms, Deep Learning computations, etc. This is important: all those parameters are used to map more and more precisely the execution needs to the most adequate mapping onto the selected architecture CPU/GPU or the set of different CPU/GPU architectures (including parameterised architectures). In some of other steps, some examples will be given.

[0024] Parameterisations are usually made by giving a configuration file or a parameter file when starting the method. An easy way to create and use those files is to describe the parameters or the configurations with rule-based languages. There are lots of rule-based sublanguages in usual programming languages. Software programming languages that are mainly based on rules are Jess, Prolog, or Drools. Build management tools (e.g., GNU Make, Nix, Apache Maven) use also rules to give parameters through the “command line arguments” of the tool. One can also use shell scripts (e.g., Linux Bash, Windows cmd, ...) for that automation and the description of parameters to be applied during the method can be given at the beginning of the method, and those parameters will be used when appropriate, such as when going from one step to the next step of the method.

[0025] The use of rules may be important in our method because each step has optional parameters and each step can optionally access to the parameters, the inputs and the results (outputs) of each preceding steps, and may also optionally access the history of all preceding inputs, outputs when iterating (in the last item in the list of steps).

[0026] Optionally and when necessary, the parameter files (i.e. , rules contained in those files) can be modified by each tool used during the method, in particular, when iterating. This allows a fine tuning of the parameterisation for each context of use by giving a complete set of rules at the start of the method. The examples given for the description of some preferred implementation uses a lot of rules that are found in the parameterisation file. However, in other optimised implementations and without lack of generality, one can code the most important rules in a programming language so that the speed and the memory needed for the execution of those rules is optimised.

BRIEF DESCRIPTION OF DRAWINGS

[0027] Figure 1 shows a flowchart of the method of the invention;

[0028] Figures 2 to 4 relate to a first example;

[0029] Figures 5 and 6 relate to a second example;

[0030] Figure 7 relates to a third example.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

[0031] Figure 1 shows a summary of the method of the invention.

[0032] A key aspect of the invention is the determination of the specific portions of source code to be modified, i.e., statements and/or expressions of source code that must be transformed to enhance the efficiency of the code when used on a GPU-based architecture. Steps 100 to 500 enable to determine these statements and/or expressions by using and combing successive techniques. As will be apparent below, each of the steps 100 to 700 are performed and steps 200 to 700 use the outputs of one or more of the preceding step(s), i.e., the output of steps 100-500 is kept in memory for being respectively potentially used in steps 300-700.

[0033] In a first step 100, a dynamic performance analysis is carried out. As explained above, this step enables to detect portions of the source code that present a potential enhancement. The portions of potential enhancement may or may not be sorted by slowness and treated in that order.

[0034] The performance analysis may include dereferenced pointer analysis, called function porting analysis, shared memory analysis, data size computation, loop complexity analysis and/or conditional issue analysis.

[0035] The performance analysis may merge static code analysis (control flow graph, program dependencies graph and/or data flow graph) and dynamic code analysis (profiling, line-by-line profiling, charge testing, and/or application coverage, etc.). These methods are modified in comparison to standard stand-alone methods, by augmenting them with the information about the “coherence” for GPU architectures. Indeed, the analyses are merged and their information of the different steps of the method are also merged at each iteration of the method. Rulebased languages are used to define rules at each step that uses the results of all preceding analyses.

[0036] Merging static and dynamic information enables to carry out performance analyses automatically and to detect finely the issues of coherence for GPU.

[0037] The performance analysis enables to not only detect the presence of low-performing portions of the source code, but also to detect showstoppers and provide insights on dependencies and solvable problems.

[0038] The input of this step is the source code of the program to be analysed.

[0039] The source code can be found in different source files as usual, for instance in the C programming language.

[0040] Parameter rules can be given, such as the target architecture (CPU/GPU, see explanation in a previous paragraph), in order focus on some functions depending on another function that already has been optimised for a GPU in a previous cycle of the method.

[0041] A simple example is the use of gprof (see gnu.org) to get some performance analysis. Using “symspec” one can focus the analysis on some functions. After the analysis, the outputs can be the source code annotated with the number of times some each code block and each function has been called, or it gives a flat profile, or the call graph profile. These outputs of gprof are analysed and parameters rule can specify for instance that only the top costly functions will appear in portions of the source code to be analysed.

[0042] Other techniques can be used based on the source code, such as abstract interpretation and symbolic execution. Abstract Interpretation is a sound over-approximation of all possible runtime states of a program. Symbolic execution tries to explore all possible execution paths of a program.

[0043] One part of source code (or the equivalent terminology “one portion of source code”) is a non-empty subset of segments of the source code. The segments of different parts may overlap. Each segment can be identified, in different ways. For instance, with line numbers, with the absolute positions of the first character and of the last character in the file source code, with techniques used in version control systems that track changes in source code (e.g. SCCS - Source Code Control System, or RCS - Revision Control System).

[0044] At the end of this step, each portion has a description associated about its performance analysis. For instance, the result of the performance analysis tool (e.g. gnu gprof), the type of the performance (matrix operations) detected in the source code, the level of the performance (with categories, such as HighMemoryTrans fer , LowComputingTime, ... ), etc. But those descriptions can be enhanced with those descriptions that are written with structured descriptions interpretable by IT tools. For instance, those structured descriptions could be encoded using standards such as xml , cvs , or j son.

[0045] In step 200, a dependency graph is computed for each portion of potential enhancement.

[0046] When the portions of the (one or more) code are given as input in this step, the dependency graph (centred on the portions defined in input) is computed for the portions given in input and all information of all preceding steps (inputs, outputs and parameters), in particular the information about the different Virtual Machine and the details of the performance analysis. By doing so, the rules can identify the various options of enhancements and focus on them.

[0047] It is important to understand that each portion has some associated descriptions (see previous step) which are used to select rules of dependency graph for potential enhancements (the selection of rules is made by defining preconditions for rules and all this is given at the beginning of the method as parameters). For instance, one can decide in the rules given at the beginning the methos to focus on GPU memory enhancements or GPU computing throughput enhancements.

[0048] Any representation of dependency graph related to the software portion code can be used. It is important to note that each portion does not mandatorily includes all parts of the software source code needed for the dependency analysis and for the performance analysis, but just describes the information needed for the parameterisation rules to be selected using all information from the preceding steps (including from preceding cycles).

[0049] Different implementations and data structures (or interfaces as defined in Java) can be used. The most important is to include annotations to trace all (structured) information that was generated in all previous steps. All those annotations allow to use classical techniques about dependency graphs used for software engineering.

[0050] The output of this step is the dependency graph with all those semantic annotations. See the small examples and the explanations of the figures 2,3,4 that gives more detailed examples about the computation and information that are contained in dependency graphs. In particular, those explanations show how to get a dependency graph annotated with abstract interpretation information which can be used for creating those graphs. Figures 5, 6, 7 present more complex dependency graphs with annotations. Note that the annotations given in input of this step “b)” can be used for implementing known optimisation techniques. For instance, slicing techniques can be used (see, e.g., https://en.wikipedia.org/wiki/Program_slicing). [0051] In step 300, a topological proximity between each vertex of the graph and each other vertex is calculated. Based on nearness parameters that are predetermined or selected by the processor, each vertex is compared to each other vertex. A proximity can be true or false between two vertices. For instance, if the proximity is true, then the two vertices can belong to the same cluster. Alternatively or in combination, a proximity can have a metric scale, which associates a value (e.g., absolute, relative, ratio according to the measurement theory) and the two vertices will belong to the same cluster under the condition that the value meets a criterion (for instance inferior to a predetermined threshold).

[0052] This step is key in comparison to other known methods. As software code are more and more formalised by the semantics of their Virtual Machine, or the semantics of specifications annotations, the step “c)” of the method also uses a rigorous approach.

[0053] For clustering purposes of the next step “d)”, it is useful to have a comparison means between pairs of vertices. Usually, the comparison is made with distance functions, metrics, or similarity evaluation, etc. This has drawbacks because it is not easy to have an algorithm (or parameterisation rules) for step “c)” that estimate a value (i.e. , an instance of a quantity) for distance or metrics or similarity concerning the vertices which themselves represent execution of functions or of data flowing in/out of memory items. Therefore, the nearness (or proximity) theory is used (for references, see for instance https://en.wikipedia.org/wiki/Proximity_space). This is more appropriate because it does not need an algorithm (or rules) to assess “quantities” but only to have an algorithm (or rules) that define relationships between subsets of vertices: A is near B, A is apart from B, A is in the proximal neighbourhood of B (see for instance axioms and references in httpsV/ncatlab.org/nlab/show/proximity+space).

[0054] For instance, as explained in the example of the Fibonacci function (that is further discussed below for illustrative purposes), two subsets of vertices that represent functions calls of the Fibonacci function with an argument that is a small integer (Small Integer) can be declared (automatically by using parameterisation rules) as near, whereas when a subset of vertices that represents function calls having a Small Integer argument and the other subset of vertices that represents function call having a Big Integer argument, they can be declared as being apart. This makes easier writing rules of the parameterisation file. Note also that the parameterisation files can be modified automatically by each step of the method. It is not easy to satisfy the correctness criteria (defined with usual techniques of software engineering with tests cases, or with formulae, e.g., in OCL) of the initial parameterisation rules that change other parameterisation rules: being able to use the most appropriate concept to write those rules is important to get their correctness. [0055] From those basic relationships of nearness, apartness or proximal neighbourhood, additional relationships between vertices are algorithmically inferred from the semantics of the source code and from the formal descriptions of the nearness relationship (see, e.g., the preceding reference to ncatlab giving samples of derivations rules). This new graph is defined with vertices representing formulae and edges representing derivations (inferences) and also includes the basic relationships already computed.

[0056] For instance, the previous example identified some basic relationships between subsets of vertices that represents Fibonacci function calls depending on the arguments. From that, we can infer new relationships by using the rule (shown in ncatlab) that if A is apart from B, then A is disjoint from B, or use another rule saying that if A is near B, then the intersection of A and B is not empty.

[0057] Another example is two subsets of vertices A and B that represent functions calls of the Fibonacci function with an argument that is a small integer (Small Integer), and their basic relationships automatically computed by using parameterisation rules as near: “A near B”. This is a basic relationship. Still continuing the example, in the parametrisation file, an inference rule can be found specifying that for any superset A’ of A it will be inferred that A’ is near B. This can be done for all nearness, apartness, proximal neighbourhood properties already mentioned.

[0058] The inference rules are also parametrised and can be used recursively to infer new relationships from preceding ones (with some stopping criteria, such as saturation, or fixpoint, or simply a memory size, computing time or number of recursion calls).

[0059] Depending on the type of analysis of the source code (matrix multiplication, deep learning training, accounting...), and depending on the optimisations needed, different decision procedures for the inferences can be used: for instance, using simple automated theorem provers based on saturation (e.g. SPASS, https://en.wikipedia.org/wiki/SPASS), or Constraint Programming (CP, https://en.wikipedia.org/wiki/Constraint_programming) or CHR rules (https://en.wikipedia.org/wiki/Constraint_Handling_Rules). Both CP and CHR are implemented in SWI Prolog (https://en.wikipedia.org/wiki/SWI-Prolog). Inferences can be made to infer normalised expressions of the nearness of couples of vertices of the dependency graph. (See explanations in exemplary figures 2,3,4).

[0060] The parameters can select different definitions (i.e., nearness, apartness or proximal neighbourhoods) and different properties depending on the kind of analysis, such as the computation time, or the memory size, or the throughput. Note that the rules of this step use the semantics of source code (described in the dependency graph and its semantics annotations). [0061] The output of this step is a new graph of the semantic derivation (inferences). The vertices are sets of vertices (including singletons) containing the vertices of the graph of the preceding step. The edges are the inferences. Annotations contain additional information such as a new list of properties that uses the axioms and rules of the nearness theory. This new graph is a partial-order.

[0062] In step 400, the content of the dependency graph is clustered based on topological proximity (for instance, memory access, similar operations/functions, throughput or pipelining), for obtaining a coherent cluster. Clustering is often used for image analysis to detect zones of images that have some coherence, such as zones of different colours, or sprites that are moving (when clustering the state transitions of images). And GPU are designed to efficiently make computations on “coherent” zones of images. The technique of clustering the proximities of functions, data kinds, ... will group extract parts of source code statements that can be improved for optimising their computation on GPU.

[0063] GPU architectures are historically intended to calculate tuples of values for pixels which have several dependent values between them in their tuple (e.g., RGB values), although each pixel is independent from the others. The execution has some kind of coherence because there is some similarity between the processing of different entities (e.g., pixels of sprites) and due to this similarity, some architectural resources can be shared to get an efficient use of the computing resources. Two kinds of coherence can be pointed out: the coherence in the instructions (operation/function) executed by each node of the GPU architecture and the coherence of memory accesses allowing to optimise the use of different memory caches.

[0064] The coherence of instructions is based on the similarity of the operations executed on a node of the GPU architecture. Indeed, this allows to load the instruction code of the operations (or functions) once and use it again and again without having to load different code of operations.

[0065] The coherence of memory accesses aims to obtain a high throughput of accesses (or transfers) of the content of memory cells. This is a dynamic execution point of view. Any kind of memory cells are considered here, such as the main memory, the shared memory, the cache memory, the registers, etc. For instance, a high throughput can be obtained sometimes with bulk data transfers between nodes of the GPU architecture, and sometimes the high throughput can be obtained when data is “pipelined” through a shared memory by cycling to the same pattern of memory accesses of that shared memory cells.

[0066] Note also that the dynamics of the code execution is an important part of the coherence to be found. Indeed, this concerns the sequence of memory accesses (main, local, cache, etc.) needed to fetch the instructions to be executed, but also of the data to be used with those instructions. Those accesses are not necessarily simultaneous. So, a separate analysis of each of these coherences is not efficient. Our solution makes this analysis: 1. conjointly for the different instructions and the different data, 2. using a dynamic perspective (i.e. , taking into account the time-varying “flow” of code and data in the processors), and 3. can be made at each stage of the computer implemented main method.

[0067] The clustering can also optionally be improved by relying on specific ways to detect the clusters of coherent instructions accesses, instructions executions and memory accesses and by analysing their trends (due to the time-varying small changes or “drifts” of the dynamics of instructions accesses, instructions executions and memory accesses). For instance, if the source code requires mathematical operations on matrices, such as addition or products, the cells of the matrix are stored in memory cells in the row order (in case of using the column order, this is similar). So, one can see that the dynamic changes in the memory accesses are often small during the computation. For a matrix addition the trend of this change is to use the next cells.

[0068] The clustering can also optionally take computational paradigms or patterns under consideration to improve the search for clusters and the search for parameter-dependent dynamic trends. For the example of matrix computations, there are patterns of use depending on the kind of the matrix, such as triangular matrixes, block-based matrixes, sparse matrixes, etc. The detected paradigms or patterns will guide the analysis concerning the memory access that are expected to be happening. Other examples of paradigms or patterns are: memoisation; staged increase of precision (e.g. robust convergence of numerical methods such as Newton-Raphson, or staged deep learning); fractal-like computation which take into account the similarities at different depth levels for deep parallelisation allowing partitioning and aggregation to optimise memory access and processor throughput for getting partial results at different depth levels; block-based computation of matrix operations or of arithmetic operations on very big integers or bulk computations; and/or section-based propagation of computation (analogy to the simulation of the heat propagation in a rod).

[0069] As an alternative to clustering, other machine learning techniques can be used such as structured prediction or structured (output) learning, as well as inference techniques used in artificial intelligence such as deduction, induction, abduction or analogy.

[0070] More specifically, the input partial-order graph of the preceding step may be a derivation (inference) graph which is a DAG (Directed Acyclic Graph, https://en.wikipedia.org/wiki/Directed_acyclic_graph). Moreover, because the inferences rules are transitive, the description of the DAG can be minimised to non-transitive edges so that the description of the graph is smaller and can decrease the memory and computing time of the clustering algorithms that will be used on this graph.

[0071] It is preferred to use clustering methods that are compatible with partial orders: for instance, the clustering methods specific for program semantics described in “Abstract clustering for program comprehension. (2000) (published on ucl.scienceopen.com), by Tabacznyj, C.”, or the ones that are described in "Dag-structured clustering by nearest neighbors”, by Monath et al., in International Conference on Artificial Intelligence and Statistics. PMLR, 2021”. Or we can use more general techniques adapted to DAG (e.g., https://en.wikipedia.org/wiki/Nearest-neighbor_chain_algorit hm).

[0072] Another kind of clustering algorithm that is adapted to (inferred) partial orders is based on FCA (Formal Concept Analysis, https://en.wikipedia.org/wiki/Formal_concept_analysis) and an example of algorithms is explained in "Constraint programming for multi-criteria conceptual clustering." International Conference on Principles and Practice of Constraint Programming. Springer, Cham, 2017, by Chabert, Maxime, and Christine Solnon.

[0073] Other kinds of algorithms can be used, such as Constrained Clustering (https://en.wikipedia.org/wiki/Constrained_clustering), Hierarchical Clustering (https://en.wikipedia.org/wiki/Hierarchical_clustering), or even Force-directed graph drawing (https://en.wikipedia.org/wiki/Force-directed_graph_drawing) which can give a good approximation of the clustering, as explained in the examples of Fig. 2, 3, 4.

[0074] When additional information is used, such as data results of performance analyses (and found in annotations of vertices), clustering methods for mixed-type of data (i.e., numerical data and categorial data) can be used, including those taking into account the hierarchy of concepts.

[0075] Moreover, by defining rules in the parameter file that depends on the target architecture (CPU/GPU, see explanation in a paragraph above), one can compute partitions of the vertices (e.g., separating computation vertices from data transfer vertices) onto different GPU architectural artefacts. This allows to obtain good approximations of clusters taking into account the GPU architectural artefacts.

[0076] The output of this step is a number of clusters described by sets of vertices. The output contains also annotations created by the parameterisation rules, such as those for the GPU architecture, or descriptions that have been used to create the clusters themselves.

[0077] In step 500, the statements and/or expressions of the source code which correspond to a given cluster are identified due to traceability links that were managed in all previous steps, starting with the traceability links generated by the tool creating the dependency graph. [0078] This step uses parametrisation rules specifically foreseen to select the most important clusters to be improved, depending on the content of the clusters. For instance, rules can select the cluster of Fibonacci function calls with small integers to be optimised. Another example are rules to select conjointly two or more clusters, such as the cluster of Deep Learning operations and the cluster of transfer operations of the sample data to the Deep Learning operations.

[0079] Note that traceability links to source code are maintained in the vertex annotations. Those links are used to select part (and their segments) of source code.

[0080] In step 600, a selection is made for the transformation to be carried out, based on the information flowing from the previous steps.

[0081] The selection of code transformation rules can involve: rescoping of called function; automatic data splitting (large set); duplicate information for not shared memory; transform dereferenced pointer with local variables; transform complex loops (if feasible to use loops in GPU); transform condition in functions (if feasible); transform CPU instructions into PU instructions (CUDA, RoCM, GPU library OpenACC, etc.).

[0082] The nature of the parameter that has been chosen for the clustering can be taken into account.

[0083] The rules to be applied when changing the identified portions of code can contain one or more of: changing of programming language that has better primitives or libraries for the gpGPU architecture; using compiler directives, pragmas, JIT for compilers specialised for the gpGPU architecture; using program transformations specialised for the gpGPU architecture; or using specific tools for the gpGPU architecture (CUDA, Hadoop, loading/unloading data from/to the gpGPU architecture, etc.).

[0084] In step 700, the selected rules are applied to the full source code to obtain an updated full source code. The updated full source code contains more than just the updated lines of code.

[0085] After the transformations have been applied to the portions of code of potential enhancement, rules may also automatically infer other enhancements of pipeline of data (i.e. , bottleneck) and/or enhancements of memory copying issues from CPU to GPU (to correct/remove some memory copies). Those rules will add to the transformed code (with annotations) the information describing those enhancements or issues. This information is used in the next cycle of the method.

[0086] Step 800 corresponds to a new performance analysis (as in step 100) operated on the updated source code. This performance analysis enables to identify portions of potential enhancement in the updated source code. If the performance analysis shows no potential enhancement or if an absolute or relative threshold of efficiency is reached, the method outputs this last updated source code as the second source code, i.e., ready for the GPU. If the criteria are not met, the method loops back to the generation of dependency graphs based of updated potential enhancement portions.

[0087] In the parameterisation file, there may be rules describing the stopping criteria to stop the cycling. For instance, stop when all matrix operations were analysed (no need to optimise other functions), or when all performance computed in the step “h)” is/has become superior to a threshold, or when a pre-determined number of iterations has been already made.

[0088] Figure 1 also highlights that for each step 100-500, information (input info or input rules) is conveyed and are involved in the following step. The input info can be described with ontologies, such as OWL, and rules can be provided with OWRL. Other means can be used for describing the input info and rules, such as the use of rule-based descriptions using for instance DROOLS, JESS, PROLOG, etc. For instance, the method could also recognize the most appropriate way, and specific knowledge for some kinds of languages (e.g., C or Java), and types of problems (e.g., matrix computations, or speech recognition) can be used.

[0089] With reference to figures 2 to 4, an example of a dependency graph of a source code is given.

[0090] For this example, we will consider the computation of the Fibonacci function. The first source code, i.e., the code that is executed on a CPU and that is to be improved, can be expressed in any language: C++, Java, PHP, XQuery/XSLT, SQL... Its source code expression in a functional language is: Fib(0)=0; Fib(1)=1 ; if n>1 then Fib(n)=Fib(n-1)+Fib(n- 2).

[0091] Figure 2 shows a dependency graph of this function when calculating Fib(4). In this document “dependency graph” is synonym with “computation dependency graph” and “computation graph”. The graph shows several occurrences of executions of computations of expressions. The arrows represent the dependencies, i.e., the computation of the source of the arrow depends on the computation of the target of the arrow. In this example we can see that the same expression is computed two or more times, each computation using processor time and memory space. When the data domains are too large (or “infinite”), it is not appropriate to detail all expressions as shown here. Techniques exists to solve that problem such as software engineering abstract interpretation which automatically split the data domains into a finite set of subdomains and split the expressions into each of the different cases of subdomains. So, in our example “Fibonacci”, instead of numbers, there are subdomain names corresponding to subset of Integers, such as lntOTo999, lnt1000To999999, and a vertex can be associated to expressions such as “Fib(lntOTo999)”. The proximity defined on those subdomains and clustering methods are used to group vertices that are coherent. Abstract interpretation is based on Galois connections which can easily maps onto a topology. (https://en.wikipedia.org/wiki/Abstract_interpretation) [0092] This graph can be expressed as follows:

[0093] Figure 3 is another way of representing the dependency graph, where only one occurrence appears of a given execution (memoisation or tail recursion method).

[0094] This graph can be expressed as:

[0095] Figure 4 is yet another representation of the same function with more details showing each computation (e.g. “f(2)+f(1)”) but also the store of the input or the result (“f3”) and the transfer of the result between the computation and the store “f(3) <:= f(2)+f(1 )”.

[0096] Figure 4 is expressed as follows:

[0097] The parameters which can be used for establishing a proximity between vertices can be: (a) the functions used (in this case it is Fib, but also “+” which is the addition); (b) the scope of the domain and range of the computation (here this is {0,1 , 2, 3, 4} for the domain, and {0,1 , 2, 3} for the range; there could be different domains and ranges for each functions or operations); and/or (c) the details of the computation as described using items of the CPU architecture and GPU architecture, for instance, the processors (or nodes of the architecture), the memory cells (e.g. shared memory or registers), the transfer between nodes of the architecture, between memory cells or between both. For each item one can consider their types. Details can be given for any item of the CPU and GPU architecture. Here are some examples: capabilities of any item of the GPU architecture (e.g. different kinds of nodes of the architecture, different kinds of memory, and more specific capabilities for basic operations, functions, relations, e.g. +,*, ... implemented in nodes of the architecture); the speed of access, of transfer or of computation; the type of transfer (transfer “in”, transfer “out”, transfer “in/out”, ...); the size of any item of the gpGPU architecture, such as, for instance, the size of the memory cells, or of the registers, or of the adders; the quantity of any preceding items, such as 2 gpGPU, 10 internal register for each gpGPU, 1000 shared memory cells, etc.

[0098] Each parameter (a), (b), (c) can be considered alone, or can be simultaneously analysed. Also, additional functions, bigger domains, or more GPU architecture details (such as, for instance, to distinguish the transfer between all kinds of memory (main memory cells, shared memory cells, local registers, input/output cells, cells used for internet connexions, ...); to distinguish between the different throughputs) can be considered.

[0099] As shown in the figures, the graphs are sets of vertices and directed edges (G=<N,A>) with the constraint that all edges used in A are in N.

[0100] The proximity is a relationship between subsets of those sets N and A. Those subsets are subgraphs: G1 is a subgraph of G when G1=<N1 ,A1 > with N1 is a subset of N and A1 is a subset of A (and with the graph constraint for N1 and A1 written in the preceding paragraph, i.e. all vertices used in A1 are in N1).

[0101] A parameter of the definition of the proximity is which collection of subsets (or subgraphs) that are considered for the definition of the proximity relationship. [0102] Example: if we want to define a proximity relationship on the graph having N={n0,n1 ,n2,n3,n4} as vertices and having A={n0<— n1 , n1<— n2, n2<— 3n, n3<— n4} as directed edges, the rules defining the parameters can focus on only connected subgraphs that contains the vertex “nO” and can define this proximity relationships as the simple subgraph inclusion based on the simple set inclusion of edges and vertices, i.e. proximity(G1 ,G2) is true exactly when G1 is a subgraph of G2 or G2 is a subgraph of G1.

[0103] Another example would be the only subsets are SG4={n2<— 3n, n3<— n4}, SG3={n1<— n2, n2<— 3n}, SG2={nO<— n1 , n1<— n2} (shortened definition of graphs as explained above). The interpretation is all subsets that have two connected computation dependencies. Two of those subsets are involved in the proximity relationship if there have one edge in common. So, SG2 and SG3 have that proximity, but not SG2 and SG4.

[0104] Another example would be to modify the preceding one and use the set of singletons (i.e., containing exactly one edge), and the proximity would have been to share one vertex of the edge.

[0105] Another example would have been to select any subgraph and the proximity would have been to share one vertex of the edges.

[0106] Optionally, the topology used and the proximity used, can be determined by starting with the method of Dana Scott (Scott continuity). For instance, when a part of the source code is written in the pure functional programming paradigm (e.g., core part of Haskell or Erlang programming languages), the fixpoint analysis method of Dana Scott, can be used to define a topology and this topology can be the basis of the proximity.

[0107] The parameter(s) chosen for the proximities can be one of the above, or any combination of those. For instance, a cross product between different proximities (or in SQL terms, a full outer-join) can be used. Tuples of heterogeneous information can also be used. The information contained in a graph is the information found in each vertex, and also the information found in the edges. This includes data and metadata found in graphs. All that information can be represented by tuples of heterogeneous data which can be clustered by different methods, such as extensions of K-means. Using a more complex analysis of proximities enables to get more detailed clustering and hence a more precise coherence with the GPU architecture.

[0108] A cluster can be expressed as a set of tuples.

[0109] In the example above of the Fibonacci function, the clustering parameter can be K- means (https://en.wikipedia.org/wiki/K-means_clustering). In that case, for the graph in figure 2, we obtain one cluster for each n, with all occurrences of Fib(n), irrespective of the arrows of the graph. For each n, all occurrence of Fib(n) will belong to the same cluster.

[0110] The proximity with the parameter being the function (Func(argsl), Func(args2)) is true for the same function (e.g. Fib or irrespective of its arguments. This proximity is interesting to cluster the various occurrences of each function. In that case, for the graph in figure 4, the clusters will be made with all Fib and another cluster will be made with the function “addition”.

[0111] The clustering parameter can be the values (scope or range). For example, the proximity (Fib(x),Fib(y)) can be true only when x <2 A 64 and y <2 A 64, or when x >2 A 64 and y >2 A 64, or for all other cases of x, y. This parameter is relevant to analyse the memory cells used and make the difference between the use of memory cells of the size of Integer (often 2 A 32 or 2 A 64 bits), and the use of memory cells for Biginteger (which can grow as large as the main memory). In that case, for the figure 2, the clusters will be made with all Fib that both have an Integer argument, or that both have a Biginteger argument, of that are “in between” with one that has an Integer argument and the other has a Biginteger argument.

[0112] A clustering parameter can also be the existence of an arrow between Fib(x) and Fib(y). In that case, for the figure 2, clusters are made of all similar arrows (even of different occurrences).

[0113] A clustering parameter can be the existence of a directed path between Fib(x) to Fib(y) containing only ‘Fib’.

[0114] These K-means parameters are simple examples to illustrate the invention. However, as already mentioned above, various parameters and information can be used in the graphs, and those parameters and information can be used to define the proximity relationship. Moreover, high dimensional clustering spaces can be used with combination of these parameters (cross-products of different proximity definitions). It is sometimes appropriate to use other kinds of clustering methods, such as hierarchical clustering methods, constrained clustering methods, biclustering methods, self-organizing map methods, or other Machine Learning (ML) methods, such as Deep Learning (DL).

[0115] Another way to cluster our data is to use algorithms of “visual” clustering methods without displaying anything. For instance, it is possible to use the various alternatives of force- directed methods with multiple clusters (force-directed graph). Clusters positions may be defined with multiple foci or with boxes that can adapt on the size of each cluster. The “forces” of the force-directed graphs will be based on the proximity defined in the graphs, and so will behave in a way that some kind of coherence will be obtained between parts of the graphs after a number of iterations. The data obtained at the final iteration can be used to feed the next steps (500,600). Indeed, without displaying the graph, we can collect the different clusters by detecting vertices having coordinates that are at close distance to each other.

[0116] Other examples for clustering: on figure 2 we could use the occurrence of each box, and the proximity would be based on the existence of a direct dependency between boxes. Now, we can use the number (and optionally types) of GPU as different foci for clustering and using the force-layout (or any other clustering method) to get an idea of the assignment of computations on the different device nodes (GPU).

[0117] Figure 2 can also be clustered based on the expression considered in the vertex (diagrammatically represented with a box). In that case the proximity can depend on the expression: multiple occurrences of the same expressions will have a high level of proximity expressions. If we use complex proximity definitions, the proximity can depend on the proximity of the arguments (input arguments and/or output arguments), or on the logarithm (base 2 A 16) of those arguments to show that multiple memory cell words will be used to store those arguments, etc. The latter example shows also that functions (e.g., “+”) can be distinguished depending on the logarithm of the arguments, for instance when using different source code algorithms for the “big integers” (i.e. , having different functions to run on the GPU architecture nodes). Some of those source code algorithms split the big numbers into blocks of digits and make in parallel the addition of different parts, then assembling the partial results. This can be used in the definition of the parameters of the graph and in the definition of the proximity.

[0118] As can be seen above, the core of the method lies in the fact that clusters are created based of information about the source code. They may take under consideration information about the CPU architecture and also the GPU architecture by means of using the topologybased proximity relationships.

[0119] Figure 5 relates to a further example. We consider the following program containing two independent loops: Float x=0.01 ;

Float y[100];

Float z[100];

For (i=0; i<100;i++) { y[i]=Func(x,i); z[i]=x*y[i];

}

Float t[100];

For (i=0;i<=100;i++){ t[i]=t[i+1]*x;

}

[0120] The dependency graph corresponding to this program and enriched with information regarding the performance analysis is illustrated on figure 5. We apply the same technique described from [0050] and after. All subexpressions of the function call are detailed. In this case this will remove the loops and replace them by the instances of the iteration, e.g. “y[25]=Func(x,25); z[25]=x*y[25];”. The resulting dependency graph will not be deep because the only shared dependency concerns “x”. The clustering parameter imposes a number of clusters that are mapped to GPU of the selected architecture. The set of clusters shows the parallelisation into different GPU. All similar computations and sub-sequences of successive data of vectors will be found in a cluster assigned to a GPU.”

[0121] Figure 6 shows the clusters made out of the method of the invention and the corresponding expressions in the code. Here we see that if the parameters impose 2 clusters (corresponding to two GPU), then the computations on the vectors “y” and “z” are similar and will be found in a first cluster, and the computations on the vector t will be found on the other cluster. The two clusters (mapped onto the source code as in step (700) is shown on that figure.

[0122] Based on identified issues, removing the dependency when feasible, relying on an optimisation strategy and/or target architecture specificities (topology) and relying on a parallelization library (e.g., Open MP/ACC pragmas, C Cuda, Rapids in Python, ...), the rules of transformation can, for example, dictate to generate for each loop in clusters the command “#pragma acc loop” or “#pragma acc parallel” and to move up the variable initialization (of variable t in the code below). The second source code, adapted for GPU computing reads: Float x=0.01 ;

Float y[100];

Float z[100];

Float t[100];

#pragma acc parallel

#pragma acc loop

For (i=0; i<100;i++) { y[i]=Func(x,i); z[i]=x*y[i];

}

#pragma acc loop

For (i=0;i<=100;i++){ t[i]=t[i+1]*x;

}

[0123] Figure 7 relates to a further example. We consider the following program containing two inter-related loops:

Float x=0.01 ;

Float y[100];

Float z[100];

For (i=0; i<100;i++) { y[i]=Func(x,i); z[i]=x*y[i]; x++;

} Float t[100];

For (i=0;i<=100;i++){ t[i]=t[i+1]*x;

}

[0124] The dependency graph is drawn on figure 7. When applying the steps (300) and (400), the “x++” can go in a cluster of its own (if the parameter concerning the number of clusters is sufficiently large), and there is a directed edge from the final result of “x” of the first loop and the “x” found in the second loop (there are also some other directed edges from “x” to other vertices). This fact is used in the next steps.

[0125] Based on identified issues, removing the dependency when feasible, relying on an optimisation strategy and/or target architecture specificities (topology) and relying on a parallelization library (e.g., Open MP/ACC pragmas, C Cuda, Rapids in Python, ...), the rules of transformation can be applied. For the example in Figure 7, the cluster related to the “x++” (obtained by the steps 300 and 400 as described in the preceding paragraph) a rule dictates to pre-compute (predictive value of) x at the end of the loop and apply the strategy used for the example of figures 5-6. Thus, for each identified dependent {variable}, create a new statement ({variable}Predict =PredictedValue{variable}). For each statement using identified dependent {variable}, change {variable}into {variable}Predict. PredictedValue is a function that computes the value regarding the dependencies graph. Here: xPredict: original x depends on for(0..100) (101 iterations of x=x+1) and initial value is 0.01 therefore xPredict=101 +0.01.

[0126] The second source code, transformed by the method of the invention, reads: Float x=0.01 ;

Float y[100];

Float z[100];

For (i=0; i<100;i++) { y[i]=Func(x,i); z[i]=x*y[i]; x++;

} xPredict=101.01 ;

Float t[100];

For (i=0;i<=100;i++){ t[i]=t[i+1]*xPredict;

}

[0127] Optionally, the method of the invention further comprises the execution of the second source code on the CPU-GPU architecture.

[0128] Optionally, static analyses can be combined with dynamic analysis performance to detect (without running the program) points of potential enhancement.

[0129] Optionally, the method of the invention can contain a step where the updated source code is annotated with comments. These comments can contain explanations for a programmer regarding the changes that have been performed or the gain of performance obtained by a particular change. Each explanation can have an identifier and a type and can contain some information that are specific to the explanation (some ontology written in RDF, OWL, ... , or any other kind of rigorous taxonomy can be used).