Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MACHINE LEARNING MODEL PERFORMANCE FORECASTING
Document Type and Number:
WIPO Patent Application WO/2022/164454
Kind Code:
A1
Abstract:
A machine learning model (MLM) performance forecasting system including a processor to execute forecasting instructions including an input, translator, and predictor modules. The input module executes to receive model files, each describing an architecture of a corresponding MLM, and hardware execution profiles, each corresponding to an ML engine and defining different types, numbers, and dependencies of hardware embodied procedures and operating metrics of each procedure. The translator module executes to derive a model execution profile of a selected MLM from its model file defining types, numbers, ordering and dependencies of ML procedures to execute the selected MLM on a selected ML engine, each ML procedure mapped to a sequence of one or more hardware embodied procedures of the selected ML engine. The predictor module executes to derive a performance forecast of the selected MLM on the selected ML engine based on the corresponding model execution profile and hardware execution profiles.

Inventors:
SUBRAMANIAM RAVI (US)
SILVA ADAM (BR)
GUIMARAES CAIO (BR)
SECRETO HUGO (BR)
NISHI HENRIQUE (BR)
SOUZA JOAO (BR)
MENDES LEANDRO (BR)
TREVISAN VINICIUS (BR)
Application Number:
PCT/US2021/015924
Publication Date:
August 04, 2022
Filing Date:
January 29, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HEWLETT PACKARD DEVELOPMENT CO (US)
International Classes:
G06F17/40; G06N3/08; G06N20/00
Domestic Patent References:
WO2020131187A22020-06-25
Foreign References:
US20150317589A12015-11-05
US10319476B12019-06-11
US20180045855A12018-02-15
Attorney, Agent or Firm:
CARTER, Daniel J. et al. (US)
Download PDF:
Claims:
CLAIMS

1 . A machine learning model (MLM) performance forecasting system comprising: memory to store forecasting instructions including an input module, a translator module, and a predictor module; and a processor to execute: the input module to receive: model files, each model file having a representational format describing an architecture of a corresponding MLM; and hardware execution profiles, each hardware execution profile defining the operation of a corresponding an ML engine including: different types, numbers, and dependencies of hardware embodied procedures of the ML engine; and operating metrics of each hardware embodied procedure; the translator module to derive a model execution profile of a selected MLM from its corresponding model file, the model execution profile defining different types, numbers, ordering and dependencies of ML procedures to execute the MLM on a selected ML engine, each ML procedure mapped to a sequence of one or more hardware embodied procedures of the selected ML engine; and the predictor module to derive a performance forecast of the performance of the selected MLM on the selected ML engine based on the model execution profile of the selected MLM and the hardware execution profile of the selected ML engine.

2. The system of claim 1 , the performance forecast defining one or more predicted performance metrics including a latency, power consumption, energy consumption, and memory consumption. 3. The system of claim 1 , based on the mapping of the ML procedures of the model execution profile to the hardware embodied procedures of the hardware execution profile, the predictor module to aggregate total numbers of each type of hardware embodied procedure and derive the one or more performance metrics based on the aggregated totals and the operating metrics of each type of hardware embodied procedure corresponding to the one or more metrics.

4. The system of claim 3, the operating metrics including a latency of executing the selected MLM on the selected ML engine, the predictor module to determine the latency of based on the aggregated total numbers of each type of hardware embodied procedure, a corresponding latency of each hardware embodied procedure included in the operating metrics, and on a dependency defining a number of each type of hardware embodied procedure able to processed in parallel.

5. The system of claim 3, the operating metrics including an amount of power and an amount of energy to execute the selected MLM on the selected ML engine, the predictor module to determine the power and energy based on the aggregated total numbers of each type of hardware embodied procedure, and a corresponding amount of power and energy of each hardware embodied procedure included in the operating metrics.

6. The system of claim 3, model execution profile including an estimated memory consumption to execute the ML procedures, the performance forecast including a prediction of whether the selected ML engine has memory capacity to execute the selected MLM based on the estimated memory consumption and a memory capacity of the selected ML engine defined as dependency of the hardware execution profile. 19

7. The system of claim 1 , the hardware embodied procedures including computational operations, memory operations, dependencies, and inter-element communications.

8. The system of claim 1 , the dependencies of the hardware execution profile including parallel processing capabilities, memory capacity, and operational constraints.

9. The system of claim 1 , the ML procedures including ML functions and ML operations.

10. The system of claim 9, the ML operations including mathematical functions, mathematical operations, memory operations, and data transfers.

11 . The system of claim 1 , the hardware embodied procedures including computational operations, memory operations, and inter-element communications.

12. The system of claim 1 , the translator module to provide model execution profiles from model files having a plurality of representational formats including open neural network exchange (ONNX) format, neural network exchange format (NNEF), and a plurality of proprietary formats.

13. The system of claim 1 , the types and numbers of ML procedures in the model execution profile depends on the hardware execution profile of the selected ML engine.

14. A method of forecasting machine learning model (MLM) performance on machine learning (ML) engines comprising: receiving model files, each model file having a representational format describing an architecture of a corresponding MLM, 20 receiving hardware execution profiles, each hardware execution profile defining the operation of a corresponding an ML engine and defining different types, numbers, and dependencies of hardware embodied procedures of the ML engine, along with operating metrics of each hardware embodied procedure; deriving a model execution profile of a selected MLM from its corresponding model file, the model execution profile defining different types, numbers, ordering and dependencies of ML procedures to execute the MLM on a selected ML engine, each ML procedure mapped to a sequence of one or more hardware embodied procedures of the selected ML engine; and deriving a performance forecast of the selected MLM on the selected ML engine based on the model execution profile of the selected MLM and the hardware execution profile of the selected ML engine.

15. A computing device comprising: a machine learning (ML) platform including: a plurality of ML engines; a memory storing: a plurality of model files, each model file representative of an architecture of a corresponding ML model (MLM); a plurality of hardware execution profiles each defining the operation of different one of the plurality of ML engines and including: different types, numbers, and dependencies of hardware embodied procedures of the ML engine; and operating metrics of each hardware embodied procedure; and forecasting instructions including a translator module, and a predictor module; and an MLM manager, for each ML engine of a number of selected ML engines of the plurality of ML engines, the MLM manager to execute: 21 the translator module to derive a model execution profile of a selected MLM from its corresponding model file, the model execution profile defining different types, numbers, ordering and dependencies of ML procedures to execute the MLM on the ML engine, each ML procedure mapped to a sequence of one or more hardware embodied procedures of the ML engine; and the predictor module to derive a performance forecast of the performance of the selected MLM on the ML engine based on the model execution profile of the selected MLM and the hardware execution profile of the ML engine; the MLM manager to run the selected MLM on one of the ML engines of the number of selected ML engines based on the performance forecasts of each of the number of selected ML engines.

Description:
MACHINE LEARNING MODEL PERFORMANCE FORECASTING

Background

[0001] The application of artificial intelligence, including machine learning, is growing at a rapid pace. Various machine learning frameworks have been developed, including open source machine learning libraries to enable developers to more easily design, train, and validate machine learning models. Additionally, artificial intelligence/machine learning processing engines are continually being developed which are specifically designed to accelerate processing of machine learning models.

Brief Description of the Drawings

[0002] Figure 1 is a block and schematic diagram generally illustrating a machine learning model performance forecaster, according to one example. [0003] Figure 2 is a schematic diagram generally illustrating a machine learning model, according to one example.

[0004] Figure 3 is a schematic diagram generally illustrating an operational flow diagram, according to one example.

[0005] Figure 4 is a flow diagram illustrating a method of forecasting performance of machine learning models on known hardware, according to one example. [0006] Figure 5 is a block and schematic diagram generally illustrating a computing system for implementing a machine learning model performance forecaster, according to one example.

[0007] Figure 6 is a block and schematic diagram generally illustrating a computing system including a performance forecaster, according to one example.

Detailed Description

[0008] In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific examples in which the disclosure may be practiced. It is to be understood that other examples may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. It is to be understood that features of the various examples described herein may be combined, in part or whole, with each other, unless specifically noted otherwise.

[0009] The use of artificial intelligence (Al), including machine learning (ML), is growing at a rapid pace. In addition to proprietary systems, various ML frameworks are available from multiple providers (e.g., PyTorch by Facebook, MXNet by Microsoft, and TensorFlow by Google, to name a few) which provide open-source machine learning libraries and tools to enable developers to more easily design, train, validate, and deploy machine learning models (MLMs). Additionally, various vendors are continually developing AI/ML processing engines (also sometimes referred to as ML engines, hardware accelerators (MLAs), or Neural Processing Units (NPUs) specifically designed to accelerate processing of MLMs. ML engines are application-specific or domain-specific integrated circuits (ASICs) typically having multi-core designs (e.g., hundreds or thousands of processing cores or compute elements) and employing both low and high precision arithmetic along with optimized dataflow architectures and memory use (e.g., in-memory computing) to accelerate calculation and increase computational throughput when processing MLMs. Typically, different ML engines are designed to optimize processing of one or more different types of MLMs.

[0010] MLMs are stored as model files having a representational data format which describes the architecture of the model (e.g., input, output, and hidden layers, layer weights, nodes of each layer, interconnections between nodes of different layers, and ML operations of each node/layer) along with operating parameters and, thus, describe or represent a process flow between input and output layers of an MLM. After development, it is advantageous for an MLM to be deployable in environments other than the environment or framework in which the model was initially trained (i.e., to be “portable”), such as on any number of different ML engines, for example.

[0011] However, because operational nomenclature and representational data formats of model files varies between different ML frameworks (e.g., each provider’s framework may have a proprietary representational data format), and because hardware capabilities vary widely between different ML engines (such as types of processing available, parallel processing capabilities, memory capacity and configurations, and data path structures, for example), the performance of an MLM when deployed on a given ML engine (e.g., latency of operation, power consumption) is an unknown. In fact, an MLM may fail to function altogether on a given ML engine, such as if the engine has inadequate memory capacity.

[0012] As a result, an MLM must currently be installed and run on a given ML engine in order to determine/benchmark its performance thereon. If a model’s performance fails to meet expectations or requirements, the model’s architecture may be modified in attempts to improve its performance (e.g., layers may be combined/separated, parameters modified). However, the modified MLM must be re-run on the given ML engine in order to benchmark its adjusted performance. Such processes are time consuming and costly.

[0013] The present disclosure provides a technique to forecast performance of MLM models on ML engines without installing and running the models thereon. In examples, which are described in greater detail below, a model execution profile of a selected MLM, which is derived from a corresponding model file, is evaluated against a hardware execution profile of a selected ML engine to provide a prediction or forecast of the performance of the selected MLM on the selected ML engine. Such performance forecast may include performance metrics such as an operational latency and energy consumption to execute the selected MLM on the selected ML engine, for example.

[0014] Figure 1 is a block and schematic diagram generally illustrating an MLM performance forecaster 20 for predicting a performance of an MLM on an ML engine, according to one example of the present disclosure. According to the illustrated example, performance forecaster 20 includes a memory 22 to store forecasting instructions 24, and a processor 26 to execute forecasting instructions 24 to provide a performance forecast of a selected MLM on a selected ML engine. In one example, forecasting instructions 24 include an input module 32, a translator module 34, and a predictor module 36.

[0015] In one example, processor 26 executes input module 32 to receive hardware execution profiles 38 (illustrated as hardware execution profiles 38-1 to 38-n) and model files 40 (illustrated as model files 40-1 to 40-n). As will be described in greater detail below, each hardware execution profile 38 defines the operation of a corresponding ML engine 39 (illustrated at ML engines 39-1 to 39-n) including different types, numbers, and dependencies of hardware embodied procedures along with corresponding operating metrics of each procedure, such as an execution latency, for example.

[0016] In one example, each model file 40 has a representational format describing an architecture (e.g., a neural network (NN) model) of a corresponding MLM 42 (illustrated as MLMs 42-1 to 42-n), the model architecture including input, output, and hidden layers, layer weights, numbers of nodes of each layer, interconnections between nodes, ML procedures of each node/layer, and parameters and ordering of ML procedures, for example. In examples, each model file 40 may have one of any number of representational formats. In one case, each model file 40 has an Open Neural Network Exchange (ONNX) file format. The ONNX format is an open-source format (e.g., common sets of ML functions, operations, and sub-operations, and sets of parameters) which describes the architecture and parameters of an MLM 40, and which enables developers to more easily move MLMs between frameworks for MLM training and inferencing to provide network architecture flexibility. In other examples, MLM models 40 may have one of any number of suitable representational file formats other than the ONNX format, such as NNEF, and any number of proprietary representational formats.

[0017] In one example, as will be described in greater detail below, processor 26 executes translator module 26 to derive a model execution profile 44 of a selected MLM 42 from its corresponding model file 40 (e.g., an ONNX file). In one example, the model execution profile 44 defines different types, numbers, ordering, and dependencies of ML procedures to execute the MLM 42 on a selected ML engine 39, with each ML procedure being mapped to a hardware embodied procedure, or a sequence of hardware embodied procedures, of the selected ML engine 39 as defined by the corresponding hardware execution profile 38. In examples, ML procedures may include any number of ML functions (e.g., convolution function) and ML operations (e.g., mathematical operations, mathematical functions, memory operations, data transfers).

[0018] In examples, after derivation of model execution profile 44 of the selected MLM 42, processor 26 executes predictor module 36 to derive a performance forecast 46 of the selected MLM 40 on the selected ML engine 39, where such derivation is performed based on model execution file 40 and the hardware execution profile 38 of the selected ML engine 39. In examples, as will be described in greater detail below, based on the mappings of the ML procedures of the model execution profile 44 of the selected MLM 40 to the hardware embodied procedures defined by the hardware execution profile 38 of the selected ML engine 39, and based on the types, numbers, ordering and dependencies of the ML procedures and on the operating metrics of each hardware embodied procedure, including a corresponding latency, predictor module 36 derives the performance forecast 46 of the selected MLM 40 on the selected ML engine 38. In examples, performance forecast 46 includes at least a predicted latency to execute the selected MLM 40 on the selected ML engine 38. In other cases, in addition to latency, performance forecast 46 may include any number of other performance metrics such an amount of energy consumed to execute the selected MLM 40, and whether the selected ML engine 39 has sufficient memory to execute the selected MLM 40, for example.

[0019] As mentioned above, different ML engines 39 may be optimized for different types of MLMs 42 (e.g., convolutional models, deep stacking, stochastic, spiking, etc.) such that hardware architectures, including the processing types and capacities (e.g., numbers and types of discrete processing cores optimized to perform different operations, microcontrollers, CPUs), memory capacities and configurations, and data path structures, for example, vary between different ML engines 39. As a result, the types, numbers, and dependencies of hardware embodied procedures and corresponding operating metrics, as defined by hardware execution profiles 38, will vary between different ML engines 39. For example, a first ML engine and a second ML engine may each include a large number of processing cores (sometimes referred to as neural processors) which are optimized to perform certain granular hardware-embodied procedures (e.g., add and multiply), where such optimized hardware-embodied procedures may vary between the first and second ML engines depending on the types of MLM models 40 the ML engines are optimized to execute. Also, the first and second ML engines may each include microcontrollers and CPUs to process more complex (or less frequently occurring) hardware-embodied procedures where, again, such microcontroller- and/or CPU-performed procedures may vary between the first and second ML engines.

[0020] Dependencies may also vary between the first and second ML engines. Dependencies describe operating characteristics of an ML engine, such as a degree of processing parallelism (e.g., how many hardware-embodied procedures of a given type may be executed in parallel), ordering of procedures (e.g., whether procedures execute serially relative to one another, such as “add” procedure needing to be performed before a “multiply” procedure, and, conversely, whether certain procedures cannot be processed consecutively relative to one another), memory operations (e.g., is an output of a first procedure provided directly to a second procedure or is such output transferred to memory before being provided the second procedure), and operational constraints. Constraints describe various operating restrictions of an ML engine, such as memory capacity (sufficient capacity to execute and store parameters of a given MLM), and minimum execution times for certain memory operations, for example.

[0021] Because of the variations in hardware embodied procedures and dependencies between ML engines 39, a same MLM 42 may execute differently on different ML engines 39. For example, while a first MLM engine may be optimized to process a given ML procedure of the MLM 42 via a single optimized hardware-embodied procedure (e.g., processing via dedicated neural processor), a second ML engine may process the given ML procedure via multiple hardware-embodied procedures.

[0022] In examples, translator module 34 derives a model execution profile 44 for a selected MLM 42 from its corresponding model file 40 in view of the hardware execution profile 38 of the selected ML engine 39 for which the performance of the selected MLM 42 is being forecast. In examples, translator module 34 derives a model execution profile 44 including ML procedures which can be mapped to a hardware embodied procedure or to a sequence hardware embodied procedures defined by the hardware execution profile of the selected ML engine 39. For example, in one case, the selected ML engine 42 may execute a given ML function (e.g., a convolution) using a sequence of hardware embodied procedures (e.g., add and multiply procedures). In such case, translator module 36 may represent such ML function in the corresponding model execution profile 44 as multiple sub-procedures (e.g., add and multiply) than can be mapped to hardware-embodied procedures define in the hardware execution profile 38 of the selected ML engine 39. In other case, the selected ML engine 42 may execute a given ML function via a microcontroller such that translator 36 may represent such ML function in the corresponding model execution profile 44 as a single ML procedure rather than multiple subprocedures. As such, in examples, translator 36 derives model execution profiles 44 defining ML procedures at a granular level which can be mapped hardware embodied procedures at a granular level at which they are defined in the hardware execution profile of the selected ML engine 39.

[0023] Figure 2 is schematic diagram generally illustrating an example of an MLM 42 implemented as a neural network (NN) 60, and graphically illustrates an architecture of NN 60 (such as input, output, and hidden layers; numbers of nodes at each layer, and interconnectivity between layers, for example) which is described by the representational format of the corresponding model file 40. NN 60 includes an input layer 62 including a plurality of input nodes 64-1 to 64-n, an output layer 66 including a plurality of output nodes 68-1 to 68-k, and a plurality of hidden layers 70, such as illustrated by hidden layer 72 having a plurality of hidden nodes 74-1 to 74-m, and hidden layer 76 having a plurality of hidden nodes 78-1 to 78-p. In examples, NN 60 may have any suitable number of hidden layers 70, and input layer 62, output layer 66, and each hidden layer 80 may have any suitable number of nodes, where each layer may have a different number of nodes.

[0024] In examples, outputs of each node of a layer are connected as weighted inputs to nodes of other layers of NN 60, such as illustrated by weighted connection 80, where different connections may have a different weight. In one example, as illustrated, each node of each layer is connected to each node the next layer of the NN to form what is referred to as a fully connected NN. At each node, various ML procedures (e.g., functions, operations) are performed on input data received from other nodes to produce output data which is transmitted to other nodes via interconnects 80. In response to a set of input data 82 being provided at input layer 62, NN 60 provides a set of output data 84 at output layer 66.

[0025] In other examples, nodes of a given layer may not be connected to nodes of a next layer. For example, nodes of a given layer may be connected to nodes of a subsequent layer which is not the next layer (a so-called “skip connect configuration). In another examples, a first portion of nodes of a given layer may be connected to nodes of a first subsequent layer, and a second portion of nodes of the given layer may be connected to nodes of a second subsequent layer. Any number of suitable interconnect arrangements may be employed between nodes of different layers.

[0026] Figure 3 is a graphical representation illustrating a derivation of an example model execution profile 144 from a corresponding model file 40 of a selected MLM 42 by translator module 34, according to one example. In the illustrated example of Figure 3, it is noted that the selected MLM 42 is a convolutional NN. In one example, from metadata and the architecture represented by the corresponding model file 40, translator module 34 derives the parameters and ordering of ML functions and operations, on a layer-by-layer basis, for execution of the selected MLM 42. Flow diagram 150 graphically represents the types and ordering of ML functions and operations, on a layer- by-layer basis, for the execution of the corresponding MLM 42. It is noted that, typically, the nodes of a given layer, such as nodes 74-1 to 74-m of hidden layer 72 of NN 60 (see Figure 2), each perform a same set of ML functions and/or operations.

[0027] In one example, an input layer and an output layer are respectively illustrated at 152 and 154, and hidden layers illustrated at 160, 162, 164, 166, and 168. In the illustrated example, input layer 152 is graphically illustrated as including a transpose operation to place data in a desired format for hidden layer 160. Hidden layer 160 is graphically illustrated as including sequentially performed Convolution and PRelu functions followed by a MaxPool operation, with hidden layers 162 and 164 each including sequentially performing Convolution and PRelu functions. Layers 166 and 168 are illustrated as including sequentially performed Convolution and PRelu functions, with layer 166 further including a Softmax operation. In the illustrated example, the procedures of layers 152 and 160-164 are illustrated as being performed sequentially while the procedures of layers 166 and 168 may be performed in parallel (and being representative of a skip-layer architecture). While the Convolution and PRelu functions are illustrated and constructively treated as being part of a same layer for purposes of deriving model execution profile 44, it is noted that such functions may actually be arranged as separate layers in the architecture of corresponding MLM 42. [0028] A portion of model execution profile 144 is illustrated on the right-hand side of Figure 3, and illustrates ML procedures for executing the Convolution functions of each layer of the flow diagram illustrated by graph 150. In the illustrated example, each of the Convolution functions is broken down in model execution profile 144 into more granular sub-operations of addition and multiplication, with such-operations being defined as distinct ML procedures which may be mapped to corresponding hardware-embodied procedures of the hardware execution profile 38 of the selected ML engine 39. As illustrated, model execution profile includes a total number of addition and multiplication operations executed for each layer 160-168, as respectively illustrated at 160-1 to 168-1 , and total bytes of memory for each layer 160-168, as respectively illustrated at 160-2 to168-2, to execute the corresponding Convolution functions. As illustrated by model execution profile 144, the convolution operations of the entire selected MLM 42 are determined to use about 100,780,118 add operations 170, 101 ,801 ,166 multiply operations 172, and consume about 4,084,192 bytes (about 4.1 Mb) 174.

[0029] Although not explicitly illustrated, in examples, in a similar fashion, translator module 34 derives similar metrics for model execution profile 144 regarding the PRelu functions (and any other ML procedures). Additionally, each of the arrows between the ML procedures (e.g., transpose, convolution, PRelu, Softmax, etc.) indicates a transfer of data between ML procedures and/or layer. In one example, a total number of such data transfers, along with corresponding memory consumption, is derived from model file 40 and included in model execution profile 144.

[0030] As described above, based on the model execution profile 44 of a selected MLM 42, and on the hardware execution profile 38 of a selected ML engine 39, predictor module 36 derives a performance forecast 46 for the selected MLM 42 on the selected ML engine 39. Using the example model execution profile 144 illustrated by Figure 3, in one case, predictor module 36 determines from the hardware execution profile 38 of the selected ML engine 39 how many add operations (if any) can be processed in parallel, how many multiplication operations (if any) can be processed in parallel, and based on how long it takes to run (the latency) of each addition and multiply operation determines a total latency for the convolution functions. In one example, predictor module 36 similarly determines a total latency for remaining ML procedures, such as for PRelu functions and memory operations. In one example, predictor module 36 aggregates such latencies to derive an overall predicted latency (a model latency) for executing the selected MLM 42 on the selected ML engine 39.

[0031] In other examples, in a similar fashion, predictor module 36 aggregates a predicted total memory consumption to execute of the selected MLM 42 from the memory consumption of each of the individual ML procedures and, based on a memory capacity included in the hardware execution profile of the selected ML engine 39, includes in the performance forecast 46 a prediction of whether the selected MLM 42 will execute on the selected ML engine 39. In other examples, based on an amount of power consumed for each hardware- embodied procedure included as operating metrics in the hardware execution profile 38 of the selected ML engine 39, and based on an aggregation of the ML procedures to which hardware embodied procedures are mapped, predictor module provides an estimate of power and energy consumption of the selected ML engine 39 to execute the selected MLM 42.

[0032] In summary, based on a model execution profiles 44 and hardware execution profiles 38, which respectively comprise abstracted representations of corresponding MLMs and ML engines, MLM performance forecaster 20, in accordance with the present disclosure, calculates computational effort and memory consumption to provide a performance forecast of MLMs on ML engines (hardware) without running the model on the hardware. The process is ML framework and hardware agnostic, and provides a quick and easy way to estimate whether a MLM will “fit” certain hardware (and vice versa) and, if so, predicts performance metrics of the MLM on the ML engine without the need for costly deployment thereon.

[0033] During a learning phase of MLM development, MLM performance forecaster 20 enables expected changes in performance resulting from modifications made to MLM architectures to be quickly and efficiently estimated. MLM forecaster 20 simplifies the training and optimization of MLMs to particular hardware (e.g., a particular ML engine). By predicting a performance of the MLM on the hardware, changes can be made to the MLM’s network architecture while maintaining accuracy of the model’s output. The performance of any number of configurations of the MLM may be predicted and, when combined with training, assist in optimizing the MLM architecture without having to run the MLM on the hardware, thereby saving time and reducing costs.

[0034] Figure 4 is a flow diagram generally illustrating a process 200 for forecasting machine learning model (MLM) performance on machine learning (ML) engines. Process 200 begins at 202 with receiving model files, each model file having a representational format describing an architecture of a corresponding MLM, such as processor 26 executing input module 32 to receive model files 40-1 to 40-n corresponding to MLMs 42-1 to 42-n, according to Figure 1 , for example. At 204 process 200 including receiving hardware execution profiles, each hardware execution profile defining the operation of a corresponding an ML engine and defining different types, numbers, and dependencies of hardware embodied procedures of the ML engine, along with operating metrics of each hardware embodied procedure, including an execution latency, such as processor 26 executing input module 32 to receive hardware execution profiles 38-1 to 38-n corresponding to ML engines 39-1 to 39-n, according to Figure 1 , for example.

[0035] At 206, process 200 includes deriving a model execution profile of a selected MLM from its corresponding model file, the model execution profile defining different types, numbers, ordering and dependencies of ML procedures to execute the MLM on a selected ML engine, each ML procedure mapped to a sequence of one or more hardware embodied procedures of the selected ML engine, such as processor 26 executing translator module 34 to derive a model execution profile 44 from a model file 40 corresponding to a selected MLM 42, according to Figure 1 , for example.

[0036] At 208, process 200 includes deriving a performance forecast of the selected MLM on the selected ML engine based on the model execution profile of the selected MLM and the hardware execution profile of the selected ML engine, the performance forecast including a model latency, such as processor 26 executing predictor module 36 to derive a performance forecast 46 from the model execution profile 44 of the selected MLM 42 and the hardware execution profile 38 of the selected ML engine 39, according to Figure 1 , for example. In some examples, as indicated at 210, process includes deriving a performance including additional performance metrics such as memory consumption, power consumption, and energy consumption of the selected MLM 42.

[0037] Figure 5 is a block and schematic diagram generally illustrating a computing system 300 for implementing MLM performance forecaster 20 according to one example. In the illustrated example, computing system or computing device 300 includes processing units 302 and system memory 304, where system memory 304 may be volatile (e.g. RAM), non-volatile (e.g. ROM, flash memory, etc.), or some combination thereof. Computing device 300 may also have additional features/functionality and additional or different hardware. For example, computing device 300 may include input devices 310 (e.g. keyboard, mouse, etc.), output devices 312 (e.g. display), and communication connections 314 that allow computing device 300 to communicate with other computers/applications 316, wherein the various elements of computing device 300 are communicatively coupled together via communication links 318.

[0038] In one example, computing device 300 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated as removable storage 306 and non-removable storage 308. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any suitable method or technology for non-transitory storage of information such as computer readable instructions, data structures, program modules, or other data, and does not include transitory storage media.

Computer storage media includes RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, and magnetic disc storage or other magnetic storage devices, for example. [0039] System memory 304, removable storage 306, and non-removable storage 308 represent examples of computer storage media, including non- transitory computer readable storage media, storing computer executable instructions that when executed by one or more processors units of processing units 302 causes the one or more processors to perform the functionality of a system, such as MLM performance forecaster 20. For example, system memory 304 stores computer executable forecasting instructions 24 for MLM performance forecaster 20, including input module instructions 32, translator module instructions 34, and predictor module instructions 36, that when executed by one or more processing units of processing units 302 implement the functionalities of MLM performance forecaster 20, as described herein. In one example, one or more of the at least one machine-readable medium storing instructions for MLM performance forecaster 20, including input module instructions 32, translator module instructions 34, and predictor module instructions 36, may be separate from but accessible to computing device 300. In other examples, hardware and programming may be divided among multiple computing devices.

[0040] In some examples, the computer executable instructions can be part of an installation package that, when installed, can be executed by at least one processing unit to implement the functionality of MLM performance forecaster 20. In such examples, the machine-readable storage medium may be a portable medium, such as a CD, DVD, or flash drive, for example, or a memory maintained by a server from which the installation package can be downloaded and installed. In other examples, the computer executable instructions may be part of an application, applications, or component already installed on computing device 300, including the processing resource. In such examples, the machine readable storage medium may include memory such as a hard drive, solid state drive, or the like. In other examples, the functionality of MLM performance forecaster 20, including input module instructions 32, translator module instructions 34, and predictor module instructions 36, may be implemented in the form of electronic circuitry, including via a flexible computing device, such as a field programmable gate array (FPGA) for example. [0041] Figure 6 is a block and schematic diagram generally illustrating an example of a computing device 400 (e.g., laptop) including an ML platform 402 having performance forecasting instructions 24 for forecasting performance of MLMs on ML engines, according to one example of the present disclosure. In one example, ML platform 402 includes an ML manager 404 (e.g., a microcontroller), a memory 406, and a plurality of ML engines 408 (illustrated as ML engines 408-1 to 408-n).

[0042] In one example, memory 406 stores a plurality of ML model files 410 (illustrated as ML files 410-1 to 410-n), where each ML model file 410 has representational format (e.g., ONNX format) and corresponds to a MLM (such as MLMs 42, see Figure 1 ). In one example, memory 406 stores a number of hardware execution profiles 412 (such as described above with respect to hardware execution files 38 of Figure 1 ), illustrated as hardware execution profiles 412-1 to 412-n, with each hardware execution profile 412 corresponding to a different one of the ML engines 408. In one example, memory 406 also stores MLM performance forecasting instructions 24 including input module 32, translator module 34, and predictor module 36, as described above with respect to at least Figures 1 -3.

[0043] In examples, each of the MLMs represented by ML files 410-1 to 410-n is configured and trained to perform different machine learning tasks (such as voice recognition, face recognition, speech-to-text conversion, and text-to- speech conversion, for example). In one example, based on requests from computing device 400 for execution of a given machine learning task, ML manager 404, based on the ML file 410 corresponding to the requested machine learning task, loads the corresponding MLM 40 (see Figure 1 ) onto one of the ML engines 408.

[0044] In examples, at any given time, other MLMs may be implemented and running on any different ones of the ML engines 408-1 to 408-n. In one example, based on the ML model 410 of the requested MLM, ML manager 404 executes translator module 34 to derive a model execution profile 44 for the requested MLM, and executes predictor module 36 to derive a performance forecast 46 of the requested MLM on one or more of the still available ML engines 408 based on their corresponding hardware execution profiles 412. [0045] In one example, ML manager 404 implements the MLM to perform the requested ML task on the ML engine 40 corresponding to the performance forecast having the most favorable performance metrics. In examples, the most favorable performance metrics may vary depending on operating policies of computing device 400. For example, in one case, if computing device 400 is operating on battery power, the most favorable performance metric may be the lowest power consumption to execute an MLM. In another case, the most favorable performance metric may be the smallest latency to execute an MLM. [0046] Although specific examples have been illustrated and described herein, a variety of alternate and/or equivalent implementations may be substituted for the specific examples shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific examples discussed herein. Therefore, it is intended that this disclosure be limited only by the claims and the equivalents thereof.