Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TOTAL MOMENTUM FEEDBACK LOOP FOR BETTER ASGD GENERALIZATION
Document Type and Number:
WIPO Patent Application WO/2023/088569
Kind Code:
A1
Abstract:
A system and a method for distributed training of a machine learning model are disclosed. They are characterized by using feedback loop for total momentum control, and comprising detection of ASGD executions with compromised generalization, a feedback loop for adjusting the total momentum of ASGD execution, i.e. the sum of the explicit, parametric momentum, and the implicit momentum, related to gradient staleness, toward zero, and tuning ASGD hyperparameters, in accordance with total momentum minimization.

Inventors:
TALYANSKY ROMAN (DE)
MELAMED ZACH (DE)
KISILEV PAVEL (DE)
KATZ MICHAEL (DE)
Application Number:
PCT/EP2021/082443
Publication Date:
May 25, 2023
Filing Date:
November 22, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
TALYANSKY ROMAN (DE)
International Classes:
G06N3/08; G06N3/063
Other References:
JIAN ZHANG ET AL: "YellowFin and the Art of Momentum Tuning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 June 2017 (2017-06-12), XP081317882
HUNTER LANG ET AL: "Using Statistics to Automate Stochastic Optimization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 September 2019 (2019-09-21), XP081479994
DEEPAK NARAYANANAMAR PHANISHAYEEKAIYU SHIXIE CHENMATEI ZAHARIA: "Memory-Efficient Pipeline-Parallel DNN Training", ARXIV:2006.09503V1 [CS.LG, 16 June 2020 (2020-06-16)
Attorney, Agent or Firm:
KREUZ, Georg M. (DE)
Download PDF:
Claims:
WHAT IS CLAIMED IS: 1. A system for training machine learning based models, using a plurality of computing devices, and configured by hyperparameters, wherein at least one computing device is configured to: receive a parameter vector as estimated for a machine learning based model at a reference iteration the parameter vector as estimated for the machine learning based model at a first previous iteration and the parameter vector as estimated for the machine learning based model at a second previous iteration; calculate a first gap vector by subtracting the parameter vector as estimated for the machine learning based model at the first previous iteration from the parameter vector as estimated for a machine learning based model at a reference iteration; calculate a second gap vector by subtracting the parameter vector as estimated for the machine learning based model at the second previous iteration from the parameter vector as estimated for the machine learning based model at the first previous iteration; and calculate a learning rate to minimize the absolute value of a statistical representative value of a total momentum estimation scalar produced by: generating a multiplication by multiplying a gradient parameter vector as estimated for the machine learning based model at an iteration proximate to the reference iteration by the learning rate; generating a sum by summing the multiplication with the first gap vector; and generating a statistical representative of the division of the sum by a the second gap vector. 2. The system of claim 1, further comprising at least one computing device configured to perform monitoring of the absolute value of the statistical representative value of the total momentum estimation scalar to detect compromised generalization. 3. The system of claim 2, wherein the monitoring is applied by counting the number of iterations wherein the absolute value of the statistical representative value of the total momentum estimation scalar exceeded a threshold within a history range.

4. The system of claim 1, wherein the at least one computing device is further configured to use the learning rate for training the machine learning based model. 5. The system of claim 1, wherein the at least one computing device is further configured to train the machine learning based model in an execution environment optimized for minimizing the absolute value of the total momentum. 6. A computer-implemented method for training machine learning based models, using a plurality of computing devices, and configured by hyperparameters, the method comprising: receive a parameter vector as estimated for a machine learning based model at a reference iteration the parameter vector as estimated for the machine learning based model at a first previous iteration and the parameter vector as estimated for the machine learning based model at a second previous iteration; and calculating a first gap vector by subtracting the parameter vector as estimated for the machine learning based model at the first previous iteration from the parameter vector as estimated for a machine learning based model at a reference iteration; calculating a second gap vector by subtracting the parameter vector as estimated for the machine learning based model at the second previous iteration from the parameter vector as estimated for the machine learning based model at the first previous iteration; and calculating a learning rate to minimize the absolute value of a statistical representative value of a total momentum estimation scalar produced by: generating a multiplication by multiplying a gradient parameter vector as estimated for the machine learning based model at an iteration proximate to the reference iteration by the learning rate; generating a sum by summing the multiplication with the first gap vector; and generating a statistical representative of the division of the sum by a the second gap vector. 7. The computer-implemented method of claim 6, further comprising performing monitoring of the absolute value of the statistical representative value of the total momentum estimation scalar to detect compromised generalization.

8. The computer-implemented method of claim 7, wherein the monitoring is applied by counting the number of iterations wherein the absolute value of the statistical representative value of the total momentum estimation scalar exceeded a threshold within a history range. 9. The computer-implemented method of claim 6, further comprising using the learning rate for training the machine learning based model. 10. The computer-implemented method of claim 6, further comprising training the machine learning based model in an execution environment optimized for minimizing the absolute value of the total momentum. 11. A computer program comprising instructions of training machine learning based models, wherein execution of the instructions by one or more processors of a computing system causes a computing system to: receive a parameter vector as estimated for a machine learning based model at a reference iteration the parameter vector as estimated for the machine learning based model at a first previous iteration and the parameter vector as estimated for the machine learning based model at a second previous iteration; and calculate a first gap vector by subtracting the parameter vector as estimated for the machine learning based model at the first previous iteration from the parameter vector as estimated for a machine learning based model at a reference iteration; calculate a second gap vector by subtracting the parameter vector as estimated for the machine learning based model at the second previous iteration from the parameter vector as estimated for the machine learning based model at the first previous iteration; and calculate a learning rate to minimize the absolute value of a statistical representative value of a total momentum estimation scalar produced by: generating a multiplication by multiplying a gradient parameter vector as estimated for the machine learning based model at an iteration proximate to the reference iteration by the learning rate; generating a sum by summing the multiplication with the first gap vector; and generating a statistical representative of the division of the sum by a the second gap vector. 12. A non-transitory computer-readable medium storing the computer program of claim 11.

Description:
TOTAL MOMENTUM FEEDBACK LOOP FOR BETTER ASGD GENERALIZATION BACKGROUND Some embodiments described in the present disclosure relate to a machine learning model training and, more specifically, but not exclusively, to a controlling asynchronous stochastic gradient descent (ASGD) momentum during distributed training. Using asynchronous stochastic gradient descent (SGD) on a system comprising a computing device is a ubiquitous method of training machine learning modes, and particularly neural networks. ASGD is a distributed algorithm for training very large scale deep learning models, shown for example by Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng, in “Large Scale Distributed Deep Networks”, at NIPS 2012. Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia proposed in “Memory-Efficient Pipeline-Parallel DNN Training” on arXiv:2006.09503v1 [cs.LG] on 16 Jun 2020, a method for partitioning the training of complex machine learning models. The training of very large machine learning models may not fit into the main memory of a single computing device, an acceleration device, or a graphic processing unit (GPU), or the like, and may be more effective when the model is partitioned into several parts and these parts are placed and trained over a set of physical acceleration devices. The algorithm may be applied using a plurality of computing devices, which may also be referred to as workers, wherein at least one computing device may also be referred to as a master. The wor ay asynchronously receive ters, namely ^ from master, compute a gradient, namely and transmit the gradien to the master, which may also be referred to as the parameter server. The Master may followingly use the gradients to update the parameters stored thereon, and followingly repeat the steps above. As compared to Synchronous Stochastic Gradient Descent algorithms, Asynchronous mode of operation of ASGD results in lower worker idle time, better resource utilization and faster wall-clock training time. However ASGD is characterized by a disadvantage of gradient staleness. The problem of gradient staleness arises, since the gradient, computed at iteration κ, is merged into model at a latency of τ iteration, or at [κ+τ]: [κ+τ+1] = [κ+τ] − η(x[κ]). Since the gradient computed at [κ+τ] may be significantly different from the gradient, computed at [κ], the precision may be lower, and the convergence may be slower, and difficult to reach. Updating parameters with stale gradients compromises the convergence rate, as compared to synchronous SGD, requiring more iterations to converge. Also sometimes ASGD may fail to reach the same final test accuracy and that disadvantage adversely affects scalability of the algorithm. The concept of momentum, applied both in SGD and ASGD, was introduced by B. T. Polyak, in “Some methods of speeding up the convergence of iteration methods, in ” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964. The momentum enables the algorithm to remember previous gradients, and apply them in following parameter updating steps. Generally, momentum values between 0.9 and 0.999 are prevalent. I. Mitliagkas, C. Zhang, S. Hadjis and C. Ré, in "Asynchrony begets momentum, with an application to deep learning," Allerton 2016, suggested that: where the sum is over infinite values of staleness τ=0, 1, 2, ..., and staleness values are distributed according to staleness distribution p[τ]~P. And where μ is momentum and η is learning rate. And additionally, stated that ASGD algorithm has an implicit momentum, even when the explicit momentum, as defined in the algorithm, is not used. Jian Zhang, and Ioannis Mitliagkas, in “YellowFin and the Art of Momentum Tuning”, at ICLR 2018, proposed that: Thus, YellowFin work, suggested to measure total momentum that incorporates both implicit and algorithmic momentum At time t the momentum may be measured according to Note that in accordance with Polyak’s work, the division is defined element-wise, resulting in a vector and median is defined over entries of the resulting vector. Sam Smith Quoc V. Le, showed in “A Bayesian Perspective on Generalization and Stochastic Gradient Descent”, at ICLR 2018 that an algorithm with momentum μ and learning rate η, has an effective learning rate of: An ASGD worker starts computing a gradient from a current version of the central model, which we call Starting Parameters (SP). When the worker finishes computing the gradient, other workers could have updated the central model, so that the worker merges its gradient into the updated version of the central model, which we call Final Parameters (FP). The number of updates between SP and FP is defined as the gradient staleness. Delay compensation methods, as suggested by Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhiming Ma, and Tie-Yan Liu, in “Asynchronous Stochastic Gradient Descent with Delay Compensation”, at ICML 2017, wherein a worker computes the gradient on starting parameters (SP), then it computes a first order approximation of the gradient at FP and, finally, merges the approximated gradient to final parameters (FP). Disadvantages: Usually, the approximation quality drops as the distance between SP and FP grows or, in turn, as the number of workers grows, leading to poor scalability and generalization. Parameter prediction methods, such as suggested by Saar Barkai, Ido Hakimi, and A. Schuster, in “Gap Aware Mitigation of Gradient Staleness”, at ICLR 2020, wherein a worker reads SP and predicts FP were also introduced. Followingly, the worker computes the gradient on the predicted parameters and merges it FP. However, as the number of workers grows, the prediction accuracy degrades leading to compromised generalization at high scale. The Gap-awareness methods together with parameter prediction methods, may predict FP and divide a gradient by its gap, which is computed as a ratio of the distance between the predicted and actual FP and average step size. However, generalization of the combination of gap-aware and parameter prediction methods may be compromised at high scale. Additionally, Staleness-aware methods, such as Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu’s “Staleness-aware async-sgd for distributed deep learning”, introduced in CoRR 2015, divide gradients by staleness. These methods may be characterized by from over- penalization, since while staleness may be large, the actual distance between SP and FP may be small. This over-penalization may lead to compromised generalization and scalability. SUMMARY It is an object of the present disclosure to describe a system and a method for distributed training of a machine learning model, while using feedback loop for total momentum control, comprising detection of ASGD executions with compromised generalization, a feedback loop for adjusting the total momentum of ASGD execution towards zero and tuning ASGD hyperparameters, in accordance with total momentum minimization. The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures. According to an aspect of some embodiments of the present invention there is provided a system for training machine learning based models, using a plurality of computing devices, and configured by hyperparameters, wherein at least one computing device is configured to: receive a parameter vector as estimated for a machine learning based model at a reference iteration the parameter vector as estimated for the machine learning based model at a first previous iteration and the parameter vector as estimated for the machine learning based model at a second previous iteration; calculate a first gap vector by subtracting the parameter vector as estimated for the machine learning based model at the first previous iteration from the parameter vector as estimated for a machine learning based model at a reference iteration; calculate a second gap vector by subtracting the parameter vector as estimated for the machine learning based model at the second previous iteration from the parameter vector as estimated for the machine learning based model at the first previous iteration; and calculate a learning rate to minimize the absolute value of a statistical representative value of a total momentum estimation scalar produced by: generating a multiplication by multiplying a gradient parameter vector as estimated for the machine learning based model at an iteration proximate to the reference iteration by the learning rate; generating a sum by summing the multiplication with the first gap vector; and generating a statistical representative of the division of the sum by a the second gap vector. According to an aspect of some embodiments of the present invention there is provided a computer-implemented method for training machine learning based models, using a plurality of computing devices, and configured by hyperparameters, the method comprising: receive a parameter vector as estimated for a machine learning based model at a reference iteration the parameter vector as estimated for the machine learning based model at a first previous iteration and the parameter vector as estimated for the machine learning based model at a second previous iteration; and calculating a first gap vector by subtracting the parameter vector as estimated for the machine learning based model at the first previous iteration from the parameter vector as estimated for a machine learning based model at a reference iteration; calculating a second gap vector by subtracting the parameter vector as estimated for the machine learning based model at the second previous iteration from the parameter vector as estimated for the machine learning based model at the first previous iteration; and calculating a learning rate to minimize the absolute value of a statistical representative value of a total momentum estimation scalar produced by: generating a multiplication by multiplying a gradient parameter vector as estimated for the machine learning based model at an iteration proximate to the reference iteration by the learning rate; generating a sum by summing the multiplication with the first gap vector; and generating a statistical representative of the division of the sum by a the second gap vector. According to an aspect of some embodiments of the present invention there is provided a computer program product comprising instructions of training machine learning based models, wherein execution of the instructions by one or more processors of a computing system is to cause a computing system to: receive a parameter vector as estimated for a machine learning based model at a reference iteration the parameter vector as estimated for the machine learning based model at a first previous iteration and the parameter vector as estimated for the machine learning based model at a second previous iteration; and calculate a first gap vector by subtracting the parameter vector as estimated for the machine learning based model at the first previous iteration from the parameter vector as estimated for a machine learning based model at a reference iteration; calculate a second gap vector by subtracting the parameter vector as estimated for the machine learning based model at the second previous iteration from the parameter vector as estimated for the machine learning based model at the first previous iteration; and calculate a learning rate to minimize the absolute value of a statistical representative value of a total momentum estimation scalar produced by: generating a multiplication by multiplying a gradient parameter vector as estimated for the machine learning based model at an iteration proximate to the reference iteration by the learning rate; generating a sum by summing the multiplication with the first gap vector; and generating a statistical representative of the division of the sum by a the second gap vector. Optionally, further comprising at least one computing device configured to perform monitoring of the absolute value of the statistical representative value of the total momentum estimation scalar to detect compromised generalization. Optionally, wherein the monitoring is applied by counting the number of iterations wherein the absolute value of the statistical representative value of the total momentum estimation scalar exceeded a threshold within a history range. Optionally, the at least one computing device is further configured to use the learning rate for training the machine learning based model. Optionally, the at least one computing device is further configured to train the machine learning based model in an execution environment optimized for minimizing the absolute value of the total momentum. Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims. Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S) Some embodiments are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments may be practiced. In the drawings: FIG.1A is a schematic illustration of an exemplary system training of a neural network, according to some embodiments of the present disclosure; FIG. 1B which is a schematic block diagram of an exemplary system for distributed training of a machine learning model, according to some embodiments of the present disclosure; FIG. 2A is a flowchart schematically representing an optional flow of operations for distributed step of training of a machine learning model, using adjusted total momentum, according to some embodiments of the present disclosure; FIG. 2B is a flowchart schematically representing an optional flow of operations for calculate a total momentum, according to some embodiments of the present disclosure; FIG. 3A is a schematic block diagram of a feedback loop for improving ASGD generalization, according to some embodiments of the present disclosure; FIG. 3B is a schematic block diagram of an implementation example for detection of compromised generalization, according to some embodiments of the present disclosure; FIG. 4 is a schematic block diagram description of an implementation of a feedback loop, according to some embodiments of the present disclosure; FIG. 5A is a schematic graph depicting total momentum during training using an exemplary experiment, according to some embodiments of the present disclosure; and FIG.5B is a schematic graph depicting test precision during training using an exemplary experiment, according to some embodiments of the present disclosure. DETAILED DESCRIPTION Some embodiments described in the present disclosure relate to a machine learning model training and, more specifically, but not exclusively, to a controlling asynchronous stochastic gradient descent (ASGD) total momentum during distributed training. Training a complex machine learning model, such as a deep neural network, using large dataset, is time, memory, and energy consuming, and may be slow or even impractical on a single computing device. Therefore methods for distributed training were developed. Distributed training between a plurality of computing devices, also referred to as worker nodes, may improve training speed, and enable training of large machine learning models. The gradient descent acceleration may be data parallel, i.e. based on distribution of the data from the training dataset between different computing devices, model parallel pipeline, i.e. based on partitioning the model and distributing it over different devices, or a combination thereof, for example by defining a two-dimensional array of computing devices, wherein each computing device is assigned to a pipeline stage according to an associated row location, and to the data part from the training set according to an associated column location. Training may alternatively be done by genetic algorithms, however gradient descent is the ubiquitous choice, particularly stochastic gradient descent, which may be accelerated by parallel computing used for the distributed training using synchronous stochastic gradient descent (SSGD), or ASGD. SSGD requires waiting for the slowest, or most remote, computing device, from which the updated gradients arrive last, and therefore may be slower than ASGD. Staleness of ASGD algorithms leads to degradation of their convergence rate, generalization and scalability. As used herein, the terms computing devices, worker nodes or workers, refer to computers, GPU cards, field programmable gate arrays (FPGA), application specific integrated circuits (ASIC), workstations, servers, digital signal processing (DSP) modules, a combination thereof, and/or similar devices, which are apt to execute instructions for training complex machine learning models, and store the parameters of the machine learning models. The computing devices may communicate using a backbone bus, cables arranged at specific topologies, communication protocols such as Ethernet, the internet, and/or the like. For better understanding of the disclosed method, recall the notation τ[t] for the staleness of a gradient that a computing device, or a worker, starts computing at time t, i.e. that the computing device may merge this gradient to the central model at time t+τ[t], after other workers have updated the central model τ[t] times. A proposed analysis of convergence of the ASGD algorithm shows that the largest factor in the bound on the convergence rate is of the form: Wherein the sum is over indices i, i∈t−N+1.., such that i−[i] < t−N and wherein G is the upper bound on gradient norm and Ε is expected value. Note that the summand in this expression is quadratic and bounded only by T^2, since staleness τ [t] may be as large as ^. Also, it is defined over very stale gradients, which are computed on central model versions from iterations below t−N, and, thus, their staleness is above N. Therefore, minimizing this factor may improve the generalization. The disclosure suggests that managing the total momentum allows us to reduce the quadratic factor above, which is the largest factor in the bound binding the convergence rate. In pipeline Model Parallel SGD: Note that from definition of the total momentum as defined in YellowFin, at time t, the total momentum may be calculated based on quantities, i.e. model parameters x, hyper parameters such as η, and gradients, as calculated and measured around time t-τ-1. The total momentum may be computed from quantities, measured around time t-τ-1 = t−N. YellowFin work, suggested that: However, it should be emphasized that median is only one example of a statistical representative value, and other values such as mean, a typical value, average of the 25th percentile and the 75th percentile, the 10th percentile, 65th percentile, 95th percentile, and the like, may be functional equivalent. Other variants of the formula are apparent to the person skilled in the art and are within the scope of the claim. When the total momentum is zero, i.e. , according to Polyak momentum, the algorithm does not remember the gradients preceding the iteration t−N. In other words, the algorithm at time t, does not remember gradients whose staleness exceeds N. Since largest factor in the bound is defined over gradients of staleness following N, when , the factor may be defined over an empty set of indices, i.e. equal to 0. Therefore, the generalization of the algorithm may improve. While ubiquitous disclosures suggest a substantial momentum, the disclosure suggest minimizing the total momentum to offset the gradient staleness. Some embodiments of the present disclosure may comprise a detection component, for detecting ASGD sessions, or executions having compromised generalization, which may be inefficient use of the computing devices. A detection component may be characterized by a threshold d, used for detecting iterations t, when the total momentum is large in absolute value, and distant from 0, i.e. when: A total momentum distant from zero, in a large enough number of iterations, may indicate that the generalization is compromised. To detect this intuition, a threshold ^ is defined and the detection component may indicate that an ASGD execution is compromised, when the fraction of iterations, when |μ'(t)| > d, is above ε: Wherein the summation is counting over iterations t =1,…, T, δ() is Kronecker delta, i.e. δ(c) = 1 if Boolean value of c = True and δ(c) = 0, when c = False. In summary, the detection component detects compromised generalization in ASGD execution, by monitoring of the absolute value of the typical element of the momentum estimation scalar. The monitoring may be performed by counting the number of iterations when the fraction of iterations, wherein the absolute value of the typical element of the momentum estimation scalar μ'(t) exceeded a threshold ^ within a history range, exceeds a second threshold ε, or alternatively by extracting a statistic such as a mean or a median of the total momentum. It should be noted that a person skilled in the art may conceive many ways to obtain similar functionality of monitoring the total momentum. Some embodiments of the present disclosure may comprise a feedback loop for improving ASGD generalization. A feedback loop may start from running an ASGD algorithm in an execution environment, where the feedback loop measures the total momentum during the execution and inputs it into a control unit. The term execution environment refers to the settings of hardware, firmware, software, and the like, that enables a computing device to perform training of a machine learning model in general, and particularly to the hyperparameters, instructions, calculations, and the likes as used in training iterations, epochs, and/or the like during ASGD. Measuring total momentum may be performed by the detection component, or by other methods, based on the model parameters during several iterations, gradients, hyper parameters, data from a central parameter server, and/or the like. The control unit may dynamically re-compute values of ASGD hyper-parameters and set them back into the ASGD execution environment. The new re-computed values of hyper-parameters may adjust the total momentum of ASGD execution to zero, and, thus, improve the generalization of ASGD training. Before explaining at least one embodiment in detail, it is to be understood that embodiments are not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. Implementations described herein are capable of other embodiments or of being practiced or carried out in various ways. Embodiments may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the embodiments. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device. Computer readable program instructions for carrying out operations of embodiments may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of embodiments. Aspects of embodiments are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions. These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks. The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks. The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions. Referring now to the drawings, FIG. 1A is a schematic illustration of an exemplary system training of a neural network, according to some embodiments of the present disclosure. An exemplary training system 100 may function as a computing node for processes such as 200 and/or 250 for distributed training a neural network or a similarly complex machine learning model from data records, using ASGD or variants thereof. Further details about these exemplary processes follow as FIG.2A and FIG.2B are described. The training of a neural network system 110 may include a network interface 113, which comprises an input interface 112, and an output interface 115. The training system may also comprise one or more processors 111 for executing processes such as 200 and/or 250, and storage 116 for storing code (program code storage 114) and/or memory 118 for data, such as network parameters, and records for training. The training of the neural network may be performed on a site, implemented on a cluster comprising mobile devices, implemented as distributed system, implemented virtually on a cloud service, on machines also used for other functions, and/or by several options. Alternatively, the system, or parts thereof, may be implemented on dedicated hardware, FPGA and/or the likes. Further alternatively, the system, or parts thereof, may be implemented on a server, a computer farm, the cloud, and/or the likes. For example, the storage 116 may comprise a local cache on the device, and some of the less frequently used data and code parts may be stored remotely. The input interface 112, and the output interface 115 may comprise one or more wired and/or wireless network interfaces for connecting to one or more networks, for example, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a cellular network, the internet and/or the like. The input interface 112, and the output interface 115 may further include one or more wired and/or wireless interconnection interfaces, for example, a universal serial bus (USB) interface, a serial port, and/or the like. Furthermore, the output interface 115 may include one or more wireless interfaces for delivering various indications to other systems or users, and the input interface 112, may include one or more wireless interfaces for receiving information from one or more devices. Additionally, the input interface 112 may include specific means for communication with one or more sensor devices 122 such as a touch screen, a microphone for receiving instructions, configurations and/or the like. And similarly, the output interface 115 may include specific means for communication with one or more display devices 125 such as a loudspeaker, display and/or the like. Both parts of the processing, storage and delivery of data records, and inference result processing may be executed using one more optional Neighbor System 124. The one or more processors 111, homogenous or heterogeneous, may include one or more processing nodes arranged for parallel processing, as clusters and/or as one or more multi core one or more processors. Furthermore, the processor may comprise units optimized for deep learning such as Graphic Processing Units (GPU). The storage 116 may include one or more non-transitory persistent storage devices, for example, a hard drive, a Flash array and/or the like. The storage 116 may also include one or more volatile devices, for example, a random access memory (RAM) component, enhanced bandwidth memory such as video RAM (VRAM), and/or the like. The storage 116 may further include one or more network storage resources, for example, a storage server, a network attached storage (NAS), a network drive, and/or the like accessible via one or more networks through the input interface 112, and the output interface 115. The one or more processors 111 may execute one or more software modules such as, for example, a process, a script, an application, an agent, a utility, a tool, an operating system (OS) and/or the like each comprising a plurality of program instructions stored in a non- transitory medium within the program code 114, which may reside on the storage medium 116. Referring now to, FIG.1B which is a schematic block diagram of an exemplary system for distributed training of a machine learning model, according to some embodiments of the present disclosure. An exemplary distributed training system 150 may function as a computing node for processes such as 200 and/or 250 for distributed training a neural network or a similarly complex machine learning model from data records. Further details about these exemplary processes follow as FIG.2A and FIG.2B are described. The network shown in 150 may be used for providing a plurality of users with a platform comprising a plurality of computing nodes, and labelled as a LAN, WAN, a cloud service, a network for distributed training neural networks and similarly complex machine learning models, a compute server, and/or the like. The network may allow communication with physical or virtual machines, or parts thereof, for example graphic processing units (GPU), functioning as computing nodes, as shown in 151,155 and 158. The network may interface the outside network, e.g. the internet, and collect data continuously. Some embodiments may prepare additional training data and perform periodic retraining and/or online training. The network computing nodes may be configured to function peer to peer, however, optionally, additional computing nodes may be configured as parameter servers, as shown in 165. A parameter server may be based on similar computing nodes, however may be a system configured for broad and fast memory access, and less processing capabilities. Optionally, more than one computing node may function as a parameter server, or the parameter server may be a plurality of devices configured to function as a single parameter server. For example, and auxiliary parameter server shown in 160 may store some of the parameters, the training data, and/or the like. Reference is also made to FIG. 2A which is a flowchart schematically representing an optional flow of operations for distributed step of training of a machine learning model, using adjusted total momentum, according to some embodiments of the present disclosure. The exemplary process 200 may be executed for training a model for executing one or more inference tasks, for example for analytics, web page traffic prediction, computer vision tasks, ecommendation systems, and/or the like. The process 200 may be executed by the one or more processors 111. The exemplary process 200 starts, as shown in 201, with receiving a parameter vector as estimated for a machine learning based model at a reference iteration the parameter vector as estimated for the machine learning based model at a first previous iteration and the parameter vector as estimated for the machine learning based model at a second previous iteration. These parameter vectors, as well as the gradients, may be already present in computing devices performing ASGD in a model parallel pipeline. In case of data parallel pipeline, other computing devices may have different gradients, as well as different parameter vectors, and this information may be exchanged periodically between the computing devices, however disclosure benefit is more significant when the traffic expensive communication operations, which also manage gradient staleness, are not very frequent. The exemplary process 200 continues, as shown in 202, with calculating a first gap vector by subtracting the parameter vector as estimated for the machine learning based model at the first previous iteration from the parameter vector as estimated for a machine learning based model at a reference iteration. The first gap vector may be calculated in accordance with the formula: x[t-τ] - x[t-τ-1] The model parameters being trained x may be present in the ASGD execution environment, t denotes the current iteration number, tau, τ, the gap of the reference iteration from the current iteration, may be determined in accordance with the frequency of the parameter matching between the nodes, the time elapsed since the most recent updating of the parameters by the parameter server, and/or the like. The reference iteration may be equal to the current iteration minus staleness of the gradient that is currently merged into the central model, or proximal thereto. The number 1, indicating the first previous iteration, may be replaced by another number such as 2, 5, or 14, however a small number is preferred as larger numbers may require additional storage, offsetting and overestimate the staleness. The exemplary process 200 continues, as shown in 203, with calculating a second gap vector by subtracting the parameter vector as estimated for the machine learning based model at the second previous iteration from the parameter vector as estimated for the machine learning based model in the ASGD execution environment at the first previous iteration. The second gap vector may be calculated in accordance with the formula: (x[t-τ-1] - x[t-τ-2]) Similarly to in 202, he model parameters being trained x may be present in the ASGD execution environment, tau, τ, may be determined in accordance with the frequency of the parameter matching between the nodes, the time elapsed since the most recent updating of the parameters by the parameter server, and/or the like. The numbers 1 and 2, indicating the first previous iteration and the second previous iteration respectively, may be replaced by other numbers such as 2 and 4, 6 and 7, or 33 ad 38, however small numbers are preferred as larger numbers may require offsetting and overestimate the staleness and additional storage. Inverting number order requires inverting the formula sign. The process 200 may continue, as shown in 204 by using the machine learning based model, executed by one or more processors 111, for calculating a total momentum for minimizing the total momentum, for training the machine learning based model. The momentum estimation may be implemented by the process 250 as shown in FIG. 2B. The total momentum may be the sum of the parametric momentum, as defined by the external set of hyper-parameter values, and the implicit momentum derived from the gradient staleness. The total momentum, of the ASGD execution environment, used for the machine learning model training, may be minimized by setting the parametric momentum to the negated value of the implicit momentum. The exemplary process 200 may continue, and conclude, as shown in 205, with calculating a learning rate to minimize the absolute value of a statistical representative value of a total momentum estimation scalar. The learning rate ^ may be calculated according the effective learning rate formula η' = η / (1 - μ). The more precise the momentum estimation is, the more precise the effective learning rate estimation may be. Further details about an exemplary learning rate calculation are described along FIG.4. Some implementation may use other parameters such as the batch size to control the total momentum, and some implementations may apply the learning rate as designtated by the set of hyper-parameter values, or by an adaptive learning rate algorithm. Note that variations of the process are apparent to the person skilled in the art, and are within the scope of the claims. Reference is also made to FIG. 2B which is a flowchart schematically representing an optional flow of operations for calculate a total momentum, according to some embodiments of the present disclosure. The exemplary process 250 may be executed at stage 204 of the process 200. The process 250 may be executed by the one or more processors 111. The process 250 may start, as shown in 251 with generating a multiplication by multiplying a gradient parameter vector as estimated for the machine learning based model at an iteration proximate to the reference iteration by the learning rate. The gradients parameter may be received from the ASGD execution environment, at a proximal iteration tp, for example, the last iteration t-1, the but last iteration, iteration t-τ+1 ,iteration t-τ-1 or the like. The gradient parameter may be multiplied by the learning rate η. For example may be the parametric learning rate as received from the external set of hyper-parameters, an adjusted value corresponding to the effective learning rate, or an otherwise adjusted value. The process 250 may continue, as shown in 252 by generating a sum by summing the multiplication with the first gap vector The first gap vector may be, for example (x[t-τ] - x[t-τ-1]), however varieties in the offset, as described, for example in 202 on FIG.2B, may apply. The sum may be for example , however the reference iteration may vary as shown in 251. The process 250 may conclude, as shown in 253 by generating a statistical representative of the division of the sum by the second gap vector. The statistical representative value may be the median, mean, a typical value such as an estimated distribution peak, a percentile, a functional alternative or a combination of options mentioned, Thereby, calculating: Or as an alternative example, wherein E represents mean: The sum nay be divided by the statistical representative value, to generate a scalar which may be used as a total momentum estimation. Note that variations of the process are apparent to the person skilled in the art, and are within the scope of the claims. Reference is now made to FIG. 3A which is a schematic block diagram of a feedback loop for improving ASGD generalization, according to some embodiments of the present disclosure. Improving ASGD generalization using total momentum control may be performed by controlling the explicit momentum, for example as defined by Polyak, the batch size, and the learning rate, and other methods may be conceived. This diagram shows an example of monitoring of the absolute value of the statistical representative value typical element of the total momentum estimation scalar to detect compromised generalization. The measure total momentum block may comprise a hardware, firmware, or a software implementation of YellowFin, to measure the total momentum characterizing each step on the ASGD execution environment, and followingly, the control component may adjust the external set of hyper-parameters so that the total momentum, is close to zero, as well as detect compromised generalization. In some implementations of the present disclosure, the effective learning rate may be a basis of the control unit of the effective learning rate, computed as follows: where μ' is the measured total momentum and η is learning rate hyper-parameter. Now, within our control unit, we set the effective learning rate η′ according to scheduled algorithmic learning rate as set through the external set of hyper-parameters, and then compute algorithmic learning rate η, using the calculation as above. The learning rate calculated thereby may be set to control the hyper-parameters of the ASGD execution environment. The disclosed control unit may thereby set the total momentum towards zero, leading to improvement in ASGD generalization. This learning rate calculation is an example of automatic auto-tuning of ASGD hyper- parameters, according to measured total momentum, and enables training the machine learning based model in the execution environment optimized for minimizing the absolute value of the total momentum. Also note that the disclosure may be applied both in pipeline Model Parallel ASGD and in Data Parallel ASGD, and combinations thereof. Experiments show that adjusting the total momentum towards 0, improve the model generalization in both parallel ASGD settings. Reference is now made to FIG 3B which is a schematic block diagram of an implementation example for detection of compromised generalization, according to some embodiments of the present disclosure. An ASGD execution may comprise three components: training data, Neural Network architecture and an ASGD algorithm. The disclosure also comprises an optional method for detecting when the generalization is compromised in a training experiment. The method may start with a definition of a threshold d, to detect iterations t, when the total momentum is far away from 0. The calculation counts when the total moment is far away from zero, i.e. when: Generalization may be compromised, when there is a large enough number of iterations, when the above holds. To capture this intuition, we define a threshold ^ and our detection component declares that an ASGD execution is compromised, when the fraction of iterations, when the statistical representative value of the total momentum exceeds threshold, is above ε, i.e. : where summation is over iterations t =1,…, T, and δ() is Kronecker delta, i.e. δ(c) = 1 if Boolean value of c = True and δ(c) = 0, when c = False. To summarize, the disclosed declaration component detects compromised ASGD execution, when the statistical representative value of the total momentum, as detected by the measure total momentum block, exceeds threshold in many iterations i.e. more than ε part. Some implementations of ASGD execution environment may terminate sessions for which the statistical representative value of the total momentum exceeds threshold over some time has elapsed. The monitoring is applied by counting the number of iterations wherein the absolute value of the statistical representative value typical element of the total momentum estimation scalar |μ'(t)| exceeded a threshold ^ within a history range Note that the proposed two stage threshold method is an arbitrary choice made for simplicity and other policies of detecting compromised generalization through the total momentum, or the implicit momentum, are apparent to the person skilled in the art, for example using dual thresholds at either or both stages, partitioning of steps by hierarchy, averaging momentum through iterations, and the like, and are within the scope of the claims. Reference is also made to FIG.4 which is a schematic block diagram description of an implementation of a feedback loop, according to some embodiments of the present disclosure. This an exemplary implementation of the present disclosure using the learning rate for training the machine learning based model, however other implementation based on learning rate adjustments are apparent to the person skilled in the art and within the scope of the claims. A feedback loop may start from running an ASGD algorithm in an execution environment, where the feedback loop measures the total momentum during the execution at the ASGD execution environment, and feeds the learning rate into the control unit to adjust the total momentum is disclosed. Measuring total momentum may be performed as proposed by implementing YellowFin in a measure total momentum block, or a functional alternative. The control unit may dynamically re-computes values of ASGD hyper-parameters and uses them to control the ASGD execution environment. The new re-computed values of hyper-parameters may adjust the total momentum of ASGD execution to zero, and, thus may improve the generalization of ASGD training. Some implementations, comprise a control unit for the effective learning rate, computed as follows: where μ' is the measured total momentum and η is learning rate hyper-parameter. The disclosed control unit may set the effective learning rate η′ according to scheduled algorithmic learning rate and then compute algorithmic learning rate η, using the following calculation: and control the ASGD execution environment therewith. The disclosed control unit may adjust total momentum towards zero, leading to improvement in ASGD generalization. Note that adjusting the momentum may also be performed by controlling the learning rare, batch size, and/or the like. Reference is now made to FIG. 5A which is a schematic graph depicting total momentum during training using an exemplary experiment, according to some embodiments of the present disclosure. The curve 512 shows the total momentum as calculated by the method disclosed in FIG. 2B, when adjusted using the learning rate toward values closer to zero in absolute value, compared to the other curve 510, which shows the total momentum calculated similarly without adjustments. The exemplary experiment was performed using Model Parallel pipeline training based on the cifar10 dataset, using resnet20 architecture and ASGD algorithm with 8 computing devices as worker nodes. Reference is now made to FIG.5B which is a schematic graph depicting test precision during training using an exemplary experiment, according to some embodiments of the present disclosure. The curve 520 shows the test precision, i.e. precision on data not used for training or tuning hyper-parameters, when the total momentum is not adjusted to lower absolute value, while the other curve 522 shoes the test precision when total momentum is adjusted. The exemplary experiment was also performed using Model Parallel pipeline training based on the cifar10 dataset, using resnet20 architecture and ASGD algorithm with 8 computing devices as worker nodes. These examples provide experimental support that pushing, or adjusting the total momentum toward zero may improve convergence properties of ASGD, for example in Model Parallel pipeline training. The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. It is expected that during the life of a patent maturing from this application many relevant machine learning models, neural network variants, and training methods will be developed and the scopes of the terms machine learning model, neural network, ASGD, training step and the like are intended to include all such new technologies a priori. As used herein the term “about” refers to ± 10 %. The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of" and "consisting essentially of". The phrase "consisting essentially of" means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method. As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof. The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment may include a plurality of “optional” features unless such features conflict. Throughout this application, various embodiments may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range. Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween. It is appreciated that certain features of embodiments, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of embodiments, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements. Although embodiments have been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. It is the intent of the applicant(s) that all publications, patents and patent applications referred to in this specification are to be incorporated in their entirety by reference into the specification, as if each individual publication, patent or patent application was specifically and individually noted when referenced that it is to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting. In addition, any priority document(s) of this application is/are hereby incorporated herein by reference in its/their entirety.