Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ARTIFICIAL INTELLIGENCE SYSTEM AND COMPUTER-IMPLEMENTED METHOD FOR PREDICTING OUTPUTS FROM SAMPLES, AND TRAINING METHOD OF THE SYSTEM
Document Type and Number:
WIPO Patent Application WO/2023/241827
Kind Code:
A1
Abstract:
Artificial intelligence system and computer-implemented method for predicting outputs from samples, and training method thereof. The artificial intelligence system (100) comprises a path selector (110) comprising a neural network (112) and an estimator (120) comprising a collection of paths (122) in said neural network, each path (124) having an assigned path weight (126). For each sample, the path selector (110) obtains active paths (114) in the neural network (112) in which every neuron output is different from zero for the corresponding sample (102), and the estimator (120) computes estimator outputs (130) as the sum of contributions of active paths (214) leading to the corresponding neural network output (206). The training method (800) obtains active paths (114) using the path selector (110) for each sample in a training dataset (802) and generates a system of linear equations (1002) using the path weights (126) of the active paths (214) as unknowns. Advantageously, the training method (800) requires processing the training dataset (802) only once. Additionally, a retraining method (1100) requires processing only the additional samples (1104).

Inventors:
DUATO MARÍN JOSÉ FRANCISCO (ES)
Application Number:
PCT/EP2023/050313
Publication Date:
December 21, 2023
Filing Date:
January 09, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
QSIMOV QUANTUM COMPUTING S L (ES)
International Classes:
G06N3/048; G06N3/0499; G06N3/09; G06N3/0455; G06N3/0464
Other References:
FONTENLA-ROMERO O ET AL: "A new convex objective function for the supervised learning of single-layer neural networks", PATTERN RECOGNITION, ELSEVIER, GB, vol. 43, no. 5, 1 May 2010 (2010-05-01), pages 1984 - 1992, XP026892662, ISSN: 0031-3203, [retrieved on 20091204], DOI: 10.1016/J.PATCOG.2009.11.024
FONTENLA-ROMERO OSCAR ET AL: "LANN-SVD: A Non-Iterative SVD-Based Learning Algorithm for One-Layer Neural Networks", IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, IEEE, USA, vol. 29, no. 8, 1 August 2018 (2018-08-01), pages 3900 - 3905, XP011687389, ISSN: 2162-237X, [retrieved on 20180720], DOI: 10.1109/TNNLS.2017.2738118
Attorney, Agent or Firm:
CLARKE MODET & CO (ES)
Download PDF:
Claims:
CLAIMS

1. An artificial intelligence system for predicting outputs from samples, the artificial intelligence system (100) comprising: a path selector (110) comprising a neural network (112) trained with an initial training dataset (804) including a plurality of initial training samples (806), the neural network (112) comprising a plurality of consecutive layers (204) jointly connecting one or more neural network inputs (202) to one or more neural network outputs (206), each layer (204) comprising one or more neurons (208), each neuron (208) configured to compute a neuron output for each sample (102) according to a function of the neuron inputs such that the neuron output is zero or a linear combination of one or more neuron inputs plus a bias; and an estimator (120) comprising a collection of paths (122) in the neural network (112), each path (124) connecting either a neural network input (202) or a neuron (208) to a neural network output (206) through a sequence of neurons (208) located in consecutive layers (204), one neuron (208) per each consecutive layer (204); each path (124) having an assigned path weight (126); wherein the path selector (110) is configured to receive one or more samples (102) and obtain a set of active paths (214) for each sample (102), wherein each active path (214) is a path (124) from the collection of paths (122) of the estimator (120) in which every neuron output is different from zero for the corresponding sample (102) and each neuron input in said path (124) has an effect on the corresponding neuron output for the corresponding sample (102); and wherein the estimator (120) is configured to perform a computation of predicted estimator outputs (130) for each sample (102), wherein each estimator output (130) is associated with a corresponding neural network output (206), wherein each estimator output (130) is computed as the sum of the contributions of all the active paths (214) for the corresponding sample (102) leading to the corresponding neural network output (206), wherein the contribution of each active path (214) is: when the active path (214) connects a neural network input (202) to a neural network output (206), the product of the path weight (126) times the neural network input (202); when the active path (214) connects a neuron (208) to a neural network output (206), the path weight (126).

2. A computer-implemented method of training the artificial intelligence system of claim 1 , the training method (800) comprising: obtaining (810) a set of active paths (114) using the path selector (110) of the artificial intelligence system (100) for each sample (102) in an extended training dataset (802), the extended training dataset (802) comprising the initial training samples (806) of the initial training dataset (804) and additional training samples (808); generating (820) a first set of equations (902) that relate the expected neural network outputs (206) with the neural network inputs (202) for each sample in the extended training dataset (802), using the weights and biases of the neurons (208) along the active paths (214) associated with each sample as primary unknowns (904), wherein each neural network output (206) is expressed as the sum of the contributions of all the active paths (214) for the corresponding sample leading to said neural network output (206), wherein the contribution of each active path (214) is: when the active path (214) connects a neural network input (202) to a neural network output (206), the product of the neuron weights along the active path (214) times the neural network input (202); when the active path (214) connects a neuron (208) to a neural network output (206), the product of the neuron weights along the active path (214) times the bias value of the neuron (208); transforming (830) the first set of equations (902) into a first system of linear equations (1002) by replacing each product (906) of two or more primary unknowns (904) with a secondary unknown (1006); solving (840) the first system of linear equations (1002); and updating (850) the path weight (126) assigned to each path (124) of the estimator (120) of the artificial intelligence system (100) using the value computed for the corresponding unknown (904,1006).

3. The method of claim 2, further comprising a subsequent retraining process (1100) of the artificial intelligence system, wherein the retraining process (1100) comprises: obtaining (1110) a set of active paths (114) using the path selector (110) for each sample in a retraining dataset (1102) containing additional samples (1104) not included in the extended training dataset (802); generating (1120) a second set of equations (1202) that relate the expected neural network outputs (206) with the neural network inputs (202) for each sample in the retraining dataset (1102); transforming (1130) the second set of equations (1202) into a second system of linear equations (1304) by replacing each product (906) of two or more primary unknowns (904) with a secondary unknown (1006); appending (1140) the second system of linear equations (1304) to the first system of linear equations (1002) to obtain an extended system of linear equations (1302); solving (1150) the extended system of linear equations (1302); and updating (1160) the path weight (126) assigned to each path (124) of the estimator (120) using the value computed for the corresponding unknown (904,1006).

4. The method of any of claims 2 to 3, wherein solving (840,1150) the first (1002) or the extended (1302) system of linear equations comprises obtaining a triangular system of equations using QR decomposition, and solving the triangular system of equations.

5. The method of claims 3 and 4, wherein appending (1140) the second system of linear equations (1304) to the first system of linear equations (1002) comprises appending the second system of linear equations (1304) to the triangular system of equations resulting from applying QR decomposition to the first system of linear equations (1002).

6. The method of claim 4 or 5, wherein solving (840, 1150) the first (1002) or the extended (1302) system of linear equations further comprises applying singular value decomposition, SVD, to determine the rank of the triangular system of equations expressed as an equation system matrix.

7. The method of any of claims 2 to 6, further comprising replacing one or more of the unknown weights and biases of the neurons (208) along the active paths (214) with a constant value in the step of generating (820,1120) a first (902) or a second (1202) set of equations or in the step of transforming (830, 1140) the first (902) or second (1202) set of equations into the first (1002) or second (1304) system of linear equations.

8. A computer-implemented method of predicting outputs from samples, the method (700) comprising: receiving (710) a sample (102); obtaining (720) a set of active paths (114) for the sample (102) in a neural network (112) trained with an initial training dataset (804) including a plurality of initial training samples (806), the neural network (112) comprising a plurality of consecutive layers (204) jointly connecting one or more neural network inputs (202) to one or more neural network outputs (206), each layer (204) comprising one or more neurons (208), each neuron (208) configured to compute a neuron output for each sample (102) according to a function of the neuron inputs such that the neuron output is zero or a linear combination of one or more neuron inputs plus a bias; wherein each active path (214) is a path (124) from a collection of paths (122) in the neural network (112) in which every neuron output is different from zero for the sample (102) and each neuron input in said path (124) has an effect on the corresponding neuron output for the sample (102); wherein each path (124) of the collection of paths (122) connects either a neural network input (202) or a neuron (208) to a neural network output (206) through a sequence of neurons (208) located in consecutive layers (204), one neuron (208) per each consecutive layer (204); each path (124) having an assigned path weight (126); and computing (730) predicted estimator outputs (130) for the sample (102), wherein each estimator output (130) is associated with a corresponding neural network output (206), wherein each estimator output (130) is computed as the sum of the contributions of all the active paths (214) for the sample (102) leading to the corresponding neural network output (206), wherein the contribution of each active path (214) is: when the active path (214) connects a neural network input (202) to a neural network output (206), the product of the path weight (126) times the neural network input (202); when the active path (214) connects a neuron (208) to a neural network output (206), the path weight (126).

9. A computer program product comprising computer code instructions which, when the program is executed by a processor, cause the processor to carry out the steps of the method (800) of any of claims 2 to 7.

10. The computer program product of claim 9, comprising at least one computer- readable storage medium having recorded thereon the computer code instructions.

11. A computer program product comprising computer code instructions which, when the program is executed by a processor, cause the processor to carry out the steps of the method (700) of claim 8. 12. The computer program product of claim 11 , comprising at least one computer- readable storage medium having recorded thereon the computer code instructions.

Description:
ARTIFICIAL INTELLIGENCE SYSTEM AND COMPUTER-IMPLEMENTED METHOD FOR PREDICTING OUTPUTS FROM SAMPLES, AND TRAINING METHOD OF THE SYSTEM

DESCRIPTION

FIELD

The present invention is comprised within the field of artificial intelligence systems derived from artificial neural networks and used to predict outputs from samples, and in the field of supervised training and retraining methods for artificial intelligence systems using training samples.

BACKGROUND

The ongoing deluge of scientific and social data experienced during the last years has led to a new scientific exploration paradigm, based on data-intensive scientific discovery, asking for the efficient integration of big data management techniques (manipulation, analysis, visualisation, etc.) with large-scale computer systems, and high performance computing. In the current “Information Age”, we are already overwhelmed with data, and the emergence of Internet-of-Things (loT) will only exacerbate the escalating data concerns. The question in this scenario is how to transform all these data into useful information.

Artificial intelligence (Al) offers an effective means to infer knowledge from the data. As a consequence, Al has been called in to process not only the rocketing volumes of social and industrial data but also to tackle many traditional scientific applications. In particular, in the last years, machine learning via deep neural networks (DNNs), also referred to as deep learning (DL), has shown great success in a large variety of applications, carrying beyond traditional niches in image classification, speech recognition, and machine translation, to expand to an ample range of scientific problems. Following this trend, the interest of the research community and the industry has paved the road toward the design of user-friendly DL frameworks such as Google TensorFlow, Facebook PyTorch, Microsoft CNTK, Apache MXNet, Matlab DL Toolbox, and Scikit-learn, among others, as well as complementary tools and libraries such as Keras, Intel oneDNN, ARMNN, NVIDIA cuDNN, etc., on top of specialised hardware accelerators. Supervised DL consists of two consecutive stages, an initial training process followed by the inference, with very distinct computational requirements. Specifically, during the training stage, the parameters of the DNN model are iteratively tuned using a gradient descent method, such as the stochastic gradient descent (SGD) or any of its variants (implicit SGD, Adam, AdaGrad, RMSProp, Momentum, Nesterov, etc.). This procedure computes and backpropagates the derivatives of the loss function with respect to the model parameters to adjust them until the difference between the model prediction and the ground truth is satisfactorily small. With the growth in the number of layers and neurons per layer, this iterative training becomes increasingly time- and energyconsuming, and this may be aggravated by problems such as slow convergence and local minimum stagnation. In contrast, compared with training, the inference stage is computationally much lighter.

To alleviate the training costs, non-iterative algorithms, such as extreme machine learning, random vector functional link network, or neural networks with random weights, analytically compute the model parameters. However, they have been applied only to non DNN-based machine learning algorithms or shallow neural networks, which are limited in their ability to tackle complex modelling tasks. Indeed, the expressiveness of DNNs to learn high-level abstractions comes from the integration of many hidden layers, in combination with non-linear activation functions between linear relations. One such function, the ReLU (Rectifier Linear Unit), is of particular interest for research on training dynamics, having been proved to deliver satisfactory results for many applications. An appealing property of the ReLU activation function is the simplicity, given by its two only possible states, pass-through or zero (“active” or “inactive”, respectively).

For many applications with evolving data, DNN training is not a one-time task but an incremental learning process. In these cases, once the initial model is well-trained on historical data, it needs to be periodically fine-tuned or retrained based on a continuous flow of new data. For that purpose, it is important to determine not only when to retrain the model but also how to do it efficiently. In this scenario, catastrophic forgetting (that is, the undesired loss of knowledge acquired in the past) is a major barrier. Technically, this occurs when the retraining excessively adjusts the learned model parameters to the new samples such that the process does not generalise satisfactorily on the initial data. There have been numerous attempts to limit forgetting, such as memory replay (i.e., retrain combining previous and new data or combining the retrain on a periodic basis using all previous data), adjusting the parameters according to their importance, or dynamically adjusting the underlying network architecture. However, none of these attempts has been combined with the advances of non-iterative training strategies that can be found in the state-of-the-art.

In a different path, modular DNNs (MDNNs) are characterised by a series of potentially pre-trained independent models whose outputs are moderated by some interface layer(s) in the final stage. These interface layers are in charge of processing the individual model outputs in order to produce a global response. While such individual models are intended to process inputs with different nature, ensembles of DNNs (a subset of MDNNs) usually process the same inputs to settle a final prediction. The interesting feature is that, up to a certain degree of data evolution, retraining MDNNs implies adjusting the parameters of the last layer(s) only, while maintaining unmodified the model components. Therefore, a non-iterative retraining of MDNNs can be performed by considering linear-regression methods for one-layer feed-forward neural networks. As a consequence, MDNN re-training can be much faster, also requiring much less energy than for conventional DNNs. Despite the shorter training time of MDNNs over DNNs, they still require retraining the individual models as soon as the global predictions become obsolete with new data.

Despite the impressive results achieved by supervised DL in multiple fields, there exist key application areas in which the designers refrain from using this powerful tool. The main obstacle to a wider adoption of DL is the walloping costs of DNN training. The slowdown of Moore’s Law and the end of Dennard’s Scaling Law resulted in the need to sacrifice single-core computational performance for higher energy efficiency via the design of multicore architectures and hardware accelerators. In particular, the increasing use of complex DNNs by the industry in general, and information technology (IT) companies in particular, has revitalised the interest for specialised processor architectures and accelerators.

Unfortunately, the time and energy costs of DNN training remain a problem whose severity is growing over time, because the demand for computational resources for this purpose is currently growing faster than the availability of faster hardware. The reason is that the DNNs that perform complex tasks require a considerable number of layers, integrating a huge number of neurons (and, therefore, tuneable parameters), together with a training phase involving a very large dataset in order to recognise the variety of meaningful features with acceptable accuracy.

The net result is that more complex DNNs as well as applications that require processing larger datasets will incur an increasingly larger number of operations with conventional SGD-based iterative training (in part due to problems such as slow convergence and techniques to avoid local minima). The long training time and high energy costs imply that this process must be done offline. This may be acceptable for some application areas. However, in many others, data evolves dynamically over time, asking for a periodic retraining that is costly and difficult to implement without disruption. Frequent DNN retraining requires a fundamentally different approach to that employed in DNN training, so that the former becomes much faster.

As relevant as the long computing time is the energy consumed by DNN (re)training, and this is related to another important problem the world is facing. The human consumption of Earth’s resources, especially energy, is ever-growing, but these resources are limited. Techniques based on Al methods are being reported to improve manufacturing efficiency in many production areas. However, the very high energy consumption of DNN training is partially cancelling the benefits of Al-based optimization. Although specialised hardware accelerators are between one and two orders of magnitude more energyefficient than conventional CPUs for DNN training, forecasts on IT electricity consumption are of utmost worry. For example, data centres (including those used for data storage and DNN training) will consume four times more energy in 2030 than in 2010. Additionally, the high energy consumption will constrain (re)training in battery- operated devices (including cell phones and cars).

In summary, DL has proven to deliver excellent results in many application areas, and it is critical to optimising the use of scarce resources, but:

The long training time impedes online training and real-time retraining, drastically narrowing the usefulness of DL in key application areas.

The considerable energy consumed by DNN training partially cancels the benefits of Al-based resource optimizations and constrains (or even impedes) retraining in battery-operated devices. In conclusion, there is a preeminent need to develop energy-efficient, lightning-fast training and retraining techniques for DNNs or similar artificial intelligence systems, so that this technology can dynamically evolve to adapt to time-varying scenarios, enabling very high performance while becoming much more environmentally friendly.

SUMMARY

The invention relates to a novel computer-implemented artificial intelligence system that decouples the selection of the computations to be executed when a sample is processed from the proper execution of said computations, also decoupling training. The artificial intelligence system predicts outputs from samples and can be rapidly (re)trained with additional samples using a novel training method based on solving a system of linear equations.

The artificial intelligence system of the present invention derives from an associated feed-forward neural network, pre-trained during the initialization of the system, whose neuron behaviour is classified as inactive (output equal to zero) or active (output different from zero). The neural network contains paths that are classified as inactive or active, as will be defined in detail below. The artificial intelligence system comprises a path selector and an estimator, as will be explained in more detail below. The path selector comprises the associated neural network plus additional code to compute a set of active paths, which will be later defined, for a sample. The estimator comprises a collection of all the paths in the associated neural network, each path having an assigned path weight that is initially computed from the weights and biases of the pre-trained associated neural network during the initialization of the artificial intelligence system. The path selector is configured to receive one or more samples and use the associated neural network to obtain a set of active paths for each sample but it does not predict any output for said sample. The estimator is configured to perform a computation of predicted outputs for each sample by using the sample inputs and the weights of the active paths for said sample.

As opposed to a neural network, where (re)training updates all the neuron weights and biases, the artificial intelligence system of the present invention only trains the path weights for the collection of paths in the estimator while leaving the path selector unaltered. Also, the paths in the collection of paths of the estimator are virtualized and do not take into account the path overlapping in the associated neural network. Thus, the path weights for the collection of paths of the estimator can be independently set. The decoupling between path selector and estimator together with the path virtualization in the collection of paths of the estimator enable the implementation of much faster methods for (re)training the artificial intelligence system of the present invention. Moreover, the path virtualization enables a higher degree of freedom to set the path weights with respect to the associated neural network, enabling a better fit during (re)training and higher prediction accuracy during inference.

The artificial intelligence system of the present invention is initialized to set up the neural network in the path selector and compute the initial path weights for the collection of paths in the estimator. The decoupling between path selector and estimator together with the path virtualization in the estimator enable (re)training methods that operate by forming and solving a system of linear equations. The different (re)training methods are configured to save as many computations as possible. In particular, when retraining with additional samples, the equations obtained in previous (re)training operations are reused. Moreover, almost all the computations performed to solve the system of linear equations in previous (re)training operations are reused, leading to the first truly incremental retraining method for an artificial intelligence system. The different (re)training methods are finally complemented with additional steps to reduce the number of unknowns in the system of linear equations and the associated memory requirements to solve the equation system.

As mentioned above, the artificial intelligence system comprises a path selector and an estimator. The path selector includes a neural network trained with an initial training dataset including a plurality of initial training samples. The neural network comprises a plurality of consecutive layers jointly connecting one or more neural network inputs to one or more neural network outputs, wherein each layer comprises one or more neurons that compute a neuron output for each sample according to a function of the neuron inputs, the function being such that the neuron output is zero or a linear combination of one or more neuron inputs plus a bias. The estimator comprises a collection of paths in the neural network, each path connecting either a neural network input or a neuron to a neural network output through a sequence of neurons located in consecutive layers, one neuron per each consecutive layer, each path having an assigned path weight that is initially computed from the weights and biases of the pre-trained associated neural network during the initialization of the artificial intelligence system. The artificial intelligence system is configured to predict outputs from samples. The path selector is configured to receive one or more samples and obtain a set of active paths for each sample, wherein each active path is a path from the collection of paths of the estimator in which every neuron output is different from zero for the corresponding sample and each neuron input in said path has an effect on the corresponding neuron output for the corresponding sample. Each estimator output has an associated neural network output. The estimator is configured to perform a computation of predicted estimator outputs for each sample, wherein each estimator output is computed as the sum of the contributions of all the active paths for the corresponding sample leading to the neural network output associated with said estimator output. The contribution of each active path is either the product of the path weight times the neural network input, when the active path connects a neural network input to a neural network output, or the path weight, when the active path connects a neuron to a neural network output.

The present invention also refers to a computer-implemented method of predicting outputs from samples, this method corresponding to the actions performed by the artificial intelligence system.

The present invention also refers to a computer-implemented method of training the artificial intelligence system. The training method comprises the following steps:

Obtaining the set of active paths using the path selector of the artificial intelligence system for each sample in an extended training dataset, the extended training dataset comprising the initial training samples of the initial training dataset and additional training samples.

Generating a first set of equations that relate the expected neural network outputs with the neural network inputs for each sample in the extended training dataset, using the weights and biases of the neurons along the active paths associated with each sample as primary unknowns, wherein each neural network output is expressed as the sum of the contributions of all the active paths for the corresponding sample leading to said neural network output, and the contribution of each active path is the product of the neuron weights along the path times the neural network input (when the active path connects a neural network input to a neural network output) or the product of the neuron weights along the active path times the bias value of the neuron (when the active path connects said neuron to a neural network output).

Transforming the first set of equations into a first system of linear equations by replacing each product of two or more primary unknowns with a secondary unknown, wherein every remaining primary unknown and every secondary unknown is the weight of a path from the collection of paths of the estimator.

Solving the first system of linear equations.

Updating the path weight assigned to each path of the estimator of the artificial intelligence system using the value computed for the corresponding unknown.

The training method may include a subsequent retraining process that allows saving computations. The retraining process comprises:

Obtaining the set of active paths using the path selector for each sample in a retraining dataset containing additional samples not included in the extended training dataset.

Generating a second set of equations that relate the expected neural network outputs with the neural network inputs for each sample in the retraining dataset.

Transforming the second set of equations into a second system of linear equations by replacing each product of two or more primary unknowns with a secondary unknown, wherein every remaining primary unknown and every secondary unknown is the weight of a path from the collection of paths of the estimator.

Appending the second system of linear equations to the first system of linear equations to obtain an extended system of linear equations.

Solving the extended system of linear equations.

Updating the path weight assigned to each path of the estimator with the value computed for the corresponding unknown.

The retraining process may include modified steps that allow processing only the additional samples (incremental retraining) without suffering from catastrophic forgetting, wherein appending the second system of linear equations to the first system of linear equations comprises appending the second system of linear equations to the triangular system of equations resulting from having applied QR decomposition (such as Householder transformations or Givens rotations) to the first system of linear equations in the previous (re)training process, and wherein solving the first or the extended system of linear equations is performed by obtaining a triangular system of equations using QR decomposition, and solving the triangular system of equations.

The training method may include a path count reduction step that allows controlling the number of paths and reducing memory requirements in a controlled manner. The path count reduction step comprises replacing one or more of the unknown weights and biases of the neurons along the active paths with a constant value in the step of generating a first or a second set of equations or in the step of transforming the first or second set of equations into the first or second system of linear equations.

Another aspect of the present invention refers to a computer program product comprising computer code instructions which, when the program is executed by a processor, cause the processor to carry out the steps of any of the aforementioned methods. The computer program product may comprise at least one computer-readable storage medium having recorded thereon the computer code instructions.

The training method according to the present invention provides extremely fast, cost- effective, and energy-efficient training of an artificial intelligence system by training only the path weights for the collection of paths in the estimator, and by transforming the optimization problem consisting of minimizing a cost function into solving a system of linear equations. The inference method according to the present invention provides increased inference accuracy by virtualizing the paths in the collection of paths of the estimator, thus allowing their weights to be independently set.

The present invention reduces (re)training time for supervised learning and the associated energy consumption, on the same hardware, by more than one order of magnitude in general, and exceeding two orders of magnitude in case of incremental retraining.

The present invention is applicable to artificial intelligence systems that derive from a wide variety of neural networks which do not rely on recurrency mechanisms (that is, do not contain backward connections among neuron layers). This includes multi-layer perceptrons, convolutional neural networks, modular neural networks, autoencoders, and transformers, among others, supporting the most popular connection patterns among layers (fully connected, convolutional, pooling, etc.) as well as activation functions that can be modelled with an active/inactive behaviour (no activation function, ReLLI, average/maximum/minimum functions, etc.).

As opposed to iterative neural network (re)training methods, the present invention uses a radically different approach, introducing three key innovations:

1. Decomposing the behaviour of the neural network into two separate and decoupled functions, namely, an active path selector and an output value estimator based on the weights of the active paths. These separate functions define the components of the novel artificial intelligence system of the present invention, and enable training to be applied only to the estimator while keeping the path selector unaltered.

2. Transforming a highly non-linear optimization problem, such as (re)training, into a system of linear equations that can be solved with a reduced cost.

3. Virtualising the collection of paths in the estimator so that they are free from the overlapping constraints that occur in a neural network, enabling weights to be independently computed and assigned to each path in the collection of paths, and therefore allowing higher accuracy than the one achieved by a neural network.

Moreover, the present invention implements the first truly incremental retraining method for an artificial intelligence system, capable of processing only the new samples while preserving the knowledge generated from the previous samples without having to process them again. Finally, the training method of the present invention avoids some of the problems that are intrinsic to iterative optimization algorithms:

Very slow convergence: The training method of the present invention provides a direct solution that does not require iterations to solve the optimization problem.

Vanishing gradient problem: This problem is eliminated since no gradients need to be computed. Local minimum stagnation: The solution to the equation system directly obtains the global minimum.

Catastrophic forgetting: The incremental approach is able to perform (re)training by processing only the new samples.

BRIEF DESCRIPTION OF THE DRAWINGS

A series of drawings which aid in better understanding the invention and which are expressly related with an embodiment of said invention, presented as a non-limiting example thereof, are very briefly described below.

Figure 1 shows an artificial intelligence system for predicting outputs from samples, according to an embodiment of the present invention.

Figure 2 shows an exemplary neural network with active/inactive paths, wherein the neurons apply the ReLLI activation function to the weighted sum of their inputs for computing their respective outputs.

Figure 3A represents the computation of a neuron output, showing the configurable neuron weights. Figure 3B depicts a ReLLI, used as an activation function in the neurons of Figure 3A.

Figure 4A shows the computation of a neural network output (01) in the example of Figure 2. Figure 4B shows the computation of the corresponding estimator output (01’).

Figure 5 shows an exemplary neural network with active/inactive paths, wherein the neurons of the third layer use the maximum function for computing their respective outputs and the rest of the neurons apply the ReLLI activation function to the weighted sum of their inputs.

Figure 6 represents the computation of a neural network output in the example of Figure 5.

Figure 7 illustrates a flow diagram of a computer-implemented method of predicting outputs from samples using an artificial intelligence system, according to an embodiment of the present invention.

Figure 8 represents a flow diagram of a training method of the artificial intelligence system using additional training samples.

Figure 9 shows an example of a first set of equations generated by the training method.

Figure 10 shows the transformation of the first set of equations of Figure 9 into a first system of linear equations.

Figure 11 shows a subsequent retraining process used in the training method.

Figure 12 shows an example of a second set of equations generated by the retraining process of Figure 11.

Figure 13 shows an extended system of linear equations obtained during the retraining process.

DETAILED DESCRIPTION

This section presents a detailed description of an embodiment of the present invention. First, it shows a schematic diagram of the artificial intelligence system of the present invention, which derives from a neural network, also introducing the main components and their global behaviour as well as the main benefits of the present invention. Then it presents two examples of small neural networks with slightly different neuron functions, introducing some notation and precisely defining some basic concepts. Those two examples enable the computation of the relationship between inputs, outputs and neural network parameters. Next, the relationship between neural network parameters and the parameters of the artificial intelligence system is established. Such a relationship is then used to define an inference method to predict outputs from samples, and several variants of a training method for the artificial intelligence system of the present invention.

Figure 1 represents, according to an embodiment of the present invention, a schematic diagram of a trainable artificial intelligence system 100 for predicting outputs from samples using components derived from a neural network. The artificial intelligence system 100 of the present invention derives from an associated feed-forward neural network 112 whose neuron behaviour can be classified as inactive (output equal to zero) or active (output different from zero). Said neural network 112 contains paths, which are classified as inactive or active, as will be defined more precisely below. The artificial intelligence system 100 comprises a path selector 110 and an estimator 120. The path selector 110 comprises the associated neural network 112 plus additional code to compute the set of active paths 114 for a sample 102. Said neural network 112 is trained during the initialization of the artificial intelligence system 100 using prior art. The estimator 120 comprises a collection of all the paths 122 (Pi , P 2 , ... , PT) in the associated neural network 112, each path 124 having an assigned path weight 126 (PWi, PW 2 , ... , PWT) that is initially computed from the weights and biases of the associated neural network 112 during the initialization of the artificial intelligence system 100. The path selector 110 is configured to receive one or more samples 102 and use the associated neural network 112 to obtain a set of active paths 114 for each sample 102 but it does not predict any output for said sample. The estimator 120 is configured to perform a computation of predicted estimator outputs 130 for each sample 102 by using the sample inputs and the weights 126 of the active paths 114 for said sample.

The associated neural network 112 is used both as the main component of the path selector 110 and as a network from which a collection of paths 122 can be obtained for the estimator 120. As opposed to a neural network, which learns both structural and quantitative information during supervised training, the artificial intelligence system 100 decouples structural from quantitative information. Structural information is stored at the path selector 110. Quantitative information is stored at the estimator 120, and will be acquired during (re)training, as will be described below. This decoupling enables a (re)training method that updates the parameters of the estimator 120 (that is, the weights 126 of the paths 124 in the collection of paths 122) while leaving the path selector 110 unaltered, also allowing the implementation of extremely fast training methods. The inference method decouples the selection of the computations to be executed when a sample 102 is processed, performed at the path selector 110, from the proper execution of the computations associated with said sample, performed at the estimator 120, to obtain the predicted estimator outputs 130 with the highest possible accuracy.

Paths in the neural network 112 overlap with each other, thus sharing neuron weights and making it impossible to modify the contribution of a path to a neural network output without affecting the contribution of other paths. However, the collection of paths 122 comprises a virtualization of the paths in the neural network 112, not constrained by path overlapping, enabling each path weight 126 to be set independently from each other. Advantageously, the decoupling between the path selector 110 and the estimator 120 together with said path virtualization enable a (re)training method for the artificial intelligence system 100 that computes the parameters of the estimator 120 by solving a system of linear equations and processing the training dataset only once, which is much faster and requires far less energy than the customary iterative methods used to train the neural network 112, which require processing the training dataset many times. Additionally, by taking advantage of said path virtualization, the (re)training method for the artificial intelligence system 100 also achieves a better fit than the neural network 112, thus enabling a higher prediction accuracy than the neural network 112 from which it derives.

As mentioned above, the artificial intelligence system 100 comprises a path selector 110 and an estimator 120. The path selector 110 comprises a neural network 112 trained during the initialization of the artificial intelligence system 100 with an initial training dataset including a plurality of initial training samples. The neural network 112 is a feedforward neural network (FNN) that comprises a plurality of consecutive layers jointly connecting one or more neural network inputs to one or more neural network outputs. Each layer comprises one or more neurons, and each neuron is configured to compute, for each sample used as input to the neural network, a neuron output according to a function of the neuron inputs such that the neuron output is zero or a linear combination of one or more neuron inputs plus a bias.

Figure 2 depicts an example of a very simple neural network 112 used by the artificial intelligence system 100, wherein all the neurons implement the ReLLI as the activation function, showing active and inactive paths for a particular sample 102. Although the example shows, for illustrative purpose, the neural network inputs 202 (first input h, second input i2, third input is and fourth input i4) connected to the neural network outputs 206 (first output Oi and second output 02) through only four different layers 204 (first layer Li, second layer L2, third layer L3 and fourth layer L4), the neural network 112 used by the artificial intelligence system 100 may include up to thousands of different layers 204. In the example depicted in Figure 2, each layer 204 comprises a plurality of neurons 208, each neuron being expressed as nJ, wherein the superindex / represents the layer (column) index and the subindex j represents the neuron index within the layer 204 (row). In particular, the first layer Li includes neurons the second layer L2 includes neurons nl , n , n and nl, the third layer L3 includes neurons n , n|, n and 714, and the fourth layer L4 includes neurons and n .

The estimator 120 includes a collection of paths 122 obtained from the neural network 112, each path 124 connecting either a neural network input 202 or a neuron 208 to a neural network output 206 through a sequence of neurons 208 located in consecutive layers 204, one neuron 208 per each consecutive layer 204, each path 124 having an assigned path weight 126.

Since neural networks usually exhibit a very rich connectivity among layers 204, the paths partially overlap with each other. Moreover, there exist multiple paths from each input 202 to each output 206. In particular, the paths from a given input 202 to a given output 206 are referred to as full paths, and the paths from a given neuron 208 to a given output 206 are referred to as bias paths.

In Figure 2, lines (either solid lines 210 or dashed lines 212) represent a connection between:

(i) neural network inputs 202 and neurons 208 of the first layer Li ,

(ii) neurons 208 arranged in adjacent layers (e.g. neurons of the second layer L2 with neurons of the third layer L3), or

(iii) neurons 208 of the fourth layer L4 and neural network outputs 206.

Lines not connecting to a neural network output 206 include a connection weight w^ n between a source (a first neuron 208 or an input 202) and a destination (a second neuron 208), wherein the superindex k represents the layer (or column) corresponding to the source (layer 0 in the case of an input), and the subindices m and n are, respectively, the destination row and the source row. The values of the weights w^ n correspond to the weights of the trained neural network 112. For instance, in the example of Figure 2, neuron n of the second layer L2 is connected, to the left, with neurons n , n of the first layer Li and, to the right, with neurons n and n of the third layer L3. Neuron n of the second layer L2 computes its output by first computing the weighted sum of its inputs, and then applying the ReLLI activation function to the result. The weights applied to the inputs coming from neurons n , n of the first layer Li are, respectively,

Solid lines 210 connecting a first neuron to a second neuron imply that, for the particular sample 102 used as input to the neural network 112, the output of the first neuron is not zero and has an effect on the second neuron. Dashed lines 212 represent neuron outputs with value zero for the particular sample 102. In the example of Figure 2 the outputs of neurons and n in the first layer Li and neuron n in the third layer L3 are zero for the particular sample 102, and the lines starting from said neurons are thus shown as dashed lines 212.

An active path is a path 124 from the collection of paths 122 in which every neuron output is different from zero for the corresponding sample 102 and each neuron input in said path has an effect on the corresponding neuron output for the corresponding sample 102. Full paths model the contribution of inputs 202 to outputs 206, while bias paths model the contribution of neuron biases to outputs 206. However, for a given sample 102, not all the paths contribute to the output values 206. For a given sample 102, only the active full paths, from one input 202 to one output 206, enable that input 202 to affect the values of the output 206. Similarly, only an active bias path enables the corresponding neuron bias to affect the values of the output 206.

Figure 2 depicts an exemplary active path 214 that connects neuron nl in the second layer L2 to neural network output 02 through neurons n and n consecutive layers L3 and L4. Along this path every neuron output is different from zero (there are no dashed lines 212 in the path) for the corresponding sample 102 and each neuron input in said path has an effect on the corresponding neuron output for the corresponding sample 102; therefore, the path is considered as an active path 214. There are no more active paths 214 connecting and 02, since there is only one additional path that connects neuron nl and neural network output 02, and this second path is not an active path since one of the neuron outputs in the path is zero (in the path there is a dashed line 212 connecting

The paths that are not active paths 214 can be referred to as “inactive paths”. Figure 2 shows an exemplary inactive path 216 connecting neuron n in the first layer Li to neural network output 01. The path is inactive since the neuron output of is zero (see dashed line 212 in between and n^) for the particular sample 102.

The collection of paths 122 includes all available paths 124 in the neural network 112 connecting a neural network input 202 to a neural network output 206. For example, in Figure 2 there are only four different paths connecting and Each path 124 has an assigned path weight 126 that is the product of the weights w^ n of the lines forming said path. This way, the path weight 126 of the third of the four paths, is

The paths 124 are determined irrespective of the value of the sample 102, since only the internal connections within the neural network 112 are considered (neuron input values and neuron output values are not taken into account). For instance, there is no path connecting neural network input h to a neural network output that passes through neuron since ii and are not connected by a line (either a solid line 210 or a dashed line 212 in Figure 2).

The collection of paths 122 further includes all available paths 124 in the neural network 112 connecting a neuron 208 to a neural network output 206. For example, in Figure 2 there are only two different paths connecting neuron n 2 and outpu and . In this case the path weight 126 of the second path, wherein s the bias of neuron n 2 (the bias b- for each neuron is labelled with the same scheme as the neuron n

Since the estimator 120 includes a collection of paths 122 obtained from the neural network 112, the equations for the estimator 120 are obtained from the equations for the neural network 112. First, the mathematical expression for a neuron output as a function of its inputs is presented. Next, said expression for a neuron output is used to compute the mathematical expression for a neural network output 206. Since an estimator output replaces the corresponding neural network output 206 when computing predictions for a sample 102, the mathematical expression for a neural network output is used to obtain the mathematical expression for an estimator output by defining the estimator 120 parameters (the path weights 126 for the paths 124 in the collection of paths 122) and performing the corresponding substitutions in the mathematical expressions.

When using an activation function o at a neuron n- located at row / of layer I of a neural network, the value of the output a- of said neuron n- is a non-linear function (i.e. the activation function o) of the weighted sum z of its inputs, which are the outputs {a^ 1 , aT 1 ,...} of connected neurons located at the previous layer /-?. In the example of Figure 3A, the output a of neuron located at layer I is computed according to the following equation: ai l = CT ( Z 1) wherein is a bias and the activation function o is a non-linear function.

As previously explained, the neuron output of a neuron 208 of the neural network 112 of the artificial intelligence system 100 is obtained according to a function of the neuron inputs of said neuron, the function being such that the value of the neuron output is either zero or a linear combination of one or more neuron inputs plus a bias. A ReLU (Rectified Linear Unit) activation function y=max(0,x), depicted in Figure 3B, satisfies this condition, since the neuron output y will always be either zero (y=0 when the value of the activation function input x<0, wherein y=al , x=z and a is a ReLU in the previous equation of Figure 3A) or a linear combination of one or more neuron inputs plus a bias (y=x=w^ 1 ■ when the value of the activation function input x>0).

Neural networks often employ pooling layers to down sampling feature maps. Common pooling layers are max pooling and average pooling. While average pooling offers no challenge, max pooling requires supporting the maximum function at one or more layers. The output of a neuron at layer I using the maximum function y=f(x)=max(xi,x2,... ,x n ) is the maximum among all the inputs {xi ,X2, ... ,x n }, wherein the inputs correspond to outputs of neurons from the previous layer 1-1 that are connected with the neuron at layer I. The maximum function also fulfils the condition that the value of the neuron output is either zero or a linear combination of one or more neuron inputs plus a bias. In particular, when applying the maximum function, the output is a linear combination of all the connected inputs, wherein the weight is 1 for the input with maximum value and 0 for the rest of the inputs, and the bias is zero.

Similarly, the minimum function y=f(x)=min(xi,X2,... ,x n ) also fulfils the aforementioned condition and therefore is a function that can be applied to one or more neurons (e.g. all the neurons of the same layer) of the neural network to compute neuron outputs. Other functions that satisfy the condition that the value of the neuron output is either zero or a linear combination of one or more neuron inputs plus a bias may be used in the present invention. The neural network may include a combination of different functions that satisfy the condition; for instance, in one embodiment the neurons of one or more layers may employ the maximum function and the neurons of the rest of the layers may employ the ReLLI of the weighted sum of the inputs.

Figure 4A shows the computation of neural network output 01 in the example of Figure 2, computed by replacing each neuron output by its mathematical function and taking into account the active/inactive status of each neuron for the sample 102. Since all the neurons in the example of Figure 2 use the ReLLI as their activation function, each active neuron output has been replaced by the weighted sum of its inputs plus the neuron bias and each inactive neuron output has been replaced by zero. In particular,

After replacing each neuron output a- above with its corresponding expression and performing operations to remove all the parenthesis, the mathematical expression in Figure 4A is obtained.

As mentioned above, an active path 214 (regardless of whether it is full or bias) can be described by its path weight 126. The path weight 126 is assigned an initial value, computed as the product of the neuron weights along the active path 214 (for a full path) or the product of the neuron weights along the active path 214 times the neuron bias value (for a bias path).

There are 13 active paths 214 for the sample 102 depicted in Figure 2; in particular, the following active paths 214 with their associated initial path weights 126:

1. having an associated path weight

2. having an associated path weight

3. having an associated path weight

4 having an associated path weight P

5 having an associated path weight P

6 having an associated path weight P

7 having an associated path weight P

8. having an associated path weight

9 . having an associated path weight

10 having an associated path weight

11 , having an associated path weight

12. having an associated path weight having an associated path weight P

Taking into account the definition of the initial value assignment for a path weight 126, a neural network output 206 (and the neural network output 01 in the example of Figure 2, in particular) can also be computed as the sum of the contributions of all the active paths 214 for the corresponding sample 102 leading to it, where each contribution is the product of the path weight 126 times the input value 202 (for a full path) or the path weight 126 (for a bias path). This alternative way of computing a neural network output 206 establishes the relationship between the neural network 112 and the estimator 120. Once initialized, both the neural network 112 and the estimator 120 will initially compute the same predicted outputs (206,130) for a sample 102. However, the training method of the artificial intelligence system 100 of the present invention trains the path weights 126 of the paths 124 in the collection of paths 122 of the estimator 120, leaving the neural network 112 unaltered, as described below in more detail. Therefore, after one or more training operations, the neural network 112 and the estimator 120 will no longer compute the same predicted outputs (206,130) for a sample 102.

Figure 4B shows the computation of estimator output 01 which is associated with neural network output 01 in the example of Figure 2, expressed as a function of the path weights 126 (PW-i, PW2, PW13) of the active paths 214 and the neural network inputs 202. It can be easily observed that when the path weights 126 (PI/l/7, PW2, ... , PW13) in Figure 4B are replaced with their corresponding initial expressions listed above (PW = b , PW 2 = - b^, ...), the expression for the estimator output 0 (Figure 4B) is equal to the expression for its associated neural network output 01 (Figure 4A).

The computation shown in Figure 4B is therefore the computation performed at the estimator 120, where an estimator output 130 (0 ) associated with a neural network output 206 (01) is computed by adding the contributions of all the active paths 214 for the corresponding sample 102 leading to it, where each contribution is the product of the path weight 126 times the input value 202 (for a full path) or the path weight 126 (for a bias path). The estimator 120 stores the path weights 126 for all the paths 124 in the collection of paths 122, but it does not store any information about individual neuron parameters.

In the example depicted in Figure 2, all the neurons use a ReLLI activation function. However, a combination of different functions that compute the neuron outputs may be used in the neural network 112, provided the functions satisfy the condition that the value of the neuron output is either zero or a linear combination of one or more neuron inputs plus a bias for any sample. Figure 5 depicts an embodiment of a neural network 112 using the same number of layers, neurons and internal connections between neurons as the one shown in Figure 2, with the only difference that the neurons of the third layer (n , nf , ng and n ) use the maximum function for computing their respective outputs. The input to the neural network 112 in Figure 5 is the same sample 102 used as input to the neural network 112 in Figure 2; therefore, the outputs of neurons and n 2 in the first layer Li remain with a value of zero, and the dashed lines 212 starting from said neurons of first layer Li remain unaltered, as in Figure 2. However, in the example of Figure 5, the output of neuron in the third layer L3 is no longer zero, since the function to compute the output of said neuron has changed (it is now the maximum function).

By applying the maximum function in the third layer L3, in said layer one or more neuron inputs have no effect on the corresponding neuron outputs. For instance, neuron n 2 output is greater than utput for the particular value of the sample 102, and therefore neuron nl has no effect on the output of any of the neurons This lack of effect is depicted also with dashed lines. As a result, the dashed lines 212 represent either neuron outputs with value equal to zero or neuron inputs that have no effect on the neuron output for a particular neuron. This representation in Figures 2 and 5, with solid lines 210 and dashed lines 212, is used to understand how the active paths 214 are obtained for each sample 102.

Figure 6 represents the computation of neural network output 01 in the example of Figure 5, also computed by replacing each neuron output by its mathematical function and taking into account the active/inactive status of each neuron for the sample 102. Neural network output 01 can also be computed as the sum of the contributions of all the active paths 214 for the corresponding sample 102 leading to 01. Note that when applying the maximum function, the neuron weight is 1 for the input with maximum value and 0 for the rest of the inputs, and the bias is zero. For instance, for neuron we have b Therefore, in this case there are 11 active paths 214 for the sample 102, out of which two bias paths have a null contribution because the neuron bias is zero:

1. having an associated path weight b

2. having an associated path weight -

3. having an associated path weight

4. having an associated path weight ■

5. having an associated path weight w 6. having an associated path weight ■

7 having an associated path weight w 2

8 having an associated path weight

9 , having an associated path weight w^

1 having an associated path weight

1 , having an associated path weight w^

Figure 7 depicts a flow diagram of a computer-implemented method 700 of predicting outputs from samples 102 using an artificial intelligence system 100, according to an embodiment of the present invention. The method 700 comprises receiving 710 a sample 102 and obtaining 720 a set of active paths 114 for the sample 102 in a neural network 112 trained with an initial training dataset including a plurality of initial training samples. The neural network 112 comprises a plurality of consecutive layers 204 jointly connecting one or more neural network inputs 202 to one or more neural network outputs 206, each layer 204 comprising one or more neurons 208, each neuron 208 computing a neuron output for each sample 102 according to a function of the neuron inputs such that the neuron output is zero or a linear combination of one or more neuron inputs plus a bias. Each active path 214 is a path from a collection of paths 122 in the neural network 112 in which every neuron output is different from zero for the sample 102 and each neuron input in said path has an effect on the corresponding neuron output for the sample 102. Each path 124 of the collection of paths 122 connects either a neural network input 202 or a neuron 208 to a neural network output 206 through a sequence of neurons 208 located in consecutive layers 204, one neuron 208 per each consecutive layer 204, wherein each path 124 has an assigned path weight 126.

As already mentioned above, the neural network 112 is trained during the initialization of the artificial intelligence system 100 with an initial training dataset including a plurality of initial training samples, using prior art. The neural network 112 may have been pretrained by an external training system, so that the artificial intelligence system 100 only retrieves the data associated with the neural network 112 (layout, weights, biases, etc.). Also, each path 124 in the collection of paths 122 has an assigned path weight 126 that is initially computed from the weights and biases of the pre-trained associated neural network 112 during the initialization of the artificial intelligence system 100. The initial path weight 126 of each path 124 in the collection of paths 122 is computed as follows:

(i) When the path connects a neural network input 202 to a neural network output 206, the product of the neuron weights along the path, said neuron weights obtained from the neural network 112 trained during the initialization with the initial training dataset.

(ii) When the path connects a neuron 208 to a neural network output 206, the product of the neuron weights along the path times the bias value of said neuron 208, said neuron weights and bias obtained from the neural network 112 trained during the initialization with the initial training dataset.

The method 700 further comprises computing 730 predicted estimator outputs 130 for the sample 102, wherein each estimator output 130 is mapped to the corresponding neural network output 206 of the associated neural network 112 (one-to-one mapping) and its predicted value is computed as the sum of the contributions of all the active paths 214 for the sample 102 leading to said corresponding neural network output 206 in the associated neural network 112 (as shown for instance in Figure 4B for estimator output Oi’, which is associated with neural network output Oi). The contribution of each active path 214 is:

When the active path 214 connects a neural network input 202 to a neural network output 206, the product of the path weight 126 times the neural network input 202.

When the active path 214 connects a neuron 208 to a neural network output 206, the path weight 126.

When a training process with additional samples is applied to the artificial intelligence system 100, the path weights 126 of the different paths 124 included in the collection of paths 122 are trained. The goal of this training process, which is explained below, is not obtaining trained neuron weights w^ n for the neural network 112, as it is customary in the prior art, but instead trained path weights 126 for the paths 124 in the collection of paths 122. In other words, in the present invention the training process using additional samples does not train the neural network 112 (since the neuron weights w^ n of the neural network 112 are not actually modified during training), but the estimator 120. This novel approach requires far less computational resources in the training process of the estimator 120 than training the neural network 112, as it is explained in detail below.

Figure 8 depicts a flow diagram of a training method 800 of the artificial intelligence system 100 using additional training samples. The training method 800 comprises:

Obtaining 810 a set of active paths 114 using the path selector 110 of the artificial intelligence system 100 for each sample 102 in an extended training dataset 802, the extended training dataset 802 comprising the initial training samples 806 of the initial training dataset 804 and additional training samples 808.

Generating 820 a first set of equations that relate the expected neural network outputs 206 with the neural network inputs 202 for each sample in the extended training dataset 802, using the weights and biases of the neurons 208 along the active paths 214 associated with each sample as primary unknowns, wherein each neural network output 206 is expressed as the sum of the contributions of all the active paths 214 for the corresponding sample leading to said neural network output 206, wherein the contribution of each active path 214 is:

- When the active path 214 connects a neural network input 202 to a neural network output 206, the product of the neuron weights along the active path times the neural network input 202.

- When the active path 214 connects a neuron 208 to a neural network output 206, the product of the neuron weights along the active path times the bias value of said neuron 208.

Transforming 830 the first set of equations into a first system of linear equations by replacing each product of two or more primary unknowns with a secondary unknown, wherein every unknown after the replacement, that is, every remaining primary unknown and every secondary unknown is the weight 126 of a path 124 from the collection of paths 122 of the estimator 120.

Solving 840 the first system of linear equations.

Updating 850 the path weight 126 assigned to each path 124 in the collection of paths 122 of the estimator 120 of the artificial intelligence system 100 using the value computed for the corresponding unknown. In many cases, the path weight 126 is updated with the value computed for the corresponding unknown; however, in some cases (in particular, as it will be explained later, when one or more of the unknown weights and biases of the neurons along the active paths 214 are replaced with a constant value) the path weight 126 may be updated with the product of the value computed for the corresponding unknown and at least one constant.

Figure 9 shows a first set of equations 902 generated by the training method 800. The first set of equations 902 includes the equations that relate the neural network outputs (01 , 02) with the neural network inputs ( , i2, is, i4) for the samples included in the extended training dataset 802 (e.g. N samples). In the example shown in Figure 9, the first sample corresponds to the sample 102 of Figure 2 and the equation for 01 is the one depicted for said sample 102 in Figure 4A. In these equations the weights w^ n and biases b- of the neurons 208 along the active paths 214 are considered as primary unknowns 904. For each sample (sample 1 , sample 2, ... , sample N), the inputs ( , i2, is, i4) and the outputs (01 , 02) of the neural network 112 correspond to the values of the corresponding sample; therefore, in the first set of equations 902 the inputs ( , i2, is, i4) and the outputs (01 , 02) are known values.

As shown in Figure 10, the first set of equations 902 of Figure 9 are then transformed 830 into a first system of linear equations 1002 by replacing the products 906 of two or more primary unknowns with secondary unknowns 1006 (xi, X2, ... , XM). The unknowns to be solved in the first system of linear equations 1002 are the remaining primary unknowns (bi ,...) and the secondary unknowns (xi, X2, ... , XM). Said unknowns are the weights 126 of the paths 124 of the collection of paths 122 of the estimator 120. Obtaining a system of linear equations represents a huge simplification that leads to drastic speed- ups in the time-to-solution and energy savings, when compared to solving non-linear equations when training neural networks according to conventional methods in the prior art.

The first system of linear equations 1002 is solved 840 by any method known in the prior art. When training the artificial intelligence system 100, some or all the arithmetic operations performed may be executed with reduced arithmetic precision in order to save computing time and energy (e.g. by tuning the precision of arithmetic operations separately for path selection and solving the system of linear equations). The path selector 110 may be implemented with low-precision arithmetic without any accuracy loss. In an embodiment, depicted in Figure 11 , the training method 800 further comprises a subsequent retraining process 1100 of the artificial intelligence system 100, the retraining process 1100 comprising:

Obtaining 1110 a set of active paths 114 using the path selector 110 for each sample in a retraining dataset 1102 containing additional samples 1104 not included in the extended training dataset 802.

Generating 1120 a second set of equations that relate the expected neural network outputs 206 with the neural network inputs 202 for each sample in the retraining dataset 1102.

- Transforming 1130 the second set of equations into a second system of linear equations by replacing each product of two or more primary unknowns 904 with a secondary unknown 1006.

- Appending 1140 the second system of linear equations to the first system of linear equations 1002 to obtain an extended system of linear equations.

Solving 1150 the extended system of linear equations.

Updating 1160 the path weight 126 assigned to each path 124 in the collection of paths 122 of the estimator 120 of the artificial intelligence system 100 with the value computed for the corresponding unknown.

Figure 12 shows a second set of equations 1202 generated by the retraining process 1100. The second set of equations 1202 include the equations that relate the neural network outputs 206 (01, 02) with the neural network inputs 202 ( , i2, is, i4) for the additional samples 1104 (e.g. R samples) included in the retraining dataset 1102. As in the example shown in Figure 9, the weights and biases of the neurons 208 along the active paths 214 are considered as primary unknowns 904. For each sample (sample N+1 , sample N+2, ... , sample N+R), the inputs (h, i2, is, i4) and the outputs (01, 02) of the neural network 112 correspond to the values of the corresponding sample; therefore, in the second set of equations 1202 the inputs (h , i2, is, i4) and the outputs (01 , 02) are known values.

The second set of equations 1202 are transformed 1130 into a second system of linear equations by replacing the products 906 of two or more primary unknowns 904 with secondary unknowns 1006. The remaining primary unknowns (e.g. and the secondary unknowns (e.g. xi, X2, ... , XM) of the second system of linear equations will normally be the same ones as those in the first system of linear equations 1002, although in some cases the second system of linear equations may add further unknowns, depending on whether the active paths 214 activated by the additional samples 1104 in the retraining dataset 1102 are new or not when compared to the set of active paths 114 activated by the samples of the extended training dataset 802.

Figure 13 shows the extended system of linear equations 1302 obtained once the second system of linear equations 1304 for samples N+1 to N+R is appended to the first system of linear equations 1002 for samples 1 to N.

The extended system of linear equations 1302 is solved 1150 by any known method. Some or all the arithmetic operations performed while solving the system of linear equations may be executed with reduced arithmetic precision.

In an embodiment, solving (840,1150) the first 1002 or the extended 1302 system of linear equations comprises obtaining a triangular system of equations using QR decomposition and solving said triangular system of equations. Appending 1140 the second system of linear equations 1304 to the first system of linear equations 1002 may comprise appending the second system of linear equations 1304 to the triangular system of equations resulting from applying QR decomposition to the first system of linear equations 1002. The application of QR decomposition to the extended system of linear equations 1302 to obtain a triangular system of equations requires a much smaller number of operations when the extended system of linear equations 1302 has been obtained by appending the second system of linear equations 1304 to the triangular system of equations resulting from applying QR decomposition to the first system of linear equations 1002. This embodiment represents a truly incremental retraining method for the artificial intelligence system 100, capable of processing only the additional samples 1104 included in the retraining dataset 1102 without leading to catastrophic forgetting, and also achieving additional dramatic speed-ups in the time-to-solution and energy savings. This is especially noticeable when the retraining dataset 1102 is much smaller than the extended training dataset 802. In an embodiment, the training method 800 comprises replacing one or more of the primary unknowns 904 (i.e. one or more weights and biases of the neurons 208 along the active paths 214) with a constant value. In this case it is assumed that one or more weights and/or biases are known, usually spanning whole layers. The constant value for each known weight/bias may be set according to a previously known value; for instance, the value of the corresponding weight/bias in the neural network 112 trained with the initial training dataset 804 (i.e. the one previously computed with SGD). The replacement is performed in the step of generating (820,1120) a first or a second set of equations or in the step of transforming (830,1130) a first or a second set of equations into a first or a second system of linear equations.

In an embodiment, the primary unknowns 904 (i.e. the weights and biases) of all the layers but one are replaced with a previously known value (e.g. the one computed with SGD). The weights and biases of the remaining layer (e.g. the last layer or another fixed layer) will be considered as primary unknowns 904. In successive retraining operations the retraining method may alternate among layers; for example, in a first retraining process the remaining layer with the unknowns is the last layer, in a second retraining process the remaining layer with the unknowns is the penultimate layer, and so on. In both embodiments, the replacement of one or more of the primary unknowns 904 with constant values allows reducing the number of unknowns and the memory requirements are much lower.

For instance, in the example of Figure 9 the weights w 22 and w 23 are respectively replaced with constant values p and q. As a result, the products of primary unknowns 906 ■ w 2 ■ w 22 ■ w 23 respectively become ■ w 2 ■ w 22 ■ p and ■ w1 2 ■ w 22 ■ q. Thus, after the replacement, both products of primary unknowns 906 are replaced with the same secondary unknown x 5 , and similarly for x 10 , the resulting equation for output 01 in sample 1 being:

+ p - x w - i 2 + q - x w - i 3

In this case the 12 unknowns (xi, ... , X12) shown in Figure 10 have been reduced to 10 unknowns (xi, ... , x ), since the values of p and q (as well as i2 and is) are known. Both embodiments represent an enhancement of the training method 800 since the total number of paths and associated memory requirements are reduced to tractable values. In addition, these embodiments may also be used to set the number of unknowns to be consistent with the number of samples and/or the number of equations in the system of linear equations. For example, weights and/or biases (or products of weights and/or biases) may be replaced with constant values so that the total number of unknowns becomes a certain fraction of the total number of equations in the first 1002 or the extended 1302 system of linear equations.

For a given (re)training operation, the number of equations in the first 1002 or second 1304 system of linear equations is equal to the number of samples (in the extended training dataset 802 or the retraining dataset 1102) being processed times the number of outputs in the neural network 112 (or the estimator 120). As mentioned above, the embodiment consisting of replacing one or more of the primary unknowns 904 with constant values allows a strict control on the total number of unknowns. In general, the number of unknowns should be adjusted so that the first 1002 or extended 1302 system of linear equations is overdetermined. It will also be incompatible since it is very unlikely that different samples lead to equations that are a linear combination of other equations.

The resulting overdetermined system of linear equations can be solved by any method in the prior art, including obtaining a triangular system of equations using QR decomposition and solving said triangular system of equations. The result will be equivalent to an optimization procedure that optimizes a cost function (such cost function being the mean squared error in the case of using QR decomposition), as it is done in customary methods in the prior art.

If the number of equations in the first 1002 or extended 1302 system of linear equations is not large enough with respect to the number of unknowns, it may occur that the system of linear equations is undetermined. In such a case, there exist groups of paths in the collection of paths 122 that are all active or all inactive for a given sample 102. Therefore, the training method 800, instead of computing the weight 126 for each path 124 in the collection of paths 122, it will compute the joint weight for each group of paths that are simultaneously inactive/active. As mentioned above, the training method 800 of the artificial intelligence system 100 achieves dramatic speed-ups in the time-to-solution and huge energy savings with respect to customary retraining methods in the prior art. This is especially true for the embodiment consisting in appending 1140 the second system of linear equations 1304 to the first system of linear equations 1002 by appending the second system of linear equations 1304 to the triangular system of equations resulting from applying QR decomposition to the first system of linear equations 1002 (incremental retraining).

As a quantitative example, training the convolutional neural network ResNet-50 for image recognition with the ImageNet image database (currently containing 14,197,122 images) has been reported to require 90 epochs (i.e. processing all the images in the database 90 times) to achieve the desired accuracy. However, the training method 800, once the neural network 112 has been initialized, only requires processing the image database once.

Moreover, successive retraining operations with additional training samples 808 requires processing all the samples in the extended training dataset 802 when implementing customary methods in prior art, to prevent catastrophic forgetting. However, the incremental retraining embodiment of the training method 800 of the artificial intelligence system 100 enables processing only the additional training samples 808 (or the additional samples 1104 in successive retraining operations) without suffering from catastrophic forgetting. For instance, if the artificial intelligence system 100 is retrained every time 10,000 new images are captured, only those images need to be processed while the customary methods in prior art would require processing 14,197,122 + 10,000 = 14, 207,122 images. This represents a reduction by a factor of 1 ,420, more than three orders of magnitude savings.