Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
EFFICIENT COMPUTATIONAL INFERENCE
Document Type and Number:
WIPO Patent Application WO/2021/052609
Kind Code:
A1
Abstract:
A computer-implemented method of processing data comprising a plurality of observations associated with respective ordered input values to train a Gaussian process (GP) to model the data. The method includes initialising an ordered plurality of inducing input locations, and initialising parameters of a multivariate Gaussian distribution over a set of inducing states, each inducing state having components corresponding to a Markovian GP and one or more derivatives of the Markovian GP at a respective one of the inducing inputs. The initialised parameters include a mean vector and a banded Cholesky factor of a precision matrix for the multivariate Gaussian distribution. The method further includes iteratively modifying the parameters of the multivariate Gaussian distribution, to increase or decrease an objective function corresponding to a variational lower bound of a marginal log-likelihood of the observations under the Markovian GP.

Inventors:
ADAM VINCENT (GB)
ELEFTHERIADIS STEFANOS (GB)
DURRANDE NICOLAS (GB)
ARTEMEV ARTEM (GB)
HENSMAN JAMES (GB)
BORDEAUX LUCAS (GB)
Application Number:
PCT/EP2019/081394
Publication Date:
March 25, 2021
Filing Date:
November 14, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SECONDMIND LTD (GB)
International Classes:
G06N7/00; G06N20/00
Foreign References:
EP3467717A12019-04-10
EP3467718A12019-04-10
Other References:
MICHALIS K TITSIAS: "Variational Learning of Inducing Variables in Sparse Gaussian Processes", 16 April 2009 (2009-04-16), XP055700452, Retrieved from the Internet [retrieved on 20200603]
NICOLAS DURRANDE ET AL: "Banded Matrix Operators for Gaussian Markov Models in the Automatic Differentiation Era", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 February 2019 (2019-02-26), XP081033847
DURRANDE ET AL.: "Banded Matrix Operators for Gaussian Markov Models in the Automatic Differentiation Era", AISTATS, 2019
SALIMBENI ET AL.: "Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models", AISTATS, 2018
H. SALIMBENIM. DEISENROTH: "Doubly stochastic variational inference for deep Gaussian processes", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2017
Attorney, Agent or Firm:
EIP (GB)
Download PDF:
Claims:
CLAIMS 1. A computer-implemented method of processing data comprising a plurality of observations associated with respective ordered input values, to train a Gaussian process (GP) to model the data, the method comprising: initialising an ordered plurality of inducing inputs; initialising parameters of a multivariate Gaussian distribution over a set of inducing states, each inducing state having components corresponding to Markovian GP and one or more derivatives of the Markovian GP at a respective one of the inducing inputs, wherein the initialised parameters comprise a mean vector and a banded Cholesky factor of a precision matrix for the multivariate Gaussian distribution; and iteratively modifying the parameters of the multivariate Gaussian distribution, to increase or decrease an objective function corresponding to a variational lower bound of a marginal log-likelihood of the observations under the Markovian GP. 2. The method of claim 1, wherein initialising the parameters of the multivariate Gaussian distribution comprises allocating a first memory region to compactly store the lower block banded Cholesky factor of the precision matrix. 3. The method of claim 1 or 2, comprising iteratively modifying the inducing inputs to increase or decrease the objective function. 4. The method of any preceding claim, comprising: receiving a data stream comprising the plurality of observations; and initialising the inducing inputs sequentially and concurrently with the receiving of the data stream. 5. The method of claim 4, wherein first input values associated with first observations of the plurality of observations lie within a first interval, and second input values associated with second observations of the plurality of observations lie within a second interval different from the first interval, the method comprising: receiving the first observations; initialising first inducing inputs within the first interval; initialising first parameters of the multivariate Gaussian distribution corresponding to first inducing states associated with the first inducing inputs; iteratively modifying the first parameters of the multivariate Gaussian distribution to increase or decrease an objective function for the first interval; receiving the second observations; initialising second parameters of the multivariate Gaussian distribution corresponding to second inducing states associated with the second inducing inputs; and iteratively modifying the second parameters of the multivariate Gaussian distribution to increase or decrease an objective function for the second interval. 6. The method of any preceding claim, wherein the number of inducing inputs is less than the number of observations. 7. The method of any preceding claim, wherein iteratively modifying the parameters of the multivariate Gaussian distribution comprises performing a natural gradient update. 8. The method of any preceding claim, wherein the data is time-series data and the ordered input values correspond to times. 9. The method of claim 8, wherein each of the observations corresponds to a sample from an audio file. 10. The method of claim 8, wherein each of the observations corresponds to a neural activation measurement.

11. The method of claim 8, wherein each of the observations corresponds to a measurement of a radio frequency signal. 12. The method of any preceding claim, wherein the Markovian GP is a component GP in a composite GP comprising a plurality of further component GPs. 13. The method of claim 12, wherein the composite GP is an additive GP and each of the component GPs of the composite GP represents a source underlying the plurality of observations, the method comprising training the Markovian GP and the plurality of further GPs to determine a distribution of each of the sources underlying the plurality of observations. 14. A data processing system configured to process data comprising a plurality of observations associated with respective ordered input values, to train a Gaussian process (GP) to model the data, the system comprising: memory circuitry arranged to store: the data, an ordered plurality of inducing inputs, and parameters of a multivariate Gaussian distribution over a set of inducing states, each inducing state having components corresponding to the Markovian GP and one or more derivatives of the Markovian GP at a respective one of the inducing inputs, wherein the initialised parameters comprise a mean vector and a banded Cholesky factor of a precision matrix for the multivariate Gaussian distribution; and processing circuitry arranged to: initialise the ordered plurality of inducing inputs; initialise the parameters of the multivariate Gaussian distribution over the set of inducing states; and iteratively modify the parameters of the multivariate Gaussian distribution, to increase or decrease an objective function corresponding to a variational lower bound of a marginal log-likelihood of the observations under the Markovian GP. 15. The data processing system of claim 14, comprising a receiver for receiving a data stream comprising the plurality of observation, wherein the processing circuitry is arranged to initialise the inducing inputs sequentially and concurrently with the receiving of the data stream.

Description:
EFFICIENT COMPUTATIONAL INFERENCE Technical Field The present invention relates to the computational implementation of Gaussian process models. Background Gaussian process (GP) models provide a powerful and flexible means of inferring statistical information from empirical data. GP models have been applied in various contexts in machine learning, for example in regression and classification tasks, and also for predicting states of an environment in a reinforcement learning system, as described in European patent publications EP3467717 and EP3467718. Certain GP models are well-suited to the modelling of time-series data such as a sequence of audio samples, a sequence of measurements of neural activity in a human or animal brain, and a sequence of measurements of physical states of a dynamical system. For such time-series data, empirical observations are assumed to be noisy realisations of an unobservable latent process encoding statistical properties of an underlying signal. GP models offer strong guarantees of accuracy and automatically yield well-calibrated uncertainty estimations. However, both in terms of computational operations and in terms of memory requirements, most computational implementations of GP models scale poorly with the size of an empirical dataset. This poor scalability has tended to limit the viability of GP models as an alternative to neural network-based models across a range of technical fields. Furthermore, in cases where a data signal is received in a streaming fashion (as is often the case for time-series data), existing GP methods require batch processing of historical data and in many cases do not provide the necessary efficiency for real-time analysis of the streamed data. In a typical GP inference task, the objective is to fit a latent GP to an observed dataset such that the resultant GP can be used to predict a distribution of an output value y * at an unobserved input location x * , r alternatively to determine properties of a signal underlying the data. This description encapsulates regression, in which an output corresponds to one or more attributes of a corresponding input x, and classification, in which an output y corresponds to a probability of an input x being associated with a given class. GP models are conventionally defined using a kernel representation, which consists of a GP prior and an observation model as shown in Equation (1): where and it has been assumed that the likelihood p( │f(·)) factorizes over the dataset D. The prior density p ·) of the GP depends on a mean function m (.) and a kernel k(.,.'), where . and .' denote unspecified input locations. The aim of GP inference is to determine or approximate a posterior process p(f ·)│y ) orresponding to the GP prior conditioned on the data In addition to the kernel representation of Equation (1), certain GP models admit a state space representation in which a state vector has components given by evaluations of the GP and the first d - 1 derivatives of the GP, such that f t f (1) t) ] T T deno s the transpose. In such models, the state vector s(t) sa s a linear time-invariant (LTI) stochastic differential equation (SDE) given by Equation (2): where ṡ denotes the derivative of s with respect to time. The r-dimensional vector ∈ ℝ r is a white noise process with spectral density i.e. an infinite collection of indexed uncorrelated random variables with zero mean and finite variance. The coefficients F ∈ ℝ dxd , L ∈ ℝ d× r , H ∈ ℝ 1× d , along with the dimension d of the state vector s, can be derived for different choices of kernel in the kernel representation of Equation (1). Although the input locations in this example are represented by t to indicate temporal indexes, the LTI-SDE representation of Equation (2) is not limited to time-series data, and the input locations may alternatively represent spatial positions or other ordered indexes. Mappings between specific kernel functions and coefficients in the state-space representation are known, with various examples being set out, for example, in the textbook Applied stochastic differential equations, Simo Särkkä and Arno Solin, Cambridge University Press, 2019. Note that, although the state-space model of Equation (2) has been introduced here as an alternative representation of the kernel model of Equation (1), in certain applications a state-space model may be derived directly from a dynamics model of a system and an observation model, without reference to any kernel representation. One such example is where data correspond to noisy observations of a dynamical system. The present invention is equally applicable in either of these contexts. The marginal distribution of solutions to the LTI-SDE system of Equation (2) evaluated at an ordered set of input values [ t 1 , ... ] T follows a discrete-time linear system given by Equation (3): for n = 0, ... N - 1, where P 0 is the initial state covariance matrix. The state transition matrices A n, n+1 Î ℝ d× d and the noise covariance matrices Q n, n+1 Î ℝ dxd have analytical expressions given by: whee Δ n = t n+1 - t n . It is observed that the distribution for each state depends only on the previous state in the sequence. This property is referred to as the Markov property, and the GP is therefore an example of a hidden Markov model. Figure 1 is a graphical model showing the dependence between four successive states s i = s( t i ) for input values t i with i = 1,2,3,4, illustrating the Markov property between the four states. A smoothing solution of the LTI-SDE system of Equation (2) corresponds to the posterior density p( f(´)| y) of the GP conditioned on the data in the kernel representation of Equation (1). Determining a smoothing solution of the LTI-SDE system is therefore equivalent to performing GP inference using the corresponding kernel representation. Whereas the computational cost of performing exact GP inference using the kernel representation scales cubically with the size of the dataset, for conjugate likelihood models (i.e. regression in which the likelihood is assumed to be Gaussian), a smoothing solution to the LTI-SDE can be determined by applying Kalman filters and Raugh–Tung–Striebel (RTS) smoothers to the discrete-time linear system of Equation (3), resulting in a computational cost that scales linearly with the size of the dataset. Although this represents a significant improvement in computational cost compared with the cubic scaling of the kernel-based implementation mentioned above, in order to incorporate new data the Kalman/RTS method requires a forward and reverse pass through the entire dataset, which can still be prohibitive, for example in cases where time-series data is streamed at a high rate. Furthermore, the Kalman/RTS method does not extend to models with non-conjugate likelihoods or to composite models such as deep GP (DGP) models. As such, neither the Kalman/RTS method, nor existing kernel methods, are generally suitable for real-time analysis of data signals. Summary According to a first aspect of the present invention, there is provided a computer-implemented method of processing data comprising a plurality of observations associated with respective ordered input values to train a Gaussian process (GP) to model the data. The method includes initialising an ordered plurality of inducing inputs, and initialising parameters of a multivariate Gaussian distribution over a set of inducing states, each inducing state having components corresponding to a Markovian GP and one or more derivatives of the Markovian GP at a respective one of the inducing inputs. The initialised parameters include a mean vector and a banded Cholesky factor of a precision matrix for the multivariate Gaussian distribution. The method further includes iteratively modifying the parameters of the multivariate Gaussian distribution, to increase or decrease an objective function corresponding to a variational lower bound of a marginal log-likelihood of the observations under the Markovian GP. The parameterisation of the Markovian GP described above results in an inference scheme with unexpected favourable properties which have not previously been exploited and which allow for a GP model to be implemented at a greater speed, requiring less memory usage, and with greater scalability than any existing method. The resulting method is carried out efficiently using reverse-mode differentiation, making the present invention suitable for implementation using well-developed automatic differentiation frameworks. The method is more scalable and better suited to the real-time analysis of data signals than the Kalman/RTS method and existing kernel-based methods. Further features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, given by way of example only, which is made with reference to the accompanying drawings. Brief Description of the Drawings Figure 1 is a graphical model illustrating the dependence of a sequence of states in a state-space representation of a Gaussian process model. Figures 2a and 2b are graphical models illustrating the dependence of a sequence of states and inducing states in a state-space representation of a Gaussian process model in accordance with an embodiment of the present invention. Figure 3 shows a variational GP being used to model a sequence of time-series data in accordance with an embodiment of the present invention. Figures 4a-c schematically show sparse and compact storage of a symmetric block-tridiagonal matrix and a lower block-bidiagonal matrix. Figures 5a and 5b show a sampled audio signal at two different times. Figure 6 shows a data processing system arranged to perform statistical inference in accordance with an embodiment of the present invention. Figure 7 shows schematically a method of performing statistical inference in accordance with an embodiment of the present invention. Figure 8 shows schematically a deep Gaussian process model with layers in accordance with an embodiment of the present invention. Detailed Description The present invention relates to a computationally efficient method of performing statistical inference using Markovian GP models. The method is based on sparse variational inference, in which a variational GP is defined entirely by its marginal statistics at a set of inducing inputs, as opposed to data input values. Using inducing inputs significantly reduces the computational cost of the training process provided that the number M of inducing points is much smaller than the number N of observations in the dataset being modelled, i.e. M ≪ N. In accordance with the present invention, a set of inducing states u i is defined at an ordered set of inducing inputs u , where i = s(z i ). The inducing states are an example of inter-domain inducing features, because the states are not simply evaluations of the GP at the inducing input locations (but rather, the inducing states also carry information about derivatives of the GP at the inducing input locations), and therefore exist in a different domain to the GP. The prior distribution of the inducing states satisfies the Markov property and inherits a multivariate Gaussian distribution from the variational GP with mean vector and precision matrix , which has an analytical expression in terms of the parameters of discrete-time linear system (3). The Markov p roperty of the inducing states advantageously results in the precision matrix being block-tridiagonal, such all elements of the precision matrix not contained within a row of three d × d blocks centred on the leading diagonal are zero. A block-tridiagonal matrix is an example of a banded matrix, because all non- zero elements lie within a diagonal band containing the leading diagonal, along with 2 d-1 upper diagonals and 2 d-1 lower diagonals. The precision matrix is said to have a bandwidth of l = 2 d - 1 (the maximum of the number of lower diagonals and the number of upper diagonals). The precision matrix is symmetric and positive definite, and can therefore be decomposed as Q ψ = L ψ L T y , where the Cholesky factor L ψ is a lower block-bidiagonal matrix. In order to perform sparse variational inference, a sparse variational GP q (f(. )) is introduced to approximate the posterior process p(f( . ) Iy) , as given by Equation (6): where p(f(. )Iu) is used as a shorthand to denote p(f(. ) |s ( zi ) = ui } . The sparse variational GP is thus given by the GP prior conditioned on the inducing states and marginalised over a variational Gaussian distribution for the inducing states. The training process involves iteratively modifying parameters of the variational Gaussian distribution ^^^^( ^^^^), along with (optionally) the inducing input locations and any hyperparameters of the kernel ^^^^ and mean function ^^^^ (or, equivalently, coefficients of the corresponding LTI- SDE system), and parameters of the likelihood, in order to minimise the Kullback-Leibler (KL) divergence between the variational GP q ( ( . ) evaluated at the observed data points and the Bayesian posterior p(f(. ) | y ) evaluated at the observed data points. The KL divergence is given by Equation (7): where and The KL divergence in Equation (7) is non-negative, with equality when the variational GP q(f( . )) matches the Bayesian posterior process p ( .)Iy) . The quantity ℒ is therefore a variational lower bound of the marginal log-likelihood log p(y) . Maximising the variational bound L with respect to the inducing input locations and the parameters of the variational distribution q ( u ) minimises the KL divergence between the variational GP and the Bayesian posterior process. Maximising the variational bound with respect to the hyperparameters maximises the marginal likelihood log p ( y ) of the data, or in other words how well the exact model fits the data. By treating the variational bound as an objective function to be maximised with respect to the inducing inputs, the parameters of the variational distribution over inducing states, and the hyperparameters simultaneously, over a series of training iterations the variational GP tends towards a Bayesian posterior process for an optimal GP prior. Figure 3 shows an example of a sparse variational GP trained to fit a dataset in a time-series regression problem. Time-series data are shown as crosses and solid thin curves represent a single standard deviation above and below the mean function of the variational GP. The dashed curve shows a latent function f resulting from two sampled inducing states u 1 ~ q ( s ( z 1 )) an In this example, the GP prior p ( f ( ´ )) admits a three-dimensional state-space representation ( d = 3). The thick solid curves show the first three terms of the local Taylor expansion of the latent function f at the inducing inputs z 1 , z 2 , illustrating the information contained within the sampled inducing states u 1 , u 2 . Given the state-space representation described above, the joint prior distribution of state vectors indexed at the data input locations and the inducing inputs satisfies the Markov property as shown in Figure 2a. A key insight of the present invention is that the variational posterior distribution q u ) independently satisfies the Markov property as shown by the solid arrows in Figure 2b, and the conditional distribution | for a given stat depends only on the closest left and right neighbouring inducing states, as shown by the dashed arrows in Figure 2b. The conditional dependence of the GP on the inducing states follows from this, and satisfies Equation (9): where with being the neighbouring inducing inputs to such tha . Denoting the marginal variational distribution of the inducing state pair the predictive posterior distribution of a state vector at a given input location is given in terms of the parameters of the linear system by Equation (10): where: The predictive posterior distributio depends on the marginal covariance of the neighbouring inducing states, which is a 2 matrix formed of neighbouring block-tridiagonal elements of the dense covariance matrix A further key insight of the present invention is that the optimal precision matrix of the variational Gaussian distribution is a symmetric block-tridiagonal matrix, and can therefore be decomposed as where the Cholesky factor is a lower block- bidiagonal matrix. In accordance with the present invention, the variational precision matri is parameterised to reflect the form of the optimal precision matrix such tha where the Cholesky factor is restricted to being a banded matrix of sufficient bandwidth to contain the non- zero elements of a lower block-bidiagonal matrix of block size d (i.e. a bandwidth of at least l = 2 d - 1). In one implementation, the Cholesky factor u reustricted to being a lower block-bidiagonal matrix. The sparseness and symmetry of the precision matrices Q uu and Q y allows for compact storage such that only the in-band elements are stored, because the other elements are guaranteed to be zero. In one example, the in- band elements are stored compactly as a dense matrix of siz × (2 + 1) (with rows of the dense matrix having elements equal to the in-band elements of the corresponding sparse matrix). In other examples, blocks of the block- banded matrices are arranged for compact storage. Figure 4a shows a block- tridiagonal matrix with the same sparseness pattern as the precision matrix Q uu and Q y , and Figure 4b shows an example of compact storage of the matrix of Figure 4a, in which the blocks are arranged in three columns (with the top-left and bottom-right blocks being redundant). Storing the precision matrices compactly reduces the memory required to be allocated to storing the precision matrices from O( M 2 d 2 ) to O( M d 2 ). Furthermore, the Cholesky factors L y and L u are banded matrices, and can similarly be compactly stored as dense matrices. In some examples, the Cholesky factors are restricted to being lower block-bidiagonal matrices as shown in Figure 4c, and can be stored compactly by arranging the blocks in two columns as shown in Figure 4b. Reducing the memory usage by compact storage of sparse matrices (i.e. by arranging the elements in a predetermined manner to form a corresponding dense matrix) is particularly advantageous where available memory is limited, for example when the present method is performed using a compact device such as a smartphone or a tablet computer. Such devices may only be able to allocate a small memory region to the inference process, especially if multiple applications or processes are running simultaneously. The block-tridiagonal form of the variational precision matrix allows for the marginal covariances for pairs of neighbouring inducing states (each being a 2 d × 2 d block of block-tridiagonal elements of the dense covariance matrix to be computed from the Cholesky factor u using a sparse inverse subset operator (described in more detail below) at a computational cost of O( M d 2) operations. The matrix operations required to determine each marginal posterior density at the data points then adds a further computational cost of O ( d 3 ) operations. The computational cost of evaluating the expectation terms in the objective function ℒ (or of estimating the expectation values in the case of non-conjugate likelihood models) is therefore O ( M d 2 + N d 3) operations. Furthermore, the gradient of the expectation terms with respect to the parameters of the model has the same scaling by making use of reverse mode sensitivities, as described in more detail hereafter. An unbiased estimator of the sum of expectation terms (and the gradient of the sum of expectation terms) can be determined by randomly sampling a subset of the data of size Nb ≪ N , referred to as a mini-batch, relabelling the indices to be ordered within the mini-batch, and determining an estimator for the sum of expectation terms on the basis of the min-batch only. Mini-batching in this way reduces the computational cost to O(Md 2 +N b d 3 ) operations. The KL divergence term in Equation (8) is given by Equation (14): Only the block-tridiagonal elements of contribute to the trace term, and hence the trace term and its gradient can be evaluated in O ( M d 2 ) operations using the sparse inverse subset operator. The computational cost of the log determinants and their gradients is O( M d 3 ) using the Cholesky decompositions of the precision matrices. The overall computational cost of evaluating the objective function ɭ and its gradient is therefore O (( N + M ) d 3 ) operations, or O (( Nb + M ) d 3 ) operations when mini-batches are used. It is stressed that the linear computational cost with respect to the number of inducing variables results from the auspicious parameterisation chosen for the variational distribution q ( u ). Once the GP model has been trained (i.e. once the parameters of the variational distribution q ( u ), along with optionally the inducing input location and any hyperparameters have been determined), the distribution at an unseen input location t * is readily computed. Furthermore, a prediction at the unseen input location can be made in O ( d 3 ) operations by sampling a state vector from the distribution and then sampling from the conditional likelihood p( at the corresponding value of the latent function f ( t * ) . The predictive probability distribution of an observation ^^^^ * at an unobserved input location t * is given by Equation (15): In cases where the likelihood is conjugate to the posterior distribution q ( f ( t * )), the integral in Equation (15) can be evaluated analytically. In other examples, numerical methods can be used to estimate the predictive probability, for example Monte Carlo sampling as shown in Equation (16): where where k is the number of Monte Carlo samples used to approximate the integral. In other implementations, other numerical methods may be used to approximate the integral of Equation (15), such as Gaussian quadrature. Banded matrix operations As explained above, the banded diagonal matrices (block-tridiagonal and lower block-bidiagonal) used in the present formulation result in a dramatically reduced computational cost and memory requirements compared with other methods of implementing GP models. A number of computational operators specific to banded matrices are used to implement the present method. In particular, the following linear algebra operators are required for b anded matrices Q,Q 1 ,Q 2 , , banded lower triangular matrices L , and vectors v,v1, v2: • Cholesky decomposition ℂ: Q®L such that LL T = Q; • Triangular solve S : ( L,v ) → L - 1 v ; • Sparse inverse subset: :L® band q ( Q -1) ; • Reverse sparse inverse subset: -1 : band Q ( Q -1 ) → L ; • Products Q 1 ,Q 2 QQ 2 Q 1 Q 2 or Q,v Q v Q,vQ v • Sparse outer product subset: O:v1, v2 → band q ( v1vT 2) , where band Q denotes elements in a band corresponding to that of the banded matrix ^^^^. For a banded matrix of size Md × Md and bandwidth d (including the block-tridiagonal and block-bidiagonal matrices mentioned above), the above operators can be implemented at a computational cost of at most O( M d 2 ) operations. The linear algebra operators described above allow for the objective function L to be determined (or estimated in the case of a non-conjugate likelihood) in O (( Nb + M ) d 3 ) operations. In order to determine the gradient of L with respect to the model parameters (including the parameters of the variational distribution q ( u ) and, optionally, the inducing input locations and any hyperparameters), reverse-mode sensitivities are also required for each of these operators. Reverse-mode sensitivities are derivatives performed in reverse-mode on the output of an operator. With the exception of the reverse sparse subset inverse operator, exemplary implementations of the linear operators above, as well as the associated reverse-mode sensitivities, can be found, for example, in the publication “Banded Matrix Operators for Gaussian Markov Models in the Automatic Differentiation Era”, Durrande et al, AISTATS 2019, the entirety of which is incorporated herein by reference. For the reverse sparse subset inverse operator II -1 , a novel operator is constructed as set out in the novel algorithm below, which takes as an input C = band q ( Q -1 ), where Q is a symmetric banded matrix of bandwidth d , and outputs a banded matrix of lower bandwidth d for which Q= LL T : everse sparse inverse subset L¬O (initialise L to Md × Md zero matrix) for i [0, … , Md - 1] do c(i)Ci:i+ d ,i:i +d (extract d × d symmetric sub-block at i ) v ( i ) ( c(i) -1 e , e = [1,0, … ,0] ∈ ℝ d (select first column of inverse of sub-block) return L Reverse-mode differentiation through the reverse subset inverse operator is achieved using the following algorithm, which takes the reverse-mode sensitivity L ≡ d f /d L of an arbitrary function f with respect to the Cholesky factor L as an input and outputs a matrix Cº d f /d C which is the reverse-mode sensitivity with respect to C = band q ( Q -1 ) . Reverse-mode differentiation for reverse sparse subset inverse C ← O (initialise C to Md × Md zero matrix) for i [0, … , Md - 1] do c(i)Ci:i+ d ,i:i +d (extract d × d symmetric sub-block at i ) r eturn C where and e = [1,0, … ,0] ∈ ℝ d as before. In the algorithms above, the determination of the inverse of a symmetric sub-block can be achieved by computing the Cholesky factor s i and then solving The Cholesky factors s ( i ) can either be computed directly in parallel, or alternatively using an iterative method, where given the Cholesky factor s ( i ) of c ( i ) , the Cholesky factor s ( i -1) of c ( i -1) is given by Equation (17): w here chol ( Ci-1 :i+d,i- 1:i+d ) The iterative method results in a computational cost of O( M d 2) operations for the reverse sparse inverse subset operator and the associated reverse-mode derivative. Real-time signal analysis The inference method presented herein is particularly advantageous for the real-time or near real-time analysis of data received continually as a data stream. Examples include real-time analysis of audio samples, for example in speech detection, speech recognition, automatic pitch detection and similar applications, modelling of dynamical systems from noisy observations, for example to generate control signals for one or more entities in the dynamical system, and analysis or visualisation of neural activation data in human or animal brains. In the context of machine learning, performing inference in real- time on streamed data is referred to as on-line learning. Until now, GP models have been generally unsuitable for on-line learning because GP inference generally requires the inversion of a dense matrix at each update step. If order for new data to be incorporated, a new dense matrix needs to be inverted, resulting in an O ( N 3 ) computational cost at each update, where N is the size of the matrix. In accordance with certain embodiments of the present invention, data may be received as a data stream and inducing states may be generated sequentially in an on-line manner. For example, new inducing inputs may be initialised concurrently with the receiving of streamed data, either at a predetermined rate or in response to the receiving of observations. Due to the Markov property of the inducing states, the marginal distribution q(um ) of a new inducing state may be determined and optimised at a low computational cost. Figure 5 shows an example of a real sound file representing an uttered vowel from a female speaker sampled at a frequency of 16kHz. A vowel is typically a quasi-periodic signal where the pitch is fixed but the repeated temporal pattern varies with time. The sound file of Figure 5 is therefore assumed to be generated by a Markovian GP with a quasi-periodic kernel given by Equation (18): where f 0 denotes the fundamental frequency of the pitched sound and kmat 3/2 denotes the Matérn kernel of order 3/2. The kernel has four hyperparameters: the fundamental frequency f 0 of the periodic variation, the variance y i of the various harmonics, the length-scale l of the Matérn kernel and variance s 2 of the Matérn kernel. This kernel has a state-space representation with d=26 , and accordingly the methods set out in the present invention can be used to analyse the signal. In the example of Figure 5, the audio signal is received by a microphone and analysed in real-time using the method described herein. In the present example, inducing inputs are initialised sequentially at regular 0.05s intervals. Figure 5a shows four white circles corresponding to inducing states sampled at inducing inputs { z 1 , z 2 , z 3 ,z 4 } = {0.1,0.15,0.2,0.25}s. The marginal distributions over the four inducing states are determined by optimising an objective function ℒ for samples in the interval D 1 = [0.1,0.25]s as described above, for example using mini-batches of observations within this interval, resulting in a computational cost of O ((NΔ 1 +MD 1 ) d 3 ) operations, where N Δ1 is the number of samples used in the interval D 1 (which may be a mini-batch of the samples in Δ 1 ) and M Δ1 = 4 is the number of inducing states in the interval D 1 . In this way, the GP model is fitted to the samples in the interval D 1 . In addition to the distributions over inducing states, in some examples hyperparameters of the GP model and locations of the inducing inputs are optimised for the interval D 1 . In the interval 0.25s < t < 0.3s, additional samples are received, and a further inducing state u 5 is initialised at time z 5 = 0.3s. In the present example, parameters of the model are updated by optimising ℒ for samples in the interval D 2 = [0.15,0.3]s for example using mini-batches of observations within this interval, resulting in a computational cost of O((N Δ 2 +MD 2 ) d 3 ) operations. In this example, the intervals D 1 and D 2 are of equal duration and contain an equal number of inducing inputs. As such, a window of fixed size (a fixed interval) is moved over the data as the signal is received, resulting in a model that automatically adapts to changes in characteristics of the data. In some examples, an interval is bounded by two inducing inputs and contains no further inducing inputs (i.e. MD i = 2 ) = 2), allowing for very rapid adaptive updating of the model. In other examples, multiple inducing states may be initialised before optimisation is performed. Parameters determined for a preceding interval may be used as initial guesses for the new interval. In other examples, intervals may vary in size. For example, the size of an interval may depend on a rate at which data is received. In some examples, characteristics of a data signal may evolve over time. For example, the pitch of a voice may vary in an audio file representing human speech. In a simple example, the present invention is used to determine a pitch of a voice or other sound from an audio file. The pitch corresponds to the fundamental frequency hyperparameter f 0 of the kernel in Equation (18). Using the approach described above, the evolution of the pitch over time can be measured using the optimised frequency hyperparameter for each interval. Other information can be determined from hyperparameters. For example, in an example where a state-space model is derived from a model of a dynamical system, hyperparameters are related to physical properties of the system, which can therefore be derived from the optimised hyperparameters. In some examples, hyperparameters of a model may remain fixed (for example, where a state-space model corresponds to a model of a dynamical system). Natural gradient descent In the examples described above, the variational distribution q ( u ) is parameterised in terms of the mean m u and the Cholesky factor L u of the precision matrix Quu . The Cholesky factor Lu is advantageously restricted to being a lower block-bidiagonal matrix, which allows the computational cost and memory requirements of inference to scale linearly with the number M of inducing points, and further allows learning to be performed in an on-line manner. In optimising the objective function ℒ with respect to the parameters of the variational distribution, gradient descent is performed using reverse- mode automatic differentiation operators for banded matrices. In the present section, a variant of gradient descent is presented referred to as natural gradient descent. It is observed that in accordance with the present invention, natural gradient descent can be used as an alternative to gradient descent for optimising the parameters of the variational distribution, and retains the linear scaling of the optimisation procedure with respect to the number of inducing variables. Gradient-based optimisation methods iteratively update the parameters to form a sequence where are converged parameters that sufficiently approximates an optimal set of parameters according to predefined convergence criteria. The optimal parameters maximises the objective function L. A general example of an update rule for updating the parameter vector is given by Equation (19): in which: is the step size, which may be fixed or may vary with step number; and P is a preconditioning matrix that affects the direction and the size of the change in the parameters induced by the update. Different gradient-based optimisation methods vary from each other by using different preconditioning matrices. For gradient descent (referred to hereafter as ordinary gradient descent), P is an identity matrix. For Adam, P is a diagonal matrix with components given as described in the following section. For L- BFGS, P is an approximation of the Hessian matrix of q ( u ). For natural gradient descent, P is given by the Fisher information matrix which is defined by Equation (20): Defining the natural gradient of L as the update rule (19) for natural gradient descent is written as: For small changes in each iteration of the natural gradient method as given by Equation (21) updates in a direction that maximises the KL divergence between q ( u ) for the updated parameter vector and q (u ) for the previous parameter vector. For a given sequence of step sizes, a sequence of updates resulting from Equation (21) has the special property that it is independent of the choice of parameterisation. For conjugate GP models, natural gradient descent is known to converge in very few steps, and in most cases in one step. For non-conjugate GP models, a robust scheme for achieving rapid convergence using the natural gradient in a non-conjugate Gaussian process model is to use a step size that varies according to two phases. During the first phase, in which the distance takes relatively large values, the step size starts at a small value and increases with step number over a predetermined number of iterations until it reaches a predetermined value. In the second phase, the step size remains at a constant or approximately constant value, for example the predetermined value from the first phase. One example of such a scheme is to increase the step size log-linearly over K steps, such that g 00 = g initial and gk = g final , and to fix g t = gfin a l for t >k. In a specific e -4 -1 final xample, g ginitia = 10 , g gfina = 10 , and K = 5. Other combinations of g initial, g final, and K are also expected to lead to robust performance for g in g iintiiatia < < finfiana , as are other schemes in which the sequence has an initial increasing phase followed by a phase in which g t remains constant or approximately constant. Robust performance refers to cases in which the sequence reliably converges, independent of the initial choice of parameters. An efficient method for calculating the natural gradient will now be described. The natural gradient is related to a first distinguished parameterisation referred to as the natural parameterisation [ [ ] ] , and a second distinguished parameterisation referred to as the expectation parameterisation n = [ n 1 , n2 ]. General definitions of these parameterisations are known in the art. In terms of the parameterisation described above (the mean m u m u and Cholesky factor L u of the precision matrix), the natural and expectations parameterisations are given by Equation (22): where btd[ [] M] racts the block-tridiagonal elements of a matrix M and returns them as a vector. It is evident from Equation (22) that the natural and expectation parameters inherit the banded structure of the original parameters. Inverse transformations from the natural and expectation parameters to the original parameters are given by Equations (23) and (24): where chol denotes the Cholesky factor. These parameter transformations are performed using the sparse inverse subset operator and the reverse sparse inverse subset operator mentioned above. The transposed natural gradient (which appears in the update rule of Equation (21)) is given by The farthest right derivative is evaluated in practice using the chain rule Using the reverse-mode sensitivities of the banded matrix operators mentioned above, the derivatives not involving ℒ can be computed in O( M d 2) operations. An efficient method of performing natural gradient descent using automatic reverse-mode differentiation libraries is discussed in the article “Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models”, Salimbeni et al, AISTATS 2018. Optimising hyperparameters and inducing input locations The preceding section described a method of optimising parameters of a variational Gaussian distribution q(m ) using natural gradient descent. In addition to these parameters, the variational GP q(f( .) ) is dependent on hyperparameters of the GP (i.e. coefficients of the LTI-SDE for the state-space model), and also the inducing inputs Natural gradient descent is not suitable for optimising the hyperparameters or the inducing inputs, as neither of these have an associated probability distribution and hence the natural gradient is undefined. Instead, in examples where natural gradient descent is used to optimise the parameters of the variational distribution, a different optimisation method must be used to update the inducing inputs and the hyperparameters. A hybrid scheme is therefore proposed which alternates between steps of natural gradient descent and the other chosen optimisation method. In a specific example, natural gradient descent is alternated with Adam, which is defined by the update rule of Equation (18) with P given by a diagonal matrix with elements where m i and v i are the bias- corrected exponential moving averages of the components respectively, and is a fixed small number. In this example, = 10 -8 . The step size g Adam for the Adam update is generally different to the step size g used for the natural gradient update. In one example, the step size g Adam decreases with the number of update steps, for example exponentially. In another example, the search is performed over a set of candidate step sizes 6 and the largest step size that remains stable is chosen. Example data processing system Figure 6 shows an example of a data processing system 600 arranged to perform signal analysis in accordance with an embodiment of the present invention. The system includes various additional components not shown in Figure 6 such as input/output devices, a wireless network interface and the like. The data processing system 600 includes a receiver 602, which is arranged to receive a data signal. In the present example, the receiver is a microphone receiver for receiving an audio signal. In other examples, a receiver may include a radio receiver, a camera, a radio antenna, or any other component capable of receiving a data signal. The data processing system 600 includes a data buffer 604 for temporarily storing a received data signal before passing the received data signal to working memory 606, which in this example includes volatile random-access memory (RAM), in particular static random-access memory (SRAM) and dynamic random-access memory (DRAM). The data processing system 600 may also include non-volatile storage, not shown in Figure 6. The working memory 606 is arranged to receive data from the data buffer 604, and to store parameters of a Markovian GP including inducing inputs, parameters of a variational distribution over inducing states, and hyperparameters of the GP model. Block-banded matrices including the precision matrices Q y and Q uu and their Cholesky factors are stored compactly as described above. The data processing system 600 includes an inference module 608, which includes processing circuitry configured to perform variational inference in accordance with the present invention. In this example, the inference module 608 includes multiple processor cores 610, of which four processor cores 610a- d are shown. In this example, the processor cores are located within a CPU, though in other examples multiple processor cores may be included for example in one or more processing units such as a graphics processing unit (GPU) or a neural processing unit (NPU). During inference, determination of the marginal distributions q(s(t n)) and subsequent determination of the expectation terms in Equation (8) is parallelised across the processor cores once the marginal distributions of all pairs of consecutive inducing states have been determined. Example method Figure 7 shows an example of a method performed by the data processing system 600 to process a data signal in accordance with an embodiment of the present invention. The data processing system 600 initialises, at S702, first parameters including hyperparameters in the form of coefficients of a state-space representation of a Markovian GP model. In the present example, the hyperparameters are initialised randomly, though in other examples the hyperparameters may be initialised using a different strategy, for example by deriving the hyperparameters from a dynamical model of a physical system. The data processing system 600 receives, at S704, a first portion of the signal via the receiver 602. The received first portion is passed to the working memory 606 via the data buffer 604, either continually or in batches. The first portion of the signal is formed of one or more observations or samples. The data processing system 600 initialises, at S706, one or more inducing states. In this example. initialising an inducing state involves two steps – first initialising an inducing input and then initialising entries of a mean vector and a banded Cholesky factor of a precision matrix corresponding to the inducing input. In this example, inducing inputs are initialised sequentially and concurrently with the receiving of the signal. In other words, the one or more inducing inputs are initialised such a rate to allow real-time updating of the GP model as the signal is received. In the present example, the inducing inputs are initialised as predetermined input locations, for example being evenly spaced from each other, though in other examples input locations may be initialised unevenly, for example randomly or in dependence on the input locations of received observations. In some examples, inducing inputs are initialised using K-means clustering of received observations or at input locations of received observations. In the present example, the entries of the mean vector and Cholesky factor of the precision matrix are initialised randomly. The data processing system 600 determines, at S708, marginal distributions q(s(t n)) for one or more observations in a given interval. As explained above, the interval may have a predetermined size or may be dependent on a rate of observations being received. The one or more observations may be a mini-batch formed of a subset of the observations received in that interval, or may include all of the observations received in the interval. The data processing system determines, at S710, an unbiased estimator for a gradient of an objective function L for observations in the given interval using the reverse mode sensitivities mentioned above. In some examples, an ordinary gradient is determined with respect to parameters Lu,m u of the variational Gaussian distribution q(u) , and optionally with respect to the inducing inputs and hyperparameters of the Markovian GP model. Preferably, however, a natural gradient is determined with respect to parameters of the variational Gaussian distribution q ( u ), as discussed above. In examples where the expectation term in the objective function is intractable, an unbiased stochastic estimator is determined, for example using one or more Monte Carlo samples as shown in Equation (26): w here is a set of parameters to be updated, with and S is the number of Monte Carlo samples. In practice, samples of a random variable are drawn from a normalised Gaussian distribution N(0,Id ) and a sampled is determined as This is commonly referred to as the reparameterisation trick, and ensures that the objective function is differentiable with respect to the parameters F , which in turn allows reverse mode differentiation to be performed to determine the gradient. For examples in which the expectation terms are tractable, Monte Carlo sampling is unnecessary. The data processing system 600 performs, at S712, a gradient-based update of the parameters F using the determined unbiased gradient estimator (using gradient descent, natural gradient descent, and/or variants of these) to increase the value of the objective function L. Steps S708 to S712 are performed iteratively until either predetermined convergence criteria are satisfied or until a predetermined number of iterations have been performed. The resulting set of parameters may then be transferred to main storage or may remain in the working memory 606 to be used in further computations, for example to determine a predictive distribution over an unknown output g * for a given input location t * . At a later time, the data processing system 600 receives a second portion of the signal via the receiver 602. Accordingly, the method of Figure 7 begins again from S704, with new inducing points being initialised within a new interval encompassing the new portion of the signal. In this way, the model is updated adaptively. In some examples, hyperparameters of the Markovian GP model are updated automatically each time a new signal portion is processed. In other examples, hyperparameters are updated only if the performance of the model is determined to deteriorate (for example, if the optimised value of the objective function for the new interval is lower than the optimised value of the objective function for the previous interval by a threshold amount). In this way, the model is adaptively updated as the properties of the signal evolve. Application to DGPs The expressive capacity of a given GP model is limited by the choice of kernel, or equivalently, by the choice of state-space representation for a Markovian GP. Extending a GP model to having a deep structure can improve the expressive capacity of the model. In some embodiments, a deep GP model is formed by composing multiple Markovian GPs with respective state-space representation using inter-domain inducing states as described herein. In an example, a deep GP architecture is based on a composition of functions f ( ´ ) = f L (… , f 2( f 1(´))), where each component function f1ÎR ®R is given a Markovian GP prior with a state space representation l = T he joint density for the deep GP model is given by Equation (27): in which ℎ n,oº t n and the (predetermined) form of determines how the output vector h n,l of a given GP layer depends on the state vector for that layer, and may be chosen to be stochastic or deterministic. In a specific deterministic example, the output of the layer is equal to the output of the latent function, such that Following the approach of H. Salimbeni and M. Deisenroth in Doubly stochastic variational inference for deep Gaussian processes, Advances in Neural Information Processing Systems, 2017, a variational deep GP is introduced and assumed to be factorizable between layers (referred to as a mean-field assumption), such that each layer of the deep GP is approximated by a variational GP layer q(f l ) with inducing states u l = sl(Z l-1 -1 ) defined at a respective set Z l of inducing inputs The inducing states are given variational Gaussian distributions where the Cholesky factor of the precision matrix is restricted to being a banded matrix of bandwidth l = 2 dl-1 (or, in some examples, a lower block-bidiagonal matrix). The mean and Cholesky factor of the covariance for the Gaussian distribution in each layer, along with (optionally) the locations of the inducing inputs Z l and hyperparameters of the kernels k l , are determined through optimisation of an objective function given by Equation (28): where the approximate posterior density is given by is given by Equations (10)-(13) but with t n replaced with h n,l-1 Each of the KL terms in Equation (27) has an analogous expression to that of Equation (14). Due to the intractability of the expectation terms in Equation (27), it is necessary to draw Monte Carlo samples from the distributions which is achieved using a reparameterisation trick in which a random variable ( l ) )s sampled from a normalised Gaussian distribution ^ and then passed through a deterministic function to arrive at the required distributions as described above for the single-layer case. This reparameterisation trick allows for reverse-mode differentiation to be applied in order to perform gradient-based optimisation of the objective function of Equation (27). Figure 8 schematically shows a training phase for a two-layer DGP model ( ^^^^ = 2) in a regression setting of the type described above. Each observation in a dataset has an input portion ^^^^ ^^^^ and an output portion ^^^^ ^^^^ . At a given training iteration, an ordered minibatch ^^^^ of observations is sampled from the dataset. For each observation point in the minibatch, a first random variable ^^^^ (1) ^ ^^^ is drawn from the normalised multivariate Gaussian distribution, and a vector ℎ ^^^^,1 is determined by evaluating the stochastic function ^^^ 1 ^ ( ^^^^ ^^^^ ) using the random variable ^^^^ (1) ( ) ^ ^^^ . A second random variable ^^^^ 2 ^ ^^^ is then drawn from the normalised multivariate Gaussian distribution, and a vector ℎ ^^^^,2 is determined by evaluating the stochastic function ^^^^ 2 (ℎ ^^^^,1 ) using the random variable ^^^^ (2) ^ ^^^ . The likelihood ^^^^( ^^^^ ^^^^ |ℎ ^^^^,2 ) is then evaluated at the vector ℎ ^^^^,2 , and the logarithm of this likelihood is used as a Monte Carlo estimate of the expectation appearing in Equation (15). Reverse-mode differentiation is then performed to determine an unbiased estimator for a gradient of the objective function. In Figure 8, the above process is illustrated for a first observation ( ^^^^ 1 , ^^^^ 1 ) in the mini-batch ^^^^. In the examples described above, DGPs were described in which each layer was a Markovian GP implemented in accordance with the present disclosure. In other examples, a Markovian GP implemented in this way may be used as a layer in a DGP comprising other types of GP layers, for example convolutional GP layers or any other type of GP layer suitable for use within a DGP. Multi-dimensional inputs In the examples described above, Markovian GP models are used to model data defined on a scalar input domain (for example, time-series data). In other examples, the methods described herein can be extended to be used for data with multiple input dimensions, i.e. for data of the form ( ^^^^, ^^^^), where the input ^^^^ = { ^^^^ ^^^^ } ^ ^ ^ ^ ^ ^ ^ = ^ 1. In one example, an additive model is constructed as a sum aussian processes corresponding to the D dimensions of the input domain, as shown in Equation (29): in which and xi ∈ ℝ. For each dimension, the kernel ki(x i,x'i ) has a state-space representation as described above. The additive model may be used for example, in regression problems where observations are expected to depend on multiple covariates. In such cases, a factorised (mean-field) approximation of the overall posterior process is given by where q ( i ) ( f i ) are component variational GPs parameterised using the Markovian GP formulation described in the present application, with independent inducing states for each input dimension. In some applications, such as in audio processing and telecommunications, a signal (referred to as a mixture) is formed as a superposition of unknown underlying components (referred to as sources), often nonlinearly transformed and modified by noise. Recovering the underlying sources is referred to as source separation. An application of source separation for audio signals is the so-called cocktail party problem, in which the objective is to isolate the speech of a single person in a room with several people speaking (and, possibly, other sounds). Source separation may be performed using the additive model of Equation (29), by defining an extended state space vector S(t) = ( s 1 ( t ), … s D ( t ) T as a concatenation of states for a set of D underlying component Markovian GPs, and an extended observation matrix ( t ) = ( H1(t ) , … HD ( t ) . Each of the component Markovian GPs is associated with a variational GP with a respective set of inducing points with a respective variational distribution to be determined by optimising a combined objective function. The resulting GPs can be extracted using the extended observation matrix H , thus solving the source separation problem. Using the present formulation of Markovian GP inference, source separation can be performed rapidly, for example in real time or near real time depending on the rate of sampling of the data. Further embodiments and modifications The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. For example, the present invention may be implemented using different hardware to the data processing system shown in Figure 6, for example using a distributed computing system and/or in accordance with a client-server arrangement. In some examples, the present invention may be used within other types of composite GP models, for example in models with multi-dimensional outputs modelled by correlated GPs. In one example, the present Markovian GP model is used in Gaussian process factor analysis of neural activity, in which measurements of neural activity are taken over a period of time for multiple neurons in a human or animal brain. The neural activity measurements at a given time are correlated and are assumed to be noisy observations of an underlying low-dimensional neural state. This method allows for neural activity data to be meaningfully analysed from a single trial, which is preferable to averaging measurements over multiple trials, because neural conditions generally vary even for nominally identical trials. In accordance with an embodiment of the present invention, each dimension of the neural state is represented by a Markovian GP with a respective characteristic time-scale and parameterised as described above. A multi-output composite GP is described by an extended state space vector S ( t ) = ( s1(t), … sD ( t)) T , which is projected onto ^^^^ measured outputs using an extended observation matrix H ℝ k×Dd . Gaussian process factor analysis is learned by training the composite GPs, along with the extended observation matrix. In this setting, the present invention allows for efficient real-time or near real-time analysis and/or visualisation of neural activity within the human or animal brain. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.