Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD AND APPARATUS FOR DETERMINING DIFFERENT OPERATING POINTS IN A SYSTEM
Document Type and Number:
WIPO Patent Application WO/2011/086171
Kind Code:
A1
Abstract:
Apparatus for determining an operating point of a system that is used to predict a time dependent output value. The apparatus comprises an input for receiving two or more time-dependent input variables, and an initialiser coupled to said input for receiving said input variables therefrom. The initialiser comprises a pre-processor for removing constant correlations between said two or more input variables, and a processor configured to apply a non-linear function to said two or more input variables to obtain a series of time dependent instantaneous correlations. The apparatus further comprises an operating point determination unit for determining an appropriate operating point on the basis of the obtained instantaneous correlations.

Inventors:
VALPOLA HARRI (FI)
Application Number:
PCT/EP2011/050488
Publication Date:
July 21, 2011
Filing Date:
January 14, 2011
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ZENROBOTICS LTD (FI)
VALPOLA HARRI (FI)
International Classes:
G06N3/08
Domestic Patent References:
WO2008102052A22008-08-28
Other References:
HARRI VALPOLA: "Behaviourally meaningful representations from normalisation and context-guided denoising", INTERNET CITATION, 2004, pages 1 - 8, XP002493664, Retrieved from the Internet [retrieved on 20080826]
ANONYMOUS: "ZenRobotics Recycler introduction", 3 September 2010 (2010-09-03), pages 19PP, XP002624806, Retrieved from the Internet [retrieved on 20110224]
ILIN ET AL: "Exploratory analysis of climate data using source separation methods", NEURAL NETWORKS, vol. 19, no. 2, 1 March 2006 (2006-03-01), ELSEVIER SCIENCE PUBLISHERS, BARKING, GB, pages 155 - 167, XP005367355, ISSN: 0893-6080, DOI: 10.1016/J.NEUNET.2006.01.011
ALEXANDER ILIN ET AL: "Frequency-Based Separation of Climate Signals", KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005 LECTURE NOTES IN COMPUTER SCIENCE; LECTURE NOTES IN ARTIFICIAL INTELLIG ENCE; LNCS, vol. 3721, 1 January 2005 (2005-01-01), SPRINGER, BERLIN, DE, pages 519 - 526, XP019021315
JAAKKO SÄRELÄ ET AL: "Denoising Source Separation", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 6, 3 May 2005 (2005-05-03), MIT PRESS, CAMBRIDGE, MA, US, pages 233 - 272, XP007905529, ISSN: 1532-4435
Attorney, Agent or Firm:
LIND, Robert (4220 Nash CourtOxford Business Park South, Oxford Oxfordshire OX4 2RU, GB)
Download PDF:
Claims:
CLAIMS:

1. Apparatus for determining an operating point of a system that is used to predict a time dependent output value, the apparatus comprising:

an input for receiving two or more time-dependent input variables;

an initialiser coupled to said input for receiving said input variables therefrom, and comprising

a pre-processor for removing constant correlations between said two or more input variables, and

a processor configured to apply a non-linear function to said two or more input variables to obtain a series of time dependent instantaneous correlations; and

an operating point determination unit for determining an appropriate operating point on the basis of the obtained instantaneous correlations.

2. Apparatus according to claim 1, wherein said pre-processor is configured to remove constant correlations between said two or more input variables by generating a linear prediction of one of said input variables from the or each other input variable, and removing the prediction from that one input variable.

3. Apparatus according to claim 1 or 2, wherein the pre-processor is configured to remove slow components.

4. Apparatus according to any one of the preceding claims, wherein said processor is configured to apply the non-linear function directly to the pre-processed input variables.

5. Apparatus according to any one of claims 1 to 3, wherein said processor is configured to generate projections of the input variables using a set of projection coefficients, and to apply the non-linear function to those projections.

6. Apparatus according to claim 5, wherein said processor is configured to employ an iterative estimation procedure to obtain the projection coefficients.

7. Apparatus according to any one of the preceding claims and comprising a sampler for capturing said two or more input variables over a pre-determined sample range and for supplying these to said input.

8. Apparatus according to any one of the preceding claims, wherein said input variables are one or more of video signals, audio signals, and mechanical and physical measurements.

9. Apparatus according to any one of the preceding claims, wherein said preprocessor is configured to whiten said input variables.

10. Apparatus according to any one of the preceding claims, wherein said non-linear function comprises multiplication of the two or more input variables.

11. Apparatus according to any one of the preceding claims, wherein said processor is configured to apply a non-linear function to said two or more input variables to obtain a first series of instantaneous correlations, and to apply dimension reduction to said first series of instantaneous correlations to determine a second series of instantaneous correlations, said operating point determination unit determining an appropriate operating point using said second set of instantaneous correlations.

12. Apparatus according to any one of the preceding claims, wherein said nonlinear function applies spatial and or temporal filtering, the spatial and or temporal filtering using one or more of:

principle component analysis (PC A);

low-pass filtering;

high-pass filtering;

linear slow feature analysis (1SFA); and

canonical correlation analysis (CCA).

13. Apparatus according to any one of the preceding claims and comprising a modelling module coupled to said operating point determination unit whereby the operating point determination unit configures the modelling module according to the determined operating point, and the modelling module receiving as an input one or more of said time-dependent input variables and providing at an output a prediction of a physical variable.

14. Apparatus according to claim 13, wherein said prediction of a physical variable is provided in the form of a control signal for controlling a machine.

15. Apparatus according to claim 14, wherein the said control signal is provided to instruct a machine to move and/or carry out a pre-defined action.

16. A machine for performing a physical task comprising:

one or more electrically actuable moving parts;

one or more sensors for sensing properties of the moving part(s);

a controller for controlling the movement of said moving part(s); and

an apparatus according to claim 15 coupled to said sensor(s) to receive said input variables therefrom, and to said controller for providing said control signal thereto.

17. A machine according to claim 16, wherein an electrically actuable moving part of the machine is a robotic arm.

18. A method of determining an operating point of a system that is used to predict a time dependent output value, the method comprising:

receiving two or more time-dependent input variables;

removing constant correlations between said two or more input variables;

applying a non-linear function to said two or more input variables to obtain a series of time dependent instantaneous correlations; and

determining an appropriate operating point on the basis of the obtained instantaneous correlations.

19. A method according to claim 18, wherein said step of removing constant correlations between said two or more input variables comprises generating a linear prediction of one of said input variables from the or each other input variable, and removing the prediction from that one input variable.

20. A method according to claim 18 or 19 and comprising removing slow components from the two or more input variables.

21. A method according to any one of claims 18 to 20 and comprising whitening said input variables after said step of removing constant correlations.

22. A method according to any one of claims 18 to 21, wherein the non-linear function comprises spatial and or temporal filtering.

23. A method according to claim 22, wherein said spatial and or temporal filtering comprises using one of:

principle component analysis (PC A);

low-pass filtering;

high-pass filtering;

linear slow feature analysis (1SFA); and

canonical correlation analysis (CCA).

24. A method according to any one of claims 18 to 23 and comprising performing said step of applying a non-linear function to said two or more input variables to obtain a first series of instantaneous correlations, applying dimension reduction to said first series of instantaneous correlations to determine a second series of instantaneous correlations, and applying said step of determining an appropriate operating point to said second set of instantaneous correlations.

25. A method according to any one of claims 18 to 24 and comprising generating projections of the input variables using a set of projection coefficients, and applying the non-linear function to those projections.

26. A method according to claim 25 and comprising employing an iterative estimation procedure to obtain the projection coefficients.

27. A method according to any one of claims 18 to 26, wherein said non-linear function comprises multiplication of the two or more input variables. A method of sorting waste comprising:

sensing properties of a robotic arm configured to pick waste out of a waste stream;

using the method of any one of claims 18 to 27 to determine an operating point of the robotic arm;

using the determined operating point to predict a time dependent output value; and

using the time dependent output value to control the position of and/or forces applied by the robotic arm.

Description:
A Method and Apparatus for Determining Different Operating Points in a System

Field of the Invention

This invention relates to a method and apparatus for determining different operating points in a system. It is applicable in particular, though not necessarily, to robotic machines and to the control of such machines.

Background to the Invention

Many fields of engineering apply mathematical models to generate an appropriate response to given input variables. Examples include model predictive control, where the model needs to capture the behaviour of a system to generate an output control signal from given input variables, and pattern recognition tasks such as speech recognition, where the model needs to capture the structure of acoustic features and their relation to underlying phonemes and words so as to perform a pre-determined task in response to a spoken command.

In such systems it is common for the relations between the input variables to change for given contexts and these are often referred to as the different operating points of the system. For example, the different operating points of a speech recognition system may correspond to different speakers, tones of voice or ambient sound environments. It will be understood that, for each operating point, the relationship between the input and the output of the model may need to be calculated in a different manner (i.e. by a different sub-model). It is therefore important that the changing relations between the input variables are identified by the model so that an appropriate output can be generated.

Typically, human experts are required to specify the potential contexts and their effects so that a model structure can be defined which takes the different operating points into account. Data-driven machine-learning techniques may then be employed to determine any unknown variables of the system. However, this approach can be time-consuming and expensive since it requires human experts to fully define the structure of the problem (at all conceivable operating points) and to design the appropriate modelling techniques required. Moreover, in respect of complex problems, human experts are often unable to express their knowledge in such a way that will allow a suitable model to be readily created. For example, it is a difficult task for a human expert to define groups of similar speakers and to assign a particular speaker to a specific group.

In order to set the context for the present invention, some particular problems are described below.

Example 1 - Water Tanks

Firstly consider a simple task of mixing water from two water tanks. This particular problem can be solved with existing techniques but is described in order to illustrate the problem addressed by the present invention.

In this example, water from Tank 1 has a temperature 7 / and water from Tank 2 has a temperature T 2 . There is a constant output flow of 1 (undefined units) and the task is to regulate the output temperature T by controlling a valve which defines the mixing proportion from Tanks 1 and 2. If the output controlling valve is denoted by u we can assume, for example, that when

while when w=0, all the water comes from Tank 2 and hence T=T 2 . Assuming a linear relationship, this leads to Equation 1A below for controls 0< u <1.

T = uT v + (1 - u)T 2 Equation 1 A

For such a simple problem it would usually be possible to measure Ti and T 2 directly and thus solve Equation 1A for the optimum control u, but, if we assume that these temperatures are unknown (and not measurable), it will only possible to infer their values from the relationship between u and T (which are both assumed to be measured and known). In statistics, unknown variables (which usually nevertheless affect the system) are called latent variables. Thus, in this particular problem, the latent variables Ti and T 2 affect the relations between the input variables u and T. In other words, the latent variables define the operating points of the system.

Example 2 - Driving on a Slippery Road Similar but far more complex problems are encountered in real- world situations. As an example, consider the task of a robotically-controlled car driving on a road which may be slippery. This time, the slipperiness truly is a latent (unknown) variable because it usually cannot be measured directly. It does, however, affect the way the car responds to control outputs such as turning the wheel or pushing the brake or gas pedal.

Finding such latent variables is very difficult using existing machine-learning techniques. However, it can be done in principle because human drivers are able to learn to recognise slipperiness as an important variable and are able to feel it from the way the car responds to the controls. Thus, it is (theoretically) possible to define a model that can produce an appropriate output (e.g. a control signal to turn the wheel or actuate the brake or gas pedal) in response to given input signals - some of which might be measurable (e.g. the resistance in wheel movement) and some of which might initially be unknown (i.e. relating to latent variables such as slipperiness).

Once the latent variables have been identified, it is usually possible to predict their values from auxiliary inputs. For example, in the present case, humans readily use visual cues to assess the slipperiness of the road. However, it would be very difficult to directly use visual cues for machine control and latent variables thus simplify the task considerably. For example, the car could be controlled using a set of control systems, each of which would be designed to operate best for specific slipperiness conditions (e.g. different degrees of slipperiness). In this case, the latent variable characterising the slipperiness would need to be discovered first in order to realise that different controllers are needed for different degrees of slipperiness. Once the latent variable has been found and estimated for different situations, it is possible to try to find the same information from auxiliary sensors such as cameras. For instance, a certain texture on the road might predict a particular degree of slipperiness. It would, however, be a very complex learning task to learn to use the texture directly as an input for the controller. The latent variable thus acts as a useful intermediate target for a learning system.

In both of the examples described above, the latent variables define the relationship between control outputs and observed measurements. In many cases, for example in speech recognition and machine vision, it is important to find relations between observed measurements alone. More specifically, in the speech recognition task, relations between the amplitudes of different spectral components of speech signals could be used as an indicator of the gender or emotional state of the speaker.

Example 3 - Machine Vision

As an example, consider the machine vision task of estimating the 3D shape of a surface based on its texture, as recorded by a camera. Such 3D scene information is important, for example, for a moving robot to avoid obstacles. Texture is yet another instance of a latent variable which affects the relations between input variables. More specifically, the texture defines the spatial relations of elementary image features. For example, the correlation of the intensities of pixels located far apart may be large for a smooth surface but may be small for a rugged (e.g. furry) surface. The characteristic spatial relations between image features may look quite different for the same texture when it is recorded by a camera from different viewing angles or distances. Thus, the 3D shape of the surface can be considered a higher-order latent variable affecting the relations between image features. Accordingly, the problem can be formulated as a two-level model where the latent variables at the first level (textures) act as the inputs whose relations are modelled by the second level latent variables (3D shape of the surface).

Human experts are often able to identify the relevant latent variables in a problem which involves a single level of latent variables which affect the relations between input variables. For example, human experts can easily identify the acoustic features required for separating male and female speakers in the speech recognition task. However, they often perform poorly when the problem involves several levels, such as the example of recognising 3D shape from image texture. Such multi-level problems are nevertheless common in application areas such as robot control, machine vision and speech recognition.

Problems with Existing Techniques

There are many relatively general-purpose machine-learning techniques available for learning latent variable models. The models themselves may employ a mixture of 'experts' (linear or non-linear sub-models) and a gating function which uses auxiliary variables (usually not latent variables) to predict which 'expert' is suitable for modelling the relation at the current operating point. If the 'experts' have learnt to handle different situations then the latent variables can be inferred by monitoring which 'expert' best predicts the observed behaviour. However, as this is only possible after the 'experts' have learnt to handle the different situations it is not particularly effective as a learning technique.

While existing methods can be used to adequately determine latent variables which directly affect individual input variables, they are not reliable in identifying latent variables which affect the correlation between input variables.

Although the above-problem has been described in the context of a 'mixture of experts' model, typically any model (e.g. using conditional random fields) having latent variables describing relations between input variables will suffer from the same problem.

Examples of existing techniques which are often used for learning latent variables include the Pearson product-moment correlation coefficient, slow-feature analysis, canonical correlation analysis (CCA), denoising source separation (DSS), principal component analysis (PCA) and linear regression. Some of these techniques are described in more detail below.

Pearson Product-Moment Correlation Coefficient

The Pearson product-moment correlation coefficient is a measure of the linear dependence between two variables, x and y, giving a value between +1 and -1 inclusive.

Figure 1 outlines the steps performed on the variables, x and y, to obtain the Pearson product-moment correlation coefficient. In the first step 10, the mean value is removed from each variable. The next step 12 consists of normalising each variable to unit variance before the tensor product 14 of the two variables is computed. It will be understood that the tensor product 14 is computed in the case where x and y are vectors. However, in the case where x and y are one-dimensional variables, this step will equate to calculating the product of the two variables. In either case, the result is then averaged 16 to obtain the correlation coefficient.

Notably, the Pearson product-moment correlation coefficient can only be computed for two one-dimensional variables. Furthermore, this method assumes a constant (i.e. static) mean value.

As an example, we can consider the example with water tanks once more. In this case, we will assume that the temperatures Tj and T 2 do not change with time. We can then obtain a series of measurements of temperature T(t) and control input u(t) at time instances t=l,2,... and we can than rewrite Equation 1A as follows:

AT(t) = T(t) - T(t - 1) = (7; - T 2 - u(t - 1)) = {Τ γ - T 2 )Au(t) Equation IB

The Pearson correlation coefficient p can then be computed for AT(t) and Au(t) , and

(j _ j )

this give us p =——.— , where std(AT) denotes the standard deviation of AT .

std{AT)

Accordingly, the correlation coefficient can be used to identify the linear regression coefficient in Equation IB since 7 - T 2 = std{AT)p .

In the machine vision (texture) example described above, we can denote the intensity of an image pixel at location (x, y) by l(x, y) . It is then possible to compute the Pearson correlation coefficients between pixels located at distance r from each other: for example, between l(x, y) and l(x— r, y) where averaging at step 16 is performed with respect to x. The resulting coefficient p{r) (as a function of displacement r) could be used as a distinctive feature of textures because different textures would often produce distinct correlation coefficients p{r) . Thus, those features could be useful for classification of textures.

Slow Feature Analysis (SFA)

Slow feature analysis is a method for extracting slowly varying features from a quickly varying data set. As an example, we can consider the water tank problem described above but, in this case, we will assume that there are several pairs of mixing water tanks with temperatures Tj and T^, i = 1,2,... We can also assume that the temperature in one tank of each pair changes synchronously with temperatures in tanks from other pairs. For example, Tanks 2 in each pair might be connected to the same heater with temperature T 2 (t) and the heating efficiency (represented by coefficients w i ) might be different for each tank, yielding T t 2 (t) ~ w ; 2 (t) . Provided that temperature T 2 (t) changes slowly, slow feature analysis applied to AT { and Au { will allow us to discover T 2 (t) as the slowest feature. The knowledge of T 2 (t) can therefore facilitate modelling of the dependency of AT t {t) on Aw ; (i) because the relation becomes linear: AT i {t) = (T l - w. 2 T 2 {t))Au i {t) .

Figure 2 outlines the steps performed in slow feature analysis. Thus, the tensor product 20 of the variables, x and y, is determined after the mean has been removed from each set of variables in step 21 but before a whitening step 22 is performed. It will be noted that step 20 includes the multiplication of all of the possible pairs of the elements in vectors x and y. This is, in fact, a simplification of SFA: more generally, also terms x and y would be included.

It will be understood that the term 'whitening' is used throughout to denote so-called whitening transformation which is a de-correlation method to convert a covariance matrix of a set of samples into an identity matrix. Effectively this creates new random variables that are uncorrelated but that can be linearly transformed to reconstruct the original random variables.

After whitening 22, a low-pass filtering step 24 is performed and lastly principle component analysis (PCA) 26 is undertaken on the data. PCA is a known technique of orthogonal linear transformation which transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principle component), the second greatest variance lies on the second coordinate, and so on.

The result of this method is therefore a series of values indicating the slowly changing signal that can be found from the nonlinear expansion (with the pair-wise products) of signals x and y. This signal can be used, for example, for modelling relations between x and y as in the example with several pairs of mixing water tanks.

Linear slow feature analysis (1SFA) is a variant of this method whereby the multiplication step 20 is omitted. This can be used when the slowest features can be observed in x and y directly and the nonlinear expansion with the pair-wise products does not bring new information.

Canonical Correlation Analysis (CCA)

Canonical correlation analysis is another technique and is generally performed using Eigen-decompositions. However, it may also be implemented in an iterative procedure as shown in Figure 3. Thus, the mean value is removed 30 from each of the variables, x and y, before a whitening step 32 is performed. An iterative procedure 34 is then undertaken whereby vectors a(t) and b(t) are respectively computed 36 in accordance with Equations 2A and 2B below, where W x and W y are random projection vectors that are re-estimated 38 from the values of b(t) and a(t), respectively.

a(t) = W x x(t) Equation 2A b(t) = W y y(t) Equation 2B

The iterative procedure 34 is repeated until a(t) and b(t) converge.

As per the Pearson product-moment correlation coefficient, CCA assumes a constant (i.e. static) mean value. In addition, the correlation coefficient in CCA is assumed to be constant.

A typical use for CCA is to search for what is common amongst two sets of variables x and y. For example in climatology, one can use global measurements of sea surface temperatures and sea level pressures to find factors which can be observed in the variability of both variables. Such common factors would be explained by the global climate phenomena such as the annual cycle and El Nino-Southern Oscillation. In CCA a(t) and b(t) would represent the common factors such that a(t) is a linear projection from x, b(t) is a linear projection from y and a(t) ~ b(t) . In the climatology example, a(t) and b(t) could be a sine wave showing transitions from summers to winters. In CCA W x (and W ) are vectors of weights which tell how the observations x (and y) should be combined to get the common factor a(t) ( b(t) respectively). For example, El-Nino is known to be most prominent in the tropical regions and therefore the weights would be larger over those areas. The convergence of a(t) and b(t) tells us that the two signals can no longer be made closer to each other by changing the projection coefficients W x r and W y v . This means that the Pearson correlation coefficient between a(t) and b(t) (which is the objection function optimised in CCA) cannot further be improved. The outputs a(t) , b(t) represent some important phenomena underlying the data, so they can be used as such to understand the variability in x and y. They could also be used for building a model which relates common factors in the two sets of variables.

Denoising Source Separation (DSS)

Denoising source separation is a general framework which can be tuned to find signals with some pre-defined properties from given data. DSS is implemented in an iterative procedure shown in Figure 4. Thus, DSS involves the steps of whitening 100 a data set, x, computing a projection vector a(t) 102, in accordance with Equation 2A above, denoising 104 the projection vector such that a(t) = f {a{t)) , re-estimating the projection co-efficient W x in step 106 and then repeating the above from step 102.

The key step in DSS is the denoising procedure 104 applied to the current estimates of the projected signals a(t) . The specific choice of the denoising function 104 depends on what types of signals need to be found in the data. For example, one can consider the CCA implementation shown in Figure 3 as a specific case of DSS. In that example, the denoising function 104 would correspond to obtaining the denoised signals a(t) as signals projected from another dataset, y, that is a(t) = b(t) = W y{t) . It is also possible to use the sum a(t) = a(t) + b(t) to obtain steadier convergence. Similarly, extraction of signal b(t) from y(t) also follows the DSS structure. Thus, the procedure in Figure 3 implements CCA by connecting two blocks of DSS (one for each input vector, x and y) and defining step 104 of Figure 4 to be "denoising the signal from the other block".

Other possibilities for the denoising function 104 include (but are not limited to) low- pass filtering (corresponding to linear SFA) or applying a nonlinearity a(t) = a 3 (t) (corresponding to independent component analysis).

In the DSS methodology, the steps in Figure 4 are tuned to extract signals of a desired type by selecting a suitable denoising function in step 104. The choice of the denoising function 104 implicitly determines what kinds of signals to look for in the data. For example, in the CCA example in Figure 3, we are interested in finding signals in x which would be similar to signals b(t) projected from another dataset y. Therefore the denoising function 104 makes the denoised signal a(t) similar to that projected signal b(t) = W y y(t) .

Thus, it can be determined from the above that DSS relies on a procedural definition of what should be learned. This is in contrast to many current machine learning methods which rely on a cost function which should be minimised by any suitable means. For example, CCA is traditionally derived as a procedure minimising the Pearson correlation coefficient between two projected signals a(t) = W x y(t) and b(t) = W y y(t) .

However, as explained above, none of these techniques are reliable in identifying latent variables which affect the correlation between input variables. The primary reason for this is that they tend to rely on random mappings and estimation procedures that easily get stuck in local minima rather than converging to a useful solution.

It is therefore an aim of the present invention to provide a machine-learning system and a method for determining different operating points in such a system.

Summary of the Invention

According to a first aspect of the present invention there is provided an apparatus for determining an operating point of a system that is used to predict a time dependent output value. The apparatus comprises an input for receiving two or more time- dependent input variables, and an initialiser coupled to said input for receiving said input variables therefrom. The initialiser comprises a pre-processor for removing constant correlations between said two or more input variables, and a processor configured to apply a non-linear function to said two or more input variables to obtain a series of time dependent instantaneous correlations. The apparatus further comprises an operating point determination unit for determining an appropriate operating point on the basis of the obtained instantaneous correlations. The time dependent instantaneous correlations may be considered as latent variables of the system. They may have a real world meaning, e.g. slipperiness, or may be abstract mathematical values.

The pre-processing of the input variables ensures that any correlation which is always present in the data is discarded. This is because we are interested in obtaining information concerning the changing relationships between the input variables since this is what will provide us with information about the different operating points of the system. In other words, the system can learn abstract latent variables which do not directly indicate which lower-level features are active but which can indicate the kind of relations the lower-level features have. Thus, embodiments of the present invention can be used to determine values for latent variables (i.e. the instantaneous correlations defined above) which describe the mode (or state) of relations between the input variables (or input-output relations) of a system. The values of the latent variables can be used, for example, for detecting the operating points of the system or for estimating a complex mapping between pairs of input/output variables. Accordingly, the present invention can facilitate simpler or more efficient modelling of complex systems.

Depending on the problem being modelled, the values of the latent variables may be either fixed or optimised during a learning phase by using the latent variables as inputs to the model and solving the problem to refine the fixed parameters of the system. Implementations of the present invention therefore overcomes the problem with existing techniques whereby latent variables cannot be estimated before the fixed parameters of the system are known and vice versa.

Some advantages of implementations of the present system are that it provides an efficient means for learning a model structure in a data-driven fashion since it reduces the need for expensive and time-consuming stages where human experts are required to define the structure of the problem and design a suitable model to describe the problem. Implementations of the present invention can also be used to solve problems for which human experts are unable to describe sufficiently in order to build a suitable model.

The approach described here is useful in machine-learning applications where the correlation coefficients are assumed to be changing with time and to contain prominent structure such as, for example, slowness, synchrony of changes in correlations between different pairs of inputs, or synchrony of changes in correlations with some linear combination of the inputs. For example, in solving the water-tank example described above, slowness can be assumed and therefore the instantaneous temperature at any time, t, can be inferred from Tj(t-1 ) ~ Tj(t).

In the approach described, each different instantaneous correlation is fed into the model. However, the operating points for different time instances may be found by integrating information from all of the instantaneous correlations collected from the data. Thus, embodiments of the invention may include an integrator for integrating the series of instantaneous correlations.

Notably, the present invention can be employed with multi-dimensional input variables.

Embodiments of the invention provide a system in which the response from the modelling module provides a prediction of a physical variable. The predicted physical variable may be used to simulate a human response (i.e. vision) and/or to facilitate an action.

The pre-processor may be configured to remove the linear prediction of the two or more input variables. References to 'remove the linear prediction' will be understood by persons skilled in the art to mean the removal of data satisfying Equation 3 below for the simple linear regression from x to y (or vice versa) where the matrix A and vector m are found using the method of least- squares.

y(t ) ~ Ax(t) + m Equation 3

The step of removing the linear prediction is employed because we are not interested in linear dependencies or constant correlations.

The pre-processor may also be configured to remove slow components. It will be understood that references to 'remove slow components' can be performed in a variety of ways that would readily be understood by a person skilled in the art. For example, this procedure could be implemented by temporal high-pass filtering or by removing the Kalman filter predictions from the data. In practice this step can be used to remove a slowly changing mean value.

In one embodiment of the invention, the initialiser is configured to apply the non-linear function directly to the input variables (after pre-processing). This embodiment will be suitable for use when the relations of interest are seen directly from the input variables.

In another embodiment of the invention, the initialiser is configured to apply the nonlinear function to projections of the input variables (after pre-processing). The projections may be linear combinations of the input variables. In this case, suitable projection coefficients may not be known in advance and so an iterative estimation procedure may be employed to obtain the projection coefficients. This embodiment will be suitable for use when the relations of interest are found from the projections of the input variables.

The initialiser may further comprise a sampler for capturing said two or more input variables over a pre-determined sample range.

The response from the modelling module may be in the form of a control signal. For example, the model may be configured to instruct a machine or robot to move and/or carry out a pre-determined action (e.g. operate a control valve).

Additionally, or alternatively, the machine-learning system may be configured for sound and/or vision recognition, for example, to recognise speech and/or faces.

According to a second aspect of the present invention there is provided a machine for performing a physical task comprising:

one or more electrically actuable moving parts;

one or more sensors for sensing properties of the moving part(s);

a controller for controlling the movement of said moving part(s); and

an apparatus according to claim 15 coupled to said sensor(s) to receive said input variables therefrom, and to said controller for providing said control signal thereto. This machine may be an electrically actuable moving part of the machine is a robotic arm.

According to a third aspect of the present invention there is provided a method of determining an operating point of a system that is used to predict a time dependent output value, the method comprising:

receiving two or more time-dependent input variables;

removing constant correlations between said two or more input variables;

applying a non-linear function to said two or more input variables to obtain a series of time dependent instantaneous correlations; and

determining an appropriate operating point on the basis of the obtained instantaneous correlations.

It will be understood that the method of the present invention can be applied to any data but whether the result is meaningful will depend on whether the appropriate structure is present in the data.

A pre-processing step may comprise removing the linear prediction of the two or more input variables.

A pre-processing step may comprise removing slow components from the two or more input variables.

A pre-processing step may comprise removing the mean value of the input variables with respect to time. This is because we are particularly interested in changing (rather than static) correlations.

A pre-processing step may comprise whitening. Whitening may be performed by removing the mean, performing principle component analysis (PCA) and normalising the principle components to unit variance.

A pre-processing step may comprise applying a saturating function to the input variables. Alternatively, the pre-processing step may comprise removing outlying values of the input variables since otherwise these may skew the results. Outlying values may be detected by looking at the deviation of each input variable compared to its standard deviation. Depending on the nature of the problem, values having a deviation of approximately 3 sigma or more could be considered outlying and therefore should be removed.

A pre-processing step may comprise dimension reduction (for example by the application of dimension reduction projections). This is particularly useful when the input dimensions are high since the number of possible instantaneous correlations (i.e. distinct pairs of inputs) grows relatively fast, for example, quadratically with dimension.

The non-linear function may have a Taylor-series expansion having cross terms (e.g. x n y m ). In other words, the non-linear function f(x, y) may be such that it cannot be expressed /(¾ y)=fi(x)+f 2 (y).

When the input variables are binary, the non-linear function may comprise, for example, XOR, AND or OR. In certain embodiments, any function except for the six out of sixteen with the following truth tables may be employed:

00 00 11 11 01 10 00 11 00 11 01 10

These are the six truth tables that can be expressed as fix, y)=fi(x)+f 2 (y)- The second truth table represents x+0*y, the third represents (1 -x) + 0*y, the fifth represents 0*x + y, and so on.

In certain embodiments, the non-linear function may comprise multiplication of the two or more input variables. Alternatively, the non-linear function may comprise multiplication of projected signals obtained by multiplying the input variables by projection vectors. The projection vectors may be obtained by an iterative estimation procedure.

A saturating function may be applied after the multiplication step. The non-linear function may further comprise spatial integration and/or temporal filtering to extract useful information about the current operating point from the noise in the signal. More specifically, the non-linear function may comprise one or more of the following: Kalman filtering, low-pass filtering, high-pass filtering, principle component analysis (PCA), linear slow feature analysis (1SFA), canonical correlation analysis

(CCA).

Since, for example, PCA produces correlations of its input variables, applying PCA to the product of the input variables (i.e. after multiplication of the input variables), means that the procedure effective searches for correlations between correlations. This procedure can therefore be used to find higher order latent variables, as mentioned above.

Projection coefficients may be applied before or after the instantaneous correlations are obtained.

In certain embodiments, where the non-linear function comprises multiplication of projected signals obtained by multiplying the input variables by projection vectors, the projection vectors are obtained by simple linear regression plus projecting the regression matrix onto a set of orthogonal matrices (i.e. orthogonalisation). In a specific embodiment, the following steps are carried out after pre-processing of input variables, x and y:

i) projected signals are obtained from a( t)= W x x(t) and b( t)= W y y(t) ii) element-wise product of a(t) and b(t) is obtained (e.g. if several pairs are estimated simultaneously in a symmetric approach, then the products are computed for each pair of elements in a and b)

iii) covariance of a(t) and b(t) is estimated from c=LPF(a(t) b(t))

(LPF stands for Low-Pass Filter)

iv) variance of a and b are estimated, respectively, from v a =LPF(a (t)) and v b =LPF(b 2 (t))

v) correlation coefficient is estimated from

c(t)

Pit) vi) correction is applied such that a(t) = p(t) b(t) and b(t) = p(t) a(t) vii) re-estimation of W x and W by one of the approaches outlined below viii) the above steps are repeated using the re-estimated values until a(t) and b(t) converge.

In one embodiment of the above (referred to as the deflation approach), only one pair of signals a(t) and b(t) is estimated at a time and the whole iterative procedure is run many times in order to find several pairs of signals. In another embodiment of the above (referred to as the symmetric approach), several pairs of signals a(t) and b(t) are estimated simultaneously. Both the deflation approach and the symmetric approach may be used in source separation methods.

More specifically, in the deflation approach, the i-th pair of signals a t (t) and b t {t) is estimated using the projection vector W x ; . The regression is performed from ; (t) to x which is simplified to W x i = x(t) T a^t) because x is pre-whitened. The t

orthonormalisation of W x i can be done, for example, using the Gram-Schmidt orthonormalisation procedure when W x i is made orthogonal to the previously found vectors W x . , j = 1,K , i - 1 . The same procedure could also then be applied to W i .

In the symmetric approach, both W x and W are matrices in which each row corresponds to one signal a t (t) or b ; (t) , respectively. Then, the operations specified above (such as the product, low-pass filtering and so on) are applied to each pair of signals a t (t) and b t {t) separately. Regression is done by W x = a(t)x(t) T , where a(t) t

is a vector with elements a ; (t) . Orthonormalisation of W x can be performed by

-1 / 2

W x = \W X W X ) W x . The same procedure could also then be applied to W y .

It will be noted that the correlation coefficient is estimated as shown above when we are interested in the correlated projections of a(t) and b(t). However, if we are interested in finding a mapping from a(t) to b(t) the following regression coefficient is calculated in step v) instead:

It will also be clear from the above that the correlation coefficent is variable as opposed to constant.

In certain embodiments, the method may further comprise the step of fine-tuning. Fine- tuning may be achieved by using the estimated latent variables as inputs to a model having unknown latent variables and parameters and solving the model to obtain the parameters and optimise the values of the latent variables. For example, fine tuning may comprise the estimation of a mapping of a latent variable x to a latent variable y. The estimated latent variables could be used as indicators for different operating points and a different mapping estimated for each operating point. The estimated values for the latent variables may be fixed or updated in the fine-tuning step.

Fine-tuning may be implemented using, for example, known gradient-based techniques such as maximum-a-posteriori (MAP) or maximum likelihood estimation by gradient descent (or with any other optimisation technique); Bayesian inference by Markov- chain Monte Carlo sampling; or Bayesian inference by variational techniques; and by solving each of the parameter matrices assuming others are fixed and then iterating the results for convergence.

The method may further comprise a post-processing step. The post-processing step may comprise normalization or orthogonalisation. Normalisation may comprise division by a local estimate of the standard deviation (i.e. square root of variance), for example, if the correlation is symmetrical. Alternatively, if the correlation is asymmetrical, normalization may comprise division with the variance of the first variable (x).

Post-processing may comprise dimension reduction. It will be understood that the number of data points required will depend on the signal- to-noise ratio. A few hundred values of instantaneous correlations (or even less if the data is very well-behaved) may be enough. Typically, a large number of instantaneous correlations will be used, for example, standard computers would be able to handle tens of billions of instantaneous correlations in a matter of seconds.

It will be noted that with a pair of input signals, only one instantaneous correlation can be produced. However, with more input signals, more instantaneous correlations can be produced. Thus, several latent variables can be extracted using standard dimension reduction methods such as PCA or SFA.

According to a fourth aspect of the present invention there is provided a method of sorting waste comprising:

sensing properties of a robotic arm configured to pick waste out of a waste stream;

using the method of any one of claims 18 to 27 to determine an operating point of the robotic arm;

using the determined operating point to predict a time dependent output value; and

using the time dependent output value to control the position of and/or forces applied by the robotic arm.

Brief Description of the Drawings

Figure 1 illustrates a mechanism for producing a Pearson product-moment correlation coefficient;

Figure 2 outlines steps performed in a slow feature analysis;

Figure 3 illustrates a canonical correlation analysis that is generally performed using Eigen-decompositions;

Figure 4 illustrates a Denoising source separation procedure;

Figure 5 illustrates the processing steps in a first embodiment of the present invention, whereby the non-linear function operates directly on the inputs x and y;

Figure 6 illustrates the processing steps in a second embodiment of the present invention, whereby the non-linear function operates on projections of the inputs x and y; Figure 7 shows a graph illustrating the efficiency of using the present invention to initialise the value of a latent variable when compared to using a random initial value as with current techniques;

Figure 8 schematically illustrates a machine learning system according to an embodiment of the present invention;

Figure 9 schematically illustrates an initialiser according to an embodiment of the present invention; and

Figure 10 schematically illustrates the processing steps in a method of determining changes in relations between input variables in a machine-learning system according to an embodiment of the present invention.

Detailed Description of Certain Embodiments

With reference to Figure 5, there is illustrated a method 40 of determining changes in relations between input variables, x and y, in a machine-learning system, according to a first embodiment. Thus, it can be seen that pre-processing is applied to the input variables x and y. This includes the steps of removing slow components 42, removing the linear prediction 44 from x to y, and then whitening the input variables 46. The whitening step 46 comprises removing the mean, performing principle component analysis (PCA) and normalising the principle components to unit variance.

Following the pre-processing described above, the input variables x and y are multiplied in step 48. Spatial integration and/or temporal filtering are then employed in step 50 so as to extract (i.e. separate) useful information about the current operating point from the noise in the signal. As illustrated, this may involve one or more of the following procedures: principle component analysis (PCA), low-pass filtering, linear slow feature analysis (1SFA) or canonical correlation analysis (CCA). In the case of CCA, it will be noted that an external input z is required. This can be any other data set related to the studied system. For example, it could be the slow components removed from x and y in step 42.

Finally, the instantaneous correlations (i.e. the latent variables) obtained from the above procedures are fine-tuned in step 52. Ideally, the latent variables will indicate the different operating points of the system such that different mappings, from x to y, can be selected or estimated for each operating point.

With reference to Figure 6, there is illustrated a method 60 of determining changes in relations between projections of the input variables, x and y, in a machine-learning system, according to a second embodiment. It will be noted that the pre-processing steps described above in relation to Figure 5 are also employed in this embodiment. Thus, the steps of removing slow components 62, removing the linear prediction 64 from x to y, and then whitening the input variables 66 are all performed as above. However, in this case, an iterative procedure 68 is next employed in which multiplication 70 is applied to projected signals obtained by multiplying the input variables by projection vectors in step 72. The projection vectors themselves are obtained by simple linear regression and orthogonalisation. More specifically, the following steps are carried out after pre-processing of the input variables, x and _ :

1) projected signals are obtained from a(t)= W x x(t) and b(t)= W y y(t)(step 72)

2) element- wise product of a(t) and b(t) is obtained (step 70)

3) covariance of a(t) and b(t) is estimated from c=LPF(a(t) b(t)) (step 76)

(LPF stands for Low-Pass Filter)

4) variance of a and b are estimated, respectively, from v a =LPF(a (t)) and v h =LPF(b 2 (t)) (step 74)

5) correlation coefficient is estimated from

correction is applied such that a(t) = p(t) b(t) and b(t) = p(t) a(t) (step

80)

7) re-estimation of W x and W using either the deflation approach or the symmetric approach (step 82)

8) the above steps are repeated using the re-estimated values until a(t) and b(t) each converge.

Finally, the latent variables a(t), b(t) and p(t) obtained from the above procedures are fine-tuned in step 84. As above, it is hoped that the latent variables will indicate the different operating points of the system such that different mappings, from x to y, can be selected or estimated for each operating point.

It will be noted that signals p(t) can take any values in the range [-1 to 1] and they can be selected as indicators of a changing operating point. Thus, p(t) can be used in the fine-tuning stage as either fixed or updated values. Signals a(t), b(t) represent the features (found in the two data sets) whose changing correlations are most useful for distinguishing different operating points. The exact meaning of a(t), b(t) and p(t) depends on the application, some examples of which are described below.

Application Example 1 - Motion Analysis for Images

The method described above in relation to Figure 5 may be employed in a machine- learning system configured for motion analysis of images.

In this example, the input data consists of a ten-by-ten grey-scale image patch which moves across a larger grey-scale image. In other words, consecutive image patches are shifted by one pixel and the direction changes from time to time. In this example, 5000 samples were extracted for learning (i.e. this is the "training data" referred to below) and 1000 samples were used for testing. Input x(t) was a collection of images at time instances i=l, 2, T-l, while input y(t) was a collection of images in subsequent time instances j=2, 3, T. Synchrony of changes in relations between the different inputs was assumed in this example.

Step 1 - Pre-Processing

The 100-dimension vectors x(t) and y(t) were whitened by removing their mean value and applying a 100-by-lOO rotation matrix such that the resulting vectors had zero mean and unit covariance. Next, y(t) was set to be x(t+l ) minus a linear prediction from x(t). This step effectively removes constant correlations between the two input variables x(t) and y(t). Step 1 may also include removing slow components from the input variables as illustrated in Figures 5 and 6.

Step 2 - Multiplication and Normalization

Products Xi(t)y j (t) were computed to yield 10,000 signals which were then divided by jc(i)|| + 0.1. The constant 0.1 was chosen so as to avoid magnifying the signal too much if the norm of x(t) happens to be small (as is the case when the image patch happens to be sampled from a smooth region of the image). The resulting signals Zij(t) are the instantenous correlations of the input signals x(t) and y(t) and they can be used as the initial values for the latent variables s(t). Since the number of signals in this example is relatively large (10,000), dimension reduction can be employed to reduce the number of latent variables to a more manageable number, as is done in the next step.

Step 3 - Spatial Integration / Temporal Filtering

Three signals were extracted by applying PCA to Zij(t). In this example, three signals were chosen because the data was generated with four possible movement directions for the image patch. In other embodiments, it would be necessary to choose a suitable number of signals for the application in question or to try several numbers of signals and to select the most appropriate number based on some validation tests. Since the dimensionality of the input was relatively large (10,000), a divide-and-conquer approach was taken and PCA was performed in two phases. In the first phase, PCA was applied for each j for the 100-dimensional inputs obtained by considering each i. In other words, one 10,000-dimensional PCA was replaced by 100 100-dimensional PCAs. Three signals were extracted by each of the 100 PCAs to result in 300 signals. The second phase consisted of PCA being applied to these 300 signals, resulting in three signals s(t) to be used as the initial values for the latent variables.

Step 4 - Fine-Tuning

A fine-tuning step can be used to further correct s(t) so that more accurate predicitons can be made using the resulting model. The signals s(t) produced in step 3 are used for initializing the fine-tuning. The modeling assumption is that the latent variables s(t) explain the linear regressors A:

A:(t) ~ Bs(t)

and that the latent variables s(t) change slowly. This is natural because the operating points can be assumed to be somewhat stable and only change slowly in time. The slow changing of the operating point is modeled with a temporal model

s(t) ~ D s(t-l)

which implements the assumption that s(t) should change slowly. Now we have a model:

y(t) = A(t) x(t) + noise

A:(t) = Bs(t) + noise

s(t) = D s(t- l) + noise

where x(t), y(t) are given, and an initial value for s(t) is known from step 3. Matrices B and D are unknown parameters. During fine-tuning, B, D and s(t) are re-estimated from training data to make the model fit the data. Numerous methods for fitting the model to given training data are known from the literature; for example gradient descent of MAP-based cost function and testing for predicting the samples of the test set can be used.

Matrix A represents how (on average) the current image patch is changed to yield the image patch at the next time instant. For example, when the image patch is shifted by one pixel upwardly, matrix A would contain ones in the elements corresponding to shifting pixels (one per row and one per column, with some exceptions at the image edges; for example, in this case, A would have all zeros in the row corresponding to the pixels in the bottom-most row of the image and all zeros in the row corresponding to the pixels in the top-most row of the image) and zeros elsewhere. Since the movement directions are assumed to be changing, A is modeled to be dependent on time. The noise term accounts for modeling inaccuracy and noisy measurements. The noise in this example was assumed to be Gaussian with constant (but unknown) variance.

The fine-tuned model with calculated B and D can then be used to predict new data y(t) from known x(t) by first calculating the next operating point, then calculating the time- variant matrix A(t) for that operating point (in this example as a linear mapping from the underlying latent variables s(t)) and then using that A(t) as a model to predict the new y(t). The procedure for this is:

1. Data x(l) is received. At this point y(l) cannot be predicted without initial values for the latent variables s(l).

2. An initial value for s(l) is calculated. If the received data is a continuation of the training data, a previous value for s is known and s(l) = Ds. Otherwise, some initial guess for s(l) is made. 3. Prediction for y(l) = A:( B s(l) ) is calculated.

4. Actual y(l) is received (e.g. measured)

5. Previously calculated s(l) can now be updated because x(l) and y(l) are known. This can be done by finding s(l) so as to minimize the prediction error when calculating y(l) = A:( B s(l) ) x(l)

6. Data x(2) is received.

7. s(2) is calculated as Ds(l).

8. y(2) is predicted as y(2) = A:( B s(2) ) x(2)

9. Actual y(2) is received (e.g. measured)

10. s(2) is then re-estimated as was done with s(l) in step 5. Each time new data is received, the process repeats from step 7.

This problem was simple enough such that even a random initialisation of the latent variables allowed the gradient descent to converge to the same result. However, as shown in Figure 7, random initialisation 90 suffered from an initial plateau 91 of the cost function (i.e. it took many iterations before the optimization procedure escaped the chicken-and-egg problem whereby the latent variables cannot be estimated before the fixed parameters of the system are known and vice versa). Initialisation with the signals 92 obtained using the above method avoided the initial plateau which shows that this embodiment was able to provide an initial solution which was close to the final optimum (in other words, the present method was much faster at finding the minimum of the cost function). It will be noted that the y-axis of the graph in Figure 7 represents the value of the cost function, which is minimized during the fine tuning stage, against the number of iterations, represented by the x-axis.

Application Example 2 - Control of a Robot Hand-like Gripper

The method can be utilized to control a robotic system which uses a hand-like gripper to grip objects, for example for waste sorting purposes. For the purposes of this example we assume a gripper which is equipped with sensors for sensing forces and torques caused by objects when they collide or are gripped by the gripper. Such sensors can be, for example, strain gauges, in which case the output signals are voltages which are sampled periodically using an A/D converter. The sampled values for the forces and torques at sampling time t are, in this example, collectively referred to as the strains s(t). A controller unit moves the gripper. The location of the gripper at time t is in this example referred to as vector x(t) comprising the coordinates of the location of the gripper at time t. The aim is to move the gripper to a position where the gripper is able to grip an object, while avoiding running the gripper into objects or other obstructions. This position must be found by observing the sampled strains s(t) on the gripper. The gripper is controlled to essentially "grope" its way to the object, trying to find a position where the strains indicate that the gripper encloses an object, but backing away when the strains indicate the gripper is colliding against an obstruction. In practice such a system can also be equipped with other sensors, such as cameras, which can also be used to guide the gripper, but for the purposes of this example, only the strains are considered.

In order to move the gripper at time t the controller needs to calculate a change in location, denoted here with dx(t), to be applied to the previous location x(t-l). To do this, a linear model is constructed to map a change in gripper location dx(t) to a change in strain ds(t). A matrix A(t) is calculated so that:

ds(t) = A(t) dx(t)

Once A(t) has been calculated, it can be used to find a change dx(t) in gripper location which will result in some predetermined strain values s goal which are known to correspond to a good gripping position. Such strain values can be, for example, recorded while placing an object inside the gripper. The calculation can be done by finding the dx(t) which minimizes the (euclidean) distance between the resulting new strain s(t) + ds(t) and the desired predetermined strain s goal :

- s(t) - ds(t) S - s(t) - A(t) dx(t)

which is then minimized with respect to dx(t). The result dx(t) which minimizes the above result corresponds to the gripper movements which will result in strains closest to known "desired" strains for a given linear model (the matrix A(t)).

The matrix A(t) depends on the operating point of the system at time t and must be chosen or calculated accordingly. By way of example, an operator may configure a plurality of different matrix values corresponding to different object materials. Thus, a matrix value may be defined which results in optimal gripping of an object made of soft plastic, whilst another matrix value may be defined which results in optimal gripping of an object made of a hard metal. During a waste sorting operation, the system must be able to determine a suitable operating point and select the corresponding matrix value in order to determine how to next adjust the gripper. Alternatively, A(t) can be calculated to fit the operating point by calculating it using the values for the extracted latent variables, in the same way as was done in the previous example 1. In general, it is clear to a person skilled in the art that "choosing" a model according to an operating point of the system can comprise selecting a model from a predetermined set of models, or calculating or "fitting" a model in some predetermined fashion to the operating point. In principle the difference between these two is not great, as selecting e.g. a matrix from a predetermined set according to some operating point could just as well be thought of as a very coarse way of calculating the matrix.

The benefits of the present proposal become apparent in the phase where A(t) is selected, i.e. during operation of the gripper to sort waste. In order to determine the correct operating point and hence select the appropriate matrix A(t), the input variables dx(t) and ds(t) are observed and processed in order to determine instantaneous correlations between the two inputs. With reference to Figures 5 and 6, dx(t) and ds(t) are the input variables x(t) and y(t). The output of the selected control process (i.e. the ouput of the fine tuning block in Figures 5 and 6) is the matrix A(t), which corresponds to the mapping between the measured strains and the desired gripper robot movements at the current operating point.

It will be appreciated that, although linear models have been calculated or chosen in the previous two examples, the detemination of a current operating point for a system may result in selection of an equation or other non-linear model that is associated with the operating point.

Application Example 3 - Analysis of Climate Data

Essentially the same method as described above in example 1 was used in the analysis of climate data. More specifically, the data was weather measurements of air pressure from a ten-by-ten grid over Helsinki. The spacing of the grid was 2.5 degrees and the same number of samples were used, as above.

The method employed in this example was identical to that described above (Example 1) except that 10 signals were extracted in the final PC A phase instead of three since this turned out to give better results. It is believed that more latent variables were required in this example because, unlike in the previous example where there were only four possible image transformations (i.e. one pixel to the left, right, up or down) which are adequately represented by three latent variables (as corners of a tetrahedron), with weather data there are various types of wind patterns and other phenomena which appear to require more latent variables to describe them.

A similar generative model was used and the method was tested for prediction in a similar manner to the above. Again the results were similar, showing an initial plateau for a random initialisation and a much faster and more reliable result when using the present method. This therefore reinforces the usefulness of the present approach in overcoming the chicken-or-egg problem encountered in various types of applications.

Application Example 4 - Control Application

The method described above in relation to Figure 6 was employed in a machine- learning system configured for a control application. Unlike in the previous examples, in this case dimension reduction is applied before the products of the input variables are obtained.

In this example, the system was configured to simulate a vehicle moving at a constant velocity in one of 20 equally spaced directions in a 2D plane. The output was thus one of the 20 options. These were treated as a 20-dimensional binary vector x(t). It was further assumed that the only measurement the vehicle receives is the delayed measurement of the distance to a target position which was unknown to the vehicle. The task was therefore to steer the vehicle to the target position which changed every now and then. The task was relatively difficult since there was assumed to be no prior information about what each of the control outputs does or what the delay of the observation is. A sequence of seven future observations was collected as y(t).

Step 1 - Pre-Processing

The vectors x(t) were whitened and a linear prediction from them was removed from y(t), which were then also whitened. Step 2 - Multiplication of Projected Signals

Two products of projections were calculated from z ; (t) = , where <¾ and bi are projection vectors and i was 1 and 2.

Step 3 - Spatial Integration / Temporal Filtering

The two signals were then low-pass filtered.

Step 4 - Post-Processing

The resulting signals were normalised by dividing by estimates of the low-pass filtered amplitude of the signals (square root of low-pass filtered squares) that were multiplied, that is the squares of af x(t) and bf y(t) were low-pass filtered with the same filter as was used in step 3. The signals were then orthogonalised in order to make sure that they convey different information about the operating point. The resulting signals are essentially windowed estimates of the correlations of two projection vectors and we will denote these signals as d(t). This step corresponds to step 78 in Figure 6, where Ci(t) corresponds to ρ { {ί) .

An estimate of one projection vector is then obtained by multiplying the other projection by a(t). These estimates are then used as the target for updating the projections. As an example, the new estimate for <¾ is obtained by the least squares estimate from equation a(t) bf y(t) = af x(t) . This process corresponds to the steps 80 and 82 of Figure 6.

Steps 2 through 4 are then iterated until the vectors <¾ and bi no longer change significantly.

The obtained projection vectors turned out to capture the structure of the control problem: bi correctly estimated the delay of the observation signal and <¾ captured the spatial structure of the control output x. Furthermore, the estimated correlations a(t) made a prediction about how the distance to the target would change if different control outputs were applied (as obtained by c / <2 / +c 2 <¾)- This was then successfully applied for steering the vehicle towards the target although the only feedback the vehicle got was the change in the distance to the target. It is noted that the meaning of the signals <¾, bi and a(t) in the above example was determined using the applicants own knowledge about the problem. In more complex problems, it may be more difficult to find a simple explanation for the extracted signals. However, it is often not necessary to know the meaning of the signals when they are used as input data to machine learning methods.

Application Example 5 - View-independent face recognition.

The mapping required for correct classification of face images can be very complex when different views are present among training and test examples. However, the present approach can be used to detect the different modes for the mapping (corresponding to the different views), and this can significantly simplify the classification task.

Application Example 6 - Video compression

Efficient video compression techniques can be built based on a generative model for a stream of frames (e.g., a model which explains how a current frame is generated from the previous one). The method according to the present approach, as explained in the motion analysis of images, can therefore be used to learn such a model. The problem of video compression is then reduced to coding the innovation process, which is the difference between the current frame and its prediction from the previous one. In other words, the innovation process is equivalent to the noise term in x(t) = A(t)x(t - 1)+ noise .

Application Example 7 - Guiding a robot to search for a place with high concentration of an interesting substance

The embodiment described in relation to Figure 6 (and illustrated in the vehicle control example) can be used in this application. The robot may be searching for truffles in a forest, an oil spill in an open sea or for explosive substances in public places.

Figure 8 schematically illustrates a machine learning system 100 according to an embodiment of the present invention. Thus, the machine learning system 100 comprises an initialiser 102 configured to estimate an initial series of values for a latent variable, a modelling module 104 configured to generate a response (e.g. in the form of a control signal 108) to a given input signal, and an input device 106 for inputting the initial series of values obtained from the initialiser 102 into the modelling module 104 as input signals therefor. As illustrated independently in Figure 9, the initialiser 102 comprises a pre-processor 110 configured to remove constant correlations between two or more input variables (e.g. in the form of input signals 112), and a processor 114 configured to apply a non-linear function to the two or more input variables to obtain a series of instantaneous correlations 116 for use as the initial series of values for the latent variable. The initialiser 102 may optionally include a sampler 118 configured to capture the two or more input variables over a pre-determined sample range.

It will be understood that the machine learning system 100 and initialiser 102 may be further configured to carry out the steps described above in relation to Figures 5 and 6. Furthermore, the machine learning system 100 and initialiser 102 may be configured for use in any of the applications described above.

Figure 10 schematically illustrates the processing steps in a method 120 of determining changes in relations between input variables in a machine-learning system according to an embodiment of the present invention. Thus, the method 120 comprises the step 122 of pre-processing two or more input variables by removing constant correlations therebetween, and the step 124 of applying a non-linear function to the two or more input variables to obtain a series of instantaneous correlations. A further step 126 of applying projection coefficients may be included either before or after the instantaneous correlations are obtained. In addition, the steps of applying a saturating function 128, fine-tuning 130 or post-processing 132 may be applied after step 124.

It will be understood that the discussions above relating to the methods of Figures 5 and 6 provide more details as to how each of the steps illustrated in Figure 10 could be performed.

From the all of the above it will be appreciated that embodiments of the present invention can be used to solve complicated problems with valuable commercial applications, such as machine vision or robotic control. It will be appreciated by persons skilled in the art that various modifications may be made to the above embodiments without departing from the scope of the present invention.