Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TRAINABLE DEVICE
Document Type and Number:
WIPO Patent Application WO/2001/067382
Kind Code:
A1
Abstract:
A device for recognising structure in data such as handwritten digits within a scanned image. The device is implemented on a computer (10) and comprises an array of ten sub-devices which can be used to recognise, for example, US postal codes. Each sub-device generates a score in respect of an input digit image which it has been trained to recognise. Each sub-device is trained on a training input image by adjusting weights which determine the score generated by the sub-device in response to an input image according to a five stage process in which a reconstruction of a training input signal is formed and used to adjust the weights to try to give a higher score to the training input signal than to the reconstruction.

Inventors:
HINTON GEOFFREY E (GB)
Application Number:
PCT/GB2000/000799
Publication Date:
September 13, 2001
Filing Date:
March 06, 2000
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV LONDON (GB)
HINTON GEOFFREY E (GB)
International Classes:
G06K9/66; G06N3/02; (IPC1-7): G06K9/66; G06N7/00; G10L15/14
Other References:
HINTON G E: "Products of experts", ICANN99. NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (IEE CONF. PUBL. NO.470), PROCEEDINGS OF 9TH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS: ICANN '99, EDINBURGH, UK, 7-10 SEPT. 1999, 1999, London, UK, IEE, UK, pages 1 - 6 vol.1, XP000965328, ISBN: 0-85296-721-7
Attorney, Agent or Firm:
Beresford, Keith Denis Lewis (Beresford & Co. 2-5 Warwick Court High Holborn London WC1R 5DJ, GB)
Download PDF:
Claims:
Claims
1. A method of adjusting a plurality of control parameters of a trainable device having a plurality of processing units, said device being operable to receive an input signal and to generate a plurality of activation and gradient signals using the received input signal and said control parameters, said method comprising the steps of: (a) applying a training input signal to the trainable device to generate an initial activation signal in respect of each processing unit and an initial gradient signal in respect of each control parameter; (b) successively performing the following two steps (c) and (d) a predetermined number of times, n; (c) generating a reconstruction signal using the control parameters and the activation signals most recently generated in either step (a) or step (d); (d) generating a subsequent activation signal in respect the each processing unit using said control parameters and a reconstruction signal most recently generated in step (c) and generating a subsequent gradient signal in respect of each control parameter using the subsequent activation signals generated in this step together with the reconstruction signal most recently generated in step (c) ; (e) comparing the initial gradient signals generated in step (a) with the subsequent gradient signals generated during the nth performance of step ; and (f) adjusting said control parameters in dependence upon the results of the comparison performed in step ; wherein the first predetermined number, n, is less than or equal to ten.
2. A method as claimed in claim 1 wherein the predetermined number, n, is one.
3. A method as claimed in either one of the preceding claims wherein each processing unit has at least one latent variable which is adapted to take on any one of at least two different values, and wherein each processing unit is adapted to stochastically select a value for its at least one latent variable in dependence upon its respective activation signal.
4. A method as claimed in claim 3 wherein each processing unit includes probabilistic model means for assigning a probability, or a related value such as the logarithm of a probability, to every possible reconstruction signal, to therebydefine a respective probability distribution over possible reconstruction signals, in dependence on the values of a subset of the control parameters, said control parameters within said subset being associated with the respective processing unit, and the value of the or each respective latent variable.
5. A method as claimed in claim 4 wherein the step of generating a reconstruction signal includes combining together the probabilities or related values assigned to all possible reconstruction signals by each processing unit such that an overall probability distribution over possible reconstruction signals is defined, wherein the combining is done in such a way that the overall probability distribution depends upon the product of the individual probability distributions defined by each respective processing unit.
6. A method as claimed in claim 5 wherein the step of generating a reconstruction signal further includes stochastically choosing the reconstruction signal from under the overall probability distribution over possible reconstruction signals.
7. A method as claimed in any one of the preceding claims wherein the trainable device is a restricted Boltzmann machine and wherein each processing unit is a hidden unit within the restricted Boltzmann machine.
8. A method as claimed in any one of claims 1 to 6 wherein the trainable device is a product of hidden Markov model devices and each processing unit is an individual hidden Markov model device.
9. A method of training a trainable device having a plurality of processing units, said device being operable to receive an input signal and to generate a plurality of activation and gradient signals using the received input signal and said control parameters, said method comprising: setting initial values for the control parameters; and adjusting the control parameters by repeatedly using the method of any one of the preceding claims.
10. A computer program for carrying out the method of any one of the preceding claims.
11. A carrier medium carrying the computer program of claim 10.
12. A trainable device having a plurality of processing units and a plurality of control parameters associated with the device, wherein each processing unit is adapted to generate a respective activation signal in response to an input signal, in dependence upon the input signal and the control parameters, wherein said device further comprises adjustment means for adjusting the control parameters, said adjustment means comprising: (a) means for applying a training input signal to each processing unit to thereby generate a respective initial activation signal, and for generating an initial gradient signal in respect of each control parameter; (b) reconstruction signal generating means for generating a reconstruction signal using the control parameters and activation signals; (c) means for applying a reconstruction signal to each processing unit to thereby generate a respective subsequent activation signal and for generating a subsequent gradient signal in respect of each control parameter; (d) means for controlling the reconstruction signal generating means and the means for applying a reconstruction signal to each processing unit to successively generate higher order reconstruction signals and gradient signals respectively up to a predetermined order; (e) comparison means for comparing the initial gradient signals with the higher order gradient signals generated by said means for applying a reconstruction signal to each processing unit; and (f) means for adjusting each control parameter in dependence on the comparison performed by the comparison means; wherein the predetermined order is less than or equal to ten.
13. Apparatus for producing a pretrained pattern recognition device, the pattern recognition device comprising a plurality of processing units and being adapted to store a plurality of control parameters, said pattern recognition device being operable to receive an input signal and to generate an activation signal in respect of each processing unit in response thereto, said apparatus comprising a trainable device as claimed in claim 12 together with download means for downloading control parameters from the trainable device to the pattern recognition device.
Description:
TRAINABLE DEVICE The present invention relates to data processing devices which can be trained to recognise features within an input signal, and in particular to a method of training such devices.

An example of a typical application of a trainable device is in recognising handwritten digits. In such an application, the device may first be trained with example sets of handwritten digits (eg a set of hand-written "1's"to train the device to recognise"l's"a set of handwritten"2's"to train the device to recognise"2's", etc.) and thereafter when a previously unseen handwritten digit is presented to the device, the device will generate an estimate as to what digit it has seen.

Hinton, G. E. and Sejnowski, T. J. (1986) Learning and Relearning in Boltzmann machines. In Rumelhart, D. E. and McClelland, J. L. editors, Parallel Distributed Processing: Explorations in the microstructure of cognition. Volume 1: Foundations, MIT Press, describes a trainable device which is referred to as a Boltzmann machine, and a general method for training such a device.

The method is described as applying to any Boltzmann

machine having a plurality of visible units, a plurality of hidden units and weighted connections between units (note any two units can be connected to each other in this way up to a maximum in which every unit is connected to every other unit, including connections from one hidden unit to another, etc). The method includes a first positive Hebbian learning phase which involves clamping the states of the visible units in accordance with"the environment"to be modelled and performing a first series of processing steps in which hidden units are individually probed. When a hidden unit is probed, it stochastically chooses its state (ie ON or OFF) in dependence upon the ratio of the overall probability of the configuration of all units when the hidden unit is ON versus the overall probability when the hidden unit is OFF. In this way, the machine is allowed to settle towards thermal equilibrium; once there, the weights are updated to increase the probability of the clamped configuration of the visible units. The method also includes a second negative Hebbian learning phase which involves allowing the machine to reach thermal equilibrium without clamping the visible units and then updating the weights (negatively) to reduce the probability of the finally reached configuration.

There are several drawbacks with this method. Firstly, the procedure is very slow. In one example, about thirty iterations (one iteration corresponds to approximately every unit stochastically choosing a state) are used in each positive phase in respect of each of twenty training examples, and about thirty iterations repeated twenty times in each negative phase before the weights are updated. Furthermore, there is a large amount of noise in each value calculated during the negative phase which means that the weights can only be adjusted by very small amounts (ie a slow learning rate).

Freund, Y. and Haussler, D. (1992) Unsupervised learning of distributions on binary vectors using two layer networks. Advances in Neural Information Processing Systems 4. J. E. Moody, S. J. Hanson and R. P. Lippmann (Eds.), Morgan Kaufmann: San Mateo, CA., describes a method of training a particular type of Boltzmann machine, which will hereinafter be referred to as a restricted Boltzmann machine, having a basic architecture in which there is a single layer of visible units, a single layer of hidden units and weighted connections from visible units to hidden units, but no intra-layer connections (ie no connections between hidden units or between visible units). This paper explains that the

standard learning algorithm for the Boltzmann machine requires repeated stochastic approximation which results in unacceptably slow learning, and describes an alternative method which avoids performing any stochastic approximations by finding a formula for the negative phase term which does not require stochastic methods.

However, the formula is only readily calculable for small numbers of hidden units or cases in which only a small number of hidden units are likely to be ON together.

Thus, in order to train a Boltzmann machine accurately and reliably, it was generally believed to be necessary to perform tens or hundreds of iterations during each iterative processing procedure for each training input signal prior to obtaining the weight update components.

This method is therefore relatively slow and is not suited to some applications.

The inventor has found that restricted Boltzmann machines can be reliably and accurately trained when only one iteration is performed during the iterative processing in respect of each training signal prior to obtaining the weight update components. The training can therefore be done much more quickly and consequently can be used in more applications. Surprisingly, the inventor has also

shown that this new training method actually achieves more accurate results than the conventional training method.

Furthermore, the inventor has realised that this training method can be applied to any device sharing some basic properties of a restricted Boltzmann machine. This means that the training method can be applied to devices which employ quite sophisticated probabilistic models such as Hidden Markov Models (HMM's). A device which shares the necessary basic properties of a restricted Boltzmann machine and to which the present invention can therefore apply will hereinafter be referred to as a product of experts device.

According to a first aspect of the present invention, there is provided a method of adjusting control parameters of a trainable device said trainable device comprising a plurality of first processing units and a plurality of second processing units wherein each of said control parameters is associated with a respective pair of a first processing unit and a second processing unit, and wherein each processing unit is adapted to generate activation and state signals; said method comprising the steps of:-

a) applying a training signal having a plurality of elements to said trainable device; b) generating an initial state signal in respect of each first processing unit in dependence on a respective element of said training signal; c) generating an initial activation signal in respect of each second processing unit, each initial activation signal being generated in dependence upon all of the initial state signals of the first processing units together with the respective control parameters associating the first processing units with the respective second processing unit; d) generating an initial state signal in respect of each second processing unit in dependence upon its respective initial activation signal; e) generating a further activation signal in respect of each first processing unit, each further activation signal being generated in dependence upon all of the initial state signals of the second processing units together with the respective control parameters associating the second processing units with the respective first processing unit; f) generating a further state signal in respect of each first processing unit in dependence on its respective further activation signal;

g) generating a further activation signal in respect of each second processing unit, each further activation signal being generated in dependence upon all of the further state signals of the first processing units together with the respective control parameters associating the first processing units with the respective second processing unit; h) generating an initial product term in respect of each control parameter in dependence on the product of the initial state signal of the first processing unit associated with said control parameter and the initial activation signal of the second processing unit associated with said control parameter; i) generating a further product term in respect of each control parameter in dependence on the product of the further state signal of the first processing unit associated with said control parameter and the further activation signal of the second processing unit associated with said control parameter; and j) adjusting each control parameter in dependence on the difference between the respective initial product term and the respective further product term.

When the above method of training is applied to a restricted Boltzmann machine, the first processing units

correspond to the visible units, the second processing units correspond to the hidden units and the control parameters correspond to the weights connecting each visible unit to each hidden unit. Furthermore, each activation signal corresponds to the probability of activation of the respective unit and each state signal corresponds to the state of the respective unit when represented as a binary value (ie 1 for ON and 0 for OFF).

Thus, in a restricted Boltzmann machine, the above method comprises: a) applying a training signal having a plurality of elements to the restricted Boltzmann machine; b) setting the states of the visible units in dependence on the respective elements of the training signal; c) computing the probability of activation of each hidden unit in dependence upon the states of the visible units and the respective weights connecting the respective hidden unit to the visible units; d) stochastically choosing a state in respect of each hidden unit in dependence upon its respective probability of activation; e) computing the probability of activation of each

visible unit in dependence upon the states of the visible units and the respective weights connecting the respective visible unit to the hidden units; f) stochastically choosing a reconstructed state in respect of each visible unit in dependence upon its respective probability of activation; g) computing a new probability of activation of each hidden unit in the same way as was done in step (c) except that the reconstructed states of the visible units are used instead of the states as set by the training signal; h) generating a"positive"product term in respect of each weight by multiplying the state of the respective visible unit set in step (b) with the probability of activation of the respective hidden unit calculated in step (c); i) generating a"negative"product term in respect of each weight by multiplying the reconstructed state of the respective visible unit stochastically chosen in step (f) with the new probability of activation of the respective hidden unit compared in step (c); and j) adjusting each weight in dependence on the difference between its respective"positive"product term generated in step (b) and its respective"negative" product term generated in step (i).

Note that steps (h) and (i) can preferably be incorporated at the end of or as part of steps (c) and (g) respectively. Also note that steps (a) and (b) could preferably be jointly expressed as the restricted Boltzmann machine observing some data which is to be modelled by the restricted Boltzmann machine. Step (c) corresponds to each hidden unit computing the posterior probability distribution over its latent variables (ie its state values). It is also the case that for a restricted Boltzmann machine, each"positive"product term calculated in step (h) corresponds to a linear function of the derivative, with respect to the respective weight, of the logarithm of the probability of the training signal (or the states of the visible units set in step (b)) under the restricted Boltzmann machine when this probability is defined in the normal way for a Boltzmann machine. Similarly, each negative term calculated in step (i) corresponds to the same linear function of the derivative, with respect to the respective weight, of the logarithm of the probability of the reconstructed signal generated in step (f) under the restricted Boltzmann machine.

According to a second aspect of the present invention, there is provided a method of adjusting the parameters of

a product of hidden Markov models-device wherein each parameter is adjusted in dependence upon the difference between a respective positive term and a respective negative term, wherein the positive and negative terms are calculated using the following five stage procedure:- 1) computing in respect of each individual hidden Markov model a positive term in respect of each of its parameters in its transition matrix and its output probability models each positive term depending on the derivative of the logarithm of the probability of an observed data sequence with respect to the respective parameter; 2) choosing a path in respect of each individual hidden Markov model through its hidden states from its posterior probability distribution over paths through its hidden states calculated in dependence upon the observed data sequence and the respective parameters of the respective individual hidden Markov model; 3) computing for each time, t, in the observation sequence a probability distribution over the possible observations at time t in dependence upon the product of the probability distributions specified by the hidden

states of each hidden Markov model at time t; 4) choosing a reconstructed observation sequence by choosing a reconstructed observation at each time t from the probability distributions computed in stage 3; and 5) computing in respect of each individual hidden Markov model a negative term in respect of each of its parameters in its transition matrix and its output probability models, each negative term depending on the derivative of the logarithm of the probability of an observed data sequence with respect to the respective parameter.

In the above second aspect of the present invention, the first stage (1) of computing a positive term in respect of each parameter of each hidden Markov model corresponds to step (h) of the first aspect of the present invention; the second stage (2) of choosing a path in respect of each hidden Markov model through its hidden states corresponds to steps (c) and (d) in the first aspect of the present invention; the third stage (3) of the second aspect of the present invention corresponds to step (e) of the first aspect of the present invention; the fourth

stage (4) of the second aspect of the present invention corresponds to step (f) of the first aspect; and stage (5) of the second aspect of the present invention corresponds to step (i) of the first aspect.

From the above discussion of the first and second aspects of the present invention, it should be clear that the present invention can be generalised to provide a general method for training any trainable device in which the probability of an observation or an input data vector, etc. depends upon the product of the probabilities assigned to the observation or input data vector, etc. by each one of a plurality of probabilistic models represented by, for example, a hidden unit in a restricted Boltzmann machine or a hidden Markov model in a product of hidden Markov models device.

Thus, according to a third aspect of the present invention, there is provided a method of training a trainable device for assigning scores which represent the logarithms of probabilities (up to an unknown additive constant) of points which may be represented by data vectors in a continuous or discreet data space, the device comprising a plurality of individual expert models each of which is itself adapted to assign scores to points in the data space which represent the logarithms

of the probabilities (up to an additive constant which may or may not be known) assigned to the points by said individual expert models; wherein the product of experts device is adapted to add the scores assigned by the individual expert models and wherein the individual expert models contain one or more adjustable parameters and one or more latent variables whose values influence the scores assigned by each individual expert model; the method comprising fitting observed data vectors by adjusting some or all of the parameters of some or all of the individual expert models in dependence on the difference between a positive term and a negative term, wherein the positive and negative terms are calculated using the following five stage procedure:- (1) computing in respect of each individual expert the posterior probability distribution over its latent variables given an observed data vector and also computing in respect of each individual expert a respective positive term in respect of each of some or all of said adjustable parameters, wherein each positive term depends upon the derivative, with respect to the respective adjustable parameter, of the logarithm of the probability of an observed data vector to be modelled, or a value which depends upon this derivative such as, for example, a linear function of this derivative;

(2) stochastically choosing,-in respect of each individual expert, a specific value for each of its latent variables from the posterior probability distribution over its latent variables given said observed data vector which was computed in the first stage; (3) computing in respect of the product of experts device as a whole a representation of the posterior probability distribution over all data vectors in the data space given the values of the latent variables of the individual expert models computed in the second stage; (4) generating in respect of the product of experts device as a whole a reconstruction by stochastically choosing a data vector in the data space from under the probability distribution computed in the third stage; and (5) computing in respect of each individual expert model the posterior probability distribution over its latent variables given the reconstruction generated in the fourth stage and also computing in respect of each individual expert model a negative term in respect of each of some or all of the adjustable parameters, each negative term depending upon the derivative, with respect to the respective adjustable parameter, of the logarithm of the probability of the reconstructed data vector, or

a value which depends on this derivative such as, for example, a linear function of this derivative.

Preferably, each reconstruction signal or reconstructed data vector comprises a plurality of components each of which may be generated independently of all of the other components as this permits all of these components to be generated in parallel with one another (assuming the hardware used to implement the device permits parallel processing to be performed). Similarly, it is also preferable if each activation signal or posterior probability distribution may be generated in respect of each processing unit, hidden unit or individual expert independently of all of the other units or experts, as this permits all of the activation signals or posterior probability distributions to be generated in parallel with one another.

Preferably, an apparatus or device according to the present invention may be implemented in the form of an integrated circuit. Preferably, the integrated circuit includes means for generating a random signal derived from a randomly varying property of the integrated circuit or the substrate on which the integrated circuit is formed. This random signal may then be used to

stochastically choose the values or states of internal variables or units. Preferably, the integrated circuit includes real value to binary value conversion means for stochastically choosing a binary value on the basis of an input real value, and a binary value multiplication means (ie an AND gate) for multiplying first and second binary values together to generate a binary value product output. The combined use of a real value to binary value conversion means and binary value multiplication means obviates the need for a real value multiplication means (which might require the use of a floating point calculation engine). By performing the real value to binary value conversion on a stochastic basis, the value of the product of two real variables or a real and a binary variable is equivalent to the expected value of the same product after stochastically converting any real value number into a binary value number, after sampling the variable a large number of times and taking the average result.

The integrated circuit may be adapted to output the values or states of a plurality of variables or units or it may be adapted to output a score in respect of any input signal. Preferably, the integrated circuit is adapted to receive analogue input signals and to output

digital signals, thus removing the-need for a separate digital to analogue converter.

According to a fourth aspect of the present invention, there is provided a computer program for controlling a computer, or a processor in general, to carry out any of the above described methods. Preferably, the computer program is carried on a suitable carrier medium. The carrier medium may be an integrated circuit adapted to receive analogue signals and to output one or more values or states of internal variables or units in a digital format.

In order that the present invention may be better understood, embodiments thereof will now be described, by way of example only, with reference to the accompanying drawings in which:- Figure 1 is a block diagram of automatic mail sorting apparatus including a computer suitably programmed to perform digit recognition; Figure 2 is a block diagram of the functional modules used by the computer of Figure 1 to perform digit recognition;

Figure 3 is a block diagram of a digit recognition module which is one of the modules shown in Figure 2 illustrating an array of restricted Boltzmann machines; Figure 4 is a block diagram of the array of restricted Boltzmann machines of Figure 3; Figure 5 is a schematic diagram showing in more detail one of the restricted Boltzmann machines shown in the array of Figure 4; Figure 6 is a block diagram of an arrangement for training the restricted Boltzmann machine shown in Figure 5; Figure 7 is a block diagram of an arrangement for training a score post-processor which forms part of the digit recognition module shown in Figure 3; Figure 8 is a schematic diagram of various arrays used by the computer of Figure 1 during training of the restricted Boltzmann machine shown in Figures 5 and 6; Figure 9 is a schematic diagram of further arrays used by the computer of Figure 1 during the training of the

restricted Boltzmann machine shown in Figures 5 and 6; Figure 10 is a flow diagram illustrating a method of training the restricted Boltzmann machine of Figures 5 and 6; Figure 11 is a schematic block diagram of a device for preprocessing speech prior to being processed by a speech recognition device; Figure 12 is a schematic block diagram of a digital camera; and Figure 13 is a schematic block diagram of apparatus for monitoring complex physical processes for rare failures according to the present invention.

Figure 1 illustrates a mail sorting system for automatically sorting envelopes 1 into different piles 9a, 9b and 9c according to the zip code included in the address 2 written on the envelopes. The apparatus includes an envelope feeder 4 for feeding envelopes one at a time to a scanner 6 which, after scanning each envelope, passes them on to an envelope sorter 8. The scanner 6 scans each envelope to generate a digital image

signal 11 which is input to a computer 10. The computer 10 is programmed to process the image signal 11 and to generate a control signal 17 which controls envelope sorter 8 to transfer the scanned envelope to a particular pile as determined by the computer 10 on the basis of the image signal 11 received from the scanner 6.

To determine which pile an envelope 1 should be sorted into, the computer 10 attempts to identify and then read the zip code included as part of the address 2 written on the envelope 1. The computer 10 then compares this reading of the zip code with entries contained within a pre-stored look-up table (not shown) indicating which piles correspond to which zip codes. If the computer 10 is unable to identify or successfully read the zip code, or if a zip code is read which cannot be matched with a genuine zip code in the look-up table, the computer outputs a default control signal which causes the envelope to be transferred to a default pile 9a for manual sorting.

In order to attempt to identify and read each zip code, the computer 10 is programmed to include a number of functional modules illustrated in Figure 2. The three illustrated modules are a pre-processing module 12, a

digit recognition module 14 and a post-processing module 16. Pre-processing module 12 receives the image signal 11 output from the scanner 6 and, using conventional techniques, it firstly attempts to locate the portion of the image corresponding to the zip code and then, if this is successful, it secondly attempts to locate the portions of the zip code which correspond to individual digits. If the two preceding steps are successful, the pre-processing module 12 then normalises each of the digit images, again using conventional techniques. The normalising processing involves representing each digit image in a sixteen by sixteen pixel, grey-scale image 13 in which the whole digit is contained within the sixteen by sixteen pixel square without leaving a completely unused margin of white pixels around the edge of the square. Each of these normalised bitmap images 13 of individual digits is then individually input to the digit recognition module 14.

The digit recognition module 14 attempts to recognise the digit represented by each input image and outputs its best estimate of this digit as a digital signal 15 which can be readily understood by the post-processing module 16.

The post-processing module 16 collates together the best estimates of the individual digits to generate a best estimate of the entire zip code and tries to match this with a zip code contained within its pre-stored look-up table relating zip codes to desired sorting piles. If the post-processing module 16 finds a match it looks up the associated desired sorting pile and outputs a suitable control signal 17 for controlling the envelope sorter 8 so that the corresponding envelope 1 is transferred to the correct pile 9b, 9c. Otherwise, the post-processing module 16 outputs the default control signal for controlling the envelope sorter 8 so that the corresponding envelope 1 is transferred to the default pile 9a for manual sorting.

Figure 3 shows the digit recognition module 14 in more detail. In this embodiment, digit recognition module 14 comprises a data vector generator 18, an array 20 of restricted Boltzmann machines and a score post processor 30. Data vector generator 18 receives an individual grey-scale pixel map 13 from pre-processing module 12 and, in this embodiment, converts this into a data vector 19 having 256 binary values (the individual elements of the vector) each of which corresponds to a respective pixel of the input grey-scale pixel map 13.

Each binary value of the data vector is stochastically chosen by comparing the grey-scale value of the corresponding pixel with a pseudo-random number between 0 and 1 and choosing a 1 if the grey-scale value is higher than the pseudo-random number or a zero otherwise, such that dark grey pixels are likely to generate a 1 and light grey pixels are likely to generate a 0. The data vector 19 generated in this way is then applied to the array 20 of restricted Boltzmann machines.

The array 20 processes the input data vector 19 and generates ten scores 201,211,221,231,241,251,261, 271,281,291. When each of the restricted Boltzmann machines within the array 20 has been properly trained, each of the ten scores represents how closely the input data vector corresponds to a particular digit. Thus the "0"score 201 gives a score as to how closely the input data vector 19 resembles a 0; the"1"score 211 is an indication of how closely the input data vector 19 resembles a"I" ; etc.

In this embodiment, the ten scores 201-291 are then applied to the score post-processor 30 for further processing. In particular, the purpose of the score post-processor 30 is to make better use of the

information available from all ten scores, instead of simply picking the best score. For example, digit recognising devices usually find it very difficult to distinguish between"7's"and"9's". Thus array 20 might frequently generate a higher"9"score 291 than"7"score 271 even when the data vector 19 actually represents a "7". However, a"7"is usually more similar to a"2" than a"9"is to a"2". Thus, when the"7"score 271 and "9"score 291 have a similar value, the score post- processor 30 can use information from the"2"score 221 to help to try and determine whether a"9"or a"7"has been input to the digit recognition module 14. A more detailed description of the structure of the score post- processor 13 in this embodiment is given below.

Figure 4 illustrates the array 20 of restricted Boltzmann machines in greater detail. As shown, the array 20 contains ten restricted Boltzmann machines 200,210,220, 230,240,250,260,270,280,290 each of which is operable to receive the input data vector 19 and to output therefrom a respective one of the scores 201-291.

One of the restricted Boltzmann machines 210 is schematically illustrated in greater detail in Figure 5.

As shown, the restricted Boltzmann machine 210 includes a layer 60 of visible units vl-vN plus a first true unit

Ti, a layer 70 of hidden units hui-ho plus a second true unit T2 and a plurality of connections CII-CNM (only some of which are shown for clarity) which connect each visible unit to each hidden unit. Additionally, there is a connection CT between the first true unit T, and every hidden unit and a connection ciT between every visible unit and the second true unit T2. Each connection ci, CTI ciT has an associated weight W. which can be any real number but which will typically lie between minus ten and plus ten. In the present embodiment, there are two- hundred-and-fifty-six (256) visible units vu-vu (ie N=256), one hundred (100) hidden units hl-hM (ie M=100), a first and second true unit T1, T2 and twenty-five- thousand-nine-hundred-and-fifty-six (25,956) weighted connections c, 1-c"M, cm ;-cmM, c=m-CNT.

In this embodiment each of the 256 visible units vi is arranged to receive a respective one of the 256 binary values in the input data vector 19. This binary value sets a"state"of the corresponding visible unit vi, such that the visible unit is set to be in its"ON"state if the binary value is a 1 and is set to be in its"OFF" state if the binary value is 0. The state which each visible unit is in is monitored by an associated state variable svi. In this embodiment, the True units T., T2

have only a single state-they are always ON and the reason for this will be described later.

Once the states svi of all of the visible units vi have been set in this way, each hidden unit hj calculates a value which is termed the total input x,, to hidden unit hj by performing the following summation:- In this embodiment, because of the general structure of the restricted Boltzmann machine, each hidden unit h is effectively adapted (through the weights associated with the hidden unit) to identify whether or not a particular feature of the digit the machine is trained to recognise, is present within the input digit being recognised. This means that most of its weights will be small, but a few of its weights will be large (both positive and negative) and it is the visible units associated with these large weights on which the hidden unit has specialised. For this reason, the hidden units are often referred to as feature detectors. If this feature is present in the digit to be recognised, then this total input will be a large positive number and if it is not present then this

total input will be a large negative number. If the machine is unsure, then this total input will be a small positive or negative number.

The way that this is achieved can be seen from equation (1) above. For example, the total input Xh to a hidden unit hj will be a large positive value when the weights of connections between the hidden unit and visible units which are ON are large positive weights. This will only happen when there is a good correspondence between the states of the visible units which are ON for the digit being recognised and the states of the visible units which were ON for the training digits. Where the weights are small (either positive or negative) the total input x,, will not be greatly affected one way or the other if the associated visible units are ON or OFF. This occurs when the hidden unit is not concerned about the pixel values associated with these visible units. Where the weights are large negative weights, the total input xi will not be reduced provided the associated visible units are OFF. However, to effectively increase the total input when a visible unit associated with a large negative weight is OFF, the first True unit TI, can include a component in its weight WTU which is large and positive. In this way, it can be seen that, in this

embodiment, the first True unit T, acts as a biasing unit.

Therefore, in summary, when the visible units with the associated large weights have (on the whole) the "correct"states (ie ON for large positive weights and OFF for large negative weights) the total input to that hidden unit will be high and this can be equated with a high probability of the"feature", which the hidden unit has been trained to detect, having been detected within the digit being recognised.

In the present embodiment, this vague notion of a high probability of a feature being detected is converted into a precise probability value termed the probability of activation, Ph given by:- From equation (2), it can be seen that the total input is converted into a probability (having the property of always being between 0 and 1 for any value of the total input from minus infinity to plus infinity) by taking an exponent of the total input. In the field of Boltzmann

machines, this is referred to as moving from the logarithmic probability domain to the probability domain, wherein the total input is in the logarithmic probability domain. Therefore, a high positive total input x, will result in a high probability of activation Phi (since exp (-phi) will be small) whilst a large negative total input Xhj will result in a low probability of activation (since exp (-phi) will be large). When the total input is zero the probability of activation is a half (since exp (0) = 1).

Restricted Boltzmann machine 210 then combines the total inputs Xi ;, and the probabilities of activation, Phj, for all the hidden units h,, to generate the score 211 which is output by the restricted Boltzmann machine 210 for the input data vector 19 corresponding to the digit being recognised. In this embodiment, the combination is performed according to the following equation:- Score = - ! n- (l-) ln (l-.) + . ) (3) Persons skilled in the art will recognise that the first two terms within the summation of equation (3), represent

the entropy of the machine 210. For-present purposes, it simply needs to be noted that the entropy terms will always make a positive contribution to each term of the summarisation, have a maximum when phj is a half and decrease as Phj moves away from a half, and most importantly, will always be small compared to the third term within the summation, when the total input Xhj is large.

With regard to the third term within the summation (ie + Ph Xhj), it will be noted that when Xh is large and positive, Phi will be close to 1 and the entropy terms will be negligible. In this case, therefore, the terms within the summation will be approximately equal to the total input Xhj. When x is small, Phi will be close to -, and the terms within the summation (which may now be dominated by the entropy terms) will be small relative to the terms within the summation for a hidden unit for which Xh is large and positive. Finally, when Xh is large and negative, Phi will be small and the third term within the summation will be negative. However, such terms will have a lesser effect in reducing the total score than terms for which Xhj is correspondingly large, but positive, because the large negative Xh terms will be scaled down by the small corresponding values of Ph ;

Therefore, as can be seen from equation (3), the score 211 output by machine 210 depends only on the following factors :-the structure of the machine, the values of the weights and the input data vector 19. The other restricted Boltzmann machines in array 20 also operate in the same way. Therefore, as those skilled in the art will appreciate from the above description, the object of the training method for training the machines is to adjust the weights on the connections in such a way that a high score will be output by a device (after it has been trained) when it receives an input data vector 19 which is similar to training data vectors (on which it was trained).

Training Figure 6 illustrates an arrangement for initially training restricted Boltzmann machine 210 to recognise the digit"1". During training, the restricted Boltzmann machine 210 is in a training mode and it is supplied with training data vectors 209 which represent various training"1"images which are stored in a database 41.

The training data vectors 209 stored in database 41 are generated by the data vector generator 18 from grey-scale 256 pixel training images for the digit"1". In order

that the restricted Boltzmann machine, once trained, will be able to recognise most input"1"digits, the training images are manually chosen to represent a good cross- section of different styles and shapes of the digit"1".

Initially, the weights wij of restricted Boltzmann machine 210 are set to be small (ie between plus and minus one) random values. Then, the machine 210 is supplied with a first training data vector 209 from the database 41, which the machine 210 processes in accordance with a training procedure (which is described in greater detail below) to generate an array of weight updates. Further training data vectors 209 are then sequentially supplied to the restricted Boltzmann machine 210 which successively processes each of these training vectors in a similar manner to generate further weight updates. Once all of the training data vectors have been processed in this way, the array of weight updates is deemed to be complete (since the contribution from each training data vector has been included in the array) and the completed array of weight updates is used to update the weights wi of the restricted Boltzmann machine 210.

Once the weights w, have been updated in this way each of the training data vectors 209 in database 41 is reapplied to the machine 210 in order to generate a new

array of weight updates which is again used to update the weights whizz By repeating this process a large number of times (eg a few thousand times achieves very good training) the weights of the restricted Boltzmann machine 210 converge towards values which ensure that when it is switched to its normal mode, it generates high scores for data vectors which share some of the features found in the training data vectors stored in database 41. Such data vectors 19 which receive a high score should therefore correspond to"1's", whilst data vectors which do not correspond to"l's"should receive a low score.

The processing which is performed during training is described in greater detail below with reference to Figure 10.

Figure 7 illustrates an arrangement for training the score post-processor 30. This training is done after all of the restricted Boltzmann machines 200 to 290 within the array 20 have been fully trained. In order to train the score post-processor 30, a database 50 of unseen training data vectors 52 for all digits is used. The database 50 also stores a digit label 51 for each of the training data vectors 52 which identifies the digit represented by the data vector. As in the case of training database 41, the training data vectors 52 have

again been generated from images that are manually selected to ensure that the database 50 contains data vectors representing a good cross section of styles and shapes of different digits with approximately equal numbers of all the different digits.

In this embodiment, the score post-processor comprises an artificial neural network (not shown) having ten input units and ten output units, in which each input unit is connected to each output unit via a weighted connection.

Each output unit corresponds to a particular digit and can be either ON or OFF, although only one output unit may be ON at any one time. When the score post-processor 30 is operating in its normal mode (see Figure 3), one of the output units will turn ON in response to a particular set of scores 201-291 output from the array 20 of restricted Boltzmann machines. This indicates that the score post-processor's best estimate of the input data vector 19 is that it corresponds to the digit associated with the ON output unit. In this embodiment, the score post-processor 30 determines (when operating in is normal mode) which output unit to turn ON in the following way.

Each input unit is set according to the value of a respective one of the scores 201-291 output by the array 20. Each output unit then multiplies each of these

values by the corresponding weight (of the connection from the output unit to the respective input unit) and sums the products from all of these input units to obtain a total input to that output unit. The output unit with the greatest total input is then allowed to turn (or stay) ON whilst the other output units are constrained to turn (or stay) OFF.

In this embodiment, the score post-processor 30 is trained by adjusting the weights on the connections using a conventional back propagation method.

In particular, the score post-processor 30 is trained by applying training data vectors 52 from the database 50 to the array 20 of restricted Boltzmann machines (each of which is now operating in its normal mode). The array 20 therefore produces, for each training data vector 52; ten scores 201-291 indicating how closely the training data vector 52 is deemed by each restricted Boltzmann machine to resemble the respective digit which each restricted Boltzmann machine has been trained to recognise. These ten scores 201-291 are then presented to the input units of the score post-processor 30.

Simultaneously, a label 51 indicating which digit the data vector 52 actually represented is also input to the

score post-processor 30. Whilst the score post-processor 30 is in its training mode, it determines which output unit would have been turned on on the basis of the scores 201 to 291 applied to the input units and the current values of the weights on the connections between the input units and the output units. This result is then compared with the desired result which can be determined by examining the digit label 51 to produce an error.

This error is then used to adjust the values of the weighted connections between the input and the output units in such a way as to minimise the error. This type of method for training an artificial neural network is well understood in the art and the specific implementation details of this method will not therefore be described here in further detail.

Referring now to Figures 8 and 9, the restricted Boltzmann machine 210 illustrated in Figure 5 is simulated in the present embodiment, using a conventional computer, by manipulating a plurality of one and two dimensional arrays which are schematically illustrated in Figures 8 and 9. The elements of the arrays are stored in a conventional manner in the memory of the computer 10 and are processed according to the various equations and algorithms described herein (and in greater detail below)

by the central processing unit of-the computer. The arrays include a visible units states array 61 which is a one-dimensional binary array having N members (one for each visible unit) of the form svi ; a visible units probability array 62 which is a one dimensional real- valued array having N members (one for each visible unit) of the form psi ; a hidden units states array 71 which is a one-dimensional binary array having M members (one for each hidden unit) of the form Shj ; a hidden units probability array 72 which is a real valued one- dimensional array having M members (one for each hidden unit); a weights array 81 which is a two-dimensional real valued array having N by M members of the form wi ; a positive terms array 82 which is also a two-dimensional real valued array having N by M members of the form s° p° ; and a negative terms (s,'phl) array 83 which is also a two-dimensional real valued array having N by M members of the form svilPhil.

Note that the computer 10 will also have a first true unit weights array (not shown) which stores the weights w, on the connections from each hidden unit h. to the first true unit T, and a second true unit weights array (not shown) which stores the weights wiT on the connections from each visible unit v to the second true

unit T,. Both of these arrays are 1 dimensional arrays having M and N real-valued elements respectively.

Similarly, the computer 10 can also keep track of the positive and negative terms associated with the weights to and from the true units Tl and T, using further one- dimensional arrays.

Reconstructions and Gibbs Sampling In order to train the device 210 according to the method of the present embodiment, it is necessary to operate the machine 210 in such a way as to generate a "reconstruction"of an input training data vector 209.

This involves firstly choosing the state of each hidden unit by setting corresponding values for state variables Sh which monitor the states of the hidden units hj (by comparing Phj with a pseudo-random number between 0 and 1 such that high values of Ph are likely to lead to the unit hj being turned ON and low values of Ph ; are likely to lead the unit h being turned OFF). In this embodiment, once the states Shj of the hidden units have been set in this way, the machine is effectively run backwards and each visible unit vi calculates a total input xvi to itself given by:-

and then calculates a corresponding probability of activation pvi given by:- From these probabilities of activation Pvit a reconstruction of the input training data vector 209 can be generated by setting the states svi of the visible units vi as either ON or OFF in dependence upon the probabilities Pvi (again by comparing the probabilities of activation with a pseudo-random number between 0 and 1). Clearly, this process could then be repeated indefinitely to generate a reconstruction of a reconstruction, which is termed a second order reconstruction, a reconstruction of a reconstruction of a reconstruction (a third order reconstruction), etc.

To help to understand what is happening in this process, each hidden unit hj can be thought of as a probabilistic binary feature detector. This means that each hidden

unit h is a probabilistic model having two states; an ON state and an OFF state. The state of the hidden unit hj is monitored by the state variable, Sh ; which is set to equal"1"when the unit is ON (ie when it is in its ON state) and"0"when the unit is OFF. Every probabilistic model has an associated data space and it specifies a probability (or a probability density in the case of a continuous as opposed to discrete data space) to every point in the data space, and the distribution of these probabilities across the whole data space is referred to as the probability distribution specified by the probabilistic model. Accordingly, the hidden unit hj specifies a probability to any data vector representing a point in its data space. The data space associated with each hidden unit h is the space of all possible configurations of the two hundred and fifty six visible units to which it is connected. Since each visible unit vi can either be ON or OFF (as monitored by state variable svi which is set equal to"1"when the unit is ON and"0"when it is OFF), the data space associated with each hidden unit includes 2256 different data vectors each of which corresponds to a different configuration of (the states of) the visible units.

The probability assigned to a particular configuration of

the visible units by a hidden unit-hj depends upon the state of the hidden unit. In particular, when a hidden unit is OFF, its contribution to the total input xvi of each visible unit will be zero (see equation 4).

Therefore, when a hidden unit hj is OFF, it assigns an equal probability to every possible configuration of the visible units. That is to say, when a hidden unit hj is OFF it specifies a uniform distribution over its data space. It does this by specifying that each visible unit has an equal probability of being either ON or OFF-this is discussed in greater detail below.

When a hidden unit h is ON it specifies a non-uniform distribution over its data space. This distribution is specified by means of the weights wij. In this embodiment, each weight wi corresponds to the logarithm of the ratio of the probabilities assigned by hidden unit hj to visible unit vi being ON or OFF. For example, if weight w23 = 1-ln (0. 73/0.27) this implies that hidden unit h3 (when it is ON) assigns a probability of approximately 0.73 to visible unit v2 being ON and the corresponding probability of approximately 0.27 to visible unit v being OFF. Since the probability of each visible unit being ON is conditionally independent of the other visible units (for a given configuration of the

hidden units), it is possible to. specify a complete probability distribution over all possible configurations of the visible units, by specifying an individual probability distribution (ie the probability of the visible unit being ON, or in other words the weight) for each visible unit. Therefore, the probability of a particular configuration of the states of all two hundred and fifty six visible units is the product of the probability of each visible unit having its correct state. For example, the probability of the configuration in which every visible unit is ON is equal to the product of the probability of each individual visible unit being ON.

In the present embodiment, each visible unit vu hais a probability of turning ON which is given by the probability of activation Pvi given in equation (5) above. A mathematical analysis of equation 5 shows that the odds p/ (l-p) are equal to the product of the individual odds assigned to the visible unit vi by each of the hidden units and the second true unit T2. From the above discussion of the conditional independence of the visible units, it should be apparent that by specifying in this way the odds of each visible unit being ON, the hidden units also specify a probability for

any configuration of the visible units and therefore a probability distribution over all possible configurations of the visible units (ie over the whole data space of each of the hidden units). It should also be apparent that this probability distribution can be computed by taking the product of the probability distributions specified by each individual hidden unit.

A mathematical way of restating the above is to state that if the weights and binary states of the hidden units are given, the product of the distributions specified by the hidden units is a factorial distribution in which the visible units are conditionally independent and the probability of an individual visible unit being ON is obtained by adding together the weights to that visible unit from the active hidden units (ie those which are in the ON state) together with the weight from the second true unit, and putting this sum through the logistic function given in equation 5 above.

The above described manner of generating a reconstruction is referred to as a Gibbs sample when each hidden unit is thought of as a probabilistic model, since the reconstruction represents a sample from under the probability distribution specified by the hidden units,

for a given configuration of the states of the hidden units. If the analysis of each hidden unit as a probabilistic model is continued, the question arises as to the significance of the probability of activation Phi of each hidden unit hj for a given configuration of the states of the visible units. Under this analysis, the probability of activation Phj of a hidden unit h can be equated with the posterior probability of the hidden unit hj being ON given the particular observed configuration of the states of the visible units. In general, this means that if the particular configuration of the visible units had been drawn as a sample from underneath the probability distribution specified either by the hidden unit in its OFF state (which is the uniform"don't care" distribution) or by the hidden unit in its ON state (which can be thought of as corresponding to a particular feature), but we don't know which, then the posterior probability gives an indication of which distribution it was more likely to have come from.

As before, since a posterior probability is associated with each hidden unit, the combination of those posterior probabilities specifies a posterior probability distribution over all possible configurations (of the states) of the hidden units. Choosing the states of the

hidden units in the way described above therefore corresponds to sampling a particular configuration of the states of the hidden units from under the posterior probability distribution specified by a particular configuration of the visible units.

From an intuitive view point, generating a reconstruction can be thought of as comprising the two steps of:- (1) analysing an input data vector to detect which features (as specified by the values of the weights) corresponding to particular hidden units are present in the input data vector and turning ON the hidden units whose features are detected; and (2) forming a reconstruction which has the same features as those detected to be in the input data vector.

Mathematically, this can be regarded as taking a Gibbs sample with reference to an input data vector by: (1) selecting a combination of the states of the hidden units from under the posterior probability distribution specified by the input data vector (for a given set of values of the weights); and (2) sampling from under the probability distribution defined by the product of the probability

distributions specified by each hidden unit in its state as determined in step (1).

Note that the posterior probability distribution is factorial because the posterior probability of each hidden unit being ON is specified independently of the other hidden units. This factorial nature of the posterior probability distribution is a useful feature which may be present in other trainable devices to which the present invention may apply.

From an intuitive point of view, it can be considered that the posterior probability of a hidden unit being ON will be high if the visible units are in a configuration which is quite likely under the probability distribution specified by the hidden unit when it is ON.

Figure 10, is a flow diagram illustrating the first part of the method used in this embodiment for training each restricted Boltzmann machine. The first part of the training method involves generating reconstructed data vectors using an iterative method (as illustrated in Figure 10). Each cycle of the iteration involves five steps. The second part of the method involves updating the weights in the restricted Boltzmann machine. The

first part of the method involves 6 steps which are labelled 90 through to 95, comprising an initial stage 90 of setting or clamping the visible units to the corresponding values of the presented training data vector, and first through fifth stages 91-95. In initial stage 90 a training input signal 209 representing a training data vector is presented to the restricted Boltzmann machine 210 from database 41. From the presented training data vector, the states, SVI-SVN of visible units, vu-vain the layer of visible units 60, are set or clamped. In the computer simulation, this corresponds to setting the members of s, array 61 to equal the corresponding elements of the presented data vector.

In the first stage 91 of the first part of the training method, the probabilities Phj of all of the hidden units hl-hM are computed using equation (2) and stored in the Ph array 72. Also for every pair of a visible unit, vi, and a hidden unit, h ;, the product of the binary state of the visible unit, svi, and the probability of the hidden unit being ON, Phj, given the states of the visible units is calculated and stored in sV0ph3 array 82 (i. e. svi°ph° over all visible and all hidden units is stored in array 82). These products will be referred to as the positive

products (note that the 0 superscript indicates that these terms correspond to the original values (or clamped values in the case of the visible units) of these terms respectively, prior to any reconstructions described below).

In stage two (92), each probability Phi'calculated in stage 1 is used to stochastically determine the binary state Shy'ouf each hidden unit (or feature detector), hj, and these are stored in the Sh array 71. That is to say, for each feature detector or hidden unit, hj, stage 2 converts the real number Phj° to the binary number shi° and then stores each of these in the appropriate location within the array 71. This conversion is done by comparing each probability Phi'with a pseudo random number selected from a set of pseudo random numbers uniformly distributed over the real interval 0 to 1.

In stage three (93) equation (4) is used to determine the probability pvil of each visible unit, vi, given the binary state, s, j', of each hidden unit or feature detector determined in stage 2 and each of these is stored in the pu array 62 (note that the superscript 1 indicates that these terms relate to a first order reconstruction).

Stage four (94) is exactly analogous to stage two (92) except that stage 4 operates on visible units rather than on hidden units (i. e. the state s, il of each visible unit vi is stochastically determined on the basis of the probability pvl of each visible unit vi). That is to say for each visible unit vi stage four (94) converts the real number p, il to the binary number svil and stores each of these in the appropriate locations within the s, array 61. Thus the array 61 is updated from the presented training data vector 209 values to the reconstructed data vector values.

Stage five (95), is exactly analogous to stage one (91) except that it operates with the value svil of each visible unit vi given by the reconstructed data-vector s,/ generated in stage four (94), instead of the presented training data vector 209 s, l and each product svilPhil computed in stage five (95) is stored in the negative products silphe array 83 and is referred to as a negative product.

The weights Wij, WiTt WTU (which are stored as values in the w array 81 and the second and first true units weight arrays) then updated using the values stored in the positive and negative products arrays in accordance with

equation (6) given below:- Where d is an index over all training data vectors, svi° (d) is the binary value of visible unit v after clamping by data vector d, Phj° (d) is the real valued probability of hidden unit h turning on given data vector d, Svl (d) is the binary value of visible unit V in a first order reconstruction of data vector d, Ph1 (d) is the real valued probability of hidden unit hj turning on (or staying on) given the first order reconstruction of data vector d and E7 is a scaling factor which is approximately equal to the inverse of the number of terms in the summation. As will be appreciated by persons skilled in the art, the scaling factor will determine the learning rate of the device.

Note that equation (6) is also used to update the bias weights from the first and second true units by replacing either the i or j with a T respectively and setting sv, = I and phT = 1 where necessary.

Note that the weights are"repeatedly updated in parallel", which means that the right-hand side of

equation (6) is recomputed on the-whole training set before the weights wi are updated by adding the Aw calculated using equation 6 to each respective wij. In order to calculate the right-hand side of Equation (6) in respect of the whole training set, a two-dimensional array is used to store the Awij and two one-dimensional arrays are used to store LWTJ and Awl These arrays are updated after each iteration with each new data vector from the training set. When the iteration has been performed in respect of all of the data vectors in the training set, the final values of Lwij, ;, AWiT, AWTU are arrived at and these are then used to update the values of wij, WiT and WT and the whole process is repeated. The updating of the weights in this way constitutes the second part of the training method.

An important distinction between the present embodiment and the conventional method of training a restricted Boltzmann machine is that in the present embodiment the negative term is derived after only a first order reconstruction has been generated, as opposed to the conventional method where the negative term is derived after the machine to be trained has been allowed to settle to an equilibrium or stationary configuration which is substantially independent of the originally

input training data vector. As mentioned above, this not only requires more cycles or iterations of the first part of the method, but also means that the negative term includes a significantly greater noise component which reduces the accuracy with which the"correct"values for Awij and Awtj may be approximated.

Once each restricted Boltzmann machine has been trained in the above described manner, each machine reverts to its normal mode in which the weights are fixed and each device is used to generate one of the scores 201-291.

Each score 201-291 calculated in this way is representative of the logarithm of the probability of an input data vector 19 under the probability distribution of data vectors learnt by the restricted Boltzmann machine during training (up to an unknown additive constant). That is to say, for a particular set of values of the weights Wij, WiT, WTU, a certain probability distribution of the data vectors is defined, and it will be appreciated by persons skilled in the art that equation (3) generates a score which reflects this learnt distribution.

Equation (3) is specific to a restricted Boltzmann machine. However, a person skilled in the art will

recognise that equation (3) corresponds to the so called "free-energy"of the restricted Boltzmann machine, and that corresponding equations for different types of product of experts devices can be used to generate a score which is representative of the logarithm of the probability of a respective data vector (at least up to an unknown additive constant) in an analogous way.

Although a specific embodiment has been described above with reference to Figures 1 to 10, many aspects of this embodiment could be changed. For example, there are many alternative ways of adjusting the weights using the difference between the positive and negative terms which will still give rise to the advantages of the above described embodiment. For example, instead of using the whole training set in Equation (6), a subset of the training set could be used instead; furthermore, a different subset of the training set could be used for each update of the weights, each subset being chosen randomly or according to a predetermined cyclical order.

In particular, the weights may be updated after only a single training data vector. Such a situation is referred to as on-line training and can be useful when there is no possibility to store training data vectors.

On-line training has the disadvantage that the weights

may be adjusted in the wrong direction at times; however, if the weight updates are sufficiently small, the process will even itself out eventually to arrive at values for the weights which are similar to those which would have been obtained using equation 6 to repeatedly update the weights in parallel in respect of a whole training set.

The important aspect of the above described embodiment resides in updating or adjusting the weights using the difference between the positive and negative terms calculated in the way which has been described above, or in some equivalent or approximately equivalent manner.

The distinction between the present embodiment and the conventional method is that no attempt is made in the above described embodiment to generate negative terms which are independent of the applied training data vector or which are generated after the machine has been allowed to settle to a stable or equilibrium state.

Furthermore, although the above-described embodiment is applied to the training of restricted Boltzmann machines, the method can be advantageously applied to any trainable device comprising a plurality of probabilistic models each of which specifies a probability distribution over a particular data space and wherein the device can be

trained to maximise the probability, of any input training data vectors, specified by the product of the probability distributions specified by each probabilistic model. For example, a device having a plurality of hidden Markov models could be trained in this way. In the general case, the method comprises performing a Gibbs sample in respect of an input training data vector and adjusting adjustable parameters of each probabilistic model to maximise the difference between the probabilities of the input data vector and the respective Gibbs sample. This is advantageous over the conventional method of attempting to maximise the absolute probability of an input training data vector under the product of experts device as a whole because it is often much easier and quicker to calculate since only a single Gibbs sample is required. Also note that it may not actually be necessary to generate a reconstruction (ie draw a Gibbs sample) to determine how the adjustable parameters should be adjusted to maximise the difference between an input training data vector and a reconstruction (which might have been generated). Instead, it may be possible to use say a probability distribution of possible reconstructions (see variation 1 below), or some other equivalent or approximate technique, etc.

Further possible envisaged variations of the above described embodiment are set out below.

VARIATION 1 Instead of using binary values for the input data vectors, the intensities in the image are converted by the data vector generator 18 into real values between 0 and 1. In stage one these real values, pvi°, are used to set the values of the real value pu array 62 instead of the binary values svi° setting the values of the sv array 61 (which may in fact be dispensed with). The elements of the pu array 62 are used in place of the elements of the s, array 61 both in equation (2) which is used for computing the probabilities of activation Phj of the hidden units and in the positive products which are therefore of the form Pvi°Phr° Stages two and three are unaltered but stage four is omitted. Stage five is exactly analogous to stage one in using Pvi instead of svil for determining both the probabilities phjl of the activation of the hidden units and the negative products.

VARIATION 2 Instead of using the probabilities, ph°, of activation of the hidden units in the positive products, the states, Shj° of the hidden units computed in stage two are used

so that the positive products are of the form sviûshi° It is also possible to introduce a sixth stage which is analogous to stage 2 and converts Phil into a binary state shjl that can be used instead of Phi'in determining the negative products. This can be advantageous in hardware implementations because the negative and the positive products are then all binary.

VARIATION 3 After training the restricted Boltzmann machine 210 as described above (i. e. by adjusting the values of the weights wij), a second layer of hidden units is trained to model the binary activities of the first layer 70 of hidden units. The second layer is trained in exactly the same way as the first layer but uses the states shi (or the probabilities of activation, Phi) of the hidden units hi in the first layer 70 in place of the input data vectors (which correspond directly to the states of pixels). Each of the ten Boltzmann machines has its own separate second layer.

When a test image is presented after all 10 Boltzmann machines have been trained, each Boltzmann machine

outputs 2 scores. The first score is indicative of the logarithm of the probability of the configuration of the states of the visible units (ie of the input data vector) given the weights between the visible units and the first layer of hidden units h and the second score is indicative of the logarithm of the probability of the activations Shj (or the configuration) of the first layer of hidden units 70 given the probability distribution of configurations of the first layer 70 of hidden units learned by the second layer of hidden units (ie the probability of configuration of the first layer given the weights between the first and second layers of hidden units). These 20 scores can then be used in the same way as described above with respect to a single layer of hidden units for each restricted Boltzmann machine. Thus a separate score post-processing system similar to the score post-processor 30, but now with 20 input units, may be trained as before to convert the now 20 scores into class labels or probabilities of class labels.

This variation may be extended to use any number of additional layers of hidden units in the same way (e. g.

2,3,4, etc additional layers for each Boltzmann machine).

VARIATION 4 Instead of using 10 separate Boltzmann machines for the 10 separate digits, a single restricted Boltzmann machine is used but an adapted data vector generator is used to generate extended data vectors each of which corresponds to both an image and a label representing a digit which could, for example, consist of a set of 10 binary values only one of which is active at a time. Furthermore, instead of having 256 visible units, the modified restricted Boltzmann machine requires 266 visible units (256 for the pixels as before plus 10 visible units for each of the 10 binary values represent a digit label).

The product of experts device is trained as before except that the training set now contains extended data vectors representing all ten digits with their respective digit labels. After training, a test image without a digit label is recognised in the following way: the test image is used by the data vector generator to generate an extended data vector which includes one of the labels and a score is obtained for this extended data vector. This process is repeated for all 10 labels and the 10 scores obtained in this way are used as before (e. g. by feeding the scores to a separate score post-processor).

VARIATION 5 Before training 10 separate restricted Boltzmann machines on data vectors of different digits, data vectors representing images of all the different digits are pre-processed by a single restricted Boltzmann machine and the binary states Shj or the probabilities phj of activation of the hidden units hj in this shared preprocessing restricted Boltzmann machine (on data vectors representing the relevant digit) are used as the data vectors for training each of the ten separate restricted Boltzmann machines.

Note that any combination of variation 1,2,3,4 and 5 also be used.

2nd Embodiment SPEECH PRE-PROCESSING DEVICE Figure 11 illustrates a speech pre-processing device comprising a microphone 102, a spectrogram generator 104, a data vector generator 106 and a restricted Boltzmann machine 108. A speech signal 101 is processed by the microphone 102 into an analogue electrical signal 103.

This is converted into a digital signal 105 representative of a spectrogram by means of spectrogram

generator 104 (suitable devices for this purpose are well known in the art and will not be described herein). Each spectrogram generated in this way represents the energy in each frequency band in each time window of the original speech signal 101. In this embodiment, each time window corresponds to one millisecond. A"frame" consists of the energies in all frequency bands in a single time window or a vector of coefficients derived from these energies. Digital signal 105 is input to the data vector generator 106 which generates digital signals 107 representing data vectors, each of which consists of the concatenation of one or more temporally adjacent frames.

The restricted Boltzmann machine 108 including a layer of hidden units is trained in exactly the same way as for the first embodiment (the digits recognition application) except that, in this embodiments each speech sequence gives rise to a number of different data vectors corresponding to all the different ways of picking the appropriate number of adjacent frames. After training, the restricted Boltzmann machine 108 outputs a digital signal 109 which is representative either of the states of the hidden units or their probabilities of activation and which, in this embodiment, is used as input (or

partial input) to a speech recognition device (not shown). In this embodiment, the performing non-linear pre-processing on the spectrograms on the input speech signal 101.

As before, multiple layers of hidden units may be used and one or more layers (e. g. the top layer only) may be used to provide the pre-processed data (i. e. the output of the product of experts device 108).

A variation of the above described second embodiment is envisaged as described below.

VARIATION Instead of training a single restricted Boltzmann machine 108, several different restricted Boltzmann machines are trained on several different classes of speech (e. g. different speakers saying the same sentence-as could be used in a speaker verification task, or different digits being spoken by one or more speakers for use in a spoken digit recognition task). A test utterance is then recognised by combining in some way (e. g. accumulating) the scores from each of the restricted Boltzmann machines for all data vectors derived from the utterance, and feeding these combined scores to a separate score post-

processing system as before.

3rd Embodiment DIGITAL CAMERA Figure 12 is a schematic diagram of a digital, black-and- white camera. The digital camera includes a lens and shutter arrangement 172, a photosensitive screen 174 (having 64 by 48 pixels), and an integrated circuit 180.

The lens and shutter arrangement 172 is adapted to expose photosensitive screen to an input image in a conventional manner. In this embodiment, the photosensitive screen is divided into 64 by 48 pixels, each of which generates an analogue signal representative of the amount of light falling onto it. These signals are transmitted to the first integrated circuit 180 which processes these input analogue signals to generate a digital signal 185 which represents in a digital format features continued within the image which was exposed on the photosensitive screen.

These features can then be transmitted over a suitable network (not shown) to a remote processing centre (not shown) for further processing without having to transmit the entire image.

In the present embodiment, the first integrated circuit

180 comprises a data vector generator 182 and a restricted Boltzmann machine 184. The restricted Boltzmann machine 184 has 3,072 visible units corresponding to the 64 by 48 pixels of the photosensitive screen 174, and 1,024 hidden units. Data vector generator 182 receives analogue signals from each pixel and converts each of these into a corresponding binary element of a data vector 183 which is used to set the states of the visible units of restricted Boltzmann machine 184. The conversion from an analogue signal to a binary element of data vector 183 is performed by comparing each analogue signal with a random analogue signal generated by the integrated circuit 180 using, in this embodiment, a random thermal noise signal which is processed to have corresponding maximum and minimum values to those of the analogue signals from the screen 174, and an approximately uniform distribution between these values. As before, the comparison is performed in such a way that high value signals from a pixel are likely to result in a one being generated as the corresponding element of data vector 183, whilst low value signals are likely to result in a zero being generated as the corresponding element of the data vector 183 (note that in this case as opposed to in the first embodiment, a low signal from a pixel means that the

pixel received little light-ie it's dark grey or black -whereas a high signal means that it received a large amount of light-ie it's light grey or white).

Each hidden unit h then calculates its total input Xhj, and from this its probability of activation Ph, on the basis of the states of the visible units as set by the data vector 183. Note that in this embodiment, each hidden unit has its own associated processing capability and local memory which permits all of the hidden units to calculate their respective total inputs and probabilities of activation in parallel. This may require either multiple access to a central register storing the states of the visible units or each hidden unit may have its own local copy of the states of the visible units. Finally, each hidden unit stochastically chooses a state (ie either ON or OFF) in dependence on its probability of activation. The restricted Boltzmann machine 184 then outputs the states of the hidden units in the form of an activation vector for transmission to a remote processing centre. When the restricted Boltzmann machine has been properly trained, the activation vector will provide useful preprocessed information about the reviewed image for further processing.

To train the device 184, a set of-training images is recorded by the camera and used to generate a set of training data vectors, thereafter, the restricted Boltzmann machine 184 is trained in the normal way described in relation to the first embodiment. Note that there is no need to supervise the learning at all.

Provided the restricted Boltzmann machine can detect certain common features in the input images, it will train itself to be able to defect at least some of the features common to the training set of images. Note that a possible variation of the above described embodiment in which the restricted Boltzmann machine 184 is trained after it has been inserted into the digital camera, is to insert the integrated circuit 180 into the camera only after the restricted Boltzmann machine 184 has been pre- trained. One way that this could advantageously be performed is to initially train a master prototype simulation of the machine to find suitable values for the weights and then to simply download the weights onto the integrated circuit to generate a pre-trained restricted Boltzmann machine. One advantage of doing this is that the visible units do not need to be able to perform any processing and can therefore be replaced simply with a register. Preferably, the integrated circuit 180 stores the weights in a non-volatile re-writeable memory such

that new weights can be written to the integrated circuit 180 at any future time to permit the functioning of the integrated circuit 180 to change.

4th Embodiment PRODUCT OF HIDDEN MARKOV MODELS DEVICE The product of hidden Markov models device according to this embodiment may be used in any application where a single hidden Markov model could be used. In such an application the product of hidden Markov models device replaces the single hidden Markov model in the application. The product of hidden Markov models device is trained in the following way:- 1) each individual hidden Markov model device computes the derivative of the logarithm of the probability of an observed training data sequence with respect to each of the parameters in its transition matrix and its output probability models (the adjustable parameters). These derivatives are the positive terms; 2) each individual hidden Markov model device chooses a path through its hidden states from the posterior probability distribution of paths through its hidden states defined by the observed training data sequence and the correct values of its adjustable

parameters; 3) for each time, the product of the probability distributions specified by the output models of the hidden states of each of the hidden Markov model devices at that time is computed; 4) a reconstructed observation sequence is computed by picking an observation at each time from under the probability distributions computed in stage 3 (note for efficiency reasons stages 3 and 4 may be combined and the probability distribution specified in stage 3 may not be computed explicitly); and 5) this is exactly the same as stage 1 except that the reconstruction computed in stage 4 is used instead of the observed training data sequence and the derivatives computed in this stage are the negative terms.

The adjustable parameters are then updated so as to increase the difference between the probability of the observed training data sequence and the probability of the reconstructed observation sequence using the difference between the positive and negative terms calculated in stages 1 and 5 respectively.

5t Embodiment APPARATUS FOR MONITORING COMPLEX PHYSICAL PROCESSES FOR RARE FAILURES Referring now to Figure 13, apparatus 130 for monitoring complex physical processes for rare failures comprises a data vector generator 150 and a data vector score generator 152. A complex physical process which is equipped with sensors 131-137, each of which outputs an electrical signal indicative of the instantaneous reading of the sensors, can be monitored in the following way:- During the training phase, a restricted Boltzmann machine forming part of the data vector score generator 152 is fitted to a large set of data vectors that are generated by the data vector generator 150 and input to the data vector score generator 152 by means of digital signal 151 and which consist of individual time frames of sensor readings. The data vectors used for training the restricted Boltzmann machine are all acquired from the complex physical process when it is operating within its normal, safe regime, so no examples of failures or unsafe conditions are required for training the restricted Boltzmann machine. After training the restricted

Boltzmann machine can rapidly assign-a numerical score to any data vector using equation (3). Note, if restricted Boltzmann machine were replaced with another type of trainable product of experts device, a corresponding equation to that of Equation (6) could be used. This score represents the logarithm of the probability of the data vector under the restricted Boltzmann machine, or product of experts device, up to an additive constant.

To detect when the complex physical process is outside of its normal or safe operating regime, the score provided by the restricted Boltzmann machine by means of signal 153 is compared with a threshold using a suitable comparator (not shown). An appropriate threshold can be estimated by looking at the distribution of the scores for the normal operating regime or by using examples of failures or unsafe regimes if these are available.

Instead of using single time frames of sensor readings, several adjacent time frames can be concatenated into each data vector or a whole sequence of time frames can be used in conjunction with individual experts that can model temporal sequences, such as, for example, hidden Markov models The complex physical process being monitored could be a

physiological process in a human or other organism or plant.

6th Embodiment APPARATUS FOR MODELLING CONDITIONAL PROBABILITY DENSITIES The learning algorithm for Product of Experts devices described above can be easily modified to deal with tasks in which the data vector is divided into two parts, X and Y and the goal is to model the conditional density of Y given X. In stages 3 and 4, instead of reconstructing both parts of the data vector, only part Y is reconstructed and part X is kept the same as it was in the initial data vector that is presented to the apparatus.

An example of an application for such apparatus is in recognising handwritten digits. In this example, a restricted Boltzmann machine is provided with 356 visible units divided into a first part having 256 visible units (one for each pixel of a 256 (16 x 16) bitmap image of a handwritten digit) and a second part having ten visible units each of which represents a particular digit (eg 0, 1,2, etc.).

To train this restricted Boltzmann machine, a training set of training data vectors is used in which each data vector comprises a first part having 256 elements (one for each visible unit in the first part of the visible units) representing the pixels of a 16 x 16 bitmap image, and a second part having ten elements (of which only one will be a one) to represent the class of digit (eg"0", "1","2", etc) to which the training image corresponds.

During training, a training data vector is applied to the restricted Boltzmann machine which sets all of its visible units accordingly (ie ON when its corresponding element is a one and OFF when it is a zero). Thereafter, stage (1) of calculating the probabilities of activation in respect of each hidden unit and using these together with the activations of the visible units to calculate positive terms in respect of each weight is performed.

Stage (2) of stochastically selecting a stage for each hidden unit in dependence upon the probabilities of activation calculated in stage (1) is then performed as normal. However, in stage 3, instead of calculating the probability of activation of all of the visible units, the probabilities of activation are only calculated in respect of the ten visible units in the second part of the visible units. In stage (4) one of the ten visible units is turned ON in accordance with the probabilities

of activation calculated in stage (3) (whilst the 256 visible units of the first part are held clamped with the same values originally set); note that it is important that only one of the ten visible units of the second part turns ON so some suitable form of constraint is used-in this example, only the unit with the highest probability of activation calculated in stage (3) is turned ON. In stage (5), new probabilities of activation of all of the hidden units are calculated (note that these will reflect any changes in the configuration of the ten visible units of the second part) and these are used together with the states of all of the visible units after stage (4) to generate a negative term in respect of each weight. The weights are then adjusted using, for example, equation (6) together with the differences between the positive and negative terms calculated as described above.

Once the restricted Boltzmann machine has been trained in this way, a previously unseen digit is recognised by setting each of the visible units of the first part according to the input image in any of the above described ways (eg using a binary bitmap to directly set the states of the visible units, using grey-scale values to set the probabilities of activation of the visible units, etc). The states of the visible units of the

second part are initialised in some predetermined way such as by setting them all to zero or setting them to random states. Thereafter, the above described stages of selecting states for the hidden units and then selecting states for the visible units of the second part only whilst keeping the states of the visible units of the first part clamped at their original values is repeated a predetermined number of times. The number of times which this iteration is repeated is predetermined to a value which is deemed to be sufficient for the machine to approach thermal equilibrium (eg approximately thirty iterations). At the end of this process, the visible unit of the second part which is ON indicates the digit which is deemed by the trained machine to correspond to the input image.

INITIALISATION OF THE PARAMETERS The parameters of the experts can be initialized in many different ways, including methods that involve using other learning procedures. For example, each expert could be trained separately but with different training data for each expert or differently weighted training data for each expert or different starting values of the parameters of each expert. The present invention is not

limited to any particular method of initialising the parameters of the experts.

VARIETIES OF EXPERT The learning algorithm for product of experts devices described above can be applied whenever the individual experts satisfy two conditions. The"tractable inference condition"is that it must be tractable to compute or approximate the derivative with respect to the parameters of an individual expert of the log probability that the expert assigns to a data vector, or a linear function of this derivative. The"tractable reconstruction condition"is that it must be tractable to compute a reconstructed data vector in stage 4 that is drawn from the exact or approximate conditional distribution over data vectors given the states of the latent variables computed in stage 2 for all the experts. An important aspect of the present invention resides in using the difference between the positive and negative terms calculated in the way which has been described above.

The present invention is not limited to any particular types of individual experts except insofar as they are required to satisfy the tractable inference and tractable reconstruction conditions. Types of individual expert

that satisfy these conditions include mixtures of Gaussians with full or constrained covariance matrices and independent or constrained means, mixtures of multinomials, mixtures of factor analysers, mixtures of Poisons, Hidden Markov Models, mixtures of Linear dynamical systems, or mixtures of any of these types of expert with a uniform distribution. It is also possible to use experts that assign numerical scores to data vectors even if these scores cannot be interpreted as the logarithms of probabilities, provided that they can be interpreted as being linearly, or approximately linearly, related to the logarithms of probabilities.

MULTIPLE GIBBS STEPS In the standard training algorithm the negative terms are computed on reconstructions of the data vector from latent states of the experts which are computed from the data vector itself. It is also possible to iterate the reconstruction procedure several times before computing the negative terms. On each such iteration, stage 4 is followed by stage 1, except that the computation of the positive terms is omitted from the repetitions of stage 1 and the reconstructed data vector computed in the immediately preceding stage 4 is used in place of the

real data vector in repetitions of stage 1.

An important aspect of the present invention resides in using many less repetitions of the reconstruction procedure than would be required to approach a stationary probability distribution over the reconstructed data vectors. Prior art training methods attempted to approximate the gradient of the logarithm of the likelihood of the data vector under the product of experts and therefore assumed that it was necessary to use enough repetitions of the reconstruction procedure to approximate the stationary probability distribution over reconstructed data vectors before computing the negative terms in stage 5 in order to train a product of experts device. The reason for this is that the differentiation of the logarithmic probability of a data vector under a product of experts device with respect to an adjustable parameter of the device can be shown to be the difference between a first term corresponding to the differentiation of the logarithmic probability of the data vector under the individual expert associated with the adjustable parameter and a second term which results from the normalising or partition function associated with the product of experts; the second term can be shown to be the expected value of the differentiation of the

logarithmic probability under the individual expert of a fantasy data vector generated from under the probability distribution of the product of experts device as a whole.

The expected value is equivalent to a random sample of a data vector from under the stationary probability distribution of the product of experts device plus a noise factor which should be cancelled out if enough samples are taken.

The present invention approximates the gradient of a different objective function, namely the difference in the logarithms of the likelihoods of each data vector and its reconstruction. By using this novel objective function, the present invention avoids the large number of repetitions of the reconstruction process that are generally required to approach the stationary probability distribution over reconstructed data vectors. It also avoids the high variance in the estimated gradient that is a result of sampling from the stationary probability distribution over reconstructed data vectors.