THRESHOLD SETTING AND TRAINING OF A SPEAKER VERIFICATION SYSTEM

Title:

THRESHOLD SETTING AND TRAINING OF A SPEAKER VERIFICATION SYSTEM

Document Type and Number:

WIPO Patent Application WO/1999/054868

Kind Code:

Abstract:

Speaker verification system in comprising model building means which build n different speaker models, each based on n-1 tokens, for each model one independent token being available for estimating the distance between the model and an utterance that has not be used to build the model. Estimation means estimate the central tendency of the distance between the speaker's model and newly prodced utterances of the same speaker, and also its variance. The central tendency of distances between the model and utterances is optimised, based on linear interpolation: Th(new) = b * CTi + (1 -b) * CTt, where Th(new) is the optimal threshold, CTi is the central tendency obtained from pre-recorded impostor speech, CTt is the central tendency estimated from the enrolment speech of the new customer, and b is the interpolation parameter, that is optimised using additional pre-recorded utterances not used in estimating CTi.

Inventors:

BOVES LODEWIJK WILLEM JOHAN (NL)

Application Number:

PCT/EP1999/002641

Publication Date:

October 28, 1999

Filing Date:

April 16, 1999

Export Citation:

Click for automatic bibliography generation Help

Assignee:

KONINKL KPN NV (NL)
BOVES LODEWIJK WILLEM JOHAN (NL)

International Classes:

G10L17/04; (IPC1-7): G10L5/06

Domestic Patent References:

WO1996041334A1

1996-12-19

Other References:

FAKOTAKIS N ET AL: "SPEAKER VERIFICATION OVER TELEPHONE LINES BASED ON DIGITAL STRINGS", SIGNAL PROCESSING THEORIES AND APPLICATIONS, BRUSSELS, AUG. 24 - 27, 1992, vol. 1, no. CONF. 6, 24 August 1992 (1992-08-24), VANDEWALLE J;BOITE R; MOONEN M; OOSTERLINCK A, pages 399 - 402, XP000348685, ISBN: 0-444-89587-6

Attorney, Agent or Firm:

Klein, Bart (Koninklijke KPN N.V. P.O. Box 95321 CH The Hague, NL)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

Speaker verification system, comprising receiving means (1) for receiving utterances of speakers during an enrolment process, during which a speaker (2) produces n tokens of some set of phrases, model building means (3) for building one or more models consisting of explicit or implicit sets of central tendencies and variances of all speech coefficients of utterances received via said receiving means, threshold means (4) for establishing accept/reject thresholds during said enrolment process, and estimating means (5), characterized in that said model building means (3) build n different speaker models, each based on n1 tokens, for each model one independent token being available for estimating, by said estimating means (5), the distance between the model and an utterance that has not been used to build the model.

2.	Speaker verification system according to claim 1, characterized in that said estimation means (5) estimate the central tendency of the distance between the speaker's model and newly produced utterances of the same speaker, and also its variance.

Speaker verification system according to claim 1, in which the estimation of the accept/reject threshold from enrolment speech is combined, by combining means (6), with prerecorded impostor speech, characterized in that the central tendency of the distances between the model and utterances is optimised by optimising means (7).

Speaker verification system according to claim 3, characterized in that said optimisation, in said optimisation means (7), is executed by linear interpolation: Th (new) = b * CTi + (1b) * CTt, where Th (new) is the optimal threshold, CTi is the central tendency obtained from prerecorded impostor speech, CTt is the central tendency estimated from the enrolment speech of the new customer, and b is the interpolation parameter, that is optimised using additional prerecorded utterances not used in estimating Cti.

5.	Speaker verification system according to claim 3, characterized in that enrolment speech and prerecorded impostor utterances are segmented, in said optimising means (7), into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed.

Speaker verification system according to claim 5, characterized in that the central tendencies of the distance distributions are corrected, in said optimising means (7), removing a bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model.

7.	Speaker verification system according to claim 3, characterized in that, in said optimising means (7), a single optimal value is computed that applies to all speakers.

8.	Speaker verification system according to claim 3, characterized in that, in said optimising means (7), an optimal correction factor is estimated for each newly enrolled speaker.

Description:

THRESHOLD SETTING AND TRAINING OF A SPEAKER VERIFICATION SYSTEM BACKGROUND OF THE INVENTION The invention referres to a speaker verification system.

Speaker verification (SV) systems are systems in which models of each customer must be built during an enrolment process, accept/reject thresholds must be established during the same enrolment process and speech of customers who claim a certain identity must be compared to the claimed speaker's model, to determine whether the identity claim is likely to be true This patent application addresses the second sub-process mentioned above, i. e., the estimation of the accept/reject threshold during enrolment. Speech is a behavioural biometric measure. As all other behaviour, speech behaviour is variable. Therefore, it is not possible to build exact models of a person's speech behaviour. Rather, models must always consist of some combination of central tendencies and the attendant variance around the central tendency value of all parameters with which the speech is characterised. By consequence, the process of verifying a claimed identity is always statistical in nature: one must test what the likelihood is that the newly observed speech pattern is indeed produced by the person who has enrolled the model (i. e., the person whose identity is claimed by the speaker).

Speaker verification systems may use a wide range of parameters to characterise the speech, including spectral coefficients, Mel Frequency coeffeicients, Cepstral coefficients, Mel Cepstral coefficients, Pitch, Loudness, etc. All these different parameter representations are used in essentially the same process during model enrolment: for all individual coefficients central tendencies and variances must be estimated. This patent application

applies to all speaker verification systems that build models consisting of explicit or implicit sets of central tendencies and variances of all speech coefficients.

It is generally known in the field that ideally the estimation of the model parameters would require speech that is recorded in many different sessions. This is so because only multiple sessions can produce data about the behavioural variations in the client (e. g. due to physiological and psychological status) and variations in the handset microphone, acoustic background noise, etc.

However, for Human Factors reasons most operational speaker verification systems must limit the number of enrolment sessions to one or two. This is obviously not enough to capture all variation. Therefore, in general the estimates of the model parameters are not fully reliable.

In addition to the parameters of the speaker model, another pair of statistical distributions must be estimated during the enrolment process, viz. the distribution of the distances to the speaker model of new utterances of the same speaker, and the distribution the distances of suitable utterances produced by impostor speakers to this speaker's model. This pair of distributions is needed to enable the system to determine whether a new utterance is more likely to have been produced by the speaker who has enrolled the model or by an impostor speaker. In estimating the distribution of the distances of impostor speaker utterances to the newly enrolled speaker's model it may be possible to use speech of many speakers that has been recorded well before the start of the enrolment session. Therefore, it may be possible to estimate the distance distribution for impostor speakers relatively easily and reliably (provided, of course, that the model parameters are reliable themselves). Estimating the distribution of the distances of the newly enrolled

customer to her or his own model is much more difficult, because usually no independent speech utterances are available for that purpose: to keep the enrolment procedure as short and easy as possible, usually all speech produced during the enrolment session must be used to estimate the model parameters. Consequently, no data are left that would allow an independent estimate of the distance between the model and newly produced utterances of the same speaker.

Moreover, all speech that is available for the estimation of the true customer's distance to his/her own model is recorded under the same conditions as the speech used to build the models. Therefore, one must expect that the estimates of the distance of new utterances the true customer's utterances to his/her models will have a strong bias towards small values.

Almost all scientific research on speaker verification has been based on pre-recorded corpora of utterances of true and impostor customers. Some utterances are set apart for building the models; the remaining speech is used for testing. In these tests the distance of all test utterances to a speaker model are computed, both for the test utterances of the the true customers and for the test utterances of the impostor speakers. This process yields two distributions, viz. one distribution of the distances of the true customer to her/his model, and another distribution of the distances of the impostor speakers' utterances to the same model. Based on these distributions it is then possible to determine the distance threshold that minimises the number or the proportion of false accept or false reject decisions. Here, a false accept decision means that the distance to the model of an impostor utterance is so small, that it falls well within the distribution of the distances of the true customer to her/his own model, and must therefore be accepted as if it

was indeed produced by the true customer. False reject means that the distance between an utterance of the true customer and her/his own model happened to be so large that it falls well within the distribution of impostor utterances, and therefore must be considered as an utterance produced by a speaker different from the true customer. In this process the continuum of the distances between a model and a new utterances is divided into two halves: distances that are smaller than some threshold lead to the acceptance of the utterance as produced by the speaker who enrolled the model, while distances larger than this threshold will be rejected, because they are considered as produced by some other speaker. The process sketched in this paragraph is known as'a posteriori threshold optimisation', because it uses all data in the corpus, thereby biassing the threshold setting.

In real world applications of speaker verification systems 'a posteriori threshold optimisation'is obviously impossible, because accept/reject decisions must be made immediately for each new utterance, whether it has been produced by a true customer or an impostor. Thus, in real world applications of speaker verification systems the accept/reject threshold must be estimated'a priori', using only true customer speech recorded during the enrolment session. This obviously requires a solution for the problem described above, viz. that the estimates that can be obtained from speech recorded in the enrolment session are by necessity biassed.

This patent application describes a number of ways in which the bias can be minimised, so that maximally reliable estimates of the accept/reject threshold can be obtained.

SUMMARY OF THE INVENTION In this patent application two classes of solutions to the problem of a priori threshold estimation are described.

Moreover, the two classes can be combined, so as to obtain even better results. Both classes of techniques address the issue of improving the estimates of the distance between the newly built model and utterances of the true customer.

The estimation of the distribution of distances to impostor utterances is not addressed.

Leaving One Out The bias in the estimates of the distance between new utterances of the true customer and the newly built model can be reduced by a new application of the'leaving one out'technique.'Leaving one out'as a general experimental procedure has been used previously, e. g. in testing speaker dependent automatic speech recognizers for isolated words on the basis of pre-recorded speech corpora. Here, the technique is used differently, omong others to circumvent the lack of a suitable pre-recorded corpus of utterrances.

During a typical enrolment session for a speaker verification system a new speaker is requested to produce n tokens of some set of phrases, where typically n > 2. We have developed a procedure in which n different speaker models are built, each based on n-1 tokens. For each model one independent token is then available to estimate the distance between the model and an utterance that has not been used to build the model. This procedure allows us to estimate not only the central tendency of the distance between the new speaker's model and newly produced utterances of this same speaker, but also its variance. Obviously, since all utterances are recorded under similar conditions, both central tendency and variance are still biassed towards too small values.

Therefore, it is advantageous to combine this technique with one or more versions of the distance interpolation technique described in the next subsection.

Distance Interpolation This subsection describes two different approaches to the estimation of the accept/reject threshold from enrolment speech combined with pre-recorded impostor speech.

In the first method, only the central tendency of the distances between the new model and true customer utterances or impostor utterances is optimised. The variances of the two distributions are not used. No assumptions are needed about the statistical nature of the distributions.

Optimisation is based on linear interpolation: Th (new) = b * CTi + (1-b) * CTt, where Th (new) is the optimal threshold, CTi is the central tendency obtained from pre-recorded impostor speech, and CTt is the central tendency estimated from the enrolment speech of the new customer, b is the interpolation parameter, that is optimised using additional pre-recorded impostor utterances that were not used in estimating Cti.

In the second method it is explicitly assumed that the distance distributions of both true customers and impostors approach the Gaussian distribution. To improve the estimates, enrolment speech and pre-recorded impostor utterances are segmented into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed.

The Central Tendencies of the distance distributions are then corrected to remove the bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model. The optimal correction parameter h is optimised using additional pre-recorded impostor speech.

Both methods set forth in this sub-section can be applied to optimise the correction factors either speaker independently, where a single optimal value is computed

that applies to all customers, or speaker dependently, where an optimal correction factor is estimated for each newly enrolled customer.

More detailed descriptions of the methods described above can be found in two papers attached to this patent application (Attachments I and II). In these papers it is shown that the newly introduced methods improve the operational performance of speaker verification systems.

Finally, an implementation of a speaker verification system according to the present invention is visualized in figure 1, while the outstanding capabilities of the invention in illustrated in Attachment III.

Figure 1 schematically shows a system according to the invention, comprising a number of co-operating, hardware or software based modules, viz. a receiving module 1, a model building module 3, a threshold module 4, an estimating module 5, a combining module 6 and an optimisation module 7. Said modules are interconnected by a (software or hardware) bus 8 and controlled by a"control & output"module 9, which also ouputs an"accept/reject"signal resulting from the processes executed in the various modules of the shown speaker verification system.

The receiving module 1 receives utterances of a speaker 2 during an enrolment process, during which speaker 2 produces n tokens of some set of phrases. Model building module 3 builds one or more models consisting of explicit or implicit sets of central tendencies and variances of the speech coefficients of the utterances received via receiving module 1. Threshold module 4 establishes accept/reject thresholds during said enrolment process, while estimating module 5. Model building module 3 builds n different speaker models, each based on n-1 tokens, for each model one independent token being available for

estimating, by the estimating module 5, the distance between the model and an utterance that has not been used to build the model.

The estimation module 5 estimates the central tendency of the distance between the speaker's model and newly produced utterances of the same speaker, and also its variance. The estimation of the accept/reject threshold from enrolment speech is combined, by combining module 6, with pre-recorded impostor speech, whereby the central tendency of the distances between the model and utterances is optimised by optimising module 7. Optimisation, in the optimisation module 7, is executed by linear interpolation: Th (new) = b * CTi + (1-b) * CTt, where Th (new) is the optimal threshold, CTi is the central tendency obtained from pre-recorded impostor speech, CTt is the central tendency estimated from the enrolment speech of the new customer, and b is the interpolation parameter, that is optimised using additional pre-recorded utterances not used in estimating Cti. Enrolment speech and pre-recorded impostor utterances are segmented, in said optimising module 7, into a large number of theoretically independent parts, for each of which the distance to the newly enrolled model is computed. The central tendencies of the distance distributions are corrected, in the optimising module 7, removing a bias caused by the fact that the enrolment speech has been used both for building the model and for computing the distances to the model. In the optimising module 7 a single optimal value is computed that applies to all speakers. Preferrably, in the optimising module 7 an optimal correction factor is estimated for each newly enrolled speaker.

TECHNIQUES FOR A PRIORI DECISION THRESHOLD ESTIMATION IN SPEAKER VERIFICATION ABSTRACT A key problem for field applications in speaker verification is the issue of a priori threshold setting. In the context of the CAVE project several methods for estimating speaker- independent and speaker-dependent decision thresholds were compared. Relevant parameters are estimated from development data only, i. e. without resorting to additional client data. The various approches were tested on the Dutch SESP datahase. RESUME Un Ies problèmes importants en verification du locuteur portc sur 1 estimation du seuil de decision. Dans le cadre du projet européen CAVE, plusieurs méthodes d'estimation en mode dependant et independant du locuteur sont comparees.

I. es paramctres servant a revaluation de ce seuil sont estimés uniquement grâce au données de developpement et sans ajout de dollllees supplementaires pour les clients. Les diffcrentes approches sent testees sur la base de données en langue SESP.

1. INTRODUCTION The CAVE project (CAller VErification in Banking and Telecommunications) was a 2-year project that ended in december 1997, It was supported hy the Langage Sector of the Telematics Applications l'mgrammc of the European Union, and for the Swiss theOfficeFédéraldel'EducationetdelaSciencepartnersby BildungundWissenschaft).Thepartners(Bundesamtfür were Dutch PTT Tetecom. KUN. KTH. ENST. UBILAB. ll) lAI. V () CAI IS TELIA and Swiss Telecom PTT. ! n the rca) m of the project. 2 telephone-based system which used Spcaker Verif'ication (SV) were developed and assessed.

4(WP4)inthisprojectfocusedontheWorkPackage research and development aspects. The SV system used in the experiments reported here is the cave-WP4 generic SV system [7], based on the HTK software platform (21.

Laboratory evaluations of SV systems usually base their asscssmcnts on the Equal Error Rate (EER). The FER is aposteriorisettingthedecisionthreshold(s)soobtainedby that false acceptance and false rejection rates become equal.

The EER gives a good estimate of the modeling module of the SV system. The EER does. however. not give much information about the performance to expect in a field application. In such a case the decision threshold (s) must be estimated a priori during the enrollment phase. Bayesian theory indicate that the decision threshold (s) could be predicted for the false acceptance and false rejection costs.

The mismatch between the speaker and (non-speaker) model (s) and the real data distributions requires adjustments of the threshold (s) for efficient decisions to be made.

Part of the results in this paper is also reported in [8]. In this paper a new method (SD-4) has been added and this method is compared to the results reported in [8].

2. THEORETICAL BACKGROUND 2.1 Notations Let X denote a speaker, and X his probabilistic model. I. et X denote the non-speaker model for speaker X. i. e. the model of the rest of the population. Let Y be a speech utterance claimed as being from speaker X.

If we denote as X (resp. X) the acceptance (resp. rejection) decision of the system, and Px (resp. px) the @ priori probability of the claimed speaker to hc tresp. not to be) speaker X. the total cost function of the system is s 1 : where P (XIX) and P (XIX) denote respectively the prohahiliy of a false acceptance and of a false rejection, while C(XIX) and C(XIX) represent the corresponding costs (assuming a null cost for a true acceptance and a true rejection).

2.2 PDF Ratio and Bayesian Threshold IC we now denote by Px and Px the Probability Density Functions (PDFs) of the speaker and of the non-speaker distributions, the minimisation of C in equation (1) is obtained by implementing the PDF Ratio (PR) test [4]: accept accept Px (Y) X R (2) ' () < reject where R is the Bayesian threshold: 2.3 Half total Error rate Hquation (3) shows that the optimal threshold only depends on the false acceptance/false rejection cost ratio and the impostor/client a priori probability ratio. In the particular case ot equal costs of 0.5 and when clients and impostors are assumed a priori equiprobable. the choice of #=1 as a decision threshold should then lead to a minimum of the Half Rate:TotalError 2. 4 Liketihood Ratio and Threshold Adjustment In practice. however, the PR in equation (2) is calculated l'om hkelihoo (l functions, i. e. estimations of the PDFs. which do not mutch the exact speaker and non-speaker distributions. As ; consequence, it is usually necessary to adjust the threshold of the PR test accordingly, in order to correct for the improper tit between the modet and the data [5]. Thus the PR test becomes a Likelihood Ratio (LR) test: accept /, Rx (F) = oX () eX (R) (S) ) < reject andPZdenotetherespectivemodellikelihoodwherePZ functions for the speaker and the non-speaker, and # (R) is a speaker-1 ancl cost-) dependent thresliol (l.

2.5 modellog-LR In most cases, the logarithm of LRx (Y) is obtained as the sum logarithmoftheframe-basedlikeliboodratioscoresofthe tray,@ where v denotes the @th frame in uttorance Y. of totai teogth n. In some variant, the average log-LR is used instead of the log-I.R: We wili relel to thcse two quantities as unnormaliscd and respectively.normalisedI.R.

If n is large enough, the utterance log-likelihood ratio can be assumed to follow a Gaussian distribution. This distribution is different depending on whether the speech utterance Y was pronounced by speaker X or by an impostor X: and similarly: with the obvious relations: nmXMX=nmXMX= (10 SX = nsX SX = nsX As opposed to the utterance log-likelihood ratio, the frame- based log-likelihood ratio does not generally followa Gaussian distribution. But, if we denote as P. and (3., (resp. and Ox) the mean and variance of the distribution of the frame-based client (resp. impostor) log-likclihmoi ratio log lrx(y1 X) (resp. log lrx (y, and if we assume that the frame-based scores are statistically independent, we have (according to the Limit Central Theorem) : Under the assumption that the client and impostor log-I.R follow Gaussian distribution, decisionthresholdoptimal can be obtained as: and similarly for log-LR'.

In practice it is feasible to obtain reasonahle claim ; and Sx. from scores yielded by a population ol' impostors. Conversely. in real applications. Mx and Sx have to be estimated from the enrollment data themselves und arc therefore strongly biased. especially in the case when very few enrollment data are available.

3. (SI) THRESHOLD A classical method for adjusting the threshotd in equation (5) consists in estimating a speaker-indepencient threshold so as to optimise the cost function of equation (1). In practice this optimisation is carried out on a development data set. composed of enrollment and test data for a popoulation of speakers which is distinct from (but representative of) the actual client population. In our experiments. we have tested the bothwithunnoormalisedandnormalisedI.R.method We twoapproachesasSIandSI-N.respectively.these The SI and SI-N methods do not make any particular assumption as regards the shape of the log-I. R distribution.

However. the fact that the threshold is peaker-independent relies on the hypothesis that the mismatch between the likelihood function and the actual client PDF translates into a client-independent shift between the log-PR and the log-LR.

This is obviously a very simplistic hypothesis as part of the model mismatch is certainly variable across speakers.

4. SPEAKER-DEPENDENT (SD) THRESHOLD Conversely. the estimation of a speaker-dependent threshold accounting for the variability in modeling accuracy can be hindered by the lack of proper data for estimating that threshold. Indeed. in the context of practical applications. enrollment material is so limited that it is not reasonable to reserve some of it for threshold setting. The speaker- dependent threshold must be derived from the same client data as those used for training the client model (and from some pseudo-impostor data).

In the next sections, we present 4 methods for speaker- dependent TS. Methods SD-1. SD-2 and SD-4 were tested with the unormalised log-LR, whereas SD-3 was used with normaliseii scores (log-LR).

4. 1 Method SD-1 SD-1 consists of estimating e,, (R) as a linear combination of the log-I.R mean. Mx, and variance. Sx. following an approach similar similar the one propose by Furui [6]: #X(R)=#X(R)=MX (13)αSX where Mx and Sx are obtained from pseudo-impostor data. whereas α is optimised on a development population.

4.2 Method SD-2 The second method relies on an estimation of Q (R) using also the client score obtained with the enrollment data. In this method. #x(R) is obtained as a linear combination of estimates andMx:Mx #X(R)+(1-ß)MX(14)ßMX isobtainedfrompseudo-impostordata,whereaswhereMx Mx is the (biased) estimate of M. Parameter on a development population.

4.3 Method SD-3 This method is explicitely based on the Gaussian model of utterance log-I.R distribution, as exposed in 151 The method uses the Gaussian model introduced in subsection 2.5. Hstimatesand(ofandOj;areinitiallyobtainedfrom enrollmentdata,whereasµxand#xareestimatedtheclient from the pseudo-impostor population. Then. a speaker- inclepenclcnt correction h is applied to µx only : µX=h#X=#X(15)µX= where li is optimised on a development population. Then. estimates of sx. andsxareobtainedfromµx,#xµxmx and 6. as in equation (H). Finally. Q Finally. Q (R) is obtained as in equation(12): 4.4 Method SD-4 The fourth SD method can be viewed as a speaker dependent adjustment of an estimated SI threshold. #@ (R) is obtained as a linear combination of the SI-threshold and estimates of Mx and M : where Mx is obtained from the enrolment set of a development population, whereas Mx'is the (biased) estimate of Mx, Parameter γ is optimised on the development population.

5. DATABASE All our experiments on TS were carried out on the realistic telephone speech database SESP [1], collected hy KPN Research. It contains telephone utterances from 21 male and 20 female speakers calling with different handsets (including some calls from mobile phones) from a wide variety of places. During each call, the speaker was asked to utter a number of items, including a speaker-dependent sequence of 14 digits (twice) and a few other sequences of 14 digits. corresponding to other speakers.

Each session contains, therefore. 2 utterances of the client card number. For the experiments described in this paper we used 2 enrollment sessions with a low level of background noise, corresponding to 2 calls placed from 2 different handsets. Two other calls were reserved as extended enrollment material. The rest of the calls were used as test material.

In our experiment on TS, we have split the SESP data into 2 suh-populations which we denote SESP-a and SESI'-h.

SESP-a contains 11 male and 10 female speakers while SESP-b contains 10 male and 10 female speakers. Each data set is composed of approximately 800 genuine trials and 250 impostor attempts from other clients (out of which about 75*/ { are same-sex attempts). We use SESP-b as pseudo-impostors and development data for SESP-a and vice-versa.

Acoustic features are 16 LPC cepstral coefficients with log- energy. together with their first and second derivatives.

Cepstral mean subtraction is applied. Our tests were carriez out using Left-Right HMM digit models, with 2 different topologies: p=2 states per phoneme q=3 Gaussian densités per state, and p=3 states per phoneme q=2 Gaussian densities per state. In these experiments, both the client and world model, have the same topology. These configurations were chosen as they were those that worked best in terms of Hquai Error Rate, in previous experiments on SESI'I 11.

In all our experiments, we aim at optimising the HTER. defined in equation (4). TS method Eval. data dev. data =2, g=3 p=3, g=2 (sp.-dep.thresholds)EEREERaposteriori EER SESP-a-0. 57 0. 99 0.460.63SESP-b- -0.570.99EER-NSESP-a 0.260.89SESP-b- FRFAHTERFRFAHTERapriori 1SESP-a-12.110.256.1813.250.256.75#= SESP-b - 8.21 0.00 4.10 9. 76 0.00 4.88 SI SESP-a SESP-b 0. 86 4. 60 2. 73 1.85 4.01 2.93 1.721.731.721.471.611.54SESP-bSESP-a SI-N SESP-a SESP-b 1.63 4.95 3.29 2. 73 2.15 2.44 2.251.962.112.121.611.87SESP-bSESP-a SD-1 SESP-a SESP-b 4. 08 2. 26 3. 17 3.25 3.59 3.42 1.053.692.371.44SESP-bSESP-a 2.98 2.21 SESP-b2.831.822.322.722.522.62SD-2SESP-a 1.281.121.201.021.801.41SESP-bSESP-a SESP-b4.861.663.262.80SD-3SESP-a 1.89 2.35 1.652.442.051.763.112.43SESP-bSESP-a SESP-b0.384.182.280.582.281.43SD-4SESP-a 1.081.611.351.471.611.54SESP-bSESP-a Table 1: Equal-Error Rates and comparative results for several a priori threshoid setting methods, on the SESP-a and SESP-b databases.

6. RESULTS Comprehensive results are reported in Table 1. We provide separate peitbrmances for SESP-a and SESI'-b. We first give RatesforbothunnormalisedandnormalisedEqualError tikeiihood scores. Then we give the performance with the fixed threshold. followed by those obtained with the various speaker dependent TS methods presented above.

7. COMMENTS AND CONCLUSIONS On our task. normalisation by the utterance length seems to have little effect. But SESP utterances all have quite similar lengths. Therefore. the real impact of normalisation can not hestudied accuratety. l. oosek speaking, the HTER is about 3 to 5 times arger than stressesoncemorethefactthattheEERfigureisaEER.This very optimistic evaluation of the actual performance of a SV system. All methois yield similar results, except method SD-2 and SD-4. which seem to perform consistently better. This may come from the fact that these methods only use the means of the log-LR distributions. which are prohahly estimated more reliably than the variances. given the small amount of data strongbiasintheclientestimates.andthe It must also he noted that the SI methods do not perform especially worse than the SD methods. which tends to show that a large part of the model mismatch can he accounted for hy a speaker-independent shift of the Buyesian threshold.

Quite important differences are observe between performances obtained on SESP-a and SESP-b. which illustrates the relatively large confidence interval that must he taken into account when interpreting these results.

Future work will consolidate these results. by extending the amount of experiments and the size of the database, and hy testing the merit of the various methods for Threshold Setting methods for other cost functions than the HTHR.

8. REFERENCES F.,HutterH.-P.,JabouletC.,KoolwaaijJ.,[1]Bimbot Lindherg J.. and Pierrot J.-B.,"Speaker Verilicalion in the Telephone Network: Research activities in the CAVF.

EUROSPEECH'97,Rhodes.Greece.Project."Proc.

2.pp971-974.1997.Vol.

S.,JansenJ.,OdellJ.,OllasonD.,WoodlandP.,[2]Young 'llic HTK BOOK. HTK 2.0 Manual@, 1995.

R.O.,HartP.E.,"PatternClassificationandScene[3]Duda Analysis John Wiley & Sons. 1973.

L.L.,"StatisticalSignalProcessing.Detection,[4]Scharf Estimation and Time Analysais. Addison-Wesley Publishing Company, 1991 [Sl Bimbot F.. Genoud D.."Likelihood ratio adjustment for the compensation of model mismatch in spe : tker verification". Proc. EUROSPEECH'97. Rôdes, Greece.

1997. Vol. 2. ppl387-1390.

[6] Furui AnalysisTechniqueforAutomatic"Cepstral Speaker Trans.onASSP,vol29,noIEEE 2. pp. 254-272.1981.

[7] Jaboulet C.. Koolwaaij J.. Lindherg. 1. Pierrot J.-B., Bimbot F.."The eave-WP4 generic speaker verification RLA2C,Avignon.France.1998system",Proc. 181 Pierrot J.-B.. Lindberg J.. Koolwaaij J., Hutter H.-P., Genoud D.. Blomberg M.. Bimbot F.. comparison of a priori threshold setting procedures for speaker verification in the cave project". Proc. ICASSP. Seattle.

USA. 1998 A COMPARISON OF A PRIORI THRESHOLD SETTING PROCEDURES FOR SPEAKER VERIFICATION IN THE CAVE PROJECT ABSTRACT ' ['lie issue of a priori threshold setting in speaker verification is a key problem for field applications. In the context of the CAVE project, we compared sev- eral methods for estimating speaker-independent and decision thresholds. Relevant pa- rameters are (estimated from development data only, i. e without resorting to additional client data. The various testedontheDutchSESPdatabase.approachesare 1. INTRODUCTION The CAVL project (CAller V Erification in Banking and Telecommunications) is a 2-year project supported by the Language Engineering Sector of the Telemat- ProgrammeoftheEuropeanUnion.icsApplications theSwisspartnersbytheOfficeFédéraldeandfor i, i (le la Science (Buudesamt fur Bildung The Wissenschaft). The partners are Dutch PTT Tele- KTH,ENST,UBILAB,IDIAP.VOCALIS,com.KUN, l'ELIA and Swiss Telecom PTT. It started on Decem- berThetechnicalobjectivesoftheCAVE1995. todesign,implementandassess2telephone-projectare based systems which use Speaker Verification (SV) tech- nology. Work Package 4 (WP4) of this project focuses on anddevelopmentaspects.Thespeakerresearch usedintheexperimentsreportedverificationsystem theGenericCAVE-WP4SVsystem[1],basedliereis on softwareplatform[2].HTK DéptSignal.CNRS-URA820.46RueBarrault,1ENST- cedex13,FRANCE-EU75634Paris ofSpeech,MusicandHearing,Drottning2KTH,Department 31.S-10044Stockholm.SWEDEN-EUKristinasVäg ofLanguage&Speech,Erasnmsplein1.NL-6525@KUN.Dept THENETHERLANDS-EUHTNijmegen. BankofSwitzerland.Bahnhofstrasse45.CH-@@bilab.Union SWITZERLANDS021.Zürich, duSimplon4.CasePostale592,CH-1920Mar-@DIAP.Rue tigny,SWITZERLAND Laboratory evaluations of SV systems are usually based on the Equal Error Rate (EER). obtained by a posteriori setting the decision threshold (s) so as to equalise the false rejection and acceptance rates. In- deed. the EER gives a good idea of the quality of the modeling module in a SV system. However. in the context of field applications, a specific must be implemented in order to set the decision thresh- old a priori namely during the enrollment procédure. Whereas Bayesian theory indicates tha the decision threshold could be readily predicted for the false re- , jection ancl false acceptance costs, the mismatch be- tween the speaker (and non-speaker) mode) (s) and the real data distributions requires the adjustement of the threshold for an efficient decision.

This paper reports on a series of comparative exper- iments on a priori Threshold Setting (TS) carried out by WP4. We first recall the main theoretical aspects TS.Then,weexpressseveralTSproceduresinvolvedin under a. common formalism. Finally, we compare their efficiency on a task of speaker verification on a realistic telephone speech database (the SESP database).

2. THEORETICAL ASPECTS 2.1. Notations Let. V denote a speaker, and X his probabilistic model.

LetLetX non-speakermodelforspeakerX.i.ethe the model of the rest of the population. Let Y be a speech utterance claimed as being from speaker. '.

If we denote e as X (resp. X) the acceptance (resp. ofthesystem.andpx(resp.px)rejection)decision the a priori probability of the claimed speaker to be (resp. not. to be) speaker. X. the total cost function of thesystemis[3]: C(X|X).pX.P(X|X)+C(X|X).pX.P(X|X)(1)C= where P (X|X) and P (X|X) denote respectiveiy the probability of a false acceptance and of a false rejection, while C (XIX) represent the corresponding costs (assuming a null cost for a true acceptance and a true rejection).

2.2. PDF Ratio and Bayesian Threshold nowdenotebyPxandPxtheProbabilityDensityIfwe Functions (PDFs) of the speaker and of the non-speaker distributions, the minimisation of C in equation (1) is implementingthePDFRatio(PR)test[4]obtainedby accept PR. r (') = p, i O) < reject istheBayesianthreshold:whereR 2.3. Half Total Error Rate As can be seen from equation (3), the optimal threshold should only depend on the false acceptance/rejection cost ratio and the impostor/client a priori probability theparticularcasewhenthecostsC(X|X)andratio.In C(X|X) are equal to 0.5. and when clients and impostors aprioriequiprobable,thechoiceof#=1areassmned as a decision threshold should then lead to a minitnum ofTotalErrorRate:Half ½[P(X|X)+P(X|X)](4)HTER= RatioandThresholdAdjustment2.4.Likelihood In practice. however, the PR in equation (2) is catch- likelihoodfunctions.i.eestimationsofthelatedfrom donotmatchtheexactspeakerandnon-PDFs.which Asaconsequence,itisusuallyspeakerdistributions. adjustthethresholdofthePRtestaccord-necessaryto it) in order to correct for the improper fit between thethedata[5].and PRtestbecomesanLR(LikelihoodRa-Thus.the sot : accent-t P t (1) re< I (; r)) -'rf.-< andPxdenotestherespectivemodellike-wherePx forthespeakerandthenon-speaker.lihoodfunctions (R)isaspeaker-(andcost-)dependentthresh-and#x old.

2.5. Gaussian log-LR model In most cases, the logarithm of LRx(Y) is obtained as the sum of the logarithm of the frame-based likelihood ratio scores lrx (Yi): where yi denotes the ith frame in utterance ', of total length n. In some variants, the average log-LR is used instead of the LR : log (V)=tog/?.r(')(7) n \Ve will refer to this two quantities as unnonnatised and normalised LR, respectively.

The frame-based log-likelihood -lrlog If n is large enough, log canbeassumed(Y) to follow a Gaussian distribution. This distribution is different depending on whether the speech utterance Y' was pronounced by speaker Y or by an impostor X : The frame-based log-likelihoo (i ratio does nos generany follow a Gaussian distribution. We will denote as progresscurrentlyin and if the frame-based log-likelihood ratio values are supposed statistically independent. log LRx(Y) assumedtofollowoneoftwoconditionalGans-canbe sian distribution # (µx: #(µx:#x).dependingor on whether Y was uttered by the client speaker X or by an impostor totheLimitCentralThe-According =mxandµx=mx.wheremxandmxorem,µx are the means of the client (resp. impostor) frame- based tog-nkehhood ratio values. =, ; xl\li'i =sx/#n,withsxandsxdenotingthestan-and#x darddarddeviations of distribution.Noteframe-based general,theframe-basedlog-likelihoodvaluesthat.in do not follow themselves a distribution.

In the rest of the paper, we will indicate by a star ofµx,#x,mxandsxobtainedfromthetheestimates enrollment data thesamedataasthosei.e used to estimate the parameters of the client model X.

3. SPEAKER-INDEPENDENT (SI) THRESHOLD A classical method for adjusting the threshold #x (R) inconsistsinestimatingaspeaker-indepen-(5) dent. thresholcl so as to o optimise the cost function of equation (1). In practice, this optimisation is carried out on a development data set, composed of enroll- ment and test data for a population of speakers which is distinct from (but representative of) the actual client population. We will refer to this method as method SI.

However, the fact that the threshold is speaker- independent relies on the hypothesis that the mismatch between the likelihood function and the actual client PDF translates into a client-independent shift between i lle log-PR and the log-LR. This is obviously a too sim- plistic hypotliesis as the model mismatch is certainly variable across speakers. withoutnormalisationvsSI-N:withnormali-SI: sation.

4. SPEAKER-DEPENDENT (SD) THRESHOLD ('oaversely, the estimation of a speaker-dependent thresh- old accounting for the variability in modeling accuracy can be hindered by the lack of proper data for estimat- ing that threshold. Indeed, in the context of practical al) pp ! ications. it is not feasible to reserve any enrollment malaria) for threshold setting. The speaker-dependent bederivedthresholdmust from dataenrollment u lse l ves. agreeonthelastparagraph.ordidImisun-Dowe method?dersloodKUN's 4.1. Method SD1 fromKUN/identicaltoFurui's/tobeSerondmethod completed 4.2. Method SD2 fromKUN/bestperfarmingThirdmethod 4.3. Method SD3 97)lestedbyPierrot/toBimbot/Gonoud(Eurospeech b@completed 5. DATABASE All our experiments on TS were carried on the realistic databaseSESP.telephonespeech SESP was collected by KPN Research. It contains from21maleand20femalespeak-telephoneutterances ers calliy witli different handsets (including some calls from mobile phones) from a wide variety of places (such publicphonesandairportdepartureasrestaurants. therecordingsweremadebetweenMarchlounges).All and Asubstantialproportionofthecalls1994. fromforeigncountries.Duringeachcall,thewasmade speaker was asked to utter a number of items, includ- ing a speaker-dependent sequence of 14 digits (twice) and a few other sequence of 14 digits, corresponding to other speakers.

Each session contains therefore 2 utterances of the client card number. For the experiments described in this paper we used 2 enrollment sessions with a low level of background noise, corresponding to 2 calls placed from 2 different handsets. Two other calls were reserved as extended enrollment material. The rest of the data was used as test material.

The SESP data are very realistic in many aspects.

However, the obvious factor that make them signifi- cantly different from those that could be expected from a field test data collection, is the lack of intentional im- postor attempts.

In our experiment on TS, we have split the SESP data into 2 sub-populations which we denote SESP- a and SESP-b. SESP-a contains 10 male and 10 fe- male speakers while SESP-b contains 11 male and 10 female speakers. Each data set is composed of approx- imately 800 genuine trials and 250 impostor attempts from other clients (out of which about 75 % are same- sex attempts). We use SESP-b as development data for SESP-a and vice-versa 6. RESULTS factwehadallscorescorrespondingtolog-likelihoodIfin ratios nor7nalised by the number of distinctly assigned couldfocusonthis,andIbelievethisisframes.we thebestscores,isn'titthecasewithyouwhatgives ThepaperwouldgainalotinsimplicityandKTH? doyouanticipatebeingabletopro-consistency.KUN, duce withinareasouabledelayresults (say Wednesday)?Andwhataboutyoutime IamstillesxpectingfreshresultsfromJean-benoit? you. with Wherearethey?correction.

7. CONCLUSIONS # Normalisation has (onTHESEdata).effect IMITER is about 3 to 5 times larger than EER. methodsperformsimilarilyexceptSD2which#All seems consistenlty better.

8. REFERENCES [1] F. HUTTER.,C.H.-P.

JABOULET, J. LIND-J.

PIERROT:Speakerrerifi-BERG,J.-B. cation in the e telephone network : research theCAVEproject.Proc.En-activitiesin pp.971-97@1.1997.rospeech@97, TS method 11 eval. set l dev. set p = 2, q = 3 p = 3, q = 2 a posteriori (sp.-dep. thresholds) EER EER -0.570.99EERSESP-a 0.460.63SESP-b- 0.570.99EER-NSESP-a 0.260.89SESP-b- a priori FR FA HTER FR FA HTER 1SESP-a-12.110.256.1813.250.256.75#= 8.210.004.109.760.004.88SESP-b- SI! SESP-a SESP-b0. 864. 602. 731.85 4.01 2. 93 2.722.472.592.412.50SESP-a2.22 SESP-b SESP-a 1. 72 1. 73 1. 72 1.47 1.61 1.54 2.261.260.971.911.44SESP-b0.26 SESP-b1.634.953.292.732.152.44SI-NSESP-a 1.622.343.061.442.25SESP-a3.06 SESP-b SES-a 2.25 1.96 2.11 2. 12 1. 61 1.87 2.661.541.591.611.60SESP-b0.42 SESP-b4.082.263.173.253.593.42SD-1SESP-a 2.262.803.722.433.07SESP-a3.35 1.053.692.371.442.982.21SESP-bSESP-a 3.002.090.793.342.00SESP-b1.17 SESP-b2.831.822.322.722.522.62SD-2SESP-a 1.822.322.612.522.57SESP-a2.82 1.281.121.201.021.801.41SESP-bSESP-a 1.121.201.281.521.40SESP-b1.28 SESP-b4.861.663.262.801.892.35SD-3SESP-a 2.201.581.081.921.50SESP-a0.97 1.652.442.051.763.112.43SESP-bSESP-a 2.711.640.573.251.62SESP-b0.57 [2] S. YOUNG, J. JANSEN, J. ODELL, D.

OLLASON, P. WOODLAND: The HTIi POOL, HTK 2.0 Manual. 1995.

[: R. O. DUDA, P. E. HART: Pattern Clas- sification and Scene Analysis. John. Wiley 1973.&Sons, [4] L. L. SCHARF: Statistical Signal Process- EstimationandTimeSe-ing.Detection, ries Analysis. Addison-Wesley Publishing Company,1991.

[5] F. BIMBOT, D. GENOUD: Likelihood ratio adjustment for the compensation of inspeakerverifica-mismatch tion, Proc. Eurospeech'97, pp. 1387-1390.

1997.

Previous Patent: GRAPHICS PROCESSOR ARCHITECTURE

Next Patent: ADAPTATION OF A SPEECH RECOGNIZER FOR DIALECTAL AND LINGUISTIC DOMAIN VARIATIONS