VOICE AUTHENTICATION - DOMAIN DYNAMICS LTD

Title:

VOICE AUTHENTICATION

Document Type and Number:

WIPO Patent Application WO/2003/098373

Kind Code:

A2

Abstract:

A method of voice authentication comprises enrolment and authentication stages. During enrolment, a user is prompted to provide a spoken response which is recorded using a microphone (5). The recorded signal is divided into frames and converted into feature vectors. Feature vectors are concatenated to form featuregrams. Endpoints of speech may be determined using either explicit endpointing based on analysis of energy of the timeslices or using dynamic time warping methods. Featuregrams corresponding to speech are generated and averaged together to produce a speech featuregram archetype. During and authentication stage, a user is prompted to provide a spoken response to the prompt from which a speech featuregram is obtained. The speech featuregram obtained during authentication is compared with the speech featuregram archetype and is scored. The score is evaluated to determine whether the user is a valid user or an impostor.

Inventors:

PHIPPS TIMOTHY (GB)
ROBSON JOHN H (GB)

Application Number:

PCT/GB2003/002246

Publication Date:

November 27, 2003

Filing Date:

May 22, 2003

Export Citation:

Click for automatic bibliography generation Help

Assignee:

DOMAIN DYNAMICS LTD (GB)
PHIPPS TIMOTHY (GB)
ROBSON JOHN H (GB)

International Classes:

G10L17/24; G10L25/78; (IPC1-7): G06F/

Domestic Patent References:

WO2003021539A1

2003-03-13

Foreign References:

US4827518A	1989-05-02
US4851654A	1989-07-25
EP0920674A1	1999-06-09

Attorney, Agent or Firm:

Piotrowicz, Pawel (Shipley & Co. 20 Little Britain, London EC1A 7DH, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1.	A token storing a voice, authentication biometric.

2.	A token according to claim 1 for possession by a user.

3.	A token according to claim 1 or 2, which is small enough to be kept on a user.

4.	A token according to any preceding claim, which is small enough to be worn by a user as jewellery.

5.	A token according to any preceding claim, which is small enough to be kept in a pocket of an article of clothing worn by a user.

6.	A smart card storing a voice authentication biometric.

7.	A token or smart card according to any preceding claim, wherein the voice authentication biometric is for use in authenticating a user using a sample of speech from the user.

8.	A token or smart card according to any preceding claim, wherein the voice authentication biometric includes at least one set of feature vectors.

9.	A token or smart card according to any preceding claim, wherein the voice authentication biometric includes at least one archetype set of feature vectors.

10.	A token or smart card according to claim 8 or 9, wherein the voice authentication biometric includes at least one prompt, each prompt associated with a respective set of feature vectors.

11.	A token or smart card according, to any one of claims 8 to 10, wherein the voice authentication biometric includes corresponding statistical information relating to each set of feature vectors.

12.	A token or smart card according to any preceding claim, wherein the voice authentication biometric includes data for controlling authentication procedure.

13.	A token or smart card according to any preceding claim, wherein the voice authentication biometric includes data for determining authentication.

14.	A token or smart card according to any preceding claim, wherein the voice authentication biometric includes data for configuring a voice authentication apparatus.

15.	A token or smart card according to any preceding claim, wherein the voice authentication biometric is encrypted.

16.	A token or smart card according to any preceding claim including non volatile memory storing the voice authentication biometric.

17.	A token or smart card according to any preceding claim storing a computer program comprising program instructions for causing a computer to perform a matching process for use in voice authentication.

18.

A token for voice authentication including a processor, the token storing a voice authentication biometric including a first set of feature vectors and a computer program comprising program instructions for causing the processor to perform a method, the method comprising : receiving a second set of feature vectors ; and comparing the first and second set of feature vectors.

19.

A smart card for voice authentication including a processor, the smart card storing a voice authentication biometric including a first set of feature vectors and storing a computer program comprising program instructions for causing the processor to perform a method, the method comprising : receiving a second set of feature vectors; and comparing the first, and second set of feature vectors.

20.	A token or smart card according to claim 18 or 19, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising : requesting a user to provide a spoken response.

21.	A token or smart card according to any one of claims 18 to 20, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: receiving a recorded signal including a recorded signal portion corresponding to a spoken response.

22.	A token or smart card according to claim 21, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: determining endpoints of said recorded signal portion corresponding to a spoken response.

23.	A token or smart card according to claim 21 or 22, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: deriving said second set of feature vectors for characterising said recorded signal portion.

24.

A token or smart card according to any one of claims 18 to 23, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: producing a score dependent upon a degree of matching between said first and second set of feature vectors.

25.	A token or smart card according to 24, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: comparing the score with a predefined threshold so as to determine authentication of a user.

26.	A method of voice authentication, the method comprising in a token or smart card: providing a first set of feature vectors ; receiving a second set of feature vectors for characterising a recorded signal portion ; and comparing said first and second sets of feature vectors.

27.	A method according to claim 26, further comprising: providing data relating to a prompt.

28.	A method according to claim 26 or 27, further comprising: receiving a recorded signal including a recorded signal portion corresponding to a spoken response.

29.	A method according to claim 28, further comprising: determining endpoints of said recorded signal portion.

30.	A method according to claim 28 or 29; deriving said second set of feature vectors for characterising said recorded signal portion.

31.	A method according to any one of claims 26 to 30, further comprising : producing a score dependent upon a degree of matching between said first and second set of feature vectors.

32.	A method according to claim 31, further comprising: comparing the score with a predefined threshold so as to determine authentication of a user.

33.

A method according to any one of claims 26 to 32, further comprising: receiving a recorded signal which includes a recorded signal portion corresponding to a spoken response and which includes a plurality of frames; determining endpoints of said recorded signal including: determining whether a value of energy for a first frame exceeds a first predetermined value; and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.

34.

A method according to any one of claims 26 to 33, further comprising : requesting said authenticating user to provide first and second spoken responses to said prompt ; obtaining a recorded signal including first and second recorded signal portions corresponding to said first and second spoken responses ; isolating said first and second recorded signal portions; deriving second and third sets of feature vectors for characterising said first and second isolated recorded signal portions respectively; comparing said second set of feature vectors with said third set of feature vectors so as to produce a score dependent upon the degree of matching; and comparing the score with a predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.

35.

A method according to any one of claims 26 to 34, further comprising: requesting a user to provide a plurality of spoken responses to a prompt; obtaining a plurality of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response; deriving a plurality of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion; comparing said sets of feature vectors with said first set of feature vectors so as to produce a plurality of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon said plurality of scores.

36.

A method according to any one of claims 26 to 35, further comprising : receiving a recorded signal which includes a recorded signal portion ; determining endpoints of said recorded signal by dynamic, time warping said second set of feature vectors onto said first set of feature vectors, including: determining a first subset of feature vectors within said second set of feature vectors from which a dynamic time warping winning path may start and determining a second subset of feature vectors within said second set of feature vectors at which the dynamic time warping winning path may finish.

37.

A method of determining an endpoint of a recorded signal portion in a recorded signal including a plurality of frames, the method comprising: determining whether a value of energy for a first frame exceeds a first predetermined value; and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.

38.	A method according to claim 37, wherein the first predetermined value represents a value of energy of a frame comprised of background noise.

39.	A method according to any one of claim 37 or 38 comprising: defining a start point if the value of energy of the first frame exceeds the first predetermined value and the second frame does not represent a spoken utterance portion.

40.	A method according to claim 39, further comprising : indicating that the first frame represents a spoken utterance portion.

41.	A method according to any one of claims 37 to 39 comprising :, defining a stop point if the value of energy of the first frame does not exceed the first predetermined value and the second frame represents a spoken utterance portion.

42.	A method according to claim 41, further comprising: defining the first frame as not representing a spoken utterance portion.

43.	A method according to claim 41 or 42, further comprising: counting a number of frames preceding a start point of the spoken utterance portion.

44.	A method according to claim 43 further comprising: pairing the stop point with said start point of the spoken utterance portion if the number of frames exceeds a predetermined number.

45.	A method according to claim 43 further comprising: pairing the stop point with start point of a preceding spoken utterance portion if the number of frames does not exceed a predetermined number.

46.	A method according to claim 37 or 38 comprising: determining whether the value of energy for a first frame exceeds a third predetermined value; and counting a number of frames preceding a start point of the spoken utterance portion.

47.	A method according to claim 46 further comprising: defining a start point if the value of energy of the first frame exceeds the third predetermined value, the second frame does not represent a spoken utterance portion and if the number of frames does not exceed a predetermined number.

48.	A method according to claim 47 further comprising: determining whether a value of energy for a third frame following said first frame exceeds the second predetermined value.

49.	A method according to claim 48 further comprising: defining a stop point if the value of energy of the third frame does not exceed the third predetermined value.

50.	A method according to claim 49 further comprising: pairing the stop point with the start point of the spoken utterance portion.

51.	A method according to claim 50 further comprising: pairing the stop point with a start point of a preceding spoken utterance portion.

52.	A method according to claim 42 comprising: defining the first frame as representing background noise if the value of energy of the first frame does not exceed the third predetermined value.

53.	A method according to claim 52 further comprising: calculating an updated value of background energy using said value of energy of the first frame.

54.	A method according to claim 53, further comprising: counting a number of frames preceding a start point of the spoken utterance portion and determining whether said number of frames exceeds another, large number.

55.	A method according to any one of claims 37 to 54 comprising: determining whether a value of rate of change of energy of the first frame exceeds a second predetermined value.

56.	A method according to claim 55, wherein the second predetermined value represents a value of rate of change of energy of a frame comprised of background noise.

57.

A method according to claim 55 or 56, comprising: defining a start point if the value of energy of the first frame exceeds the first predetermined value, and the value of rate of change of energy exceeds the second predetermined value and the second frame does not represent a spoken utterance portion.

58.

A method according to claim 55 or 57, comprising : defining a stop point if the value of energy of the first frame does not exceed the first predetermined value, and the value of rate of change of energy does not exceed the second predetermined value and the second frame represents a spoken utterance portion.

59.	A method according to any one of claims 55 to 58 comprising : determining whether the value of rate of change of energy for the first frame exceeds a fourth predetermined value.

60.

A method of dynamic time warping for warping a first speech pattern (B) characterised by a first set of feature vectors onto a second speech pattern (A) characterised by a second set of feature vectors, the method comprising : identifying a first subset of feature vectors within said first set of feature vectors from which a dynamic time warping winning path starts and identifying a second subset of feature vectors within said first set of feature vectors at which the dynamic time warping winning path finishes.

61.

A method of voice authentication comprising: enrolling a user including: requesting said enrolling user to provide a spoken response to a prompt ; obtaining a recorded signal including a recorded signal portion corresponding to said spoken response; determining endpoints of said recorded signal portion; deriving a set of feature vectors for characterising said recorded signal portions ; averaging a plurality of sets of feature vectors, each set of feature vectors relating to one or more different spoken responses to the prompt by said enrolling user so as to provide an archetype set of feature vectors for said response; storing said archetype set of feature vectors together with data relating to said prompt; authenticating a user including : retrieving said data relating to said prompt and said archetype set of feature vectors; requesting said authenticating user to provide another spoken response to said prompt; obtaining another recorded signal including another recorded signal portion corresponding to said other spoken response ; determining endpoints of said other recorded signal portion; deriving another set of feature vectors for characterising said other recorded signal portions; comparing said another set of feature vectors with said archetype set of feature vectors so as to produce a score dependent upon a degree of matching; and comparing said score with a predefined threshold so as to determine whether said enrolling user and said authenticating user are the same.

62.	A method of gain control comprising a plurality of times: determining whether an amplified signal level is above a predetermined limit; either decreasing gain if the amplified signal level is above the predetermined limit or maintaining gain if otherwise; thereby permitting no increase in gain.

63.	A method of gain control comprising a plurality of times: determining whether an amplified signal level is below a predetermined limit; either increasing gain if the amplified signal level is below predetermined limit or maintaining gain if otherwise; thereby permitting no decreases in gain.

64.

A method of voice authentication comprising: requesting a user to provide first and second spoken responses to a prompt; obtaining a recorded signal including first and second recorded signal portions corresponding to said first and second spoken responses ; isolating said first and second recorded signal portions; deriving first and second sets of feature vectors for characterising said first and second isolated recorded signal portions respectively ; comparing said first set of feature vectors with said second set of feature vectors so as to produce a second score dependent upon the degree of matching; and comparing the second score with another predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.

65.

A method of voice authentication including: requesting an authenticating user to provide a plurality of spoken responses to a prompt; obtaining a plurality of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response; deriving a plurality of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion; comparing said sets of feature vectors with an archetype set of feature vectors so as to produce a plurality of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon said plurality of scores.

66.

A method of determining an authentication threshold score, the method including: requesting a first set of users to provide respective spoken responses to a prompt ; for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response; for each user, deriving a set of feature vectors for characterising the recorded signal portion; for each user, comparing said set of feature vectors with an archetype set of feature vectors for said user so as to produce a score dependent upon a degree of matching; fitting a first probability density function to frequency of scores for said first set of users; requesting a second set of users to provide respective spoken responses to a prompt; for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response; for each user, deriving a set of feature vectors for characterising the recorded signal portion ; for each user, comparing said set of feature vectors with an archetype set of feature vectors for a different user so as to produce a score dependent upon a degree of matching; fitting a second probability density function to frequency of scores for said set of users.

67.

A method of averaging a plurality of feature vectors, the method comprising: providing a plurality of feature vectors; comparing said each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching; searching for a minimum score; determining whether at least one score is below a predetermined threshold.

68.

A smart card for voice authentication comprising: means for storing a first set of feature vectors and data relating to a prompt ; means for providing said data to an external circuit; means for receiving a second set of feature vectors relating said prompt; means for comparing said first and second set of feature vectors so as to determine a score ; and means for comparing said score with a predetermined threshold.

69.

A smart card for voice authentication comprising : a memory for storing a first set of feature vectors and data relating to a prompt; an interface for providing said data to an external circuit and for receiving a second set of feature vectors relating to said prompt ; a processor for comparing said first and second set of feature vectors so as to determine a score and for comparing said score with a predetermined threshold.

70.	Information storage medium storing a voice authentication biometric.

71.	A medium according to claim 70, which is portable.

72.	A computer program comprising program instructions for causing a smart card to perform a method, the method comprising: retrieving from memory a first set of feature vectors ; receiving a second set of feature vectors ; and comparing the first and second set of feature vectors.

73.	A method comprising writing at least part of a voice authentication biometric to a smart card or token.

74.	A method according to claim 73, wherein said at least part of a voice authentication biometric is a set of feature vectors.

75.	A method comprising writing a computer program to a smart card or token, the computer program comprising computer instructions for performing a method, the method comprising performing voice authentication.

76.	A method according to claim 73 and 75.

77.	A smart card for voice authentication including a processor, the smart card storing a computer program comprising program instructions for causing the processor to. perform a method, the method comprising performing voice authentication.

78.	A smart card reader/writer connected to apparatus for recording speech and generating feature vectors, said reader/writer being configured to transmit a set of feature vectors to a smart card or token and receive a response therefrom.

Description:

Voice authentication Field of the Invention The present invention relates to voice authentication.

Background Art Voice authentication may be defined as a process in which a user's identity is validated by analysing the user's speech patterns. Such a process may be used for controlling access to a system, such as a personal computer, cellular telephone handset or telephone banking account.

Aspects of voice authentication are known in voice recognition systems. Examples of voice recognition systems are described in US-A-4956865, US-A-507939, US-A- 5845092 and WO-A-0221513.

Summary of the Invention The present invention seeks to provide voice authentication.

According to the present invention there is provided a token storing a voice authentication biometric. The token may be suitable for possession by the user and may be small enough to be kept on a user, worn by the user as jewellery or kept in a pocket of an article of clothing worn by them. The token may be an information storage medium or device.

According to the present invention there is also provided a smart card storing a voice authentication biometric.

Storing a voice authentication biometric on a smart card can help validate that a smart card user is the smart card owner.

The voice authentication biometric may be suitable for use in authenticating a user using a sample of speech from the user. The voice authentication biometric may include at least one set of feature vectors, such as archetype set of feature vectors.

Case: PJP/40810PCT1

The voice authentication biometric may include at least one prompt, each prompt associated with a respective set of feature, vectors. The voice authentication biometric may include corresponding statistical information relating to each set of feature vectors. The voice authentication biometric may include data for controlling authentication procedure, data for determining authentication and/or data for configuring a voice authentication apparatus. The voice authentication biometric may be encrypted. The token or smart card may include non-volatile memory storing the voice authentication biometric. The token or smart card may store a computer program comprising program instructions for causing a computer to perform a matching process for use in voice authentication.

According to the present invention there is provided a token for voice authentication including a processor, the token storing a voice authentication biometric including a first set of feature vectors and a computer program comprising program instructions for causing the processor to perform a method, the method comprising receiving a second set of feature vectors and comparing the first and second set of feature vectors.

According to the present invention there is provided a smart card for voice authentication including a processor, the smart card storing a voice authentication biometric including a first set of feature vectors and storing a computer program comprising program instructions for causing the processor to perform a method, the method comprising receiving a second set of feature vectors and comparing the first and second set of feature vectors.

The computer program may comprise program instructions for causing the processor to perform a method, the method comprising requesting a user to provide a spoken response. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising receiving a recorded signal including a recorded signal portion corresponding to a spoken response. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising determining endpoints of the recorded signal portion corresponding to a spoken response. The computer

program may comprise program instructions for causing the processor to perform a method, the method comprising deriving the second set of feature vectors for characterising the recorded signal portion. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising producing a score dependent upon a degree of matching between the first and second set of feature vectors. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising comparing the score with a predefined threshold so as to determine authentication of a user.

According to the present invention there is also provided a method of voice authentication, the method comprising in a token or smart card providing a first set of feature vectors, receiving a second set of feature vectors for characterising a recorded signal portion and comparing the first and second sets of feature vectors.

The method may further comprise providing data relating to a prompt. The method may further comprise receiving a recorded signal including a recorded signal portion corresponding to a spoken response. The method may further comprise determining endpoints of the recorded signal portion. The method may further comprise deriving the second set of feature vectors for characterising the recorded signal portion. The method may further comprise producing a score dependent upon a degree of matching between the first and second set of feature vectors. The method may further comprise comparing the score with a predefined threshold so as to determine authentication of a user. The method may further comprise receiving a recorded signal which includes a recorded signal portion corresponding to a spoken response and which includes a plurality of frames, determining endpoints of the recorded signal including determining whether a value of energy for a first frame exceeds a first predetermined value and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion. The method may further comprise requesting the authenticating user to provide first and second spoken responses to the prompt, obtaining a recorded signal including first and second recorded signal portions corresponding to the first and second spoken responses, isolating the first and second recorded signal

portions, deriving second and third sets of feature vectors for characterising the first and second isolated recorded signal portions respectively, comparing the second set of feature vectors with the third set of feature vectors so as to produce a score dependent upon the degree of matching; and comparing the score with a predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors. The method may further comprise requesting a user to provide a plurality of spoken responses to a prompt, obtaining a plurality of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response, deriving a plurality of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion, comparing the sets of feature vectors with the first set of feature vectors so as to produce a plurality of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon the plurality of scores. The method may further comprise receiving a recorded signal which includes a recorded signal portion, determining endpoints of the recorded signal by dynamic time warping the second set of feature vectors onto the first set of feature vectors, including determining a first sub-set of feature vectors within the second set of feature vectors from which a dynamic time warping winning path may start and determining a second sub-set of feature vectors within the second set of feature vectors at which the dynamic time warping winning path may finish.

Endpointing seeks to locate a start and stop point of a spoken utterance.

The present invention seeks to provide an improved method of endpointing.

According to the present invention there is also provided a method of determining an endpoint of a recorded signal portion in a recorded signal including a plurality of frames, the method comprising determining whether a value of energy for a first frame exceeds a first predetermined value and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.

The first predetermined value may represent a value of energy of a frame comprised of background noise. The method may comprise defining a start point if the value of energy of the first frame exceeds the first predetermined value and the second frame does not represent a spoken utterance portion. The method may further comprise indicating that the first frame represents a spoken utterance portion. The method may comprise defining a stop point if the value of energy of the first frame does not exceed the first predetermined value and the second frame represents a spoken utterance portion. The method may comprise defining the first frame as not representing a spoken utterance portion. The method may comprise counting a number of frames preceding a start point of the spoken utterance portion. The method may further comprise pairing the stop point with the start point of the spoken utterance portion if the number of frames exceeds a predetermined number.

The method may further comprise pairing the stop point with start point of a preceding spoken utterance portion if the number of frames does not exceed a predetermined number. The method may comprise determining whether the value of energy for a first frame exceeds a third predetermined value and counting a number of frames preceding a start point of the spoken utterance portion. The method may further comprise defining a start point if the value of energy of the first frame exceeds the third predetermined value, the second frame does not represent a spoken utterance portion and if the number of frames does not exceed a predetermined number. The method may further comprise determining whether a value of energy for a third frame following the first frame exceeds the second predetermined value. The method may further comprise defining a stop point if the value of energy of the third frame does not exceed the third predetermined value.

The method may further comprise pairing the stop point with the start point of the spoken utterance portion. The method may further comprise pairing the stop point with a start point of a preceding spoken utterance portion. The method may comprise defining the first frame as representing background noise if the value of energy of the first frame does not exceed the third predetermined value. The method may further comprise calculating an updated value of background energy using the value of energy of the first frame. The method may further comprise counting a number of frames preceding a start point of the spoken utterance portion and determining whether the number of frames exceeds another, large

number. The method may comprise determining whether a value of rate of change of energy of the first frame exceeds a second predetermined value. The second predetermined value may represent a value of rate of change of energy of a frame comprised of background noise. The method may further comprise defining a start point if the value of energy of the first frame exceeds the first predetermined value, and the value of rate of change of energy exceeds the second predetermined value and the second frame does not represent a spoken utterance portion. The method may comprise defining a stop point if the value of energy of the first frame does not exceed the first predetermined value, and the value of rate of change of energy does not exceed the second predetermined value and the second frame represents a spoken utterance portion. The method may comprise determining whether the value of rate of change of energy for the first frame exceeds a fourth predetermined value.

Many voice recognition and authentication systems use dynamic time warping to match a recording to a template. However, a user may pause, cough, sigh or generate other sounds before or after providing a response to a prompt. These silences or sounds may be included in the recording. Thus, only a portion of the recording is relevant.

The present invention seeks to provide a solution to this problem.

According to the present invention there is provided a method of dynamic time warping for warping a first speech pattern characterised by a first set of feature vectors onto a second speech pattern characterised by a second set of feature vectors, the method comprising identifying a first sub-set of feature vectors within the first set of feature vectors from which a dynamic time warping winning path starts and identifying a second sub-set of feature vectors within the first set of feature vectors at which the dynamic time warping winning path finishes.

The first speech pattern may include speech, background noise and/or silence.

The present invention seeks to provide a method of voice authentication.

According to the present invention there is provided a method of voice authentication comprising: enrolling a user including requesting the enrolling user to provide a spoken response to a prompt, obtaining a recorded signal including a recorded signal portion corresponding to the spoken response, determining endpoints of the recorded signal portion, deriving a set of feature vectors for characterising the recorded signal portions, averaging a plurality of sets of feature vectors, each set of feature vectors relating to one or more different spoken responses to the prompt by the enrolling user so as to provide an archetype set of feature vectors for the response, storing the archetype set of feature vectors together with data relating to the prompt ; and authenticating a user including retrieving the data relating to the prompt and the archetype set of feature vectors, requesting the authenticating user to provide another spoken response to the prompt, obtaining another recorded signal including another recorded signal portion corresponding to the other spoken response, determining endpoints of the other recorded signal portion, deriving another set of feature vectors characterising the other recorded signal portions, comparing the another set of feature vectors with the archetype set of feature vectors so as to produce a score dependent upon a degree of matching and comparing the score with a predefined threshold so as to determine whether the enrolling user and the authenticating user are the same.

Voice authentication systems typically include an amplifier. If a user provides a spoken response which is too quiet, then amplifier gain may be increased.

Conversely, if a spoken response if too loud, then amplifier gain may be reduced.

Usually, a succession of samples is taken and amplifier gain is increased or reduced accordingly until a settled value of amplifier gain is obtained. However, there is a danger that that amplifier gain rises and falls and never settles.

The present invention seeks to ameliorate this problem.

According to the present invention there is provided a method of gain control comprising a plurality of times determining whether an amplified signal level is above a predetermined limit, either decreasing gain if the amplified signal level is

above the predetermined limit or maintaining gain if otherwise, thereby permitting no increase in gain.

According to the present invention there is provided a method of gain control comprising a plurality of times determining whether an amplified signal level is below a predetermined limit, either increasing gain if the amplified signal level is below the predetermined limit or maintaining gain if otherwise, thereby permitting no decreases in gain.

A potential threat to the security offered by any voice authentication system is the possibility of an impostor secretly recording a spoken response of a valid user and subsequently replaying a recording to gain access to the system. This is known as a "replay attack. " The present invention seeks to help detect a replay attack.

According to the present invention there is provided a method of voice authentication comprising requesting a user to provide first and second spoken responses to a prompt, obtaining a recorded signal including first and second recorded signal portions corresponding to the first and second spoken responses, isolating the first and second recorded signal portions, deriving first and second sets of feature vectors for characterising the first and second isolated recorded signal portions respectively, comparing the first set of feature vectors with the second set of feature vectors so as to produce a second score dependent upon the degree of matching and comparing the second score with another predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.

During authentication, users may occasionally provide an uncharacteristic spoken response.

The present invention seeks to provide an improved method of dealing with uncharacteristic responses.

According to the present invention there is provided a method of voice authentication including requesting an authenticating user to provide a'plurality of spoken responses to a prompt, obtaining a plurality of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response, deriving a plurality of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion, comparing the sets of feature vectors with an archetype set of feature vectors so as to produce a plurality of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon the plurality of scores.

If a user provides a spoken response during authentication, a threshold score is usually generated.

The present invention seeks to provide an improved method of determining an authentication threshold score.

According to the present invention there is provided a method of determining an authentication threshold score, the method including requesting a first set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterising the recorded signal portion, for each user, comparing the set of feature vectors with an archetype set of feature vectors for the user so as to produce a score dependent upon a degree of matching, fitting a first probability density function to frequency of scores for the first set of users, requesting a second set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterising the recorded signal portion, for each user, comparing the set of feature vectors with an archetype set of feature vectors for a different user so as to produce a score dependent upon a degree of matching, fitting a second probability density function to frequency of scores for the set of users.

The present invention seeks to provide a, method of averaging a plurality of feature vectors.

According to the present invention there is provided a method of averaging a plurality of feature vectors, the method comprising providing a plurality of feature vectors, comparing the each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching, searching for a minimum score and determining whether at least one score is below a predetermined threshold.

According to the present invention there is provided a smart card for voice authentication comprising means for storing a first set of feature vectors and data relating to a prompt, means for providing the data to an external circuit, means for receiving a second set of feature vectors relating to the prompt, means for comparing the first and second set of feature vectors so as to determine a score; and means for comparing the score with a predetermined threshold.

According to the present invention there is provided a smart card for voice authentication comprising a memory for storing a first set of feature vectors and data relating to a prompt, an interface for providing the data to an external circuit and for receiving a second set of feature vectors relating to the prompt, a processor for comparing the first and second set of feature vectors so as to determine a score and for comparing the score with a predetermined threshold.

According to the present invention there is also provided information storage medium storing a voice authentication biometric. Preferably, the information storage medium is portable and may be for example a memory stick.

According to the present invention there is also provided a computer program comprising program instructions for causing a smart card to perform a method, the method comprising retrieving from memory a first set of feature vectors, receiving a

second set of feature vectors and comparing the first and second set of feature vectors.

According to the present invention there is also provided a method comprising writing at least part of a voice authentication biometric to a smart card or token.

The at least part of a voice authentication biometric is a set of feature vectors.

According to the present invention there is also provided a method comprising writing a computer program to a smart card or token, the computer program comprising computer instructions for performing a method, the method comprising performing voice authentication.

According to the present invention there is also provided a method comprising writing at least part of a voice authentication biometric to a smart card or token and writing a computer program to said smart card or token, the computer program comprising computer instructions for performing a method, the method comprising performing voice authentication According to the present invention there is also provided a smart card for voice authentication including a processor, the smart card storing a computer program comprising program instructions for causing the processor to perform a method, the method comprising performing voice authentication.

According to the present invention there is also provided a smart card reader/writer connected to apparatus for recording speech and generating feature vectors, said reader/writer being configured to transmit a set of feature vectors to a smart card or token and receive a response therefrom.

Brief Description of the Drawings Embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings in which: Figure 1 shows a voice authentication system 1 for performing a method of voice authentication;

Figure 2 is a process flow diagram of a method of voice authentication; Figure 3 is a process flow diagram of a method of enrolment ; Figure 4 is a process flow diagram of a method of calibration ; Figure 5 is an analog representation of a recorded signal; Figure 6 is a generic representation of a recorded signal; Figure 7 is a digital representation of a recorded signal; Figure 8 illustrates dividing a recorded signal into timeslices ; Figure 9 is a process flow diagram of a method of generating a featuregram ; Figure 10 illustrates generation of a feature vector ; Figure 11 illustrates generation of a featuregram from a plurality of feature vectors; Figure 12 shows first and second endpointing processes; Figure 13 illustrates explicit endpointing; Figure 14 is a process flow diagram of a method of explicit endpointing; Figure 15 illustrates determination of energy and delta energy values of a timeslice ; Figure 16 shows pairing of a stop point with a two start points; Figure 17 shows pairing of a stop point with a start point of a preceding section; Figure 18 is a process flow diagram of a method of detecting lip smack; Figure 19 shows pairing of a stop point with an updated start point for removing lip smack; Figure 20 illustrates a dynamic time warping process for word spotting; Figure 21 shows a warping function from a start point to an end point; Figure 22 illustrates a local slope constraint on a warping function; Figure 23 illustrates a global condition imposed on a warping function; Figure 24 is a process flow diagram of a method of finding a minimum distance for an optimised path from a start point to an end point representing matched speech patterns ; Figure 25a shows an array following initialisation for holding a cumulative distance associated with a path from a start point to an end point; Figure 25b shows an array for holding a cumulative distance associated with a path from a start point to an end point; Figure 25c shows a completed array for holding a cumulative distance associated with a path from a start point to an end point including a winning path;

Figure 26 shows a process flow diagram of a method of performing a plurality of sanity checks; Figure 27 illustrates creation of a speech featuregram ; Figure 28 illustrates generation of a speech featuregram archetype; Figure 29 is a process flow diagram of a method of generating a speech featuregram archetype; Figure 30 illustrates generation of a featuregram cost matrix ; Figure 31 shows a featuregram cost matrix; Figure 32 is a process flow diagram of a method of finding a minimum distance for an optimised path from a start point to an end point representing matched speech patterns; Figure 33 illustrates creation of featuregram archetypes using featuregrams ; Figure 34 illustrates generation of a featuregram archetype cost matrix; Figure 35 shows a featuregram archetype cost matrix; Figure 36 shows a probability distribution function; Figure 37 shows a continuous distribution function Figure 38 shows a voice authentication biometric ; Figure 39 is a process flow diagram of a method of authentication; Figure 40 is an analog representation of an authentication recorded signal; Figure 41 illustrates dividing an authentication recorded signal into timeslices ; Figure 42 illustrates generation of an authentication feature vector; Figure 43 illustrates generation of an authentication featuregram from a plurality of feature vectors ; Figure 44 illustrates generation of endpoints; Figure 45 illustrates comparison of a featuregram archetype with an authentication featuregram; Figure 46 illustrates a featuregram including first and second spoken responses of the same prompt for detecting replay attack; Figure 47 is a process flow diagram of a method of detecting replay attack; Figure 48 shows a voice authentication system employing a smart card; Figure 49 illustrates a contact smart card ; Figure 50 illustrates a contactless smart card; Figure 51 is a schematic diagram showing a smart card reader and a smart card;

Figure 52 is an application program data unit (APDU) table; Figure 53 shows exchange of messages between a laptop and a smart card during template loading ; Figure 54 shows a first exchange of messages between a laptop and a smart card during authentication; and Figure 55 shows a second exchange of messages between a laptop and a smart card during authentication.

Detailed Description of the Invention Voice authentication system 1 Referring to Figure 1, a voice authentication system 1 for performing a method of voice authentication is shown. The voice authentication system 1 limits access by a user 2 to a secure system 3. The secure system 3 may be physical, such as a room or building, or logical, such as a computer system, cellular telephone handset or bank account. The voice authentication system 1 is managed by a system administrator 4.

The voice authentication system 1 includes a microphone 5 into which a user may provide a spoken response and which converts a sound signal into an electrical signal, an amplifier 6 for amplifying the electrical signal, an analog-to-digital (A/D) converter 7 for sampling the amplified signal and generating a digital signal, a filter 8, a processor 9 for performing signal processing on the digital signal and controlling the voice authentication system 1, volatile memory 10 and non-volatile memory 11. In this example, the A/D converter 7 samples the amplified signal at 11025 Hz and provides a mono-linear 16-bit pulse code modulation (PCM) representation of the signal. The digital signal is filtered using a 4''order 100Hz high-pass filter to remove any d. c. offset.

The system 1 further includes a digital-to-analog (D/A) converter 12, another amplifier 13 and a speaker 14 for providing audio prompts to the user 2 and a display 15 for providing text prompts to the user 2. The system 1 also includes an interface 16, such as a keyboard and/or mouse, and a display 17 for allowing access by the system administrator 4. The system 1 also includes an interface 18 to the secure system 3.

In this embodiment, the voice authentication system 1 is provided by a personal computer which operates software performing the voice authentication process.

Referring to Figure 2, the voice authentication process comprises two stages, namely enrolment (step S1) and authentication (step S2).

The purpose of the enrolment is to obtain a plurality of specimens of speech from a person who is authorised to enrol with the system 1, referred herein as a"valid user". The specimens of speech are used to generate a reliable and distinctive voice authentication biometric, which is subsequently used in authentication.

A voice authentication biometric is a compact data structure comprising acoustic information-bearing attributes that characterise the way a valid user speaks. These attributes take the form of a template, herein referred to as a"featuregram archetypes" (FGAs), which are described in more detail later.

The valid user's voice authentication biometric may also include further information relating to enrolment and authentication. The further information may include data relating to prompts to which a valid user has responded during enrolment, which may take the form of text prompts or equivalent identifiers, the number of prompts to be used during authentication and whether prompts should be presented in a random order during authentication, and other data relating to authentication such as scoring strategy, pass/fail/retry thresholds, the number of acceptable failed attempts and amplifier gain.

Enrolment Referring to Figure 3, the enrolment process corresponding to step S1 in Figure 2 is shown: The voice authentication system 1 is calibrated, for example to ensure that a proper amplifier gain is set (step S1. 1). Once the system is calibrated, a plurality of spoken responses are recorded (step S1. 2). The recordings are characterised by generating

so-called"featuregrams", which comprise a set of feature vectors (step S1. 3). The recordings are also examined so as to isolate speech from background noise and periods of silence (step S1. 4, step S1. 5). Checks are performed to ensure that the recorded responses, isolated specimens of speech and featuregrams are suitable for processing (step S1. 6). A plurality of speech featuregrams are then generated (step S1. 7). Thereafter, an average of some or all of the featuregrams is taken thereby forming a more representative featuregram, namely a featuregram archetype, (step S1. 8). A pass level is set (step S1. 9) and a voice authentication biometric is generated and stored (step S1. 10) Calibration Referring to Figure 4, the calibration process of step S1. 1 (Figure 3) is described in more detail: One of the purposes of calibration is to set the gain of the amplifier 6 (Figure 1) such that the amplitude of a captured speech utterance is of a predetermined standard. The predetermined standard may specify that the amplitude of the speech utterance peaks at predetermined value, such as 70% of a full-scale deflection of a recording range. In this example, the A/D converter 7 (Figure 1) is 16-bits wide and so 70% of full-scale deflection corresponds to a signal of 87dB. The predetermined standard may also specify that the signal has a minimum signal-to- noise ratio, for instance 20dB which corresponds to a signal ten times stronger than the background noise.

The gain of the amplifier 6 (Figure 1) is set to the highest value (step S1. 1. 1) and first and second counters are set to zero (steps S1. 1. 2 & S1. 1. 3). The first counter keeps a tally of the number of specimens provided by the valid user. The second counter is used to determine the number of consecutive specimens which meet the predetermined standard.

A prompt is issued (step S1. 1. 4). In this example, the prompt is randomly selected.

This has the advantage that the it prevents the valid user from anticipating the spoken response, thereby providing an uncharacteristic or unnatural response, for

example which is unnaturally loud or quiet. The prompt may be a text prompt or an audio prompt. The valid user may be prompted to say a single word, such as "thirty-four"or a phrase"My voice is my pass phrase". In this example, the prompts comprise numbers. Preferably, the numbers are chosen from a range between 21 and 99. This has the advantage that the spoken utterance is sufficiently long and complex so as to include a plurality of features.

A speech utterance is recorded (step S1. 1.5). This comprises the user providing a spoken response which is picked-up by the microphone 5 (Figure 1), amplified by the amplifier 6 (Figure 1), sampled by the analog-to-digital (A/D) converter 7 (Figure 1), filtered and stored in volatile memory 10 (Figure 1) as the recorded signal. The processor 9 (Figure 1) calculates the power of speech utterance included in the recorded signal and analyses the result.

The signal-to-noise ratio is determined (step S1. 1. 6). If the signal-to-noise ratio is too low, for example less than 20dB, then the spoken response is too quiet and the corresponding signal generated is too weak, even at the highest gain. The user is informed of this fact (step S1. 1.7) and the calibration stage ends. Otherwise, the process continues.

The signal level is determined (step S1. 1.8). If signal level is too high, for example greater than 87 dB which corresponds to the 95 in percentile of the speech utterance energy being greater that 70% of the full scale deflection of the A/D converter 7 (Figure 1), then the spoken response is too loud and the corresponding signal generated is too strong. If the gain has already been reduced to its lowest value, then the signal is too strong, even at the lowest gain (step S1. 1.9). The user is informed of this fact and calibration ends (step S1. 1.10). Otherwise, the gain of the amplifier 6 is reduced (step S1. 1.11). The gain may be reduced by a fixed amount regardless of signal strength. Alternatively, the gain may be reduced by an amount dependent on signal strength. This has the advantage of obtaining an appropriate gain more quickly. The fact that a specimen spoken response has been taken is noted by incrementing the first counter by one (step S1. 1.12). The second counter is reset (step S1. 1.13).

If too many specimens have been taken, for example 15, then calibration ends (step S1. 1. 14). Otherwise, the process returns to step S1. 1.4, wherein the user is prompted, and the process of recording, calculating and analysing is repeated.

If, at step S1. 1.8, the signal level is not too high, then the spoken response is considered to be satisfactory, i. e. neither too loud nor too quiet. Thus, the recorded signal falls within an appropriate range of values of signal energy. The fact that a specimen spoken response has been taken is recorded by incrementing the first counter by one (step S1. 1.16). The fact that the specimen is satisfactory is also recorded by incrementing the second counter by one (step S1. 1.17). The gain remains unchanged.

If a predetermined number of consecutive specimens are taken without a change in gain, then calibration is successfully terminated and the gain setting of the amplifier 6 (Figure 1) is stored (step S1. 1.18 & S1. 1.19). In this example, the gain setting is stored in the voice authentication biometric.

Additional steps may be included. For example, once a settled value of gain is achieved at step S1. 118, then the signal level is measured a final time. If the signal level is too low, then calibration ends without the gain setting being stored.

The calibration process allows decreases, but not increases, in gain. This has the advantage of preventing infinite loops in which the gain fluctuates without reaching a stable setting.

Alternatively, the calibration process may be modified to start at the lowest gain and allow increases, but not decreases, in gain. Thus, if the signal strength is too low, for example below a predetermined limit, then gain is increased. Once a settled value of gain has been achieved, then the signal level may be measured a final time to determine whether it is too high.

The calibration process may include a further check of signal-to-noise ratio. For example, once a settle value of gain has been determined, then peak signal-to-noise ratio of the signal is measured. If the signal-to-noise ratio exceeds a predetermined level, such as 20dB, then the gain setting is stored. Otherwise, the user is instructed to repeat calibration in a quieter environment, move closer to the microphone or speak with a louder voice.

Referring again to Figure 3, during enrolment, the voice authentication system 1 records one or more spoken responses (step S1. 2). This may occur during calibration at step S1. 1. Additionally or alternately, a separate recording stage may be used.

During enrolment, the voice authentication system 1 asks the user to provide a spoken response. Preferably, the system prompts the user a plurality of times. Four types of prompt may be used: In a first type, the prompt comprises a request for a single word, for example"Say 81". Preferably, the user is asked to repeat the word. The user may be asked to repeat the word once so as to obtain two specimens of the spoken response. The user may be asked to repeat the word more than once so as to obtain multiple examples.

In a second type, the prompt comprises a request for a single phrase, for example "Say My voice is my pass phrase". Preferably, the user is asked to repeat the phrase.

The user may be asked to repeat the phrase once or more than once.

In a third type, the prompt may comprise a challenge requesting personal information, such as"What is your home telephone number?". The valid user provides a spoken response which includes the personal information. This type of prompt is referred to as a"challenge-response". This type of prompt has the advantage of increasing security. During subsequent authentication, an impostor must know or guess what to say as well as attempt to say the spoken response in the correct manner. For example, a valid user may pronounce digits in different ways,

such as pronouncing"10"as"ten", "one, zero", "one, nought"or"one-oh", and/or pause while saying a string of numbers, such as reciting"12345678"as"12-34-56- 78"or"1234-5678".

In a fourth type, the prompt may comprise a cryptic challenge-response, such as "NOD ?". For example,"NOD"may signify"Name of dog ?". Preferably, the cryptic challenge is specified by the user. This type of prompt has the advantage of increasing security since the prompt is meaningful only to the valid user. It offers few clues as to what the spoken response should be.

A set of prompts may be common to all users. Alternatively, a set of prompts may be randomly selected on an individual basis. If the prompts are chosen randomly, then a record of the prompts issued to each user is stored in the voice authentication biometric, together with corresponding data generated from the spoken response. Preferably, this information is used during authentication stage to ensure that only prompts responded to by the valid user are issued and that appropriate comparisons are made with corresponding featuregram archetypes.

Preferably, the administrator 4 (Figure 1) determines the type and number of prompts used during enrolment and authentication.

Recording Referring again to Figure 1, a spoken response is recorded by the microphone 5, amplified by amplifier 6 and sampled using A/D converter 7 at 11025 Hz to provide a 16-bit PCM digital signal. The duration of the recording may be fixed. Preferably, the recording lasts between 2 and 3 seconds. The signal is then filtered to remove any d. c. component. The signal may be stored in volatile memory 10.

Referring to Figures 5,6, 7, an example of a recorded signal 19 is shown in analog, generic and digital representations.

Referring particularly to Figure 5, the recorded signal 19 may comprise one or more speech utterances 20, one or more background noises 21 and/or one or more silence intervals 22. A speech utterance 20 is defined as a period in a recorded

signal 19 which is derived solely from the spoken response of the user.. A background noise 21 is defined as a period in a recorded signal arising from audible sounds, but not originating from the speech utterance. A silence interval 22 is defined as a period in a recorded signal which is free from background noise and speech utterance.

As explained earlier, the purpose of the enrolment is to obtain a plurality of specimens of speech so as to generate a voice authentication biometric. To help achieve this, recorded responses are characterised by generating"featuregrams" which comprise sets of feature vectors. The recordings are also examined so as to isolate speech from background noise and silences.

If the recordings are known to contain specific words, then they are searched for those words. This is known as"word spotting". If there is no prior knowledge of the content of the recordings, then the recordings are inspected to identify spoken utterances. This is known as"endpointing". By identifying speech utterances using one or both of these processes, a speech featuregram may be generated which corresponds to portions of the recorded signal comprising speech utterances.

Referring to Figure 8, a portion 19'of the recorded signal 19 is shown. The recorded signal 19 is divided into frames, referred to herein as timeslices 23. The recorded signal 19 is divided into partially-overlapping timeslices 23 having a predetermined period. In this example, timeslices 23 have a period of 50 ms, i. e. tl= 50ms, and overlap by 50%, i. e. t2= 25 ms.

Featxregrazn generation Referring to Figures 9,10 and 11, a process by which a featuregram is generated will be described in more detail: The recorded signal 19 is divided into frames, herein referred to as timeslices 23 (step S1. 3.1). Each timeslice 23 is converted into a feature vector 24 using a feature transform 25 (step S1. 3.2).

The content of the feature vector 24 depends on the transform 25 used. In general, a feature vector 24 is a one-dimensional data structure comprising data related to acoustic information-bearing attributes of the timeslice 23. Typically, a feature vector 24 comprises a string of numbers, for example 10 to 50 numbers, which represent the acoustic features of signal comprised in the timeslice 23.

In this example, a so-called mel-cepstral transform 25 is used. This transform 25 is suitable for use with a 32-bit fixed-point microprocessor. A mel-cepstral transform 25 is a cosine transform of the real-part of a logarithmic-scale energy spectrum. A mel is a measure of perceived pitch or frequency of a tone by a human auditory system. Thus, in this example, for a sampling rate of 11025Hz, each feature vector 24 comprises twelve signed 8-bit integers, typically representing the second to thirteenth calculated mel-cepstral coefficients. Data relating to energy (in dB) may be included as a 13*''feature. This has the advantage of helping to improve the performance of a word spotting routine that would otherwise operate on the feature vector coefficients alone.

The transform 25 may also calculate first and second differentials, referred to as "delta"and"delta-delta"values.

Further details regarding mel-ceptral transforms may be found in"Fundamentals of Speech Recognition"by Rabiner &Juang (Prentice Hall, 1993).

Other transforms may be used. For example, a linear predictive coefficient (LPC) transform may be used in conjunction with a regression algorithm so as to produce LPC cepstral coefficients. This transform is suitable for use with a 16-bit microprocessor. Alternatively, a TESPAR transform may be used.

Linear predictive coefficient (LPC) transform is described by B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification", Journal of Acoustical Society of America, Vol. 55, pp-1304-1312, June 1974. Further details regarding the TESPAR transform may be found in GB-B-2162025.

Referring to Figure 11, a featuregram 25 comprises a set or concatenation of feature vectors 24. The featuregram 25 includes speech utterances, background noise and silence intervals.

Endpointing Endpointing seeks to identify portions of a recorded signal which contains spoken utterances. This allows generation of speech featuregrams which characterise the spoken utterances.

Referring to Figure 12, two methods of endpointing may be used, namely explicit endpointing (step S1. 4) and dynamic time warping (DTW) word spotting (step S1. 5).

- Explicit Endpointing- Explicit endpointing seeks to locate approximate endpoints of a speech utterance in a particular domain without using any a priori knowledge of the words that might have been spoken. Explicit endpointing tracks changes in signal energy profile over time and frequency and makes boundary decisions based on general assumptions regarding the nature of profiles that are indicative of speech and those that are representative of noise or silence. Explicit endpointing cannot easily distinguish between speech spoken by the enrolling user and speech prominently forming part of background noise. Therefore, it is desirable that no-one else speaks in close proximity to the valid user when enrolment takes place.

Referring to Figure 13, an explicit endpointing process 27 generates a plurality of pairs 28 of possible start and stop points for a stream of timeslices 23. The advantage of generating a plurality of endpoints is that the true endpoints are likely to be identified. However, a drawback is that if too many endpoint combinations are identified, then the system response time is adversely affected. Therefore, a trade-off is sought between the number of potential endpoint combinations identified and the response time required.

Explicit endpointing is suitable for both fixed and continuous recording environments, although is mainly intended for use with isolated word or isolated phrase recognition systems.

Referring to Figure 14, an explicit endpointing process is shown in more detail : A check is made whether initialisation is needed, whereby background noise, energy is measured (step S1. 4. A). If so, a background noise signal is recorded (step S1. 4. B), divided into timeslices (step S1. 4. C) and a background energy value is calculated (step S1. 4. D).

After initialisation, or if no initialisation is needed, a signal is recorded and divided into a plurality of timeslices 23 (Figure 8) (step S1. 4.1). A first counter, i, for keeping track of which timeslice 23 ; is currently being processed is set to one (step S1. 4.2). A second counter, j, for counting the number of consecutive timeslices 23 which represent background noise is set to zero (step S1. 4.3). A"word"flag is set to zero to represent that the current timeslice 23 ; does not represent a spoken utterance portion, such as a word portion (step S1. 4.4).

Referring also to Figure 15, the energy of the current timeslice 23 ; is calculated (step S1. 4. 5).

Preferably, a plurality of timeslices 23 are used to calculate a value of energy for the current timeslice 23 ;. The timeslices 23 are comprised in a window 29. In this example, five timeslices 23i-2, 23i-1, 23i, 23i+1, 23i+2 are used to calculate a value of energy of the ith timeslice 23i.

A time encoded speech processing and recognition (TESPAR) coding process 30 is used to calculate an energy value for each timeslice 23i 2, 23i l, 23i, 23i+l, 23i+2. This comprises taking each timeslice 23i-2, 23i-1, 23i, 23i+1, 23i+2 and dividing it into a plurality of so-called"epochs"according to where signal magnitude changes from positive to negative and rice versa. These points are known a"zero crossings". An absolute value of a peak magnitude for each epoch is used to calculate an average

energy for each timeslice 23i 2, 23i, 23i, 23i+l, 23i+2. Thus, five energy values are obtained from which a mean energy value 31 is calculated.

A description of the TESPAR coding process is given in GB-A-2020517.

A delta energy value 32 indicative of changes in the five energy values is also calculated (step S1. 4.6). In this example, the delta energy value 32 is calculated by performing a smoothed linear regression calculation using the energy values for the timeslices 23i-2, 23i-1, 23i, 23i+1, 23i+2. The delta energy value 32 represents a gradient of straight line fitted to the values of energy. Thus, large changes in the energy values result in a large value of the delta energy value 32.

The values 31,32 of energy Ei and delta energy AE ; are used to determine whether the i timeslice 23 ; represents a spoken utterance.

Referring again to Figure 14, if the energy of 31 of an ilh timeslice 23 ; is equal to or greater than a first threshold, which is first predetermined multiple of background noise energy, i. e. E ; > kl x Eo (step S1. 4.7), and the delta energy 32 is equal to or greater than a second threshold, which is second predetermined multiple of background delta energy, i. e. AE ; > k2 x AEo (step S1. 4.8), then the ith timeslice 23 is considered to form part of a word. The timeslice 23 ; is said to form part of a voiced or energetic section. In this example, kl= 2 and k2 = 3.

If the word flag not set to one, representing that the previous timeslice 23i 1 was background noise (step S1. 4.9), then the current timeslice 23 ; is considered to be the beginning of a word (step S. 1.4. 10). Thus, the word flag is set to one (step S1. 4.11).

If at step S1. 4. 9, the word flag is set to one, then the beginning of the word has already been detected and so the current timeslice 23 ; is located within an existing word (step S1. 4.12).

The first counter, i, is incremented by one (step S1. 4.13) and the process returns to step S1. 4.5 where the energy of the new ith timeslice 23i is calculated.

If the energy value 31 falls below the first threshold at step S1. 4.7 or the delta energy value 32 falls below the second threshold at step S1. 4.8, then it is determined whether there is a stop point, and if so with which start point or start points it could be paired.

If the word flag is set to one, (step S1. 4.14), then the current timeslice 23 ; is considered to be a stop point.

The stop point may be paired with one or more other start points, as will now be explained: Referring to Figure 16, first and second sections 33,34 are separated by a gap 35.

The first section 33 includes a first start point 36, and a first stop point 371. The second section 34 has a second start point 362. According to step S1. 4.7 or S1. 4. 8 and step S1. 4.14, a second stop point 372 is found. The second stop point 372 may be paired with the second start point 372 so identifying the second section 34 as a word. However, the second stop point 372 may also be paired with the first start point 361. Thus, the first start point 361 and the second stop point 372 may define a larger word 38 which includes both the first and second sections 33,34. Therefore, it is desirable to determine the duration of a gap 35 between the first stop point 371 and the second start point 362. If the gap 35 is sufficiently short, then an additional pairing is made and the additional word 38 is identified. This has the advantage of identifying a greater number of candidates and thus increase the chances of correctly identifying a word.

Referring again to Figure 14, a check is made as to whether the start point preceding the current endpoint occurs within ten timeslices 23 of the stop point of the previous word (step S1. 4.15). If it does not, then the current endpoint is paired with only the preceding start point, thereby identifying a single word (step S1. 4.16).

If the start point is within ten timeslices 23 or less of the stop point, then the current stop point is paired with both the start point of the current section (step

S1. 4.17) and the start point of the preceding word (step S1. 4.18), thereby identifying two potential words.

The second counter, j, is reset to zero (step S1. 4.19), the word flag is set to zero' (step S1. 4.20) and the first counter is incremented by one (step S1. 4.21) before returning to step S1. 4.5.

If, at step S1. 4.14, the word flag is not set to one, then a further check is made as to whether the current timeslice 23 ; may be considered to be the start point of an unvoiced or unenergetic section, herein after referred to as simply an unvoiced section.

If the energy 31 of an ith timeslice 23i is equal to or greater than a third threshold, which is lower than the first and which is third predetermined multiple of background noise energy, i. e. Ei > k3 x Eo (step S1. 4.22) and the delta energy 32 is equal to or greater than a fourth threshold, which is lower than the first and which is fourth predetermined multiple of background delta energy, i. e. SEi > k4 x AEo (step S1. 4.23), and provided that the timeslice 23 ; is found withing 10 timeslices of the previous stop point (step S1. 4.24), then the ith timeslice 23is considered to be the start point of an unvoiced section (step S1. 4.25). In this example, k3= 1.25 and k4 = 2.

The extent of the unvoiced section is determined by incrementing the first counter i (step S1. 4.26), calculating values 31,32 of energy and delta energy (steps S1. 4.27 & S1. 4.28) and determining whether the energy 31 of the current timeslice 23 ; exceeds a fifth threshold corresponding to a fifth predetermined multiple of background noise energy, i. e. Ei > k5 x Eo (step S1. 4.29). In this case, k5= k3= 1.25. Provided that the energy 31 of the current timeslice 23 ; exceeds a fifth threshold, the timeslice 23 ; is identified as being part of the unvoiced section (step S1. 4.30).

If the energy value 31 falls below the third threshold at step S1. 4.22 or the delta energy value 32 fall below the fourth threshold at step S1. 4.23, then the current timeslice 23 ; is deemed to represent background noise (step S1. 4.31). The values of

background noise energy and delta background noise energy are updated using the current timeslice 23 ;. In this case, a weighted average is taken using 95% of the background noise energy Eo and 5% of the timeslice energy Ei (step S1. 4. 32).

Similarly, a weighted average is taken using 95% of the delta background noise AEO energy and 5% of the delta energy SE ; (step S1. 4.33). The second counter, j, is incremented by one (step S1. 4.34).

A check is made to see whether an isolated word has been found (step S1. 4.35). If a sufficiently long period of background noise is identified, for example by counting twenty timeslices after the end of a word which corresponds to 0.5 seconds of silence (step S1. 4.36), then it is reasonable to assume that the last stop point represents the end of an isolated word. If an isolated word is found, then pairing of possible start and stop points may be terminated. Otherwise, searching continues by returning to step S1. 4.5.

If at step S1. 4.30, the energy 31 of the timeslice 23 ; falls below the fifth threshold, then a stop point of an unvoiced section is identified (step S1. 4.36). The stop point is associated with the start point of the preceding word (step S1. 4.37) and the first counter, i, is incremented by one (step S1. 4.38) Referring to Figure 17, a first section 39 precedes a second section 40 and has a first start point 411 and a first stop point 422. A second stop point 422 is found in the second section 40 according to step S1. 4.36. The second stop point 422 is paired the first start point 411. Thus, the first start point 411 and the second stop point 422 may define a word 43 which includes both the first and second sections 39,40.

Thus, two types of stop points may be identified. The stop point may be an end point of a voiced section, such as the"le"in"left", or a stop point of an unvoiced section, such as the"t"in"left".

Referring to Figure 18, a process for finding and removing extraneous noises such as lip smack and generating an additional pair of endpoints is shown:

When a stop point is located at step S1. 4.16 or S1. 4.17 in a voiced section, the current start point is located (step S1. 4.39). First and second pointers p, q are set to the start point (steps S1. 4.40 & S. 1.4. 41). The first index p points to at updated start point. The second index q points keeps track of which timeslice is currently being examined.

The delta energy of a current timeslice 23 is compared with the delta energy of a succeeding timeslice 23q+ (step S1. 4.42). If the delta energy of the current timeslice 23q is greater than the delta energy of the succeeding timeslice 23q+, then the delta energy of the succeeding timeslice 23q+ is compared with the delta energy of a second succeeding timeslice 23q+2 (step S1. 4.43). If the delta energy of the succeeding timeslice 23q+l is greater than the delta energy of the second succeeding timeslice 23q+2, then the start point is updated by incrementing the first index p by one (step S1. 4.44). A check is made to see whether updated start point and the stop position are separated by at least three timeslices (step S1. 4.45). If not, then the process terminates without generating an additional pair of endpoints including an updated start point.

If at either, step S1. 4.42 or S1. 4.43 the delta energy of the current timeslice 23q is less than the delta energy of the succeeding timeslice 23 or delta energy of the succeding timeslice 23q+l is less than the delta energy of the succeeding timeslice 23q+2) then the process terminates and generates an additional pair of endpoints including an updated start point.

Referring to Figures 19a and 19b, the effect of the process for finding and removing extraneous noises is illustrated.

Figure 19a shows a voiced section 44 having a pair of start and stop points 45, 46.

Figure 19b shows the voiced section 44 after the process has identified a section portion 47 comprising a lip smack. Another pair of start and stop points 48, 49 are generated.

Preferably, explicit endpointing is performed in teal-time. This has the advantage that it may be determined whether or not a timeslice 23 corresponds to a spoken utterance, i. e. whether a potion of the recorded signal currently being processed corresponds to part of word. If so, a featuregram is generated. If not, a featuregram need not be generated. Processing resources may be better put to use, for example by generating a template (if in the training mode) or performing a comparison (if in the real-time live interrogation mode).

-Word spotting- Word spotting seeks to locate endpoints of a speech utterance in a particular domain using a przor knowledge of the words that should have been spoken as a guide. The a priori knowledge is typically presented as a speaker-independent featuregram archetype (FGA) generated from speech utterances of the word or phrase being sought that have previously been supplied by a wide range of representative speakers. The featuregram archetype may include an energy term.

Referring to Figure 20, a dynamic time warping process 50, herein referred to as a DTWFlex, is used. The process 50 compares a featuregram 51 derived from the recorded signal 19 (Figure 5) with a speaker-independent featuregram archetype 52, representing a word or phase being sought. This is achieved by compressing and/or expanding different sections of the featuregram 51 until a region inside the featuregram 51 matches the speaker-independent featuregram archetype 52. The best fit is known as the winning path and the endpoints of the winning path are output 28'.

One advantage of word spotting is that it delivers more accurate endpoints than those produced by explicit endpointing, particularly when heavy non-stationary background noise is present. If word spotting is used during enrolment, users are asked to respond to fixed-word or fixed-phrase prompts for which speaker- independent featuregram archetype have been prepared in advance. It is difficult to use word spotting in conjunction with challenge-response prompts, particularly if spoken responses cannot be easily anticipated. Thus, it is preferable to use explicit endpointing when using challenge-response prompts.

An outline of a word spotting process will now be described: First and second speech patterns A, B may be expressed as a sequence of first and second respective sets of feature vectors a, b, wherein: A=al, a2,... a ;,..., aI (la) B=bl, b2,... bp--, bl (lb) Each respective vector a, b represents a fixed period of time.

Referring to Figure 21, a dynamic time warping process seeks to eliminate timing differences between the first and second speech patterns A, B. The timing differences may be illustrated using an i-j plot, wherein the first speech pattern A is developed along an i-axis 53 and the second speech pattern B is developed along a j-axis 54.

The timing differences between the first and, second speech patterns A, B may be represented by a sequence F, wherein: F = c (l), c (2),..., c (k),..., c (K) (2) where c (k) = (i (k), j (k) ). The sequence F may be considered to represent a function which approximately maps the time axis of first speech pattern A onto that of the second speech pattern B. The sequence F is referred to as a warping function 55.

When there is no timing difference between the first and second speech patterns A, B, the warping function 55 coincides with a diagonal line j = i, indicated by reference number 56. As the timing differences grow, the warping function 55 increasingly deviates from the diagonal line 56.

A Euclidian distance, d, is used to measure of the timing difference between a pair time points in the form of feature vectors aj, bj, wherein :

However, other distances may be used to measure the timing difference, such as Manhattan distance. A weighted sum of distances in the warping function 53 is calculated using: where w (k) is a positive weighting coefficient. E (F) reaches a minimum value when the warping function 55 optimally adjusts the timing differences between the first and second speech patterns A, B. The minimum value may be considered to be a distance between the first and second speech patterns A, B, once the timing differences between them has been eliminated and is expected to be stable against time-axis fluctuation. Based on these considerations, a time-normalised distance D between the first and second speech patterns A, B is defined as: where the denominator compensates for the number of points on the warping function 55.

Two conditions are imposed on the speech patterns A, B. Firstly, the speech patterns A, B are time-sampled with a common and constant sampling period.

Secondly, there is no a priori knowledge about which parts of the speech pattern contain linguistically important information. In this case, each part of the speech pattern is considered to have an equal amount of linguistic information.

As explained earlier, the warping function 55 is a model of time-axis fluctuations in a speech pattern. Thus, the warping function 55, when viewed as a mapping

function from the time axis of the second speech pattern B onto that of the first speech pattern A, preserves linguistically important structures in the second speech pattern B time axis, and vice versa. In'this example, important speech pattern time- axis structures include continuity, monotonicity and limitation on acoustic parameter transition speed in speech.

In this example, asymmetric time warping is used, wherein a weighting function w (k) is dependent on i but not j. This condition is realised using the following restrictions on the warping function 55: Firstly, a monotonic condition is applied, wherein: i (k-1) < i (k) and j (k-1) < j (k) (6) The monotonic condition specifies that the warping function 55 does not turn back on itself. Secondly, a continuity condition is imposed, wherein: i(k)-i(k-1)=1 and j (k)-j (k-1) < 2- (7) The continuity condition specifies that the warping function 55 advances a predetermined number of steps at a time. As a result of these two conditions, the following relation holds between two consecutive points, namely: Boundary conditions are set such the warping function 55 starts at (1,1) and ends at (I, J), i. e. : i (l) = 1, j (l) = 1, aild i (K) = I, j (K) = J (9)

A local slope constraint condition is also imposed. This defines a relation between consecutive points on the warping function 55 and places limitations on possible configurations. In this example, the Itakura condition is used.

Referring to Figure 22, if point 57, moves forward in the i-direction but not in the j- direction, then the point 572 cannot move again in the i-direction without consecutively-moving in the j-direction. Therefore, this condition combined with the monotonicity and continuity conditions, imposes a maximum slope of 2 and a minimum slope of 0.5 on the warping function F. In other words, the second speech pattern B may be maximally compressed or expanded by a factor of 2 in order to time align it with the first speech pattern A.

Referring to Figure 23, the above conditions effectively constrain the possible warping function 55 to a region in the time axis bounded by a parallelogram 58 and which is referred to as the"legal"search region. The legal search conforms to the following conditions: Thus, j may take a maximum value 58maux and minimum value 58mjn for a particular value of i.

The weighting coefficient is also restricted. If the denominator in equation (5) is independent of the warping function, then:

where N is the normalisation coefficient. Equation 5 may then be simplified and re-written as : The time-normalised distance D may be solved using standard dynamic programming techniques. The aim is to find the cost of the shortest path.

In this example, an asymmetric weighting function w (k) is used, namely: w (k) = (i (k)-i (k-1)) (14) The use of an asymmetric weighting function simplifies the normalisation coefficient N of equation 12, such that: N = I (15) where I is the length of speech pattern A.

An algorithm for solving equation 13 comprises defining an array g for holding the lowest cost path to each point and initialising such that: g1(c(1))=d(c(1))3w(1). (16) In other words, the lowest cost to the first point is the distance between the first two elements multiplied by the weighting factor. For a symmetric weighting factor w (1) = 2, while for an asymmetric weighting factor w (1) = 1.

The algorithm comprises calculating gk (i, j) for each row i and column j, wherein:

gk (c (k)) = minc(k-1)[gk-1(c (k-1) ) + d (c (k)) w (k) l (17) The solution for the time-normalised distance D (A, B) is given by: D(A,B)=1/Ngk(c(K)). (18) The asymmetric weighting coefficient w (k) of equation 14 may be substituted into equation 17, wherein w (l) = 1.

Thus, the algorithm defined by equations 17,18, 19 is simplified and comprises defining an array g for, holding the lowest cost path to each point and initialising such that: g (l, l) =d (u) (19) In other words, the lowest cost to the first point is the distance between the first two elements.

The algorithm comprises calculating gk (i, j) for each row i and column j, wherein: The algorithm further comprises applying the following global conditions, namely :

Thus, the solution for the time-normalised distance D (A, B) is given by: D (A, B) =g (l, J) (24) An algorithm based on equations 19 to 24 may be used to obtain a score when comparing speech utterances of substantially the same length. For example, the algorithm is used when comparing featuregram archetypes, which is described in more detail later.

However, the algorithm based on equations 19 to 24 may be adapted for word spotting applications. In word spotting, it is assumed that the start and stop points of the first speech vector A are known. However, the start and stop points of the relevant speech in the second pattern B are unknown. Therefore, the conditions of equation 9 no longer hold and can be re-defined such that: i (l) =l, j (l) = start, and i (K) = I, j (K) = stop (25) Based on the fact that the maximum expansion/compression in the speech pattern is 2, the start point can assume any value from 1 to J-1/2 and stop point may assume any value from I/2 to J. Consequently, the global conditions specify:

and The time-normalised difference D (A, B) is now defined as: D (A, B) min [g (I, K) l where K (28) Referring to Figures 21,24 and 25, a process for calculating the time-normalised distance D is shown.

The featuregram 51 derived from the recoded signal 19 (Figure 5) is compared with the speaker-independent featuregram archetype 52. As explained earlier, the featuregram comprises a speech utterance, such as"twenty-one", silence intervals and background noise. The speaker-independent featuregram archetype 52 comprises a word or phrase being sought, which in this example is"twenty-one".

The featuregram 51 is warped onto the speaker-independent featuregram archetype 52. The aim is to locate a region within the featuregram 51 (speech pattern B) which best matches speaker-independent featuregram archetype 52 (speech pattern A).

An array g for holding the lowest cost path to each point is defined (step S1. 5.1).

The array may be considered as a net of points or nodes. As explained earlier, the start point can assume any value from 1 to J-I/2, therefore the elements g (l, 1) to g (1, J-I/2) are set to values d (1, 1) to d (l, J-I/2) respectively (step S1. 5.2).

Elements g (1, J-I/2+1) to g (1, J) may be set to a large number. A corresponding array 59 is shown in Figure 25a.

Equation 20 is then calculated for some, but not all, elements (i, j) of array g.

The process comprises incrementing index i (step S1. 5.3), checking whether the algorithm has come to an end (step Sl. 5.4), determining the bounds 58ma", 58min (Figure 23) of the legal search (step S1. 5.5 to S1. 5.8) and determining whether an index value i falls outside bounds 58maX, 58mjn (step S1. 5.9). If so, then a large distance is entered, i. e. g (i, j) = oo, which in practice is a large number (step S1. 5.10).

Otherwise, equation 20 is calculated and a corresponding distance, herein labelled d'; j, is entered, i. e. g (i, j) = d'; (step S1. 5. 11). The process continues by incrementing index j at step S1. 5.7 and continuing for until j exceeds J (step S1. 5.8).

A corresponding array 59', partially filled, is shown in Figure 25b.

The algorithm continues until the array is completed, i. e. (i,j)= (I, J) (step Sl. 5.4).

A corresponding completed array 59"is shown in Figure 25c.

The winning score with the lowest value is found (step S1. 5.12). As explained earlier, the stop point may assume any value from 1/2 to J. Therefore, elements g (I, 1/2) to g (I, J) are searched. Once a stop point 60 has been found, a start point 61 may be estimated by tracing back the winning path 62. Thus, endpoints 28'are found be reading i-values corresponding to the start and stop points 60,61.

Performing. canity checks The ability of a voice authentication system to consistently accept valid users and reject impostors is dependent on the generation of featuregrams that in some way represent the user's key speech characteristics. A plurality of sanity checks may be applied during enrolment and authentication, preferably on the recorded signal or a recorded signal portion, to ensure that they are suitable for enrolment and authentication, i. e. that the speech utterances carry sufficient information for featuregrams to be generated. Preferably, all the following sanity checks are performed.

- Speech Length- A first sanity check comprises confirming that the length of speech exceeds a minimum length. The minimum length of speech is a function of not only time but also of the number of feature vector time slices. In this example, the minimum

length of speech is 0.5 seconds of speech and 30 feature vector timeslices, and timeslice duration and overlap are defined accordingly.

-Noise Length- A second sanity check comprises checking that each speech utterance includes a silence interval which exceeds a minimum length. The silence interval is used to determine noise threshold levels for explicit endpointing, signal to noise measurements and for Speech/Noise entropy. In this example, the minimum length of silence is 0.5 seconds and 30 feature vector timeslices.

- Signal-to-Noise Ratio (SNR)- A third sanity check includes examining whether the signal-to-noise ratio exceeds a minimum. In this example, the minimum signal-to-noise ratio is 20dB. The purpose of setting a minimum signal-to-noise ratio is to obtain an accurate speaker biometric template uncorrupted by background noise.

An estimate of the SNR can be determined using: where Is is the speech energy and In is the noise energy. The speech and noise energy Is In can be calculated using: where pcmi is the value of the digital signal. Other values of signal-to noise ratio may be used, for example 25dB.

- Speech Intensity- A fourth sanity check comprises checking whether the speech energy exceeds a minimum. The purpose of setting a minimum speech intensity is not only to

provide adequate signal-to-noise, but also to avoid excessive quantisation in the digital signal. In this example, the minimum speech intensity is 47 dB.

- Clipping- A fifth sanity check comprises determining whether the degree of clipping exceeds a maximum value. The degree of clipping is defined as the average number of samples which exceeds an absolute value in each speech frame. In this case, the degree of clipping is 32000 which represents about 98% of the full-scale deflection of a 16-bit analog-to-digital converter.

- Speech Entropy- A sixth sanity check includes checking whether a so-called"speech entropy" exceeds a minimum. In this example, the minimum speech entropy is 40.

Speech entropy is defined as the average distance between a speech featuregram and the mean feature vector of the speech featuregram. The mean feature vector is calculated by taking an average of the n-feature vectors in the featuregram. A distance between each feature vector and the mean feature vector is determined.

Preferably, a Manhattan distance is calculated, although a Euclidian distance may be used. An average distance is calculated by taking an average of n-values of distance.

- Speech/Noise Entropy- A sixth sanity check comprises testing whether a so-called"speech-to-noise entropy"exceeds a minimum. Speech-to-noise entropy is defined as the average distance between the mean feature vector of the speech featuregram and the feature vectors of the background noise. In this example, the minimum speech-to-noise entropy is 40.

Referring to Figure 25, a process of performing sanity checks is shown. A plurality of sanity checks are performed and a tally kept of the number of failures (steps S1. 6.1 to S1. 6.6). If there are number of failures exceeds a threshold, for example 3, then signal is deemed to be inadequate and the user is asked to check their set-up

(steps S1. 6.7 and S1. 6.8). Otherwise, the recorded signal 19 (Figure 5) is considered to be satisfactory (step S1. 6. 9).

Creating eerh featuregram Once the endpoints of the recorded signal 19 (Figure 5) have been identified and the recorded signal (Figure 5) passes a plurality of sanity checks, then a speech featuregram may be created.

Referring to Figure 27, a speech featuregram 63 is created using a process 64 by concatenating feature vectors 24 extracted from the section of the featuregram 25 that originates from the speech utterance. The speech section of the featuregram is located via the speech endpoints 28, 28'.

Creating speech featuregram archetype The aim of the enrolment is to provide a characteristic voiceprint for one or more words or phrases. However, specimens of the same word or phase provided by the same user usually differ from one another. Therefore, it is desirable to obtain a plurality of specimens and derive a model or archetypal specimen. This may involve discarding one or more specimens that differ significantly from other specimens.

Referring to Figure 28, a speech featuregram archetype 65 is calculated using an averaging process 66 using w-featuregrams 631, 632,..., 63w. Typically, an average of three featuregrams 63 is taken.

Referring to Figure 29,30, 31 and 32, the featuregram archetype 65 is computed by determining a winning score D for each featuregram 63o, 632,..., 63W warped, using a modified version of process 50 which is shown in Figure 32, against each other featuregram 631, 632,..., 63W to create an w-by-w featuregram cost matrix 67, whose diagonal elements are zero (steps S1. 8.1 to S1. 8. 9).

Excluding the diagonal elements, a minimum value Dmin in the featuregram cost matrix 67 is determined (step S1. 8.10). If the minimum value D is greater than a predefined threshold distance Do, then all the featuregrams 631, 632".., 63w are

considered to be so dissimilar that a featuregram archetype 67 cannot be created (step S1. 8.11).

Referring to Figures 29 and 31, if one or more values in the featuregram cost matrix 67 is less than the threshold Do, then w-featuregram archetypes 691, 682,..., 68w are computed using each featuregram 631, 632,..., 63, as a reference and warping each remaining (w-1)-featuregrams 631, 632,..., 63w onto it (steps S1. 8.12 to S1. 8.21).

Referring to Figures 29,33, 34 and 35, once w-featuregram archetypes 681, 682,..., 68 « have been created, a w-by-w featuregram archetype cost matrix 69 is computed whose elements consist of the winning scores E from warping each featuregram 631, 632,..., 63W into each featuregram archetype 681, 682".., 68W (steps S1. 8.22 to S1. 8.28).

An average featuregram archetype cost matrix 70 is computed by averaging elements within each column 71 corresponding to a featuregram 631, 632,..., 63W (steps S1. 8.29 to S1. 8.37).

A maximum value E'ma@ in the featuregram cost matrix 69 is also determined (steps S1. 8.38).

If the maximum value E'maX in the featuregram cost matrix 69 is less than the threshold Do, then the featuregram archetype 681, 682,..., 68W which provides the lowest mean featuregram archetype cost <E'1>,<E'2>,...,<E'w> is chosen to be included in the voice authentication biometric (steps S1. 8.37 to S1. 8.50). The lowest mean featuregram archetype cost <E'1>,<E'2>,...,<E'w> is calculated by averaging elements within each row 72.

If the maximum value E'max in the featuregram cost matrix 69 is greater than the threshold Do, then a featuregram 631, 632,..., 63W is excluded, thus reducing the number of featuregrams to (w-1) and steps S1. 8. 1 to S1. 8.50 are repeated (steps S1. 8.54).

A featuregram 631, 632,..., 63,, is chosen for exclusion by calculating a variance (71,, for each featuregram archetype 681,, 682,..., 68, and excluding the featuregram 631, 632,..., 63w corresponding to the featuregram archetype 681, 682,..., 68xV having the lowest value of variance al, #2,...,#w (steps S1. 8.51 to S1. 8.53). For example, for an 8th featuregram archetype 68i, a variance cri is calculated from the average featuregram archetype cost matrix 70 using : Thus, the mean featuregram archetype cost <E'1>, <E'2>,...,<E'w> which produced the lowest average distance results in the reference featuregram 631, 632,..., 63xv from which it was created being discarded.

Steps S1. 8.1 to S1. 8.50 are repeated until a featuregram archetype 65 (Figure 28) is obtained or if only one featuregram 631, 632,..., 63w is left.

Setting an appropriate pass level A featuregram archetype 65 is obtained for each prompt. Thus, during subsequent authentication, a user is asked to provide a response to a prompt. A featuregram is obtained and compared with the featuregram archetype 65 using a dynamic time warping process which produces a score. The score is compared with a preset pass level. A score which falls below the pass level indicates a good match and so the user is accepted as being a valid user.

A valid user is likely to provide a response that results in a low score, falling below the pass level, and which is accepted. However, there may be occasions when even a valid user provides a response that results in a high score and which is rejected.

Conversely, an impostor may be expected to provide poor responses which are usually rejected. Nevertheless, they may occasionally provide a sufficiently close- matching response which is accepted. Thus, the pass level affects the proportion of

valid users being incorrectly rejected, i. e. the"false reject rate" (FRR) and the proportion of impostors which are accepted, i. e"false accept rate" (FAR).

In this example, a neutral strategy is adopted which shows no bias towards preventing unauthorised access or allowing authorised access.

A pass level for a fixed-word or fixed-phrase prompt is determined using previously acquired captured recordings taken from a wide range of representative speakers.

A featuregram archetype is obtained for each of a first set of users for the same prompt in a manner hereinbefore described. Thereafter, each user provides a spoken response to the prompt from which a featuregram is obtained and compared with the user's featuregram archetype using a dynamic time warping process so as produce a score. This produces a first set of scores corresponding to valid users.

The process is repeated for a second set of users, again using the same prompt.

Once more, each user provides a spoken response to the prompt from which a featuregram is obtained. However, the featuregeam is compared with a different user's featuregram archetype. Another set of scores are produced, this time corresponding to impostors.

Referring to Figure 36, frequency of scores for valid users and impostors are fitted to first and second probability density functions 731, 732 respectively using: where, p is probability, x is score, u is mean score and 6 is standard deviation.

Other probability density functions may be used.

The mean score 1ll for valid users is expected to be lower than the mean score t, for the impostors. Furthermore, the standard deviation 6I for the valid users is usually smaller than the standard deviation 62 of the second density function Referring to Figure 37, the first and second probability density functions 731, 732 are numerically integrated to produced first and second continuous density functions 741, 742. The point of intersection 75 of the first and second continuous density functions 741, 742is the equal error rate (ERR), wherein FRR = FAR. The score at the point of intersection 75 is used as a pass score for the prompt.

Creating a voice authentication biometric Referring to Figure 38. a voice authentication biometric 76 is shown. The voice authentication biometric 76 comprises sets of data 771, 772,... 77q corresponding to featuregram archetypes 65 and associated prompts 78. Statistical information 79 regarding each featuregram archetype 65 and an associated prompt 78 may also be stored and will be described in more detail later. The voice authentication biometric 76 further comprises ancillary information including the number of prompts to be issued during authentication 80, scoring strategy 81, higher level and gain settings 82. The biometric 76 may include further information, for example related to high-level logic for analysing scores.

The voice authentication biometric 76 is stored in non-volatile memory 11 (Figure 1).

Authentication Referring again to Figures 1 and 2, once enrolment has been successfully completed, the user is registered as a valid user. Access to the secure system 3 is conditional on successful authentication.

Referring to Figure 39, the authentication process corresponding to step S2 in Figure 2, is shown:

The voice authentication system 1 is initialised, for example by setting amplifier gain to a value stored in the voice authentication biometric, or calibrated, for example to ensure that an appropriate amplifier gain is set (step S2.1). The user is then prompted (step 2.2) and the user's responses are recorded (step S2.3). Featuregrams are generated from the recordings (step S2.4). The recordings are examined so as to isolate speech from background noise and periods of silence (step S2.5, step S2. 6).

Checks are performed to ensure that the recordings, isolated speech utterances and featuregrams are suitable for processing (step S2.7). The featuregrams are then matched with the featuregram archetype (step S2.8). The response is also checked for replay attack (step S2.9). The user's response is then scored. (step S2.10).

Initialisation/Calibration The gain of the amplifier 6 (Figure 1) is set according to the value 82 (Figure 37) stored in the voice authentication biometric 76 (Figure 37) which is stored in non- volatile memory 11 (Figure 1).

Alternatively, the system may be calibrated in a way similar to that used in enrolment. However, the process may differ. For example, prompts used in authentication may differ from those used in enrolment. A value of gain determined during enrolment calibration need not be recorded but may be compared with a value stored in the voice authentication biometric and user to determine whether the user is a valid user.

Authentication prompt. r Authentication prompts are chosen from those stored in the voice authentication biometric 76 (Figure 37). Preferably, prompts are randomly chosen from a sub-set.

This has the advantage that it becomes more difficult for a user to guess what prompt will be used and so give an unnatural response. Moreover, this improves security.

Recording Referring to Figure 40, following the or each prompt, a signal 83 is recorded using the microphone 5 (Figure 1) in a manner hereinbefore described.

Creating authentication featuregrams Referring to Figures 41,42 and 43, the or each recorded signal 83, is divided into timeslices 84. The timeslices 84 use the same window size and the same overlap as used for enrolment. Feature vectors 85 are created. Again, the same process 25 is used in authentication as enrolment. The feature vectors 85 are concatenated to produce featuregrams 86. The featuregrams 86 generated during authentication are usually referred to as authentication featuregrams.

Referring to Figure 44, explicit endpointing may be performed using the process 27 described earlier so as to generate endpoints 87. Explicit endpointing may be used to support sanity checks.

Sanity checks Sanity checks are conducted on the recorded signal 83 as described earlier.

Matching autbenticationfeaturegrams with the voice authentication biomettic Referring to Figure 45, the process 50 and the featuregram archetype 65 is used to word spot the authentication featuregram 86 and provide a dynamic time warping winning score 87. The process 50 may be used to provide endpoints 28'.

Rejecting a replay attack A potential threat to the security offered by any voice authentication system is the possibility of an impostor secretly recording a spoken response of the valid user and subsequently replaying a recording to gain access to the system. This is known as a "replay attack. " One solution to this problem is to issue, during each separate authentication, a randomly chosen subset of prompts from a full set of prompts responded to during enrolment. This means that several different authentication sessions will need to be secretly recorded before an impostor can collect a complete set of the responses.

However, this does not combat the threat from recordings made during enrolment.

Another solution is to store copies of the featuregrams generated during recent authentications and track them to see if they vary sufficiently over time. However, this has several drawbacks. Firstly, additional storage is needed. Secondly, replaying the same recording on several occasions under different levels and types of background noise may in itself provide sufficient variability for the system to be fooled into thinking that it is observing legitimate live spoken responses provided by the valid user.

Referring to Figures 46 and 47, a process for rejecting replay attack is shown: A fixed-phrase prompt is randomly selected (step S2.9. 1). An example of a fixed- phrase prompt is"This is my voiceprint". A recording is started (step S2.9. 2). The user is then prompted a first time (step S2.9. 3). After a predetermined period of time, for example 1 or 2 seconds, the user is prompted a second time with the same prompt (steps S2.9. 4 & S2.9. 5). Thus, the user supplies two different examples 891, 892 separated by a 1-2 second interval 90. A featuregram 86 is generated as described earlier (step S2.9. 6). The interval may comprise silence and/or noise.

The word spotting process 50 is used to isolate the two spoken responses 89,, 892 to the fixed-phrase prompt and the interval 90 (steps S2.9. 7 & S2.9. 8). The isolate . responses 891, 89,, in the form of truncated featuregrams, are fed to process 88.

Each truncated featuregram provides a representation of the spoken response. The duration of the interval 90 is determined.

If the featuregrams 891, 892 are too similar, either to each other, or to the featuregram archetype 65 stored in the voice authentication biometric, then authentication is rejected on suspicion of a replay attack (steps S2.9. 9 to S2.9. 13). A corresponding reject flag 91 is set.

A record 92 is kept of a degree of match between the two featuregrams 891, 892 and the length of the intermediate silence 90 (step S2. 9. 11). This record 92 is known as a"Replay Attack Statistic" The record 92 comprises two integers.

Therefore, it is possible to store a plurality of replay attack statistics 92 for each

fixed-phrase prompt in the voice authentication biometric 76 (Figure 36) without consuming a significant amount of memory. The record 92 is stored in statistical information 79 (Figure 37).

If during a subsequent authentication, a close match is detected between the latest replay attack statistic 92 and any subsequent replay attack statistic 92 stored in the voice authentication biometric 76 (Figure 36) (steps S2. 9.15 to S2.9. 16), then the authentication may be rejected on suspicion of a replay attack. Additionally or alternatively, the process may be repeated using a different prompt and check for the replay attack based on another set of replay attack statistics 92.

If during subsequent authentication, the duration of the interval 90 is found to be the same as the duration of the interval 90 for the same prompt arising from an earlier authentication, then the authentication may also be rejected on suspicion of a replay attack (step S2.9. 17 and step S2.9. 18).

The advantage of using this approach is that it is possible to monitor and detect suspicious similarities between featuregram archetypes even if the acoustic environment has changed since the time the recording was originally made.

Furthermore, the approach helps to guard against replay attacks based on recordings made during enrolment and authentication. Additionally, the cost of storing the replay attack statistics is low, typically 3 bytes per prompt. Thus, to monitor the last 5 authentication attempts across 5 fixed prompts typically requires 75 bytes of memory.

Higher-level decision logic A decision on whether to accept or reject the user is based on the degree of match between featuregram archetypes 65 stored in the voice authentication biometric 76 (Figure 36) and the featuregrams 86 derived from the authentication recordings.

Higher-level decision logic is subsequently applied.

Higher-level decision logic may include calculating an average score for, a plurality featuregrams 86 and determining whether the average score falls below a first predetermined scoring threshold, i. e. v < Dtheshl* If the average score falls below the first predetermined scoring threshold, then authentication is considered successful.

Higher-level decision logic may include determining the number, n, of featuregrams 86 whose score fall below a second predetermined scoring threshold, i. e. D < Dthesh2 for all 0 < i < p. The decision logic subsequently comprises checking a pass condition. For example, the pass condition may be that the scores for ri out of p featuregrams 86 fall below the second predetermined scoring threshold, where 1 < n zu p. Allowing one or more of the featuregram scores to be ignored is useful because it allows the valid user to provide an uncharacteristic response to at least one of the prompts without being unduly penalised.

For fixed prompts, with a priori knowledge of a response, the scoring thresholds may be set based upon the statistical method described earlier.

For challenge-response prompts, a threshold may be determined during enrolment.

A plurality of specimens, preferably two or three, of the same response are taken. A featuregram archetype is determined. Additionally, a variance is determined.

Thus, a fixed number of prompts are issued and spoken responses are recorded.

The spoken response are analysed to determine whether a valid user is addressing the system.

However, an alternative strategy may be used, which adaptively determines a number of prompts to be issued.

Initially, a user is prompted a predetermined number of times, for example two or three times. Spoken responses are recorded, corresponding featuregrams are obtained and compared with the featuregram archetype so as to produce a number of scores.

Depending on the score, further prompts may be issued. For example, if all or substantially all the scores fall below a threshold score, indicating a good number of matches, then no further prompts are issued and authentication is successful.

Conversely,, if all or substantially all the scores exceed the threshold score, indicating a poor number of matches, then authentication is unsuccessful.

However, if some scores fall below the threshold and other scores exceed the threshold, then further prompts are issued and further scores obtained.

This process continues until either the proportion of successful scores exceeds a first predetermined proportion, for example 70%, in which case authentication is successful, or falls below a second predetermined proportion, such as 30%, in which case authentication is considered unsuccessful.

This has the advantage that valid users who provide consistently good examples of speech when prompted need only provide a small number of spoken responses, thus saving time.

In the above embodiment, the voice authentication system is comprised in a single unit, such as a personal computer. However, the voice authentication system may be distributed.

For example, the processor for performing the matching process and non-volatile memory holding the voice authentication biometric may be held on a so-called "smart card"which is carried by the valid user. This is particularly convenient for controlling access to a room or building via an electronically-controlled lockable door. The door is provided with a microphone and a smart card reader. The door is also provided with a speaker for providing audio prompts and/or a display for providing text prompts. When the smart card is inserted into the smart card reader, the voice authentication system is connected and permits authentication and, optionally, enrolment. Enrolment may be performed elsewhere, preferably under supervision of the system administrator, using another microphone and smart card reader together with speaker and/or display. This has the advantage that access is

conditional not only on successful authentication, but also possession of the smart card. Furthermore, the voice authentication biometric and the matching process may be encrypted. The smart card may also be used in personal electronic devices, such as cellular telephones and personal data assistants.

Smart Card Voice authentication using a smart card will now be described in more detail: Referring to Figure 48, a modified voice authentication system 1'is provided by a personal computer 93, for example in the form of a lap-top personal computer, and a smart card 94. The personal computer 93 includes a smart card reader 95 for permitting the personal computer 93 and smart card 94 to exchange signals. The smart card reader 95 may be a peripheral device connected to the computer 93.

The smart card 94 includes an input/output circuit 96, processor 9', non-volatile memory 10'and volatile memory 11'. If the smart card 94 is used for storing the voice authentication biometric, but not for performing the matching or other processes, then the smart card 94 need not include the processor 9'and volatile memory 11'.

Referring to Figure 49, the smart card 94 takes the form of a contact smart card 491.

The contact smart card 491 includes a set of contacts 97 and a chip 98. An example of a contact smart card 491 is a JCOP20 card.

Referring to Figure 50, the smart card 94 may alternatively take the form of a contactless smart card 492. The contactless smart card 492 includes a loop or coil 99 and a chip 99. The contactless smart card 492 may include a plurality of sets of loops (not shown) and corresponding chips (not shown). An example of a contactless smart card 492 is an iCLASSTM card produced by HID Corporation.

Referring to Figure 51, the contact smart card 94, and smart card reader 95 are shown in more detail.

The contact smart card 941 and smart card reader 95 are connected by an interface 101 including a voltage line Vcc 1021, for example at 3 or 5 V, a reset line Rst 1022 for resetting RAM, a clock line 1023 for providing an external clock signal from which an internal clock is derived and an input/output line 1024. Preferably, the interface conforms to ISO 7816.

Volatile memory 11' (Figure 48) is in the form of RAM 103 and is used during operation of software on the card. If a reset signal is applied to line Rst 1022 or if the card is disconnected from the card reader 95, then contents of the RAM 103 is reset.

Non-volatile memory 10' (Figure 48) is in the form of ROM 104 and EEPROM 105.

An operating system (not shown) is stored in ROM 104. Application software (not shown) and voice authentication biometric 76 (Figure 38) is stored in EEPROM 105. Contents of the EEPROM 105 may be set using a card manufacturer's development kit. The EEPROM 105 may have a memory size of 8 or 16kbits, although other memory sizes may be used.

Processor 9' (Figure 48) may be in the form of an embedded processor 106 which handles, amongst other thing, encryption, for example based on triple data encryption standard (DES).

The interface 96, RAM 103, ROM 104, EEPROM 105, processor 106 may be incorporated into chip 98 (Figure 49), although a plurality of chips may be used.

The smart card reader 95 is connected to the personal computer 93 which runs one or more computer programs for permitting communication with the smart card 941 using Application Protocol Data Units (APDUs) for example as specified according to ISO 7816-4. However, other schemes and/or protocols which allow communication between a smart card and reader may be used.

Referring to Figure 52, a table 107 lists APDU commands 108 can be sent to the smart card 941 (Figure 49) and corresponding responses 109. Matrix 110 links commands 108 with corresponding responses 109 using an'X'.

The term"template"is used hereinafter to refer to a featuregram archetype 65.

Template download During or following enrolment, a voice authentication biometric 76 (Figure 38) is stored on the smart card 94 which includes one or more templates 65 (Figure 38).

Referring to Figure 53, a process of downloading templates 65 (Figure 38) to the smart card 94 is shown.

One or more"template download"commands are transmitted, each command including a respective section or portion of the template 65 (step S3.1) to be stored on the smart card 94 in EEPROM 105 (Figure 51). For example, a section of a template can be a feature vector. The smart card 94 returns a response indicating whether template download commands were successful or unsuccessful (step S3.2).

If unsuccessful, a response specifies an error and the process is repeated (steps S3.3 & S3.4).

This process may be repeated a plurality of times for each template 65 (Figure 38), for example corresponding to different prompts.

If no more sections of a template are to be downloaded, then a"template download ended"command is sent (step S3.5). The smart card 94 returns a response indicating whether template download was successful or unsuccessful (step S3.6).

Other data, such as data relating to the prompt, may be included in the data field of the"template download"command.

Template upload During authentication, at least one template 65 is compared with a featuregram 76 (Figure 45). The smart card 94 need not necessarily perform the comparison, i. e. the comparison is performed"off-chip". This may be because the smart card 94 does not have a sufficiently powerful processor. Alternatively, it may be decided to perform an"off-chip"comparison. Under these circumstances, one or more templates 65 (Figure 38) are uploaded to the computer 93.

The process is similar to template downloading, but carried out in the reverse direction. In other words, an"upload template to laptop"command is used.

Featuregram download As explained earlier, during authentication, at least one template 65 is compared with a featuregram 76 (Figure 45). It is advantageous for the smart card 94 to perform the comparison, i. e. the comparison is performed"on-chip", in which case templates 65 (Figure 38) do not leave the smart card 94. This can help prevent copying, stealing and corruption of templates 65 (Figure 38).

One or more"feature vector download"commands are transmitted each including a respective feature vector (step S4.1). The smart card 94 returns a response indicating whether feature vector download was successful or unsuccessful (step S4.2). If unsuccessful, the response specifies an error and the process is repeated (steps S4.3 & S4.4).

This process is repeated a plurality of times until a featuregram 76 is downloaded (Figure 38).

If no more feature vectors are to be downloaded, then a"feature vector download ended"command is sent (step S4.5). The smart card 94 returns a response indicating whether feature vector download was successful or unsuccessful (step S4. 6).

Referring to Figure 55, once the featuregram 76 has been downloaded, a"return score"command is sent to the card (step S5.1). The processor 106 compares the template 65 with the featuregram 76 as described above to produce a score, which is compared with threshold, which may be hard coded or previously downloaded (step S5.2), and a response is returned indicating whether authentication was successful or unsuccessful (step S5.3).

APDU commands may be used to delete featuregrams. As shown in Figure 57, other APDU command may be provided, such as"delete biometric"for deleting a all templates and other data and"verify biometric loaded"for checking whether the card holds a voice authentication biometric.

The smart card can perform other processes including some of the process described earlier, such as detecting replay attack and performing higher-level logic.

Storing the voice authentication biometric on a smart card can have several advantages.

It helps provide a secure mechanism for validating that a smart card user is the smart card owner. This is of particular importance for financial transactions, for example using credit and debit cards.

It helps keep the voice authentication biometric in the user's possession. This helps to avoid data protection issues, such as the need to comply with data protection legislation.

Authentication can be performed at a remote site without the need to communicate with a server holding a database containing voice authentication biometrics.

Furthermore, because the smart card is available locally at point of use, it helps avoid the need to communicate through telephone or data lines, thus helping to save costs, increase speed and improve security.

Performing matching on the smart card can also have several advantages.

It helps to avoid the voice authentication biometric being copied, stolen or corrupted.

Additionally, it can help provide backward compatibility by minimising modifications to an existing system so as to provide a facility for voice authentication.

Many modifications may be made to the embodiment hereinbefore described. For example, the recorded signal may comprise a stereo recording. The smart card may be any sort of token, such a tag or key, or incorporated into objects such as a watch or jewellery, which can be held in the user's possession. Information storage media and devices may be used, such as a memory stick, floppy disk or optical disk. The smart card may be a mobile telephone SIM card. The smart card may be marked and/or store data so as to identify that the card belongs to a given user.

Prompts need not be explicitly stated. For example, a prompt may be a green light or the word"Go".

Measurements of background noise may be made in different ways. For example, a recorded signal, or part thereof, may be divided into a plurality of frames. A value of background noise may be determined by selecting one or more of the lowest energy frames and either using one of the selected frames as a representative frame or obtaining an average of all the selected frames. To select the one or more lowest energy frames, the frames may be arranged in order of signal energy. Thereafter, the ordered frames may be examined to determine a boundary where signal energy jumps from a relatively low level and to a relatively high level. Alternatively, a predetermined number of frames at the lower energy end may be selected.

Previous Patent: METHOD AND DEVICE FOR MOBILITY CONTROL IN A COMMUNICATION SYSTEM

Next Patent: RENDERING A FIRST MEDIA TYPE CONTENT ON A BROWSER