Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
USER AUTHENTICATION AND LOGIN METHODS
Document Type and Number:
WIPO Patent Application WO/2023/081962
Kind Code:
A1
Abstract:
The present invention is directed to the authentication of a user of a digital service provider online or via telephone. More particularly the invention relates the methods for the authentication of a user using by signal analysis of a user's voice by digital audio signal processing means. The signal processing may be performed on an audio signal of the user speaking an identifier such as their cell phone number.

Inventors:
NIKITIN DMITRY (AU)
Application Number:
PCT/AU2022/051333
Publication Date:
May 19, 2023
Filing Date:
November 08, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
OMNI INTELLIGENCE PTY LTD (AU)
International Classes:
G06F21/32; G10L17/24; H04M1/64; H04M1/663; H04M3/38
Foreign References:
US20210090561A12021-03-25
US20150095028A12015-04-02
US20110276323A12011-11-10
US7386448B12008-06-10
US20180233152A12018-08-16
Attorney, Agent or Firm:
BOROVEC, Steven (AU)
Download PDF:
Claims:
CLAIMS:

1. A computer-implemented method for authenticating or partially authenticating the identity of a user, the method comprising the steps of: receiving an audio signal encoding a user's voice speaking a unique identifier, recognising the unique identifier by speech recognition and using the recognised identifier to determine the user amongst a plurality of users, and comparing the user audio signal or a representation thereof to a reference audio signal for the user or a representation of a reference audio signal for the user, wherein the identity of the user is authenticated or partially authenticated where the user audio signal or a representation thereof and the reference audio signal or representation of a reference audio signal for the user are comparable to at least a minimum level.

2. The method of claim 1, wherein the unique identifier is a unique string of letters, a unique string of numbers, a unique string of letters or numbers of combination thereof, a telephone number, an account identifier, a customer identifier, a business identifier, an email address, or a username.

3. The method of claim 1 or claim 2, wherein the unique identifier is unique to the user amongst the plurality of users.

3. The method of claim 1 or claim 2, wherein the authentication or partial authentication is for the purpose of a user logging into a computer-implemented service, or for identifying a user participating in a voice call.

4. The method of any one of claims 1 to 3, comprising an auxiliary authentication or step.

5. The method of claim 4, wherein the auxiliary authentication step comprises the step of transmitting a verification code to the user via a validated communication channel, and requesting input of the transmitted verification code from the user.

6. The method of claim 5, wherein the validated communication channel is a cell phone or an email address of the user.

7. The method of claim 4, wherein the auxiliary authentication step is software-enabled.

8. The method of claim 7 , wherein the software is embodied in the form of an authenticator application software,

9. The method of claim 8, wherein the authenticator application software is configured to present a time-limited verification code to the user.

10. The method of claim 9, wherein the authenticator application software is the Google™ authenticator app or a functional equivalent thereof.

11 . The method of any one of claims 1 to 10, wherein the reference audio signal or representation thereof was provided by user before execution of the method in the course of enrolment.

12. The method of any one of claims 1 to 11 , implemented on an authentication server or a login server.

13. The method of any one of claims 1 to 12, wherein the audio signal is generated by the user speaking into: the microphone of a cell phone, or the microphone of a processor enabled device having Internet connectivity.

14. The method of claim 12 or claim 13, wherein the audio signal is transmitted by a cell phone, or a processor enabled device having Internet connectivity to the authentication server or the login server.

15. The method of any one of claims 1 to 14, wherein the step of receiving a user audio signal encoding a user's voice speaking a unique identifier is performed by an audio input software module, and/or the step of recognising the unique identifier by speech recognition is performed by a speech recognition software module, and/or the step of using the recognised identifier to determine the user amongst a plurality of users is performed by a user determination software module, and/or the step of comparing the user audio signal or a representation thereof to a reference audio signal or representation thereof for the user is performed by a comparison software module.

16. The method of claim 15, wherein the audio input software module, and/or the speech recognition software module, and/or the user determination software module, and/or the comparison software module software is/are executed on an authentication server or a login server.

17. A computer-implemented method of logging into a digital service or participating in a voice call, the method comprising the steps of: a user speaking a unique identifier into a user processor-enabled device so as to provide an audio signal encoding the user's voice speaking the unique identifier, receiving the audio signal encoding a user's voice speaking a unique identifier into a login server or an authentication server, recognising the unique identifier by speech recognition, and using the recognised identifier to determine the user amongst a plurality of users, and comparing the user audio signal or a representation thereof to a reference audio signal or a representation of the reference audio signal for the user, wherein the identity of the user is authenticated or partially authenticated where the user audio signal or a representation thereof and the reference audio signal or representation thereof for the user are comparable to at least a minimum level.

18. The method of claim 17, wherein the user processor enabled device is a personal computer or a mobile device.

19. The method of claim 18, wherein the mobile device is a smart phone or a tablet device or a smart watch.

20. The method of any one of claims 17 to 19 having a feature of any one of claims 1 to 16.

21 . A processor-enabled device configured to authenticate or partially authenticate the identity of a user, the processor enabled device configured to: receive an audio signal encoding a user's voice speaking a unique identifier, recognise the unique identifier by speech recognition, and using the recognised identifier to determine the user amongst a plurality of users, and compare the user audio signal or a representation thereof to a reference audio signal or representation thereof for the user, wherein the identity of the user is authenticated or partially authenticated where the user audio signal or a representation thereof and the reference audio signal or representation of the reference audio signal for the user are comparable to at least a minimum level.

22. The processor enabled device of claim 21 having a feature of any one of claims 1 to 16.

23. The processor-enabled device of claim 21 or claim 22, that is a personal computer or a mobile device of the user.

24. The processor-enabled device of claim 23, wherein the mobile device is a smart phone or a tablet device.

24. Computer-readable medium comprising software configured to authenticate or partially authenticate the identity of a user, the medium having stored thereon program instructions configured to execute the method of any one of claims 1 to 16.

25. The medium of claim 24 configured as an application programming interface.

26. A system configured to authenticate or partially authenticate the identity of a user, the system comprising: a user processor-enabled device, and a login server or an authentication server configured to: receive an audio signal encoding a user's voice speaking a unique identifier from the user processor- enabled device , recognise the unique identifier by speech recognition, and using the recognised identifier determine the user amongst a plurality of users, compare the user audio signal or a representation thereof to a reference audio signal for the user or a derivate of a reference audio signal for the user, wherein the identity of the user is authenticated or partially authenticated where the user audio signal or a representation thereof and the reference audio signal or representation of a reference audio signal for the user are comparable to at least a minimum level.

27. The system of claim 26, wherein the login server or authentication server is configured to execute the method of any one of claims 1 to 16.

28. The system of claim 26 or claim 27, wherein the user processor enabled device is a personal computer or a mobile device.

29. The system of claim 28, wherein the mobile device is a smart phone or a tablet device or a smart watch.

30. A method for enrolling a user in a digital service, the method comprising the step of obtaining a digital audio signal of a user's voice speaking so as to allow for later recognition of the user's voice when the user speaks a unique identifier.

31. The method of claim 30, wherein the digital audio signal of the user's voice is converted into a representation thereof.

32. The method of claim 31 comprising a feature of the method of any one of claims 1 to 16.

33. A computer-implemented method of enrolling a user for a digital service, the method comprising the steps of: receiving an audio signal encoding the user's voice speaking a unique identifier, and storing the audio signal or a representation thereof in a database in linked association with the unique identifier.

34 The method of claim 33 comprising the step of receiving a unique identifier from the user.

35. The method of claim 33 or claim 34 comprising the step of receiving form the user information relating to an auxiliary authentication method and storing that information in a database in linked association with the unique identifier.

36. The method of any one of claims 33 to 35 comprising a feature of the method of any one of claims 1 to 16.

37. An enrolment server configured to execute the method of any one of claims 33 to 36.

38. The enrolment server of claim 37 that is administered by a digital service and configured to allow a user to enrol for the digital service.

39. The enrolment server of claim 37 configured to make network connection with a plurality of digital services and a plurality of users of each of the plurality of digital services.

Description:
USER AUTHENTICATION AND LOGIN METHODS

FIELD OF THE INVENTION

The present invention relates generally to the field of authentication of a user of a digital service provider online or via telephone. More particularly the invention relates the methods for the authentication of a user using by signal analysis of a user's voice by digital audio signal processing means.

BACKGROUND TO THE INVENTION

In accessing various services online or by telephone, a user is often required to firstly authenticate their identity. Such authentication is required to minimise any opportunity for unauthorised access to a service by a third party having dishonest intent. Authentication is particularly important where a service holds a user's private information or for services dealing in financial matters, such as a bank or an online stockbroker.

When accessing an online bank account for example, a user is typically required to insert a username into a first data input box and then a password into a second data input box, followed by clicking an "login" button or similar. If the password matches that stored against the username, then the login will be successful and the user permitted to transact on the account. These entries are generally made manually, and are difficult if not impossible to perform when the hands are otherwise occupied, such as when driving. In any event, the process of entering details is time consuming.

As a further problem, the user may not recall the username and/or password for the account in question. Where the username has been forgotten, the user must make direct contact with the service provider (often by telephone) and authenticate their identity. The service provider will then provide the username, or ask the user to enrol again with a fresh username. The username may be issued by the service provider (such as an account number) and not be easily recalled by the user.

Where the password has been forgotten, the user will be required to manually reset the password to a new password via a previously elected email address associated with the account. Users are encouraged to use passwords that are not easily derived by a third party, and to contain long strings characters of different type including upper and lower case letters, numbers, and special characters. Moreover, many providers require that a user regularly changes their password. Thus, it is not uncommon for a user to forget a password when making a login attempt.

User authentication is also often required when a user calls a telephone contact center. The contact center operator will usually ask the user for a unique identifier and then a series of questions in order to authenticate their identity before proceeding with the discussion. It is important that the authentication question has an answer that cannot be easily derived by a third party. For example, questions such as "what was the name of your first school?" will be easily answerable by a user, however a simple check of a social media page of the user by a dishonest third party may readily reveal the answer. Such questions are therefore discouraged. The disadvantage of authentication questions providing a higher level of security is that they are generally more obscure to the user and therefore more easily forgotten. A further problem is that the process of obtaining the unique identifier and presenting and answering a series of authentication questions can be time consuming. Again, the user may not remember an important detail such as an account number, or have the answer for an authentication question.

An overarching problem that applies to both login methods and voice call authentication is security. A user may write down a password or the answer to a user authentication query, thereby instantly compromising security.

A further overarching problem that applies to both login methods and voice call authentication is the time and complexity involved in enrolling for a digital service. In the course of enrolment via a webpage, a user will be required to manually enter a password. The password must be chosen according to a number of parameters dictated by the service provider (length, character types, no consecutive numbers, no birth dates etc), and it can take the user some time and thought to devise a password that complies and can also be remembered.

Similarly, time and mental effort is required on the part of a user to select a series of authentication questions when enrolling for a service having call center authentication. A balance between security and ease of recall must be judiciously considered.

It is an aspect of the present invention to provide an improvement in methods for logging a user into a digital service whereby the login process is shorter, and/or simplified and/or less likely to be aborted due to forgotten details. It is a further aspect to provide a useful alternative to prior art login methods.

A further aspect of the present invention is to provide an improvement in prior art methods for authentication of a user in the course of a voice call. It is a further aspect to provide a useful alternative to prior art authentication methods in that environment.

A further aspect of the present invention is to provide an improvement in prior art methods for enrolment of a user in a digital service. It is a further aspect to provide a useful alternative to prior art enrolment methods.

The discussion of documents, acts, materials, devices, articles and the like is included in this specification solely for the purpose of providing a context for the present invention. It is not suggested or represented that any or all of these matters formed part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed before the priority date of each claim of this application. SUMMARY OF THE INVENTION

In a first aspect, but not necessarily the broadest aspect, the present invention provides a computer- implemented method for authenticating or partially authenticating the identity of a user, the method comprising the steps of: receiving an audio signal encoding a user's voice speaking a unique identifier, recognising the unique identifier by speech recognition and using the recognised identifier to determine the user amongst a plurality of users, and comparing the user audio signal or a representation thereof to a reference audio signal for the user or a representation of a reference audio signal for the user, wherein the identity of the user is authenticated or partially authenticated where the user audio signal or a representation thereof and the reference audio signal or representation of a reference audio signal for the user are comparable to at least a minimum level.

In one embodiment of the first aspect, the unique identifier is a unique string of letters, a unique string of numbers, a unique string of letters or numbers of combination thereof, a telephone number, an account identifier, a customer identifier, a business identifier, an email address, or a username.

In one embodiment of the first aspect, the unique identifier is unique to the user amongst the plurality of users.

In one embodiment of the first aspect, the authentication or partial authentication is for the purpose of a user logging into a computer-implemented service, or for identifying a user participating in a voice call.

In one embodiment of the first aspect, the method comprises an auxiliary authentication or step.

In one embodiment of the first aspect, the auxiliary authentication step comprises the step of transmitting a verification code to the user via a validated communication channel, and requesting input of the transmitted verification code from the user.

In one embodiment of the first aspect, the validated communication channel is a cell phone or an email address of the user.

In one embodiment of the first aspect, the auxiliary authentication step is software-enabled.

In one embodiment of the first aspect, the software is embodied in the form of an authenticator application software.

In one embodiment of the first aspect, the authenticator application software is configured to present a time-limited verification code to the user.

In one embodiment of the first aspect, the authenticator application software is the Google™ authenticator app or a functional equivalent thereof. In one embodiment of the first aspect, the reference audio signal or representation thereof was provided by user before execution of the method in the course of enrolment.

In one embodiment of the first aspect, the method is implemented on an authentication server or a login server.

In one embodiment of the first aspect, the audio signal is generated by the user speaking into: the microphone of a cell phone, or the microphone of a processor enabled device having Internet connectivity.

In one embodiment of the first aspect, the audio signal is transmitted by a cell phone, or a processor enabled device having Internet connectivity to the authentication server or the login server.

In one embodiment of the first aspect, the step of receiving a user audio signal encoding a user's voice speaking a unique identifier is performed by an audio input software module, and/or the step of recognising the unique identifier by speech recognition is performed by a speech recognition software module, and/or the step of using the recognised identifier to determine the user amongst a plurality of users is performed by a user determination software module, and/or the step of comparing the user audio signal or a representation thereof to a reference audio signal or representation thereof for the user is performed by a comparison software module.

In one embodiment of the first aspect, the audio input software module, and/or the speech recognition software module, and/or the user determination software module, and/or the comparison software module software is/are executed on an authentication server or a login server.

In a second aspect, the present invention provides a computer-implemented method of logging into a digital service or participating in a voice call, the method comprising the steps of: a user speaking a unique identifier into a user processor-enabled device so as to provide an audio signal encoding the user's voice speaking the unique identifier, receiving the audio signal encoding a user's voice speaking a unique identifier into a login server or an authentication server, recognising the unique identifier by speech recognition, and using the recognised identifier to determine the user amongst a plurality of users, and comparing the user audio signal or a representation thereof to a reference audio signal or a representation of the reference audio signal for the user, wherein the identity of the user is authenticated or partially authenticated where the user audio signal or a representation thereof and the reference audio signal or representation thereof for the user are comparable to at least a minimum level.

In one embodiment of the second aspect, the user processor enabled device is a personal computer or a mobile device. In one embodiment of the second aspect, the mobile device is a smart phone or a tablet device or a smart watch.

In one embodiment of the first aspect, the method has a feature of any embodiment of the first aspect.

In a third aspect, the present invention provides a processor-enabled device configured to authenticate or partially authenticate the identity of a user, the processor enabled device configured to: receive an audio signal encoding a user's voice speaking a unique identifier, recognise the unique identifier by speech recognition, and using the recognised identifier to determine the user amongst a plurality of users, compare the user audio signal or a representation thereof to a reference audio signal or representation thereof for the user, wherein the identity of the user is authenticated or partially authenticated where the user audio signal or a representation thereof and the reference audio signal or representation of the reference audio signal for the user are comparable to at least a minimum level.

In one embodiment of the third aspect, the processor enabled device has a feature of any embodiment of the first aspect.

In one embodiment of the third aspect, the processor-enabled device is a personal computer or a mobile device of the user.

In one embodiment of the third aspect, the mobile device is a smart phone or a tablet device.

In a fourth aspect, the present invention provides computer-readable medium comprising software configured to authenticate or partially authenticate the identity of a user, the medium having stored thereon program instructions configured to execute the method of any embodiment of the first aspect.

In one embodiment of the fourth aspect, the medium is configured as an application programming interface.

In a fifth aspect, the present invention provides a system configured to authenticate or partially authenticate the identity of a user, the system comprising: a user processor-enabled device, and a login server or an authentication server configured to: receive an audio signal encoding a user's voice speaking a unique identifier from the user processor-enabled device, recognise the unique identifier by speech recognition, and using the recognised identifier determine the user amongst a plurality of users, and compare the user audio signal or a representation thereof to a reference audio signal for the user or a derivate of a reference audio signal for the user, wherein the identity of the user is authenticated or partially authenticated where the user audio signal or a representation thereof and the reference audio signal or representation of a reference audio signal for the user are comparable to at least a minimum level. In one embodiment of the fifth aspect, wherein the login server or authentication server is configured to execute the method of any embodiment of the first aspect.

In one embodiment of the fifth aspect, the user processor enabled device is a personal computer or a mobile device.

In one embodiment of the fifth aspect, the mobile device is a smart phone or a tablet device or a smart watch.

In a sixth aspect, the present invention provides a method for enrolling a user in a digital service, the method comprising the step of obtaining a digital audio signal of a user's voice speaking so as to allow for later recognition of the user's voice when the user speaks a unique identifier.

In one embodiment of the sixth aspect, the recorded user's voice speaking a unique identifier is converted into a representation thereof.

In one embodiment of the sixth aspect, the method comprises a feature of any embodiment of the first aspect.

In a seventh aspect, the present invention provides a computer-implemented method of enrolling a user for a digital service, the method comprising the steps of: receiving an audio signal encoding the user's voice speaking a unique identifier, and storing the audio signal or a representation thereof in a database in linked association with the unique identifier.

In one embodiment of the seventh aspect, the method comprises the step of receiving a unique identifier from the user.

In one embodiment of the seventh aspect, the method comprises the step of receiving form the user information relating to an auxiliary authentication method and storing that information in a database in linked association with the unique identifier.

In one embodiment of the seventh aspect, the method comprises a feature of the method of any embodiment of the first aspect.

In an eight aspect, the present invention provides an enrolment server configured to execute the method of any embodiment of the first aspect.

In one embodiment of the eight aspect, the enrolment server is administered by a digital service and configured to allow a user to enroll for the digital service. In one embodiment of the eighth aspect, the enrolment server is configured to make network connection with a plurality of digital services and a plurality of users of each of the plurality of digital services.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a flow chart illustrating a user enrolment process of the present invention.

FIG. 2 shows a series of database records obtained from the enrolment process shown in FIG. 1.

FIG. 3 is a flow chart of a login process of the present invention. The login process relies on (i) extraction of a cell phone number obtained in the course of a login attempt to (li) locate a reference representation of a user voice in a database (as obtained in the enrolment process of FIG. 1 and stored as a database record as shown in FIG. 2) with a user voice representation obtained in the course of a login attempt, and comparison of the representations.

FIG. 4 is a flow chart of a voice call authentication process of the present invention. The process relies on (i) extraction of a cell phone number obtained in the course of a voice call to (ii) locate a reference representation of a user voice in a database (as obtained in the enrolment process of FIG. 1 and stored as a database record as shown in FIG. 2) with a user voice representation obtained in the course of, and comparison of the representations.

FIG. 5 is a diagram of a system of the present invention whereby a login or voice call authentication server of a digital service is in network connection with a plurality of users, and also other servers of the digital service.

FIG. 6 is a diagram of a system of the present invention whereby a login or voice call authentication server provides login or authentication servers of a plurality of digital services is in network connection therewith. A plurality of users are in network connection with the login of voice call authentication server, such that a user is able to login in or have voice call authentication with at least one of plurality of digital servers.

FIG. 7 is a flow diagram showing a voice flow for enrolment, used when registering a new user for voice authentication.

FIG. 8 is a flow diagram showing a voice flow for voice verification, used to check if a speaker is registered for voice authentication.

FIG. 9 is a flow diagram showing a voice flow for voice Log-in Basic, used to identify and authenticate a person with just 1 audio file (the audio contains a spoken mobile phone number), FIG. 10 is a flow diagram showing a voice flow for voice Log-in Strong, used to implement a voice- enabled log-in requiring require a secure identification and authentication mechanism.

FIG. 11 is a block diagram showing security architecture for the platform.

FIG. 12 is a block diagram showing security architecture for the biometrics.

Unless otherwise indicated herein, features of the drawings labelled with the same numeral are taken to be the same features, or at least functionally similar features, when used across different drawings. The drawings are not prepared to any particular scale or dimension and are not presented as being a completely accurate presentation of the various embodiments.

DETAILED DESCRIPTION OF THE INVENTION AND PREFERRED EMBODIMENTS THEREOF

After considering this description it will be apparent to one skilled in the art how the invention is implemented in various alternative embodiments and alternative applications. However, although various embodiments of the present invention will be described herein, it is understood that these embodiments are presented by way of example only, and not limitation. As such, this description of various alternative embodiments should not be construed to limit the scope or breadth of the present invention. Furthermore, statements of advantages or other aspects apply to specific exemplary embodiments, and not necessarily to all embodiments, or indeed any embodiment covered by the claims.

Throughout the description and the claims of this specification the word "comprise" and variations of the word, such as "comprising" and "comprises" is not intended to exclude other additives, components, integers or steps.

Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may do so.

The present invention is predicated at least in part on the inventor's discovery that a user's voice is able to provide two functions. A first function is to provide a unique identifier, such as a customer account identifier. A second function is to provide a means for authenticating the identity of a user by virtue of the unique characteristics of an individual's speech. Thus, a single spoken string of text and/or numbers can be used to log into a user into a digital service, and avoiding the need to manually enter a unique identifier (such as a username) and an authentication factor (such as a password) into a digital service login page. Furthermore, a user's voice is able to provide the same first and second functions in the course of a voice call to a contact center. Upon receiving a call from a user, the operator (robot or human) requests the user to speak their account number (or other identifier). No time is wasted by the operator listening to the user's unique identifier, accessing an authentication question associated with the unique identifier, asking the user the question and assessing the user's answer against the expected answer.

A further finding is that a user cell phone number is a preferred unique identifier in the context of the invention. Firstly, the vast majority of users will readily be able to recall their cell phone number. Secondly, numbers are more readily recognisable by speech-to-text processing means. A cell phone number does not comprise letter or words and is therefore more readily converted to text without error. In some prior art methods, speech recognition is used to identify a user however more complex speech forms the basis for analysis, such as words and sentences. Such prior art methods are more prone to error in identification, or simply failing to identify any user because of the more complex comparisons that are required by the identification algorithms used.

The present invention is directed to the authentication or partial authentication of a user of a digital service. In some embodiments, the authentication is a partial authentication with supplementary authentication means (as discussed in more detail infra) being further implemented to result in a full authentication or substantially full authentication of a user. It will be understood that the present invention does not necessarily provide fail-safe authentication no matter how many authentication means are implemented.

In one embodiment the invention is used in the context of a user's interaction with with a digital service. The digital service may be accessed by way of web browser, an app, or a voice call. Exemplary digital services include a financial institution, a government department, an insurance company, a retailer, a telephone company, an airline, a booking agency, a gambling service, a social media company, an online business retailing a good or a service. Even a "bricks and mortar" business may be considered a digital service where a customer can interact with the business by digital means (including digitally-enabled voice call), Given the benefit of the present specification, other digital services having a use of the present invention will be apparent to the skilled person.

The user's interaction with the digital service is one requiring some means for identifying a person as an enrolled user of the service. A common scenario is where a user logs into an online or offline service. In the prior art the login step typically requires a user to have a unique identifier (such as a username) to identify the user, and a password to authenticate the identity of the user. The security of such login arrangements is reliant on the password being kept secret, or at least not being easily guessed or inferred. In the present invention the user's voice is exploited to provide both a unique identifier (such as a username) and also to identify the user, as more fully described elsewhere herein. Another common scenario is found when an enrolled user of a digital service makes a voice call to the service (say, for an account enquiry) and their identity must be confirmed to protect private information that may be disclosed by the digital service in the course of the call or to prevent unauthorised instructions being given to the digital service for example. In the prior art, the caller (being a putative enrolled user of the service) typically provides their name or an account number, and is generally asked to provide a password or to answer a series of verification questions. In the present invention the user's voice is provides both a unique identifier (such as an account number) and also to identify the user, as more fully described elsewhere herein.

Turning now to the operation of the invention, the invention requires the receipt of a digital audio signal (such as in the form of an audio file or an audio stream) for analysis in a login or user authentication procedure, the signal file being of a user's voice speaking a unique identifier. The unique identifier is obtained by a speech recognition means such as provided by known speech-to-text algorithms. Exemplary means include Project DeepSpeech (Mozilla), being an open source speech-to-text library that operates within the TensorFlow framework; Kaldi (released under the Apache public license); Julius, Wav2Letter++ (Facebook), being a trainable tool, DeepSpeech2 (Baidu, released under BSD license), Vosk, providing a steaming API allowing for online speech recognition, Athena (released under the Apache public license); and ESPnet (released under the Apache public license).

Once the unique identifier is recognised by the speech recognition means, the identifier is used to locate an electronic database record held by the digital service for the user. For example, the unique identifier may be the number string "685262", and the record associated with that unique identifier is located. A "fuzzy search" may be performed to locate the record (or at least a number of candidate records) in an attempt to locate a user by the provided identifier. A "fuzzy search" may be preferred since there is a degree of variability in the way the identifier can be submitted (i.e. pronounced by the user). Consider, for example, a phone number: it can be cited with a country code +614xxxxx or the area code 04xxxxx. The search algorithm can be anything from a simple "string matching" to a neural network based elastic search system. As will be appreciated, the record will generally exist as a database record amongst the records of a plurality of users.

Where a number of candidate records are located, the correct record (i.e. the actual record for the user) is determined by some secondary means. For example, each of the voice references for the candidate users may be analysed for similarity to the voice used for login or voice call identification. The record associated with the closest voice representation match, and also having at least a minimum required similarity as is otherwise required, is likely to be the correct record. Alternatively, a user interface may prompt the user for further information such as a post code (zip code) or other item of information recorded for users so as to locate the correct record.

The located record will comprise a reference audio signal for the user or a representation of a reference audio signal for the user. The reference audio signal has been previously obtained from the user in the course of an enrollment process, and comprises the user's voice speaking sufficient text so as to allow for identification of the user speaking the unique identifier . In some embodiments the reference audio signal per se is used as a comparator against the received audio signal to determine the likelihood that the received audio signal is the voice of the user and thereby authenticating the identity of the user.

In other embodiments, the reference audio signal is processed to form a representation of reference audio signal, and then used as a comparator against a representation of the received audio signal. As will be appreciated, the same or similar processing is used to generate the representation of the reference audio signal and the received audio signal so as to allow for a useful comparison to the made.

The audio signal may be electronically processed to form a representation thereof. The representation may a mathematical representation, a tensor, a vector (including a feature vector), or a scalar.

Speaker recognition may be achieved in three main steps: acoustic processing, feature extraction (to provide a representation) and classification/recognition.

The speech audio signal is firstly processed to improve the signal. One possible step is to remove noise so as avoid confounding the extraction of the important speech attributes and therefore negatively affecting speaker identification. Another possible step is level normalisation or compression of the audio signal. In many circumstance, a silence removal step will be required.

The purpose of the feature extraction step is to characterise a speech audio signal by a reference to predetermined number of signal elements. This avoids consideration of the entirety of data in the speech audio signal, which is generally avoided to limit processor burden. Ideally, the extraction process removes data that are not important or at least not of prime importance in the identification task.

Feature extraction may be accomplished by processing the speech waveform by digital means to a form of parametric representation at a relatively lower data rate or total amount of data for the comparison analysis. In some embodiments the audio signal is processed into a representation that is actually more discriminative and/or reliable than the original signal.

Generally, the feature extraction is configured to identify a representation that is relatively reliable for at least several conditions for the same speech signal, to take account of variations in the environmental conditions or the speaker themselves, while still retaining the portion of the data that characterizes the information in the speech signal.

Feature extraction processes typically yield a multidimensional feature vector for every speech signal. The prior art provides means for vectorizing audio signal data, and given the benefit of the present invention the skilled person is enabled to vectorize the digital audio data comprised in the received and reference audio signals. Preferably, a multidimensional linear vector is used as provided by a neural network configured to distinguish one user's speech from all others.

The vector representations (however formed) may have any number of dimensions, and in some embodiments only two dimensions. For example, an audio frequency spectrum has only two dimensions, and therefore may be represented as a simple matrix with each number of the matrix representing an amplitude. Such a matrix may be used a means for comparison between a received audio signal and a reference audio signal.

More complex representations of the audio signal may be used, and including representations having 10, 100, 1000 or more dimensions. A useful representation is provided by 256 dimensions implementing a COS distance as a measure.

In addition or alternatively, non-spectrum means may be used to process the audio stream, including audio embedding or speech recognition means.

Having the benefit of the present specification, the skilled person enabled to utilise one or more of a number of tools for providing a representation of a speech audio signal for the purpose of speaker (user) identification and authentication. Such tools include perceptual linear prediction (PLP), linear prediction coding (LPC) and mel-frequency spectrum coefficients (MFCC). A speech audio signal is a slow time varying signal and quasi-stationary when observed over a short time period (such as between 5 and 100 msec). Accordingly, short time spectral analysis (which includes MFCC, LPCC and PLP) may be preferred for the extraction of important discriminatory information from a speech audio signal. Preferably, a neural network-based feature extraction method is used, whereby the extractor is specifically trained to distinguish speakers using Mel-spectrograms.

Before feature extraction is performed, a pre-processing method may be carried out. The pre-processing method may comprise a pre-emphasis step involving passing the signal through a first-order finite impulse response (FIR) filter. A second step of frame blocking may be performed, whereby the speech audio signal is partitioned into a series of frames generally for the purpose of removing the acoustic interface present at the beginning and end of the signal.

The framed speech audio signal may then then windowed by passing through a frequency filter (such as a bandpass filter) to minimize disjointedness at the start and finish of each frame. Useful types of window include Hamming and Rectangular windows. The aim of this step is generally to improve the sharpness of harmonics, eliminate the discontinuity in the signal by tapering beginning and ending of the frame zero, and reduce spectral distortion.

In the present invention, the identity of the user may be authenticated or partially authenticated where the received user audio signal or a representation thereof and the reference audio signal or representation of a reference audio signal for the user are comparable to at least a minimum level. Such comparison is typically undertaken by speaker recognition means as incorporated into the tools discussed above. With to the minimum level, the skilled person having the benefit of the present specification is enabled to determine a suitable minimal level of comparability. Typically, the level will be set sufficiently high such that the rate false positive identifications are below an acceptable level and/or the rate of false negative identifications are below an acceptable level. The false positive rate will typically be the more critical given the possibility that a dishonest third party will gain access to the digital service involved, and therefore also access to funds, information and other resources not intended for third party access.

The acceptable false positive rate may be expressed in terms of a percentage, and in some embodiments is equal to or less than about 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, 0.8%, 0.9%, 1 .0%, 1.5%, 2.0%, 2.5%, 3.0%, 3.5%, 4.0%, 4.5% or 5.0%.

The unique identifier should be unique amongst all of the identifiers implemented for the plurality of users enrolled for the digital service. For ease of recall, it is preferable that the unique identifier is chosen by the user, as distinct from one chosen by the digital service or some other party. The identifier may be a unique string of letters and numbers, a telephone number, an account identifier, a customer identifier, a business identifier, an email address, or a username. Applicant has found that a telephone number (such as a cell phone number) as a unique identifier provides a number of advantages. Firstly, a telephone number (and particularly a person cell phone number) is retained by the user for many years, and possible even a life time. The number becomes very well known to the user, and therefore easily recalled for use when required in a login or voice call authentication process.

A second advantage compounding the first advantage is a user's telephone number will be absolutely unique (at least in the user's country of residence) because telephone numbers must be unique by their very nature. Accordingly, there is no requirement during a user enrolment process to check whether a proposed unique identifier is in fact unique amongst all existing users.

A third advantage compounding the previously mentioned advantages is that a telephone number is normally spoken in an almost staccato manner, with each number being recited in a discrete manner, with slight pauses in between each number. The discretely spoken numbers are more easily discerned as numbers and therefore better recognised by a speech to text converter.

A fourth advantage compounding the previously mentioned advantages is that a telephone number comprises only ten sounds (zero, one, two . . nine). This dramatically decreases the difficulty of the analysis by the speech to text converter, thereby increasing the rate of success is transforming a received audio speech signal into a string of numbers (i.e. the unique identifier).

A fifth advantage compounding the previously mentioned advantages is that a telephone number may be used as a validated communication channel through which an auxiliary authentication is step is effected. Whilst identification of a user's voice may provide a certain level of certainty that a user is in fact the real user, it is generally preferred that an auxiliary authentication method is used in addition to speaker recognition via the voice. As is presently common, a two-factor authentication (2FA) method is preferred to better ensure that a putative user is the real user. The auxiliary authentication method may be one that exploits the user's cell phone number and cell phone. As one possibility a verification code is sent to the user's cell phone by the digital service, and the received code is read by the user an input by the user into a web browser or an app forming part of the authentication system of the digital service. As mentioned supra, further advantage in use of the user's telephone number because the text is sent to that number. There is no requirement to separately locate (say, via a database record) the user's cell phone number to which the verification code is sent.

Another possibility exploits an authentication application installed on the phone of a user. The verification application is set up specifically for the digital service concerned so as to provide a shared secret key to the user over a secure channel, to be stored in the authenticator application. This secret key will be used for all future logins to the site. To log into the digital service using two-factor authentication and which supports the authenticator application, the user provides a voice sample of the unique identifier to the authentication system of the digital service, which computes (but does not display) a required verification code and requests the user to enter it. The user runs the authenticator application, which independently computes and displays the same verification code, which the user types into a web page or app of the digital service, thereby authenticating their identity. An exemplary authenticator app useful in the context of the present invention is Google™ Authenticator.

In computer-implemented form, the present methods may be executed on a computer server. The server may be administered by the digital service requiring authentication of a user. However, in some embodiments the authentication process is performed by a third party entity authorized to do so by the digital service and in which case the method is executed on a third party entity server.

The present invention may be implemented in one of the many forms as discussed herein, and also further forms such a user interface configured to prompt or allow the various inputs and display the various outputs. The interface may be configured to prompt the user for voice input, and optionally the prompt is by audio output such that the user is not required to view of otherwise interact with the interface.

Reference is made now to the accompanying drawings to further describe and non-limiting aspects and features of the invention.

FIG. 1 shows a non-limiting process flow for an enrolment process. In this case, the user has no existing relationship with a digital service, and seeks via the enrolment process to become a user of that digital service. The user starts the enrolment process by opening a relevant webpage of the digital service. The webpage requests entry of the usual details (name, address, email address, cell phone number etc), at which time a database entry is made for the new user. The database entry is identified within the database by reference to the user cell phone number. The user is prompted to clearly speak a passage of text (which may be provided by user interface for the user to read out aloud) . The user's speech is received as a digital audio signal. A simplified version of the audio signal is generated to form a representation of the audio signal, and the representation is saved as a data file in the new database record for the user. The representation is used after enrolment as a reference for the user.

An extract from the database referred to above is shown at FIG. 2. Partial records for several users are shown. Each record is uniquely identifiable by the user cell phone number.

Once successfully enrolled, and when the user wishes to use the digital service concerned, the user attempts to login. Reference is made to FIG. 3. The login process is commenced by a user speaking their cell phone number into a user interface provided by the digital service. The interface may be presented by way of an app, a webpage or any other means. The user's voice is digitally processed to provide a representation of the voice audio signal (the representation being generated in the same manner as for the enrolment process). To verify the identity of the individual attempting to login, the representation obtained in the login process is compared to the representation obtained during enrolment. Where the two representations display a minimum level of similarity, the login is successful. If not, the login is aborted. Preferably, the login will not be successful until and auxiliary authentication step (not shown) is successfully completed.

Where login is successful, the user is able to access any of the information and resources offered by the digital service.

FIG. 4 shows a similar scheme to that of FIG. 3, although in the context of caller verification in the course of a voice call to a digital service. The caller may be prompted to speak their cell phone number by a recording, a robot operator or a human operator. In any event, the caller's (user's) voice representation is compared to a reference representation. An optional auxiliary authentication step (not shown) may also be implemented. Where the caller's identity is verified, the digital service allows access any of the information and resources offered by the digital service by telephone. For example, a human operator may be permitted to discuss banking details or an automated transaction service may be accessed.

Reference is now made to FIG. 5 showing a non-limiting system of the invention comprising a server (100) being configured as an enrolment server, a login server, or a voice call authentication server of a digital service as described elsewhere herein. A plurality of user devices (110) are in network connection with the server (100) via the Internet (120). Upon successful login or authentication, the user is enabled to access various services of the digital services via a further connected servers (130).

Reference is made to FIG. 6 showing a non-limiting system of the present invention whereby enrolment, login or authentication is performed by a third party server (200). The plurality of user devices (110) are in network connection with the server (200) via the Internet (120). The third party server (200) is in network connection with a plurality of digital services (100). The user devices (110) are used to authenticate a user for one of the digital services (100), by logging into the third party server (200). Upon successful login the user is able to access the relevant digital service server (100) via third party server (200).

An exemplary workflow is provided below. -

Enrollment Scenarios (Enrollment API)

* Enroll against identity (e. g. call center enrollments ) : o If identity is new, it will create a new speaker against the identity, then enrol against the speaker o If identity is not new, the speaker will be looked-up, then enrol against the speaker

* Enroll against speaker (e . g. pre-enrollments or web)

- Omnivoice Workflow (Omnivoice Workflow API)

* Always Identification then Authentication

* Workflow flows : o Against Identifier (scenario name : "Voice Verification" ) ; steps :

1. Lookup speaker by identifier

2. If speaker is found, create workflow:

+ State : " identified" or "unenrolled"

+ Type : "authentication"

+ Return {Workflow_Dto: Id . . . }

3. Authenticate by voice against the workflow ID; possible cases :

+ State : " identified" , leave in "identified" if authentication is too late

+ State : "unenrolled" , leave in "unenrolled" if there is no enrollment

+ State : "unauthorized", if voiceprint doesn ' t match any of the enrollments

+ State : "authorized" , if voiceprint matches an enrollment associated with the speaker

+ Return {Workflow_Dto}

=> Can be done in a single transaction (e.g . when voiceprint and id are passed into a call) o Against a Voiceprint and Identifier (scenario name : "Voice Log- in (1- Factor)

- Weak" ) ; steps :

1. Lookup speaker by identifier (may contain cust lookup by voiceprint for a more robust search - next version)

2. If speaker found => create workflow:

+ State : "unenrolled" , "unauthorized" or "authorized" depending on whether the voiceprint matches any speaker enrollment or enrollments present

+ Type : "authorization-weak"

+ Return {Workflow_Dto} o Against a Voiceprint and Identifier + 2nd Factor (scenario name : "Voice Login (2- Factor) - Strong" ) ; steps :

1. Lookup speaker by identifier (may contain speaker lookup by voiceprint for a more robust search - next version)

2. If speaker found => create workflow:

+ State : "unenrolled" , "unauthorized" or "identified" depending of whether the voiceprint matches any speaker enrollment or enrollments are present at all => here we need to check if the speaker has any identifiers that support 2nd- Factor authentication

=> State : "no-modality-for-2nd-factor" if there are no identifiers that we can use for 2nd-Factor

+ Type : "authorization-strong"

+ If "identified", initiate 2nd- Factor based on the identifier (or any other identifier that has supported modality for 2nd factor)

+ Return {Workflow_Dto}

3. a)Authenticate 2nd-Factor with voice against the workflow ID; possible cases :

=> Accepts config to require "only voiceprint match" or "voiceprint + code"

+ State : " identified" , leave in "identified" if authentication is too late

+ State : "unenrolled" , leave in "unenrolled" if there is no enrollment

+ State : "unauthorized" , if voiceprint doesn ' t match any of the enrollments or code doesn ' t match

+ State : "authorized" , if voiceprint matches an enrollment associated with the speaker

+ State : "no-modality-for-2nd-factor" if there are no identifiers that we can use for 2nd-Factor

+ Return {Workflow Dto}

3. b)Authenticate 2nd-Factor with code against the workflow ID; possible cases :

+ State : " identified" , leave in "identified" if authentication is too late

+ State : "unenrolled" , leave in "unenrolled" if there is no enrollment at stept 2.

+ State : "unauthorized", if code doesn ' t match

+ State : "authorized" , if code matches

+ State : "no-modality-for-2nd-factor" if there are no identifiers that we can use for 2nd-Factor (would be taken from step 2)

+ Return {Workflow_Dto}

Having provided a broad outline of the present invention, further details are provided in the disclosure below.

There are two fundamentally different scenarios that are involved in the process of identification / authentication by voice: (i) enrolment, and (ii) login / authentication Enrolment, Verification and Log-in

In Login / Authentication a voiceprint is generated based on containing a speech sample. The voiceprint is then compared against each of the reference voiceprmts stored in a database. The best match that passes a confidence threshold is then returned. Among other parameters documented below, this match contains a comparison score (a value between [0..1 ]) and Speaker's Identifier (a phone number) that corresponds to the best match. The Enrolment scenario is required to populate a reference database at first instance. This is performed by calling the Enrolment API.

Enrolment

The user records a wav file of their voice, speaking at least their cell phone number, preferably in E. 164 format (e.g. +1234567890). Enrolment requires a valid cell phone number; and will not complete until the user clicks on the link sent to their phone. The flow sends an audio file containing speaker's voice recording to create a voiceprint and associate it with the speaker's identifier.

Voice Verification

The user records a wav file of their voice, speaking at least their cell phone number. The flow checks if the system has an enrolled voiceprint associated with a speaker identifier (e.g. account number) and, if so, sends an audio file containing speaker's voice and compares it with the voiceprint stored in the system.

Voice Log-in Basic

The user records a wav file of their voice, speaking at least their cell phone number. The flow checks if the system has an enrolled voiceprint associated with a speaker identifier (identifier is obtained using speech recognition) and, if so, compares speaker's voice with the voiceprint stored in the system.

Voice Log-in Strong

This flow is similar to Voice Log-in Basic (where the speaker records themselves saying their phone number) but it also requires that the speaker submits a verification code sent via a cellular network. The verification code can be submitted as a voice recoding in which case speech recognition and voice verification is performed on both, initial and verification audio submission.

Voice Sampling

Voice Biometrics compares voice samples and provide comparison results. Where the results show a minimum match (usually, expressed as a numeric score), the voice samples are considered to be spoken from the same person. In reality, various factors may impact the performance of the technology, including background noise, the presence of multiple speakers during voice sample collection, channel (e.g. phone line vs laptop mic), microphone gain distortions. In addition, the human voice exhibits numerous properties depending on the vocal task being performed (e.g. singing vs speaking). Indeed, depending on the context and emotional state, even the same phrase may have different emphases that communicate complex semantics. In some embodiment, the present invention comprises an artificial neural network configured to capture a combination of properties of a user voice as well as the way the user speaks. Performance will vary depending on the contents and duration of speech samples. Generally, a longer a speech sample exposes more voice properties and, subsequently improve performance.

Provided below are recommendations around the collection of speech samples for various scenarios to maximize voice biometrics performance: Enrolment audio should be longer than Authentication audio; Net Speech duration is less than audio file duration in most cases; Cross-channel authentications will show lower authentication scores; increase speech sample length or use same-phrase strategy to mitigate; Encourage users to make multiple Enrolments for each of the channels they normally use; Min. Net Speech requirement can be lowered when similar phrases are used for both, Enrolment and Authentication; Voice samples containing multiple speakers will affect system performance; Avoid Enrolling multiple speakers against the same account / phone number; Speech Recognition is not used for Voice Biometrics; it is used to improve Customer Experience; Provide clear instructions to users on what they should say during the speech sample collection process; Customer identifiers should be numeric; speech samples containing spoken identifiers must not have any other numbers in them; Numeric identifiers work best with Automatic Speech Recognition and yield reliable performance; do not use nonnumeric speaker identifiers.

Recommendations Summary: Parameter; Recommended Value; Notes

Min Enrolment Audio (Random); 30 sec; Net speech requirement for Enrolment that is agnostic to language / speech content; Min Enrolment Audio (Phrase-dependant); 10 sec; Net speech requirement for Enrolment - specific phrase; Min Auth. Audio (Random); 10 sec; Net speech requirement for authentication - content agnostic; Min Auth. Audio (Phrase-dependant); 6 sec; Net speech requirement for authentication - specific phrase Score Threshold; 0.7; possibly lowered for cross-channel scenarios, Verification Code Complexity; 6; Number of digits in the verification code; Audio / Net Speech Ratio; 1.4:1 ; Typical ratio between an audio file and net speech extracted from it. E.g. a 14 sec audio file will typically contain 10 sec of net speech; Audio / Net Speech Ratio 1.4:1; Typical ratio between an audio file and net speech extracted from it e.g. a 14 sec audio file will typically contain 10 sec of net speech

Enrolment vs Authentication

• Enrolment - the initial (or reference) voicepnnt registration

• Authentication - subsequent verification of the person's voice (e.g. when a previously registered person wishes to log-in using their voice)

Enrolment is done rarely (maybe once or twice) but it requires that the person being Enrolled is authentic. As much of the user's voice as practically possible should be captured to improve the quality of their Enrolment voiceprint. By contrast Authentication is performed each time a previously Enrolled person wishes to gain access to a resource (e.g. their website account). During Authentication, a fresh voice sample is collected and compared against the Enrolment voiceprint. If the Authentication fails, another attempt can be re-tried provided that voice sample collection is quick and easy (i.e. short audio samples). Enrolment Audio should be longer than Authentication Audio

Net Speech vs Audio Duration

Net Speech duration is less than audio file duration. While audio files capture a wide variety of environmental sounds (including silence), the present invention requires only human speech as its input (i.e. Net Speech). Non-speech intervals may be filtered from the input audio files. For short phrases (range 10-20 seconds), the average ratio between audio duration and Net Speech is 1.4:1. This, however, will depend on whether a person reads some unfamiliar text or says something that they are comfortable with. This ratio may have to be adjusted depending on the phrase strategy used.

Performance Across Multiple Channels

Cross-channel authentications will show lower authentication scores; increase speech sample length or use same-phrase strategy to mitigate. Users should multiple Enrolments for each of the channels normally used.

Cross-channel authentication happens when Enrolments and Authentications arrive on different channels. For example, a person could be registered (Enrolled) on a telephone call; after that they could try to authenticate themselves by voice using a web browser (e.g. logging-in to a website). Strategies to mitigate performance degradation for cross-channel authentications include: increase minimal speech sample duration for authentications, longer the speech samples, use Same-Phrase strategy (i.e. Enrolment and authentication both contain the same spoken phrase) It is recommended to Enrol more than 1 voice sample for each channel they might come in through. Users may submit a verification code using audio (Voice Login Strong scenarios). A verification code may be sent for 2nd Factor Authentication and can be recorded as well and sent as an audio recording; and in this case the recording will be appended to the initial speech sample thus increasing the amount of Net Speech provided.

It is preferred for a user to record a little more of their voice when trying to log-in. For example, the user can be instructed them to say their full name, town and suburb followed by their phone number; the phone number part will then be used in speech recognition to identify the user, followed by the voiceprint comparison

Same-Phrase Strategy

Min. Net Speech requirement can be lowered when similar phrases are used for both, Enrolment and Authentication For arbitrary speech samples it has been estimated that for the system to perform well, the minimum amount of Net Speech for Enrolment and Authentication should be 30 and 10 seconds respectively. It has been found that the Min. Net Speech requirement can be lowered when same-phrase approach is used. In this approach, both Enrolment and Authentication would be done with voice samples capturing the same phrase; in this case, the Voice Biometrics engine faces an easier task because it compares like with like.

The phrases used for Enrolment and subsequent Authentications do not have to match exactly. For this to work, it is sufficient if the phrases are just similar. One of the examples could be the address: two different addresses spoken by the same person will likely produce a high comparison score allowing us to conclude a positive authentication result.

Audio Quality and Environmental Factors

Voice samples containing multiple speakers will affect system performance. It is preferable to avoid Enrolling multiple speakers against the same account / phone number. It has been found that when audio samples sent for voice matching contain multiple speakers talking, system performance will degrade. For this reason, if multiple people need to be Enrolled against the same account, it is best to Enrol them individually. A recommended approach in this situation is to provide a unique identifier (e.g. phone number) for each person being Enrolled in the system and implement authorized person policy in the internal system of the organization.

Automatic Speech Recognition (ASR)

Speech Recognition is not used for Voice Biometrics; it is used to improve Customer Experience. Several Authentication Flows implemented in the system rely on ASR for two reasons: Identifier extraction (e.g. phone number) and Verification code extraction.

The main reason for using ASR is to reduce the number of steps required to complete a voice-enabled log-in process. Indeed, when a customer is asked to say their identifier as well as verification code, with just two steps the invention can:(i) Use this recording for voice authentication (li) identify the customer (their account / phone number), and (iii) complete 2 -Factor- Authentication for enhanced security. This is achieved mostly hands-free and provides a superior Customer Experience (CX).

Voice Recording Instructions to Users

Clear instructions should be provided to users during speech sample collection processes. Customer identifiers should be numeric; speech samples containing spoken identifiers must not have any other numbers in them

Preferably where Voice Login or Voice Login Strong is used, the speech sample should contain only a single numeric identifier (e g. phone number) and no other numbers. This is required only for Authentication speech samples but not for Enrolment. Numeric identifiers work best with Automatic Speech Recognition and yield reliable performance

Authentication Prompt Examples

- Please state your full name (including any middle names) and your mobile phone number - Please state your name, town and the ACME account number

- Please state your suburb, town and mobile phone number

Enrolment Prompt Examples

- Please state your full name (including middle names) and your mobile phone number

- Please state your office address followed by your mobile phone number

- Please state your full name and count 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Authentication Flows

The following describes the flows to enable voice authentication in a third-party system (e g. a website or telephony). As discussed, there are two fundamentally different scenarios that are involved in the process of identification / authentication by voice: (i) Enrolment, (ii) Login / Authentication

In Login / Authentication is a voiceprint is generated based on the given audio that contains a speech sample. The voiceprint is then compared against each of the reference voiceprints stored in a database. The best match that passes a confidence threshold is then returned. Among other parameters documented below, this match contains: Comparison Score (a value between [0..1]); Speaker's Identifier (a phone number) that corresponds to the best match. The Enrolment scenario is required to populate the reference database at first instance. This is done by calling the Enrolment API.

Reference is made to FIG. 7 showing a voice flow for enrolment, used when registering a new user for voice authentication. Reference is made to FIG. 8 showing a voice flow for voice verification, used to check if a speaker is registered for voice authentication. Reference is made to FIG. 9 showing a voice flow for voice Log-in Basic, used to identify and authenticate a person with just 1 audio file (the audio contains a spoken mobile phone number). Reference is made to FIG. 10 showing a voice flow for voice Log-in Strong, used to implement a voice-enabled log-in requiring require a secure identification and authentication mechanism

Security Architecture

Platform

Reference is made to FIG. 11 . The platform does not expose any Public APIs. It implements a permissionbased security model and issues expirable JSON WEB Tokens (JWT) to logged-in users that are handled by the Front-End run in browsers. The Frontend APIs that deal with Sensitive Data are additionally protected using a traditional 2-F actor- Authentication (2FA) scheme. Each API has an orthogonal permission associated with it and its access is fully determined whether the currently logged in user has the respective permission in its role. The platform uses a dedicated data store that is physically and logically isolated from the publicly available network.

Biometrics Reference is made to FIG. 12. Unlike the Platform, access to any Biometrics API is determined by the type of API Key. This is true for both, Frontend and Backend APIs. Additionally, Frontend APIs require an OAuth token issued by the Biometrics OAuth Service when a redirect from the Platform is made. The OAuth tokens are always bound to an API Key, thus, when a Business User makes a Frontend API call, the permission is determined by the type of API Key the OAuth token is bound to. The data stores utilized by Biometrics are physically and logically isolated from the publicly available network. Furthermore, Customer Data is implemented as a separate data store and can be hosted by an enterprise customer.

As will be appreciated by the skilled artisan, the computer-implemented training identification, reporting methods and systems described herein may be deployed in part or in whole through one or more processors that execute computer software, program codes, and/or instructions on a processor. The processor may be part of a server, client, network infrastructure, mobile computing platform, stationary computing platform, or other computing platform. A processor may be any kind of computational or processing device capable of executing program instructions, codes, binary instructions and the like. The processor may be or may include a signal processor, digital processor, embedded processor, microprocessor or any variant such as a coprocessor (math co-processor, graphic co-processor, communication co-processor and the like) and the like that may directly or indirectly facilitate execution of program code or program instructions stored thereon. Preferably a GPU configured to be operable with a parallel computing platform such as CUD A™ (compute unified device architecture; Nvidia, CA United States) is used. CUD A™ is a parallel computing platform and programming model developed for general computing on GPUs. In GPU-accelerated applications, the sequential part of the workload runs on the CPU (which is optimized for single-threaded performance) while the compute intensive portion of the application runs on a plurality (and even thousands) of GPU cores in parallel. CUDA™, may be implemented in well used languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions using only basic keywords and libraries.

In addition, the processor may enable execution of multiple programs, threads, and codes.

The threads may be executed simultaneously to enhance the performance of the processor and to facilitate simultaneous operations of the application. By way of implementation, methods, program codes, program instructions and the like described herein may be implemented in one or more thread. The thread may spawn other threads that may have assigned priorities associated with them; the processor may execute these threads based on priority or any other order based on instructions provided in the program code. The processor may include memory that stores methods, codes, instructions and programs as described herein and elsewhere.

Any processor or a mobile communication device or server may access a storage medium through an interface that may store methods, codes, and instructions as described herein and elsewhere. The storage medium associated with the processor for storing methods, programs, codes, program instructions or other type of instructions capable of being executed by the computing or processing device may include but may not be limited to one or more of a CD-ROM, DVD, memory, hard disk, flash drive, RAM, ROM, cache and the like.

A processor may include one or more cores that may enhance speed and performance of a multiprocessor. In some embodiments, the processor may be a dual core processor, quad core processors, other chip-level multiprocessor and the like that combine two or more independent cores (called a die).

The methods and systems described herein may be deployed in part or in whole through one or more hardware components that execute software on a server, client, firewall, gateway, hub, router, or other such computer and/or networking hardware. The software program may be associated with a server that may include a file server, print server, domain server, internet server, intranet server and other variants such as secondary server, host server, distributed server and the like. The server may include one or more of memories, processors, computer readable media, storage media, ports (physical and virtual), communication devices, and interfaces capable of accessing other servers, clients, computers, and devices through a wired or a wireless medium, and the like. The methods, programs or codes as described herein and elsewhere may be executed by the server. In addition, other devices required for execution of methods as described in this application may be considered as a part of the infrastructure associated with the server.

The server may provide an interface to other devices including, without limitation, clients, other servers, printers, database servers, print servers, file servers, communication servers, distributed servers and the like. Additionally, this coupling and/or connection may facilitate remote execution of program across the network. The networking of some or all of these devices may facilitate parallel processing of a program or method at one or more location without deviating from the scope of the invention. In addition, any of the devices attached to the server through an interface may include at least one storage medium capable of storing methods, programs, code and/or instructions. A central repository may provide program instructions to be executed on different devices. In this implementation, the remote repository may act as a storage medium for program code, instructions, and programs.

The software program may be associated with a client that may include a file client, print client, domain client, internet client, intranet client and other variants such as secondary client, host client, distributed client and the like. The client may include one or more of memories, processors, computer readable media, storage media, ports (physical and virtual), communication devices, and interfaces capable of accessing other clients, servers, computers, and devices through a wired or a wireless medium, and the like. The methods, programs or codes as described herein and elsewhere may be executed by the client. In addition, other devices required for execution of methods as described in this application may be considered as a part of the infrastructure associated with the client.

The client may provide an interface to other devices including, without limitation, servers, other clients, printers, database servers, print servers, file servers, communication servers, distributed servers and the like. Additionally, this coupling and/or connection may facilitate remote execution of program across the network. The networking of some or all of these devices may facilitate parallel processing of a program or method at one or more location without deviating from the scope of the invention. In addition, any of the devices attached to the client through an interface may include at least one storage medium capable of storing methods, programs, applications, code and/or instructions. A central repository may provide program instructions to be executed on different devices. In this implementation, the remote repository may act as a storage medium for program code, instructions, and programs.

The methods and systems described herein may be deployed in part or in whole through network infrastructures. The network infrastructure may include elements such as computing devices, servers, routers, hubs, firewalls, clients, personal computers, communication devices, routing devices and other active and passive devices, modules and/or components as known in the art. The computing and/or noncomputing device(s) associated with the network infrastructure may include, apart from other components, a storage medium such as flash memory, buffer, stack, RAM, ROM and the like. The processes, methods, program codes, instructions described herein and elsewhere may be executed by one or more of the network infrastructural elements.

The methods, program codes, calculations, algorithms, and instructions described herein may be implemented on a cellular network having multiple cells. The cellular network may either be frequency division multiple access (FDMA) network or code division multiple access (CDMA) network. The cellular network may include mobile devices, cell sites, base stations, repeaters, antennas, towers, and the like. The cell network may be a GSM, GPRS, 3G, 4G, EVDO, mesh, or other networks types.

The methods, programs codes, calculations, algorithms and instructions described herein may be implemented on or through mobile devices. The mobile devices may include navigation devices, cell phones, mobile phones, mobile personal digital assistants, laptops, palmtops, netbooks, pagers, electronic books readers, music players and the like. These devices may include, apart from other components, a storage medium such as a flash memory, buffer, RAM, ROM and one or more computing devices. The computing devices associated with mobile devices may be enabled to execute program codes, methods, and instructions stored thereon.

Alternatively, the mobile devices may be configured to execute instructions in collaboration with other devices. The mobile devices may communicate with base stations interfaced with servers and configured to execute program codes. The mobile devices may communicate on a peer to peer network, mesh network, or other communications network. The program code may be stored on the storage medium associated with the server and executed by a computing device embedded within the server. The base station may include a computing device and a storage medium. The storage device may store program codes and instructions executed by the computing devices associated with the base station.

The computer software, program codes, and/or instructions may be stored and/or accessed on computer readable media that may include: computer components, devices, and recording media that retain digital data used for computing for some interval of time; semiconductor storage known as random access memory (RAM); mass storage typically for more permanent storage, such as optical discs, forms of magnetic storage like hard disks, tapes, drums, cards and other types; processor registers, cache memory, volatile memory, non-volatile memory; optical storage such as CD, DVD; removable media such as flash memory (e.g. USB sticks or keys), floppy disks, magnetic tape, paper tape, punch cards, standalone RAM disks. Removable mass storage, off-line, and the like; other computer memory such as dynamic memory, static memory, read/write storage, mutable storage, read only, random access, sequential access, location addressable, file addressable, content addressable, network attached storage, storage area network, bar codes, magnetic ink, and the like.

The methods and systems described herein may transform physical and/or or intangible items from one state to another. The methods and systems described herein may also transform data representing physical and/or intangible items from one state to another.

The elements described and depicted herein, including in flow charts and block diagrams throughout the figures, imply logical boundaries between the elements. However, according to software or hardware engineering practices, the depicted elements and the functions thereof may be implemented on computers through computer executable media having a processor capable of executing program instructions stored thereon as a monolithic software structure, as standalone software modules, or as modules that employ external routines, code, services, and so forth, or any combination of these, and all such implementations may be within the scope of the present disclosure.

Furthermore, the elements depicted in any flow chart or block diagrams or any other logical component may be implemented on a machine capable of executing program instructions. Thus, while the foregoing drawings and descriptions set forth functional aspects of the disclosed systems, no particular arrangement of software for implementing these functional aspects should be inferred from these descriptions unless explicitly stated or otherwise clear from the context. Similarly, it will be appreciated that the various steps identified and described above may be varied, and that the order of steps may be adapted to particular applications of the techniques disclosed herein. All such variations and modifications are intended to fall within the scope of this disclosure. As such, the depiction and/or description of an order for various steps should not be understood to require a particular order of execution for those steps, unless required by a particular application, or explicitly stated or otherwise clear from the context.

The methods and/or processes described above, and steps thereof, may be realized in hardware, software or any combination of hardware and software suitable for a particular application. The hardware may include a general purpose computer and/or dedicated computing device or specific computing device or particular aspect or component of a specific computing device. The processes may be realized in one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors or other programmable device, along with internal and/or external memory. The processes may also, or instead, be embodied in an application specific integrated circuit, a programmable gate array, programmable array logic, or any other device or combination of devices that may be configured to process electronic signals. It will further be appreciated that one or more of the processes may be realized as a computer executable code capable of being executed on a computer readable medium.

The Application software may be created using a structured programming language such as C, an object oriented programming language such as C++, or any other high-level or low-level programming language (including assembly languages, hardware description languages, and database programming languages and technologies) that may be stored, compiled or interpreted to run on one of the above devices, as well as heterogeneous combinations of processors, processor architectures, or combinations of different hardware and software, or any other machine capable of executing program instructions.

Thus, in one aspect, each method described above and combinations thereof may be embodied in computer executable code that, when executing on one or more computing devices, performs the steps thereof. In another aspect, the methods may be embodied in systems that perform the steps thereof, and may be distributed across devices in a number of ways, or all of the functionality may be integrated into a dedicated, standalone device or other hardware. In another aspect, the means for performing the steps associated with the processes described above may include any of the hardware and/or software described above. All such permutations and combinations are intended to fall within the scope of the present disclosure.

The invention may be embodied in program instruction set executable on one or more computers. Such instructions sets may include any one or more of the following instruction types:

Data handling and memory operations, which may include an instruction to set a register to a fixed constant value, or copy data from a memory location to a register, or vice-versa (a machine instruction is often called move, however the term is misleading), to store the contents of a register, result of a computation, or to retrieve stored data to perform a computation on it later, or to read and write data from hardware devices.

Arithmetic and logic operations, which may include an instruction to add, subtract, multiply, or divide the values of two registers, placing the result in a register, possibly setting one or more condition codes in a status register, to perform bitwise operations, e.g., taking the conjunction and disjunction of corresponding bits in a pair of registers, taking the negation of each bit in a register, or to compare two values in registers (for example, to see if one is less, or if they are equal).

Control flow operations, which may include an instruction to branch to another location in the program and execute instructions there, conditionally branch to another location if a certain condition holds, indirectly branch to another location, or call another block of code, while saving the location of the next instruction as a point to return to. Coprocessor instructions, which may include an instruction to load/store data to and from a coprocessor, or exchanging with CPU registers, or perform coprocessor operations.

A processor of a computer of the present system may include "complex" instructions in their instruction set. A single "complex" instruction does something that may take many instructions on other computers. Such instructions are typified by instructions that take multiple steps, control multiple functional units, or otherwise appear on a larger scale than the bulk of simple instructions implemented by the given processor. Some examples of "complex" instructions include: saving many registers on the stack at once, moving large blocks of memory, complicated integer and floating-point arithmetic (sine, cosine, square root, etc.), SIMD instructions, a single instruction performing an operation on many values in parallel, performing an atomic test-and-set instruction or other read-modify-write atomic instruction, and instructions that perform ALU operations with an operand from memory rather than a register.

An instruction may be defined according to its parts. According to more traditional architectures, an instruction includes an opcode that specifies the operation to perform, such as add contents of memory to register — and zero or more operand specifiers, which may specify registers, memory locations, or literal data. The operand specifiers may have addressing modes determining their meaning or may be in fixed fields. In very long instruction word (VLIW) architectures, which include many microcode architectures, multiple simultaneous opcodes and operands are specified in a single instruction.

Some types of instruction sets do not have an opcode field (such as Transport Triggered Architectures (TTA) or the Forth virtual machine), only operand(s). Other unusual "0-operand" instruction sets lack any operand specifier fields, such as some stack machines including NOSC.

Conditional instructions often have a predicate field — several bits that encode the specific condition to cause the operation to be performed rather than not performed. For example, a conditional branch instruction will be executed, and the branch taken, if the condition is true, so that execution proceeds to a different part of the program, and not executed, and the branch not taken, if the condition is false, so that execution continues sequentially. Some instruction sets also have conditional moves, so that the move will be executed, and the data stored in the target location, if the condition is true, and not executed, and the target location not modified, if the condition is false. Similarly, IBM z/ Architecture has a conditional store. A few instruction sets include a predicate field in every instruction; this is called branch predication.

The instructions constituting a program are rarefy specified using their internal, numeric form (machine code); they may be specified using an assembly language or, more typically, may be generated from programming languages by compilers.

Any of the methods disclosed herein may be performed by application software executable on any past, present or future operating system of a processor-enabled device such as Windows™, Linux™ Android™, iOS™, and the like. It will be appreciated that any software may be distributed across a number of devices or in a "software as a service" format, or "platform as a service" format whereby participants require only some computer-based means of engaging with the software.

The present invention may be configured to interface with the infrastructure (in terms of both hardware and software) of a contact centre. Thus, the audio stream may be input into the present methods via a telephony line (physical or VOIP) of the contact centre. Furthermore, and reporting outputs relating to back office activity time may output into existing databases of the contact centre and therefore be amenable to inclusion in existing reporting systems.

Those skilled in the art will appreciate that the invention described herein is susceptible to further variations and modifications other than those specifically described. It is understood that the invention comprises all such variations and modifications which fall within the spirit and scope of the present invention.

While the invention has been disclosed in connection with the preferred embodiments shown and described in detail, various modifications and improvements thereon will become readily apparent to those skilled in the art.

Accordingly, the spirit and scope of the present invention is not to be limited by the foregoing examples, but is to be understood in the broadest sense allowable by law.