SCAM CALL PREVENTION - UNIV MACQUARIE

Title:

SCAM CALL PREVENTION

Document Type and Number:

WIPO Patent Application WO/2023/245231

Kind Code:

Abstract:

A method comprises receiving a rerouted phone call identified as a scam call, and processing received caller speech from the rerouted phone call to determine a response. The method includes interacting with a caller using the determined response, wherein the response is determined in order to extend a duration of the phone call. The processing may comprise identifying features in the received call speech associated with ending and/or extending a call. The identifying may comprise identifying one or more of: negative emotions in the caller speech, and threats in the caller speech.

Inventors:

KAAFAR DALI (AU)
WOOD IAN (AU)
KEPKOWSKI MICHAL (AU)

Application Number:

PCT/AU2023/050544

Publication Date:

December 28, 2023

Filing Date:

June 19, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV MACQUARIE (AU)

International Classes:

G06F40/56; G06F40/30; G06N3/0475; G06N3/09; G10L13/027; G10L15/26; G10L25/30; H04L65/1076; H04M3/436

Foreign References:

US20160119377A1	2016-04-28
US10110741B1	2018-10-23
US20190149575A1	2019-05-16
US20210203778A1	2021-07-01
US20180007199A1	2018-01-04

Other References:

MERVE SAHIN MARC RELIEU I3-SES, CNRS, TéLéCOM PARISTECH SOPHIA ANTIPOLIS, FRANCE MARC.RELIEU@TELECOM- PARISTECH.FR EUREC: "Using chatbots against voice spam: Analyzing Lenny’s effectiveness", USENIX, USENIX, THE ADVANCED COMPUTING SYSTEMS ASSOCIATION, 12 July 2017 (2017-07-12), Usenix, the Advanced Computing Systems Association , pages 324 - 342, XP061025299

Attorney, Agent or Firm:

FOUNDRY INTELLECTUAL PROPERTY PTY LTD (AU)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS:

1. A method comprising : receiving a rerouted phone call identified as a scam call; processing received caller speech from the rerouted phone call to determine a response; interacting with a caller using the determined response, wherein the response is determined in order to extend a duration of the phone call.

2. The method of claim 1, wherein the processing comprises identifying features in the received call speech associated with ending and/or extending a call.

3. The method of claim 2, wherein the identifying comprises identifying one or more of: negative emotions in the caller speech, and threats in the caller speech.

4. The method of any one of the preceding claims, wherein the response is determined in order to maximise the duration of the phone call.

5. The method of any one of the preceding claims, wherein the processing of the received caller speech comprises utilising a conversational artificial intelligence hot trained with a reinforcement learning training objective with a small positive reward for each utterance and a large negative reward when the rerouted phone call ends.

6. A method comprising: detecting a received scam call; and rerouting the detected scam call to a scam call bot, wherein the scam call bot is configured to extend a duration of the rerouted call.

7. The method of claim 6, wherein the scam call hot is configured to extend the duration of the rerouted call by interacting with a caller of the scam call via responses determined by the scam call bot.

8. The method of claim 7, wherein the responses are determined based on identified features in the caller’s speech associated with ending and/or extending a call.

9. The method of any one of the preceding claims wherein the duration of the call is extended by intentionally generating and responding with a response imperfection selected from a group comprising: backchannelling utterances, timewasting phrases, and conversation repair phrases.

10. A system comprising: a telephony endpoint for receiving a rerouted scam call; a speech-to-text module configured to convert caller speech from the received scam call to text; a conversational artificial intelligence (Al) bot configured to receive the text from the speech-to-text module, process the received text, determine a response so as to extend a duration of the scam call, and output the determined response; and a text-to-speech module configured to receive the determined response in text form from the bot, convert the text to a voice response, and output the voice response to the caller via the telephony endpoint.

11. The system of claim 10, wherein the text-to-speech module is configured for voice cloning.

12. The system of claim 10 or claim 11, wherein the conversational Al bot processes the received text by identifying features in the received call speech associated with ending and/or extending a call.

13. The system of claim 12, wherein the hot is configured to identify the features by identifying one or more of: negative emotions in the caller speech, and threats in the caller speech.

14. The system of any one of claims 10 to 13, wherein the bot is configured to determine the response in order to maximise the duration of the scam call.

15. The system of any one of claims 10 to 14, wherein the bot is trained with a reinforcement learning training objective with a small positive reward for each utterance and a large negative reward when the rerouted scam call ends.

16. The system of any one of claims 10 to 15 further comprising an audio processing module connecting the text-to-speech module and the telephony endpoint, and configured to process the voice response by mixing the voice response with an environment signal.

17. The system of any one of claims 10 to 16, wherein the conversational Al bot further comprises a conversation controller adapted to manage a conversation flow by adding utterances to the response that extend the duration of the scam call, wherein the added utterances comprise one or more of: a time wasting phrase, a conversation repair phrase, a backchannelling phrase, and an interrupting phrase.

18. The system of any one of claims 10 to 17, wherein the conversational Al bot further comprises a response controller configured to: discard a response utterance in response to a scammer utterance occurring during a response utterance processing, and removing said discarded response from a conversation history of the Al bot.

Description:

Scam Call Prevention

Technical Field

[0001] The present disclosure broadly relates to scam call prevention and, more particularly, to a system for, and a method of, using conversational artificial intelligence to interact with a scam call.

Background

[0002] A scam call is a voice telephony call generated for the purpose of dishonestly obtaining a benefit, or causing a loss, by deception or other means. Phone calls are the most common way that scammers target victims and have the most financial impact compared to other scam contact methods (such as emails or social networks). Scams include fraud against phone company customers by third parties, for example in the form of telemarketing fraud or caller ID spoofing used for vishing (i.e., voice phishing). Sometimes scams might include various forms of security assistance, e-commerce platforms follow ups, impersonation of government agencies requests, etc.

[0003] Phone companies and governments are actively involved in curbing false scam calls, and in some countries governments enforce legislation obliging phone companies to detect, trace and block scam calls. The Communications Alliance is an example of an organisation formed in Australia for the Australian communications industry in order to work towards reducing SMS and telephone scams as outlined in their “Industry Code”. One example of a method used by phone companies is for caller ID spoofing, where a fake caller ID is displayed when a call is made. Phone companies apply scam detection technology that identifies such calls, and the calls are then blocked. These methods are not 100% effective and scam calls still get through and cause harm.

[0004] Some scam call detection systems make use of a bot (also called a chatbot), i.e. an autonomous program that interacts with the caller. In one example, an unsolicited phone call is detected based on an analysis of a conversation between a caller who initiated the call and a hot that uses a voice recording impersonating a scam target individual, and the call is then blocked.

[0005] These types of bots use conversational artificial intelligence (Al) to talk to the caller, i.e. the perpetrator of the scam call. Conversational Al uses machine learning and natural language processing to imitate human interactions by recognising speech and then responding with appropriate phrases, for example providing answers to questions according to a database and/or algorithm.

[0006] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each claim of this application.

Summary

[0007] Conventional scam call detection systems aim to terminate a scam call as soon as possible once a scam call has been identified. This conventional strategy, however, immediately frees up the resources of the scammer to initiate a new scam call. The systems and methods described herein aim to do the opposite: once a scam call has been detected, the call is rerouted to connect with a conversational artificial intelligence bot configured to present a convincing scam victim in order to maintain the call for as long as possible. Conversational Al bots engage with scammers, waste their time, and/or make available insights into the scams they perpetrate. In this way, scammer resources remain occupied with the bot and cannot be used to target new scam victims for the duration of the redirected call, and/or said insights are available for scam prevention tasks such as warning and educating potential victims.

[0008] In one aspect there is provided a method comprising: receiving a rerouted phone call identified as a scam call; processing received caller speech from the rerouted phone call to determine a response; interacting with a caller using the determined response, wherein the response is determined in order to extend a duration of the phone call.

[0009] The processing may comprise identifying features in the received call speech associated with ending and/or extending a call. The identifying may comprise identifying one or more of: negative emotions in the caller speech, and threats in the caller speech. The advantage provided by these features is that they enable maximising the duration of a scam conversation by maximising engagement of scammers and maximising believability of the bot used for the rerouted call.

[0010] The response may be determined in order to maximise the duration of the phone call.

[0011] The processing of the received caller speech may comprise utilising a conversational artificial intelligence bot trained with a reinforcement learning training objective with a small positive reward for each utterance and a large negative reward when the rerouted phone call ends.

[0012] In another aspect there is provided a method comprising: detecting a received scam call; and rerouting the detected scam call to a scam call bot, wherein the scam call bot is configured to extend a duration of the rerouted call.

[0013] The scam call bot may be configured to extend the duration of the rerouted call by interacting with a caller of the scam call via responses determined by the scam call bot.

[0014] The responses may be determined based on identified features in the caller’s speech associated with ending and/or extending a call.

[0015] The duration of the call may be extended by intentionally generating and responding with a response imperfection selected from a group comprising: backchannelling utterances, time-wasting phrases, and conversation repair phrases. [0016] In another aspect there is provided a system comprising: a telephony endpoint for receiving a rerouted scam call; a speech-to-text module configured to convert caller speech from the received scam call to text; a conversational artificial intelligence (Al) hot configured to receive the text from the speech-to-text module, process the received text, determine a response so as to extend a duration of the scam call, and output the determined response; and a text-to-speech module configured to receive the determined response in text form from the hot, convert the text to a voice response, and output the voice response to the caller via the telephony endpoint.

[0017] The text-to-speech module may be configured for voice cloning.

[0018] The conversational Al hot may process the received text by identifying features in the received call speech associated with ending and/or extending a call. The bot may be configured to identify the features by identifying one or more of: negative emotions in the caller speech, and threats in the caller speech.

[0019] The bot may be configured to determine the response in order to maximise the duration of the scam call.

[0020] The bot may be trained with a reinforcement learning training objective with a small positive reward for each utterance and a large negative reward when the rerouted scam call ends.

[0021] The system may further comprise an audio processing module connecting the text-to-speech module and the telephony endpoint, and configured to process the voice response by mixing the voice response with an environment signal. This signal may be an audio signal, mimicking environmental and/or background sounds.

[0022] The conversational Al bot may further comprise a conversation controller adapted to manage a conversation flow by adding utterances to the response that extend the duration of the scam call, wherein the added utterances comprise one or more of: a time wasting phrase, a conversation repair phrase, a backchannelling phrase, and an interrupting phrase.

[0023] The conversational Al hot may further comprise a response controller configured to: discard a response utterance in response to a scammer utterance occurring during a response utterance processing, and removing said discarded response from a conversation history of the Al hot.

[0024] Throughout this specification the word “comprise” or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.

Brief Description of Drawings

[0025] Embodiments of the disclosure are now described by way of example with reference to the accompanying drawings in which:

[0026] Figure 1 is a schematic representation of a communication network.

[0027] Figure 2 is a schematic representation of a system used to implement a conversational artificial intelligence bot.

[0028] Figure 3 is a schematic representation of a method of predicting features as a side task using a K- Adapter.

[0029] Figure 4 is a schematic representation of a method of predicting input features.

[0030] Figure 5 is a schematic representation of a sequence-to-sequence transformer model.

[0031] Figure 6 illustrates an embodiment of a method of rerouting a detected scam call to a conversational artificial intelligence bot. [0032] Figure 7 illustrates an embodiment of an on-phone scam detection and rerouting method.

[0033] Figure 8 illustrates an embodiment of a method of interacting with a scam call using a conversational artificial intelligence bot.

[0034] Figure 9 is a schematic diagram of an exemplary embodiment of a call processing system.

[0035] Figure 10 is a schematic diagram of a configuration server that forms part of the call processing system of Figure 9.

[0036] Figure 11 is a schematic diagram of an audio processing module that forms part of the call processing system of Figure 9.

[0037] Figure 12 is a schematic representation of an overtalk module that forms part of the call processing system of Figure 9.

[0038] Figure 13 is a schematic representation of an outbound call module that forms part of the call processing system of Figure 9.

[0039] Figure 14 is a schematic diagram of another exemplary embodiment of a call processing system.

[0040] Figure 15 is a schematic diagram of an exemplary embodiment of a docker deployment of the call processing system of Figure 14

[0041] Figure 16 is a schematic representation of a load balancing module that forms part of the system of Figure 14.

[0042] In the drawings, like reference numerals designate similar parts. Detailed Description

System Overview

[0043] Figure 1 of the drawings illustrates a communication network 100 that supports both data and telephony. The network operator 108 provides telecommunications services to its users via the network 100. A user can make or receive phone calls via a user device 102 (for example a mobile phone, a smartphone, a landline phone, a Voice over IP (VoIP) device or the like). An incoming call from an originating device 104 is managed by the network operator 108, and switched to the user device 102 via the network 100. A server 110 is in communication with the network operator 108 and/or the user device 102 via the network 100.

[0044] Figure 2 is a high level schematic representation of a system 210 provided by the server 110 that is used to implement a conversational artificial intelligence (Al) bot 206. The system 210 includes a telephony endpoint 202 for receiving a rerouted scam call or initiating calls to known scam phone numbers, and a speech-to-text (STT) module 204 configured to convert caller speech from the received scam call to text. The system 210 has a conversational artificial Al bot 206 configured to receive the text from the speech-to-text module 204, process the received text, determine a response so as to extend a duration of the scam call, and output the determined response. The system 210 includes a text-to-speech (TTS) module 208 configured to receive the determined response in text form from the bot 206, convert the text to a voice response, and output the voice response to the caller via the telephony endpoint 202. The system 210 optionally includes an audio processing module 209 between the TTS module 208 and the telephony endpoint 202. The audio processing module 209 applies audio processing to mimic the background acoustic (i.e. sound) environment of a phone call and enhance voice believability and outputs the processed voice response to the caller via the telephony endpoint 202. In some embodiments, the TTS module 208 includes voice cloning capabilities. [0045] The telephony endpoint 202 may be, for example, an Asterisk server. In this embodiment, the system 210 includes a telephony endpoint 202 for receiving a rerouted scam call. In other embodiments the telephone endpoint 202 may be separate from the system 210, interfacing with the system via the STT and TTS modules. The telephony endpoint 202 is capable of receiving Session Initiation Protocol (SIP) calls. SIP is the communication protocol used for VoIP calls. The telephony endpoint 202 communicates with the speech-to-text module 204 and the text-to-speech module 208 (the latter via the audio processing module 209, if present), which in turn communicate with the conversational Al bot 206. The telephony endpoint 202 processes the audio signals of the call and passes them to the speech-to-text module 204 and from the text- to-speech module 208 module via raw audio WebSockets, which in turn communicate with the bot 206 over WebSockets in plain text. The speech-to-text module 204 may be implemented using, for example, Google STT.

[0046] The architecture of the system 210 described with reference to Figure 2 is highly scalable. Multiple phone numbers and VoIP initiators can be assigned to the same SIP trunk and the telephony endpoint can be replicated and load balanced to withstand many simultaneous calls.

[ 0047] Wh at th e Al bot does:

[0048] The bot 206 is a text based conversational Al bot, and in some embodiments, open source pre-trained bots such as the ParlAI “BlenderBot” may be adapted to implement the bot 206. The bot 206 is configured to process the received text by identifying features in the received call speech associated with ending and/or extending a call. In some embodiments, the features are associated with negative emotions and/or threats detected in the caller speech.

[0049] In some embodiments, the method comprises processing text based features found to be associated with ending the call. These text based features include text transcripts from conversations between the scammer and the bot. The Al bot is configured to identify features in the text of the transcripts (such as phrases or word paterns identified by machine learning models trained to extract scam stages) that may be considered to be indicators gearing towards the end of a call (e.g. word length, number of words per uterance, uniqueness of words, vocabulary richness, etc.). The features may be determined by predictions of machine learning (ML) models trained on an objective statistically associated with ending calls, and/or the features may be identified by unsupervised ML models statistically associated with ending calls. Based on these identified features, the Al bot is configured to avoid these features in order to avoid ending a call.

[0050] The processing of the received call speech may comprise utilising a conversational artificial intelligence bot trained to mimic victim uterances in scammervictim phone conversations, e.g. in long scammer- victim phone conversations.

[0051] The processing of received caller speech from the rerouted phone call to determine a response may include the addition of heuristic features to responses determined to increase conversation length. Heuristic features may include predetermined initial responses, addition of speech disfluencies, conforming to a predetermined persona, and/or restriction to a maximum or minimum sentence length.

[0052] The processing of the received caller speech may comprise utilising a conversational artificial intelligence bot trained or fine-tuned on labelled real phone scam transcripts, for example manually labelled real phone scam transcripts. Sources of scam transcripts may include labelled transcripts from publicly available “scam baiter” videos in which concerned individuals (“scam baiters”) converse with real scammers knowing that the call is a scam.

[0053] The bot 206 is further configured to determine the response. In some embodiments, the bot 206 is configured to determine the response in order to maximise the duration of the scam call.

[0054] In some embodiments the bot 206 is configured to mimic scam victims. This may, for example, be done through the addition of short term memory, empathy, and personas that allow the hot 206 to maintain consistent knowledge of personal facts such as a name, address and aspects of a fictitious personal life. The personas include features that enable a sufficiently convincing mimic of a vulnerable human scam victim.

[0055] In some embodiments the bot 206 may include heuristic text generation designed to prolong conversations with scammers and/or produce better quality conversations with scammers. These heuristics may include fixed initial bot utterances, injection of disfluencies into bot utterances, bot utterance sentence length truncation or exclusion of long sentences, heuristics to prevent the bot talking over the scammer.

[0056] In some embodiments, recordings and transcripts from conversations between scammers and Al bots may be analysed to determine threat intelligence information. Threat intelligence information may include: the target organisation that the scammer is pretending to be, the social engineering techniques used by the scammer, the topic of the scammers script, and/or the structure and/or stages in scripts used by the scammer.

[0057] Threat intelligence from recordings and transcripts of conversations between scammers and Al bots may be utilised (e.g., by sale to a third party) as additional data used by Al bots to effectively prolong calls with scammers. The threat intelligence data may be used to identify and educate potential future scam victims so as to reduce the success rate of scams or by way of information of concerned organisations to their customers to warn them of existing scam campaigns impersonating the organisations processes or personnel.

[0058] How the Al bot is trained:

[0059] In some embodiments the bot 206 implements Al models built around large pre-trained sequence to sequence models such as BART, T5, and GPT. These models achieve very good fluency. The models are fine-tuned on conversation data such as scam call transcripts for domain adaptation of pre-trained conversational Al models. In one embodiment Blenderbot is fine-tuned on “scam baiter” conversations with real scammers obtained, for example, from YouTube or from synthetic scammer-victim conversations crafted for conversation diversity and/or for specific conversational patterns. Conversation data for training may be enhanced with the application of text generation heuristics found to be associated with longer scam call conversations.

[0060] The conversational Al bot described herein presents novel challenges to finetuning due to long call durations (pilot data averaged 86 utterances) and the adversarial nature of the task (the aim is not quality effective conversation, but to prolong the conversation irrespective of conversational quality).

[0061] “Wild” data from calls with real scammers enables an additional form of training. The primary goal is for the bot to achieve long call durations with real scammers. In some embodiments the duration of a “wild” call (one with a real scammer) is used as a reinforcement learning (RL) training objective with a small positive reward for each utterance and a large negative reward when the scammer hangs up. In this way the Al is optimised for longer conversations via reinforcement learning on call length and dialogue self-play.

[0062] Identified conversation features that relate directly to longer call durations may also be used as RL training objectives. For example, features associated with scammer script steps and those expected or found to be associated with ending or extending a call such as scammers’ negative emotion and threats. Features taken into considering for the purpose of extending the duration of a scam call may include one or more of: the subject of the call, emotions, topics, and keywords. Relevant features may be determined through analysis of available scam call transcripts and based on existing research and understanding of persuasion, social engineering and psychology.

Available scam call transcripts will include previously existing public records of scam calls in addition to records of scammer conversations with the bots used to engage with rerouted scam calls. These features are incorporated into training as side tasks in addition to the main fine-tuning task. A model that is able to distil the knowledge necessary to predict call features associated with longer “wild” call durations is equipped to recognise model updates that are effective for achieving longer calls. In this way the duration of a scam conversation can be extended or maximised by increasing engagement of scammers and improving believability of the bot used for the rerouted call.

[0063] The main fine-tuning task consists of further training of a pre-trained model using task specific data (e.g., scam transcripts). The bot is used to attempt to predict words in scammer utterances given previous utterances in scam transcripts from the training data. The training is considered “fine-tuning” as the quantity of data used in these pre-trained models is orders of magnitude larger than the data used for finetuning. The training causes the model to iteratively give higher likelihood to generating the actual words spoken by victims in the training data (the scammer words are treated as model inputs). In this way the end to end conversational Al model adapts to new contexts.

[0064] Transfer learning through training on multiple related tasks may result in an improved model. Therefore, in some embodiments, side tasks are implemented, such as predicting from the last hidden layers of the underlying transformer, predicting from the RL action space, or predicting from the adapter framework. Side tasks for embodiments based on a pre-trained transformer natural language processing (NLP) model may, for example, be implemented by predicting from hidden layers of the underlying transformer model or through the adapter framework.

[0065] Predicting from the RL action space may be understood with reference to T. Zhao, K. Xie, and M. Eskenazi, ‘Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models’, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Jun. 2019, pp. 1208-1218. doi: 10.18653/vl/N19-1123, incorporated herein by reference in its entirety. The method is similar to the method illustrated in Figure 4 of the drawings, with the action space derived from encoder outputs. The encoder output is transformed with a small feed forward network and passed through a parameterisation function to provide a distribution over a discrete or continuous action space. It is the parameters of this distribution that can be used (again via a small feed forward network) to predict the features, and thus encourage alignment of actions with the features. The distribution over the action space is then passed to the decoder transformer network.

[0066] In some embodiments, two types of conversation features may be used as side tasks: features of scammer utterances and/or of victim (bot) utterances. Types of side tasks may include the following:

[0067] (1) Recognising features of scammer (input) utterances, with side tasks run alongside the bot’s utterance encoder.

[0068] (2) K-adapter style side tasks, i.e. parallel stacked transformers fed with encoder representations at each layer as described in Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. 2021. K- Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1405-1418, Online. Association for Computational Linguistics., incorporated herein by reference.

[0069] Figure 3 is a schematic representation of a method 300 of predicting features as a side task using a K-Adapter 302. The K-Adapter 302 is a stacked transformer with layer wise inputs of signals from between Encoder layers 304 and from the final Encoder layer.. The K-Adapter includes a predictor 308 (typically a fully connected network) with softmax or sigmoid to provide probabilities for predicting features.

[0070] (3) Figure 4 is a schematic representation of a method 400 of predicting input features from the output of the encoder. The output of the encoder (which is also fed to decoder/text generator 406 as well as memory module etc.) is fed into a NN model 408 (a one or two layer transformer with classifier layer, or a one or two layer fully connected network) whose output predicts the feature. [0071] (4) A side task based on desirable/undesirable features of victim (bot) utterances.

[0072] (5) A side task that predicts features from the decoder output and/or transformer layer outputs K-adapter-style (as described above).

[0073] (6) RL training with rewards based on (or at least partially based on) completed utterances. Figure 5 is a schematic representation of a sequence-to-sequence transformer model 500. The model 500 generates text one word at a time, with each subsequent word 502 predicted based on the previous words. The model is trained by predicting each word in a training utterance given the previous words. An error is determined from the probability the model gives to the word and is “propagated back through the network” (at 504), providing updates to the model that result in the word having higher probability. When RL is applied, it adds another component to this measured error. The features that relate to conversation length may be exhibited by a whole utterance or by one or more of its words, depending on how the feature is detected. For example, an emotion detector may not indicate which words signified the emotion (in which case the feature is associated with a whole utterance) or may provide some indication of which words contributed to the measured emotion (so the feature is associated with individual words). For whole utterance features, the RL reward is applied equally to each word. For individual word features, the RL reward is applied to those words that exhibit the feature. The reward is positive for features that are associated with longer conversations, and the reward is negative when associated with shorter conversations. The small positive reward for each new scammer utterance would work the same way as features associated with the (previous) whole generated utterance.

[0074] For the large negative reward for ending the conversation:

[0075] 1. The negative reward is applied to all utterances with exponentially decreasing magnitude from the last one (e.g., the full negative reward is applied to the last generated utterance, half of it to the second last, a quarter to the third last, an eight to the fourth last etc..).

[0076] 2. In some embodiments, the negative reward is applied using a model to estimate which utterances (or even which words in which utterances) contributed to ending the conversation and by how much, and then the negative reward is applied proportionally to that contribution

[0077] Alternatively or in addition to the positive and negative reward method, a K- Adapter style side task may be used for the decoder. In some embodiments, features are predicted via separate predictors fed with intermediate layers of the decoder transformer. If the transformer has 12 layers, the model includes 12 (simple) NN predictors, and the errors in their predictions are back-propagated into the transformer (for example with at least a 12 times smaller learning rate than the learning rate for predicting words in training utterances so that these predictors do not dominate training).

[0078] In some embodiments, further training targets may be obtained by integrating background knowledge of scammer methodologies, social engineering and the psychology of persuasion. Further knowledge of scammer methodologies and social engineering techniques to be used as training targets can be obtained by analysis of scam calls including those available in the public domain and calls between Al bots and scammers.

[0079] In some embodiments, further training targets may be obtained through the discovery of text generation heuristics, acoustic (i.e.. sound) processing and voice characteristics found to be effective for longer bot-scammer conversations.

[0080] In some embodiments the Al bot may be implemented using one or more instances of Blenderbot/2/3, GPT, and/or other Large Language Models (LLM), finetuned on transcripts of videos or voice recordings made of scam baiters. In some embodiments such transcripts may be manually edited and annotated to remove sections that are not parts of conversations with scammers, and/or to label utterances as either Scammer or Victim (the scam baiter is considered the victim).

[0081] In some embodiments may include functionality to automatically recognise stages in scam calls by identifying types of scams and sequences of scam stages for each type from analysis of past call transcripts, and/or recognising the type of scam and current scam stage during live scam calls. In some of these embodiments the Al bot may be trained to utilise structured information about a current stage of the scam call, providing more contextualised responses and allowing tailored responses that depend on the context. This may be done, e.g. using an implementation as described in Meta Fundamental Al Research Diplomacy Team (FAIR)f et al. , Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science378, 1067-1074(2022), incorporated herein by reference.

[0082] Voice Cloning

[0083] For text to speech, recent advances in the field have enabled convincing speech generation that is difficult to distinguish from human speech. The text-to-speech module 208 may include one or more voices.

[0084] Voice cloning is a type of “deep fake” consisting of deep learning Al models that generate speech audio that sounds like a given person from text inputs. The person whose voice is being cloned provides recordings of their voice which are used to train the Al model. Once sufficiently trained, arbitrary text can be provided to the model, and it will “speak” the text in the person’s voice. It is further possible to make variations on the voice to change, for example, the apparent age and gender of the generated voice and modulate expressed emotion.

[0085] In some embodiments the text-to-speech module 208 is configured to interpolate between “voice personas” and to adapt the “voice personas” along specific characteristics such as age and gender. This is achieved by combining similar technology for adapting images such as faces to the voice cloning function of the TTS module 208.

[0086] Recent voice cloning models such as YourTTS (described in E. Casanova et al., "YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone", arXiv:2112.02418 [cs, eess], Dec. 2021, http://arxiv.org/abs/2112.02418, incorporated herein by reference in its entirety) use a single vector to represent a voice. Techniques in style representation and transfer such as normalising flows (as described in D. J. Rezende and S. Mohamed, "Variational Inference with Normalizing Flows" , arXiv: 1505.05770 [cs, stat], May 2015, http://arxiv.org/abs/1505.05770, incorporated herein by reference in its entirety) and as used in StyleGan (described in T. Karras et al., "A Style-Based Generator Architecture for Generative Adversarial Networks" . arXiv, Mar. 29, 2019, http://arxiv.org/abs/1812.04948, incorporated herein by reference in its entirety) can be applied to this representation to enable effective interpolation of voice qualities. With this type of technology, a wide variety of realistic artificial voices can be obtained that smoothly transition between multiple specific voices, and characteristics such as gender and age can be smoothly varied. This is useful for victim bots to be deployable at scale as it provides a large variety of “voice personas” so that scammers will have difficulty recognising the bots by voice alone. These features provide the advantage of voice generation that provides a believable bot voice for the rerouted call in order to convince scammers that they are talking to a real person. For this, the generated voice needs to be human-like with convincibility increasing by applying features such as tone modulation, pauses, disfluencies and emotions.

[0087] The combination of the conversational Al bot 206 together with the voice cloning capability of the text-to-speech module 208 produces a “victim bots” almost indistinguishable from actual scam victims.

Method Overview

[0088] Figure 6 of the drawings shows a flow diagram of a method of rerouting a detected scam call to a conversational artificial intelligence bot. The method 600 comprises detecting (at 602) a received scam call, and rerouting (at 604) the detected scam call to a scam call bot. The scam call bot is configured to prolong the rerouted call. The scam call bot is configured to extend the duration of the rerouted call by interacting with a caller of the scam call via responses determined by the scam call bot. The responses may be determined, for example, based on identified features in the caller’s speech associated with ending and/or extending a call. In some embodiments, conversational artificial intelligence is used to interact with a scam call and derive insights into current scams from such interactions

[0089] Rerouting scam calls

[0090] The scam calls may be detected and rerouted to initiate a SIP call with the bot in several ways:

1. From a network operator that forwards calls determined to be scams.

2. From a VoIP provider of leased telephone numbers (i.e., a telephony honeypot). VoIP providers may include dedicated VoIP services or larger telecommunications companies (e.g., OPTUS or TELSTRA in Australia), which typically also have VoIP capabilities.

3. From a smartphone app that allows users to forward scam calls to the bot.

4. From third party services or individuals that reroute scam calls to the bot. Third party services and individuals forward calls to the bot either through SIP or via assigned VoIP phone numbers.

[0091] Scam detection may be performed by the phone company, for example using one or more of the methods in Table 1 : able 1 Scam call detection techniques by the network operator

[0092] Network operator scam call detection techniques are not foolproof. Challenges include incoming calls not having verifiable identity information, calls that travel through multiple carriers lack metadata, and the simple heuristics used for scam detection do not evolve as rapidly as the scam techniques.

[0093] When the network operator does not detect a scam call, the call is routed to a user phone. On-phone functionality may be provided to identify a scam call, for example one or more of the methods described in Table 2:

[0094] On-phone scam call detection techniques are not foolproof. Challenges include that the required analysis must process the incoming call in real-time to notify a user, scammers modify their methods to remain undetected, and that scam detection software should not interrupt non-scam calls.

[0095] When the on-phone scam call detection software does not detect a scam call, the call is put through to the user. If the user identifies the incoming call as a scam call the user is able to forward the call to the bot system via an on-phone rerouting app.

[0096] Scam alert rerouting app

[0097] A mobile phone app may be used to redirect scam calls to the bot system. In some embodiments, the mobile phone app automatically detects and reroutes received scam calls.

[0098] Figure 7 illustrates an embodiment of an on-phone scam detection and rerouting method 700. Scam calls are detected by monitoring received calls (at 702), and comparing (at 704) caller speech patterns with one or more feature databases 706. The feature databases 706 describe, for example, scam patterns or scammer strategies identified from real and/or modelled conversations between scammers and victims. If a received call is identified as a scam call (at 708), the user is notified and the call is rerouted (at 710) to the hot system 210.

[0099] In some embodiments, the mobile phone app includes functionality to listen in on the bot-scammer conversation and/or record the conversation. In some embodiments, the mobile phone app includes functionality to “scam bait” the scammer, i.e., the phone owner pretends to be a victim while the app records the conversation and sends it to the server’s data storage.

[0100] If the mobile phone app is unable to detect a scam call, and the call continues with the user, the user may realise that the call is a scam call. The mobile phone app also includes functionality allowing the user to identify the call as a scam call (at 712) and to forward the call (at 710) to the server 110 and the bot system 210.

[0101] Figure 8 illustrates an embodiment of a method of interacting with a scam call using a conversational artificial intelligence bot. The method 800 comprises receiving (at 802) a rerouted phone call identified as a scam call, the call being rerouted e.g., from a honeypot, or from third party scam detection systems. A “telephony honeypot” is a collection of phone numbers made available to calls from the public with the intention of attracting calls from scammers. The numbers may be “dirtied” in some way such as including them on unreliable or dishonest websites that collect personal information.

[0102] The method then processes (at 804) received caller speech from the rerouted phone call to determine a response. The method 800 further includes interacting (at 806) with a caller using the determined response. The response is determined in order to extend a duration of the phone call.

[0103] The processing may include identifying features in the received call speech associated with ending and/or extending a call, for example by identifying negative emotions and/or threats in the caller speech. In some embodiments the response is determined in order to maximise the duration of the phone call. [0104] The processing of the received caller speech may comprise utilising a conversational artificial intelligence hot trained with a reinforcement learning training objective with a small positive reward for each utterance and a large negative reward when the rerouted phone call ends.

Exemplary embodiment

[0105] Figure 9 is a schematic diagram of an exemplary embodiment of a call processing system 900. In addition to a telephony endpoint 202, a speech-to-text (STT) module 204, a conversational artificial Al bot 206, a text-to-speech (TTS) module 208, and an audio processing module 209, the system 900 includes a configuration server 1000.

[0106] At 901 audio is passed into the system 900 from the telephony endpoint 202 through an audio socket. When the audio socket is initiated, a pipeline and bot configuration is selected by the configuration server 1000.

[0107] At 903 audio is processed by the Speech To Text (STT) module 204, rendering it as text utterances. Optionally, the specific STT module and configuration for that module (e.g, which voice type to use) are specified in the configuration managed by the configuration server 1000, and may include, for example, Azure or Google STT functionality. The STT module 204 also includes a speech detector 930.

[0108] Optionally, at 904, if an utterance is received by the Al bot 206 from the STT 204 while the Al bot 206 is processing a preceding utterance, the subsequent utterance is stored by the Al bot 206 and sent (along with any other utterances that appear during processing) to the Al bot 206 as a single text when the bot has completed processing. This may be enabled or disabled as specified in the configuration managed by the configuration server 1000. This is referred to as “Overtalk Prevention” and is described below with reference to Figure 12. [0109] The conversation controller 990 (described in more detail elsewhere herein) is responsible for controlling the flow of the conversation by managing turn-taking, triggering the pipeline to respond to the scammer’s utterances, triggering time-wasting phrases, conversation repair phrases, and/or backchanneling (e.g. “uh-huh”, “yeah”, “okay”, etc.), interrupting the scammer, and/or barge-in detection (i.e., when the scammer interrupts the bot).

[0110] At 905 the Al bot 206 selects and/or generates a response utterance. In some embodiments, at 905a one or more hard coded phrases may be injected into the bot’s speech, bypassing the Al bot 206 processing at 905b. Optionally, this may be done at the beginning of the call in the form of a sequence of initial phrases (e.g. “Hello”). Phrases, phrase generation, and/or phrase selection may be specified in the configuration received from the configuration server 1000. For example, phrases may be randomly selected from a pre-set list. Optionally, this may be done during the call on a random basis in the form of time wasting phrases such as “I’m sorry, I didn’t catch that?”. The rate at which time wasting phrases are injected may be specified in the configuration, and may be altered in a random or pseudo-random way during the call.

[0111] Scammer utterances and injected phrases are added to the conversation history of the Al bot 206. This may be achieved by collecting utterances (for example based on an expected pattern of: <scammer-utterance>, <bot-utterance>, <scammer utterance>, etc.) and then passing the list of utterances to the Al bot 206, which interprets them in the same way as utterances generated by the Al bot 206.

[0112] Additionally or alternatively, at 905b scammer utterances (together with any injected Al bot utterances) are passed to the Al bot 206 and the Al bot 206 generates a response utterance. The specific Al bot model used for this may be specified in the configuration.

[0113] In some embodiments, as indicated at 906, a limit may be configured for the maximum number of long sentences the Al bot 206 can say in a response. The Al bot 206 removes sentences from the end of the response when this limit is exceeded. The maximum number of long sentences and the minimum number of words for a sentence to be considered long are specified in the configuration.

[0114] Optionally, at 907, disfluencies such as “um ...” may be injected into the response utterances. Disfluencies may be randomly selected from a pre-set list and injected at a specified rate. The configuration specifies (a) whether this is enabled as well as (b) the frequency, i.e., how the specified rate is determined (e.g. using a pseudorandom timer or according to a pre-set selection).

[0115] At 908 a response utterance is processed by the Text To Speech (TTS) module 208, which returns speech audio. The response utterance is processed to include Speech Synthesis Markup Language (SSML) which allows specifications for the speed, pitch, volume and speech styles (e.g. emotions) of the voice to be generated. The specific TTS module may include one or more functional blocks, provided by e.g: Azure, Google, and/or Custom Voice Cloning TTS software. Configuration for the TTS module 208 may be specified in the configuration, for example, the configuration may specify which voice to use, SSML, etc.

[0116] In the case that a new scammer utterance occurs during response utterance processing, discarding the response utterance is managed by the Response Controller 909 that also edits the Al bot conversation history to remove the discarded response and reflect any changes made at 906 and/or 907. This may be enabled or disabled as specified in the configuration.

[0117] Optionally, at 910, speech audio may be merged with background audio. Background audio may be selected from a pre-set collection of audio files. In some embodiments, the specific background audio stream and relative volume of speech and background audio are specified in the configuration.

[0118] In some embodiments, at 911 audio effects may be applied to the merged voice and background audio and resulting audio stream passed to the telephony endpoint. The specific audio effects and their parameters may be specified in the configuration.

[0119] Conversation Controller 990

[0120] One exemplary embodiment of a conversation controller 990, configured to control the flow of the conversation, is described now.

[0121] The conversation controller 990 receives the following inputs: STT partial utterances, STT final utterances, and Voice Activity Detection (VAD) results. In some embodiments, the conversation controller 990 may additionally or alternatively receive call audio and/or use multimodal Al to help with turn-taking. In some embodiments, the conversation controller 990 has the ability to know the bot’s transcript and state for interrupting the bot, and/or the ability to inform when the bot is talking and the scammer should be listening, and/or when there is barge-in from the scammer.

[0122] The conversation controller 990 triggers one or more of the following actions: time-wasting phrases, conversation repair phrases, backchannelling phrases, the Al bot generating a response, and interrupting the bot.

[0123] Backchannelling is a way to show that you are listening to the speaker. It involves small phrases like “uh-huh”, “yeah”, “okay”, etc. This can also be used to fill silence in a conversation, and used before starting your turn in the conversation to show that you are thinking. Backchanneling makes the task of barge-in detection more difficult as backchanneling can be detected as speech, but does not signal an intent to interrupt.

[0124] Time-wasting phrases are phrases that are said by the bot to waste the scammer’s time. They are pre-defined phrases such as “I’m sorry, I didn’t quite catch that. Could you repeat that?” or “I need to sit down, can you wait a moment please”. These are injected randomly into the conversation whenever the bot has its turn to speak. The same phrase is preferably not repeated more than once per call. [0125] Similar to time-wasting phrases, Conversation Repair Phrases are injected into the conversation and said by the bot. However, these are not randomly injected. Instead, they are added when Speech-To-Text is slow or broken and includes phrases such as “What was that? I didn’t quite catch it?” or “Sorry, it’s noisy here, can you repeat that?”. These phrases are exclusively phrases that show the bot not hearing the scammer and asking the scammer to repeat. Time-wasting phrases can include phrases like this but are not limited to them.

[0126] The Inter-Pausal Unit (IPU) is a unit of speech which is delimited by a certain time reference, for example 200ms. The IPU threshold is used to break speech into portions, while a ‘No Words’ threshold and a ‘No Final’ threshold are used to detect when the STT is not functioning correctly in order to cause the bot to respond to partial STT utterances, or to use conversation repair phrases.

[0127] The IPU can be used to break up the speech into portions to perform user-state detection or other tasks and can be the first trigger for backchannel responses.

[0128] The No Words threshold and the No Final threshold are used to detect when the VAD and the STT do not agree on the end of an utterance (or when the STT is not functioning correctly). The No Words threshold is the first threshold to be reached. It may require for example between 2 and 6 seconds of silence, for example about 3.5s of silence from the VAD and the STT to have no partial response. This is an indication that the STT is not functioning correctly as it would normally have at least a partial response after such a time interval. This triggers a conversation repair phrase to be said by the bot to the scammer.

[0129] The No Final threshold requires a time interval of 3-8 seconds, for example 5s of silence from the VAD and the STT to be activated. This is an indication that the STT is functioning slowly or incorrectly. If the STT has no words yet, then the No Words Threshold would have been triggered and conversational repair would have started. If there are words from the STT in the form of a partial response, then this threshold triggers the pipeline to respond using the partial text. [0130] A turn-switching time interval of 1-5 seconds, for example 3 s, may be used to force turn-switching even when the user-state detection shows that the speaker wants to keep their turn. In some embodiments a default 5s interval may be used because the partial responses can take up to 3s to return from the STT. The goal of a 5s delay is to allow the STT to finish if it is working. This can be configured with the configuration server.

[0131] State Transitions 1700 of the conversation controller 990 may be understood with reference to Figure 17 of the drawings and Table 3 below. Conversation states correspond to who is talking and who is listening. “Bot Responding” refers to the Al bot generating a response to the scammer’s utterance. When the response is finished generating, it is sent to the TTS module. When the TTS finishes generating the speech audio, the duration of this speech is calculated and the state is set to “Bot Responding” for this duration. The audio is then sent to the audio-mixer to be mixed and sent over the phone line.

[0132] The turn taking algorithm may include one or more goals, such as minimising silence and/or minimising talking over one another.

[0133] The states labelled in Figure 17 as “BAD” 1702, 1704 are times when the bot and scammer are talking at the same time. This is an indication of a false positive turn taking detection, or the scammer interrupting the bot. In some embodiments, the Al bot does not take the scammer interrupting the bot into consideration, and in most embodiments the bot will not intentionally talk over the scammer, hence the state being labelled as “BAD”, i.e., not intended. In some embodiments, these states may trigger the Al bot stopping in the middle of an utterance in response to barge-in from the scammer. The Al bot is configured to do one or more of: (1) detect barge-in, (2) assess if the scammer intends to interrupt or not, (3) based on (1) and/or (2) stop the Al bot talking in mid-utterance based on scammer barge-in, and (4) update the Al bot conversation history with the barge-in interruption information. [0134] The Al bot may be configured to detect a scammer’s intention to interrupt in various ways, and this functionality may include one or more of the following: (i) using the state model shown in Figure 17 and described in Table 3, (ii) ignoring false positives (e.g. a back channel or background noise from the scammer would not be an intent to interrupt the Al bot), (iii) via use of a language model (i.e., another Al model which may be incorporated in or separate to the bot) that takes the conversation history and parameters associated with a current utterance of the scammer and detects whether the intent is to intermpt the bot or not.

[0135] States are logged in a transcripter.py object using the log_state(state) function. The states are logged with the current timestamp. The duration of these states can be inferred by looking at the timestamps of the next state.

[0136] Table 3 Conversation State Table

(N/A: the bot cannot respond and talk at the same time)

[0137] The transition 1720 from Silence to Bot Talking occurs once per call. The transitions 1722 from silence to Scammer talking, From Bot responding to scammer overtalking, and from bot talking to scammer overtalking are caused by the scammer talking. The transitions 1724 from the scammer talking to the bot responding, and from the scammer overtalking the bots response to the scammer overtalking the bot talking both skip a phase. The transition 1726 from the scammer overtalking a bot response to the bot talking is an unlikely transition.

[0138] Figure 10 is a flow diagram illustrating the operation of the configuration server 1000 that forms part of the call processing system 900.

[0139] The configuration server 1000 maintains a looped list 1002 of pipeline configurations. Configurations can be added or removed from the list via a REST interface 1004. A web front end may be provided to view and edit the configuration queue, communicating with the configuration server via the REST interface 1004.

[0140] Configurations can specify random selection, pseudo-random selection, or specific configuration patterns selected from pre-set lists of values for some features, such as TTS voice and background audio.

[0141] As illustrated at 902 in Figure 9, when a new call is received, a pipeline configuration is requested from the configuration server. As illustrated in Figure 10, after a configuration request is received at 1006 a configuration is selected by checking the configuration list 1008 at 1010. If it is determined at 1012 that there are configurations in the list, then at 1014 the next configuration is selected. Selection continues from the beginning of the list after the last configuration in the list has been selected. Alternatively, if at 1012 it is determined that the list is empty, a default configuration is selected at 1016. At 1018 configuration elements are propagated to the respective pipeline modules including the TTS 208, the STT 204, the mixer in the audio module 209, and the Al bot 206. At 1020 call processing with the pipeline is initiated.

[0142] Figure 11 illustrates an exemplary embodiment of the audio processing module 209. The audio processing module performs steps 910 and 911 in Figure 9. The module 209 is configured to merge audio from multiple sources 1100 including voice audio 1102 from the TTS module 208, one or more background audio sources 1104, and/or any other audio sources 1106 specified in the configuration. [0143] The audio processing module 209 continuously merges audio from one or more of the audio sources. The TTS source 208 intermittently produces voice audio, and in some embodiments when this intermittent voice audio is received the amplitude (volume) of the voice audio may be adjusted at 1110, with the adjustment level specified in the configuration. When no speech audio is available, the audio merger 1112 forwards only the background audio.

[0144] In some embodiments the merged audio may be passed though one or more audio filters 1114 (such as lowpass, high-pass, and/or band-pass filters), typically applied in sequence. The filter parameters are specified in the configuration. Processed audio is then passed to the next module, i.e., the telephony endpoint.

[0145] Figure 12 illustrates an overtalk module 1200 used to gather utterances to prevent interruption at 904 in Figure 9. After receiving speech audio from the telephony endpoint, the speech audio is converted to text in the STT module 204, producing an utterance 1202. Overtalk prevention 904 is then executed by the overtalk module 1200 based on a processing lock status retrieved from the configuration provided by the configuration server 1000.

[0146] If the processing lock is activated then the utterance 1202 is added to an utterance queue. If the processing lock is not activated when an utterance 1202 is received, then the processing lock is activated at 1206 and the utterance 1202 is merged with the utterance queue and at 1210 passed to the Al bot for utterance generation.

[0147] On completion of audio processing of the response speech audio at 1212, the processing lock is de-activated at 1208. If the utterance queue is not empty then the utterances in the queue are merged and passed on to the Al bot for utterance generation.

[0148] Some embodiments support the generation of outbound calls to scammers.

[0149] Figure 13 illustrates the process for outbound call generation. Outbound calls are initiated from the Asterisk Command Line Interface (CLI). The asterisk server has a dial plan 1302 with two phone extensions 1304, 1306 used for outbound calls which are configured as follows:

Extension “12345” (1304) creates a Audio Socket to connect to the call processing pipeline (e.g. the pipeline as illustrated in Figure 9).

Extension “5402” (1306) sets the Caller ID, and then dials a scammer’s number.

To initiate an outbound call, a channel 1310 is created between these two extensions which results in a call between the pipeline 1320 of the call processing system 900, and the phone of the scammer 1322. In this way, outbound calls to known scam phone numbers (from “callback” scams, phishing websites or other sources) are possible.

Exemplary embodiment

[0150] Figure 14 shows another exemplary embodiment of a call processing system 1400. In this embodiment, deployment of the system is accomplished with two virtual machines (VM) 1402, 1404 and two cloud services 1406, 1408. These may include, for example, one or more of:

- Amazon Web Services (AWS), which can be used for a public facing VM running the asterisk server (supporting a telephony endpoint).

- RONIN which is an example of a managed AWS environment which can be used for the pipeline and bot VMs. RONIN VMs are accessible from the wider Internet via SSH connections, hence an SSH tunnel may be used for audio socket connections between the asterisk server and pipeline VM. SSH is not required for connections between the pipeline and Bot VMs as both are inside of RONIN.

- Azure cloud services, which can be used for TTS and STT functionalities.

[0151] In the example embodiment, the asterisk server 1410 (providing the telephony endpoint) runs on an AWS VM. A phone call is forwarded to the asterisk server via SIP (Session Initiation Protocol). The asterisk server then creates a unique ID (UUID) for the call, which is propagated to the pipeline VM via the ID of the audio socket and used as a label on all files and metadata associated with the call. The asterisk server creates an audio socket 1412 to the pipeline VM, and then forwards call audio to the socket and stores call audio in a file on the VM. In some embodiments, the audio socket may pass through an SSH tunnel 1414 to the pipeline VM.

[0152] Pipeline VM processing may be understood with reference to the steps indicated in Figure 14 of the drawings. At 1441 the audio socket is attached to an asterisk client docker container 1420 in the pipeline VM and audio streaming commences. At this point, a request is sent to the configuration server 1000 and the returned configuration is propagated to all elements and stored in a database 1422 (such as MongoDB).

[0153] At 1442 the asterisk client 1420 passes the audio stream to the STT service (via the STT module, not shown in Figure 14). The STT service returns transcribed speech as text portions that are partial or complete sentences. Some embodiments may use Azure STT, which returns a flag stating that the scammer has finished speaking to indicate the end of an utterance. Text is gathered from the STT service until this flag is observed, at which time the gathered text is passed back to the pipeline.

[0154] At 1443 accumulated text portions are passed to the Al bot, which determines a response utterance (also in text). Text accumulation processes are elsewhere herein with respect to process 904 depicted in Figure 9 of the drawings. In some instances, the Al bot may be bypassed (for example when one or more hard coded phrases are injected into the bot’s speech as described with reference to 905a in Figure 9). The accumulated text portions are considered a scammer utterance and are stored to a call transcript log taking the form of a text file.

[0155] At 1444, the response utterance is passed to the TTS service (via the TTS module, not shown in Figure 14). Post-processing of utterances such as injecting disfluencies, sentence truncation and/or SSML may also be performed before passing the utterances to the TTS module. The final response utterance after post-processing is stored in the call transcript log as an Al bot utterance. A version of the Al bot utterance including any SSML markup is also stored. [0156] At 1445 the speech audio is passed to the audio mixer 1424 which optionally (a) combines it with, for example, background audio 1426, and/or applies audio fdters or other effects 1428.

[0157] At 1446 the merged audio is then passed back through the audio socket to the asterisk server 1410.

[0158] In addition to the labelled steps described above, in some embodiments additional data is entered into the database 1422 (in this embodiment provided by a MongoDB instance). A process on the asterisk VM monitors call recordings and passes call metadata 1430 of any new call recordings that appear (which happens with every call) to the database 1422 through the SSH tunnel 1414. This process may also store the call audio itself, for example in an AWS S3 storage facility. When a call is initiated, configuration metadata alongside call ID and time may also be stored in the database 1422.

[0159] Figure 15 shows a docker deployment of a call processing system, for example like the system described with reference to Figure 14. An Asterisk server 1501 runs on an AWS VM 1504, and is responsible for recording audio at 1506 and providing a call audio socket at 1508. In this embodiment the Al bot 206 is deployed in a separate GPU equipped VM 1510. In some embodiments a relatively simple deployment may be achieved using a single GPU equipped VM for each bot and pipeline on which all docker containers are housed. In some embodiments this type of implementation simplifies the automated deployment of load balancing (as described elsewhere herein).

[0160] The custom docker containers in the Pipeline VM 1520 may be understood with reference to Figure 15 of the drawings. An Asterisk client container 1502, a configuration server container 1530, an STT container 1532, and an audio mixer container 1534 are provided. A database or MongoDB docker container 1536 is provided. In addition, a docker volume 1538 provides a folder on the VM accessible to the asterisk-client docker container 1502 for storing logs and call transcripts. A docker volume 1540 contains audio fries, for example used as background audio.

[0161] The Asterisk client container 1502 contains various modules and supports various capabilities. The Asterisk Client 1502 controls the flow of the call, connecting to the audio socket 1508 from the telecommunications endpoint, passing data to the input of each module, and then passing the returned data to the input of the next module in the pipeline before finally passing the processed audio data back through the audio socket 1508. The Asterisk Client 1502 manages retrieval of configuration data from the configuration server 1530 and distributes it to other pipeline modules. The Asterisk Client 1502 is responsible for a number of functions including one or more of: Overtalk Prevention (see 904 in Figure 9), the “AutoResponse” 1550 functionality that enables inserting initial phrases and time wasting phrases into the conversation (see 905b in Figure 9), and the Text To Speech (TTS) 1552 that receives text from the pipeline and passes it to an external TTS server before passing on the returned audio data to the audio processing module.

[0162] The STT container 1432 houses the STT module that connects to or implements an STT service.

[0163] The configuration server container 1530 houses the configuration module, and implements a configuration queue and/or a default configuration when a configuration is not set. The configuration server container 1530 provides a web interface for viewing and editing the configuration queue.

[0164] The Audio Mixer container 1534 houses the audio processing module and continuously produces background audio, merges Al bot response voice audio when available, and/or applies audio filters.

[0165] The database (MongoDB) container houses a database (e.g. MongoDB) instance that stores configurations of past calls, asterisk derived metadata on past calls, metadata on available background audio, metadata on available TTS voices, and/or logs of exceptions that occur during pipeline operation.

[0166] The Al bot container (bot-parlai-gpu) 1560 houses the Al bot which converses with the scammer using text input and output. This is situated on a VM 1510 equipped with a fast GPU (graphics processing unit) or other machine learning acceleration hardware.

[0167] Figure 16 is a schematic representation of a load balancing module 1600 that forms part of the system of Figure 14. This figure depicts the plan for automated deployment, where new pipeline VMs are created when multiple simultaneous calls are received, and VMs are destroyed when demand again drops.

[0168] In some embodiments the pipeline and bot may occupy a single VM, though similar deployment would be possible with separate VMs. The load balancing implementation of this embodiments maintains a small number of idle VMs at all times so as to be able to accept new calls without delays. Details of metadata and call transcript recording as well as logging use a centralised database (e.g., MongoDB) with which all pipelines are able to communicate.

[0169] The Asterisk server 1602 is configured to receive calls from a telecommunications provider, and to connect audio sockets to asterisk clients running on pipeline VMs as directed by the Load Balancer 1604 (in this exemplary embodiment implemented using nginx LB). In some embodiments, the described load balancing configuration uses a scaled infrastructure to handle large call volumes. The Load Balancer 1604 selects an available pipeline instance to which to connect new incoming calls, and maintains a list of pipeline instances and/or a list of active calls. The Load Balancer 1604 is equipped with an automated healthcheck and/or a status web page.

The monitor 1606 checks the status of the system (continuously, or continually based on a preset, selected, and/or variable basis), querying the Load Balancer 1604 about the number of idle and/or busy pipeline instances. The monitor 1606 manages the creation and/or destruction of VM instances on the AWS 1608, synchronising the current list of instances with the Load Balancer 1604.

[0170] The methods and systems described herein present a novel approach to defeat phone scam operators through breaking their business model and making their operations unprofitable. This is achieved through the implementation of conversational Al bots that present as a convincing potentially viable scam victim. The bots, deployed at scale, take up a great proportion of scammers’ time and significantly reduce their profit margins. Further, traces from these conversations provide valuable information on scam targets (the organisations the scammers pretend to be representatives of), scammers and current scammer strategies that is otherwise very difficult to obtain.

[0171] Advantageously, the proposed methods may be used to reduce the occurrence of vishing and other phone-based scams, may be used as a source of information on the scam landscape, and is readily complimentary to existing approaches of scam detection.

[0172] It will be understood to persons skilled in the art of the invention that many modifications may be made without departing from the spirit and scope of the invention.

Previous Patent: SHOE STORAGE DEVICE AND SYSTEM

Next Patent: REMOTE SENSING EQUIPMENT TRANSPORTATION AND DEPLOYMENT APPARATUS