Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CONTEXT DETECTION
Document Type and Number:
WIPO Patent Application WO/2021/118462
Kind Code:
A1
Abstract:
A method of identifying a semantic context of an utterance for use with an automatic conversational dialogue system (205) is disclosed herein. In a described embodiment, the method comprises: identifying a first context (S303); receiving a current utterance from the user; classifying the current utterance as consistent with the first context or inconsistent with the first context (S3011), and when the current utterance is classified as being inconsistent with the first context, identifying a second context associated with the current utterance. A method of conducting an automatic conversational dialogue with a user is also disclosed, among other aspects.

Inventors:
PAUL A AVINASH (SG)
JAIN SAURABH (SG)
VERMA ABHISHEK (SG)
MOURYA ANSHUMAN (SG)
NELLORE SIVA SAI (SG)
Application Number:
PCT/SG2020/050726
Publication Date:
June 17, 2021
Filing Date:
December 08, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ACTIVE INTELLIGENCE PTE LTD (SG)
International Classes:
G10L15/183; G10L15/16; G10L15/26
Domestic Patent References:
WO2019119916A12019-06-27
Foreign References:
US20090018829A12009-01-15
US20180261223A12018-09-13
US20150279366A12015-10-01
Attorney, Agent or Firm:
POH, Chee Kian, Daniel (SG)
Download PDF:
Claims:
Claims

1. A method of identifying a semantic context of an utterance for use with an automatic conversational dialogue system, the method comprising: identifying a first context; receiving a current utterance from the user; classifying the current utterance as consistent with the first context or inconsistent with the first context, and when the current utterance is classified as being inconsistent with the first context, identifying a second context associated with the current utterance.

2. A method of identifying a semantic context of an utterance according to claim 1, the first context being associated with a primary utterance, the primary utterance being received prior to the current utterance.

3. A method of identifying a semantic context of an utterance according to claim 1 or 2, the method further comprising: masking the current utterance by identifying named entities in the current utterance and replacing each identified named entity with a respective tag to obtain a masked current utterance, and wherein classifying the current utterance as consistent with the first context or inconsistent with the first context comprises classifying the masked current utterance as consistent with the first context or inconsistent with the first context.

4. A method of identifying a semantic context of an utterance according to claim 3, wherein the respective tag comprises a name of a category to which the relevant named entity belongs.

5. A method of identifying a semantic context of an utterance according to any of the preceding claims, wherein classifying the current utterance as consistent with the first context or inconsistent with the first context comprises determining vector representations of the first context and the current utterance; determining a function that merges the vector representations of the first context and the current utterance; and determining a probability distribution over context change and context entailment for the function.

6. A method of identifying a semantic context of an utterance according to claim 5, wherein the function prepends the vector representation of the first context to the vector representation of the current utterance.

7. A method of identifying a semantic context of an utterance according to claim 5 or 6 wherein determining a probability distribution over context change and context entailment for the function comprises employing a first neural model. 8. A method of identifying a semantic context of an utterance according to claim

7, wherein the first neural model comprises one or more long short-term memory (LSTM) layers.

9. A method of identifying a semantic context of an utterance according to claim 7 or 8, wherein the first neural model comprises bidirectional layers.

10. A method of identifying a semantic context of an utterance according to claim 8 or 9, wherein the first neural model comprises bidirectional long short-term memory (BiLSTM) layers.

11. A method of identifying a semantic context of an utterance according to claim 7 or 8 wherein the first neural model comprises a convolutional layer.

12. A method of identifying a semantic context of an utterance according to any of claims 7 to 11 wherein an attention function is applied to an output of the first neural model.

13. A method of identifying a semantic context of an utterance according to claim 7 wherein the first neural model has a transformer architecture.

14. A method of identifying a semantic context of an utterance according to any one of claims 3 to 13, wherein identifying named entities in the current utterance comprises: obtaining a character-level embedding for each word in the current utterance; obtaining a word-level embedding for each word in the current utterance based on the respective character-level embedding; and determining a probability distribution over a plurality of tags for each word in the current utterance based on the respective word-level embedding.

15. A method of identifying a semantic context of an utterance according to claim

14, wherein obtaining a character-level embedding comprises employing a second neural model with one or more long short-term memory (LSTM) layers.

16. A method of identifying a semantic context of an utterance according to claim

15, wherein the second neural model comprises bidirectional long short-term memory (BiLSTM) layers.

17. A method of identifying a semantic context of an utterance according to any one of claims 14 to 16, wherein obtaining a word-level embedding comprises employing a third neural model with one or more long short-term memory (LSTM) layers.

18. A method of identifying a semantic context of an utterance according to claim 17, wherein the third neural model comprises bidirectional long short-term memory (BiLSTM) layers.

19. Method of conducting an automatic conversational dialogue with a user, the method comprising: performing a method of identifying a semantic context of an utterance of any of claims 1 to 18; following identification of the first context, setting the first context as a current context; when the current utterance is classified as inconsistent with the first context, setting the second context as the current context, otherwise making no change to the current context; and outputting a response to the current utterance based on the current context.

20. A method of conducting an automatic conversational dialogue with a user according to claim 19, further comprising: receiving a primary utterance from the user and identifying the first context based on the primary utterance; and outputting a response to the primary utterance based on the current context prior to receiving the current utterance.

21. Method of conducting an automatic conversational dialogue with a user according to claim 19 or 20, wherein outputting responses to the primary and current utterances based on the current context comprises outputting text requesting predetermined information based on the current context.

22. A system for identifying the semantic context of an utterance, the system comprising: an input for receiving a current utterance from a user; and a processor configured to: identify a first context; classify the current utterance as consistent with the first context or inconsistent with the first context, and when the current utterance is classified as being inconsistent with the first context, identify a second context associated with the current utterance.

23. An automatic conversational dialogue system, the system comprising: the system for identifying the semantic context of an utterance according to claim 22, the processor being further configured to: set the first context as a current context, following identification of the first context, when the current utterance is classified as inconsistent with the first context, set the second context as the current context, otherwise making no change to the current context, and determine a response to the current utterance based on the current context; and an output for outputting the response to the current utterance. 24. A carrier medium comprising computer readable code configured to cause a processor to perform the method of any of claims 1 to 21.

Description:
CONTEXT DETECTION

Field and Background The present invention relates to the field of automated conversational dialogue, particularly to handling semantic context change during a conversation with a user.

Conversational dialogue systems, or "chatbots" as they are sometimes known provide an automated response to user queries. Typical uses of such systems include customer service provision. In general, automatic conversational dialogue systems are configured to retrieve an answer to a query. Existing automatic dialogue systems may take a number of forms. Such systems may be highly complex such as those based on artificial intelligence or may comprise systems which perform keyword searches of a database stored in memory. Automatic conversational dialogue systems are often accessed via an Automatic Programming Interface (API) such as that provided by a bank or financial institution. Automatic conversational dialogue systems may be domain specific, that is, designed to respond to queries relating to a specific field, for example banking products if the dialogue system is hosted by a financial institution. Some sophisticated conversational dialogue systems employ natural language processing in combination with artificial intelligence, whereas others may simply perform a database search based on identified keywords in the user's query. Typically, such systems are designed to mimic human dialogue as closely as possible. Typically, a consumer or potential consumer will either speak into a microphone or type into a dialogue box and the dialogue system will determine an appropriate response to or action based on the customer's query. Sometimes the dialogue system will require more information from the user before it is able to act on the query. A key challenge faced while developing such systems is supervising the flow of conversation over coherent utterances, i.e. to seamlessly understand user conversations over multiple interactions. Real users may change context or intent within a single conversation and it is important to determine the flow of the conversation in order to correctly process each utterance by the user.

Figure 1 shows three examples of possible flows for a conversation conducted with an automatic dialogue system regarding the transfer of money from a user's bank account which demonstrate this problem. For the task "transfer funds" the aim of the automatic conversational interface is to gather all of the required information to perform the task that the user is requesting (i.e. to transfer funds, in this case) over the course of the conversation from the user's inputs; this is known as slot filling. Once the automatic dialogue system has all the pieces of information it will then complete the conversation.

In conversation A, the user provides specific entities requested by the automatic dialogue system in multiple interactions 101. In conversation B, the user provides all the required details to finish the task in a single utterance 103. Conversation C is an example of context change where user switches context in the middle of the conversation. The user makes a query 105 about his balance in the middle of a series of queries 107 regarding money transfer. Thus, the user switches contexts (or, equivalently, intent) midway through the interaction with the conversational dialogue system from "transfer" to "balance enquiry" and then back again to gather more information before completing the transaction.

Existing conversational dialogue systems often identify the intent of a user (e.g. balance, transfer) at the start of a conversation and then initiate slot filling based on the identified intent, for example by pre-loading slots and then requesting information from the user in order to fill the slots. If the user switches context (i.e. intent) before the conversation is complete, therefore, the system may not be able to handle this change as the user's utterance will not correspond with the slots being filled. The system may keep repeating the same questions until it receives all of the information required to complete the conversation, regardless of the user's utterances. Such systems are unable to handle a variety of user conversational contexts and only work well in a limited set of use cases.

Known approaches to coherence detection in text data typically employ an entire conversation as the input. However, this is not practical in an automatic dialogue system where real time context change detection is required.

It is desirable to provide a method for identifying context change during the course of a user interaction with an automatic dialogue system which addresses at least one of the drawbacks of the prior art and/or to provide the public with a useful choice.

Summary

In a first aspect, a method of identifying a semantic context of an utterance for use with an automatic conversational dialogue system is provided, the method comprising: identifying a first context; receiving a current utterance from the user; classifying the current utterance as consistent with the first context or inconsistent with the first context, and when the current utterance is classified as being inconsistent with the first context, identifying a second context associated with the current utterance.

The term semantic context or, equivalently, intent of an utterance it may mean the broad topic to which the utterance is directed. Contexts may be predefined according to the intended application of the method. For example, if the method is to be employed for use with an automatic conversational dialogue system used for customer service provision by an organization, the contexts may be different products or services, or categories of products or services offered by the relevant organization. In other words, the contexts may be each predetermined use-case that the user is trying to solve for using the chat-bot.

By consistent with the first context it may mean that the current utterance is providing information that is associated with the first context or requesting information associated with a first context. The first context may be one of a plurality of predefined contexts. By consistent with the first context it may mean that an utterance contains information which is most strongly associated with the first context out of a plurality of pre-defined contexts. The current utterance may contain information which may fill or be associated with one or more predefined slots associated with the first context or predefined slots loaded in response to the identification of the first context. By consistent with the first context it may or may not mean that the current utterance does not contain information that is associated with other pre-defined contexts.

Advantageously, this method may provide a simple and computationally efficient method of detecting the context, or equivalently intent, of an utterance as it is received in real time.

The first context may be associated with a primary utterance, that is to say a conversation-starting utterance received prior to the current utterance. The method may therefore be a method of identifying if there has been a change in semantic context.

The method may further comprise masking the current utterance by identifying named entities in the current utterance and replacing each identified named entity with a respective tag to obtain a masked current utterance, and wherein classifying the current utterance as consistent with the first context or inconsistent with the first context comprises classifying the masked current utterance as consistent with the first context or inconsistent with the first context. The term named entity, equivalently entity, or entity name, may mean an item of specific information (as opposed to a general category) in the utterance, for example, a real-world object, a proper name, the name of a currency, a monetary value, and/or a name of a commercial product, etc. The respective tag may comprise a name of a category to which the relevant named entity belongs.

Classifying the current utterance as consistent with the first context or inconsistent with the first context may comprise determining vector representations of the first context and the masked current utterance; determining a function that merges the vector representations of the first context and the masked current utterance; and determining a probability distribution over context change and context entailment for the function. The function may prepend the vector representation of the first context to the vector representation of the current utterance as this advantageously may reduce the number of n-grams.

Determining a probability distribution over context change and context entailment for the function may comprise employing a first neural model. The first neural model may comprise one or more long short-term memory (LSTM) layers. The first neural model may comprise bidirectional layers. The first neural model may comprise bidirectional long short-term memory (BiLSTM) layers. The first neural model may comprise a convolutional layer. An attention function may be applied to an output of the first neural model. The first neural model may be selected from one or more of a CNN model; an LSTM model; a BiLSTM model; a CNN-LSTM model; a BiLSTM model with attention; and a transformer model, such as a BERT model.

Identifying any named entities present in the current utterance may comprise: obtaining a character-level embedding for each word in the current utterance; obtaining a word-level embedding for each word in the current utterance based on the respective character-level embedding; and determining a probability distribution over a plurality of tags for each word in the current utterance based on the respective word-level embedding. Employing a character-level embedding may improve accuracy and may enable the model to deal with unknown entities. Obtaining a character-level embedding may comprise employing a second neural model with one or more long short-term memory (LSTM), or BiLSTM layers. BiLSTM layers may enable a high level of accuracy to be obtained.

Obtaining a word-level embedding may comprise employing a third neural model with one or more long short-term memory (LSTM) or BiLSTM layers.

In an aspect, a method of conducting an automatic conversational dialogue with a user is provided, the method comprising: identifying a first context; setting the first context as a current context; receiving a current utterance from the user; classifying the current utterance as consistent with the first context or inconsistent with the first context, and when the current utterance is classified as being inconsistent with the first context, identifying a second context associated with the current utterance; when the likelihood of context change exceeds a predetermined threshold, setting the second context as the current context, otherwise making no change to the current context; and outputting a response to the current utterance based on the current context.

Advantageously, the method may enable context change by a user to be handled as part of automatic conversational dialogue, thereby improving the appropriateness of responses and efficiency of a dialogue with a user.

In this respect, the method may further comprise: receiving a primary utterance from the user and identifying the first context based on the primary utterance; and outputting a response to the primary utterance based on the current context prior to receiving the current utterance. Thus, the first context may be determined from a conversation-starting utterance received from a user. Outputting responses to the primary and current utterances based on the current context may comprise outputting text requesting predetermined information based on the current context, for example information to complete pre-loaded slots which have been loaded in response to the determination of the current utterance.

In an aspect, a system for identifying the semantic context of an utterance is provided, the system comprising: an input for receiving a current utterance from a user; and a processor configured to: identify a first context; classify the current utterance as consistent with the first context or inconsistent with the first context and when the current utterance is classified as being inconsistent with the first context, identify a second context associated with the current utterance.

In an aspect, an automatic conversational dialogue system is provided, the system comprising: an input for receiving a current utterance from a user; and a processor configured to: identify a first context, set the first context as a current context, classify the current utterance as consistent with the first context or inconsistent with the first context and when the current utterance is classified as being inconsistent with the first context, identify a second context associated with the current utterance and set the second context as the current context, otherwise make no change to the current context, and determine a response to the current utterance based on the current context; and an output for outputting the response to the current utterance.

In an aspect, a carrier medium comprising computer readable code is provided, the computer readable code configured to cause a processor to: identify a first context; classify a current utterance as consistent with the first context or inconsistent with the first context and when the current utterance is classified as being inconsistent with the first context, identify a second context associated with the current utterance. The carrier medium may be tangible or non-tangible. In an aspect, a carrier medium comprising computer readable code is provided, the computer readable code configured to cause a processor to: identify a first context; set the first context as a current context; classify the current utterance as consistent with the first context or inconsistent with the first context and when the current utterance is classified as being inconsistent with the first context, identify a second context associated with the current utterance and set the second context as the current context, otherwise make no change to the current context, and determine a response to the current utterance based on the current context. The carrier medium may be tangible or non-tangible.

Brief Description of the Drawings

Exemplary embodiments will now be described with reference to the accompanying drawings, in which:

Figure 1 shows examples of three conversational flows;

Figure 2 shows a schematic of a context detection module according to an embodiment;

Figure 3 shows a flowchart of a context detection method according to an embodiment;

Figure 4 shows an example of neural network architecture;

Figure 5 shows an example of BiLSTM neural network architecture;

Figure 6 shows the architecture of an entity masking module according to an embodiment;

Figure 7 shows an example of a context detection flow;

Figure 8 shows a schematic of an automatic conversational dialogue system, including the context detection module of Figure 2;

Figure 9 shows a method of performing automatic conversational dialogue according to an embodiment;

Figure 10 shows a method of training a neural network; Figure 11 shows a method of training the context detection module of Figure 2 according to an embodiment;

Figure 12 shows a method of training data generation for training a context classifier module according to an embodiment; and

Figure 13 shows the architecture of a convolutional neural network for use as a context classifier module according to an embodiment.

Detailed Description of Preferred Embodiment

Figure 2 illustrates a computer system, or equivalently context detection module 380 for detecting context according an embodiment. The computer system 380 includes a processor 382 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 384, read only memory (ROM) 386, random access memory (RAM) 388, input/output (I/O) devices 390, and network connectivity devices 392. The processor 382 may be implemented as one or more CPU chips.

It is understood that by programming and/or loading executable instructions onto the computer system 380, at least one of the CPU 382, the RAM 388, and the ROM 386 are changed, transforming the computer system 380 in part into a particular machine or apparatus having the novel functionality taught by the present disclosure. It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules. Decisions between implementing a concept in software versus hardware typically hinge on considerations of stability of the design and numbers of units to be produced rather than any issues involved in translating from the software domain to the hardware domain. Generally, a design that is still subject to frequent change may be preferred to be implemented in software, because re-spinning a hardware implementation is more expensive than re-spinning a software design. Generally, a design that is stable that will be produced in large volume may be preferred to be implemented in hardware, for example in an application specific integrated circuit (ASIC), because for large production runs the hardware implementation may be less expensive than the software implementation. Often a design may be developed and tested in a software form and later transformed, by well-known design rules, to an equivalent hardware implementation in an application specific integrated circuit that hardwires the instructions of the software. In the same manner as a machine controlled by a new ASIC is a particular machine or apparatus, likewise a computer that has been programmed and/or loaded with executable instructions may be viewed as a particular machine or apparatus.

Additionally, after the system S80 is turned on or booted, the CPU 382 may execute a computer program or application. For example, the CPU 382 may execute software or firmware stored in the ROM S86 or stored in the RAM 388. In some cases, on boot and/or when the application is initiated, the CPU 382 may copy the application or portions of the application from the secondary storage 384 to the RAM 388 or to memory space within the CPU 382 itself, and the CPU 382 may then execute instructions that the application is comprised of. In some cases, the CPU 382 may copy the application or portions of the application from memory accessed via the network connectivity devices 392 or via the I/O devices 390 to the RAM 388 or to memory space within the CPU 382, and the CPU 382 may then execute instructions that the application is comprised of. During execution, an application may load instructions into the CPU 382, for example load some of the instructions of the application into a cache of the CPU 382. In some contexts, an application that is executed may be said to configure the CPU 382 to do something, e.g., to configure the CPU 382 to perform the function or functions promoted by the subject application. When the CPU 382 is configured in this way by the application, the CPU 382 becomes a specific purpose computer or a specific purpose machine. The secondary storage 384 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if RAM 388 is not large enough to hold all working data. Secondary storage 384 may be used to store programs which are loaded into RAM 388 when such programs are selected for execution. The ROM 386 is used to store instructions and perhaps data which are read during program execution. ROM 386 is a non-volatile memory device which typically has a small memory capacity relative to the larger memory capacity of secondary storage 384. The RAM 388 is used to store volatile data and perhaps to store instructions. Access to both ROM 386 and RAM 388 is typically faster than to secondary storage 384. The secondary storage 384, the RAM 388, and/or the ROM 386 may be referred to in some contexts as computer readable storage media and/or non-transitory computer readable media.

I/O devices 390 may include printers, video monitors, liquid crystal displays (LCDs), plasma displays, touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, or other well-known input devices.

The network connectivity devices 392 may take the form of modems, modem banks, Ethernet cards, universal serial bus (USB) interface cards, serial interfaces, token ring cards, fiber distributed data interface (FDDI) cards, wireless local area network (WLAN) cards, radio transceiver cards that promote radio communications using protocols such as code division multiple access (CDMA), global system for mobile communications (GSM), long-term evolution (LTE), worldwide interoperability for microwave access (WiMAX), near field communications (NFC), radio frequency identity (RFID), and/or other air interface protocol radio transceiver cards, and other well-known network devices. These network connectivity devices 392 may enable the processor 382 to communicate with the Internet or one or more intranets. With such a network connection, it is contemplated that the processor 382 might receive information from the network, or might output information to the network in the course of performing the below- described method steps. Such information, which is often represented as a sequence of instructions to be executed using processor 382, may be received from and outputted to the network, for example, in the form of a computer data signal embodied in a carrier wave.

Such information, which may include data or instructions to be executed using processor 382 for example, may be received from and outputted to the network, for example, in the form of a computer data baseband signal or signal embodied in a carrier wave. The baseband signal or signal embedded in the carrier wave, or other types of signals currently used or hereafter developed, may be generated according to several methods well-known to one skilled in the art. The baseband signal and/or signal embedded in the carrier wave may be referred to in some contexts as a transitory signal.

The processor 382 executes instructions, codes, computer programs, scripts which it accesses from hard disk, floppy disk, optical disk (these various disk-based systems may all be considered secondary storage 384), flash drive, ROM 386, RAM 388, or the network connectivity devices 392. While only one processor 382 is shown, multiple processors may be present. Thus, while instructions may be discussed as executed by a processor, the instructions may be executed simultaneously, serially, or otherwise executed by one or multiple processors. Instructions, codes, computer programs, scripts, and/or data that may be accessed from the secondary storage 384, for example, hard drives, floppy disks, optical disks, and/or other device, the ROM 386, and/or the RAM 388 may be referred to in some contexts as non-transitory instructions and/or non- transitory information.

In an embodiment, the computer system 380 may comprise two or more computers in communication with each other that collaborate to perform a task. For example, but not by way of limitation, an application may be partitioned in such a way as to permit concurrent and/or parallel processing of the instructions of the application.

Alternatively, the data processed by the application may be partitioned in such a way as to permit concurrent and/or parallel processing of different portions of a data set by the two or more computers. In an embodiment, virtualization software may be employed by the computer system 380 to provide the functionality of a number of servers that is not directly bound to the number of computers in the computer system 380. For example, virtualization software may provide twenty virtual servers on four physical computers. In an embodiment, the functionality disclosed above may be provided by executing the application and/or applications in a cloud computing environment. Cloud computing may comprise providing computing services via a network connection using dynamically scalable computing resources. Cloud computing may be supported, at least in part, by virtualization software. A cloud computing environment may be established by an enterprise and/or may be hired on an as-needed basis from a third-party provider. Some cloud computing environments may comprise cloud computing resources owned and operated by the enterprise as well as cloud computing resources hired and/or leased from a third-party provider.

In an embodiment, some or all of the functionality according to embodiments may be provided as a computer program product. The computer program product may comprise one or more computer readable storage medium having computer usable program code embodied therein to implement the functionality disclosed below. The computer program product may comprise data structures, executable instructions, and other computer usable program code. The computer program product may be embodied in removable computer storage media and/or non-removable computer storage media. The removable computer readable storage medium may comprise, without limitation, a paper tape, a magnetic tape, magnetic disk, an optical disk, a solid state memory chip, for example analog magnetic tape, compact disk read only memory (CD-ROM) disks, floppy disks, jump drives, digital cards, multimedia cards, and others. The computer program product may be suitable for loading, by the computer system 380, at least portions of the contents of the computer program product to the secondary storage 384, to the ROM 386, to the RAM 388, and/or to other non-volatile memory and volatile memory of the computer system 380. The processor 382 may process the executable instructions and/or data structures in part by directly accessing the computer program product, for example by reading from a CD-ROM disk inserted into a disk drive peripheral of the computer system 380. Alternatively, the processor 382 may process the executable instructions and/or data structures by remotely accessing the computer program product, for example by downloading the executable instructions and/or data structures from a remote server through the network connectivity devices 392. The computer program product may comprise instructions that promote the loading and/or copying of data, data structures, files, and/or executable instructions to the secondary storage 384, to the ROM 386, to the RAM 388, and/or to other non-volatile memory and volatile memory of the computer system 380.

In some contexts, the secondary storage 384, the ROM 386, and the RAM 388 may be referred to as a non-transitory computer readable medium or a computer readable storage media. A dynamic RAM embodiment of the RAM 388, likewise, may be referred to as a non-transitory computer readable medium in that while the dynamic RAM receives electrical power and is operated in accordance with its design, for example during a period of time during which the computer system 380 is turned on and operational, the dynamic RAM stores information that is written to it. Similarly, the processor 382 may comprise an internal RAM, an internal ROM, a cache memory, and/or other internal non-transitory storage blocks, sections, or components that may be referred to in some contexts as non-transitory computer readable media or computer readable storage media.

The processing tasks performed by the processor 382 can be regarded as being conceptually arranged into three modules, namely a flow classification module 31, an information masking module 33 and a context classifier module 35.

Figure 3 shows a high-level overview of the method performed by the system 380 including the steps performed in each of these modules. In step S301, the system receives an utterance as text data from a user via a conversational interface. The conversational interface may be either the input/output module 390 or, in the case of a networked system, via the network 392.

In step S302, the received utterance S i is converted to a vector representation using a pre-trained word to vector embedding, for example a Glove embedding, optionally fine- tuned with domain-specific data, and added to the user converted stack of utterances U = {S 0 , S 1 , S 2 , ... , S n }. The stack is typically stored in the RAM 388 of the system 380.

In step S303, where the utterance S i = S 0 , i.e. is the first of a sequence of utterances, the intent (i.e. the context) of the utterance S, is predicted. This is the flow classification module 31.

In this step, the utterance S 0 = {x 0 , x 1 , x 2 , ... , x m }, (where X j 's are tokens), is classified to the appropriate flow. In the described embodiment this is done using the attention-based model described in Sinha, Koustuv et al. "A Hierarchical Neural Attention-based Text Classifier." EMNLP (2018). In this approach, an artificial recurrent neural network with bidirectional long short-term memory (BiLSTM) is employed to obtain a fixed-length encoding H i = {h 0 , h 1 , h 2 , ... , h m } from the variable-length word sequence and the encodings are employed to compute attention.

Neural networks (neural models) are adaptive models trained by machine learning methods. In general, they comprise sets of algorithms configured to map inputs to outputs. A schematic of the simplest type of neural network is shown in Figure 4. The neural network comprises an input layer 1901 where the input data is input into the network, one or more hidden layers 1903 where inputs are combined and an output layer 1905 at which the output is received.

The hidden layer 1903 comprises a series of biased nodes 1909. Each input to each hidden layer is weighted and combined at a node with a non-linear activation function. The neural network is defined by a series of parameters including those characterizing the architecture of the neural network (i.e. number of nodes and number of hidden layers), activation functions, weights and biases. The weights and biases are determined during training of the neural network.

Note that although only one hidden layer is shown in Figure 4, the neural network may comprise a plurality of hidden layers, according to the architecture employed.

In order to train a neural network, training data in the form of inputs and corresponding outputs is employed and the weights and biases are adjusted in order to minimize the difference in the neural network output and the target output. In practice this is done by minimizing a so-called objective function which characterizes the error in the network.

Recurrent neural networks are typically employed for sequential data, such as text. In these networks, the basic neural network architecture shown in Figure 4 is applied recursively to each element in the sequence in a series of steps, with the output of one step for one element forming part of the input for the next element. For example, if the network is applied to a sequence of words, in the first step, the first word in the sequence forms part of the input 1901 in the architecture shown in Figure 4 and the network produces output 1905. In the second step, both the output 1905 of the first step and the second word together form inputs 1901 for the second step, and so on.

Recurrent neural networks with Bi-directional Long Short Term Memory (BiLSTM) take inputs from both directions in the sequence of words, i.e. from the word proceeding the word in question and the subsequent word. LSTM and BiLSTM architectures are generally described in Klaus Greff, Rupesh Kumar Srivastava, Jan, Koutnik, Bastiaan Steunebrink, and Jurgen Schmidhuber, 2017, Lstm: A search space odyssey, IEEE Transactions on Neural Networks and Learning Systems, 28: 2222-2232. An example architecture employed in the described embodiment is shown in Figure 5, in which a BiLSTM network is applied to a sequence of tokens (individual units into which an utterance may be divided, for example words or terms S 0 = {x 0 , x 1 , x 2 , ... , x m } 505 comprising elements 5051, 5053, 5055 and 5057. The last output, or the final state of the LSTM, concatenated along the feature axis, is used to produce the fixed-length sequence of encodings H i = {h 0 , h 1 , h 2 , ... , h m } 507. The architecture comprises two layers: a forward LSTM layer 501 and a backward LSTM layer 503. In the forward layer 501, the basic neural network architecture 19 (with more nodes and/or hidden layers, as required) is applied for each step in the sequence 505, taking both the relevant word 5053 of the sequence itself as an input 5011 as well as an output 509 of the network 19 as applied to the previous word 5051 in the sequence (where available). The backwards layer performs the reverse process. Outputs 5013 and 5015 of the forward and backwards layers, respectively, are combined to obtain the respective encoding 5017 in the output sequence 507 of word encodings.

As well as the parameters of the basic neural network unit 19 discussed above, the BiLSTM network is also characterized by parameters defining how outputs from layers 501 and 503 are combined, as well as those defining how inputs 509 and 5011 are combined in the individual layers themselves. As with the weights and biases of the neural network module 19, these may be adjusted during training of the network and may be stored in the memory 386 of the context expansion system, as shown in Figure

2. Once the encodings 507 H t = { h 0 , h 1 , h 2 , ... , h m } of each word in the sequence S 0 = {x 0 , x 1 , x 2 , ... , x m } has been determined using a neural network having the architecture shown in Figure 5, the encodings are employed to determine attention.

Attention is a method of weighting encodings in the sequence H i according to their importance in terms of intent classification. For example, in the utterance "I wish to transfer money to my mother's account" the word "transfer" is most significant in terms of determining the intent of the sentence (transferring money).

In order to apply attention to the sequence of encodings a normalized attention vector a 1 is calculated as

Where f 1 s an attention function. The feature vector c 1 is then calculated as

For multi-head attention, multiple attention functions f 1 , f 2 , ... f m are employed. Each corresponding attention vector a 1 , a 2 , ... a m is individually normalized and m feature vectors C = c v c 2 , ... c m are obtained. These are concatenated to yield a fixed-size vector of dimension d * m, where d is the dimension of the hidden representation in the BiLSTM network.

This representation is fed to a fully connected layer to obtain probability scores for each context class.

The context class with the highest score is determined to be the intent of S 0 .

In step S305, utterances S 1 ,S 2 , ...,S n which follow S 0 sequentially in a dialogue with a user are masked. By masking the utterance, it is meant that named entities in the utterance are replaced by tags which simply describe a category of entities to which the named entity belongs, e.g. comprises simply the name of the category. Thus, semantically similar phrases are masked to make the utterances more generic. Formally, given a query S i =Q = {x 0 , x 1 , ..., x m } the information is masked by converting the named entities present in Q to their tags y = {yo, y 1 , y m }, where x,are tokens in query, m is the length of query, y, is the tag corresponding to /th named entity in the query.

For example, in masking the utterance "200$", which comprises two named entities, "200" and "$", "200" and "$" will be replaced by "amount" and "currency code", respectively, i.e. their respective tags. The resultant masked utterance resulting in a masked utterance, y of "amount currency code".

This process may be performed as each utterance is received, i.e. in real time as the dialogue with the user proceeds.

The step S305 of masking the utterances itself comprises two steps: Named Entity recognition (NER) S3051 and Intent-Entity Mapping S3053. Together these steps make up the information masking module 33. The whole process of entity masking S305 is shown in Figure 6 showing both named entity recognition S3051 steps (contained within the box) and Intent-Entity mapping steps S3053.

The Named Entity recognition process S3051 will now be described in detail.

In step S401, the user query S i =Q = {xo, x 1 , ..., x m } is input into the model.

In step S403, character-level BiLSTM is performed for each word. In this process, each character C is initialised using a d dimensional real valued vector obtained by a pretrained word to vector model, for example Glove embedding. The initialised characters of each word are then passed through a BiLSTM layer, an example architecture of which was described above (albeit in relation to words), in relation to Figure 5. The output o f ∈ R k of the last cell of the forward LSTM and the output O f ∈ R k of the backward LSTM are concatenated to obtain a character level representation of each word U c ∈ R 2k . In step S405, a pre-trained word embedding, U w ∈ R b is obtained for each word.

In step S407, the character-level representation U c is directly concatenated with the word embedding U w to obtain the vector W ∈ R 2k+b . A dropout layer (not shown) is then applied before the concatenated vector W is passed as an input to step S409.

In step S409, the concatenated vector W ∈ R 2k+b is passed as an input to word level BiLSTM. Again, a BiLSTM network with the architecture shown in Figure 5 is used here to help ensure a rich feature representation for each sequence word is learnt which is aware of context through both preceding and succeeding words.

In step S4011, the hidden state of the forward LSTM at the i th word, w i f ∈ R d is concatenated with the hidden state of the backward LSTM at the i th word, W i b ∈ R d at each time stamp i to obtain the feature representation of each word H i e R 2d .

In step S4013, individual tag decisions for each word in the sequence are made by feeding the H i to a feed forward layer (i.e. a neural network layer with simply a forward architecture such as that of Figure 4) with a Softmax activation layer applied to the output in order to generate probability distributions for each word over T tags given by where is the probability of the i th word having a t th tag from the list of tags t = {y 0 ,y 1 , ...,y T-1 }, where W and b are the learnable model parameters stored in the memory 386 of the context detection system shown in Figure 2.

In step S4015, the highest scoring tag for each word is predicted, and the resulting tags Y = y 0 ,y 1 , —,y m are output by the Entity Extractor step S3051. In the masking step S3053, each element x i of the input S i = x 0 , x 1 , ...,x m is replaced by the corresponding y i if y i is not "OTHER", where the tag "OTHER" indicates that x i was not a named entity in that case. Those elements X i for which y i is "OTHER" are left unchanged. A masked sequence X masked is then obtained where all the named entities are replaced by their corresponding tags.

In summary, a knowledge base is employed to map domain specific intents to valid English words or phrases to get valid embeddings so that the network can make more sense of the context and the words occurring in that context. The domain specific intent to entity knowledge base is also employed to mask the entities.

Returning now to Figure 3, in step S307, intent entity pre-processing is performed. In the described embodiment, this comprises the step of intent-word mapping S3071. This comprises converting the context / determined in step S303 (i.e. the intent predicted from the primary utterance, S 0 ) and X masked determined in Step S305 to a vector representation. In an example, the 100-dimensional pretrained word2vec embedding process described in Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013, Distributed representations of words and phrases and their compositionality, In Proceedings of the 26th International Conference on Neural Information Processing Systems- Volume 2, NIPS'13, page 3111-3119, Red Hook, NY, USA. Curran Associates Inc is employed.

Once the numerical representations are obtained, / and X masked are then merged using a function F(I, X masked ) to give output X f . In the described embodiment,

F (I, X masked ) prepends / to X masked as this may result in fewer n-grams.

In step S309, X f is input into the context prediction model, i.e. a prediction is made as to whether the context determined in step S303 has changed or not on the basis of / and X masked . The probability of context change is modelled as P(y|X f ) where y ∈ {0,1} and y = 1 if context change is predicted and y = 0 otherwise.

In the described embodiment, P(y|X f ) is modelled as an attention-based BiLSTM. Attention-based BiLSTM models are generally described in Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016, Attention-based bidirectional long short-term memory networks for relation classification, In Proceedings of the 54 th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 207-212, Berlin, Germany, Association for Computational Linguistics.

In order to obtain P(y|X f ), each word i of X f is input into BiLSTM network as shown in Figure 5. The output is the matrix H containing the vectors [h Q , h 1 , ..., h T ] where h i is the output of the BiLSTM cell for the i th input word, where H ∈ R dxT , and d is the dimension of the hidden representation and T is the number of words in the input sequence X f .

The attention vector a is calculated as

Where a ∈ R T and W ∈ R d , the learnable model parameters stored in the memory

386. The attention weighted average of H is taken to obtain the final representation hf. h f = Ha T where h f ∈ R d . P(y|X f ), a probability distribution over context change and context entailment, is then calculated by feeding h f into a fully connected layer (such as that of Figure 4) with softmax activations to obtain the predicted output in step S3011.

When P(y = 1 |X f ) exceeds a threshold, the utterance is classified as being inconsistent with the context, i.e. context change is predicted and the system will then proceed to redetermine intent, i.e. the method will return to step S303. Below this threshold, the utterance is classified as being consistent with the context. A threshold of 0.25 was determined via ablation experiments to produce a high level of accuracy although it will be appreciated that other thresholds could be chosen according to the parameters of the system, desired accuracy, etc.

Thus, the method of Figure 3 predicts the likelihood that an utterance is consistent or inconsistent with a context, i.e. whether the context has changed during the course of the conversation with a user. In summary, when the user starts the conversation using the relevant interface, the user conversation stack is captured by the conversational middleware for the particular user session, the user queries are classified to a particular intent using the intent classifier. Once the intent is captured, the entities are tagged using the entity recognizer and the intent entity knowledge base is looked up to filter and mask the relevant entities using a valid English word or phrase. The preprocessed query is concentrated with intent and passed through the context detection neural network to understand context-digression or context entailment.

The system flow is shown schematically in Figure 7 for a sample conversation.

Figure 7 shows a complete conversation 601 comprising a number of inputs from a user 611, 621, 631, 641 and 651.

The first utterance 611 - "Transfer money" - is input into the flow classifier module 31 which performs the intent prediction step S303 in the method of Figure 3. The result of the intent prediction performed by the flow classifier module 31 determines the Active Context 605 which is employed to process the subsequent utterances 621, 631, 641 and 651. Additionally, when the context detection model is employed as part of an automatic dialogue system, the active context may determine the slots that are loaded for a slot-filling based chatbot.

Each of the subsequent utterances 621, 631, 641 and 651 initially bypass the flow classifier module 31 and are input into the context change detector 607, which comprises the information masking and classifier modules, 33 and 35, respectively which perform steps S305 to S3011 of the method of Figure 3.

In real time, each input undergoes information masking (i.e. step S305 of Figure 3) in the information masking module 33 to obtain a masked representation 6011 which then undergoes classification according to steps S307-S3011 of Figure 3 in the classification module 35. If the classification module 35 predicts that that the context has changed, then (and only then) is the utterance is input into the flow classifier 31 in order to determine a new active context 605 for that and subsequent utterances.

In the example of Figure 7 the context of the utterances 611 ("Transfer money") and 621("500$") is that of a balance transfer. However, the subject of the third utterance, 631 ("Show account balance") is that of a balance enquiry. Utterances 641 ("Now transfer my money") and 651 ("John") revert back to the subject matter of a balance transfer. Thus, the process shown in Figure 7 will result in a change in the active context 605 following input of utterance 631 and then again following input of utterance 641.

Figure 8 shows an exemplary application of the context detection module 380 as part of a system 200 for generating responses to queries using an automatic conversational dialogue system. As shown in Figure 8 the system 200 comprises a plurality of networked computing devices 380, 205, 207 and 209 which are connected via the Internet or other configurations and protocols such as Bluetooth, an intranet, a wide area network, etc. or various combinations of the above. These computing devices have an analogous arrangement to the device shown in Figure 2 and described above.

In Figure 8, Personal user devices such as a PC 201 or mobile device 203 are shown as being connected to the system via the network. These are configured to enable a user to enter a query for processing by the system. The system is configured to receive inputs from many such devices, for example via a webpage hosted by the server 207 into which the user inputs the query via the device 201, 203 which they are using. The inputs are then passed from the server 207 to the other devices in the network. The server is also configured to receive the answer to the query and display it, for example on the webpage, for the user to view with their device 201, 203.

The system 200 comprises at least one component, or node 209 functioning as conversational middleware. This node 209 is configured to receive queries from elsewhere in the network and to distribute them accordingly.

In an example method of operation, the system 200, the server 207 is configured to transmit raw input (i.e. otherwise unprocessed) queries received from a user to the conversational middleware 209, performing decryption if necessary in the case of end- to-end encryption. The conversational middleware 209 then, in turn, distributes the raw query to context detection module 380 for processing according to the method of Figure 3. The context detection module determines the active context of the query and transmits this information to the conversational middleware 209. The conversational middleware 209 then directs the query to a slot filling module 205 and informs the slot filling module of the active context. The slot filling module then employs this context information to preload slots corresponding to the determined context and fill them based on the information in the query. If all slots are not filled by the information in the query, the slot filling module generates responses to the query designed to obtain further information from the user. These are transmitted to the conversational middlewear 209 which directs them to the server 207 for display on, for example the webpage hosted by the server 207. This process continues until all the slots are filled.

Once all slots are filled, the conversation completes.

An exemplary method performed by system 200 is shown in Figure 9.

In step S901, an input query is received from a user, for example via PC 201 or Mobile 203.

In step S903, it is determined if the query is a session starting query, i.e. the first query entered by the user in the session or a subsequent query, i.e. a second or later query entered by the user. In an example, this is done by reference to a unique hash created when a user first logs into the website or other user interface for inputting a query. During the course of the dialogue, this hash token is exchanged between the conversational middlewear and the user device, e.g. the browser, thereby enabling the system to recognise that the queries belong to the same user session. The token may expire due to inactivity or due to time. For example, the token may expire after 20 minutes.

If the hash token indicates that the query is a session starting query, i.e. a primary query then then the method moves to S905, the start of the classification branch of the method, and then in step S907, flow classification (i.e. step S303 of Figure 3) is performed on the query. The active context is then output in step S9013.

If the hash token indicates that the query is a subsequent utterance of a conversation (i.e. a secondary query) then the method moves to step S909, the start of the context change detection branch of the method and context change detection (i.e. steps S305 to S3011 of Figure 3) is performed on the query in step S9011. If no change in context is detected then the active context is unchanged and the method moves directly to S9013. If a change in context is detected, then the utterance undergoes flow classification in step S907, before the new active context is output in step S9013.

Once the active context is determined in step S9013, this is communicated to the conversational dialogue system in order for, for example, slot filling in step S9015 using the information in the received utterance. The process returns to step S901 until all slots are filled and the conversation completes in step S9017.

The training of the context detection module 380 will now be described according to an example. In order to train a neural network, training data in the form of inputs and corresponding outputs is employed and the parameters characterizing the network are adjusted in order to minimize the difference in the neural network output and the target output. In practice this is done by minimizing a so-called objective function which characterizes the error in the network.

This process is shown for a general neural network in Figure 10. In step S1109, the neural network is initialized. In practice this means that the values of the hyperparameters (the constant parameters defining the network, such as dimensions of the hidden representation, number of layers, etc.) are selected and all of the trainable parameters characterizing are given an initial value, for example, a randomly chosen value. Where necessary, the activation functions are also chosen. Preferably, the hyperparameters are determined heuristically, for example via ablation experiments, according to the limitations of the systems employed and the desires of the user, for example, the memory, speed constraints, desired throughput and desired accuracy. The tunable hyperparameters of each model will be discussed in more detail below. In step Sllll an input 1001 with a known, expected output 1003 is input into the neural network with the initialized parameters. The output produced by the neural network is compared to the target output 1003 and an error is calculated.

In step S1113, the parameters of the neural network are adjusted in order to minimize the error. The process is then repeated with the adjusted neural network by returning to step Sllll and minimizing the neural network error for other items of training data. Typically, this is done by optimizing a so-called objective function that characterizes the error in the network.

It will be appreciated that in the described embodiment of the context detection module 380, there are three trainable modules comprising neural networks: 1) the flow classification module 31 employed for intent prediction in step S303, 2) the information masking module 33 which performs step S305 and, finally, 3) the classifier 35 employed in steps S309 and S3011. According to the described embodiment, these are trained in order, as shown in Figure 11.

In the described embodiment, training data comprises inputs from real users during conversations with automatic dialogues system which has been manually tagged for context. For example, where the training data is derived from user conversations with the automatic dialogue system of a financial institution, examples of tags which could be applied to the user inputs are transfer, balance enquiry, recharge and transaction history.

Further, each utterance of the training data is categorised as being either a primary utterance or a secondary utterance. Utterances which trigger a process flow (i.e. start a new conversational context or equivalently signal a new user intent) are primary utterances and these utterances are unique to a specific process flow. 'Transfer money' in conversation A in Figure 1 is example of primary utterance since it initialises the ' transfer' flow. Utterances, other than primary utterances, are categorised as secondary utterances. Usually, but not exclusively, secondary utterances are responses input by a user in return to queries from the automatic dialogue system (for example queries designed to extract information for slot filling) and contain information required to complete a conversational flow. Note that the utterances of the automatic dialogue system employed to obtain the user conversation itself are not required since these utterances do not play major role in determining context change; context change is generally driven by a user.

In step S1201 of the training method shown in Figure 11, the flow-classification module 31, which is used to determine current flow category using classification, is trained according to the method of Figure 10.

Firstly, as described in step S1109, the network is initialised. In the flow classification module of the described embodiment, the tuneable parameters include the number of attention heads, the kernel size, the number of recurrent neural network units, the number of recurrent neural network layers, the batch size, the number of epochs and the learning rate. These are determined experimentally by employing a holdout testing set and determining the best performing values. The activation function is also selected, in an example it is a sigmoid or tanh function.

In order to train this module, only primary utterances are employed and these are selected from the tagged data set. Referring to the general training method shown in Figure 10, in the training of this module, the input 1001 comprises the primary utterances and the Target Output 1003 comprises the corresponding flow to which each of the primary utterances has been tagged. Thus, each input utterance has one ground truth class or flow assigned to it and the trainable parameters characterising the BiLSTM neural network employed for step S303 are adjusted in order to minimize the error. In an example, the loss function is a softmax cross entropy function. The next step of the training process S1203 comprises training the information masking module 33 (i.e. step S305 of the method of Figure 3) according to the process shown in Figure 10. The architecture of the information masking module was described above in association with Figure 6.

In step S1109, the network is initialised. The tuneable parameters of the entity masking module according to the described embodiment include the number of dimensions of the hidden representation of the character level BiLSTM, the number of dimensions of the hidden representation of the word level BiLSTM, the batch size, epochs and learning rate. Examples of the activation function are sigmoid and tanh functions.

The input training data 1001 for training this module comprise both primary and secondary utterances. In order to obtain the Target Output 1003, the named entities in all of the utterances, both primary and secondary, across all the chat conversation sequences, are manually tagged and the resulting masked utterance is assigned to the original utterance as the ground truth. The trainable parameters characterising the architecture shown in Figure 6 are adjusted in order to minimize the error according to the process of Figure 10. In an example, a sparse softmax cross-entropy function is employed as the loss function in step Sllll.

In step S1205, the context classifier module 35 is trained according to the method of Figure 10. As described above, in the described embodiment, the context classifier module is attention-based BiLSTM model.

As usual, in step S1109, the network is initialised. The tuneable parameters of the context change classifier include batch size, epochs, number of LSTM units and the learning rate. Examples of suitable activation functions include Relu, sigmoid and tanh functions. The training data for this module comprises a data set consisting of both primary and secondary queries. For each query, the training data consists of the utterance comprising the query, the active context, and a label indicating whether the intent comprises context digression or context entailment relative to the active context, for example a label of 1 or 0, respectively.

For example, if the conversation 601 of Figure 7 were employed as part of the training data set, the data columns for the query "show account balance" would comprise, "show account balance" as the query, "balance transfer" as the active context, and 1 as the label indicating context digression.

In order to prepare the data, a semi-supervised method is employed as shown in Figure 12.

In step S1301, raw data is input into the system. In the described embodiment, this comprises both manually created data 1313 for the domain optionally supplemented with real data from chat logs 1311 which is preferably domain specific. For example, a user may employ an administration application to provide a list of use cases that they wish to address using the conversational dialogue system in order to create the manually created data 1313. A data generation script S1303 then converts the user provided data into a specific format suitable for training the modules and combines it with the chat log data.

The raw data then undergoes intent classification, i.e. it is passed through the intent classification module 31 in S303, trained as described above, in order to determine the context of each input utterance. The utterances then undergo entity masking S305 using the masking module 33 trained as described above, followed by entity preprocessing in step S307.

In step S1309, the masked, and preprocessed queries are then manually labelled. An example of training data for this module is shown in Table 1.

Table 1 The inputs 1001 to the training process of Figure 10 for this module therefore comprise the active context and the pre-processed and masked query and the Target output 1003 comprises the label.

As discussed above, the context classifier module is configured to generate probability distributions for each word over T tags given by where is the probability of the i th word having a t th tag from the list of tags Ƭ = {y 0 ,y 1 , ...,y T-1 }, where W and b are two of the learnable model parameters determined during training. During training, the values of W and b are adjusted in step S1113 to minimize the neural network error which is calculated in step Sllll as

Where N is the number of training examples in the dataset.

In an example, the gradients are backpropagated against cross-entropy loss.

It will be appreciated that all of the modules employ vector representations of the input words obtained using a pre-trained word to vector model, for example Glove embedding (see, for example, https://nlp.stanford.edu/proiects/elove/). optionally fine-tuned with domain specific data, for example, banking-specific terminology.

In the described embodiment, the context detection module 380 detects context change in real time during a conversation with a user. This is done without having to perform computationally expensive intent detection for each individual utterance of the user. Instead, an initial intent is detected from a first utterance and then subsequent utterances are analysed for context change. Only when context change is detected is intent detection performed. Thus, a simple method is provided that may evaluate coherence between multiple queries or subsequent to bot user interaction. This method may also be more accurate than detecting intent with every utterance as secondary utterances may not include enough information to accurately identify a new context. Thus, the method of context change detection according to the described embodiment, may enable accurate context change in real time - thereby enabling accurate and contextually appropriate responses from a conversational dialogue system - while minimising the additional computing time and memory required to perform it. Consequently, the accuracy of a conversational dialogue system may be improved without compromising on throughput. The method described above according to the described embodiment may enable understanding to a deep level of text pragmatics, logical semantics to evaluate the coherence or digression of context to a high level of precision.

The method of context change detection may be particularly advantageous for automatic dialogue systems that apply a slot-filling based approach as slot filling relies on identifying information in a user's utterances to fill predefined slots based on the context and intent of the user. By potentially identifying a change in the context, the system may adapt the slots to the new context.

One additional advantage of the above design of system 200 shown in Figure 8 is that as developments in conversational Al software are made, no changes to the context detection module may be required; the context detection module 380 may simply be an out of the box solution to real-time context identification which may be employed with any automatic conversational dialogue system.

Masking the entities in both operation and training of the system 380 as in the described embodiment may be advantageous because it may deduplicate patterns in utterances, helping to ensure that processing of the utterance data is both faster and less noisy than it would be for unmasked data. Preferably, the masking is domain- specific which may ensure a high level of accuracy. By determining a character level representation prior to the word level representation, the Entity masking module may accurately distinguish known and unknown entities with morphological similarities. Further, employing BiLSTM for both character and word level embeddings may ensure a rich feature representation for each sequence word is learnt which is aware of context through both preceding and succeeding words, thereby ensuring high accuracy.

The accuracy of context change detection according to the described embodiment was evaluated experimentally. The model was trained on Tesla K80 with 12GB GPU node with hyperparameters determined by testing with a holdout testing set and choosing the best performing parameters. The maximum sequence length of utterances was 45.

A 100-dimensional word2vec embedding fine-tuned on large banking data, was employed as an initializer to the flow classifier used for S303, information masker in step S305 and context classification models in step S309. The flow classification module employed 12 attention heads, a kernel size of 5, 256 RNN units, a single RNN layer, a batch size of 150, 20 epochs and a learning rate of 0.001. Sigmoid and tanh activation functions were employed as well as a softmax cross entropy loss function. The entity extraction step S3051 used 50-dimensional randomly initialized character embeddings which were updated during the model training. A 100-dimension hidden representation of character-level BiLSTM, a 300-dimension hidden representation for word-level BiLSTM, a batch size of 10, 4 epochs and a learning rate of 0.9. Sigmoid and tanh activation functions were employed. The Adam optimizer (Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980) against cross entropy was employed as the loss function. For the context change classifier, the batch size was 128, the number of epochs 5, the number of LSTM units was 300 and the learning rate was 0.001. Relu, sigmoid and tanh activation functions were employed. Binary cross entropy was employed as the loss function. A threshold of P(y = 1|X f )=0.25 was employed for identifying context change; context change being identified if the probability was equal to or exceeded this threshold.

The model according to the described embodiment achieved an accuracy of 0.9753 and FI score of 0.9820, thereby showing excellent identification of context change. The performance of the entity extractor module 33 individually according to the described embodiment and trained as described above was also compared to a Named Entity Recognition model provided by OpenNLP which employs the Maximum Entropy Classifier as described in Hai Leong Chieu and Hwee Tou Ng. 2002; Named entity recognition: A maximum entropy approach using global information, in COLING. The entity extractor according to the described embodiment achieved an accuracy of 92% compared with the OpenNLP model which achieved an accuracy of 77%. Thus, the entity extractor according to the described embodiment has been shown to accurately mask information.

The described embodiment should not be construed as limitative. For example, although a particular method of initial flow classification is described above, the method of the initial flow classification performed in step S305 is not particularly limited and other methods could be used as an alternative in order to predict intent. The utterance S 0 may also not necessarily be the first utterance in a conversation with the user; the system may be configured to perform intent prediction after a particular number of utterances or only after certain requirements have been met.

Flow classification may not be performed for utterance S 0 . For example, the active context could be randomly initialised, or set to a most common value (for example the most common intent of a user's utterance for that particular system) and the context detection module could determine if S 0 is consistent with that active context or not, in accordance with embodiments given above.

It will be appreciated that although entity masking S305 is described above as being performed as utterances are received, i.e. in real time, the entity masking could be performed after several utterances have been received. Although the initial flow-classification, or equivalently intent prediction S303 is described as being performed using BiLSTM with multihead attention as this may result in high accuracy, it will be appreciated that other methods of performing the initial flow- classification could alternatively be employed.

Although named-entity recognition S3051 is described above as being performed using a model with the architecture of Figure 6, it will be appreciated that other methods of named entity recognition could be performed.

It will be appreciated that although Glove embedding is described above as being employed to obtain the vector representation of each utterance, other methods of obtaining the vector representation could be employed.

Although entity masking of the input query is described above, it may be omitted and context classification performed directly on the unmasked query. Alternatively, one or more of the steps of entity masking, such as steps S403 and S405 of obtaining the character embeddings could be omitted.

Although it is described above that, in Step S405, characters of each word are passed through a BiLSTM layer, other neural network architectures could also be employed according to embodiments. Likewise, although, it is described above that in Step S409, that word embeddings are determined using a BiLSTM, it will be appreciated that other neural network architectures could alternatively be employed.

Although in Step S307 of the described embodiment, F(I,X masked ) prepends / to X masked , it will be appreciated that / could alternatively be appended to X masked or the embeddings of both could be averaged.

Although the various computing devices of Figure 8 - 201, 203, 205, 207 and 209 - are described above as having the same arrangement as the device shown in Figure 2, they may have a different arrangement. Although they are shown as separate devices in Figure 8, the functionality of two or more devices may be performed by a single computing device. The network may comprise further devices which perform one or more of the functions of the devices shown in Figure 2.

The network 200 may comprise one or more further components with load balancing functionality, or load balancing may be performed by one or more of the nodes shown, for example the server 207.

The devices may be connected via the Internet other configurations and protocols such as Bluetooth, an intranet, a wide area network, etc. or various combinations of the above for connection some or all of the devices shown.

In Figure 8, Personal user devices such as a PC 201 or mobile device 203 are shown as being connected to the system via the network. It will be appreciated that many such devices or only one user device may be connected. The system may be configured to receive inputs from many such devices. The system may be configured to receive inputs from many such devices simultaneously. Although the system is described as receiving inputs via a webpage, it will be appreciated that other mechanisms for receiving inputs, or equivalently utterances could be employed. The user may submit the query verbally as speech into a microphone incorporated into their device 201 or 203. In this embodiment, one of the devices shown in Figure 3, or another networked device, not shown, may comprise speech recognition functionality and be configured to convert the input speech into text data. The conversion from input speech into text data may be carried out by a third-party provider. The input data may be end-to-end encrypted and the middleware 209 may be configured to perform decryption.

It will be appreciated that one or more of the components of the network shown in Figure 8 may be omitted according to embodiments. In particular, the functions of the different modules of the context detection module 380 described above could be performed by different components, for example the flow classification module could comprise a separate component from the information masking module and context classifier module. Different processes performed in the individual modules, such as the information masking module could be performed by different devices.

In particular, the functions of the flow classification module 31 and/or the named entity recognition may be not be performed by the context detector 380, instead being performed by the slot filing module 205 or other, separate modules not shown in Figure 8. Although particular methods of initial flow classification and named entity recognition described above, other methods of performing these steps may be employed. Consequently, the context detection module 380 may be plugged into an automatic dialogue system with an in-built initial flow classification module and/or named entity recognition module (regardless of how these modules perform these processes) and make use of these in-built modules for steps S303 and S3051 of Figure 3. These systems may also be employed in the training of the models employed in information masking S3053 and classifier modules 35, as appropriate. In this case, for training a corpus of secondary queries from previous chatlogs involving the automatic dialogues system in question may be provided for each intent. The in-built entity recognition module may be employed to identify named entities which are then masked in accordance with methods described above. The context change classifier 35 is then trained as described above in step S1205 on the masked utterances.

This may enable straightforward integration of the context detection method according to embodiments with any automatic dialogue system.

Although training is described for all three modules of the context detector module above, it will be appreciated that pretrained modules could be employed for one or more of the modules, for example the flow classification module and for the named entity recognition step S3051. One or more of the modules could be trained in parallel with the others using training data specific to each of the modules. Although training using real data sets is described above, it will be appreciated that training could be conducted using artificially generated data, including fully manual preparation of the data for training. Alternatively, the data could be custom generated in advance by a third party and may be domain specific or non-specific.

Although the classifier module described above in step S309 comprises an attention based BiLSTM classifier, classifiers based on other models could be alternatively employed. Examples include, but are not limited to, CNN based classifiers, LSTM and BiLSTM (i.e. without attention) based models, CNN-LSTM based classifiers, and Transformer based classifiers. Each of these will now be described in turn according to embodiments.

CNN based classifier

In this embodiment, P(y|X f ) is modeled using a Convolutional Neural Network (CNN) based sentence classifier. Classification with CNNs described in Yoon Kim, 2014 Convolutional neural networks for sentence classification, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746-1751, Doha, Qatar, Association for Computational Linguistics. CNNs are prominent in Computer Vision, but can also be used to generate meaningful representations of a text sequence. CNNs comprise one or more hidden layers that convolve their inputs by applying filters to them.

An example method according to this embodiment is shown in Figure 13.

In step S1401, the numerical representation of the concatenation of active context and entity masked user utterance obtained as described above in S307 and S3071 is input into the model. In step S1403, a convolutional layer with multiple filter widths is employed, where filter width signifies the number of word context in inputs sequence. The filter sizes are hyperparameters that are tuned according to the computational limitations and requirements, for example by ablation. Each filter also has an activation function, for example a ReLU activation function.

As result of this convolution operation, differently sized outputs are obtained from each filter. Consequently, in step S1405, a Maxpooling operation is performed on each filter output. In step S1407 the outputs are then concatenated to obtain a fixed sized representation of the whole input sequence.

In step S1409, the representation is reshaped and in step S1411 a dropout layer is applied. In an example, the dropout is set to 0.5.

Finally, in step S1413 P(y|X f ) is calculated by feeding the output of the dropout layer into a fully connected layer with Softmax activations to obtain the predicted output in step S3011.

Training of the CNN-based classifier module may proceed as described above in association with Figure 10, with the model architecture as described above in steps S1401 to S1413 and the model trained against Cross-entropy loss.

LSTM and BiLSTM based classifiers

In these embodiments, P(y|X f ) may be modeled using either an LSTM based classifier or a BiLSTM as described above in association with Figure 5.

In this embodiment, in both cases, X f is input into the LSTM or BiLSTM layer which encodes it into a high-level representation. The output is the matrix H containing the vectors [h Q , h 1 , ... , h T ] where h i is the output of the relevant LSTM or BiLSTM cell for the i th input word, i.e. at the i th time stamp. Max and average pooling may then be performed on these encodings: where POOL ∈ { avg p ool, maXpoo l } is the pooling function and p e { avg.max }. h f is the final encoding of X f . h f is then fed to a fully connected layer with softmax activations to obtain P(y|X f ).

Both LSTM and BiLSTM based models may be trained according to the method of Figure 10 using the backpropagation through time (BPTT) algorithm described in Mikael Boden. 2001, A guide to recurrent neural networks and backpropagation against cross-entropy loss.

CNN-LSTM based classifier

In this embodiment, P(y|X f ) is modeled by employing a CNN layer as described above to generate a feature representation of text followed by max pooling, as described above in relation to the CNN-based classifier. The generated representation is then input into a LSTM layer which generates a finer representation. Referring to Figure 13, the LSTM layer is applied between steps S1407 and S1409.

In this model, the CNN enables regional dependencies to be obtained, whereas the LSTM captures the long-term dependencies of the text. The feature representation obtained from the LSTM layer is passed to a fully connected feed forward layer in order to obtain P(y|X f ) .

This model is trained against the cross-entropy loss in accordance with the method of Figure 10. Transformer based classifiers

Transformer-based models may be employed as an alternative to the recurrent neural network-based models of the classifier of the described embodiment. For example, a Bidirectional Encoder Representations from Transformers (BERT) model as described in Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, 2018. BERT: pretraining of deep bidirectional transformers for language understanding, CoRR, abs/1810.04805 could be employed to advantageously impart the capabilities of bidirectional context to the transformer-based model. In this embodiment, two training objectives are used to pretrain the BERT model: Mask Language Modelling and Next Sentence Prediction task. This pretrained model can be fine-tuned using task specific objectives.

In an embodiment, the output representation is obtained at a classification [CLS] token which represents the meaning of the entire sentence. The representation is then fed to a linear layer for classification.

Accuracy and FI scores were obtained using all of the model architectures described above using the flow classifier and entity extractor trained as described above for use with the preferred BiLSTM with attention-based model.

For the BERT-based model, the BERT large implementation available at the hugging- face transformer library (Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew, 2019, Huggingface's transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771) was employed which has 24-layers, 1024- hidden, 16-heads and around 340M parameters. For CNN, 128 filters with filter sizes 3,4 and 5 in CNN were employed and the dropout tuned and set to 0.5. For the LSTM based classifier a 100-dimensional hidden representation with 0.2 drop probability was employed. The maximum sequence length of utterances was 45. All of the models were trained on Tesla K80 with 12GB GPU node.

The results are shown in Table 2, including the BiLSTM with Attention-based model (BiLSTM+attn) for comparison.

Table 2 It will be appreciated that all models showed good accuracy with the best performance being obtained for the BiLSTM with Attention-based classifier.

Although a slot-based approach to conversational dialogue is described above, with slot-filling module 205 shown in Figure 8, the context detection method according to embodiments may be used in conjunction with any task-oriented conversational agent. Task-oriented conversational agents may provide accurate handling of domain-specific orders.

It will be appreciated that although the embodiments above have been described with reference to examples relating to financial transactions, the methods described herein can be straightforwardly extended to any domain or intent such as insurance, booking flights, restaurant reservations, and for systems to help call centre agents. Having now fully described the invention, it should be apparent to one of ordinary skill in the art that many modifications can be made hereto without departing from the scope as claimed.