Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPARATUS FOR INTERPRETING PHRASES FROM USERS
Document Type and Number:
WIPO Patent Application WO/2022/268335
Kind Code:
A1
Abstract:
A method of interpreting phrase from user. The method includes using first machine learning module, trained as classifier, to determine top K classes most relevant or similar to phrase among collection of classes. The method further includes using second machine learning module, trained independently from first machine learning module, to determine top 1 result. The top 1 result is a class among top K classes that is most relevant or similar to phrase based on semantic similarity between top K classes and phrase. The method is based on two decoupled machine learning or deep learning model architectures (i.e., first and second machine learning modules) which are trained in improved manner about different lexical and semantic features resulting in producing improved interpretation of phrase from user. Thus, top K classes and top 1 result are accurately produced corresponding to phrase from user.

Inventors:
SALAMA HITHAM (DE)
DUTTA SOURAV (DE)
HU PENG (DE)
BURGIN EDWARD (DE)
Application Number:
PCT/EP2021/067480
Publication Date:
December 29, 2022
Filing Date:
June 25, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
SALAMA HITHAM AHMED ASSEM ALY (DE)
International Classes:
G06F40/30; G06N20/00
Foreign References:
EP3454260A12019-03-13
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A method (100) of interpreting a phrase from a user, the method (100) comprising: using a first machine learning module, trained as a classifier, to determine top K classes most relevant or similar to the phrase among a collection of classes, and using a second machine learning module, trained independently from the first machine learning module, to determine a top 1 result, the top 1 result being a class among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase.

2. The method (100) of claim 1, wherein the first machine learning module and the second machine learning module are based on deep learning architectures.

3. The method (100) of claim 1 or 2, wherein the first machine learning module is trained to determine the top K classes most relevant or similar to the phrase based on latent lexical relationships between the classes and the phrase.

4. The method (100) of claim 3, wherein the first machine learning module comprises one of a neural network, a support vector machine, SVM, and a naive Bayes classifier.

5. The method (100) of claim 2, wherein the second machine learning module has a probabilistic Siamese-network based architecture.

6. The method (100) of any one of claims 1 to 5, wherein the first machine learning module and the second machine learning module are based on a monolingual language model or a multilingual language model.

7. An apparatus (202) for interpreting a phrase from a user, the apparatus (202) comprising: a first module (204) configured to determine top K classes most relevant or similar to the phrase among a collection of classes, using a first machine learning architecture (208), trained as a classifier, and a second module (206) configured to determine a top 1 result, the top 1 result being a class among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase, using a second machine learning architecture (210), trained independently from the first machine learning architecture (208).

8. The apparatus (202) of claim 7, wherein the first machine learning architecture (208) and the second machine learning architecture (210) are deep learning architectures. 9. The apparatus (202) of claim 7 or 8, wherein the first module (204) is trained to determine the top K classes most relevant or similar to the phrase based on latent lexical relationships between the classes and the phrase.

10. The apparatus (202) of claim 9, wherein the first module (204) comprises one of a neural network, a support vector machine, SVM, and a naive Bayes classifier. 11. The apparatus (202) of claim 8, wherein the second machine learning architecture

(210) comprises a probabilistic Siamese-network based architecture.

12. The apparatus (202) of any one of claims 7 to 11, wherein the first module (204) and the second module (206) are based on a monolingual language model or a multilingual language model. 13. A computer program comprising executable instructions which when executed by a computer cause the computer to perform the method of any one of claims 1 to 6.

14. A non-transitory computer-readable storage medium storing a computer program comprising executable instructions which when executed by a computer cause the computer to perform the method of any one of claims 1 to 6.

Description:
METHOD AND APPARATUS FOR INTERPRETING PHRASES FROM USERS

TECHNICAL FIELD

The present disclosure relates generally to the field of conversational dialogue systems, natural language understanding, and chat hots related technologies; and more specifically, to a method and an apparatus for interpreting a phrase from a user.

BACKGROUND

Typically, a Frequently Asked Questions (FAQ) section of any document or website provides answers to a defined set of questions. In case of a query, a user can search the FAQ section based on a user query and a FAQ retrieval system provides an output. However, retrieval of suitable answers from the FAQ is a challenging task. This is because, texts of question-answer are short, making it technically difficult to bridge a lexical and a semantic gap between the user query and the FAQ due to a limited context. Further, in certain cases, precise understanding of the user query by a conventional system might be difficult due to informal representations, domain-specificity, abbreviations, formal-colloquial term mismatches and the like. Further, the FAQ retrieval systems need to handle both keywords as well as natural language questions. Thus, due to user-centric nature of such FAQ retrieval systems, there is a need of high precision and advanced interpretability compared to traditional information retrieval systems. Generally, user experience or a quality of service for a given information retrieval system is based on corresponding ranking capabilities such that the user can find an answer to the user query among top results returned by the given retrieval system. However, in current interactive applications, a fluidity of natural language based human-computer interactions provides an additional requirement to provide a quality of the user experience. For example, a voice-based FAQ platform may be interfaced via a personal assistive system. In such cases, providing the user with the top matching results (from the FAQ platform) to choose from, hinders natural fluidity of interaction. Further, the conventional FAQ retrieval systems are unable to automatically understand and/or infer a context, meaning, and relevance adequately to provide the best matching answer to address the user query. Thus, there is reduced quality of service that is provided by conventional FAQ retrieval systems. Further, there are limitations associated with handling user queries from multi-lingual users. As a result of which inappropriate answers are retrieved by the conventional FAQ retrieval systems based on multi-lingual user queries. Thus, there is a technical problem of low recommendation accuracy and high inference time in retrieval of answers by the conventional FAQ retrieval systems, which adversely impact interactive customer experience from such conventional systems.

Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with the conventional FAQ systems or information retrieval systems.

SUMMARY

The present disclosure seeks to provide a method and an apparatus for interpreting a phrase from a user. The present disclosure seeks to provide a solution to the existing problem of low recommendation accuracy and high inference time in retrieval of answers by conventional retrieval systems, which adversely impact interactive customer experience from such conventional retrieval systems. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provides improved method and system that provide an improved and more accurate retrieval of answers corresponding to a user query even if the query is in the form of a phrase.

One or more objects of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.

In one aspect, the present disclosure provides a method of interpreting a phrase from a user, the method comprising: using a first machine learning module, trained as a classifier, to determine top K classes most relevant or similar to the phrase among a collection of classes, and using a second machine learning module, trained independently from the first machine learning module, to determine a top 1 result, the top 1 result being a class among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase. The method is based on two decoupled machine learning models or architectures (i.e., first and second machine learning modules) which are trained independently with each other about different lexical and semantic features resulting in producing improved interpretation of the phrase from the user. Thus, the top K classes and top 1 result are accurately produced corresponding to the phrase from the user. Beneficially, the method uses lexical and semantic similarities and relationships among user queries, pre-stored questions and their paraphrased versions to understand fine-grained differences to provide enhanced one-shot accuracy. Further, the method provides improved accuracy for both monolingual and multi-lingual user queries. Beneficially, the method requires lesser model parameters compared to conventional methods, and thereby improves functioning of an apparatus in which the method is employed in terms of computational resource usage efficiency as well as interpretation accuracy of the phrase from the user. In others words, recommendation accuracy is improved and inference time is significantly reduced in retrieval of answers by the method, which in turn improves interactive customer experience.

In an implementation form, the first machine learning module and the second machine learning module are based on deep learning architectures.

By virtue of the deep learning architectures, improved results i.e., higher accuracy of top K classes and the top 1 result are obtained.

In a further implementation form, the first machine learning module is trained to determine the top K classes most relevant or similar to the phrase based on latent lexical relationships between the classes and the phrase.

By virtue of the latent lexical relationships, the top K classes are determined with an improved accuracy based on the phrase from the user.

In a further implementation form, the second machine learning module has a probabilistic Siamese-network based architecture.

The probabilistic Siamese-network based architecture is used to capture fine-grained differences in semantic context between the questions and their possible variants for further improving the accuracy of the top 1 result. In a further implementation form, the first machine learning module and the second machine learning module are based on a monolingual language model or a multilingual language model.

By virtue of the monolingual and multilingual models, a larger number of users can use the method of interpreting the phrase from the user. Further, there is improved accuracy with respect to answer provided based on multilingual user query.

In another aspect, the present disclosure provides an apparatus for interpreting a phrase from a user, the apparatus comprising: a first module configured to determine top K classes most relevant or similar to the phrase among a collection of classes, using a first machine learning architecture, trained as a classifier, and a second module configured to determine a top 1 result, the top 1 result being a class among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase, using a second machine learning architecture, trained independently from the first machine learning architecture.

The apparatus achieves all the advantages and effects of the method of the present disclosure.

It is to be appreciated that all the aforementioned implementation forms can be combined. It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims. Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.

Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

FIG. 1 is a flowchart of a method of interpreting a phrase from a user, in accordance with an embodiment of the present disclosure;

FIG. 2A is a block diagram of an apparatus for interpreting a phrase from a user, in accordance with an embodiment of the present disclosure;

FIG. 2B is a block diagram that illustrates various exemplary components of an apparatus for interpreting a phrase from a user, in accordance with an embodiment of the present disclosure;

FIG. 3 is an illustration of training of first and second machine learning architectures of an apparatus for interpreting a phrase from a user, in accordance with an embodiment of the present disclosure;

FIG. 4 is an illustration of inference of first and second machine learning architectures of an apparatus for interpreting a phrase from a user, in accordance with an embodiment of the present disclosure;

FIG. 5 is a graphical representation that depicts results of interpretation of phrase from a user, in accordance with an embodiment of the present disclosure; and FIG. 6 is a graphical representation that depicts results of interpretation of phrase from a user, in accordance with another embodiment of the present disclosure. In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non- underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.

DETAILED DESCRIPTION OF EMBODIMENTS

The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.

FIG. l is a flowchart of a method of interpreting a phrase from a user, in accordance with an embodiment of the present disclosure. With reference to FIG. 1 there is shown a method 100 The method 100 is executed at an apparatus described, for example, in Fig. 2A. The method 100 includes steps 102 and 104

In one aspect, the present disclosure provides a method 100 of interpreting a phrase from a user, the method 100 comprising: using a first machine learning module, trained as a classifier, to determine top K classes most relevant or similar to the phrase among a collection of classes, and using a second machine learning module, trained independently from the first machine learning module, to determine a top 1 result, the top 1 result being a class among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase.

The method 100 of the present disclosure enables in interpreting the phrase from the user. The phrase from the user may include a query or question of the user. In other words, the method 100 of the present disclosure enables in understanding a user’s intent or sentence classification. The interpreting of the phrase from the user is required to provide suitable answers to the question or query of the user based on pre-stored questions and respective answers. In an example, the pre-stored questions and respective answers may refer to Frequently Asked Questions (FAQ) section of any document or website. Thus, the method 100 retrieves suitable answers from the FAQ section of any document or website based on the user query. In an example, the present method 100 may be described in context of FAQ retrieval, however the method 100 can be extended to other classification solutions where textual labels contain semantic information.

The method 100 comprises receiving the phrase from the user. In an example, the phrase may be received via manual typing by the user or by providing voice based input. In another example, the phrase may be in form of a single word or multiple words (i.e., a sentence). The phrase from the user is further used for interpretation and understanding for providing suitable answers.

At step 102, the method 100 comprises using a first machine learning module, trained as a classifier, to determine top K classes most relevant or similar to the phrase among a collection of classes. The first machine learning module comprises determining the top K classes to enable retrieving questions and corresponding answers present in the FAQ which are most relevant or similar to the phrase from the user. Beneficially, by virtue of determining the top K classes, a search space is reduced via such classification which does not take into account semantic relationships. Thus, a reduced search space enables in faster interpretation of phrase from the user. In an example, the collection of classes herein refers to a classification of different original questions in the FAQ and their respective paraphrased versions. The first machine learning module is trained as a classifier to learn relationships (example latent lexical relationships) between questions (i.e., original questions) in the FAQ (also referred to as QU) and their respective paraphrased variants (also referred to as EQ or extended questions) for generating top K most relevant or similar questions within the collection of questions in the FAQ.

At step 104, the method 100 further comprises using a second machine learning module, trained independently from the first machine learning module, to determine a top 1 result. The top 1 result being a class among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase. The top K classes determined by the first machine learning module are further fed to the second machine learning module, to capture fine-grained differences in semantic context between the questions and their possible variants (i.e., proxies for user queries) for further improving the accuracy of the final top 1 result. The top 1 result is determined by the second machine learning module to select from the reduced search space i.e., among the top K classes, the most semantic similar class. The top 1 result includes a question which is most similar to the question i.e., the phrase provided by the user and thus, an answer associated with the question in top 1 result is suitable based on the phrase provided by the user. In an example, as the first machine learning module and the second machine learning module are trained independently, these two learning modules may collectively be referred to as Decoupled Training Architecture for FAQ Retrieval (DTAFA). By virtue of independent training of the first machine learning module and the second machine learning module, there is an enhanced top 1 result accuracy (or one-shot accuracy). Thus, an overall quality of interactive automated user experience is also improved.

Beneficially, the DTAFA has improved performance in comparison to the conventional pre trained contextualized language models based classifiers as they do not use any label context. Further, the DTAFA has improved performance in comparison to conventional models which use only semantic similarity based techniques that does not learn lexical similarities between samples belonging to a same class. Further, the DTAFA requires less model parameters compared to existing deep learning architectures (e.g., contextualized language models like Bidirectional Encoder Representations from Transformers (BERT), Robustly Optimized Bidirectional Encoder Representations from Transformers approach (RoBERT a)). Thus, the present method 100 can be easily deployment in environments having a direct impact on inference time.

According to an embodiment, in the method 100, the first machine learning module and the second machine learning module are based on deep learning architectures. In other words, the first machine learning module and the second machine learning module are trained based on artificial neural networks. By virtue of the training, the improved results i.e., higher accuracy of top K classes and the top 1 result are obtained. Examples of deep learning architectures include but are not limited to deep neural networks, deep belief networks, graph neural networks, recurrent neural networks and convolutional neural networks. An example of machine learning architectures, is described in FIG. 2A.

According to an embodiment, in the method 100, the first machine learning module is trained to determine the top K classes most relevant or similar to the phrase based on latent lexical relationships between the classes and the phrase. The first machine learning module is trained as a classifier to determine the latent lexical relationships between the collection of classes and the phrase provided by the user for generating top K most relevant or similar questions within the collection of questions in the FAQ. As the collection of classes include the questions in the FAQ or extended/paraphrased version of the question in the FAQ and the phrase also includes a question. Thus, a latent lexical relationship is determined. By virtue of the latent lexical relationships, the top K classes are determined with an improved accuracy based on phrase from the user.

According to an embodiment, in the method 100, the first machine learning module comprises one of a neural network, a support vector machine, SVM, and a naive Bayes classifier. In an example, the support vector machine is a neural network that is used for classification of the original questions of the FAQ and extended questions of questions in the FAQ. In another example, the naive Bayes classifier is used to identify differences in the original questions of the FAQ and extended questions of questions in the FAQ. By virtue of the neural networks in the first machine learning module, there is improved accuracy in the top K classes generated by the first machine learning module.

According to an embodiment, in the method 100, the second machine learning module has a probabilistic Siamese-network based architecture. In other words, the second machine learning module comprises a probabilistic Siamese Long Short-Term Memory (LSTM) based architecture. The probabilistic Siamese-network based architecture is used to capture fine-grained differences in semantic context between the questions and their possible variants (i.e., proxies for user queries) for further improving the accuracy of the final top 1 result.

According to an embodiment, in the method 100, the first machine learning module and the second machine learning module are based on a monolingual language model or a multilingual language model. In an example, as the first machine learning module and the second machine learning module are trained independently i.e., are decoupled trained architectures, a better accuracy is achieved for both monolingual and multi-lingual models compared to existing techniques. By virtue of the monolingual and multi-lingual models, a larger number of users can use the method 100 of interpreting the phrase from the user in comparison to conventional techniques. Further, there is improved accuracy with respect to answer provided based on multilingual user query in comparison to conventional method where unsuitable answers are provided based on multilingual user query.

According to an embodiment, the first machine learning module and the second machine learning module are based on Language-Agnostic Sentence Representations (LASER) embedding. By virtue of the LASER embedding, the first machine learning module and the second machine learning module can support multilingualism and zero-shot learning for scaling to other languages. The LASER embedding are language-independent representations where similar sentences are mapped onto nearby vector spaces (in terms of cosine distance), regardless of an input language. Beneficially, the LASER embedding can generalize to languages belonging to same family without explicitly training on them. Thus, in the present method 100, training dataset is structured so as to bridge gap between some low-resource and distance languages required in production environment. Hence, in comparison to conventional approach where there is training using only one language and zero-shot learning is performed on the others, the present disclosure uses three languages such as English, Spanish and Chinese for training.

Beneficially, the method 100 can be implemented as foundation framework for dialogue systems, chat bots, FAQ retrieval systems, and Natural Language Understanding (NLU) systems. The method 100 achieves beyond conventional performance while using 20 parameters compared to most recent conventional models. The method 100 can be used on any multilingual embedding underneath such as LASER or a future new upcoming multilingual LM. The method 100 uses multiple languages for training at same time. This helps to scale easily to other languages. The method 100 relies on zero-shot learning, however the framework can be suitable with other few-shot learning approaches on other languages as needed. In the method 100, one model can be used for several languages. This reduced overall cost of interpretation of the phrase from the user. Further, in the method 100, there is no machine translation required. Further, in the method 100 only original language (example English or Chinese) is used for training.

A performance on monolingual dataset for conventional techniques such as semantic based similarity models, context-free language models and contextualized language models and the Decoupled Training Architecture for FAQ Retrieval (DTAFA) used in the present disclosure is shown in the table 1. The DTAF A-EN variation provide the highest accuracy compared to conventional techniques. In the table, the conventional approach MaLSTM refers to Manhattan LSTM, SBERT refers to Sentence Bidirectional Encoder Representations from Transformers, TF-IDF refers to Term Frequency Inverse Document Frequency, GloVe refers to Global Vectors, ULMFiT refers to Universal Language Model Fine-Tuning, ELMo refers to Embeddings from Language Models, BERT refers to Bidirectional Encoder Representations from Transformers, RoBERTa refers to Robustly Optimized Bidirectional Encoder Representations from Transformers approach.

Table 1

The method 100 is based on two decoupled machine learning or deep learning model architectures (i.e., first and second machine learning modules) which are trained in an improved manner about different lexical and semantic features resulting in producing improved interpretation of the phrase from the user. Thus, the top K classes and top 1 result are accurately produced corresponding to the phrase from the user. Beneficially, the method 100 uses lexical and semantic similarities and relationships among user queries, pre-stored questions and their paraphrased versions to understand fine-grained differences to provide enhanced one-shot accuracy. Further, the method 100 provides improved accuracy for both monolingual and multi-lingual user queries. Beneficially, the method 100 requires lesser model parameters compared to conventional methods, thereby improving functioning of an apparatus (in which the method 100 is employed) in terms of computational resource usage efficiency as well as interpretation accuracy of the phrase from the user. In others words, recommendation accuracy is improved and inference time is significantly reduced in retrieval of answers by the method 100, which in turn improves interactive customer experience.

In another aspect, there is provided a computer readable medium that is configured to store instructions which, when executed by a processor, cause the processor to execute the method 100. In another aspect, a computer program product is provided comprising a non-transitory computer-readable storage medium having instructions stored thereon, the instructions being executable by a processor to execute the method 100. In another aspect, a computer program is provided to execute the method 100. Examples of implementation of the non-transitory computer-readable storage medium include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory. A computer readable medium may include, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. FIG. 2A is a block diagram of an apparatus for interpreting a phrase from a user, in accordance with an embodiment of the present disclosure. With reference to FIG. 2A there is shown a block diagram 200A of an apparatus 202. The apparatus 202 includes a first module 204, a second module 206, a first machine learning architecture 208 and a second machine learning architecture 210.

In another aspect, the present disclosure provides an apparatus 202 for interpreting a phrase from a user, the apparatus 202 comprising: a first module 204 configured to determine top K classes most relevant or similar to the phrase among a collection of classes, using a first machine learning architecture 208, trained as a classifier, and a second module 206 configured to determine a top 1 result, the top 1 result being a class among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase, using a second machine learning architecture 210, trained independently from the first machine learning architecture 208.

The apparatus 202 includes suitable logic, circuitry, and interfaces that is configured for interpreting the phrase from the user. The phrase from the user may refer to a query or question of the user. In other words, the apparatus 202 of the present disclosure enables in understanding a user’s intent or sentence classification. The interpreting of the phrase from the user is required to provide suitable answers to the question or query of the user based on pre-stored questions and respective answers. In an example, the pre-stored questions and respective answers may refer to Frequently Asked Questions (FAQ) section of any document or website. Thus, the apparatus 202 is configured to retrieve suitable answers from the FAQ section of any document or website based on the user query. Examples of the apparatus 202 includes but are not limited to servers, cloud servers, data storage servers, datacenters including one or more hard disks and the like.

The first module 204 includes suitable logic, circuitry, and interfaces that is configured to determine top K classes most relevant or similar to the phrase provided by the user, using the first machine learning architecture 208. The second module 206 includes suitable logic, circuitry, and interfaces that is configured to determine top 1 result among the top K classes most relevant or similar to the phrase provided by the user, using the second machine learning architecture 210. The first machine learning module and the second machine learning module in the FIG. 1 are based on machine learning architectures such as the first machine learning architecture 208 and the second machine learning architecture 210. In this context, the first module 204 and the second module 206 can be considered as analogues of the first machine learning module and the second machine learning module in the FIG. 1.

In operation, the apparatus 202 is configured for receiving the phrase from the user. In an example, the phrase may be received via manual typing by the user or by providing voice based input. In another example, the phrase may be in form of a single word or multiple words (i.e., a sentence). The phrase from the user is further processed for interpretation and understanding of the user’s intent so as to provide a suitable feedback or answer.

The apparatus 202 comprises the first module 204 that is configured to determine top K classes most relevant or similar to the phrase among a collection of classes, using the first machine learning architecture 208, trained as a classifier. The first machine learning architecture 208 comprises determining the top K classes to enable retrieving questions and corresponding answers present in the FAQ which are most relevant or similar to the phrase from the user. Beneficially, by virtue of determining the top K classes, a search space is reduced via such classification which does not take into account semantic relationships. Thus, a reduced search space enables in faster interpretation of phrase from the user. In an example, the collection of classes herein refers to a classification of different original questions in the FAQ and their respective paraphrased versions.

The apparatus 202 further comprises the second module 206 that is configured to determine a top 1 result. The top 1 result is a class among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase. The top 1 result is determined using a second machine learning architecture 210, trained independently from the first machine learning architecture 208. The top K classes determined by the first machine learning architecture 208 are further fed to the second machine learning architecture 210, to capture fine-grained differences in semantic context between the questions and their possible variants (i.e., proxies for user queries) for further improving the accuracy of the final top 1 result. The top 1 result is determined by the second machine learning architecture 210 to select from the reduced search space i.e., among the top K classes, the most semantic similar class. The top 1 result includes a question which is most similar to the question i.e., the phrase provided by the user and thus, an answer associated with the question in top 1 result is suitable based on the phrase provided by the user. In an example, as the first machine learning architecture 208 and the second machine learning architecture 210 are trained independently, these architectures may collectively be referred to as Decoupled Training Architecture for FAQ Retrieval (DTAFA). By virtue of independent training of the first machine learning architecture 208 and the second machine learning architecture 210, there is an enhanced top 1 result accuracy (or one-shot accuracy). Thus, an overall quality of interactive automated user experience is also improved.

According to an embodiment, in the apparatus 202, the first machine learning architecture 208 and the second machine learning architecture 210 are deep learning architectures. In other words, the first machine learning architecture 208 and the second machine learning architecture 210 are trained based on artificial neural networks. By virtue of the training, the improved results i.e., higher accuracy of top K classes and the top 1 result is obtained.

According to an embodiment, in the apparatus 202, the first module 204 is trained to determine the top K classes most relevant or similar to the phrase based on latent lexical relationships between the classes and the phrase. The first module 204 is trained as a classifier to determine the latent lexical relationships between the collection of classes and the phrase provided by the user for generating top K most relevant or similar questions within the collection of questions in the FAQ. As the collection of classes include the questions in the FAQ or extended/paraphrased version of the question in the FAQ and the phrase also includes a question. Thus, a latent lexical relationship is determined.

According to an embodiment, in the apparatus 202, the first module 204 comprises one of a neural network, a support vector machine, SVM, and a naive Bayes classifier. The support vector machine is a neural network that is used for classification of the original questions of the FAQ and extended questions of the original questions in the FAQ. The naive Bayes classifier is used to identify differences in the original questions of the FAQ and extended questions of the original questions in the FAQ.

According to an embodiment, in the apparatus 202, the second machine learning architecture 210 comprises a probabilistic Siamese-network based architecture. In particular, it can be a probabilistic Siamese Long Short-Term Memory (LSTM) based architecture. The probabilistic Siamese-network based architecture is used to capture fine-grained differences in semantic context between the questions and their possible variants (i.e., proxies for user queries) for further improving the accuracy of the final top 1 result.

According to an embodiment, in the apparatus 202, the first module 204 and the second module 206 are based on a monolingual language model or a multilingual language model. In an example, as the first module 204 and the second module 206 are trained independently i.e., are decoupled trained architectures, a better accuracy is achieved for both monolingual and multi-lingual models compared to existing techniques. By virtue of the monolingual and multi-lingual models, a larger number of users can use the apparatus 202 of interpreting the phrase from the user in comparison to conventional techniques.

The apparatus 202 is based on two decoupled machine learning or deep learning model architectures (i.e., first and second machine learning architectures) which are trained in an improved manner about different lexical and semantic features resulting in producing improved interpretation of the phrase from the user. Thus, the top K classes and top 1 result are accurately produced corresponding to the phrase from the user. Beneficially, the apparatus 202 uses lexical and semantic similarities and relationships among user queries, pre-stored questions and their paraphrased versions to understand fine-grained differences to provide enhanced one-shot accuracy. Further, the apparatus 202 provides improved accuracy for both monolingual and multi-lingual user queries. Beneficially, the apparatus 202 requires lesser model parameters compared to conventional deep learning architectures, thereby improving functioning of the apparatus 202 in terms of computational resource usage efficiency as well as interpretation accuracy of the phrase from the user. In others words, recommendation accuracy is improved and inference time is significantly reduced in retrieval of answers by the apparatus 202, which in turn improves interactive customer experience.

FIG. 2B is a block diagram that illustrates various exemplary components of an apparatus for interpreting a phrase from a user, in accordance with an embodiment of the present disclosure. FIG. 2B is described in conjunction with elements from FIG. 2A. With reference to FIG. 2B there is shown a block diagram 200B of the apparatus 202. The apparatus 202 includes a processor 212, a transceiver 214 and a memory 216. The memory 216 includes the first module 204 and the second module 206. The processor 212 is configured to train the first machine learning architecture 208 of the first module 204 and the second machine learning architecture 210 of the second module 206, receive and interpret the phrase from the user. The processor 212 may perform all the functionalities of the apparatus 202. In an example, the processor 212 may be a general- purpose processor. Other examples of the processor 212 may include, but is not limited to a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or control circuitry.

The transceiver 214 includes suitable logic, circuitry, and interfaces that may be configured to communicate with one or more external devices, such as a user device to receive the phrase from the user. Examples of the transceiver 214 may include, but is not limited to, an antenna, a telematics unit, a radio frequency (RF) transceiver, one or more amplifiers, one or more oscillators, a digital signal processor, a coder-decoder (CODEC) chipset, and/or a subscriber identity module (SIM) card.

The memory 216 refers to a primary storage of the apparatus 202. The memory 216 includes suitable logic, circuitry, and interfaces that is configured to store the first module 204 and the second module 206. Examples of implementation of the memory 216 may include, but are not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD). The memory 216 may store an operating system and/or other program products (including one or more operation algorithms) to operate the apparatus 202.

In operation, the processor 212 of the apparatus 202 is configured to receive and process the phrase from the user to interpret the phrase for providing suitable answers to questions of the user. The processor 212 is further configured to determine top K classes most relevant or similar to the phrase among a collection of classes and further determine the top 1 result among the top K classes that is most relevant or similar to the phrase based on semantic similarity between the top K classes and the phrase. FIG. 3 is an illustration of training of first and second machine learning architectures of an apparatus for interpreting a phrase from a user, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements from FIG. 2A and 2B. With reference to FIG. 3 there is shown an illustration 300. There is shown extended question (EQ) embedding 302, original question (QU) encoded labels 304, fully connected network hidden layers 306, dense layer 308A, 308B and 308C, top K labels 310, pairwise EQ-QU processing 312, QU embedding 314, bi-directional Long Short-Term Memory (LSTM) layers 316A and 316B, single-output Long Short-Term Memory (LSTM) 318A and 318B, a merge function 320 and a linear activation function 322.

The training of the first machine learning architecture such as the first machine learning architecture 208 of FIG. 2A and the second machine learning architecture such as the second machine learning architecture 210 of FIG. 2A may include three phases i.e., a first phase, a second phase and a third phase. In an example, the first machine learning architecture 208 is trained as a classifier to learn latent lexical relationships between questions (i.e., original questions) in the FAQ (also referred to as QU) and their respective paraphrased variants (also referred to as EQ or extended questions) for generating top K most relevant or similar questions within the collection of questions in the FAQ.

The first phase includes modelling the latent lexical and semantic similarities between the re-formulated extended questions (EQ) i.e., QU embedding 314 and the original questions in FAQ (QU) i.e., QU encoded labels 304. The different paraphrased or re-formulated versions of a question capture the same underlying intent in diverse lexical formulations, providing the apparatus such as the apparatus 202 of FIG. 2 A with a wider view as to how different users might express the same intent or query. Thus, in the first phase, the present disclosure learns to map the extended questions i.e., QU embedding 314 to their corresponding original question i.e., QU encoded labels 304, formulated as a classification task based on the semantic similarities between EQ and QU.

Thus, the first machine learning architecture 208 is trained to map the extended questions to their corresponding original question, to produce the collection of classes based on the semantic similarities between the EQ and the QU. In an example, the first machine learning architecture 208 comprises a fully connected neural network with two hidden layers such as the fully connected network hidden layers 306 with the extended questions (in embedded vector representation) as inputs and the original questions (encoded as class labels) as outputs. A resulting input matrix R (mxn), where m is number of samples in a dataset and n = 1024 is vector length of a LASER embedding, is passed through the fully connected neural network with two hidden layers of 700 units each and an activation function of ReLU (i.e., rectified linear activation function). A final layer i.e., the dense layer 308A uses an activation function to output a classification probability corresponding to 336 different intent/query categories (QU labels). In an example, a 0.5 dropout factor across all layers is used. The first machine learning architecture 208 is trained for 400 epochs using a batch size of 32. Further, a learning rate is reduced by a factor of 0.5 and a patience of 40 epochs for validation loss is used. In an example, a sparse categorical cross-entropy is used as a loss function and Adam algorithm is used as model optimizer. Further, a total number of trainable parameters is 1.5 million.

The first machine learning architecture 208 may comprise a EQ-EQ classification module (not shown) for the first phase and a EQ-QU classification module (not shown) for the second phase.

In the second phase, top K labels 310 for the extended questions (EQ) are generated. The vector representations of the paraphrased questions, EQ, are again fed to the classifier trained in the first phase, to obtain the top K QU labels for each of the EQ, along with the classification probability score. For this second phase, since the input to the classifier is the exact data on which it had been trained, a high classification score is obtained for the correct QU category and lower scores for the remaining classes. The second phase identifies different classes of user questions (or intents) that are semantically very close. These top K identified classes are further used to find fine-grained differences among the classes.

Beneficially, the EQ-QU classification module acts as a natural label smoothing mechanism, preventing the apparatus 202 from over-fitting and consequently improving performance and generalizability across domains and languages. Further, in the pairwise EQ-QU processing 312, for each extended question EQi (in the training dataset), the apparatus 202 generates Qi, the set of top K queries (QU) returned by the classifier in the first phase as possible matching class questions (or intents). Pi represent the classification probabilities associated with the class questions, Qi. Thus, for each EQi, a set of K 3-tuples, T = (hEQi, Qij, Fiji} (j e [1, k]), where Qij is the jth element in Qi and its associated classification probability is given by Pij. In other words, the 3-tuple hEQi, Qij, Piji represents that the question Qij in the FAQ collection (QU) is identified by the EQ-EQ classifier as a possible matching class (for the extended question EQi) with a classification score of Pij. The set of 3-tuples, T for all the pairwise EQ-QU classes extracted from the FAQ collection is constructed and forms the input to the next stage.

In the third phase, the second machine learning architecture 210 is used to assess semantic similarities and learn fine-grained differences among the above identified classes. Hence for a class 3-tuple, hEQi, Qij, Piji e T, the vector representation of EQ-QU question pair (EQi, Qij) is given as input and the second machine learning architecture 210 is trained as a regression model with the associated probability score Pij treated as output. The Siamese network of the second machine learning architecture 210 comprises two branches, each with a masking layer i.e., the EQ embedding 302 and the QU embedding 314 followed by bi- directional-LSTM layers 316A and 316B. An incorporation of intermediate representations across the branches enables increased context flow between them, which positively impacts an overall parameter updating process. Further, some multiplication and subtraction layers i.e., single-output LSTM 318A and 318B between the outputs of the branches from the bi- directional-LSTM layers 316A and 316B is used to capture more variations between the paired sentences. Thus, fine-tuning the semantic similarity captured by the pre-trained language model (such as LASER). Beneficially, intermediate layers such as dense layer 308B and 308C before a concatenation layer i.e., the merge function 320 enables in avoiding a gradient vanishing problem by allowing more gradient to flow. Further, the concatenation layer i.e., the merge function 320 followed by one hidden layer with ReLU activation function is used. Further, the linear activation function 322 is used for regression based prediction.

FIG. 4 is an illustration of inference of first and second machine learning architectures of an apparatus for interpreting a phrase from a user, in accordance with an embodiment of the present disclosure. FIG. 4 is described in conjunction with elements from FIG. 2A and 2B. With reference to FIG. 4 there is shown an illustration 400. There is shown a new query 402, a trained EQ-EQ classifier 404, top K labels 406, a trained EQ-EQ semantic Siamese-based architecture 408, Argmax top K labels 410, a predicted query 412 and a recommended answer 414.

A new user query such as new query 402 is received by the apparatus such as the apparatus 202 of FIG. 2A. The first machine learning architecture 208 and the second machine learning architecture 210 retrieves a most relevant answer from FAQ collection, based on the trained architecture as described in FIG. 3. The inference (i.e., online interactive component) of the first machine learning architecture 208 and the second machine learning architecture 210 follows a similar flow to that of the training process as shown in FIG. 3. The new query 402 is initially represented in a high-dimensional vector space using multi-lingual embedding, and is subsequently fed to the trained EQ-EQ classifier 404, which extracts the top K best matching questions (QU) i.e., top K labels 406 from the FAQ collection along with their classification scores. The new query 402, the identified similar questions, along with their classification scores are used to generate a list of 3-tuples. The 3-tuples are fed to a pre trained EQ-QU similarity module i.e., the trained EQ-EQ semantic Siamese-based architecture 408, and question with highest output score i.e., Argmax top K label 410 is considered as the best matching and most relevant FAQ based on the new query 402. In other words, the predicted query 412 is best matching to the new query 402 and thus, the recommended answer 414 corresponding to the predicated query 412 is provided as output to the user.

FIG. 5 is a graphical representation that depicts results of interpretation of phrase from a user, in accordance with an embodiment of the present disclosure. With reference to FIG. 5, there is shown a graphical representation 500 of multilingual phrase interpretation by apparatus 202 of the present disclosure and by conventional technique.

The graphical representation 500 represents different languages on X-axis 502 and accuracy (in percentage) on Y-axis 504 for a first technique used by the apparatus 202 and a second technique (for example Bidirectional Encoder Representations from Transformers (BERT)) used conventionally. As shown, for languages such as Italian, French, Portuguese, German, Catalan, Romanian, Russian, Japanese, Turkish and Arabic, the first techniques used by the apparatus 202 of the present disclosure has close to 36 percent more accuracy compared to the second technique used conventionally. Further, for the aforesaid languages, the present disclosure requires no training of models, no translation and no data collection. FIG. 6 is a graphical representation that depicts results of interpretation of phrase from a user, in accordance with another embodiment of the present disclosure. With reference to FIG. 6, there is shown a graphical representation 600 of phrase interpretation by apparatus 202 of the present disclosure and by conventional technique.

The graphical representation 600 represents different languages on X-axis 602 and accuracy (in percentage) on Y-axis 604 for a first technique used by the apparatus 202 and a second technique (for example Multilingual Bidirectional Encoder Representations from Transformers (mBERT)) used conventionally. Languages such as English, Spanish and Arabic are used for training and languages such as French, Italian, Portuguese, German, Japanese, Turkish, Catalan and Romanian are used for zero-shot. The languages used for zero-shot may be referred to as alliance data. The first techniques used by the apparatus 202 of the present disclosure has close to 20 percent more accuracy compared to the second technique used conventionally for languages used for zero-shot.

Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.