Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR EFFICIENT ENSEMBLING OF NATURAL LANGUAGE INFERENCE
Document Type and Number:
WIPO Patent Application WO/2019/115200
Kind Code:
A1
Abstract:
Techniques disclosed herein relate to independent and dependent neural network models in an ensemble with can use features extracted from the input sentence pairs to weigh networks within the ensemble. In various embodiments, data indicative of a premise and data indicative of a hypothesis, wherein the data indicative of the premise and the data indicative of the hypothesis form a natural language inference classification pair. For example, the data indicative of the premise and the data indicative of the hypothesis can be extracted for generate a set of features (156), and using a plurality of natural language inference models (154) can generate a plurality of natural language inference models output. The features can and output can be used in a decision function (158) including an ensemble learning model that adds a weight to each model. Natural language inference classification labels 9162) can be generated from the decision function.

Inventors:
GHAEINI REZA (NL)
AL HASAN SHEIKH (NL)
FARRI OLADIMEJI (NL)
Application Number:
PCT/EP2018/082259
Publication Date:
June 20, 2019
Filing Date:
November 22, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
KONINKLIJKE PHILIPS NV (NL)
International Classes:
G06F17/27; G06N3/04
Other References:
ZHIPENG XIE ET AL: "Max-Cosine Matching Based Neural Models for Recognizing Textual Entailment", LECTURE NOTES IN COMPUTER SCIENCE 10177 DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 22 March 2017 (2017-03-22), Springer, Cham, pages 295 - 308, XP055554819, ISBN: 978-3-319-55753-3, Retrieved from the Internet [retrieved on 20190211]
YICHEN GONG ET AL: "Natural Language Inference over Interaction Space", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 13 September 2017 (2017-09-13), XP080820821
QIAN CHEN ET AL: "Enhanced LSTM for Natural Language Inference", PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (VOLUME 1: LONG PAPERS), 30 July 2017 (2017-07-30), Stroudsburg, PA, USA, pages 1657 - 1668, XP055555457, DOI: 10.18653/v1/P17-1152
ALEXIS CONNEAU ET AL: "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data", PROCEEDINGS OF THE 2017 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 21 July 2017 (2017-07-21), Stroudsburg, PA, USA, pages 670 - 680, XP055483765, ISBN: 978-1-945626-83-8, DOI: 10.18653/v1/D17-1070
ZHIGUO WANG ET AL: "Bilateral Multi-Perspective Matching for Natural Language Sentences", PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 19 August 2017 (2017-08-19), California, pages 4144 - 4150, XP055499056, ISBN: 978-0-9992411-0-3, DOI: 10.24963/ijcai.2017/579
Attorney, Agent or Firm:
VAN OUDHEUSDEN-PERSET, Laure, E. et al. (NL)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method implemented with one or more processors, comprising:

Obtaining (126) data indicative of a premise and data indicative of a hypothesis, wherein the data indicative of the premise and the data indicative of the hypothesis form a natural language inference classification pair (152);

Processing (128) the data indicative of the premise and the data indicative of the hypothesis to extract a set of features (156), wherein the set of features comprise data indicative of relationship between the data indicative of the premise and the data indicative of the hypothesis;

processing (130) the data indicative of the premise and the data indicative of the hypothesis using a plurality of natural language inference models (154) to generate a plurality of natural language inference models output;

processing (132) the set of features and the plurality of natural language inference models output in a decision function (158) including an ensemble learning model that adds a weight to each natural language inference model in the plurality of natural language inference models based on the set of features to generate a decision function output, wherein the set of features extracted from the data indicative of the premise and the data indicative of the hypothesis influence the weight assigned to each natural language inference model; and

generating (134) a natural language inference classification output (162) from the decision function output, wherein the natural language inference classification output is selected from the group consisting of entailment, neutral, and contradiction.

2. The method of claim 1 , wherein the set of features is selected from the group consisting of a comparison of the length of the data indicative of the premise and the data indicative of the hypothesis, a comparison of the overlap between words in the data indicative of the premise and the data indicative of the hypothesis, lexical features in the data indicative of the premise, and lexical features in the data indicative of the hypothesis.

3. The method of claim 2, wherein lexical features are selected from the group consisting of a noun, a verb, an adjective, a pronoun, an adverb, a preposition, a conjunction, and an interjection.

4. The method of claim 1, wherein the plurality of natural inference models further comprise a plurality of bidirectional long short term memory (Bi-LSTM) networks that independently and dependently read the data indicative of the premise and the data indicative of the hypothesis.

5. The method of claim 4, wherein each Bi-LSTM network in the plurality of Bi- LSTM network is trained using varying training parameters, wherein the training parameters are selected from the group consisting of a number of dependent readings in the Bi-LSTM network, an activation function used by the Bi-LSTM, and an initialization seeding of the Bi-LSTM network.

6. The method of claim 1, wherein generating the natural language inference classification output of the decision function output further comprises feeding the decision function output into a sofitmax function to generate the natural language inference classification output.

7. The method of claim 1, wherein entailment indicates data indicative of the hypothesis is entailed by the data indicative of the premise in a natural language inference, wherein contradiction indicates the data indicative of the hypothesis is contradicted by the data indicative of the premise in the natural language inference, and wherein neutral indicates the data indicative of the hypothesis is not entailed or contradicted by the data indicative of the premise in the natural language inference.

8. The method of claim 1, further comprising preprocessing the data indicative of the premise and the data indicative of the hypothesis which form the natural language inference classification pair.

9. At least one non-transitory computer-readable medium comprising instructions that, in response to execution of the instructions by one or more processors, cause one or more processors to perform the following operations:

obtaining (126) data indicative of a premise and data indicative of a hypothesis, wherein the data indicative of the premise and the data indicative of the hypothesis form a natural language inference classification pair (152);

processing (128) the data indicative of the premise and the data indicative of the hypothesis to extract a set of features (156), wherein the set of features comprise data indicative of relationship between the data indicative of the premise and the data indicative of the hypothesis;

processing (130) the data indicative of the premise and the data indicative of the hypothesis using a plurality of natural language inference models (154) to generate a plurality of natural language inference models output;

processing (132) the set of features and the plurality of natural language inference models output in a decision function (158) including an ensemble learning model that adds a weight to each natural language inference model in the plurality of natural language inference models based on the set of features to generate a decision function output, wherein the set of features extracted from the data indicative of the premise and the data indicative of the hypothesis influence the weight assigned to each natural language inference model; and

generating (134) a natural language inference classification output (162) from the decision function output, wherein the natural language inference classification output is selected from the group consisting of entailment, neutral, and contradiction.

10. The at least one non-transitory computer-readable medium of claim 9, wherein the set of features is selected from the group consisting of a comparison of the length of the data indicative of the premise and the data indicative of the hypothesis, a comparison of the overlap between words in the data indicative of the premise and the data indicative of the hypothesis, lexical features in the data indicative of the premise, and lexical features in the data indicative of the hypothesis.

11. The at least one non-transitory computer readable medium of claim 10, wherein lexical features are selected from the group consisting of a noun, a verb, an adjective, a pronoun, an adverb, a preposition, a conjunction, and an interjection.

12. The at least one non-transitory computer-readable medium of claim 9, wherein the plurality of natural inference models further comprise a plurality of bidirectional long short term memory (Bi-LSTM) networks that independently and dependently read the data indicative of the premise and the data indicative of the hypothesis.

13. The at least one non-transitory computer-readable medium of claim 12, wherein each Bi-LSTM network in the plurality of Bi-LSTM network is trained using varying training parameters, wherein the training parameters are selected from the group consisting of a number of dependent readings in the Bi-LSTM network, an activation function used by the Bi-LSTM, and an initialization seeding of the Bi-LSTM network.

14. The at least one non-transitory computer-readable medium of claim 9, wherein generating the natural language inference classification output of the decision function output further comprises feeding the decision function output into a sofitmax function to generate the natural language inference classification output.

15. The at least one non-transitory computer-readable medium of claim 9, wherein entailment indicates data indicative of the hypothesis is entailed by the data indicative of the premise in a natural language inference, wherein contradiction indicates the data indicative of the hypothesis is contradicted by the data indicative of the premise in the natural language inference, and wherein neutral indicates the data indicative of the hypothesis is not entailed or contradicted by the data indicative of the premise in the natural language inference.

16. The at least one non-transitory computer-readable medium of claim 9, further comprising preprocessing the data indicative of the premise and the data indicative of the hypothesis which form the natural language inference classification pair.

17. A system comprising one or more processors and memory operably coupled with the one or more processors, wherein the memory stores instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors the perform the following operations:

obtaining (126) data indicative of a premise and data indicative of a hypothesis, wherein the data indicative of the premise and the data indicative of the hypothesis form a natural language inference classification pair (152);

processing (128) the data indicative of the premise and the data indicative of the hypothesis to extract a set of features (156), wherein the set of features comprise data indicative of relationship between the data indicative of the premise and the data indicative of the hypothesis;

processing (130) the data indicative of the premise and the data indicative of the hypothesis using a plurality of natural language inference models (154) to generate a plurality of natural language inference models output;

processing (132) the set of features and the plurality of natural language inference models output in a decision function (158) including an ensemble learning model that adds a weight to each natural language inference model in the plurality of natural language inference models based on the set of features to generate a decision function output, wherein the set of features extracted from the data indicative of the premise and the data indicative of the hypothesis influence the weight assigned to each natural language inference model; and

generating (134) a natural language inference classification output (162) from the decision function output, wherein the natural language inference classification output is selected from the group consisting of entailment, neutral, and contradiction.

18. The system of claim 17, wherein the set of features is selected from the group consisting of a comparison of the length of the data indicative of the premise and the data indicative of the hypothesis, a comparison of the overlap between words in the data indicative of the premise and the data indicative of the hypothesis, lexical features in the data indicative of the premise, and lexical features in the data indicative of the hypothesis.

19. The system of claim 17, wherein the set of features is selected from the group consisting of a comparison of the length of the data indicative of the premise and the data indicative of the hypothesis, a comparison of the overlap between words in the data indicative of the premise and the data indicative of the hypothesis, lexical features in the data indicative of the premise, and lexical features in the data indicative of the hypothesis.

20. The system of claim 17, wherein the plurality of natural inference models further comprise a plurality of bidirectional long short term memory (Bi-LSTM) networks that independently and dependently read the data indicative of the premise and the data indicative of the hypothesis.

Description:
SYSTEM AND METHOD FOR EFFICIENT ENSEMBLING OF NATURAL LANGUAGE

INFERENCE

Related Application

This application claims the benefit of and priority to U.S. Provisional No. 62/597,132, filed December 1 1, 2017, the entireties of which are incorporated by reference.

Technical Field

[0001] Various embodiments described herein are directed generally to natural language processing. More particularly, but not exclusively, various methods and apparatus disclosed herein relate to ensemble neural network methods for natural language inference.

Background

[0002] Natural Language Inference (NLI) is an important classification task in natural language processing (NLP). A system can be given a pair of sentences ( e.g . premise and hypothesis), and the system classifies the pair of sentences with respect to three different classes: entailment, neutral, and contradiction. In other words, the classification of the pair of sentences conveys whether the hypothesis is entailed by the given premise, whether it is a contradiction, or whether it is otherwise neutral. Recognizing textural entailment can be an important step in many NLP applications including automatic text summarizers, document simplifiers, as well as many other NLP applications.

[0003] Information can be represented in different ways, with varying levels of complexity and/or ambiguity. NLI finds relationships, similarity, and/or alignment between sentences which can simplify a document and/or remove redundant information (which can lead to confusion by a reader of the document). Reducing redundancy can additionally make the content of a document more focused and/or coherent. For example, reducing redundancy can make the essence of the information become more meaningful to a reader. Existing NLI systems can use neural networks to classify the relationship (i.e. entailment, neutral, and contradiction) between a premise sentence and a hypothesis sentence. However, these techniques often rely on trivial ensemble strategies and use a majority voting method to determine the final classification label. Summary

The present disclosure is directed to methods and apparatus for ensemble neural network methods which can utilize features extracted from natural language (NLI) premise and hypothesis sentence pairs in a decision function which can give weight to neural network models in the ensemble based on the received NLI premise and hypothesis sentence pair input. In other words, the input to the ensemble method, through extracted features, can influence the output of the ensemble network. In some embodiments, the neural network models within the ensemble can use both independent and dependent readings of the NLI premise and hypothesis sentence pairs for classification of within the each individual neural network models. Additional variations of independent and dependent reading neural network models can be included within the ensemble. In some embodiments, the classification accuracy of a neural network model within the ensemble can be related to the number of iterations of dependent readings of the NLI input sentence pair. For example, a less complex NLI input sentence pair might be better classified by a neural network model in the ensemble with fewer dependent readings and a more complex NLI input sentence pair might be better classified by a neural network model in the ensemble with more dependent readings.

[0004] For example, in various embodiments, a deep learning based NLI neural network model within the ensemble method can classify the relationship between a pair of sentences with respect to generally three different classes: entailment, neutral, contradiction. For example, in various embodiments, a premise sentence and a hypothesis sentence NLI pair can be classified by the independent and dependent readings of a deep learning neural network (e.g., recurrent neural networks, long short-term memory, or“LSTM,” networks, etc.) with three classification labels: entailment, neutral, and contradiction.

[0005] Generally, in one aspect, a method may include: obtaining data indicative of a premise and data indicative of a hypothesis, wherein the data indicative of the premise and the data indicative of the hypothesis form a natural language inference classification pair; processing the data indicative of the premise and the data indicative of the hypothesis to extract a set of features, wherein the set of features comprise data indicative of relationship between the data indicative of the premise and the data indicative of the hypothesis; processing the data indicative of the premise and the data indicative of the hypothesis using a plurality of natural language inference models to generate a plurality of natural language inference models output; processing the set of features and the plurality of natural language inference models output in a decision function including an ensemble learning model that adds a weight to each natural language inference model in the plurality of natural language inference models based on the set of features to generate a decision function output, wherein the set of features extracted from the data indicative of the premise and the data indicative of the hypothesis influence the weight assigned to each natural language inference model; and generating a natural language inference classification output from the decision function output, wherein the natural language inference classification output is selected from the group consisting of entailment, neutral, and contradiction.

[0006] In various embodiments, the method may further include the set of features which is selected from the group consisting of a comparison of the length of the data indicative of the premise and the data indicative of the hypothesis, a comparison of the overlap between words in the data indicative of the premise and the data indicative of the hypothesis, lexical features in the data indicative of the premise, and lexical features in the data indicative of the hypothesis.

[0007] In various embodiments, the method may further include lexical features which are selected from the group consisting of a noun, a verb, an adjective, a pronoun, an adverb, a preposition, a conjunction, and an interjection.

[0008] In various embodiments, the method may further include the plurality of natural inference models further including a plurality of bidirectional long short term memory (Bi- LSTM) networks that independently and dependently read the data indicative of the premise and the data indicative of the hypothesis.

[0009] In various embodiments, the method may further include each Bi-LSTM network in the plurality of Bi-LSTM network is trained using varying training parameters, wherein the training parameters are selected from the group consisting of a number of dependent readings in the Bi-LSTM network, an activation function used by the Bi-LSTM, and an initialization seeding of the Bi-LSTM network.

[0010] In various embodiments, the method may further include generating the natural language inference classification output of the decision function output further comprises feeding the decision function output into a softmax function to generate the natural language inference classification output.

[0011] In various embodiments, the method may further include entailment indicating data indicative of the hypothesis is entailed by the data indicative of the premise in a natural language inference, wherein contradiction indicating the data indicative of the hypothesis is contradicted by the data indicative of the premise in the natural language inference, and wherein neutral indicating the data indicative of the hypothesis is not entailed or contradicted by the data indicative of the premise in the natural language inference.

[0012] In various embodiments, the method further includes preprocessing the data indicative of the premise and the data indicative of the hypothesis which form the natural language inference classification pair.

[0013] In addition, some implementations include one or more processors of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the aforementioned methods. Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the aforementioned methods.

[0014] It should be appreciated that all combinations of the foregoing concepts and additional concepts discussed in greater detail below (provided such concepts are not mutually inconsistent) are contemplated as being part of the inventive subject matter disclosed herein. In particular, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the inventive subject matter disclosed herein. It should also be appreciated that terminology explicitly employed herein that also may appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein.

Brief Description of the Drawings [0015] In the drawings, like reference characters generally refer to the same parts throughout the different views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating various principles of the embodiments described herein.

[0016] FIG. 1 A is a flowchart illustrating an example process of performing selected aspects of the present disclosure, in accordance with various embodiments.

[0017] FIG. 1B is a flowchart illustrating another example process of performing selected aspects of the present disclosure, in accordance with various embodiments.

[0018] FIG. 1C is a diagram illustrating one example of ensemble methods in accordance with various embodiments.

[0019] FIG. 1D is a flowchart illustrating another example process of performing selected aspects of the present disclosure, in accordance with various embodiments.

[0020] FIG. 2 is a flowchart illustrating another example process of performing selected aspects of the present disclosure, in accordance with various embodiments.

[0021] FIG. 3A - 3B are diagrams depicting one example of input encoding in accordance with various embodiments.

[0022] FIG. 4A - 4B are diagrams depicting one examples of attention in accordance with various embodiments.

[0023] FIG. 5 is a diagram illustrating one example of inference encoding in accordance with various embodiments.

[0024] FIG. 6 is a diagram illustrating one example of classification in accordance with various embodiments.

[0025] FIG. 7 is a diagram depicting an example computing system architecture.

Detailed Description

[0026] Many existing models can use simple reading mechanisms to encode the premise and hypothesis of a natural language inference (NLI) sentence pair using ensemble learning methods. However, in several embodiments, such a complex task can require more sophisticated ensemble strategy to determine which NLI model within an ensemble system can perform the best for a specific input data sample, such as a NLI sentence pair. In several embodiments, NLI models within an ensemble strategy can assign a set of weights to the models within the system. In many embodiments, the weights can be determined by extracting a set of features ( e.g ., length, overlap, lexical features, etc.) from each NLI sentence pair the ensemble learning neural network model receives as input. The features can produce a set of weights for each NLI model within the ensemble system to help indicate which NLI model within the ensemble is more likely to give a more accurate classification of the input NLI sentence pair. In many embodiments, the features can be passed to a single layer feed forward neural network to produce the set of weights for each NLI model within the system. These weights can be used in a decision function to determine which NLI models in the ensemble system can provide a more accurate decision for the given NLI premise hypothesis sentence input pair.

[0027] In many embodiments, NLI models can use a more explicit modeling of dependency relationship between the premise and the hypothesis during an encoding and inference processes to prevent the loss of relevant contextual information in deep-learning networks. For simplicity, such strategies can be referred to as“dependent reading”. While any of a variety of neural network models can be used as NLI models in the ensemble learning system, a dependent reading bidirectional long short term memory (Bi-LSTM) neural network model can be utilized in several embodiments. Bi-LSTM neural network models will be described in detail below.

[0028] In several embodiments, the NLI models including dependent reading of the data ( e.g ., Bi-LSTM neural network models) can vary the number of dependent readings of the input NLI sentence pair. Additional variations of NLI models used in ensemble learning systems in accordance with many embodiments can include: a single dependent reading of the input data (e.g., not using a dependent reading in the inference stage of the Bi-LSTM model described in detail below), multiple dependent readings of the input data (e.g., adding additional layers with dependent reading to the Bi-LSTM model described below, for example three dependent reading of the input data), varying the structure of the Bi-LSTM model (e.g., changing the activation function used in the NLI model such as using a tank activation function in some NLI models and a ReLU activation function in other NLI models), changing the initialization seeding of the NLI model, etc. In some embodiments, by combining different rounds of dependent reading of the premise and hypothesis data within the ensemble system, the features extracted from the premise hypothesis pair can indicate if a simple or more complex NLI model is a better fit for classifying the data sample. In other words, the model can determine how much dependent reading can be required for a specific data sample. For example, a simple premise hypothesis sentence pair might be better classified from a NLI model with fewer rounds of dependent reading while a more complex premise hypothesis sentence pair might be better classified from a NLI model with more rounds of dependent reading.

[0029] Various techniques described herein utilize one or both of independent and dependent reading recurrent networks for natural language inference. For example, in a variety of embodiments, neural networks can perform and independent reading and a dependent reading of a premise and a hypothesis. In several embodiments, a dependent reading bidirectional long short term memory (DR-Bi-LSTM) element of a neural network model can be utilized. Given a premise u and a hypothesis v, various embodiments described herein may first encode the premise and the hypothesis independently and then encode them considering dependency on each other (i.e. encode both the premise dependently with respect to the hypothesis: u\v , and encode the hypothesis dependently with respect to the premise: v\u).

[0030] In many embodiments, the neural network model can employ an attention mechanism, for example, a soft attention mechanism, to extract relevant information from these input encodings. In a variety of embodiments, the augmented sentence representations can then be passed to an inference encoding stage, which can use a similar independent and dependent reading strategy in both directions, i.e. u ® v and v ® u. In many embodiments, a classification decision, for example labeling the premise hypothesis sentence pair with an entailment, neutral or contradiction label, can made through a multilayer perceptron (MLP) based on the aggregated information. In a variety of embodiments, neural network models to solve NLI problems can be divided into a variety of subsection including: input encoding, attention, inference encoding, and classification. In some embodiments, additional or alternative steps, for example a preprocessing step, can be added to any of the stages of the neural network model including: input encoding, attention, inference encoding, and classification.

[0031] Referring to FIG. 1A, an example process 100 for practicing selected aspects of the present disclosure, in accordance with various embodiments is disclosed. For convenience, the operations of the flowchart are described with reference to a system that performs the operations. This system may include various components of various computer systems, including those described in FIG. 7. Moreover, while operations of process 100 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, and/or added. [0032] At block 102, a premise sentence and a hypothesis sentence NLI sentence pair can be obtained. A pair of NLI sentences generally can have three relationship classifications: entailment, contradiction, and neutral. An entailment classification can indicate the hypothesis sentence is related to the premise sentence. A contradiction classification can indicate the hypothesis sentence is not related to the premise sense. Additionally or alternatively, a neutral classification can indicate hypothesis sentence has neither an entailment classification nor a contradiction classification. For example, the premise sentence“A senior is waiting at the window of a restaurant that serves sandwiches.” can be linked with various hypothesis sentences. The hypothesis sentence“A person waits to be served his food.” can indicate an entailment classification (i.e., the hypothesis sentence has a relationship with the premise sentence). The hypothesis sentence“A man is looking to order a grilled cheese sandwich.” can indicate a neutral classification (i.e., the hypothesis sentence has neither entailment nor contradiction with the premise sentence). Additionally, the hypothesis sentence“A man is waiting in line for the bus.” can indicate a contradiction classification (i.e., the hypothesis sentence has no relationship with the premise sentence).

[0033] At block 104 NLI sentences can be classified using an ensemble of trained neural networks. In many embodiments, a decision function using features extracted from the NLI sentences can make additional predictions regarding which neural network in the ensemble is best for the current set of input data. Neural network models used in an ensemble system in accordance with many embodiments of the disclosure can contain a variety of layers including: input encoding, attention, inference encoding, and classification. In many embodiments, the neural network can be a deep learning neural network, for example, a recurrent network. In many embodiments, a bidirectional Long Short Term Memory (Bi-LSTM) can be used as building blocks of the trained neural network. Additionally or alternatively, a dependent reading Bi-LSTM (DR-Bi-LSTM) can be used to both independently and dependently read premise and hypothesis sentence pairs. Additional information regarding the use of Bi-LSTM as the neural network models within the ensemble model will be described below.

[0034] Moreover, features can be extracted from the NLI sentences for use in a decision function to provide additional data to the ensemble system. Features can include: premise sentence length, hypothesis sentence length, a comparison between the premise sentence length and the hypothesis sentence length, a comparison of the overlap between the premise sentence and the hypothesis sentence, lexical features, etc. Lexical features can include classifying words in the premise sentence and the hypothesis sentence by their part of speech, for example as a noun, a verb, an adjective, a pronoun, an adverb, a preposition, a conjunction, and an interjection, etc. In various embodiments, these features can be used to generate a weight for each neural network within the ensemble system. In many embodiments, the addition of features can improve the accuracy of the system by providing information about which neural networks within the ensemble are more likely to generate correct predictions.

[0035] At block 106, a classification label can be generated for the classified NLI sentence pair using the output of the ensemble system. In some embodiments, a soflmax function can be used to classify the output of the ensemble system. A variety of embodiments can have three classification labels: entailment, neutral, contradiction). In other embodiments additional labels can be utilized, for example, when the NLI sentence pairs used in a training data set are labeled by one or more humans, additional classification labels can be generated for a training input sentence pairs when humans disagree on how a NLI sentence pair should be classified.

[0036] FIG. 1B describes an example process 125 for practicing selected aspects of the present disclosure, in accordance with various embodiments of the disclosure. In many embodiments, an ensemble neural network system can be composed of several trained neural networks and a set of features extracted from the NLI sentence pair input. A trained decision function, using the set of features extracted from the NLI sentence pair can predict which neural networks within the ensemble are more likely to produce an accurate result for the given input. For example, a simple input sentence pair may be more accurately classified by a simpler neural network model within the ensemble and similarly a more complex sentence may be classified more accurately by a more complex neural network within the ensemble. In some embodiments, the decision function can be trained with the same training data set that is used to train the neural networks within the ensemble. In many embodiments, the decision function can be a single layer feed forward neural network.

[0037] At block 126, a premise sentence and a hypothesis sentence which can form a NLI sentence pair can be obtained. In many embodiments, NLI sentence pairs can be obtained in a manner similar to block 102 in FIG. 1A. [0038] A set of features can be extracted from the NLI sentence pair at block 128. In a variety of embodiments, feature extraction from NLI sentence pairs can be generated in a manner similar to block 104 in FIG. 1A. In many embodiments, feature extraction techniques can include: independent component analysis, principal component analysis, isomap, partial least squares, multifactor dimensionality reduction, nonlinear dimensionality reduction, multilinear subspace learning, semidefinite embedding, autoencoder etc.

[0039] At block 130, a classification output from a set of trained NLI neural network models can be generated using the NLI sentence pair. In some embodiments, ensemble methods can use multiple models to obtain better predictive performance. Previous works typically utilize trivial ensemble strategies by either using majority votes or averaging the probability distributions over the same model with different initialization seeds. In many embodiments, a weighted averaging of the probability distributions of the models within the ensemble can be used, where the weight of each model is learned through its performance on a training data set such as the same training data set used to train the individual neural network models within the ensemble. Furthermore, in some embodiments, the differences between models within in the ensemble can originate from a variety of neural network model differences including: variations in the number of dependent readings (i.e., one or three rounds of dependent reading can be used in different models where three rounds of dependent readings can include repeating an additional inference encoding layer), variations in a projection layer activation ( e.g ., using tank and ReLU in Equations 10 and 11), different initialization seeds of neural network models, etc. In a variety of embodiments, NLI neural network models within an ensemble can generate a classification in a manner similar to block 104 in FIG. 1A.

[0040] An output can be generated at block 132 using a decision function which can weigh the output of the set of trained neural network models within the ensemble using the set of features extracted from the NLI sentence pair. In many embodiments, the effectiveness of a model may depend on the complexity of a premise-hypothesis instance. For example, for a simple instance, a simple model could perform better than a complex one, while a complex instance may need further consideration toward disambiguation. Therefore, using models with different rounds of dependent readings in the encoding stage can be beneficial. [0041] The following is an example configuration of an ensemble network model in accordance with the disclosure. Six trained Bi-LSTM models with different initialization seeds can be used. A Bi-LSTM network can be used which includes a tank activation function in place of a ReLU activation function. Equations 10 and 11 described below can be replaced with Equations 1 and 2 as follows:

[0043] q j = tanh(W p b j + b p ) (2)

[0044] In some embodiments, A Bi-LSTM with one round of dependent reading can use the same configuration as the general Bi-LSTM described below without a dependent reading in the inference process. In other words, p = p and q = q can be used instead of Equations 14 and 15 described below.

[0045] Additionally, a Bi-LSTM with three rounds of dependent reading can be used which can the same basic configuration as the Bi-LSTM described below except Equations 5 and 6 are replaced with Equations 3 and 4.

[0046] — , s v = BiLSTM( , 0 )

[0047] — , s vu = BiLSTM(u, s v )

[0048] —, s vuv = BiLSTM(y, s vu )

[0049] u,—= BiLSTM(u, s vuv ) (3)

[0050] s u = BiLSTM(u, 0)

[0051] — , s uv = BiLSTM(v, s u )

[0052] , s uvu = BiLSTM(u, s uv

[0053] v, -= BiLSTM(v, s uvu ) (4)

[0054] In many embodiments, the Bi-LSTM can have any number of rounds of dependent readings and neural networks within an ensemble method in accordance with the disclosure are not limited to three rounds of dependent readings.

[0055] In some embodiments, an example final ensemble model can include a combination of six models. In many embodiments, all six modes can be initialized with different seeds. The final ensemble model includes a Bi-LSTM model with a tank activation function in the projection layer, a Bi-LSTM with one round of dependent reading, a Bi-LSTM model with three rounds of dependent reading, and three additional Bi-LSTM models as described below initialized with different seeds. In some embodiments, a decision function to classify an ensemble system can generate output in a manner similar to block 104 in FIG. 1A.

[0056] At block 134, a classification of the NLI sentence pair can be generated. In many embodiments, a sofitmax function generate a label for the NLI sentence pair. In various embodiments, a classification label ( e.g ., entailment, neutral, contradiction) can be generated in a manner similar to block 106 in FIG. 1A.

[0057] FIG. 1C illustrates an example ensemble learning system in accordance with various embodiments. Image 150 include a dataset 152 which provides the NLI sentence pair input to the network. Trained models 154 can include any of a variety of NLI neural network models for use in the ensemble system including recurrent neural networks. In a variety of embodiments, trained models 154 can include Bi-LSTM network models described in detail below. In some embodiments, trained models 154 can include variations including: a single dependent reading of the input data (e.g., not using a dependent reading in the inference stage of the Bi-LSTM model described in detail below), multiple dependent readings of the input data (e.g., adding additional layers with dependent reading to the Bi-LSTM model described below) for example three dependent reading of the input data, varying the structure of the Bi-LSTM model (e.g., changing the activation function used in the NLI model such as using a tank activation function in some trained models and a ReLU activation function in other trained models), changing the initialization seeding of the model before training, etc.

[0058] Features can be extracted from the NLI input sentence pair at feature extraction module 156. In many embodiments, feature extraction can be performed on the pair of input premise and hypothesis sentences in a manner similar to block 104 in FIG. 1A and/or block 128 in FIG. 1B. A decision function 128 can receive the output of the classifications of the trained models 154 and the extracted features 126 for a given set up NLI sentence pair input. Using the input to the model, the decision function can make further predictions regarding which trained models 154 in the ensemble are more likely to generate an accurate classification of the NLI sentence pair input from the features extracted from the input. In other words, the decision function can pick an output of the ensemble method using information provided by the input (i.e., the features extracted from the premise and hypothesis sentence pair). In some embodiments, the decision function can be a single layer feed forward neural network. In various embodiments, the decision function can be performed in a manner similar to block 132 of FIG. 1B.

[0059] In a variety of embodiments, the output of the decision function can be passed to another layer in the neural network system to generate a classification label. In some embodiments, the output is passed to a softmax function 160, which can generate classification label 162. In many embodiments, classification labels for NLI systems can include entailment, neutral, and contradiction.

[0060] Referring to FIG. 1D, an example process 175 for practicing selected aspects of the present disclosure, in accordance with various embodiments is disclosed. For convenience, the operations of the flowchart are described with reference to a system that performs the operations. This system may include various components of various computer systems, including those described in FIG. 7. Moreover, while operations of process 175 are show in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, and/or added. In many embodiments, process 175 is an example of an individual NLI model within the ensemble that can classify a NLI sentence pair. Individual NLI models within an ensemble system in accordance with many embodiments of the present disclosure described with respect to process 175 can include Bi-LSTM neural network models.

[0061] At block 176, a premise sentence and a hypothesis sentence NLI sentence pair can be obtained. In many embodiments, a premise sentence and a hypothesis sentence can be obtained in a manner similar to block 102 in FIG. 1A.

[0062] At block 178 NLI sentence can be classified using a trained neural network. The trained neural network can perform independent readings and dependent readings of the premise and hypothesis sentences. Neural network models in accordance with many embodiments of the disclosure can contain a variety of layers including: input encoding, attention, inference encoding, and classification. In many embodiments, the neural network can be a deep learning neural network, for example, a recurrent network. In many embodiments, a bidirectional Long Short Term Memory (Bi-LSTM) can be used as building blocks of the trained neural network. Additionally or alternatively, a dependent reading Bi-LSTM (DR-Bi-LSTM) can be used to both independently and dependently read premise and hypothesis sentence pairs. Additional information regarding the use of Bi-LSTM in the neural network model will be described below. [0063] In some embodiments, a neural network can be trained using a data set with a known set of inputs corresponding to a known classification. The input is passed through the network, and one or more adjustments can be made to the neural network by comparing the actual output of the network and what the output of the network should be from the data set that corresponds with the given input. For example, the Stanford Natural Language Inference (SNLI) data set can be used to train a neural network in accordance with many embodiments of the disclosure for use in NLI applications.

[0064] At block 180, a classification label can be generated for the classified NLI sentence pair. A variety of embodiments can have three classification labels: entailment, neutral, contradiction). In many embodiments, classification labels for an individual NLI model can be generated in a manner similar to block 106 in FIG. 1 A.

[0065] FIG. 2 describes an example process 200 for practicing selected aspects of the present disclosure, in accordance with various embodiments is disclosed. In many embodiments, a Bi- LSTM neural network model which can be an individual neural network included in the ensemble learning system can be composed of the following components: input encoding, attention, inference encoding, and classification. For convenience, the operations of the flowchart are described with reference to a system that performs the operations. This system may include various components of various computer systems, including those described in FIG. 7. Moreover, while operations of process 200 are show in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, and/or added.

[0066] At block 202, a premise sentence and a hypothesis sentence for a NLI sentence pair can be obtained. In many embodiments, NLI sentence pairs can be obtained in a manner similar to block 102 in FIG. 1.

[0067] An input encoding of the premise sentence and the hypothesis sentence can independently and dependently be generated at block 204 using a neural network model. In many embodiments, the neural network model can contain recurrent neural network elements, for example, Bi-LSTM blocks. Input encoding in accordance with several embodiments will be discussed in detail in FIGS. 3 A - 3B.

[0068] An attention of the premise sentence and the hypothesis sentence can independently and dependently be generated at block 206 using the neural network. Attention mechanisms can generate embedding for each word sequence in a sentence considering the other sentence. For example, attention mechanisms can correlate which words in the premise and the hypothesis have a higher importance. Attention in accordance with several embodiments will be discussed in detail in FIGS. 4A - 4B.

[0069] An inference encoding of the premise sentence and the hypothesis sentence can independently and dependently be generated at block 208 using the neural network model. In some embodiments, the neural network model at the inference encoding stage can contain recurrent neural networks elements, for example, Bi-LSTM blocks. Inference encoding in accordance with several embodiments will be discussed in detail in FIG. 5.

[0070] At block 210, a classification of the NLI sentence pair can be generated using the neural network. In several embodiments, classification labels can include: entailment, neutral, and contradiction. Classification in accordance with various embodiments will be discussed in detail in FIG. 6.

[0071] FIGS. 3A - 3B illustrate an example input encoding in accordance with many embodiments. FIG. 3A and FIG. 3B illustrate images 300 and 350 respectively, which when combined can illustrate an example input encoding.

[0072] Image 300 contains an input premise sentence 302 and an input hypothesis sentence 304. Input premise sentence 302 can be passed to embedding 306, which can transform words in an input premise sentence into a word representation. Similarly, input hypothesis sentence can be passed to embedding 308 to transform words in an input hypothesis sentence into a word representation. In many embodiments, embedding 306 and/or embedding 308 can include a variety of word embeddings including: word2vec, GloVe, fastText, Gensim, Brown clustering, and/or latent semantic analysis.

[0073] Once an input premise sentence 302 has been embedded, a sequence of premise word embedding 310, referred to as simply a“premise” for simplification, can be represented by u. Premise 310 is represented by diagonal line shading, and any data originating from premise 310 is similarly represented by diagonal line shading throughout FIGS. 3 - 6 in accordance with some embodiments of the disclosure. Similarly, once an input hypothesis sentence 304 has been embedded, a sequence of hypothesis word embedding 312, referred to as simply a“hypothesis” for simplification, can be represented by v. Hypothesis 312 is represented by dotted shading, and any data originating from hypothesis 312 is similarly represented by dotted shading throughout FIGS. 3 - 6 in accordance with many embodiments of the disclosure. In some embodiments, u = \u , ... , u n ] can be a premise with length n and v = [v 1 , ... , v m ] can be a hypothesis with length m, where u i Vj £ r can be a word embedding of r - dimensional vector. In a variety of embodiments, the classification task can be to predict a label y that can indicate the logical relationship between premise u and hypothesis v.

[0074] In several embodiments, recurrent neural networks (RNNs) can be utilized for variable length sequence modeling. Additionally or alternatively a bidirectional Long Term Short Term (Bi-LSTM) block can be utilized for encoding the given premise 310 and hypothesis 312. Premise 310 and hypothesis 312 can be encoded with independent and dependent readings of Bi- LSTMs. For example, in an independent reading, the premise can be read without reading the hypothesis and similarly the hypothesis can be read without reading the premise. In a dependent reading, one sentence is read, and the reading of that first sentence is used in the reading of the second sentence. For example, in a dependent reading the premise can be read and the reading of the premise can be used to read the hypothesis.

[0075] Image 300 can contain four Bi-LSTM blocks which in a variety of embodiments, can work together to independently and dependently read the premise and hypothesis. Bi-LSTM block 314 can independently read hypothesis 312 to generate an independent hypothesis vector space 322. Similarly, Bi-LSTM block 318 can independently read premise 310 to generate independent premise vector space 326. Bi-LSTM block 316 can dependently read premise 310 using information passed from an independent reading of hypothesis 312 from Bi-LSTM block 314 to generate dependent premise vector space 324. Similarly, Bi-LSTM block 320 can dependently read hypothesis 312 using information from an independent reading of premise 310 passed from Bi-LSTM block 318 to generate dependent hypothesis vector space 328.

[0076] For ease of presentation, only a mathematical description of how to encode u depending on v ( i.e . (u\v will be described, but in many embodiments, the same procedures can be utilized for the reverse direction to encode (v\u).

[0077] In a variety of embodiments, to dependently encode u, v can be processed using the Bi-LSTM. Then u can be read through the Bi-LSTM that is initialized with previous reading finals states such as memory cell and hidden states. For example, a word can be represented by Ui and its context can depend on the other sentences such as v.

[0078] v, s v = BiLSTM (v, 0) [0079] ύ,— = BiLSTM (u, s v ) (5)

[0080] U, s u = BiLSTM (u, 0)

[0081] (6)

[0082] are the independent reading sequences, dependent reading sequences, and Bi-LSTM final state of independent reading of u and v respectively (i.e. {independent reading sequences, dependent reading sequences, Bi-LSTM final state of independent reading of u or v}). In can be noted that “ in these equations means that the associated variable and its value is unimportant. Bi-LSTM inputs (i. e. premise 310 and hypothesis 312) can be the word embedding sequence.

[0083] Independent and dependent reading embeddings from Bi-LSTM blocks can be passed to pooling processes. Dependent premise vector space 324 and independent premise vector space 326 can be passed to pooling 330. Additionally or alternatively, independent hypothesis vector space 322 and dependent hypothesis vector space 328 can be passed to pooling 332. Pooling 330 and pooling 332 can combine data passed to them in different ways including: max pooling, average pooling, L2 norm pooling etc.

[0084] Image 350 in FIG. 3B contains the pooling 330 and pooling 332 processes represented in FIG. 3A. The output of pooling 330 is a state vector 334 can represents the pooling of independent and dependent reading of the premise, and similarly the output of pooling 332 is a state vector 336 can represent the pooling of independent and dependent readings of the hypothesis. In several embodiments, state vector 334 and state vector 336 can be passed to an attention mechanism 338. In some embodiments, the input encoding mechanism can yield a richer representation for both premise and hypothesis by taking the history of each other into account. An attention mechanism in accordance with some embodiments of the disclosure will be discussed in FIGS. 4A - 4B.

[0085] FIGS. 4A - 4B illustrate an example attention mechanism in accordance with many embodiments. Attention mechanisms in accordance with a variety of embodiments of the disclosure can generate an embedding for each word sequence in a sentence considering the other sentence. Image 400 in FIG. 4A can contain a state vector 334 with a length of m words, which represents the pooling of independent and dependent reading of the premise, and a state vector 336 with a length of n words, which represents the pooling of independent and dependent readings of the hypothesis similar to the state vectors illustrated in FIG. 3B. State vectors 334 and 336 can be combined to form a matrix 402 of size m x n. Matrix 402 can be the input into a softmax function 404 and a softmax function 406. In some embodiments, softmax function 404 can be over the first dimension and softmax function 406 can be over the second dimension. Summation element 410 can combine the output of softmax function 406 with state vector 336 to generate attentional representation 414. Attentional representation 414 is visually represented by cross-hatches. Similarly, summation element 408 can combine the output of softmax function 404 with state vector 334 to generate attentional representation 412. Attentional representation 412 is visually represented by vertical lines.

[0086] In some embodiments, an attention mechanism can pass input embedding, attentional embedding, the difference of input embedding and attentional embedding, and the element wise product of input embedding and attentional embedding and attention output.

[0087] A premise attention output 416 can receive input from state vector 334 and attentional representation 414. A difference element 418 can compute the difference between state vector 334 and attentional representation 414 to generate difference output 426. An element wise product element 420 can compute an element wise product between state vector 334 and attentional representation 414 to generate element wise product output 428. In many embodiments, premise attention output 416 can represent one or more sequences of words, each word comprising elements of: state vector 334, attentional representation 414, difference output 426, and element wise product output 428.

[0088] Similarly, in several embodiments, a hypothesis attention output 430 can receive input from state vector 336 and attentional representation 412. A difference element 432 can compute the difference between state vector 336 and attentional representation 412 to generate difference output 440. An element wise product element 434 can compute an element wise product between state vector 336 and attentional representation 412 to generate element wise product output 442. In some embodiments, hypothesis attention output 430 can represent one or more sequences of words, each word comprising elements of: state vector 336, attentional representation 412, difference output 440, and element wise product output 442.

[0089] Image 450 in FIG. 4B contains premise attention output 416 and hypothesis attention output 430. Projector 452 can be a feed-forward layer which can transform the premise attention output 416 into premise attention state vector 456. Similarly, projector 454 can be a feed forward layer which can transform the hypothesis attention output 430 into hypothesis attention state vector 458. In many embodiments, premise attention state vector 456 and hypothesis attention state vector 458 can be a lower dimensional space than the input to the corresponding projectors. Premise attention state vector 456 and hypothesis attention state vector 458 can be passed to inference encoding 460. Inference encoding in accordance with some embodiments will be discussed in FIG. 5.

[0090] Additionally or alternatively, in some embodiments, attention can be performed by a soft alignment method which can associate the relevant sub-components between the given premise and hypothesis. In deep learning models, such purpose is often achieved with a soft attention mechanism. In many embodiments, the unnormalized weights can be computed as the similarity of hidden states of the premise and hypothesis with Equation 7. Equation 7, for example, can be an energy function.

[0092] where ft j and V j are the dependent reading hidden representations of u and v respectively. In some embodiments, for each word in either the premise or the hypothesis, the relevant semantics in the other sentence can be extracted and composed according to e^. In various embodiments, Equations 8 and 9 can provide formal and specific details of this procedure.

[0095] where fq represents the extracted relevant information of v by attending to iq while V j represents the extracted relevant information of u by attending to V j .

[0096] In many embodiments, the collected attentional information can be further enriched by passing the concatenation of the tuples (iq, iq) or ( VJ, VJ ). To additionally add similarity and closeness measures, in some embodiments, the difference and element wise product for tuples (iq, iq) and (v j , V j ) that represent the similarly and closeness

[0097] The difference and element-wise product are then concatenated with the computed vectors, (iq, iq) or VJ, VJ), respectively. Additionally or alternatively, a feed-forward neural layer with ReLU activation function can project the concatenated vectors form 8d-dimensional vector space into d-dimensional (Equations 10 and 11). In many embodiments, this can capture deeper dependences between the sentences besides lowering the complexities of vector representations.

[0098] a t = [iίi,ΰi,ύi - ΰi, ύi Q ΰi]

[00101] q j = ReLU(W p b j + b p ) (1 1)

[00102] Here Q stands for element-wise product, while W p £ 8dxd and b p E d are the trainable weights and biases of the projector layer respectively.

[00103] FIG. 5 illustrates an example inference encoding in accordance with several embodiments. Image 500 includes premise attention state vector 456 and hypothesis attention state vector 458 similar to the state vectors illustrated in FIG. 4B. In a variety of embodiments, inference encoding can encode premise and hypothesis data using independent readings and dependent readings in a manner similar to the encoding mechanisms used in input encoding steps of a neural network model described in FIGS 3 A - 3B. Premise attention state vector 456 can be represented by p and attention state vector 458 can be represented by q. An aggregation of p and q can be performed in a sequential manner to avoid losing an effect of latent variables that might rely on the sequence of matching vectors.

[00104] In a variety of embodiments, image 500 can contain four Bi-LSTM blocks which similarly to input encoding, can work together to independently and dependently read premise attention state vector 456 and hypothesis attention state vector 458. Bi-LSTM block 506 can independently read premise attention state vector 456 to generate independent reading premise state vector 514. Similarly, Bi-LSTM block 502 can independently read hypothesis attention state vector 458 to generate independent reading hypothesis state vector 510. Bi-LSTM block 504 can dependently read premise attention state vector 456 using additional information passed from an independent reading of hypothesis attention state vector 458 from Bi-LSTM block 502 to generate dependent reading premise state vector 512. Similarly, in many embodiments, Bi- LSTM block 508 can dependently read hypothesis attention state vector 458 using additional information passed from an independent reading of premise attention state vector 456 by Bi- LSTM block 506 to generate dependent reading hypothesis vector 516. [00105] Independent and dependent readings of p and q can be passed to pooling processes. In various embodiments, dependent reading premise state vector 512 and independent reading premise state vector 514 can be passed to pooling processing 518 to generate premise inference state vector 522. Similarly, independent reading hypothesis state vector 510 and dependent reading hypothesis state vector 516 can be passed to pooling process 520 to generate hypothesis inference state vector 524. In some embodiments, additional pooling processes can be performed on the data. In some such embodiments, premise inference state vector 522 can be passes to sequence pooling 526 and similarly hypothesis inference state vector 524 can be passed to sequence pooling 528. Sequence pooling 526 and sequence pooling 528 can be utilized in a classification step such as classification 530. In a variety of embodiments, sequence pooling can generate a non-sequential tensor that can be a combination of different pooling methods including: max-pooling, avg-pooling, min-pooling, etc. A classification step for a neural network model similar to classification 530 will be discussed in detail in FIG. 6.

[00106] In alternative or additional embodiments, inference processes similar to those described in FIG. 5 can be performed in a manner similar to that described below. Instead of aggregating the sequences of matching vectors individually, a Bi-LSTM reading process (Equations 12 and 13) similar to the input encoding step can be utilized in accordance with some embodiment of the disclosure. Both independent readings (p and q) and dependent readings (p and q) can be fed to a max pooling layer, which can select maximum values from each sequence of independent and dependent readings (p; and pi) as shown in Equations 14 and 15. In yet another embodiment, this architecture can maximize the inferencing ability of the model by considering both independent and dependent readings.

[00107] q, s q = BiLSTM {q, 0)

[00108] p,— = BiLSTM ( p , s q ) (12)

[00109] p, s p = BiLSTM (p, 0)

[00110] q, -= BiLSTM ( q , s p ) (13)

[00111] p = MaxPooling ( r, r ) (14)

[00112] q = MaxPooling (q, q) (15)

[00113] In many embodiments, {p £ nx2d , p £ nx2d , s p ] and {q £ mx2d , q £ mx2d , s q } are the independent reading sequences, dependent reading sequences, and Bi-LSTM final state of independent reading of p and q respectively (i.e. {independent reading sequence, dependent reading sequence, Bi-LSTM final state of independent reading}). Bi-LSTM inputs can be the word embedding sequences and initial state vectors.

[00114] In some embodiments, p G nx2d and q G mx2d can be converted to fixed-length vectors with pooling, U G 4d and V G 4d . As shown in Equations 16 and 17, some embodiments may employ both max and average pooling and describe the overall inference relationship with concatenation of their outputs.

[00115] U = [MaxPooling (p),AvgPooling (p)] (16)

[00116] V = [MaxPooling ( q),AvgPooling (q)] (17)

[00117] FIG. 6 illustrates an example classification in accordance with many embodiments. Image 600 contains sequence pooling 526 and sequence pooling 528 which in several embodiments can represent a sequence pooling similar to sequence pooling 526 and sequence pooling 528 illustrated in FIG. 5. Sequence pooling 526 and sequence pooling 528 can be concatenated into classification input 602. In many embodiments, classification input 602 can be fed into a feed-forward layer 604 and a softmax layer 606. Softmax layer 606 can generate a classification label 608 for the given premise and hypothesis NFI sentence pair ( e.g ., entailment, neutral, or contradiction).

[00118] Classification processes in accordance with many embodiments of the disclosure can be performed in a manner similar to that described below. The concatenation of U and V, for example, ([U, F]) can be fed into a multilayer perceptron (MFP) classifier that can include a hidden layer with tank activation and softmax output layer. In a variety of embodiments, the model can be trained in an end-to end-manner.

[00119] Output = MLP ([ U , V ]) (18)

[00120] FIG. 7 is a block diagram of an example computing device 710 that may optionally be utilized to perform one or more aspects of techniques described herein. In some implementations, one or more of a client computing device, user-controlled resources engine 130, and/or other component(s) may comprise one or more components of the example computing device 710.

[00121] Computing device 710 typically includes at least one processor 714 which communicates with a number of peripheral devices via bus subsystem 712. These peripheral devices may include a storage subsystem 724, including, for example, a memory subsystem 725 and a file storage subsystem 726, user interface output devices 720, user interface input devices 722, and a network interface subsystem 716. The input and output devices allow user interaction with computing device 710. Network interface subsystem 716 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.

[00122] User interface input devices 722 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term "input device" is intended to include all possible types of devices and ways to input information into computing device 710 or onto a communication network.

[00123] User interface output devices 720 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term "output device" is intended to include all possible types of devices and ways to output information from computing device 710 to the user or to another machine or computing device.

[00124] Storage subsystem 724 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 724 may include the logic to perform selected aspects of the processes of FIGS. 1 and 2.

[00125] These software modules are generally executed by processor 714 alone or in combination with other processors. Memory 725 used in the storage subsystem 724 can include a number of memories including a main random access memory (RAM) 730 for storage of instructions and data during program execution and a read only memory (ROM) 732 in which fixed instructions are stored. A file storage subsystem 726 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 726 in the storage subsystem 724, or in other machines accessible by the processor(s) 714. [00126] Bus subsystem 712 provides a mechanism for letting the various components and subsystems of computing device 710 communicate with each other as intended. Although bus subsystem 712 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.

[00127] Computing device 710 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 710 depicted in FIG. 7 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 710 are possible having more or fewer components than the computing device depicted in FIG. 7.

[00128] While several inventive embodiments have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the inventive embodiments described herein. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the inventive teachings is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific inventive embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described and claimed. Inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure.

[00129] All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms. [00130] The indefinite articles“a” and“an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean“at least one.” [00131] The phrase“and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with“and/or” should be construed in the same fashion, i.e.,“one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the“and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as“comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.

[00132] As used herein in the specification and in the claims,“or” should be understood to have the same meaning as“and/or” as defined above. For example, when separating items in a list,“or” or“and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as“only one of’ or“exactly one of,” or, when used in the claims,“consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term“or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e.“one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or“exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.

[00133] As used herein in the specification and in the claims, the phrase“at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase“at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example,“at least one of A and B” (or, equivalently,“at least one of A or B,” or, equivalently“at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.

[00134] It should also be understood that, unless clearly indicated to the contrary, in any methods claimed herein that include more than one step or act, the order of the steps or acts of the method is not necessarily limited to the order in which the steps or acts of the method are recited.