Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR SEQUENCE LABELING USING HIERARCHICAL CAPSULE BASED NEURAL NETWORK
Document Type and Number:
WIPO Patent Application WO/2020/261234
Kind Code:
A1
Abstract:
This disclosure relates generally to sequence labeling and more particularly to method and system for sequence labeling. The method includes employing a hierarchical capsule based neural network for sequence labeling, which includes a sentence encoding layer (having word embedding layer, feature extraction layer and multiple capsule layer) and a document encoding layer, Bi-LSTMs, a fully connected layer and a conditional random fields (CRF) layer. The word embedding Layer obtains fixed-size vector representation of words of sentences associated with a dialogue or an abstract, then the feature extraction layer encodes the sentences, the Capsule layer extracts high-level features from the sentence. Ah the sentence encodings are then stacked up together and are passed through another Bi-LSTM layer to derive the contextual information from sentences. A fully connected layer calculates likelihood scores. The CRF layer obtains optimized label sequence based on the likelihood scores.

Inventors:
SRIVASTAVA SAURABH (IN)
AGARWAL PUNEET (IN)
SHROFF GAUTAM (IN)
VIG LOVEKESH (IN)
Application Number:
PCT/IB2020/056101
Publication Date:
December 30, 2020
Filing Date:
June 27, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TATA CONSULTANCY SERVICES LTD (IN)
International Classes:
G06F17/28; G06F17/27
Foreign References:
US20160259775A12016-09-08
US20170270100A12017-09-21
Other References:
HAO PENG; JIANXIN LI; QIRAN GONG; SENZHANG WANG; LIFANG HE; BO LI; LIHONG WANG; PHILIP S YU: "Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large- Scale Multi-Label Text Classification", ARXIV.ORG, 9 June 2019 (2019-06-09), XP081376232, Retrieved from the Internet [retrieved on 20200927]
ALZAIDY RABAH RAZAIDY, CARAGEA CORNELIA, GILES C LEE: "Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents", THE WORLD WIDE WEB CONFERENCE, 27 September 2020 (2020-09-27), XP058471298, Retrieved from the Internet [retrieved on 20200927]
Attorney, Agent or Firm:
KHAITAN & CO (IN)
Download PDF:
Claims:
Claims

1. A processor implemented method, comprising:

employing, via one more hardware processors, a hierarchical capsule based neural network for sequence labeling, the hierarchical capsule based neural network comprising a sentence encoding layer, a document encoding layer, a fully connected layer and a conditional random fields (CRF) layer, the sentence encoding layer comprising a word embedding layer, a feature extraction layer composed of a first plurality of Bi-LSTMs, a primary capsule layer, and convolutional capsule layers, and the document encoding layer comprising a second plurality of Bi-LSTMs, wherein employing comprises:

determining, by the word embedding layer, initial sentence representations for a plurality of sentences associated with a task, each of the initial sentence representations comprising a concatenated embedding vector;

encoding, by the first plurality of Bi-LSTMs, contextual semantics between words within a sentence of the plurality of sentences using the associated concatenated embedding vector to obtain a plurality of context vectors (Ci);

convolving with the plurality of context vectors (Ci) while skipping one or more context vectors in between, by the primary capsule layer comprising a filter, to obtain a capsule map comprising a plurality of contextual capsules associated with the plurality of sentences, the plurality of contextual capsules connected in multiple levels using shared transformation matrices and a routing model;

computing, by the Convolutional Capsule Layer, a final sentence representation for the plurality of sentences by determining coupling strength between child-parent pair contextual capsules of the plurality of contextual capsules connected in the multiple levels;

obtaining, by a second plurality of Bi-LSTMs, contextual information between the plurality of sentences using the final sentence representations associated with the plurality of sentences, wherein the second plurality of Bi-LSTMs takes sentences at multiple time steps as input and produces a sequence of hidden state vectors corresponding to each of the plurality of sentences;

passing the hidden state vectors through the feed forward layer to output likelihood scores for probable labels for each statement of the plurality of statements; and

determining optimized label sequence for the plurality of sentences by the CRF layer based at least on a sum of probable labels weighted by the likelihood scores.

2. The method as claimed in claim 1, wherein the concatenated embedding vector comprises a fixed-length vector corresponding to each word of the plurality of sentences, a fixed-length vector corresponding to a sentence being representative of lexical-semantics of words of the sentence.

3. The method as claimed in claim 1, wherein a number of the one or more vectors is a dilation rate (dr).

4. The method as claimed in claim 1, wherein a context vector of the plurality of context vectors associated with a word comprises a right context and a left context between the word and adjacent words.

5. The method as claimed in claim 1, wherein the filter (Wb) multiplies context vectors in {Ci + drj^with stride of one to obtain a capsule /;,, where, pi = g(WbCi), and

g is a non-linear squash function.

6. The method as claimed in claim 1, wherein determining the optimized label sequence comprises calculating, by the CRF layer, probability score for the label sequence associated with the plurality of sentences based on the sum of possible labels weighted by the likelihood scores and transition scores of moving from one label to another label.

7. A system (701) for sequence labeling comprising:

one or more memories (704); and

one or more first hardware processors (702), the one or more first memories (704) coupled to the one or more first hardware processors (702), wherein the one or more first hardware processors (702) are configured to execute programmed instructions stored in the one or more first memories (704):

employ a hierarchical capsule based neural network for sequence labeling, the hierarchical capsule based neural network comprising a sentence encoding layer, a document encoding layer, a fully connected layer and a conditional random fields (CRF) layer, the sentence encoding layer comprising a word embedding layer, a feature extraction layer composed of a first plurality of Bi-LSTMs, a primary capsule layer, and convolutional capsule layers, and the document encoding layer comprising a second plurality of Bi-LSTMs, wherein to employ hierarchical capsule based neural network , the one or more hardware processors are configured by the instructions to:

determine, by the word embedding layer, initial sentence representations for a plurality of sentences associated with a task, each of the initial sentence representations comprising a concatenated embedding vector;

encode, by the first plurality of Bi-LSTMs, contextual semantics between words within a sentence of the plurality of sentences using the associated concatenated embedding vector to obtain a plurality of context vectors (Ci);

convolve with the plurality of context vectors (Ci) while skipping one or more context vectors in between, by the primary capsule layer comprising a filter, to obtain a capsule map comprising a plurality of contextual capsules associated with the plurality of sentences, the plurality of contextual capsules connected in multiple levels using shared transformation matrices and a routing model;

compute, by the Convolutional Capsule Layer, a final sentence representation for the plurality of sentences by determining coupling strength between child-parent pair contextual capsules of the plurality of contextual capsules connected in the multiple levels;

obtain, by a second plurality of Bi-LSTMs, contextual information between the plurality of sentences using the final sentence representations associated with the plurality of sentences, wherein the second plurality of Bi-LSTMs takes sentences at multiple time steps as input and produces a sequence of hidden state vectors corresponding to each of the plurality of sentences;

pass the hidden state vectors through the feed forward layer to output likelihood scores for probable labels for each statement of the plurality of statements; and

determine optimized label sequence for the plurality of sentences by the CRF layer based at least on a sum of probable labels weighted by the likelihood scores.

8. The system as claimed in claim 7, wherein the concatenated embedding vector comprises a fixed-length vector corresponding to each word of the plurality of sentences, a fixed-length vector corresponding to a sentence being representative of lexical-semantics of words of the sentence.

9. The system as claimed in claim 7, wherein a number of the one or more vectors is a dilation rate (dr).

10. The system as claimed in claim 7, wherein a context vector of the plurality of context vectors associated with a word comprises a right context and a left context between the word and adjacent words. 11. The system as claimed in claim 7, wherein the filter (Wb) multiplies context vectors in {Ci + dr}J^1with stride of one to obtain a capsule /;,, where, pi = g(WbCi), and

g is a non-linear squash function.

12. The system as claimed in claim 7, wherein to determine the optimized label sequence, the one or more hardware processors are configured by the instruction to calculate, by the CRF layer, probability score for the label sequence associated with the plurality of sentences based on the sum of possible labels weighted by the likelihood scores and transition scores of moving from one label to another label.

13. One or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause:

employing, via one more hardware processors, a hierarchical capsule based neural network for sequence labeling, the hierarchical capsule based neural network comprising a sentence encoding layer, a document encoding layer, a fully connected layer and a conditional random fields (CRF) layer, the sentence encoding layer comprising a word embedding layer, a feature extraction layer composed of a first plurality of Bi-LSTMs, a primary capsule layer, and convolutional capsule layers, and the document encoding layer comprising a second plurality of Bi-LSTMs, wherein employing comprises:

determining, by the word embedding layer, initial sentence representations for a plurality of sentences associated with a task, each of the initial sentence representations comprising a concatenated embedding vector;

encoding, by the first plurality of Bi-LSTMs, contextual semantics between words within a sentence of the plurality of sentences using the associated concatenated embedding vector to obtain a plurality of context vectors (Ci);

convolving with the plurality of context vectors (Ci) while skipping one or more context vectors in between, by the primary capsule layer comprising a filter, to obtain a capsule map comprising a plurality of contextual capsules associated with the plurality of sentences, the plurality of contextual capsules connected in multiple levels using shared transformation matrices and a routing model;

computing, by the Convolutional Capsule Layer, a final sentence representation for the plurality of sentences by determining coupling strength between child-parent pair contextual capsules of the plurality of contextual capsules connected in the multiple levels;

obtaining, by a second plurality of Bi-LSTMs, contextual information between the plurality of sentences using the final sentence representations associated with the plurality of sentences, wherein the second plurality of Bi-LSTMs takes sentences at multiple time steps as input and produces a sequence of hidden state vectors corresponding to each of the plurality of sentences;

passing the hidden state vectors through the feed forward layer to output likelihood scores for probable labels for each statement of the plurality of statements; and

determining optimized label sequence for the plurality of sentences by the CRF layer based at least on a sum of probable labels weighted by the likelihood scores.

Description:
SYSTEM AND METHOD FOR SEQUENCE LABELING USING HIERARCHICAL CAPSULE BASED NEURAL NETWORK

CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY

[001] The present application claims priority from Indian provisional application no. 201921025909, filed on June 28, 2019. The entire contents of the aforementioned application are incorporated herein by reference.

TECHNICAL FIELD

[002] The disclosure herein generally relates to sequence labeling, and, more particularly, to system and method for sequence labeling using hierarchical capsule based neural networks.

BACKGROUND

[003] In Natural Language Processing (NLP) maintaining a memory of history plays an important role in many tasks. While reading a book, for example, certain summaries or short stories that represent the key aspects of the book are referred to as‘context’ in NLP.

[004] In many of NLP areas including but not limited to, Dialogue Systems, Scientific Abstract Classification, Part-of-Speech tagging, the treatment of context plays an important role for text classification. The context is sometimes, arranged hierarchically, i.e., something from each of the previous statements needs to be remembered. Also, sometimes the context is present sporadically in some of the previous sentences. The presence of such context makes the task of sentence classification even more challenging.

SUMMARY

[005] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for sequence labeling is provided. The method includes employing, via one more hardware processors, a hierarchical capsule based neural network for sequence labeling. The hierarchical capsule based neural network includes a sentence encoding layer, a document encoding layer, a fully connected layer and a conditional random fields (CRF) layer, the sentence encoding layer comprising a word embedding layer, a feature extraction layer composed of a first plurality of Bi-LSTMs, a primary capsule layer, and convolutional capsule layers, and the document encoding layer comprising a second plurality of Bi-LSTMs. Employing the hierarchical capsule based neural network for sequence labeling includes determining, by the word embedding layer, initial sentence representations for a plurality of sentences associated with a task. Each of the sentence representations includes a concatenated embedding vector. The concatenated embedding vector includes a fixed-length vector corresponding to each word of the sentence. A fixed-length vector corresponding to a sentence is representative of lexical-semantics of words of the sentence. The feature extraction layer encodes contextual semantics between words within a sentence of the plurality of sentences using the concatenated embedding vector associated with each sentence of the plurality of sentences to obtain a plurality of context vectors. The method further includes convolving with the plurality of context vectors (Ci) while skipping one or more context vectors in between, by the primary capsule layer comprising a filter, to obtain a capsule map comprising a plurality of contextual capsules associated with the plurality of sentences. The plurality of contextual capsules are connected in multiple levels using shared transformation matrices and a routing model. A number of the one or more vectors is dilation rate (dr). A final sentence representation is computed for the plurality of sentences. Computing the final sentence representation includes determining coupling strength between child-parent pair contextual capsules. Contextual information is obtaining between the plurality of sentences by a second plurality of Bi-LSTM. The contextual information is obtained using the final sentence representations associated with the plurality of sentences. The second plurality of Bi-LSTMs takes sentences at multiple time steps and produces a sequence of hidden state vectors corresponding to each of the plurality of sentences. The hidden state vectors are passed through the feed forward layer to output likelihood scores for possible labels for each statement of the plurality of statements. Optimized label sequence are obtained for the plurality of sentences by the CRF layer based at least on a sum of possible labels weighted by the likelihood scores.

[006] In another aspect, a system for sequence labeling is provided. The system includes one or more memories; and one or more hardware processors, the one or more memories coupled to the one or more hardware processors, wherein the one or more hardware processors are configured to execute programmed instructions stored in the one or more memories to employ a hierarchical capsule based neural network for sequence labeling. The hierarchical capsule based neural network includes a sentence encoding layer, a document encoding layer, a fully connected layer and a conditional random fields (CRF) layer, the sentence encoding layer comprising a word embedding layer, a feature extraction layer composed of a first plurality of Bi-LSTMs, a primary capsule layer, and convolutional capsule layers, and the document encoding layer comprising a second plurality of Bi-LSTMs. Employing the hierarchical capsule based neural network for sequence labeling includes determining, by the word embedding layer, initial sentence representations for a plurality of sentences associated with a task. Each of the sentence representations includes a concatenated embedding vector. The concatenated embedding vector includes a fixed-length vector corresponding to each word of the sentence. A fixed-length vector corresponding to a sentence is representative of lexical-semantics of words of the sentence. The feature extraction layer encodes contextual semantics between words within a sentence of the plurality of sentences using the concatenated embedding vector associated with each sentence of the plurality of sentences to obtain a plurality of context vectors. The method further includes convolving with the plurality of context vectors (Ci) while skipping one or more context vectors in between, by the primary capsule layer comprising a filter, to obtain a capsule map comprising a plurality of contextual capsules associated with the plurality of sentences. The plurality of contextual capsules are connected in multiple levels using shared transformation matrices and a routing model. A number of the one or more vectors is dilation rate (dr). A final sentence representation is computed for the plurality of sentences. Computing the final sentence representation includes determining coupling strength between child-parent pair contextual capsules. Contextual information is obtaining between the plurality of sentences by a second plurality of Bi-LSTM. The contextual information is obtained using the final sentence representations associated with the plurality of sentences. The second plurality of Bi-LSTMs takes sentences at multiple time steps and produces a sequence of hidden state vectors corresponding to each of the plurality of sentences. The hidden state vectors are passed through the feed forward layer to output likelihood scores for possible labels for each statement of the plurality of statements. Optimized label sequence are obtained for the plurality of sentences by the CRF layer based at least on a sum of possible labels weighted by the likelihood scores.

[007] In yet another aspect, a non-transitory computer readable medium for a method of sequence labeling is provided. The method includes employing, via one more hardware processors, a hierarchical capsule based neural network for sequence labeling. The hierarchical capsule based neural network includes a sentence encoding layer, a document encoding layer, a fully connected layer and a conditional random fields (CRF) layer, the sentence encoding layer comprising a word embedding layer, a feature extraction layer composed of a first plurality of Bi-LSTMs, a primary capsule layer, and convolutional capsule layers, and the document encoding layer comprising a second plurality of Bi-LSTMs. Employing the hierarchical capsule based neural network for sequence labeling includes determining, by the word embedding layer, initial sentence representations for a plurality of sentences associated with a task. Each of the sentence representations includes a concatenated embedding vector. The concatenated embedding vector includes a fixed-length vector corresponding to each word of the sentence. A fixed-length vector corresponding to a sentence is representative of lexical-semantics of words of the sentence. The feature extraction layer encodes contextual semantics between words within a sentence of the plurality of sentences using the concatenated embedding vector associated with each sentence of the plurality of sentences to obtain a plurality of context vectors. The method further includes convolving with plurality of context vectors (Ci) while skipping one or more context vectors in between, by the primary capsule layer comprising a filter, to obtain a capsule map comprising a plurality of contextual capsules associated with the plurality of sentences. The plurality of contextual capsules are connected in multiple levels using shared transformation matrices and a routing model. A number of the one or more vectors is dilation rate (dr). A final sentence representation is computed for the plurality of sentences. Computing the final sentence representation includes determining coupling strength between child-parent pair contextual capsules. Contextual information is obtaining between the plurality of sentences by a second plurality of Bi-LSTM. The contextual information is obtained using the final sentence representations associated with the plurality of sentences. The second plurality of Bi-LSTMs takes sentences at multiple time steps and produces a sequence of hidden state vectors corresponding to each of the plurality of sentences. The hidden state vectors are passed through the feed forward layer to output likelihood scores for possible labels for each statement of the plurality of statements. Optimized label sequence are obtained for the plurality of sentences by the CRF layer based at least on a sum of possible labels weighted by the likelihood scores.

[008] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

[009] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:

[010] FIG. 1 illustrates an exemplary network environment for implementation of a system for sequence labeling using hierarchical capsule based neural network according to some embodiments of the present disclosure. [Oi l] FIGS. 2A and 2B illustrates a flow diagram for a method for sequence labeling using hierarchical capsule based Neural network, in accordance with an example embodiment of the present disclosure.

[012] FIG. 3 illustrates an example block diagram of a hierarchical capsule based neural network for sequence labeling, in accordance with an example embodiment of the present disclosure.

[013] FIG. 4 illustrates a sentence encoder of the hierarchical capsule based NN of FIG. 3, in accordance with an example embodiment of the present disclosure.

[014] FIG. 5 illustrates a capsule layer of the sentence encoder of the hierarchical capsule based neural network of FIG. 3, in accordance with an example embodiment of the present disclosure.

[015] FIG. 6 illustrates a document encoder and a CRF layer of the hierarchical capsule based neural network of FIG. 3, in accordance with an example embodiment of the present disclosure.

[016] FIG. 7 illustrates a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

[017] Sequence Labeling is one of the most prominent tasks in NLP. Sequence labeling refers to assigning sequence of labels to sequences of objects, for example, in NLP. For instance, in NLP sequence labeling may have applications in speech recognition, POS tagging, name-entity recognition, and so on.

[018] In Natural Language Processing (NLP) maintaining a memory of history plays an important role in many tasks. For example, while reading a long book the reader may remember key aspects of sentences and at the end of the paragraph, can create a short summary of what is read. The reader may mentally collect these‘short stories’ or‘summaries’ while moving forward in the book and finally amalgamate these summaries to create a long sustaining memory of the book which typically lacks in verbiage (words or phrases that don’t play crucial role in remembering key aspects) and focuses mainly on critical points of the book. These‘short stories’ or‘summaries’ are referred to as‘context’ in NLP.

[019] Sequence Labeling is one of the important tasks in text classification problems and requires proper treatment of the context, making this task different from general text classification. The traditional text classification models do not carry context from one sentence to another and hence may not perform well on tasks associated with NLP, such as Dialogue Systems, Scientific Abstract Classification, Part-of-Speech tagging, and so on. These traditional models lack a hierarchical structure that can aid them in dissecting the input structure at different levels to allow flow of context between sentences. The aforementioned tasks tend to be complex because of improper treatment of context. The context is sometimes, arranged hierarchically, meaning thereby that the model may need to remember something from each of the previous statements. Also, sometimes the context is present sporadically in some of the previous sentences. The presence of such context makes the task of sentence classification even more challenging.

[020] Prominent examples of sequence labeling tasks include dialogue act classification and scientific abstract classification. These tasks are described in more detail in the description below. It will be noted herein that even though some of the embodiments are described with reference to the tasks of the dialogue act classification and scientific abstract classification, the embodiments are equally applicable for various other NLP based tasks.

[021] Dialogue act classification: The study of utterances in dialogues is an intriguing challenge in the field of computational linguistics and has been studied with a variety of perspectives like linguistic and psychology, which has been extended to computational strategies. Study of these acts can helps in understanding the discourse structure which in turn can be used to enhance the capabilities of conversational systems. Still, there is an absence of a definite model for understanding the discourse structure which typically consists of unconstrained interactions between humans. Dialogue Acts (DAs) however, are one of the traits that can be used to understand these complex structures. Dialogue Acts have proven their usefulness in many NLP problems like Spoken Language Understanding (SLU). For example, DAs can be domain-dependent intents, such as“Find books”, or“Show flights” in“Library” and“Flight” domain. DAs are also used in many Machine Translation tasks where the goal of the Automatic Speech Recognition (ASR) tool is to understand the utterance and respond accordingly, DAs are used in this practice to increase the word recognition accuracy. DAs are also widely used to make the model of a talking head animated by making facial expressions resembling human beings, e.g., if a surprising statement has been made to the model, it can imitate a human making a bewildering expression.

[022] Scientific Abstract Classification: There have been a plethora of scientific articles published every year, as evidenced by which mentions that more than 50 million scientific articles have been published till now, with more and more of these articles coming out each year. With such a large amount of scientific articles being published every year, the process of categorizing them properly or searching a relevant text of information has become an arduous task. Thus, to have an organized system of these articles can facilitate the process of searching. To create such systems an intelligent tool is needed that can facilitate in categorization of these scientific articles into meaningful classes based on the information within them. Ideally, someone seeking relevant information looks up into the abstract of the scientific articles and categorizes its sentences into one of the different categories like objective, solution, results, and conclusion. However, these categories are not stated explicitly in these articles and hence, one can find it difficult to comprehend them. One of the major challenges with these scientific articles arises due to adherence of its writer to a variety of artistic skills which makes them unstructured and hard to extract out the essential elements. For e.g., one may think it’s good to introduce background knowledge before providing the result of the work and on the other hand, one can describe their objective before providing the results. Hence, it would be beneficial to develop an intelligent tool that can extract these elements categorically thereby, saving both time and human effort. [023] In both of the tasks defined above, namely dialogue act classification and scientific abstract classification, a proper understanding of context to categorize a sentence is required. This makes the task of sequence labeling, different from traditional text classification. Amalgamation of important contextual and current information to classify the sentence is at the center of Sequence Labeling Task and has been solved traditionally with the help of Conditional Random Fields (CRFs), Hidden Markov Models (HMMs), and so on. CRFs have been proved to perform well in many NLP tasks like POS tagging, Named Entity Recognition, and so on, each of which requires contextual information and past history of sequences to categorize the current input.

[024] Initially, sequence labeling problems were tackled with one of pre eminent models in Machine Learning, i.e., Hidden Markov Models (HMMs). From their advent HMMs have been used in many of the sequence labeling problems and has been used widely for many of the text processing problems. In another unprecedented model which can be seen as an extension of HMM, CRF was proposed. CRFs have been used in many challenging text processing problems and were dominant in most of them. One shortcoming of such approaches was to manually provide the features to them, which is a time- consuming process.

[025] Later, deep Learning based modules were introduced that focused on using RNNs, CNNs, and sometimes a hybrid model combining both of them. These modules are equipped with the capability to automatically capture the relevant features from the input and don’t completely rely on human intervention. CNNs are known to perform well in short texts and extract the N-gram phrases but, however, they pose a difficult task of choosing an optimal window size. Further, dilated CNNs were introduced. To eschew the process of selecting an optimal window size, RNNs can be used which uses its recurrent structure to capture long term dependencies. RNNs also put forward some difficulties, as they are said to focus on the current input and its neighboring words at any time step which consequently results in a summary biased towards the extreme ends of the sentence. However, on reaching the extremes there can be a possibility for them to forget the necessary information which can appear in between the document.

[026] Hybrid based approach combining both CNNs and RNNs have been proposed in to overcome the shortcoming of each other. Capsule Networks have shown to perform well on some text classification tasks. Nonetheless, these models only consider current input for classification and hence lack the context semantics generated by the neighboring sentences, which is beneficial in sequence labeling task.

[027] To optimize the sequence labeling operations, many techniques combining Deep Learning approaches with CRFs have shown to outperform many state-of art approaches. A combination of Bi-LSTM and CRF are supposed to incorporate the contextual information obtained from RNNs with inter dependence between subsequent labels improving the labeling process. In a conventional system, Multi-Hop attention layer is used with CRFs to combine sentence level contextual information and interdependency between the layers to strengthen the training process. Said system also removed the character-based word embedding with attention-based pooling on both RNNs and CNNs.

[028] Various embodiments disclosed herein provide system and method for sequence labeling using a hierarchical capsule based Neural Network. For example in one embodiment, the method includes obtaining a sentence representation of sentences associated with a task by encoding input words of the sentence into a fixed length vector. After obtaining word representations, said word representations are concatenated into one single vector of fixed length, this fixed length vector is then passed, first to a feature extraction layer and then to a layer of capsules to further squeeze out the essential word level features. Similarly, sentence representation for all the sentences in an abstract or dialogue associated with the task are obtained, stacked up together, and are then passed to a bidirectional long short term memory (Bi-LSTM) layer to get document representation (whole dialogue and/or abstract representation) enriched with contextual information. Representation obtained at each time-step is used for calculating the likelihood of each label. Finally, with the help of CRF layer, an optimal label sequence, i.e., a label for every sentence of the document is obtained, by remembering the label history.

[029] An important technical contribution of the disclosed embodiment is a hierarchical neural network architecture which obtains the sentence representation using capsules. . It will be understood that obtaining an intermediate representation in an NLP sequence labeling task using capsules has the technical advantage of reduction of model training time, number of parameters and complexity as compared to conventional models such as Attention, transformers based models, and so on.

[030] RNNs are known to extract the contextual information by focusing only on the neighboring words. As a consequence, the final hidden state representation (which is normally used in text classification problem), may have contextual information which is biased toward extremes of the sentence. For calculating a representation vector against each sentence, the disclosed system convolves hidden activations of RNN units (summaries of the text) which are separated by a fixed distance referred to as‘dilation-rate’. This method allows the disclosed system to focus not only on the neighboring vectors but also on the hidden state vectors that are scattered in the sentence.

[031] Conventionally, CNNs, despite their capability to extract Nword- phrase, have a problem of selecting an optimal window size. The short window size may result in lossy information compression while an increase in window size may lead to an increase in the number of parameters which leads to increased burden in the training process. Various embodiments disclose a method of first extracting the smoothened contextual information as low-level features by using the Bi-LSTM layer instead of CNNs. These smoothened low-level features are then passed through subsequent layers. It will be understood that using the Bi- LSTMs to capture low-level features in the sentence allows collection of information that could be used to infer more complex ones. These are other features of disclosed embodiments are described further with reference to FIGS. 1- 7 below. [032] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.

[033] Referring now to the drawings, and more particularly to FIG. 1 through 7, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.

[034] FIG. 1 illustrates an example network implementation 100 of a system 102 for sequence labeling using hierarchical capsule based neural network, in accordance with an example embodiment. The hierarchical structure of the disclosed hierarchical capsule based neural network aid in dissecting an input structure of data (for example, sentences, abstracts, paragraphs, and so on) at different levels to allow flow of context between sentences in the data. In various embodiments disclosed herein, the hierarchical neural network includes Bi- FSTMs, Dilated Convolution operation, Capsules and Conditional Random Field (CRF) to understand the discourse and/or abstract structure of the data and predict next probable label by using label history.

[035] In an embodiment, the system 102 employs the hierarchical capsule based neural network for the purpose of sequence labeling. In an embodiment, the hierarchical capsule based neural network includes a sentence encoding layer, a document encoding layer, a fully connected layer and a conditional random fields (CRF) layer. The sentence encoding layer includes a word embedding layer, a feature extraction layer including a first plurality of Bi- LSTM layers, a primary capsule layer, and convolutional capsule layers. The document encoding layer includes a second plurality of Bi-LSTM layers.

[036] The word embedding layer obtains a fixed-size vector representation of each word of a sentence. The feature extraction layer composed of the first plurality of Bi-LSTMs encode the whole sentence, then a Primary and Convolutional Capsule layer to extract the high-level features from the sentence. All the sentence encodings within a dialogue or an abstract are then stacked up together and are passed through the second plurality of Bi-LSTMs (in a second Bi-LSTM layer) to squeeze out the contextual information from sentences. Thereafter, the fully connected layer calculates likelihood scores and finally, the CRF layer obtains optimized label sequence for the sentences of the dialogue or the abstract. The architecture for hierarchical capsule based neural network for sequence labeling is described in detail with reference to FIGS. 1-7.

[037] Although the present disclosure is explained considering that the system 102 is implemented on a server, it may be understood that the system 102 may also be implemented in a variety of computing systems 104, such as a laptop computer, a desktop computer, a notebook, a workstation, a cloud-based computing environment and the like. It will be understood that the system 102 may be accessed through one or more devices 106-1, 106-2... 106-N, collectively referred to as devices 106 hereinafter, or applications residing on the devices 106. Examples of the devices 106 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, a Smartphone, a tablet computer, a workstation and the like. The devices 106 are communicatively coupled to the system 102 through a network 108.

[038] In an embodiment, the network 108 may be a wireless or a wired network, or a combination thereof. In an example, the network 108 can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Intemet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, the network 108 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within the network 108 may interact with the system 102 through communication links.

[039] As discussed above, the system 102 may be implemented in a computing device 104, such as a hand-held device, a laptop or other portable computer, a tablet computer, a mobile phone, a PDA, a smartphone, and a desktop computer. The system 102 may also be implemented in a workstation, a mainframe computer, a server, and a network server. In an embodiment, the system 102 may be coupled to a data repository, for example, a repository 112. The repository 112 may store data processed, received, and generated by the system 102. In an alternate embodiment, the system 102 may include the data repository 112.

[040] The network environment 100 supports various connectivity options such as BLUETOOTH®, USB, ZigBee and other cellular services. The network environment enables connection of devices 106 such as Smartphone with the server 104, and accordingly with the database 112 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 102 is implemented to operate as a stand-alone device. In another embodiment, the system 102 may be implemented to work as a loosely coupled device to a smart computing environment. The components and functionalities of the system 102 are described further in detail with reference to FIGS.2A-7.

[041] Referring collectively to FIGS. 2A-6, components and functionalities of the system 102 for sequence labeling using hierarchical capsule based NN is described in accordance with an example embodiment. For example, FIGS. 2A-2B illustrates a flow diagram for a method for sequence labeling using hierarchical capsule based NN, in accordance with an example embodiment of the present disclosure. FIG. 3 illustrates an example block diagram of the hierarchical capsule based NN, in accordance with an example embodiment, of the present disclosure. FIG. 4 illustrates a sentence encoder of the hierarchical capsule based NN of FIG. 3, in accordance with an example embodiment of the present disclosure. FIG. 5 illustrates a capsule layer of the sentence encoder of the hierarchical capsule based NN of FIG. 3. FIG. 6 illustrates a document encoder and a CRF layer of the hierarchical capsule based NN of FIG. 3, in accordance with an example embodiment of the present disclosure.

[042] Mathematically sequence labeling can be described as below:

Given an evidence E = ( ei,e2,..,e n ) about a particular event (for example, previous utterances in a conversation), a set of class sequence O = (oi, 02,..,o n ) is to be determined that has the highest posterior probability P(OIE) given the evidence E.

[043] Expanding the formula and applying the Baye’s rule,

[044] Herein, P(O) represents the prior probability of the class sequence, and P(E\0) is the likelihood of O given the evidence E, denominator P(E) is common across all the calculations and could be ignored. An example demonstrating evidences (E) and label sequence (O) is demonstrated in Table I below:

Table I: Example of evidences (e) and label sequence (o)

[045] As is illustrated in Table I, the conversation includes a plurality of sentences such as“Hi, good morning are you”,“I’m fine, thank you. How are you”,“I’m fine too”,“Can I ask you a question”, and so on. The disclosed hierarchical capsule based neural network is configured to predict the labels such as labels Greetings, thanking, Y/N questions, and so on for the sequence of sentences of the conversation. The method for sequence labelling is described further below.

[046] At 202 of method 200, a hierarchical capsule based neural network for sequence labeling is employed. A hierarchical capsule based neural network 300 (illustrated with reference to FIG. 3) includes a sentence encoding layer 302, a document encoding layer 304, a feed forward layer 306 and a CDF layer 308. For the sake of brevity of description, the term‘sentence encoding layer’ may be used interchangeably with the term‘sentence encoder’. The sentence encoding layer 302 (illustrated with reference to FIG. 4) includes a word embedding layer 402, a feature extraction layer 404, a primary capsule layer 406. Referring to FIG. 5, the document encoding layer 304 (including a second Bi-LSTM layer), the fully connected layer 306 and the CRF layer 308.

[047] At 204, the method 200 includes determining an initial sentence representation for each of the plurality of sentences associated with the task (for example, the conversation). The initial sentence representation (410) is determined by the word embedding layer 402 of the sentence encoder 400. Each of the sentence representations includes a concatenated embedding vector. The concatenated embedding vector includes a fixed-length vector Vi corresponding to each word Wi of the sentence. The fixed-length vector corresponding to a sentence is representative of lexical-semantics of words of the sentence. In an embodiment, the fixed-length vector Vi corresponding to a word HV is obtained a from a‘weight matrix’ W 6 R dword x v where d word is the vector dimension and IF I is the vocabulary size. Each column j of weight matrix corresponds to a vector W, G R dword x i v i h wor p j n vocabulary. Each Vi represents the lexical-semantics of words obtained after pre-training from a large corpus through an unsupervised training. [048] At 206 of method 200, the feature extraction layer 404 composed of a first plurality of Bi-LSTMs encodes contextual semantics between the words within a sentence of the plurality of sentences using the concatenated embedding vector associated with each sentence to obtain a plurality of context vectors. For example, for a sentence of length N, the concatenated embedding vectors may be {VI;V2..,VN\ · The contextual semantics between words within a sentence is encoded through it. The output from feature extraction layer is G = [cl c[\E R 2 x dsen for a word Wi where, c and cl right and left contexts (hidden activations), and d sen is number of LSTM units. Finally, for all the N words,

[049] At 208 of method 200, the plurality of context vectors ( Ci ) are convolved while skipping one or more context vectors in between to obtain a capsule map. The capsule map includes a plurality of contextual capsules associated with the plurality of sentences. The plurality of contextual capsules connected in multiple levels using shared transformation matrices and a routing model, as is described below.

[050] The plurality of context vectors are convolved by the primary capsule layer of the sentence encoder 408. The capsules replace singular scalar outputs by local“capsules” which are vectors of highly informative outputs known as “instantiated parameters”. In text processing, these instantiated parameters can be hypothesized as local orders of the words and their semantic representation. In an embodiment, to capture the semantics and cover a large part of a sentence, the primary capsule layer 408 includes a filter (or shared window) W b which convolves with the adjacent context vectors (Ci, Ci+1, ...) as well as with distant context vectors Ci+dr, skipping a number of context vectors. A number (or count) of context vectors that are skipped may be referred to as dilation rate (d r ) (marked as label 406 in FIG. 4).

[051] For context vectors Ci, a shared window with holes W b e R (2 x dsen) x d where, d is the capsule dimension, convolving with Ci’s separated at a distance of d r to cover a large part of sentence. Shared window W b multiplies vectors in {Ci + dr}^ with stride of one to get a capsule pt, Pl = g{ W b Ci)

where, g is a non-linear squash function.

[052] The non-linear activation squash keeps the smaller and higher probability to around 0 and 1 respectively. After the convolution operation, a capsule feature map (P) is created:

P = [pi,p2, ...,pc] 6 R (N x c x > d stacked with total N X C d-dimensional capsules representing the contextual capsules.

[053] In an embodiment, iterative dynamic routing model is used to introduce a coupling effect where the agreement between lower level capsules (layer 1) and higher level capsules (1+1) is maintained. In an example scenario, if contextual capsules with low level features at layer 1 is“m”, and contextual capsules at layer (1+1) is“n” then, for a capsule j at layer (1+1), the output vector can be computed by:

where, cij is the coupling coefficient between capsule i of layer 1 to capsule j of layer (1+1) and are determined by iterative dynamic routing, Ws is the shared weight matrix between the layers 1 and 1+1 (FIG. 5). In an embodiment, softmax function is utilized for computations. The softmax function is used over all the b’s to determine the connection strength between the capsules. The coupling coefficients a j are calculated iteratively in‘r’ rounds by:

exp (b jj )

ClJ åk exp (fcfft)

[054] Logits bij which are initially same, determines how strongly the capsules j should be coupled with capsule i. Consequently, in spite of using RNNs, which are said to be biased for extremes, by combining the dilation operation as shown in FIG.6 inside Bi-LSTM with dynamic routing between capsules, the disclosed system is able to focus in between the sentence also.

[055] At 210, the method 200 includes computing, by the Convolutional Capsule Layer 306, a final sentence representation for the plurality of sentences. In the Convolutional Capsule Layer 306, the capsules are connected to lower level capsules which determine the child-parent relationship by multiplying the shared transformation matrices followed by the routing algorithm. In an embodiment, the final sentence representation is calculated by determining coupling strength between child-parent pair contextual capsules Uj^ .

Uj\i = W ij u i

where, m is the child capsule and Ws is shared weight between capsules i and j.

[056] Finally, the coupling strength between the child-parent capsule is determined by the routing algorithm to produce the parent feature map. The capsules are then flattened out into a single layer and then multiplied by a transformation matrix W FC followed by routing algorithm to compute the final sentence representation (s k ).

[057] After getting all the sentence representation for a dialogue/utterance, the contextual information between the plurality of sentences is captured at 212. The contextual information between the plurality of sentences is captured by using the second Bi-LSTM layer (having the second plurality of Bi- LSTMs 304) that takes sentences at every time step and produces a sequence of hidden state vectors [h , h 2 ' ... , h M ' ] corresponding to each of the plurality of sentences (M). At 214, the hidden state vectors are passed through the feed forward layer 306 to output vectors oG R a , where a is the total number of possible labels for a statement. The output vectors o provides likelihood scores/probabilities for possible labels for each statement of the plurality of statements.

[058] At 216, the CRF layer 308 obtains optimized label sequence for the plurality of sentences based at least on a sum of possible labels weighted by the likelihood scores. The CRF Layer 308 adds some constraint for the final valid prediction labels (or possible labels). For example, in natural Question Answering Dialogue it would be natural to answer a Yes-No- Answer, if the responder has been asked a Yes-No-Question, similarly in an abstract, first the Objective is clearly defined before moving onto its Solution. CRF layer can induce some constraint to such patterns to generate a final valid inference based on the training data. To model such dependencies, a transition matrix T £ R 3 ^ is used, where a is the number of possible labels. An entry T[ai,a j ] corresponds to a weight score for transitioning from label i to label j. Score for a label sequence [yi, y2, . .. ,yM] is calculated by the sum of labels yi weighted by the probabilities Oi computed in previous layer (FIG. 6) and, a transition scores of moving from label y i to label yi

[059] Finally, taking a softmax over all possible tag sequence yields a probability for the sequence [y yi, ... j/w]

Here, score y is score for all possible sequences y in Y.

[060] During training, the log-probability of correct labels provided in the training data is maximized, and while decoding, the output sequence can be predicted with the maximum scores calculated by Viterbi Algorithm. An example scenario of sequence labeling by using the proposed capsule based hierarchical neural network is described further below.

[061] FIG. 7 is a block diagram of an exemplary computer system 701 for implementing embodiments consistent with the present disclosure. The computer system 701 may be implemented in alone or in combination of components of the system 102 (FIG. 1). Variations of computer system 701 may be used for implementing the devices included in this disclosure. Computer system 701 may comprise a central processing unit (“CPU” or “hardware processor”) 702. The hardware processor 702 may comprise at least one data processor for executing program components for executing user- or system generated requests. The processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. The processor may include a microprocessor, such as AMD Athlon™, Duron™ or Opteron™, ARM’s application, embedded or secure processors, IBM PowerPC™, Intel’s Core, Itanium™, Xeon™, Celeron™ or other line of processors, etc. The processor 702 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc.

[062] Processor 702 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 703. The I/O interface 703 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE- 1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S- Video, VGA, IEEE 802.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.

[063] Using the I/O interface 703, the computer system 701 may communicate with one or more I/O devices. For example, the input device 704 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc.

[064] Output device 705 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc. In some embodiments, a transceiver 706 may be disposed in connection with the processor 702. The transceiver may facilitate various types of wireless transmission or reception. For example, the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc.

[065] In some embodiments, the processor 702 may be disposed in communication with a communication network 708 via a network interface 707. The network interface 707 may communicate with the communication network 708. The network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/intemet protocol (TCP/IP), token ring, IEEE 802.1 la/b/g/n/x, etc. The communication network 708 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc. Using the network interface 707 and the communication network 708, the computer system 701 may communicate with devices 709 and 710. These devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like. In some embodiments, the computer system 701 may itself embody one or more of these devices.

[066] In some embodiments, the processor 702 may be disposed in communication with one or more memory devices (e.g., RAM 713, ROM 714, etc.) via a storage interface 712. The storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure. [067] The memory devices may store a collection of program or database components, including, without limitation, an operating system 716, user interface application 717, user/application data 718 (e.g., any data variables or data records discussed in this disclosure), etc. The operating system 716 may facilitate resource management and operation of the computer system 701. Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like. User interface 717 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 701, such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems’ Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like.

[068] In some embodiments, computer system 701 may store user/application data 718, such as the data, variables, records, etc. as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as hand-oriented databases (e.g., using HandStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among various computer systems discussed above. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination. [069] Additionally, in some embodiments, the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation. Further, it should be noted that one or more of the systems and methods provided herein may be suitable for cloud-based implementation. For example, in some embodiments, some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.

[070] In the example scenario, the hierarchical capsule based neural network, for example NN 300 is trained using datasets, such as data sets SwDA CorpusTM, PUBMEDTM, and NICTA-PIBOSOTM, and experimental results were obtained, as described herein.

[071] SwDA CorpusTM: The Switchboard Corpus contains 1155 human conversations recorded over telephone communication. The Switchboard corpus has total of more than 2M utterances initially divided in about 220 tags. SwDA coders manual suggested clustering of these tags into 42 different tags. These 42 tags aimed at facilitating machine learning over the Dialogue Act (DA) annotated part of Switchboard corpus. The hierarchical tag-set (220 tags) was further compressed into single atomic labels, which captures the individual utterance’s function and also the hierarchical information as captured by the initially designed DAMSL schema (Dialogue Act Markup at Several Layers which consists of 42 tags). One of the tag,‘+’, has been treated differently, some of them concatenate two consecutive user utterances when a‘+’ tag is present, while some others simply ignore the‘+’ tag. In the experiments herein, both the approaches were adopted and the results were reported after removing‘+’ tag and, concatenating the current user’s last utterance with the current one. The SwDA corpus has one problem of class distribution which ranges from maximum 36% to minimum < 0:1% of full SwDA corpus (Table II).

[072] One of the early results of experiments on Switchboard corpus was published that reported the mixture of Neural Network models and HMMs. However, there was no clear distinction of train-dev-test split. To maintain consistency with results that follow a common ground split is utilized. TABLE: - SWDA class distribution

[073] PUBMED: The dataset is collection of sentences obtained from medical abstracts and randomized controlled trials (RCT). It is derived from the PubMed™ database of biomedical literature and contains approximately 200,000 abstracts of randomized controlled trials, and total of upto 2.3M sentences. For training the model, PUBMED dataset which is largest medical abstract dataset was released in two configurations, a 20K and 200K abstracts training data with each sentence labeled with one of class from: background, objective, method, result, and conclusion. Same train-dev-test splits was used. An example abstract (PMID: 18554189) is shown in Table III

Table III: An example abstract (PMID:18554189)

[074] NICTA-PIBOSO The dataset was released for shared ALTA 2012 dataset, which has objective of building classifiers for automatically labeling sentences to pre-defined categories. The dataset was collected from the domain of Evidence Based Machine (EBM) and each sentence is labeled with one of classes from: Population, Intervention, Background, Outcome, Study -Design and Other. The dataset has about 1000 abstracts, 11616 sentences and can be used for sequence labeling task. Class Distribution for all these scientific abstract datasets has been given in Table IV.

Table IV: Class distribution for scientific abstract datasets

[075] Training Details: For training the proposed capsule based hierarchical neural network model, the training data as specified in the papers corresponding to different datasets were used. To initialize the words the 300 dimension GloVe Embeddings were used for SwDA and 200 dimension PubMed- word2vec embeddings for Scientific Abstracts. Bi-LSTMs (for example, the first plurality and the second plurality of Bi-LSTMs) were used for the sentence and document encoder in all the experiments which takes into account the sentences preceding and following the current utterance while the baseline for SwDA doesn’t. For NICTA, number of LSTMs used in sentence and document encoder was kept to 300 (150 Left and Right contexts), similarly in PUB MED and SwDA these numbers were kept to 400 and 500 units respectively. The dilation rate dr was in range [2-5]. A total of 20 capsules each with dimension d of 16 were used for all the experiments. Also the same routing value r of 3 was used across all the experiments because using the larger value may result in overfitting of test data.

[076] The performance of the proposed architecture is evaluated on these 4 datasets (Pubmed has 2 parts, 20k and 200k), the numbers are reported in Table V. In SwDA corpus the model outperformed the state-of-art by about 2%. For the remaining three scientific abstracts data PUBMED 20k, 200k and NICTA, the results achieved are comparable with published state-of-art models.

Table V: Performance on 4 publicly available datasets

[077] As is seen from above, the proposed capsule based hierarchical neural network model for sequence labeling has achieved state-of-art results on two different domain datasets (Dialogue Systems and Scientific Abstract Classification). Through results, the efficacy of the proposed system and method on these complex NLP task it has been demonstrated by allowing the model to capture the representation of a sentence or a document, which is enriched with the contextual information.

[078] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.

[079] Various embodiments disclosed herein provide method and system for sequence labeling using capsule based hierarchical neural network. The proposed system uses a layer of Bi-LSTMs to first obtain a sentence representation and then performs dilation on it along with applying capsule to get a sentence representation. Finally after obtaining all such sentence representations from multiple sentences of a task (such as an abstract/conversation), the system utilizes CRF to get optimum sequence labeling. The embodiments of present disclosure herein utilizes a Hierarchical structure of the capsule based neural network that can aid them in dissecting the input structure at different levels to allow flow of context between sentences. Said Hierarchical structure facilitates in understanding the discourse/abstract structure and predict next probable label by using label history.

[080] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field- programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.

[081] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

[082] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words“comprising,”“having,”“containing,” and“including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms“a,”“an,” and“the” include plural references unless the context clearly dictates otherwise.

[083] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

[084] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.