A METHOD FOR ENABLING IDENTIFICATION OF CONTEXTUAL DATA AND RELATED ELECTRONIC DEVICE

Title:

A METHOD FOR ENABLING IDENTIFICATION OF CONTEXTUAL DATA AND RELATED ELECTRONIC DEVICE

Document Type and Number:

WIPO Patent Application WO/2024/099524

Kind Code:

Abstract:

Disclosed is a method, performed by an electronic device, for enabling identification of contextual data. The method comprises obtaining textual data. The method comprises generating, based on the textual data, a first vector. The method comprises generating, based on the first vector, a first cluster indicative of a first context. The method comprises generating, based on the first vector and the first cluster, a second vector indicative of a frequent occurrence of a second context. The method comprises generating, based on the second vector and the first cluster, a second cluster indicative of the first context and the second context. The method comprises generating, based on the second cluster, second data comprising the first context and the second context.

More Like This:

JP3009654	[Title of Invention] Machine translation processing apparatus
JPH0580793	INTERACTIVE UNDERSTANDING DEVICE WITH WORD PREDICTING FUNCTION
JP2004294542	SPEECH RECOGNITION DEVICE AND PROGRAM THEREFOR

Inventors:

GAHLOT HIMANSHU (IN)
PRADHAN ANSHUMAN YAJNAVALKYA (IN)
CHINNAMGARI SUNIL KUMAR (IN)
MOHANTY GOURA SUNDAR (IN)

Application Number:

PCT/DK2023/050259

Publication Date:

May 16, 2024

Filing Date:

November 01, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

MAERSK AS (DK)

International Classes:

G10L15/183; G06F40/20; G06F40/30

Foreign References:

US20140129560A1	2014-05-08
CN113742448A	2021-12-03
CN114330327A	2022-04-12

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1 . A method, performed by an electronic device, for enabling identification of contextual data, the method comprising:

- obtaining textual data;

- generating, based on the textual data, a first vector;

- generating, based on the first vector, a first cluster indicative of a first context;

- generating, based on the first vector and the first cluster, a second vector indicative of a frequent occurrence of a second context;

- generating, based on the second vector and the first cluster, a second cluster indicative of the first context and the second context; and

- generating, based on the second cluster, second data comprising the first context and the second context.

2. The method according to claim 1 , wherein generating the first cluster and/or the second cluster comprises applying a multi-level clustering technique to the first vector and/or the second vector respectively.

3. The method according to claim 2, wherein the multi-level clustering technique comprises a hierarchical density-based clustering model.

4. The method according to claim 3, wherein the featurization technique comprises one or more of: a binary count vectorizer, a reverse Term Frequency - Inverse Document Frequency based model, and a word embedding model.

5. The method according to any of claims 3-4, wherein generating the first vector and/or the second vector comprises:

- applying a dimension reduction technique to the feature vector.

6. The method according to any of the previous claims, the method comprising: - determining, based on the first cluster, first data indicative of a first token representative of the first context. The method according to claim 6, wherein the method comprises: - removing the first data from the textual data. The method according to any of the previous claims, wherein generating the second vector comprises performing a transformation based on the first vector and the second vector. The method according to any of claims 6-8, wherein the first data comprises the first token representative of the first context. The method according to any of claims 6-9, wherein the second data comprises the first token representative of the first context and one or more second tokens representative of the second context.

Description:

A METHOD FOR ENABLING IDENTIFICATION OF CONTEXTUAL DATA AND

RELATED ELECTRONIC DEVICE

The present disclosure pertains to the field of textual data processing. The present disclosure relates to a method for enabling identification of contextual data and related electronic device.

BACKGROUND

Companies may deal with massive quantities of textual data obtained from several information sources and addressing multiple subjects. The textual data may be obtained from legal documents, user reviews, user queries, call transcripts of audio files and any other suitable information sources. It may be challenging to identify relevant information (e.g., actionable insights) from such textual data.

SUMMARY

Manual handling of such textual data is prone to error, time consuming and not readily scalable.

Accordingly, there is a need for an electronic device and a method for enabling identification of contextual data, which mitigate, alleviate, or address the shortcomings existing and provide for an identification of a context indicative of relevant issues and/or topics in textual data.

Disclosed is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device cause the electronic device to perform any of the methods disclosed herein.

It is an advantage of the present disclosure that the disclosed electronic device and method enable identification of contextual data, such as identifying a first context and a second context, in a fast, robust, accurate and scalable manner. The disclosed electronic device and method may allow analyzing large collections of textual data by properly identifying the relevant information (e.g., high quality information) associated with the first context and the second context.

The disclosed electronic device and method may be particularly advantageous for applications since the disclosed electronic device and method may enable a consistent processing of textual data (e.g., from where contextual data can be inferred) by allowing classification, and/or categorization of large collections of textual data. The disclosed electronic device and method may lead to a faster identification of an issue in a user query and provide a timely accurate response. The present disclosure may allow using labelled datasets for identifying user issues, without involvement of a person. The disclosed electronic device and method may enable near real time identification of the issue and enable providing a timely accurate response.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other features and advantages of the present disclosure will become readily apparent to those skilled in the art by the following detailed description of exemplary embodiments thereof with reference to the attached drawings, in which:

Fig. 1 is a diagram illustrating schematically an example process for enabling identification of contextual data according to this disclosure,

Figs. 2A-2B are flow-charts illustrating an exemplary method, performed by an electronic device, for enabling identification of contextual data according to this disclosure, and Fig. 3 is a block diagram illustrating an exemplary electronic device according to this disclosure.

DETAILED DESCRIPTION

Various exemplary embodiments and details are described hereinafter, with reference to the figures when relevant. It should be noted that the figures may or may not be drawn to scale and that elements of similar structures or functions are represented by like reference numerals throughout the figures. It should also be noted that the figures are only intended to facilitate the description of the embodiments. They are not intended as an exhaustive description of the disclosure or as a limitation on the scope of the disclosure. In addition, an illustrated embodiment needs not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated, or if not so explicitly described.

The figures are schematic and simplified for clarity, and they merely show details which aid understanding the disclosure, while other details have been left out. Throughout, the same reference numerals are used for identical or corresponding parts.

A context disclosed herein may be seen as information indicative of a matter (e.g., a processing step, an issue, a goal, an aim, a technical error, an environment and/or a topic). In other words, context may be seen as information capable of providing information about a matter (e.g., meaningful and/or relevant information or insight for the matter or issue at hand). Contextual data may be seen as data characterizing a context of the disclosed textual data.

Contextual data may be obtained (e.g., extracted) from an information source of textual data. The information source of the textual data may be one or more of: a chat, a message, an email, a document, an audio file, a video file, a chatbot, and any other suitable information source. In other words, contextual data may be seen as relevant information to an interpretation of textual data. The textual data to be processed according to this disclosure carries information that is relevant for processing it, such as for moving to the next processing step. For example, the textual data can include: “trying to upload a document”, the context for this example textual data can be “upload” and/or “document”. A first context disclosed herein may be represented with a first token (e.g. a first term, character, string, number). In other words, the first context may be seen as a topic (e.g., a term), such as a topic related to a process in which the textual data occurred. A token may be seen as one or more of: a word, a string, a character, and a number. A second context disclosed herein may be represented by one or more second tokens. The second context may be seen as one or more second tokens related to a same semantic domain (e.g., a topic). When identifying contextual data as addressing a first context, the textual data may comprise one or more second tokens representative of a second context which are related with the first context. In other words, the second context may be associated with the first context. In other words, the second context may be seen as a brief description (e.g., an issue, a problem, a pain point) of the first context (e.g., a topic, a matter). For example, the second context can be indicative of a specific problem faced by a user related to use of a product.

The present disclosure provides a technique that enables accurately identifying contextual data, such as identifying a first token representative of the first context and one or more second tokens representative of the second context. The disclosed technique allows identifying, at a first stage, the first token representative of the first context, such as a contextual topic and/ a contextual matter. The disclosed technique allows identifying, at a second stage, the one or more second tokens representative of the second context, such as one or more issues (e.g., brief description of one or more problems and/or issues) associated with the contextual topic. The disclosed technique may allow processing large collections of textual data and compressing the textual data into identified contextual data, e.g. the relevant and/or meaningful topics (e.g., associated with the first context) and respective issues (e.g., associated with the second context). The present disclosure provides a context modelling technique (e.g., a topic identification technique) for identifying contexts from textual data in a fast, robust, accurate and scalable manner.

The disclosed technique may be particularly useful for processing (e.g. solving, interpreting and/or analyzing) a user query (e.g., textual data) obtained from a user support platform (e.g., a chatbot). For example, a user query may be as “I am facing a login issue while submitting an invoice. Please help me as soon as possible”. The disclosed technique may identify (e.g., tag) the user query as “submit bill”, in which “bill” may be seen as the matter or topic (e.g., the first token representative of the first context) and “submit” may be seen as the issue (e.g., the one or more second token representative of the second context). Implementation of the disclosed technique by an electronic device may allow identifying and categorizing user issues in a robust and accurate manner, without the need to perform such actions manually. The implementation of the disclosed technique may also contribute to a reduction in the amount of time required to address the user issue and less error prone processing of the user queries.

Fig. 1 is a diagram illustrating schematically an example process 1 for enabling identification of contextual data according to this disclosure. Fig. 1 shows a process 1 for identifying a first context and a second context from textual data. The process 1 is carried out by the electronic device disclosed herein.

The electronic device obtains the textual data 10 from an information source. For example, an information source can be one or more of: a chat, a message, an email, a document, an audio file, a video file, and a chatbot and any other suitable information source. In one or more examples, the textual data can include one or more tokens. For example, a token can be one or more of: a word, a string, a character, and a number. For example, the textual data may include one or more sentences.

The electronic device may remove one or more tokens 12 from the textual data. For example, the one or more tokens can be a stop token and/or a frequently used token and/or an infrequently used token. A stop token may be one or more of: a determiner, a preposition, an adjective, a coordinating conjunction and/or a greeting. In other words, a stop token may be a token that is commonly used in a language. A removal of a stop token may not sacrifice content of the textual data. A frequently used token may be seen as a token which is comprised in historical textual data (e.g., in 70-80% of sentences comprised in the historical textual data). An infrequently used token may be seen as a token which is not comprised by an infrequently used token parameter (e.g., in 15 sentences) in the historical textual data. The historical textual data may be stored in a repository (e.g., a database). The historical textual data may be indicative of textual data previously obtained, by the electronic device, from one or more information sources and stored in a repository. The electronic device may remove the one or more tokens 12 to impose constraints on which tokens can be used for performing the identification of the first and second context. In other words, the electronic device may remove the one or more tokens 12 to determine the first and second context, while providing an accurate first and second context.

The electronic device generates a feature vector e.g. by applying a binary count vectorizer 14 (e.g., a featurization technique) to the textual data. In other words, the electronic device vectorizes the textual data by generating, based on the textual data, the feature vector, which may be denoted a first feature vector.

The electronic device may generate the first vector optionally by applying a dimension reduction technique 16 to the feature vector. For example, dimension of the first vector is lower than dimension of the feature vector. For example, the dimension reduction technique can be seen as a manifold learning technique (e.g., a Uniform Manifold Approximation and Projection, UMAP, algorithm) for dimension reduction. For example, applying the dimension reduction technique 16 to the feature vector is optionally performed after applying the binary count vectorizer 14 to the textual data (e.g., upon vectorizing the textual data). The UMAP algorithm may reduce the dimension of the feature vector to a dimension reduction parameter (e.g., to 2) without losing considerable amount of information. When the dimension reduction technique is not applied (e.g., by the electronic device) to the textual data, the first vector may be generated (e.g., by the electronic device) by solely applying the binary vectorizer 14 to the textual data.

The electronic device generates, based on the first vector, a first cluster by applying a multi-level clustering technique 18 to the first vector. The first cluster may be indicative of the first context. For example, the first context can be a topic, a matter, and/or a term and/or an issue. The first context may be a classification and/or a tag of a matter (e.g., a problem related to a topic, such as the first context). The electronic device generates the first cluster based on presence of a topic (e.g., a token) in the textual data. The first cluster may be associated with a most frequent token. The electronic device may label and/or tag the first cluster with the most frequent token. For example, the electronic device may generate one or more first clusters, in which the one or more first clusters are associated with one or more most frequent tokens in the textual data. The multi-level clustering technique may be based on a hierarchical density-based spatial clustering of applications with Noise, HDBSCAN, model. The electronic device may determine, based on the first cluster, first data 20. The first data may be indicative of a first token representative of the first context. The first data may include a first token, such as a topic and/or a matter. The first data may comprise one or more first tokens (e.g., an array of first tokens), such as one or more topics and/or matter. For example, each first token of the one or more first tokens labels each first cluster of the one or more first clusters.

The electronic device may remove the first data 22 from the textual data for provision of removed textual data. The electronic device may remove vectorized textual data (e.g., vectorized data comprised in the first vector) from the first vector.

The electronic device generates a feature vector (denoted second feature vector) by applying a reverse Term Frequency - Inverse Document Frequency 24, TF-IDF, (e.g., a featurization technique) to the removed textual data. The electronic device vectorizes the removed textual data by generating, based on the removed textual data, the feature vector. For example, the reverse TF-IDF assigns higher weights to most occurring one or more tokens in the removed textual data and lower weights to rare occurring one or more tokens in the removed textual data. In other words, the reverse TF-IDF may assign frequently occurring one or more tokens with one or more values around “1” and rarely occurring one or more tokens with one or more values around “0”.

The electronic device may perform a transformation 26 based on the first vector and the second vector by merging (e.g., concatenating) the first vector with the feature vector. In other words, the electronic device merges constituent elements of the first vector associated with the first data, such as constituent elements of the first vector identified as topics, with the feature vector. Performing such transformation may result in a vector comprising one or more topics (e.g., associated with the first context) and one or more issues related to each topic of the one or more topics (e.g., associated with the second context).

The electronic device may generate a second vector optionally by applying a dimension reduction technique 28 to the feature vector. For example, dimension of the second vector is lower than dimension of the feature vector. For example, the dimension reduction technique can be seen as a manifold learning technique (e.g., a Uniform Manifold Approximation and Projection, UMAP, algorithm) for dimension reduction. For example, applying the dimension reduction technique 28 to the feature vector is optionally performed after applying the TF-IDF 24 to the removed textual data (e.g., upon vectorizing the textual data). The UMAP algorithm may reduce the dimension of the feature vector to a dimension reduction parameter (e.g., to 2) without losing considerable amount of information. When the dimension reduction technique is not applied (e.g., by the electronic device) to the removed textual data, the second vector may be generated (e.g., by the electronic device) by solely applying the TF-IDF 24 to the removed textual data.

The electronic device generates, based on the second vector and first cluster, a second cluster by applying a multi-level clustering technique 30 to the second vector. The second cluster may be indicative of the first context and the second context. The second cluster may be associated with a topic (e.g., the first context) and respective issues (e.g., the second context). The electronic device may generate one or more second clusters, with each second cluster oh the one or more second clusters associated with a topic and respective issues. In other words, the electronic device groups the one or more topics with the respective issues. The multi-level clustering technique may be based on a hierarchical density-based spatial clustering of applications with Noise, HDBSCAN, model.

The electronic device generates, based on the first cluster, second data 32. The second data may comprise the first context and the second context. The second data may include a first token (e.g., a topic) and respective one or more second tokens (e.g., one or more issues). The second data may comprise one or more first tokens (e.g., one or more topics) and one or more second tokens (e.g., one or more issues) associated with each first token of the one or more first tokens.

The electronic device can optionally select 34, based on the second data, one or more first tokens (e.g., one or more topics) that are associated with a plurality of second tokens (e.g., plurality of issues) for provision of an improved second data. Such selection can be useful to remove topics which may not contribute for identification of relevant contextual data.

Figs. 2A-2B show a flow-chart of an exemplary method, performed by an electronic device, for enabling identification of contextual data according to the disclosure. The electronic device is an electronic device disclosed herein, as illustrated in Fig. 1 and 3.

The method 100 comprises obtaining S102 textual data. In one or more examples, the textual data is obtained (e.g. received, retrieved and/or extracted) by the electronic device from an information source. For example, an information source can be one or more of: a chat, a message, an email, a document, an audio file, a video file, and any other suitable information source. For example, an information source can be a chatbot (e.g., a user support platform). In one or more examples, the textual data can include one or more tokens. For example, a token can be one or more of: a word, a string, a character, and a number. Put differently, the textual data may be any character and/or a string of characters. In one or more examples, the textual data can comprise independent information, such as information obtained from one or more information sources (e.g., from one or more documents and/or one or more messages) addressing different subjects.

The method 100 comprises generating S108, based on the textual data, a first vector. In one or more examples, the first vector is generated by converting the textual by the electronic device into a vector, such as the first vector. Put differently, the textual data may be vectorized by the electronic device as illustrated in Fig. 1 . For example, sentences (e.g., from one or more information sources) comprised in the textual data are vectorized by the electronic device.

The method 100 comprises generating S110, based on the first vector, a first cluster indicative of a first context. In one or more examples, the first context can be a topic and/or a term and/or an issue. The first context can optionally be topic related to a process, such as a topic related to a process. In one or more examples, the first cluster is labelled by the electronic device with a first token indicative of the first context. For example, the first context can be seen as a classification and/or a tag of an issue (e.g., a problem related to a topic, such as the first context). In one or more examples, the first vector comprises information indicative of presence of a first token in the textual data. Put differently, generating the first vector may allow identifying whether a first token is present in the textual data. In one or more examples, the first cluster is generated by the electronic device based on the presence of the first token in the textual data. The first cluster may be associated with a most frequent first token. In one or more examples, one or more first clusters can be generated by the electronic device. A first cluster of the one or more first clusters may be generated by the electronic device for each most frequent (e.g., prevalent) first token comprised in the textual data. One or more first clusters may be associated with respective one or more most frequent first tokens (e.g., topics). The method 100 comprises generating S116, based on the first vector and the first cluster, a second vector indicative of a frequent occurrence of a second context. Generating the second vector may allow identifying frequently occurring one or more tokens in the textual data. In other words, the second vector may comprise information indicative of how frequent a second token is used in textual data. In other words, the second vector may comprise information which quantifies how frequent a second token is in textual data. In one or more examples, the second context can be seen as one or more issues (e.g., user issues). The second context (e.g., one or more issues) may be associated with the first context (e.g., a topic). Put differently, a topic (e.g., a topic) may comprise one or more issues (e.g., one or more problems related to a topic). The method 100 comprises generating S118, based on the second vector and the first cluster, a second cluster indicative of the first context and the second context. In one or more examples, generating the second cluster can be seen as grouping the first context with the second context based on the second vector (e.g., indicative of a frequent occurrence of the second context) and the first cluster (e.g., indicative of the first context). For example, one or more second clusters can be generated by the electronic device based on the second vector and the one or more first clusters.

The method 100 comprises generating S120, based on the second cluster, second data comprising the first context and the second context. The second data may comprise a topic and corresponding one or more issues (e.g., associated with the topic). In a context, the disclosed technique may accurately categorize and/or identify a problem and/or an issue submitted by a user via a user support platform. A first problem and/or issue may be identified as “submit bill” and/or “copy of bill” and/or “download bill”. Put differently, the second data may be seen as “submit bill” and/or “copy of bill” and/or “download bill”. For example, “bill” is a topic and “submit”, “download”, “copy” are the one or more issues associated with “bill”. A second problem and/or issue may be identified as “Making booking”, “Download booking”, “Paying booking”. The first and second issues may be identified at different timings (e.g., the first and second issue are processed by the electronic device separately). The first and second problems and/or issues may be identified simultaneously. In other words, the second data comprises one or more topics and respective issues. The second data may be seen as the contextual data identified in method 100.

The method may comprise providing the second data comprising to the first context and the second context for controlling a control system, such as a response to the user, such as a next processing step of the textual data. For example, the second data can be provided to give a list of most relevant topics because clusters are identified amongst multiple texts, for describing issues faced by many users.

In one or more example methods, generating S110 the first cluster comprises applying S110A a multi-level clustering technique to the first vector. The first cluster may be associated with a most frequent first token in the textual data. The first cluster may be associated with a topic. In one or more example methods, generating S118 the second cluster comprises applying S118A a multi-level clustering technique to the second vector. The second cluster may be associated with a most frequent first token and one or more most frequent second tokens in textual data. The second cluster may be associated with a topic and respective issues.

In one or more example methods, the multi-level clustering technique comprises a hierarchical density-based clustering model. In one or more examples, the multi-level clustering technique can comprise one or more of: a density-based clustering model, a hierarchical-based model, and a distribution-based model. In one or more examples, the hierarchical density-based clustering model comprises a hierarchical density-based spatial clustering of applications with Noise, HDBSCAN. For example, generating the first cluster by applying the HDBSCAN to the first vector comprises generating the first cluster based on density of the most frequent first token in the textual data. For example, generating the second cluster by applying the HDBSCAN to the second vector comprises generating the second cluster based on density of the most frequent first token and one or more most frequent second tokens in textual data.

In one or more example methods, generating S108 the first vector comprises generating S108A a feature vector by applying a featurization technique to the textual data. In one or more example methods, generating S116 the second vector comprises generating S116A a feature vector by applying a featurization technique to the textual data.

In one or more examples, the textual data is converted, by the electronic device, into a feature vector by applying a featurization technique to the textual data. The feature vector may be different from the first vector and/or second vector. In other words, the feature vector may be different from the first vector and/or second vector in terms of dimension (e.g., size). In one or more examples, generating the second vector comprises generating the feature vector by applying the featurization technique to the removed textual data (e.g., textual data not comprising the first context).

In one or more example methods, the featurization technique comprises one or more of: a binary count vectorizer, a reverse Term Frequency - Inverse Document Frequency based model, and a word embedding model. In one or more examples, the feature vector is generated by applying a binary count vectorizer to the textual data for provision of the first vector. Put differently, the textual data may be converted into the feature vector by applying the binary count vectorizer to the textual data for provision of the first vector. Applying the binary count vectorizer to the textual data may allow identifying whether a first token (e.g., a topic) is present in the textual data. For example, based on such identification, applying the binary count vectorizer to the textual data also enables to determine how frequent the first token is in the textual data. In one or more examples, the feature vector is generated by applying a reverse Term Frequency - Inverse Document Frequency, TF-IDF, to the textual data for provision of the second vector. Put differently, the textual data may be converted into the feature vector by applying the reverse TF-IDF to the textual data for provision of the second vector. The reverse TF-IDF may be seen as 1 -(TF-IDF). Applying the reverse TF-IDF to the removed textual data may enable not only to determine frequency in which the one or more second tokens (e.g., one or more issues) are present in the removed textual data but also to provide the importance of the one or more second tokens (e.g., to quantify the one or more second tokens). For example, the reverse TF-IDF assigns higher weights to most occurring one or more second terms in the removed textual data and lower weights to rare occurring one or more second tokens in the removed textual data. Applying the TF-IDF to the removed textual data may improve the vectorization of the textual data from provision of the first vector. For example, the use of reverse TF-IDF may identify one or more first tokens which are not comprised in the first data. In other words, the use of reverse TF-IDF may improve the vectorization of the textual data performed by the binary count vectorizer. The use of reverse TF-IDF may allow identifying one or more topics and respective one or more issues in a fast and accurate manner. Such fast and accurate identification of one or more topics and respective one or more issues in an environment, such as in a client support platform, may lead to a growing in user satisfaction rates.

In one or more example methods, generating S108 the first vector comprises applying S108B a dimension reduction technique to the feature vector. In one or more examples, the first vector is generated by the electronic device by applying the dimension reduction technique to the feature vector. For example, dimension of the first vector is lower than dimension of the feature vector. In one or more example methods, generating S116 the second vector comprises applying S116C a dimension reduction technique to the feature vector. In one or more examples, the second vector is generated by the electronic device by applying the dimension reduction technique to the feature vector. For example, dimension of the second vector is lower than dimension of the feature vector.

In one or more examples, the dimension reduction technique can be seen as a manifold learning technique for dimension reduction. For example, the dimension reduction technique can comprise a Uniform Manifold Approximation and Projection, UMAP, algorithm. For example, applying the dimension reduction technique to the feature vector comprises generating a low dimension representation of the feature vector for provision of the first vector and/or the second vector.

For example, applying the dimension reduction technique to the feature vector is optionally performed after applying a featurization technique to the textual data (e.g., upon vectorizing the textual data). For example, the feature vector can comprise one or more dimensions. For example, the feature vector can comprise a considerable number of dimensions. The UMAP algorithm may reduce the dimension of the feature vector to a dimension reduction parameter (e.g., to 2) without loss of information. The dimension reduction parameter may be an integer value. In one or more example methods, the method 100 comprises determining S112, based on the first cluster, first data indicative of a first token representative of the first context. In one or more examples, a token can be one or more of: a word, a string, a character, and a number. In one or more examples, the first token can be a token for identifying a topic (e.g., a topic). In one or more examples, a first token can be seen as a label for a first cluster. In one or more examples, the first data can be an array of first tokens, such as tokens which identify topics. For example, the array of first tokens can comprise a first token (e.g., a topic) which labels a first cluster. For example, an array of first tokens labels (e.g., classify) one or more first clusters. For example, each first token comprised in the array of first tokens labels each first cluster of the one or more first clusters. In one or more examples, the first data can be a first token for labelling each first cluster. In one or more examples, the first data can be seen as a topic. In one or more example, the first data can be one or more first tokens for labelling one or more first clusters, respectively. In one or more examples, the first data can be seen as one or more topics.

In one or more example methods, the method 100 comprises removing S114 the first data from the textual data for provision of removed textual data. In one or more examples, removing the first data from the textual data may comprise removing a topic and/or one or more topics from the textual data. For example, removing the first data from the textual data may be equivalent to remove vectorized textual data from the first vector.

In one or more example methods, generating S116 the second vector comprises performing S116B a transformation based on the first vector and the second vector. In one or more examples, performing a transformation based on the first vector and the second vector comprises merging the first vector with the second vector. In one or more examples, performing a transformation based on the first vector and the second vector comprises merging the first vector with the feature vector. In one or more examples, performing a transformation based on the first vector and the second vector comprises merging the second vector (e.g., after applying reverse TF-IDF to the textual data) with constituent elements of the first vector associated with the first data (e.g., an array of one or more first tokens).

In one or more example methods, the first data comprises the first token representative of the first context. In one or more examples, the first data labels and/or tags the first cluster. For example, the first token can be seen as a topic (e.g., a topic). In one or more example methods, the second data comprises the first token representative of the first context and one or more second tokens representative of the second context. The second data may comprise a topic (e.g., first token) and corresponding one or more issues (e.g., one or more second tokens). For example, an issue may be identified as “submit bill” and/or ““copy of bill” and/or “download bill”. The topic (e.g., first token) may be “bill”. The one or more issues (e.g., one or more second tokens) may be “submit” and/or “copy” and/or “download”.

In one or more example methods, the second data comprises the one or more second tokens associated with the first token. For example, the one or more second tokens can be seen as one or more issues. For example, the first token can be seen as a topic which comprises one or more issues.

The electronic device can optionally select, based on the second data, one or more first tokens (e.g., one or more topics) which are associated with a plurality of second tokens (e.g., plurality of issues), for provision of an improved second data. Such selection may be useful to remove tokens (e.g., false positive tokens or topics) which may not contribute for identification of relevant contextual data. For example, a false positive token may be a name reoccurring like “John”.

In one or more example methods, the method 100 comprises obtaining S104 one or more tokens for removal, wherein the one or more tokens comprise one or more of: a stop token, a frequently used token and an infrequently used token. In one or more examples, obtaining the one or more tokens for removal (e.g., a stop token and/or a frequently used token and/or an infrequently used token) comprises generating the one or more tokens by querying a repository (e.g., a database). The repository may comprise historical textual data. For example, the historical textual data can comprise one or more historical tokens. The repository may comprise one or more tokens obtained from one or more textual data (e.g., independent textual data, such as textual data from one or more information sources). In other words, the repository may comprise one or more tokens obtained (e.g., by the electronic device) from previous one or more textual data, such as one or more textual data previously obtained from an information source (e.g., a user support platform). In one or more examples, a stop token can be a determiner (e.g., the, a, an) and/or a preposition (e.g., above, across, before) and/or an adjective (e.g., bad, good, nice) and/or a coordinating conjunction (e.g., for, nor, but, or, yet) and/or a greeting (e.g., good morning, hello). In one or more examples, a stop token can be in any language. The frequently used token may be seen as a token that is comprised (e.g., present) in, for example, 70-80% of the sentences comprised in the historical textual data (e.g., obtained from the repository). The historical textual data (e.g., textual data previously received from one or more users addressing one or more issues) may be stored in a database. For example, the frequent used token parameter (e.g., 70-80%) is a percentage. The frequent used token parameter may be customizable. The infrequently used token may be seen as a token that is not comprised (e.g., not present) in, for example, 15 sentences comprised in the historical textual data. For example, the infrequent used token parameter (e.g., 15 sentences) is an integer value. The infrequent used token parameter may be customizable. In one or more examples, the infrequent used token parameter can be seen as the number of times a first token is identified as associated with a first cluster. In other words, a first cluster may be generated by the electronic device when the textual data comprises at least 15 times a token. In one or more example methods, the method 100 comprises removing S106 the one or more tokens from the textual data. In one or more examples, removing the one or more tokens from the textual data is performed prior to generation of the first and/or the second vector. Put differently, removing the one or more tokens from the textual data may be performed prior to generation of the feature vector by applying a featurization technique to the textual data (e.g., prior to vectorizing the first and/or the second vector). Punctuation marks may be removed by the electronic device from the textual data.

In one or more examples, removing the one or more tokens from the textual data may, at a first stage, allow removing a considerable number of tokens which are not decisive and/or relevant for identifying the first data and second data. Removing the one or more tokens from the textual data, at a first stage, may fasten and improve identification of the first context and second context, such as generation of the first data and the second data.

Fig. 3 shows a block diagram of an exemplary electronic device 300 according to the disclosure. The electronic device 300 comprises a memory 301 , a processor 302, and an interface 303. The electronic device 300 is configured to perform any of the methods disclosed in Figs. 2A-2B. In other words, the electronic device 300 is configured for enabling identification of contextual data. For example, the electronic device 300 is a context provider device for textual data. The electronic device 300 is configured to obtain (e.g., via the interface 303 and/or using the memory 301 ) textual data.

The electronic device 300 is configured to generate (e.g., using the processor 302), based on the textual data, a first vector.

The electronic device 300 is configured to generate (e.g., using the processor 302), based on the first vector, a first cluster indicative of a first context.

The electronic device 300 is configured to generate (e.g., using the processor 302), based on the first vector and the first cluster, a second vector indicative of a frequent occurrence of a second context.

The electronic device 300 is configured to generate (e.g., using the processor 302), based on the second vector and the first cluster, a second cluster indicative of the first context and the second context.

The electronic device 300 is configured to generate (e.g., using the processor 302), based on the second cluster, second data comprising the first context and the second context.

The processor 302 is optionally configured to perform any of the operations disclosed in Figs. 2A-2B (such as any one or more of: S102, S104, S106, S108, S108A, S108B, S110, S110A, S112, S114, S116, S116A, S116B, S116C, S118, S118A, S120). The operations of the electronic device 300 may be embodied in the form of executable logic routines (e.g., lines of code, software programs, etc.) that are stored on a non-transitory computer readable medium (e.g., the memory 301) and are executed by the processor 302.

Furthermore, the operations of the electronic device 300 may be considered a method that the electronic device 300 is configured to carry out. Also, while the described functions and operations may be implemented in software, such functionality may as well be carried out via dedicated hardware or firmware, or some combination of hardware, firmware and/or software.

The memory 301 may be one or more of: a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, a random access memory (RAM), and any other suitable device. In a typical arrangement, the memory 301 may include a non-volatile memory for long term data storage and a volatile memory that functions as system memory for the processor 302. The memory 301 may exchange data with the processor 302 over a data bus. Control lines and an address bus between the memory 301 and the processor 302 also may be present (not shown in Fig. 3). The memory 301 is considered a non-transitory computer readable medium.

The memory 301 may be configured to store textual data, a first vector, a first cluster, first data, a second vector, a second cluster, second data in a part of the memory.

Embodiments of methods and products (electronic device) according to the disclosure are set out in the following items:

Item 1 . A method, performed by an electronic device, for enabling identification of contextual data, the method comprising:

- obtaining textual data;

- generating, based on the textual data, a first vector;

- generating, based on the first vector, a first cluster indicative of a first context;

- generating, based on the first vector and the first cluster, a second vector indicative of a frequent occurrence of a second context;

- generating, based on the second vector and the first cluster, a second cluster indicative of the first context and the second context; and

- generating, based on the second cluster, second data comprising the first context and the second context.

Item 2. The method according to item 1 , wherein generating the first cluster and/or the second cluster comprises applying a multi-level clustering technique to the first vector and/or the second vector respectively.

Item 3. The method according to item 2, wherein the multi-level clustering technique comprises a hierarchical density-based clustering model.

Item 4. The method according to any of the previous items, wherein generating the first vector and/or the second vector comprises generating a feature vector by applying a featurization technique to the textual data. Item 5. The method according to item 4, wherein the featurization technique comprises one or more of: a binary count vectorizer, a reverse Term Frequency - Inverse Document Frequency based model, and a word embedding model.

Item 6. The method according to any of items 4-5, wherein generating the first vector and/or the second vector comprises:

- applying a dimension reduction technique to the feature vector.

Item 7. The method according to any of the previous items, the method comprising:

- determining, based on the first cluster, first data indicative of a first token representative of the first context.

Item 8. The method according to item 7, wherein the method comprises: removing the first data from the textual data.

Item 9. The method according to any of the previous items, wherein generating the second vector comprises performing a transformation based on the first vector and the second vector.

Item 10. The method according to any of items 7-9, wherein the first data comprises the first token representative of the first context.

Item 11. The method according to any of items 7-10, wherein the second data comprises the first token representative of the first context and one or more second tokens representative of the second context.

Item 12. The method according to item 11 , wherein the second data comprises the one or more second tokens associated with the first token.

Item 13. The method according to any of the previous items, the method comprising:

- obtaining one or more tokens for removal, wherein the one or more tokens comprise one or more of: a stop token, a frequently used token and an infrequently used token; and removing the one or more tokens from the textual data.

Item 14. An electronic device comprising a memory, a processor, and an interface, wherein the electronic device is configured to perform any of the methods according to any of items 1-13.

Item 15. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device cause the electronic device to perform any of the methods of items 1 -13.

The use of the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. does not imply any particular order, but are included to identify individual elements. Moreover, the use of the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. does not denote any order or importance, but rather the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. are used to distinguish one element from another. Note that the words “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. are used here and elsewhere for labelling purposes only and are not intended to denote any specific spatial or temporal ordering. Furthermore, the labelling of a first element does not imply the presence of a second element and vice versa.

It may be appreciated that Figs. 1 -3 comprises some circuitries or operations which are illustrated with a solid line and some circuitries or operations which are illustrated with a dashed line. The circuitries or operations which are comprised in a solid line are circuitries or operations which are comprised in the broadest example embodiment. The circuitries or operations which are comprised in a dashed line are example embodiments which may be comprised in, or a part of, or are further circuitries or operations which may be taken in addition to the circuitries or operations of the solid line example embodiments. It should be appreciated that these operations need not be performed in order presented. Furthermore, it should be appreciated that not all of the operations need to be performed. The exemplary operations may be performed in any order and in any combination.

It is to be noted that the word "comprising" does not necessarily exclude the presence of other elements or steps than those listed. It is to be noted that the words "a" or "an" preceding an element do not exclude the presence of a plurality of such elements.

It should further be noted that any reference signs do not limit the scope of the claims, that the exemplary embodiments may be implemented at least in part by means of both hardware and software, and that several "means", "units" or "devices" may be represented by the same item of hardware.

The various exemplary methods, devices, nodes, and systems described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer- readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removable and non-removable storage devices including, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), compact discs (CDs), digital versatile discs (DVD), etc. Generally, program circuitries may include routines, programs, objects, components, data structures, etc. that perform specified tasks or implement specific abstract data types. Computer-executable instructions, associated data structures, and program circuitries represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.

Although features have been shown and described, it will be understood that they are not intended to limit the claimed disclosure, and it will be made obvious to those skilled in the art that various changes and modifications may be made without departing from the scope of the claimed disclosure. The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense. The claimed disclosure is intended to cover all alternatives, modifications, and equivalents.

Previous Patent: COMPENSATOR, IN PARTICULAR FOR A MOTOR VEHICLE, COMMUNICATION MODULE, COMMUNICATION SYSTEM, AND METH...

Next Patent: FUEL TEMPERATURE CONTROL SYSTEM AND METHOD