SYSTEM AND METHOD FOR SUMMARIZING TEXTUAL CLUES

Title:

SYSTEM AND METHOD FOR SUMMARIZING TEXTUAL CLUES

Document Type and Number:

WIPO Patent Application WO/2023/183834

Kind Code:

Abstract:

A method, system, and a non-transitory computer-readable storage medium for textual clue summarization are provided. The method can be applied to a computing device that can obtain a dataset. The dataset may include textual data. The computing device can obtain at least one clue. Each of the at least one clue can identify an aspect of the dataset. The computing device can further generate at least one topic based on a topic model, the dataset, and the at least one clue. The computing device can also obtain a topic summary for each of the at least one topic. The topic summary may include summary text that can be representative of the at least one topic. The computing device can also display the topic summary.

More Like This:

WO/2022/251076	TECHNIQUES FOR ABSTRACTION OF UNSTRUCTURED CLINICAL TRIAL HEALTH DATA
WO/2022/260794	REDUCING BIASES OF GENERATIVE LANGUAGE MODELS
WO/2024/073413	PRIVACY-CONTROLLED GENERATION OF SUGGESTED SNIPPETS

Inventors:

CHANG CHRISTINA (US)
DONG ZIZHUO (US)
XU ZIYUE (US)
PAN RUIQI (US)
JANIS ABRAM (US)

Application Number:

PCT/US2023/064799

Publication Date:

September 28, 2023

Filing Date:

March 22, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

HOLLISTER INC (US)

International Classes:

G06F16/34

Foreign References:

US20190362020A1

2019-11-28

Attorney, Agent or Firm:

FREIRE, Diego, F. et al. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

What is claimed is:

1. A method for clue summarization comprising: obtaining a dataset, wherein the dataset comprises textual data; obtaining at least one clue, wherein each of the at least one clue identifies an aspect of the dataset; generating at least one topic based on a topic model, the dataset, and the at least one clue; obtaining a topic summary for each of the at least one topic, wherein the topic summary comprises summary text that is representative of the at least one topic; and displaying the topic summary.

2. The method of claim 1, wherein the topic model comprises a Latent Dirichlet Allocation (LDA) model.

3. The method of claim 2, wherein obtaining the at least one topic comprises: obtaining the at least one topic based on the topic model, the dataset, the at least one clue, and an optimal number of topics.

4. The method of claim 2, wherein obtaining the at least one topic comprises: obtaining a topic coherence; obtaining a topic overlap; obtaining an optimal number of topics based on a difference between the topic coherence and the topic overlap; and obtaining the at least one topic based on the topic model, the dataset, the at least one clue, and the optimal number of topics.

5. The method of claim 1, wherein obtaining the topic summary for each of the at least one topic comprises: assigning each of the at least one clue to each of the at least one topic; combining each of the at least one clues within a same topic; and obtaining the topic summary for each of the at least one topic.

6. The method of claim 1, wherein the summary text comprises at least one top sentence from the textual data.

7. The method of claim 1, wherein obtaining at least one clue comprises: obtaining the at least one clue based on a natural language processing (NLP) model.

8. The method of claim 7, wherein the NLP model comprises a Bidirectional

Encoder Representations from Transformers (BERT) model.

9. The method of claim 8, wherein the BERT model comprises a DistilBERT model trained with the dataset.

10. A computing device, comprising: one or more processors; a non-transitory computer-readable storage medium storing instructions executable by the one or more processors, wherein the one or more processors are configured to: obtain a dataset, wherein the dataset comprises textual data; obtain at least one clue, wherein each of the at least one clue identifies an aspect of the dataset; generate at least one topic based on a topic model, the dataset, and the at least one clue; obtain a topic summary for each of the at least one topic, wherein the topic summary comprises summary text that is representative of the at least one topic; and display the topic summary.

11. The computing device of claim 10, wherein the topic model comprises a Latent Dirichlet Allocation (LDA) model.

12. The computing device of claim 10, wherein the one or more processors configured to obtain the at least one topic are further configured to: obtain the at least one topic based on the topic model, the dataset, the at least one clue, and an optimal number of topics.

13. The computing device of claim 11, wherein the one or more processors configured to obtain the at least one topic are further configured to: obtain a topic coherence; obtain a topic overlap; obtain an optimal number of topics based on a difference between the topic coherence and the topic overlap; and obtain the at least one topic based on the topic model, the dataset, the at least one clue, and the optimal number of topics.

14. The computing device of claim 10, wherein the one or more processors configured to obtain the topic summary for each of the at least one topic are further configured to: assign each of the at least one clue to each of the at least one topic; combine each of the at least one clues within a same topic; and obtain the topic summary for each of the at least one topic.

15. The computing device of claim 10, wherein the summary text comprises at least one top sentence from the textual data.

16. The computing device of claim 10, wherein the one or more processors configured to obtain at least one clue are further configured to: obtain the at least one clue based on a natural language processing (NLP) model.

17. The computing device of claim 16, wherein the NLP model comprises a Bidirectional Encoder Representations from Transformers (BERT) model.

18. The computing device of claim 17, wherein the BERT model comprises a DistilBERT model trained with the dataset.

19. Anon-transitory computer-readable storage medium storing a plurality of programs for execution by a computing device having one or more processors, wherein the plurality of programs, when executed by the one or more processors, cause the computing device to perform acts comprising: obtaining a dataset, wherein the dataset comprises textual data; obtaining at least one clue based on a natural language processing (NLP) model, wherein each of the at least one clue identifies an aspect of the dataset; generating at least one topic based on a topic model, the dataset, and the at least one clue; obtaining a topic summary for each of the at least one topic, wherein the topic summary comprises summary text that is representative of the at least one topic; and displaying the topic summary.

20. The non -transitory computer-readable storage medium of claim 19, wherein the plurality of programs further cause the computing device to perform: obtain a topic coherence; obtain a topic overlap; obtain an optimal number of topics based on a difference between the topic coherence and the topic overlap; and obtain the at least one topic based on the topic model, the dataset, the at least one clue, and the optimal number of topics.

Description:

TITLE

SYSTEM AND METHOD FOR SUMMARIZING TEXTUAL CLUES

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application is based upon and claims priority to U.S. Provisional Application No. 63/322,843 filed on March 23, 2022, the entire contents of which are incorporated herein by reference in their entirety.

BACKGROUND

[0002] This disclosure is related to a system and method for analyzing and summarizing textual clues to identify summarize, and prioritize relevant topics (or hunches) upon which action may be taken.

[0003] Data may be collected related to user and clinician feedback and experience in order to improve and develop medical devices and procedures. Such data may allow for a better understanding of what users and clinicians desire from medical devices and procedures. There are currently qualitative methods to condense several pieces of information from collected data into an aggregated item. However, these qualitative methods may not be sophisticated enough to provide much insight into the data collected. For example, there may be a large number of data to analyze, and qualitative methods may not be able to provide a useful summary of such data.

[0004] Accordingly, it is desirable to provide an improved system and method for analyzing collected data. BRIEF SUMMARY

[0005] A method, system, and a non-transitory computer-readable storage medium for textual clue summarization are provided according to various embodiments.

[0006] In a first aspect of the present disclosure, a method for clue summarization is provided. The method may be applied to a computing device. The computing device may obtain a dataset. The dataset may include textual data. The computing device may also obtain at least one clue. Each of the at least one clue can identify an aspect of the dataset. The computing device may further generate at least one topic based on a topic model, the dataset, and the at least one clue. The computing device may also obtain a topic summary for each of the at least one topic. The topic summary may include summary text that is representative of the at least one topic. The computing device may further display, on a display device, the topic summary.

[0007] In an embodiment of the first aspect, the topic model comprises a Latent Dirichlet Allocation (LDA) model. In such an embodiment, obtaining the at least one topic may include obtaining the at least one topic based on the topic model, the dataset, the at least one clue, and an optimal number of topics. In another such embodiment, obtaining the at least one topic may include obtaining atopic coherence, obtaining a topic overlap, obtaining an optimal number of topics based on a difference between the topic coherence and the topic overlap, and obtaining the at least one topic based on the topic model, the dataset, the at least one clue, and the optimal number of topics [0008] In an embodiment of the first aspect, obtaining the topic summary for each of the at least one topic may include assigning each of the at least one clue to each of the at least one topic, combining each of the at least one clues within a same topic, and obtaining the topic summary for each of the at least one topic. [0009] In an embodiment of the first aspect, the summary text may include at least one top sentence from the textual data.

[0010] In an embodiment of the first aspect, obtaining at least one clue may include obtaining the at least one clue based on a natural language processing (NLP) model. In such an embodiment, the NLP model may include a Bidirectional Encoder Representations from Transformers (BERT) model. In a similar embodiment, the BERT model may include a DistilBERT model trained with the dataset.

[0011] In a second aspect of the present disclosure, a computing device is provided. The computing device may include one or more processors, a non-transitory computer-readable memory storing instructions executable by the one or more processors. The one or more processors may be configured to obtain a dataset. The dataset may include textual data. The one or more processors may also be configured to obtain at least one clue. Each of the at least one clue may identify an aspect of the dataset. The one or more processors may also be configured to generate at least one topic based on a topic model, the dataset, and the at least one clue. The one or more processors may also be configured to obtain a topic summary for each of the at least one topic. The topic summary may include summary text that is representative of the at least one topic. The one or more processors may also be configured to display, on a display, the topic summary.

[0012] In an embodiment of the second aspect, the topic model comprises an LDA model.

[0013] In an embodiment of the second aspect, the one or more processors may further be configured to obtain the at least one topic based on the topic model, the dataset, the at least one clue, and an optimal number of topics.

[0014] In an embodiment of the second aspect, the one or more processors may further be configured to obtain a topic coherence, obtain a topic overlap, obtain an optimal number of topics based on a difference between the topic coherence and the topic overlap, and obtain the at least one topic based on the topic model, the dataset, the at least one clue, and the optimal number of topics.

[0015] In an embodiment of the second aspect, the one or more processors may further be configured to assign each of the at least one clue to each of the at least one topic, combine each of the at least one clues within a same topic, and obtain the topic summary for each of the at least one topic.

[0016] In an embodiment, of the second aspect the summary text may include at least one top sentence from the textual data.

[0017] In an embodiment of the second aspect, the one or more processors may further be configured to obtain the at least one clue based on an NLP model. In such an embodiment, the NLP model may include a BERT model. In a similar embodiment, the BERT model may include a DistilBERT model trained with the dataset.

[0018] In a third aspect of the present disclosure, a non-transitory computer-readable storage medium having instructions stored therein is provided. When the instructions are executed by one or more processors of a computing device, the instructions may cause the device to obtain a dataset. The dataset may include textual data. The instructions may also cause the device to obtain at least one clue based on a NLP model. Each of the at least one clue can identify an aspect of the dataset. The instructions may also cause the device to generate at least one topic based on a topic model, the dataset, and the at least one clue. The instructions may also cause the device to obtain a topic summary for each of the at least one topic. The topic summary may include summary text that may be representative of the at least one topic. The instructions may also cause the device to display, on a display, the topic summary. [0019] In an embodiment of the third aspect, the instructions may further cause the device to obtain a topic coherence, obtain a topic overlap, obtain an optimal number of topics based on a difference between the topic coherence and the topic overlap, and obtain the at least one topic based on the topic model, the dataset, the at least one clue, and the optimal number of topics.

[0020] The foregoing general description and the following detailed description are examples only and are not restrictive of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

[0021] The benefits and advantages of the present embodiments will become more readily apparent to those of ordinary skill in the relevant art after reviewing the following detailed description and accompanying drawings, wherein:

[0022] FIG. l is a flowchart illustrating the steps of a clue summarization method, according to an embodiment.

[0023] FIG. 2A is a bar graph illustrating the most common words for an example dataset.

[0024] FIG. 2B is a bar graph illustrating the most common bigrams for the example dataset.

[0025] FIG. 2C is a bar graph illustrating the most common trigrams for the example dataset.

[0026] FIG. 3 A is a diagram illustrating a topic modeling process.

[0027] FIG. 3B is a bar graph illustrating topics allocation to documents for an example dataset.

[0028] FIG. 4 is a graph illustrating model metrics per number of topics for an example dataset.

[0029] FIG. 5A is a intertopic distance map for an example dataset.

[0030] FIG. 5B is a bar graph illustrating the top-30 most relevant terms for a topic in an example dataset. [0031] FTG. 6 is a flowchart illustrating the steps of a method for clue summarization, according to another embodiment.

[0032] FIG. 7 is a schematic illustration of a computing environment, according to an embodiment.

DETAILED DESCRIPTION

[0033] While the present disclosure is susceptible of embodiment in various forms, there is shown in the drawings and will hereinafter be described presently preferred embodiments with the understanding that the present disclosure is to be considered an exemplification and is not intended to limit the disclosure to the specific embodiments illustrated. The words “a” or “an” are to be taken to include both the singular and the plural. Conversely, any reference to plural items shall, where appropriate, include the singular. The words “first,” “second,” “third,” and the like may be used in the present disclosure to describe various information, such information should not be limited to these words. These words are only used to distinguish one category of information from another. The directional words “top,” “bottom,” up,” “down,” front,” “back,” and the like are used for purposes of illustration and as such, are not limiting. Depending on the context, the word “if’ as used herein may be interpreted as “when” or “upon” or “in response to determining.”

[0034] The present disclosure provides a system and method for textual clue summarization. A clue may be a piece of information that helps to identify a particular linguistic feature or aspect of a text. For example, a user can use a clue from the text to understand the entire text based on how the clue is used or appears in the text. The clue summarization method can include the use of text analysis technology to analyze and summarize a textual dataset based on clues obtained about the dataset. For example, text analysis technology can be used to obtain hunches or topics from the text using the clues. The hunches or topics may represent the overall text and they can then be used to summarize the textual clues. The clue summarization method can use artificial intelligence (Al) tools or models, such as machine learning, for automatic extraction of meaningful insights of large and complex datasets. For example, machine learning models can be trained to uncover patterns and insights in large, complex datasets that might not be immediately obvious to human analysts. Therefore, text analysis technology can be improved with the disclosed clue summarization method.

[0035] Referring now to the figures, FIG. 1 illustrates a clue summarization method 100. The clue summarization method 100 may be applied to a computing device, mobile device, personal computer, cloud service, or server.

[0036] In step 110, the computing device can obtain a dataset. The dataset can include textual data related to overall themes or topics. For example, the dataset can include textual data about medicine and more specifically about medical devices.

[0037] In step 120, the computing device can preprocess the dataset. The text can be preprocessed to standardize the dataset for processing. For example, preprocessing can include converting all letters to lowercase, removing punctuation, and removing special characters so the text in the data can be standardized.

[0038] In step 130, the computing device can obtain clues. The clues can be created by analyzing the dataset and extracting data. For example, a dataset about medical devices will include clues for understanding what insight the dataset has about medical devices. Such clues could include a field ID, title, lens, text, and a link to a website from which the data was collected. The clues can be obtained from a user looking to further analyze the dataset.

[0039] In step 140, the computing device can generate topics using topic modeling. In an embodiment, topic modeling can include Latent Dirichlet Allocation (LDA), as further discussed below in the Clue Topic Summarization section.

[0040] In step 150, the computing device can assign each clue to a topic.

[0041] In step 160, the computing device can combine all clues within the same topic.

[0042] In step 170, the computing device can create summaries for each topic. For example, the topic can be used to determine the top sentence within the text represents the topic.

[0043] In step 180, the computing device can display the summaries. The summaries can be displayed as a list of summaries for the topics found. They can be displayed to a user for further analysis or for the user to take action based upon such topics.

[0044] In an embodiment, the dataset may include, for example, feedback from medical device users and clinicians regarding their experience using such medical device. Clue summarization may include one or two sentences that describe what the clue means in terms of the dataset collected and analyzed. The disclosed system and method may include a multi-layer process for condensing several pieces of information into one aggregated item in each step. The summarization of textual clues from the dataset can be used to extract or determine subconscious observations such users and clinicians may have in their daily lives. The observations may then be refined into actionable items for improving such medical devices or associated services.

[0045] In an embodiment, the dataset may be analyzed through an expoloratory data analysis to acquire insight about the textual data. For example, a dataset may be analysed to determine the common words used. These common words may indicate what the dataset is related to FIGS. 2A- 2C illustrate an analysis of an example dataset for a number of repeated words.

[0046] FIG. 2A is a bar graph identifying the most common words in an example dataset. According to the example dataset represented by FIG. 2A, the most common word is “health,” which may indicate the dataset is related to health care. The x-axis of the bar graph represents the top words in the example dataset. Each word is represented by a single bar in the graph. The y- axis represents the number of times each word appears in the example dataset.

[0047] FIG. 2B is a bar graph identifying the most common bigrams in the example dataset. According to the example dataset represented by FIG. 2B, the most common bigram is “health care,” which may indicate the dataset is related to health care. The x-axis of the bar graph represents the top bigrams in the example dataset. Each bigram is represented by a single bar in the graph. The y-axis represents the number of times each bigram appears in the example dataset. [0048] FIG. 2C is a bar graph identifying the most common trigrams in the example dataset. According to the example dataset represented by FIG. 2C, the most common trigrams are “fruit vegetable intake,” which may indicate the dataset is related to dietary healthcare. The x-axis of the bar graph represents the top trigrams in the example dataset. Each trigram is represented by a single bar in the graph. The y-axis represents the number of times each trigram appears in the example dataset.

[0049] In an embodiment, a clue collection process may include an optimized platform or analytical dashboard for an increased clue collection efficiency. The platform may include a structured text form where information relevant to a specific clue is collected. The dashboard may include generated topics, as well as preliminary visualizations of how clues are distributed over these topics.

[0050] Dataset

[0051] In an embodiment, a medical device user or clinician may provide feedback about the medical device in a form of a survey response, social media comments or reviews, customer support, website reviews, feedback forms, or participation in focus groups. The feedback can include simple yes or no answers to questions or full responses involving full sentences and paragraphs. The feedback may further include spelling errors, grammatical errors, abbreviations and acronyms, incomplete or inconsistent data, non-standard language (including slang, jargon or non-English language text), ambiguity, and encoding errors. This feedback may be collected into a dataset to be analyzed.

[0052] A. Clue Topic Summarization

[0053] In an embodiment, the dataset may be collected from a central point like a dataserver or may be gathered from various points like a number of known web links. Clues then can be used to summarize the dataset into topics. Clues can include, for example, field ID, title, lens, text, and a relevant link. The relevant link can be a link to a website where text relating to the clue can be collected from. For example, a social media post were user feedback has been entered. The dataset can be collected using web scraping to gather text from the set of given web links. This data can be used in a model for training.

[0054] In an embodiment, for each clue, the text from the title, clue text, and text from the relevant link can be combined into a single string. The text can be preprocessed and cleaned by converting all letters to lowercase, removing punctuation, and removing special characters so the text in the data can be standardized.

[0055] In another embodiment, clue topic summarization can include a clustering method. For example, the clustering method may be a K-means clustering method, a means-shift method, and a Density-Based Spatial Clustering of Applications with Noise (DBSCAN) method. K-means clustering can include a machine learning algorithm used for grouping data points into clusters based on their similarity. The algorithm may work by randomly initializing a set of K centroids, assigning each data point to its closest centroid, and then updating the centroids to the mean of the assigned data points. This process can be repeated iteratively until the centroids no longer move or a maximum number of iterations is reached.

[0056] In an embodiment, the clustering method can include a transforming the text into numerical features, then applying dimensional reduction (principal component analysis) to represent each clue with two features. The clues can then be cluster using a K-Means clustering method.

[0057] B. Clue Importance Estimation Data

[0058] In an embodiment, the dataset can be a 243 -row dataframe. The dataset can include clues and hunches about the textual data. Clues can be mapped into hunches in groups. For example the groups can consist of 2-5 clues per hunch. In an embodiment, the dataset can be preprocessed into binary class: either being selected or not. In one example, the whole text can have about 4700+ words with half of the words appearing only once.

[0059] Artificial Intelligence

[0060] In an embodiment, clue summarization may include a text processing pipeline that can categorize and prioritize clues from a dataset. The text processing pipeline may include Al tools that may allow for systematic and automated text processing. For example, text processing pipeline may include natural language processing (NLP) and machine learning techniques that can analyze and summarize the textual clues. The use of Al tools, such as machine learning, can allow for automatic extraction of meaningful insights from large and complex datasets.

[0061] Al can refer to computer systems that can perform complicated processes such as problem-solving, decision-making, and natural language processing. Al can include machine learning, deep learning, and natural language processing. For example, Al can include machine learning algorithms that learn from a dataset to summarize patterns and use those patterns for decision-making. This can be achieved through the creation of statistical models, architectures, and optimization techniques that enable computer systems to automatically extract meaningful insights from large and complex datasets. One of the key benefits of machine learning is that it can uncover patterns and insights in large, complex datasets that might not be immediately obvious to human analysts.

[0062] NLP can include algorithms and models that enable computers to understand, interpret, and generate natural language. NLP can include applications such as machine translation, sentiment analysis, text classification, question answering, and speech recognition. NLP techniques can include statistical and machine learning methods, deep learning, and rule-based approaches.

[0063] One of the main challenges in NLP can be the ambiguity and complexity of human language. The same sentence can have different meanings depending on the context, tone, and other factors. Therefore, NLP systems need to be able to understand the nuances of language and accurately interpret meaning. NLP models include Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-trained Transformer (GPT), and Transformer.

[0064] BERT is a pre-trained NLP model capable of understanding the context of words in a sentence by analyzing the surrounding words. BERT is based on the Transformer architecture, which allows it to understand the dependencies between words to provide more accurate language understanding. BERT can be pre-trained on large amounts of text data and can be fine-tuned for specific NLP tasks such as question answering, sentiment analysis, and language translation. Fine- tuning the model on a specific task can include using a smaller dataset, in addition to its pretraining on a large corpus of text data.

[0065] Gathering more text data can be helpful in training the models. In an embodiment, more data can be gathered using web scraping. The new text data gathered can preferably be those that span across a diverse range of topics in order to best improve the model. The model can be retrained with additional data to be improved and be able to make more accurate predictions. The larger the input text data corpus is, the better the model will be at summarizing the clue topics and estimating the clue importance.

[0066] GPT is a language model that uses Transformer architecture to generate natural language text. GPT is pre-trained on a large corpus of text using an unsupervised learning approach, where it learns to predict the next word in a sequence of text. The model is then finetuned on specific NLP tasks, such as language translation or text classification, to adapt to the specific domain.

[0067] Transformer is a deep neural network architecture that uses an attention mechanism that allows the model to weigh the importance of different parts of the input sequence when making predictions. The architecture consists of an encoder that processes the input sequence and a decoder that generates the output sequence. The encoder and decoder are made up of multiple layers, each containing a self-attention mechanism and a feedforward neural network.

[0068] Methodology

[0069] A. Clue Topic Summarization

[0070] In an embodiment, to identify the main themes in the clues, topic modeling, for example LDA, can be used on the clues in the dataset. LDA is a statistical model that can be used to identify topics in a collection of documents. It is a generative probabilistic model that assumes that each document in the collection is a mixture of various topics, and each topic is a probability distribution over words. LDA can include first creating a fixed number of topics and then assigning a probability distribution over words to each topic. LDA can further include identifying the distribution of topics in each document, and the distribution of words in each topic. These distributions can be learned by analyzing the co-occurrence patterns of words in the documents. LDA assumes that each document can be represented as a mixture of topics, and each topic can be represented as a distribution over words. LDA can be used to discover underlying topics and their distributions without any prior knowledge about the topics or the documents.

[0071] Topic modeling is a way to analyze a large corpus of text by grouping similar words and phrases together to form topics. This methodology can be used to identify the main ideas from all the clues given. The topics found from the model can be treated as hunches.

[0072] FIG. 3A illustrates a topic modeling process. According to example embodiments shown in FIG. 3A, a set of documents containing textual data can be processed using an LDA model. The LDA model can output a set of topics based on the text. The topics can include a weight for each word chosen as a topic. The weight for each word represents the probability that the particular word belongs to the particular topic. For example, the topics can include a topic 1 and topic 2. The topic 1 can include, for example, flower, rose, and plant. For the word flower, the weight can be 3% which means the word flower is most likely part of topic 1 compared to rose and plant. The topic 2 can include, for example, company, wage, and employee. For the word company, the weight can be 2% which means company is most likely part of topic 2 compared to the words wage and employee.

[0073] FIG. 3B shows topics allocated to documents. According to example embodiments shown in FIG. 3B, the topics can be allocated to each document to determine what each document can refer to. The x-axis of the bar graph represents the top documents used. The y-axis represents the perfect the topic is used. For example, document 1 can include topic 1, topic 2, and topic 3.

For document 1, topic 2 can be the leading topic. This means that document 1 may be better explained by topic 2, which involves words related to a company, wage, and employee. For document 2, topic 1 may be the most relevant meaning flower rose and plant may signify what the document is about. Therefore, the use of topics can help understand what an example dataset refers to.

[0074] In an embodiment, an input required to run the topic modeling can be the number of topics created. The number of topics can be chosen by considering two factors: topic coherence and topic overlap.

[0075] Topic coherence is a measure of the meaningfulness and interpretability of topics generated by the LDA model. It assesses how well the words within a topic are related to each other and to the underlying theme they represent. To calculate topic coherence, pairwise similarity of top-ranked words in a topic can be compared using external knowledge sources like word embeddings or a thesaurus. Higher coherence scores indicate more meaningful and interpretable topics, and it can be used to select the best topic model for a given task.

[0076] Topic overlap refers to situations where the same set of words appear in multiple topics, resulting in less interpretable and less meaningful topics. Topic overlap can arise when the number of topics is too high, or when the topics are too broad or too specific. To minimize topic overlap and improve topic coherence, it may be important to choose the number of topics. For example, an ideal number of topics can be calculated (as shown in FIG. 4 below). In addition, tuning the hyperparameters (alpha and beta) of the LDA model may also be necessary. The alpha hyperparameter dictates the number of topics for any given clue, and the beta hyperparameter controls the number of words per topic. In addition, post-processing techniques such as merging or splitting topics can be used to further improve the quality of the topics generated by LDA.

[0077] In an embodiment, topic modeling can be run a number of times (for example, 40 times), using from one topic to up to 40 topics. Then, the difference between topic coherence and topic overlap can be calculated. The model with the largest difference between topic coherence and topic overlap can be the optimal number of topics. Once the topics are chosen, each clue can be assigned to one of the topics.

[0078] FIG. 4 shows a metric level per number of topics. According to example embodiments shown in FIG. 4, an ideal number of topics can be determined based on the average topic overlap and the topic coherence. For example, for the example dataset shown in FIG. 4, the ideal number of topics can be where the average topic overlap is the lowest and the topic coherence is the greatest (18 topics).

[0079] The text from all clues within the same topic can be combined. A text summarization method can be applied to identify the top sentences that are representative of the topic. For example, the top three sentences that are most representative of the topic can be used. The top three sentences, for example, can be determined based on a calcualted probability of a clue being relevant to a topic, and this probability may be used to determine the top three sentences representative of the topic.

[0080] B. Clue Importance Estimation

[0081] In an embodiment, when a new clue is obtained, clue importance estimation can be used to determine which clues to use in the clue topic summarization process. In an embodiment, clue importance estimation can be treated as a binary classification problem. Different approaches for clue importance estimation can be used. These include CBOW (Continuous Bag-of-Words) models, word embedding models, and BERT models (described above).

[0082] CBOW is a type of neural network architecture that predicts the current word given its context, which consists of a fixed number of words before and after it. In CBOW, the input layer consists of the one-hot encoding of the context words, and the output layer consists of the one-hot encoding of the target word. The hidden layer, which contains the word embeddings, is typically much smaller than the input and output layers. During training, the network learns to predict the target word based on its context, and the weights of the hidden layer are updated to minimize the difference between the predicted and actual target word. The word embeddings are learned as a byproduct of this training process and are used to represent the words in a continuous, highdimensional space.

[0083] Word embedding can represent words as dense, low-dimensional vectors in a continuous space. Word embedding allows LDA to capture the semantic relationships between words and improve the quality of the topics generated by the model. Word embedding can map each word to a dense and continuous vector in a low-dimensional space, such that semantically similar words are closer together in this space. These vectors can be learned from large amounts of text data using unsupervised machine learning algorithms, such as Word2Vec or GloVe. Word embedding can be used in LDA to calculate the topic coherence of a topic, which measures how well the words in the topic are related to each other and to the underlying theme they represent.

[0084] CBOW and word embedding can be used as baseline models since they can only predict sentences with words that appear in the training data. In other words, they may not perform any prediction task if a new word is involved in the sentence. Because CBOW and word embedding models are easier to train, a grid search can be performed to tune out the best vector presentation of each word and use their fl score (which measures the balance between precision and recall) as the benchmark. In an embodiment, a CBOW Stochastic Gradient Descent (SGD) classifier can be used for CBOW as well as a large window word embedding with a Convolutional Neural Network (CNN) model. [0085] CBOW SGD classifiers are a type of linear classification algorithm that can be used for both binary and multi-class classification problems. In an SGD classifier, the loss function can typically be a linear SVM (Support Vector Machine) or logistic regression loss. The algorithm can update the weights of the model at each iteration by computing the gradient of the loss function with respect to a randomly selected subset of the training data.

[0086] A CNN model is a type of neural network architecture that can consist of multiple layers, including convolutional layers, pooling layers, and fully connected layers. A CNN model, developed to be used in computer vision tasks, can be used to process text data, such as in the case of text classification tasks. For example, a CNN model can be trained to classify text by treating each word in the input sequence as a pixel and using convolutional and pooling layers to learn local features and reduce the dimensionality of the input.

[0087] In an embodiment, a dataset being used may only cover a small set of words. In this case, either a dataset with more words or a different model can be used. For example, a BERT model can be used since it was first trained on a very large corpus before fine-tuning with a specific dataset. Therefore, the BERT model already has a vector representation for all words in the first place. The weight can then be updated based on the data. BERT models can include BERTbase, DistilBERT, and EarlyBERT models.

[0088] BERTbase is a powerful language model that can be fine-tuned for a variety of natural language processing tasks, such as text classification, question-answering, and named entity recognition. BERTbase can contain 12 transformer layers and it can be pre-trained on a large corpus of text data. After pre-training, the BERTbase model can be fine-tuned on a specific natural language processing task using a smaller dataset. This fine-tuning process typically involves replacing the output layer of the model with a task-specific output layer and training the model on a labeled dataset for the target task.

[0089] DistilBERT is a smaller and faster variant of the BERT language model. It has the same underlying architecture as BERT but with a reduced number of parameters, making it more efficient to train and run. During training, it is first trained on a large dataset using the standard BERT architecture. Then, a smaller DistilBERT model is trained to mimic the behavior of the larger BERT model, using the outputs from the BERT model as a teacher signal. This allows the DistilBERT model to learn from the BERT model's predictions and to distill the knowledge of the larger model into a smaller one. Additionally, DistilBERT uses parameter pruning, a technique that removes unimportant or redundant parameters from the model, further reducing its size and computational requirements.

[0090] EarlyBERT is a technique used to improve the efficiency of training BERT-based models. EarlyBERT uses a technique called “early exit” to allow the model to make a prediction before it has processed the entire input sequence. This can be useful in scenarios where there is a time constraint for making a prediction, such as in interactive applications like chatbots or search engines. In EarlyBERT, the model is trained to make a prediction based on a partial input sequence, allowing it to exit early if it is confident in its prediction. This can significantly reduce the computational resources required to make a prediction, as the model does not need to process the entire input sequence if it can make a confident prediction based on a partial sequence. To train the EarlyBERT model, a new loss function can be introduced that takes into account the number of tokens processed before making a prediction. The model can be trained to minimize this loss function while also maximizing the accuracy of its predictions.

[0091] V Results

[0092] A. Clue Topic Summarization [0093] In an embodiment, for topic modeling, the optimal number of topics used can be, for example 18. Summaries for each of the 18 topics can be created. For example, for an example dataset, one of the topics can be about the EU's Horizon Europe program. The following can be the summary from this topic: “horizon europe is the eu’s key funding programme for research and innovation with a budget of €95.5 billion, the result of that effort is a €95.5 billion (in current prices) research programme, including €5 billion that will come out of the eu’s new €750 billion recovery fund.”

[0094] In another embodiment, for an example dataset, the topic can be wearable technology and the topic summary can be “by joining forces, the researchers created the first flexible, stretchable wearable device that combines chemical sensing (glucose, lactate, alcohol and caffeine) with blood pressure monitoring, wearable technology in healthcare includes electronic devices that consumers can wear, like fitbits and smartwatches, and are designed to collect the data of users’ personal health and exercise.”

[0095] FIG. 5A shows an intertopic distance map (via multidimensinal scaling). According to an example embodiment shown in FIG. 5A, the intertopic distance map is a visual example of the topic modeling. The intertopic distance map can include 18 topics shown in a bubble. The size of the bubble can indicate a marginal topic distribution. The closer each bubble is, the more similar they are. The bubbles are the different topics, the size indicates the importance of the topic, and the bar chart on the FIG. 5B shows the most relevant words in the topic.

[0096] FIG. 5B a diagram illustrating the top-30 most relevant terms for topic 11 According to example embodiments shown in FIG. 5B, the diagram shows an overall term frequency and estimated term frequency within selected topic for each of the top-30 words. The x-axis of the bar graph represents the word in the example dataset. Each word is represented by a single bar in the graph The y-axis represents the number a frequency for the word. The topic 11 can be 3.5% of tokens. For example, the word technology is the top word used within topic 11, which may signify that topic 11 deals with technology.

[0097] Overall, this process of identifying topics and creating topic summaries can help make the clue to hunch process more efficient. Going from clues to hunches (topics) are where machine learning models are the most beneficial because there may be a a large number of clues that can be obtained and difficult to use. For example, there may be hundreds of clues that may be reduced to under 20 topics using machine learning models.

[0098] B Clue Importance Estimation

[0099] To evaluate the performance of different models, a micro-fl score can be used. There may be an imbalanced of classes so a micro score might be more indicative than the macro score. [00100] In an embodiment, the models can be run to determine a micro-fl score. Table 1 shows the results.

Table 1. Model test

[00101] The CBOW SGD classifier can have the best result with a micro-fl of 0.67. However, because SGD it may be difficult to implement in real practice, the BERT model may be more practical.

[00102] The Distilbert model may have the best performance, with a micro-fl score of 0.64 and also much less training time compared to the BERT base model.

[00103] FIG. 6 shows a method 600 for clue summarization. The method may be applied to a computing device, mobile device, personal computer, cloud service, or server.

[00104] In step 610, the computing device can obtain a dataset.

[00105] In step 620, the computing device can obtain at least one clue.

[00106] In step 630, the computing device can generate at least one topic based on a topic model, the dataset, and the at least one clue.

[00107] In step 640, the computing device can obtain a topic summary for each of the at least one topic.

[00108] In step 650, the computing device can display the topic summary.

[00109] FIG. 7 shows a computing system 700. According to example embodiments shown schematically in FIG. 7, the computing system 700 can include a computing environment 710 coupled to a user interface 750 and a communication unit 760. The computing environment 710 can include a processor 720, a memory 730, and an VO (input/output) interface 740. The computing environment 710 can be coupled to the user interface 750 and communication unit 760 through the VO interface 740.

[00110] The processor 720 can typically control the overall operations of the computing environment 710, such as the operations associated with data acquisition, data processing, and data communications. The processor 720 can include one or more processors to execute instructions to perform all or some of the steps in the above-described methods. Moreover, the processor 720 can include one or more modules that facilitate the interaction between the processor 720 and other components. The processor may be or include a central processing unit (CPU), a microprocessor, a single chip machine, a graphical processing unit (GPU) or the like. [00111] The memory 730 can store various types of data to support the operation of the computing environment 710. Memory 730 can include predetermined software 731. Examples of such data comprise instructions for any applications or methods operated on the computing environment 710, a dataset, textual data, clues, topics, image data, model data, etc. The memory 730 may be implemented by using any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random-access memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk.

[00112] The I/O interface 740 can provide an interface between the processor 720 and peripheral interface modules, such as a RF circuitry, external port, proximity sensor, audio and speaker circuitry, video and camera circuitry, microphone, accelerometer, display controller, optical sensor controller, intensity sensor controller, other input controllers, keyboard, a click wheel, buttons, and the like. The buttons may include but are not limited to, a home button, a power button, and volume buttons.

[00113] The user interface 750 can include a speaker, lights, display, haptic feedback motor or other similar technologies for communicating with the user.

[00114] Communication unit 760 provides communication between the processing unit, an external device, mobile device, and a webserver (or cloud). The communication can be done through, for example, WTFT or BLUETOOTH hardware and protocols The communication unit 760 can be within the computing environment or connected to it.

[00115] In some embodiments, there is also provided a non-transitory computer-readable storage medium comprising a plurality of programs, such as comprised in the memory 730, executable by the processor 720 in the computing environment 710, for performing the abovedescribed methods. For example, the non-transitory computer-readable storage medium may be a ROM, a RAM, or the like.

[00116] The non-transitory computer-readable storage medium has stored therein a plurality of programs for execution by a computing device having one or more processors, where the plurality of programs when executed by the one or more processors, cause the computing device to perform the above-described method for motion prediction.

[00117] In some embodiments, the computing environment 710 may be implemented with one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), graphical processing units (GPUs), controllers, micro-controllers, microprocessors, or other electronic components, for performing the above methods.

[00118] From the foregoing it will be observed that numerous modifications and variations can be effectuated without departing from the true spirit and scope of the novel concepts of the present disclosure. It is to be understood that no limitation with respect to the specific embodiments illustrated is intended or should be inferred. The disclosure is intended to cover by the appended claims all such modifications as fall within the scope of the claims.

Previous Patent: EMBEDDING DRIFT IN A MACHINE LEARNING MODEL

Next Patent: SUCTION AND ASPIRATION COLLECTION DEVICE