Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CORRELATING DATA FROM HETEROGENOUS SOURCES
Document Type and Number:
WIPO Patent Application WO/2022/119575
Kind Code:
A1
Abstract:
A data processing system for correlating data from heterogenous sources includes: a main processor and associated memory; a classification model trained with data from the first data source; and a clustering model for clustering content from the second data source. The main processor is programmed to: use the classification model to classify content from the second data source; use the clustering model for clustering content from the first data source into a number of clusters; compare the classified content from the second data source to the clusters of content from the first data source; match a selected cluster of content from the first data source with the classified content from the second data source; and correlate the classified content from the second data source with the selected cluster of content from the first data source.

Inventors:
CHIEN PEI-YUAN (TW)
HUANG I-KAI (TW)
CHOU HONG-WEI (TW)
TAI CHIANG-HSIEN (TW)
Application Number:
PCT/US2020/063140
Publication Date:
June 09, 2022
Filing Date:
December 03, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HEWLETT PACKARD DEVELOPMENT CO (US)
International Classes:
G06F11/36; G06N20/00
Foreign References:
CN105653444B2018-07-13
US20200241861A12020-07-30
US20180307713A12018-10-25
Attorney, Agent or Firm:
JENNEY, Michael et al. (US)
Download PDF:
Claims:
CLAIMS

WHAT IS CLAIMED IS:

1 . A method of correlating data from heterogeneous sources, the method comprising addressing defects indicated by a test log from a test scenario executed on a piece of software by: using a classification model to predict root causes of the defects indicated by the test log; comparing the predicted root causes from the classification model to root causes identified in a cluster of closed defect reports that document resolution of a previous defect; matching a selected cluster of closed defect reports to the test log based on the comparing; and providing the selected cluster of the closed defect reports with the test log to guide debugging of the piece of software from the defects indicated by the test log.

2. The method of claim 1 , wherein comparing the predicted root causes from the classification model to root causes identified in a cluster of closed defect reports further comprises: extracting features of the test log to multi-dimensional vectors that represent semantic and syntactic relationships between words; extracting features of the closed defect reports in a cluster to multidimensional vectors that represent semantic and syntactic relationships between words; and comparing the vectors from the test log to the vectors from the defect report cluster.

3. The method of claim 1 , further comprising: collecting only closed defect reports from a set time range; and dividing the collected defect reports into a number of clusters with a trained clustering model.

4. The method of claim 3, wherein, when there are no predicted root causes of the test log matching identified root causes from the cluster of closed defect reports, the method further comprises adjusting the clustering model.

5. The method of claim 1 , wherein matching a selected cluster of closed defect reports to the test log further comprises: identifying a number (K) of predicted root causes from the test log that match an identified root cause from the selected cluster of closed defect reports; calculating a ratio of K over all the predicted root causes; and comparing the ratio to a threshold, wherein, when the ratio exceeds the threshold, the selected cluster of closed defect reports is accepted to guide the debugging.

6. The method of claim 5, wherein, when K does not exceed the threshold, the method includes adjusting a clustering model used to form clusters of the closed defect reports.

7. The method of claim 1 , further comprising identifying an owner of the test log based owners of closed defect logs in the selected cluster.

8. The method of claim 2, wherein extracting the features of the test log and closed defect reports to the vectors is performed using Natural Language Processing and the Word2Vec technique.

9. The method of claim 1 , further comprising training the classification model with closed defect reports.

10. A data processing system for correlating data from heterogenous sources, the system comprising: a main processor and associated memory; a classification model trained with data from the first data source; and a clustering model for clustering content from the second data source; wherein the main processor is programmed to: use the classification model to classify content from the second data source; use the clustering model for clustering content from the first data source into a number of clusters; compare the classified content from the second data source to the clusters of content from the first data source; match a selected cluster of content from the first data source with the classified content from the second data source; and correlate the classified content from the second data source with the selected cluster of content from the first data source.

11 . The data processing system of claim 10, further comprising: a database of closed defect reports document resolution of a software defect, wherein the first data source comprises the databased of closed defect reports; wherein the second data source comprises a test log from a test scenario executed on a piece of software.

12. The data processing system of claim 11 , the main processor programmed for: using the classification model to predict root causes of the defects indicated by the test log; using the clustering model for clustering the closed defect reports into a number of clusters; comparing the predicted root causes from the classification model to root causes identified in a cluster of the closed defect reports; matching a selected cluster of closed defect reports to the test log based on the comparing; and providing the selected cluster of the closed defect reports with the test log to guide debugging of the piece of software from the defects indicated by the test log.

13. The data processing system of claim 11 , further comprising a Natural Language Processor, the main processor being further programmed to use the Natural Language Process for extracting features of the test log to multi-dimensional vectors that represent semantic and syntactic relationships between words; extracting features of the closed defect reports in a cluster to multidimensional vectors that represent semantic and syntactic relationships between words; and comparing the vectors from the test log to the vectors from the defect report cluster.

14. The data processing system of claim 11 , wherein matching a selected cluster of closed defect reports to the test log further comprises: identifying a number (K) of predicted root causes from the test log that match an identified root cause from the selected cluster of closed defect reports; calculating a ratio of K over all the predicted root causes; and comparing the ratio to a threshold, wherein, when The ratio exceeds the threshold, the selected cluster of closed defect reports is accepted to guide the debugging.

15. A method of correlating data from heterogenous sources wherein data from a first data source is labeled and data from a second data source is unlabeled, the method comprising: training a classification model with data from the first data source; using the classification model to classify content from a second data source; clustering content from the first data source into a number of clusters; comparing the classified content from the second data source to the clusters of content from the first data source; matching a selected cluster of content from the first data source with the classified content from the second data source; and correlating the classified content from the second data source with the selected cluster of content from the first data source so that information can be extracted from both data sources together based on the correlation between the data sources.

Description:
CORRELATING DATA FROM HETEROGENOUS SOURCES

BACKGROUND

[0001] Extracting useful information from collections of data may be needed in a wide variety of fields and applications. Sometimes two different collections of data may be about the subject or field. For example, both sets of data may contain measurements over time of a number of variables measured in a mechanical system. However, the data might be organized differently in the two collections. For example, the data in one such collection may be labeled, while similar data in a second collection is not labeled. In such a case, the data collections are considered to be heterogeneous. Because the data in the two collections may be the same type of or similar data, considering the two data sets together may yield more useful information than considering just one of the data sets in isolation. Consequently, if there were a way to correlate the data from the two heterogeneous data sources, data from both sources could then be used to extract more useful information than would be available from a single data set. However, manually correlating the data between the two data sets may be extremely time-consuming or simply beyond human capacity as the size of the data sets increase.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] The accompanying drawings illustrate various implementations of the principles described herein and are a part of the specification. The illustrated implementations are merely examples and do not limit the scope of the claims. [0003] Fig. 1 is a flowchart illustrating an example of a method of addressing defects indicated by a test log from a test scenario executed on a piece of software consistent with the disclosed implementations.

[0004] Fig. 2 is a further flowchart illustrating more detail of the method of Fig. 1 , consistent with the disclosed implementations.

[0005] Fig. 3 is a further flowchart illustrating more detail of the method of Fig. 1 , consistent with the disclosed implementations.

[0006] Fig. 4 is a further flowchart illustrating more detail of the method of Fig. 1 , consistent with the disclosed implementations.

[0007] Fig. 5 is a chart illustrating more possible detail of the method of Fig. 1 , consistent with the disclosed implementations.

[0008] Fig. 6 is a diagram of an illustrative system for correlating data from heterogenous sources, consistent with the disclosed implementations.

[0009] Fig. 7 is a diagram of an illustrative system for correlating data from a test log with data from a database of closed defect reports, consistent with the disclosed implementations.

[0010] Fig. 8 is another diagram of an illustrative system for correlating data from a test log with data from a database of closed defect reports, consistent with the disclosed implementations.

[0011 ] Fig. 9 is another diagram of an illustrative system for correlating data from a test log with data from a database of closed defect reports, consistent with the disclosed implementations.

[0012] Fig. 10 is a flowchart of a method of correlating data from heterogenous sources wherein data from a first data source is labeled and data from a second data source is unlabeled, consistent with the disclosed implementations.

[0013] Fig. 11 is another flowchart of a method for correlating data from a test log with data from a database of closed defect reports, consistent with the disclosed implementations.

[0014] Fig. 12 is a diagram of a computing device for correlating data from heterogeneous sources, according to an example of the principles described herein. [0015] Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

DETAILED DESCRIPTION

[0016] As noted above, there are a wide variety of situations in which useful or guiding information needs to be extracted by analyzing collected data. Sometimes two different collections or sets of data contain the same type or types of data or are otherwise related such that important information might be extracted in both collections of data could be considered together.

[0017] However, the data might be organized differently in the two collections. For example, the data in one such collection may be labeled, while similar data in a second collection is not labeled. In another implementation, similar data in the two collections may be labeled differently based on a different labeling convention used in the different data sets. In such a case, the data collections are considered to be heterogeneous.

[0018] As noted, because the data in the two collections may be the same type of, similar or related data, considering the two data sets together may yield useful information that could not be obtained by considering only one of the data sets in isolation. However, the lack of correlation between the two heterogeneous data sets prevents a processing system from readily processing the data collectively from the two heterogenous data sources. This presents a technical issue that is specific to this computerized environment in which large volumes of data are collected and stored electronically

Consequently, if there were a way to correlate the data from the two heterogeneous data sources, data from both sources could then be used to extract more useful information than would be available from either data set alone. However, manually correlating the data between the two data sets may be extremely time-consuming or simply beyond human capacity, particularly as the size of the data sets increase.

[0019] One specific scenario where the correlation of heterogenous data sets is useful is debugging computer software. As bugs or software defects are discovered and corrected, defect reports are written that describe the defect and the solution to the defect. When the defect has been corrected, the report may be referred to as a closed defect report. These closed defect reports may be collected into a database.

[0020] On the other hand, test scenarios are written that provide input to a piece of software under test. A test log is then generated that records the input to the software and the resulting behavior and output of the software under the test scenario. As used herein, “software” refers collectively to software, firmware or any other form of programing or computer instructions that may need to be tested and debugged. These test logs are then analyzed to determined if there is a defect in the software being tested and may be further analyzed as part of resolving any defect discovered.

[0021] A collection of such test logs will include data that is similar to, or of the same type as, data in the closed defect reports. However, there is no correlation between these test logs and closed defect reports. Thus, a database of test logs and a database of closed defect reports are heterogenous data sources. However, as will be described herein, correlating data from these heterogenous data sources may be an effective tool in debugging software.

[0022] Accordingly, the present specification presents a technical solution to the problem in a computing environment of needing to correlate two or more heterogeneous data sources so that the data from all the sources can then be processed collectively to obtain information and insights that are not available from either data source alone.

[0023] Some aspects of the technical solution presented make use of machine learning, which is an established field of artificial intelligence. A machine learning algorithm is an algorithm that is trained using a set of training data. The training data frequently describes a large number of documented events, each with an outcome that depends on a number of variables. The machine learning algorithm trained with the set of training data can then be given data of an event for which the outcome is not known or for which a decision needs to be made. The machine learning algorithm will then predict the outcome of that event or make a required decision based on its training with the data of the training set. As additional data is added to the training set, the algorithm may be adjusted automatically for greater accuracy.

[0024] As used herein, the term “closed defect report” or “defect report” refers to a report of a software defect or bug that was identified, for which a root cause was found and a solution was implemented. Defect reports are generated by people.

[0025] A used herein, the term “test log” refers to the data generated when a test scenario is input to a piece of software under test. Test logs are machine generated.

[0026] As used herein, the term “clustering model” refers to a machine learning algorithm that groups or clusters together data objects that are similar in important ways. For example, the clustering model may cluster together closed defect reports that have similar characteristics, especially similar or identical root causes of the defects documented.

[0027] As used herein, the term “classification model” refers to a machine learning algorithm that processes a test log and predicts the root cause of any software defects indicated by the test log. In this case, the classification model is trained with a large set of closed defect reports that document the conditions and root causes of previously identified defects. The classification model then, with respect to a new test log, predicts the root cause of any defect identified by the test log. The classification model may be a result of several classification methods. After every method has been tested by the likes of cross validation or hypothesis testing, the best performing method may be used.

[0028] As used herein, the term Word2Vec refers to a known Natural Language Processing technique in which transforms each word in a text sample into a high-dimensional vector so that the semantic and syntactic relationships of the words are maintained.

[0029] As used herein, the term "selected cluster" refers to a cluster of root causes which are mapped by the test log when the test log and defect reports are co-clustered. A cluster of defect reports is associated with the test log when the test log and defect reports are co-clustered. The defect reports in the selected cluster include a set of root causes which are then associated with the test log.

[0030] In an example, the selected cluster may indicate a set of root causes which has been selected based on the intersection of the root causes between the results of the clustering method and the classification method. For example, the clustering method may group the selected defect reports into several clusters and places the new test log into the best matching cluster. The root causes of those defect reports in the best matching cluster are obtained and form a set. The classification method, given the same test log, outputs the possible root causes. A number of root causes will be selected, where the number, n, is based on the accumulated probability of the n root causes exceeding a certain threshold. These selected n root causes then form another set.

[0031] As used herein, the term “processor” may refer collectively to any positive number of processors or cores functioning together.

[0032] The present specification describes, in an example, a method of addressing defects indicated by a test log from a test scenario executed on a piece of software, the method including: using a classification model to predict root causes of the defects indicated by the test log; comparing the predicted root causes from the classification model to root causes identified in a cluster of closed defect reports that document resolution of a previous defect; matching a selected cluster of closed defect reports to the test log based on the comparing; and providing the selected cluster of the closed defect reports with the test log to guide debugging of the piece of software from the defects indicated by the test log.

[0033] In another example, the present specification describes a data processing system for correlating data from heterogenous sources, the system including: a main processor and associated memory; a classification model trained with data from the first data source; and a clustering model for clustering content from the second data source. The main processor is programmed to: use the classification model to classify content from the second data source; use the clustering model for clustering content from the first data source into a number of clusters; compare the classified content from the second data source to the clusters of content from the first data source; match a selected cluster of content from the first data source with the classified content from the second data source; and correlate the classified content from the second data source with the selected cluster of content from the first data source.

[0034] In another example, the present specification describes a method of correlating data from heterogenous sources wherein data from a first data source is labeled and data from a second data source is unlabeled, the method including: training a classification model with data from the first data source; using the classification model to classify content from a second data source; clustering content from the first data source into a number of clusters; comparing the classified content from the second data source to the clusters of content from the first data source; matching a selected cluster of content from the first data source with the classified content from the second data source; and correlating the classified content from the second data source with the selected cluster of content from the first data source so that information can be extracted from both data sources together based on the correlation between the data sources.

[0035] In these and other examples, the present specification provides a method and system to, with a pretrained clustering model, provide a debugger with a set of highly correlated closed defect reports from the past, given a new test log under consideration. The system is trained in an unsupervised manner with features extracted from defect reports, preferably, within a specific time frame. Including all historical data without a time frame may consume too much computing power and burden the clustering algorithm as the dataset could be enormous. Also, focusing on recent defect reports prevents duplicates from being provided.

[0036] For every given test log extracted into identical features, the system outputs a set of clusters of closed defect reports and matches the test log to one of them, specifically to the cluster with defect reports pointing to similar or highly correlated types of issues. By examining and reading the given cluster of defect reports, the current test log can be handled and assigned more effectively.

[0037] Furthermore, there is a validation mechanism for the clustering model. Since the model was trained without labels, it relies on another service to ensure its correctness. This service is a mature classification model trained with closed defect reports including their labels and has been used to predict root causes for new test logs. Given the comparable nature of defect reports and test logs, test logs can be extracted into identical features and serve as input of the classification model for the prediction of root causes for defects observed in new test logs. By doing so, the comparison between the predicted root causes and the root causes of the defect reports in the target cluster can be performed. In other words, it validates that the defect reports within the cluster are accurate and correlated. Should the model fail to output satisfying clusters, adjustments to the clustering model can be made. Specifically, adjustments to the clustering model can include adjusting the distance function, adjusting number of clusters produced or even adopting a new clustering algorithm for the clustering model.

[0038] Turning to the figures, Fig. 1 is a flowchart illustrating an example of a method 100 of addressing defects indicated by a test log from a test scenario executed on a piece of software consistent with the disclosed implementations. As shown in Fig. 1 , the method 100 includes using 102 a classification model to predict root causes of the defects indicated by the test log. The classification model is an algorithm that uses the data from the test log and, based on that data, predicts a root cause or causes of the behavior observed that has been characterized as a defect in the software under test.

[0039] The classification model is a machine-learning algorithm that has been trained using input such as the closed defect reports that document resolved defects. In this way, the classification model is able to predict a root cause of a defect documented in a test log. The method may then, in some examples, include training the classification model with closed defect reports. (Fig. 5, 536).

[0040] The method proceeds with comparing 104 the predicted root causes from the classification model to root causes identified in a cluster of closed defect reports that document resolution of a previous defect. For example, closed defect reports may document somewhat different software behavior and corresponding solutions that all resulted from the same or similar root cause. Accordingly, these closed defect reports are related by their root cause and may be clustered together even though they document variations in the software tested or the software behavior documented.

[0041] This comparison 104 is followed by matching 106 a selected cluster of closed defect reports to the test log. Given the correlation of the test log to the cluster of closed defect reports, the closed defect reports will provide guidance for addressing the defect or defects evident in the test log. Accordingly, the method concludes with providing 108 the selected cluster of the closed defect reports with the test log to guide debugging of the piece of software from the defects indicated by the test log.

[0042] Fig. 2 is a further flowchart illustrating more detail of the method of Fig. 1 , consistent with the disclosed implementations. Specifically, Fig. 2 further details the method of comparing 104 the predicted root causes from the classification model to root causes identified in a cluster of closed defect reports. As shown in Fig. 2, this process 200 includes extracting features 210 of the test log to multi-dimensional vectors that represent semantic and syntactic relationships between words. Similarly, the method includes extracting features 212 of the closed defect reports in a cluster to multi-dimensional vectors that represent semantic and syntactic relationships between words.

[0043] In some examples, extracting the features of the test log and the closed defect reports to the vectors may be performed using Natural Language Processing. More specifically, in some examples, the Word2Vec technique is used. (Fig. 5, 534). The Word2Vec technique is a known technique that converts a word or other textual input into a vector using the principles of Natural Language Processing.

[0044] With these vectors generated, the method 200 proceeds by comparing the vectors 214 from the test log to the vectors from the defect report cluster. This comparison allows the method to match a selected cluster of closed defect reports to the test log. [0045] Fig. 3 is a further flowchart illustrating more detail of the method of Fig. 1 , consistent with the disclosed implementations. As shown in Fig. 3, in some examples, the method 300 may include collecting 316 only closed defect reports from a set time range. For example, if only closed defect reports from within a recent period of time are used, the defect reports may be more likely to include information relevant to solving a current software defect. Additionally, selecting set period of time within which to consider defect reports may exclude very similar, duplicate or even identical reports that were generated at different times.

[0046] Once the desired collection of defect reports is assembled, whether based on time range or some other basis, the method 300 may include dividing 318 the collected defect reports into a number of clusters with a trained clustering model.

[0047] As noted above with respect to Fig. 1 , the method includes matching a selected cluster of closed defect reports to the test log based on the comparing. In some cases, however, there may not be any predicted root causes of the test log matching identified root causes from a cluster of closed defect reports. When there are no predicted root causes of the test log matching identified root causes from the cluster of closed defect reports, the method 399 may further include adjusting 320 the clustering model.

[0048] Fig. 4 is a further flowchart illustrating more detail of the method of Fig. 1 , consistent with the disclosed implementations. As shown in Fig. 4, the act of matching a selected cluster of closed defect reports to a given test log may be performed with the following actions.

[0049] The method 400 may begin by identifying 422 a number (K) of predicted root causes from a test log that match an identified root cause from the selected cluster of closed defect reports. Next, the method performs calculating 424 a ratio of K over all the predicted root causes and comparing 426 the ratio to a threshold. When the ratio exceeds the threshold, the selected cluster of closed defect reports is accepted 428 to guide the debugging. When the ratio does not exceed the threshold, the method includes adjusting 430 the clustering model used to form clusters of the closed defect reports. [0050] Fig. 5 is a chart illustrating more possible detail of the method of Fig. 1 , consistent with the disclosed implementations. As shown in Fig. 5, additional aspects of the method 500 may include identifying 532 an owner of the test log. The owner is identified based on who are the designated owners of closed defect logs in the selected cluster. Thus a test log may be assigned to an owner who previously closed a related defect report. This owner may have familiarity with the previous defect which may facilitate closure of the new test log.

[0051] Fig. 6 is a diagram of an illustrative data processing system for correlating data from heterogenous sources, consistent with the disclosed implementations. As noted above, one example of such heterogenous data sources may be test logs and closed defect reports.

[0052] As shown in Fig. 6, the system 600 includes a main processor 640 and associated memory 645. Within the memory 645, the processor 640 has access to a classification model 642 trained with data from a first data source and a clustering model 644 for clustering content from a second data source. The first and second data sources are heterogenous.

[0053] The main processor 640 is programmed to: use 650 the classification model 642 to classify content from the second data source; use 652 the clustering model 644 for clustering content from the first data source into a number of clusters; compare 654 the classified content from the second data source to the clusters of content from the first data source; match 656 a selected cluster of content from the first data source with the classified content from the second data source; and correlate 658 the classified content from the second data source with the selected cluster of content from the first data source.

[0054] Fig. 7 is a diagram of an illustrative data processing system 700 where the data processing system of claim 6 has been specifically applied for correlating data from a test log with data from a database of closed defect reports, consistent with the disclosed implementations. Thus, as shown in Fig. 7, the classification model 642 is trained to predict a root cause of a defect in a test log using closed defect reports. The clustering model is an algorithm to cluster together similar closed defect reports.

[0055] The processor 640 is programmed, in this example, for using 760 the classification model 642 to predict root causes of the defects indicated by the test log; using 762 the clustering model 644 for clustering the closed defect reports into a number of clusters; comparing 764 the predicted root causes from the classification model to root causes identified in a cluster of the closed defect reports; matching 766 a selected cluster of closed defect reports to the test log based on the comparing; and providing 768 the selected cluster of the closed defect reports with the test log to guide debugging of the piece of software from the defects indicated by the test log.

[0056] Fig. 8 is another diagram of the illustrative system of Fig. 7 for correlating data from a test log with data from a database of closed defect reports, consistent with the disclosed implementations. As shown in Fig. 8, the system 800 further includes or has access to a database 846 of closed defect reports that each document resolution of a software defect.

[0057] In this example, the processor 640 is further programmed to match a selected cluster of closed defect reports to the test log by: identifying 870 a number (K) of predicted root causes from the test log that match an identified root cause from the selected cluster of closed defect reports; calculating 872 a ratio of K over all the predicted root causes; and comparing 874 the ratio to a threshold. When the ratio exceeds the threshold, the selected cluster of closed defect reports is accepted 876 and used to guide the debugging.

[0058] Fig. 9 is another diagram of the illustrative system of Fig. 7 for correlating data from a test log with data from a database of closed defect reports, consistent with the disclosed implementations. As shown in Fig. 9, the system 1000 includes a Natural Language Processor 1040. The Natural Language Processor 1040 is programmed for extracting 1090 features of the test log to multi-dimensional vectors that represent semantic and syntactic relationships between words; extracting 1092 features of the closed defect reports in a cluster to multi-dimensional vectors that represent semantic and syntactic relationships between words; and comparing 1094 the vectors from the test log to the vectors from the defect report cluster.

[0059] Fig. 10 is a flowchart of a method of correlating data from heterogenous sources wherein data from a first data source is labeled and data from a second data source is unlabeled, consistent with the disclosed implementations. As shown in Fig. 10, the method 900 includes: training 978 a classification model with data from the first data source; using 980 the classification model to classify content from a second data source; clustering 982 content from the first data source into a number of clusters; comparing 984 the classified content from the second data source to the clusters of content from the first data source; matching 986 a selected cluster of content from the first data source with the classified content from the second data source; and correlating 988 the classified content from the second data source with the selected cluster of content from the first data source so that information can be extracted from both data sources together based on the correlation between the data sources.

[0060] Fig. 11 is another flowchart of a method for correlating data from a test log with data from a database of closed defect reports, consistent with the disclosed implementations. As shown in Fig. 11 , the method includes receiving a test log 1108 for which assistance in debugging is desired. In general, the method will correlate closed bug reports to the test log and identify closed bug reports that are most likely to be helpful in processing the test log. The method performs this by clustering the test log and the closed defect reports together. The cluster with the test log is then associated with the closed defect reports in the same cluster. The identified root cause(s) of the closed defect reports in the same cluster may then be attributed to the test log as possible explanations for the test log. These root causes are checked to see if the clustering mechanism is performing the desired clustering.

[0061 ] The method includes setting 1102 a time range over which software defect reports are collected. This may ensure that the defect reports will be recent enough to be relevant to a test log that has recently been generated. This may also limit the volume of data to be processed. Next, the method collects 1104 the defect reports corresponding to the set time range and extracts 1106 features from the defect reports. These features are statements that describe or define the root cause of the defect reported and corrected in the closed defect reports.

[0062] Elements 1102-1110 may be referred to, collectively, as a data collection phase. In the data collection phase, defect reports within a set time frame from the past are collected and embedded into vectors, while the given test log is also embedded into vectors. The applied algorithm for producing the vectors, e.g., Word2Vec, maintains semantic and syntactic relationships between words.

[0063] The vectors are then prepared for processing by the clustering model which will cluster the defect reports and match the test log to a cluster of defect reports that is most like the test log and most likely to be helpful in processing the test log. In some examples, the clustering uses the distance between the vectors to perform the clustering. The clustering is performed on both the defect report vectors and the test log vector.

[0064] The next phase may be referred to as the generation of clusters phase. With data collected and processed from the previous phase, the clustering model can be trained or used to generate clusters. Referring to Fig. 11 , the defect reports are clustered 1112 so that similar reports are clustered together. Similar reports are those that address a similar defect or a similar root cause. The clustering model generates clusters of defect reports and finds the best match among the clusters, i.e., the cluster to which the test log is most correlated. Since the defect reports are closed, their root causes are known. Therefore, the root causes of the defect reports in this cluster are specifically retrieved for validation. The outcome from the clustering is a set of root causes which are found in the cluster to which the test log is assigned. This set of root causes may contain a single root cause or multiple root causes. If the set of root causes is empty, the clustering may be adjusted.

[0065] The next phase may be referred to as prediction of root cause for the test log. Here, an existing classification model predicts root causes of test logs. Since test logs are extracted into identical features, previous test logs could serve as an input to this model and predicted root causes are outputted. Once the predicted root causes are given, top root cause(s) whose accumulated probability exceeds 80 percent are kept for validation.

[0066] Referring to Fig. 11 , the test log is submitted 1118 to a classification model, as described herein. The classification model predicts a number N of the most likely root causes for a defect evident in the test log. The accumulated probability of the N root causes including the actual root cause is calculated and compared to a threshold. If the probability exceeds the threshold, the set of N most likely root causes for the defect of the test log is finalized.

[0067] Features of the test log related to these predicted root causes of the evident defect are extracted 1110, similar to the extraction of the features from the closed defect reports. As noted above, this extraction may result in vectors that can be compared.

[0068] The last phase is the validation phase. This includes finding the intersection between the root causes of the defect reports in the cluster and the top predicted root causes from the test log. Should an intersection of root causes be found and is also the majority of the all the root causes in the cluster, the root causes could be submitted for reference. However, if any conditions failed to be met, the clustering model must be adjusted through ways such as: adjusting the distance function, number of clusters or even adopt a new clustering algorithm. This adjustment to the clustering model may increase or decrease the number of root causes associated with the cluster mapped by the test log. In some example, the cluster mapped by the test log will be split into multiple clusters by the refinement of the clustering model, reducing the number of root causes associated with the test log and providing greater specificity for the root causes to be outputted. In other examples, the cluster does not contain sufficient root causes and the clustering model is adjusted to increase the size of the clusters. This produces a cluster mapped by the test log which is associated with more root causes. This provides increased root causes output for the algorithm which provides additional possibilities to explain the test log. [0069] Referring to Fig. 11 , and as noted above, the comparison using the vectors, or some other method, is performed to correlate the data and generate 1114 a number of clusters of closed defect reports that are related to a likely root cause of the defect in the test log. This cluster of closed defect reports will include 1116 a set of M documented root causes.

[0070] The method then determines if the intersection of M and N is null 1120. If the intersection is null, the algorithm used to cluster the defect reports is adjusted 1122 as discussed above. Otherwise, the intersection will include 1124 a number K of root causes common to the cluster of defect reports and the prediction based on the test log.

[0071] Next, a ratio of K to M is then taken and compared 1126 to a threshold. If the ratio does not exceed the threshold, the method may adjust the clustering algorithm used. If the ratio does meet the threshold, the cluster of defect reports is accepted and may be used or served to guide debugging of the defect documented in the test log. In some examples, the task of identifying the root cause of the test log is assigned to an operator who closed out the closed defect report(s) containing identified one or more identified root causes in the set K. This operator may have greater familiarity with the particulars of the root cause associated with the test log due to their familiarity with the previous resolution of the closed defect reports.

[0072] Fig. 12 is a diagram of a computing device 1200 for correlating data from heterogeneous sources, according to an example of the principles described herein. The computing device 1200 may be implemented in an electronic device. Examples of electronic devices include servers, desktop computers, laptop computers, personal digital assistants (PDAs), mobile devices, smartphones, gaming systems, and tablets, among other electronic devices.

[0073] The computing device 1200 may be utilized in any data processing scenario including, stand-alone hardware, mobile applications, through a computing network, or combinations thereof. Further, the computing device 1200 may be used in a computing network. In an example, the methods provided by the computing device 1200 are provided as a service over a network by, for example, a third party.

[0074] To achieve its desired functionality, the computing device 1200 includes various hardware components. Among these hardware components may be a number of processors 1230, a number of data storage devices 1240, a number of peripheral device adapters 1232, and a number of network adapters 1234. These hardware components may be interconnected through the use of a number of busses and/or network connections. In an example, the processor 1230, data storage device 1240, peripheral device adapters 1232, and a network adapter 1234 may be communicatively coupled via a bus 1238.

[0075] The processor 1230 may include the hardware architecture to retrieve executable code from the data storage device 1240 and execute the executable code. The executable code may, when executed by the processor 1230, cause the processor 1230 to extract features from test logs and defect reports and then cluster the test logs and defect reports based on the extracted features. The functionality of the computing device 1200 is in accordance to the methods of the present specification described herein. In the course of executing code, the processor 1230 may receive input from and provide output to a number of the remaining hardware units.

[0076] The data storage device 1240 may store data such as executable program code that is executed by the processor 1230 and/or other processing device. The data storage device 1240 may specifically store computer code representing a number of applications that the processor 1230 executes to implement at least the functionality described herein.

[0077] The data storage device 1240 may include various types of memory modules, including volatile and nonvolatile memory. For example, the data storage device 1240 of the present example includes Random Access Memory (RAM) 1242, Read Only Memory (ROM) 1244), and Hard Disk Drive (HDD) memory 1246. Other types of memory may also be utilized, and the present specification contemplates the use of many varying type(s) of memory in the data storage device 1240 as may suit a particular application of the principles described herein. In certain examples, different types of memory in the data storage device 1240 may be used for different data storage needs. For example, in certain examples the processor 1230 may boot from Read Only Memory (ROM) 1242, maintain nonvolatile storage in the Hard Disk Drive (HDD) memory 1246, and execute program code stored in Random Access Memory (RAM) 1244.

[0078] The data storage device 1240 may include a computer readable medium, a computer readable storage medium, or a non-transitory computer readable medium, among others. For example, the data storage device 1240 may be, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the computer readable storage medium may include, for example, the following: an electrical connection having a number of wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable readonly memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store computer usable program code for use by or in connection with an instruction execution system, apparatus, or device. In another example, a computer readable storage medium may be any non-transitory medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

[0079] The data storage device 1240 may include a database 1248. The database 1248 may include test logs. The database 1248 may include defect reports.

[0080] Hardware adapters, including peripheral device adapters 1232 in the computing device 1200 enable the processor 1230 to interface with various other hardware elements, external and internal to the computing device 1200. For example, the peripheral device adapters 1232 may provide an interface to input/output devices, such as, for example, display device 1250. The peripheral device adapters 1232 may also provide access to other external devices such as an external storage device, a number of network devices such as, for example, servers, switches, and routers, client devices, other types of computing devices, and combinations thereof.

[0081] The display device 1250 may be provided to allow a user of the computing device 1200 to interact with and implement the functionality of the computing device 1200. The peripheral device adapters 1232 may also create an interface between the processor 1230 and the display device 1250, a printer, and/or other media output devices. The network adapter 1234 may provide an interface to other computing devices within, for example, a network, thereby enabling the transmission of data between the computing device 1200 and other devices located within the network.

[0082] The computing device 1200 may, when executed by the processor 1230, display the number of graphical user interfaces (GUIs) on the display device 1250 associated with the executable program code representing the number of applications stored on the data storage device 1240. The GUIs may display, for example, interactive screenshots that allow a user to interact with the computing device 1200. Examples of display devices 1250 include a computer screen, a laptop screen, a mobile device screen, a personal digital assistant (PDA) screen, and a tablet screen, among other display devices 1250.

[0083] In an example, the database 1248 stores the corpus of documents being used to generate the training set. The database 1248 may include the labeled documents making up the training set.

[0084] The computing device 1200 further includes a number of modules 1252, 1254 used in the implementation of the systems and methods described herein. The various modules 1252, 1254 within the computing device 1200 include executable program code that may be executed separately. In this example, the various modules 1252, 1254 may be stored as separate computer program products. In another example, the various modules 1252, 1254 within the computing device 1200 may be combined within a number of computer program products; each computer program product including a number of the modules 1252, 1254. Examples of such modules include an extraction module 1252 and a clustering module 1254.

[0085] In Fig. 12, the dashed boxes indicate instructions 1252, 1254 and a database 1248 stored in the data storage device 1240. The solid boxes in the data storage device 1240 indicate examples of different types of devices which may be used to perform the data storage device 1240 functions. For example, the data storage device 1240 may include any combination of RAM 1242, ROM 1244, HDD 1246, and/or other appropriate data storage medium, with the exception of a transient signal as discussed above.

[0086] The preceding description has been presented only to illustrate and describe examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form disclosed. Many modifications and variations are possible in light of the above teaching.