Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ANOMALY DETECTION USING LOGS
Document Type and Number:
WIPO Patent Application WO/2024/074883
Kind Code:
A1
Abstract:
Method comprising: converting each log of a sequence of N logs into an identifier among K different identifiers to obtain a sequence of N identifiers; for each n between 0 and N: for each of the K identifiers: counting occurrences of the identifier among the first n identifiers of the sequence to obtain a front frequency of the identifier for the respective n; and for each of the K identifiers: counting occurrences of the identifier among the last N-n identifiers of the sequence to obtain a rear frequency of the identifier for the respective n; arranging the front frequencies and the rear frequencies of the identifiers in a count vector; inputting the count vector an autoencoder to obtain an output vector for the respective n; determining a difference between the output vector and the count vector; marking the sequence as anomalous if the difference between the output vector and the count vector is larger than a threshold; wherein each of the identifiers is an integer.

Inventors:
SZILÁGYI PÉTER (HU)
HORVATH GABOR (HU)
KADAR ATTILA (HU)
Application Number:
PCT/IB2022/059636
Publication Date:
April 11, 2024
Filing Date:
October 07, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA SOLUTIONS & NETWORKS OY (FI)
International Classes:
G06F17/40; G06N3/0455; G06N3/088
Foreign References:
EP3979080A12022-04-06
Other References:
ZHANG LINMING ET AL: "LogAttn: Unsupervised Log Anomaly Detection with an AutoEncoder Based Attention Mechanism", 7 August 2021, 16TH EUROPEAN CONFERENCE - COMPUTER VISION - ECCV 2020, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, PAGE(S) 222 - 235, XP047604389
CATILLO MARTA ET AL: "AutoLog: Anomaly detection by deep autoencoding of system logs", EXPERT SYSTEMS WITH APPLICATIONS, ELSEVIER, AMSTERDAM, NL, vol. 191, 10 December 2021 (2021-12-10), XP086924337, ISSN: 0957-4174, [retrieved on 20211210], DOI: 10.1016/J.ESWA.2021.116263
Download PDF:
Claims:
Claims:

1 . Apparatus comprising: one or more processors, and memory storing instructions that, when executed by the one or more processors, cause the apparatus to perform: converting each log line of a sequence of N log lines among a plurality of log lines into a respective log identifier among K different log identifiers to obtain a sequence of N log identifiers; for each value of n between 0 and N inclusive: for each of the K different log identifiers: counting occurrences of the respective log identifier among the first n log identifiers of the sequence of N log identifiers to obtain a front frequency of the respective log identifier for the respective value of n; and for each of the K different log identifiers: counting occurrences of the respective log identifier among the N-n log identifiers of the sequence of N log identifiers following the first n log identifiers of the sequence of N log identifiers to obtain a rear frequency of the respective log identifier for the respective value of n; arranging the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to a predefined order; inputting the count vector for the respective value of n into an autoencoder to obtain, from the auto encoder, an output vector for the respective value of n; determining a difference between the output vector for the respective value of n and the count vector for the respective value of n; and at least one of the following: checking whether the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than a threshold and marking the sequence of N log lines as anomalous if the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than the threshold; or inputting the difference between the output vector for the respective value of n and the count vector for the respective value of n into the auto encoder as a respective reconstruction error; wherein each of the K different log identifiers is a respective integer.

2. The apparatus according to claim 1 , wherein the instructions, when executed by the one or more processors, further cause the apparatus to perform: marking the sequence of N log lines as not anomalous if, for each of the values of n, the difference between the output vector for the respective value of n and the count vector for the respective value of n is not larger than the threshold.

3. The apparatus according to any of claims 1 and 2, wherein the instructions, when executed by the one or more processors, cause the apparatus to perform: the converting each of the log lines of the sequence of N log lines into the respective log identifier by an integer embedding algorithm.

4. The apparatus according to any of claims 1 to 3, wherein the instructions, when executed by the one or more processors, cause the apparatus to perform, for each of the values of n between 0 and N inclusive: the arranging the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to the predefined order by arranging the front frequencies in a first portion of the count vector for the respective value of n according to a predefined first order, arranging the rear frequencies in a second portion of the count vector for the respective value of n according to a predefined second order, and combining the first portion of the count vector for the respective value of n and the second portion of the count vector to the count vector for the respective value of n according to a predefined rule.

5. The apparatus according to claim 4, wherein the instructions, when executed by the one or more processors, cause the apparatus to perform, for each of the values of n between 0 and N inclusive at least one of the following: the arranging the front frequencies of the log identifiers in the first portion of the count vector for the respective value of n such that, according to the predefined first order, the log identifiers of the front frequencies increase monotonously from a beginning of the first portion of the counter vector towards an end of the first portion of the counter vector or such that the log identifiers decrease monotonously from the beginning of the first portion of the counter vector towards the end of the first portion of the counter vector; or the arranging the rear frequencies of the log identifiers in the second portion of the count vector for the respective value of n such that, according to the predefined second order, the log identifiers of the rear frequencies increase monotonously from a beginning of the second portion of the counter vector towards an end of the second portion of the counter vector or such that the log identifiers decrease monotonously from the beginning of the second portion of the counter vector towards the end of the second portion of the counter vector; or the combining the first portion of the count vector for the respective value of n with the second portion of the count vector for the respective value of n by concatenating the first portion of the count vector for the respective value of n and the second portion of the count vector for the respective value of n according to the predefined rule to obtain the count vector for the respective value of n; or the combining the first portion of the count vector for the respective value of n with the second portion of the count vector for the respective value of n by interleaving the first portion of the count vector for the respective value of n and the second portion of the count vector for the respective value of n according to the predefined rule to obtain the count vector for the respective value of n.

6. The apparatus according to any of claims 1 to 5, wherein the instructions, when executed by the one or more processors, further cause the apparatus to perform: monitoring whether a new log line for a point in time later than the points in time for the log lines in the sequence of N log lines is available; converting the new log line into a respective log identifier among the K different log identifiers; removing the log identifier of the log line with an earliest point in time among the sequence of N log lines from the sequence of N log identifiers and adding the log identifier of the new log line to the sequence of N log identifiers to obtain the sequence of N log identifiers.

7. The apparatus according to any of claims 1 to 6, wherein the instructions, when executed by the one or more processors, further cause the apparatus to perform: sorting the log identifiers of the sequence of N log identifiers such that the points in time of the log lines on which the log identifiers are based are subsequent.

8. The apparatus according to any of claims 1 to 7, wherein the instructions, when executed by the one or more processors, further cause the apparatus to perform: determining a number of different types of log lines from the plurality of log lines or from a source code of a system providing the plurality of log lines; setting K equal to or larger than the number of different types of log lines.

9. The apparatus according to any of claims 1 to 8, wherein the instructions, when executed by the one or more processors, further cause the apparatus to perform: identifying, in a system providing the plurality of log lines, a root cause why the sequence of N log lines is marked as anomalous if the sequence of N log lines is marked as anomalous; mitigating the root cause in the system.

10. Method comprising: converting each log line of a sequence of N log lines among a plurality of log lines into a respective log identifier among K different log identifiers to obtain a sequence of N log identifiers; for each value of n between 0 and N inclusive: for each of the K different log identifiers: counting occurrences of the respective log identifier among the first n log identifiers of the sequence of N log identifiers to obtain a front frequency of the respective log identifier for the respective value of n; and for each of the K different log identifiers: counting occurrences of the respective log identifier among the N-n log identifiers of the sequence of N log identifiers following the first n log identifiers of the sequence of N log identifiers to obtain a rear frequency of the respective log identifier for the respective value of n; arranging the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to a predefined order; inputting the count vector for the respective value of n into an autoencoder to obtain, from the auto encoder, an output vector for the respective value of n; determining a difference between the output vector for the respective value of n and the count vector for the respective value of n; and at least one of the following: checking whether the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than a threshold and marking the sequence of N log lines as anomalous if the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than the threshold; or inputting the difference between the output vector for the respective value of n and the count vector for the respective value of n into the auto encoder as a respective reconstruction error; wherein each of the K different log identifiers is a respective integer.

1 1. The method according to claim 10, further comprising: marking the sequence of N log lines as not anomalous if, for each of the values of n, the difference between the output vector for the respective value of n and the count vector for the respective value of n is not larger than the threshold.

12. The method according to any of claims 10 and 11 , wherein: the converting is performed by converting each of the log lines of the sequence of N log lines into the respective log identifier by an integer embedding algorithm.

13. The method according to any of claims 10 to 12, wherein, for each of the values of n between 0 and N inclusive: the arranging is performed by arranging the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to the predefined order by arranging the front frequencies in a first portion of the count vector for the respective value of n according to a predefined first order, arranging the rear frequencies in a second portion of the count vector for the respective value of n according to a predefined second order, and combining the first portion of the count vector for the respective value of n and the second portion of the count vector to the count vector for the respective value of n according to a predefined rule.

14. The method according to claim 13, wherein, for each of the values of n between 0 and N inclusive at least one of the following: the arranging is performed by arranging the front frequencies of the log identifiers in the first portion of the count vector for the respective value of n such that, according to the predefined first order, the log identifiers of the front frequencies increase monotonously from a beginning of the first portion of the counter vector towards an end of the first portion of the counter vector or such that the log identifiers decrease monotonously from the beginning of the first portion of the counter vector towards the end of the first portion of the counter vector; or the arranging is performed by arranging the rear frequencies of the log identifiers in the second portion of the count vector for the respective value of n such that, according to the predefined second order, the log identifiers of the rear frequencies increase monotonously from a beginning of the second portion of the counter vector towards an end of the second portion of the counter vector or such that the log identifiers decrease monotonously from the beginning of the second portion of the counter vector towards the end of the second portion of the counter vector; or the combining the first portion of the count vector for the respective value of n with the second portion of the count vector for the respective value of n by concatenating the first portion of the count vector for the respective value of n and the second portion of the count vector for the respective value of n according to the predefined rule to obtain the count vector for the respective value of n; or the combining is performed by combining the first portion of the count vector for the respective value of n with the second portion of the count vector for the respective value of n by interleaving the first portion of the count vector for the respective value of n and the second portion of the count vector for the respective value of n according to the predefined rule to obtain the count vector for the respective value of n.

15. The method according to any of claims 10 to 14, further comprising: monitoring whether a new log line for a point in time later than the points in time for the log lines in the sequence of N log lines is available; converting the new log line into a respective log identifier among the K different log identifiers; removing the log identifier of the log line with an earliest point in time among the sequence of N log lines from the sequence of N log identifiers and adding the log identifier of the new log line to the sequence of N log identifiers to obtain the sequence of N log identifiers.

16. The method according to any of claims 10 to 15, further comprising: sorting the log identifiers of the sequence of N log identifiers such that the points in time of the log lines on which the log identifiers are based are subsequent.

17. The method according to any of claims 10 to 16, further comprising: determining a number of different types of log lines from the plurality of log lines or from a source code of a system providing the plurality of log lines; setting K equal to or larger than the number of different types of log lines.

18. The method according to any of claims 10 to 17, further comprising: identifying, in a system providing the plurality of log lines, a root cause why the sequence of N log lines is marked as anomalous if the sequence of N log lines is marked as anomalous; mitigating the root cause in the system.

19. A computer program product comprising a set of instructions which, when executed on an apparatus, is configured to cause the apparatus to carry out the method according to any of claims 10 to 18.

20. The computer program product according to claim 19, embodied as a computer-readable medium or directly loadable into a computer.

Description:
Anomaly detection using logs

Field of the invention

The present disclosure relates to anomaly detection.

Abbreviations

3GPP 3 rd Generation Partnership Project

5G/6G/7G 5 ,h /6 ,h /7 ,h Generation

AE Autoencoder

FM Fault Management

HDFS Hadoop Distributed File System

ID Identifier

ML Machine Learning

PM Performance Management

RCA Root Cause Analytics

Background

Conventionally, performance/failure monitoring and service assurance in telecom networks is based on the collection and analysis of numerical PM/FM counters.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder learns a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore insignificant data (“noise”).

An autoencoder has two main parts: an encoder that maps the input to a code, and a decoder that reconstructs the input from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function. The simplest way to perform the copying task perfectly would be to duplicate the input. To suppress this behavior, the code space usually has fewer dimensions than the input space. Summary

It is an object of the present invention to improve the prior art.

According to a first aspect of the invention, there is provided an apparatus comprising: one or more processors, and memory storing instructions that, when executed by the one or more processors, cause the apparatus to perform: converting each log line of a sequence of N log lines among a plurality of log lines into a respective log identifier among K different log identifiers to obtain a sequence of N log identifiers; for each value of n between 0 and N inclusive: for each of the K different log identifiers: counting occurrences of the respective log identifier among the first n log identifiers of the sequence of N log identifiers to obtain a front frequency of the respective log identifier for the respective value of n; and for each of the K different log identifiers: counting occurrences of the respective log identifier among the N-n log identifiers of the sequence of N log identifiers following the first n log identifiers of the sequence of N log identifiers to obtain a rear frequency of the respective log identifier for the respective value of n; arranging the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to a predefined order; inputting the count vector for the respective value of n into an autoencoder to obtain, from the auto encoder, an output vector for the respective value of n; determining a difference between the output vector for the respective value of n and the count vector for the respective value of n; and at least one of the following: checking whether the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than a threshold and marking the sequence of N log lines as anomalous if the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than the threshold; or inputting the difference between the output vector for the respective value of n and the count vector for the respective value of n into the auto encoder as a respective reconstruction error; wherein each of the K different log identifiers is a respective integer. The instructions, when executed by the one or more processors, may further cause the apparatus to perform: marking the sequence of N log lines as not anomalous if, for each of the values of n, the difference between the output vector for the respective value of n and the count vector for the respective value of n is not larger than the threshold.

The instructions, when executed by the one or more processors, may cause the apparatus to perform: the converting each of the log lines of the sequence of N log lines into the respective log identifier by an integer embedding algorithm.

The instructions, when executed by the one or more processors, may cause the apparatus to perform, for each of the values of n between 0 and N inclusive: the arranging the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to the predefined order by arranging the front frequencies in a first portion of the count vector for the respective value of n according to a predefined first order, arranging the rear frequencies in a second portion of the count vector for the respective value of n according to a predefined second order, and combining the first portion of the count vector for the respective value of n and the second portion of the count vector to the count vector for the respective value of n according to a predefined rule.

The instructions, when executed by the one or more processors, may cause the apparatus to perform, for each of the values of n between 0 and N inclusive at least one of the following: the arranging the front frequencies of the log identifiers in the first portion of the count vector for the respective value of n such that, according to the predefined first order, the log identifiers of the front frequencies increase monotonously from a beginning of the first portion of the counter vector towards an end of the first portion of the counter vector or such that the log identifiers decrease monotonously from the beginning of the first portion of the counter vector towards the end of the first portion of the counter vector; or the arranging the rear frequencies of the log identifiers in the second portion of the count vector for the respective value of n such that, according to the predefined second order, the log identifiers of the rear frequencies increase monotonously from a beginning of the second portion of the counter vector towards an end of the second portion of the counter vector or such that the log identifiers decrease monotonously from the beginning of the second portion of the counter vector towards the end of the second portion of the counter vector; or the combining the first portion of the count vector for the respective value of n with the second portion of the count vector for the respective value of n by concatenating the first portion of the count vector for the respective value of n and the second portion of the count vector for the respective value of n according to the predefined rule to obtain the count vector for the respective value of n; or the combining the first portion of the count vector for the respective value of n with the second portion of the count vector for the respective value of n by interleaving the first portion of the count vector for the respective value of n and the second portion of the count vector for the respective value of n according to the predefined rule to obtain the count vector for the respective value of n.

The instructions, when executed by the one or more processors, may further cause the apparatus to perform: monitoring whether a new log line for a point in time later than the points in time for the log lines in the sequence of N log lines is available; converting the new log line into a respective log identifier among the K different log identifiers; removing the log identifier of the log line with an earliest point in time among the sequence of N log lines from the sequence of N log identifiers and adding the log identifier of the new log line to the sequence of N log identifiers to obtain the sequence of N log identifiers.

The instructions, when executed by the one or more processors, may further cause the apparatus to perform: sorting the log identifiers of the sequence of N log identifiers such that the points in time of the log lines on which the log identifiers are based are subsequent.

The instructions, when executed by the one or more processors, may further cause the apparatus to perform: determining a number of different types of log lines from the plurality of log lines or from a source code of a system providing the plurality of log lines; setting K equal to or larger than the number of different types of log lines.

The instructions, when executed by the one or more processors, may further cause the apparatus to perform: identifying, in a system providing the plurality of log lines, a root cause why the sequence of N log lines is marked as anomalous if the sequence of N log lines is marked as anomalous; mitigating the root cause in the system.

According to a second aspect of the invention, there is provided a method comprising: converting each log line of a sequence of N log lines among a plurality of log lines into a respective log identifier among K different log identifiers to obtain a sequence of N log identifiers; for each value of n between 0 and N inclusive: for each of the K different log identifiers: counting occurrences of the respective log identifier among the first n log identifiers of the sequence of N log identifiers to obtain a front frequency of the respective log identifier for the respective value of n; and for each of the K different log identifiers: counting occurrences of the respective log identifier among the N-n log identifiers of the sequence of N log identifiers following the first n log identifiers of the sequence of N log identifiers to obtain a rear frequency of the respective log identifier for the respective value of n; arranging the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to a predefined order; inputting the count vector for the respective value of n into an autoencoder to obtain, from the auto encoder, an output vector for the respective value of n; determining a difference between the output vector for the respective value of n and the count vector for the respective value of n; and at least one of the following: checking whether the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than a threshold and marking the sequence of N log lines as anomalous if the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than the threshold; or inputting the difference between the output vector for the respective value of n and the count vector for the respective value of n into the auto encoder as a respective reconstruction error; wherein each of the K different log identifiers is a respective integer.

The method may further comprise: marking the sequence of N log lines as not anomalous if, for each of the values of n, the difference between the output vector for the respective value of n and the count vector for the respective value of n is not larger than the threshold.

The converting may be performed by converting each of the log lines of the sequence of N log lines into the respective log identifier by an integer embedding algorithm.

For each of the values of n between 0 and N inclusive: the arranging may be performed by arranging the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to the predefined order by arranging the front frequencies in a first portion of the count vector for the respective value of n according to a predefined first order, arranging the rear frequencies in a second portion of the count vector for the respective value of n according to a predefined second order, and combining the first portion of the count vector for the respective value of n and the second portion of the count vector to the count vector for the respective value of n according to a predefined rule.

For each of the values of n between 0 and N inclusive at least one of the following may apply: the arranging may be performed by arranging the front frequencies of the log identifiers in the first portion of the count vector for the respective value of n such that, according to the predefined first order, the log identifiers of the front frequencies increase monotonously from a beginning of the first portion of the counter vector towards an end of the first portion of the counter vector or such that the log identifiers decrease monotonously from the beginning of the first portion of the counter vector towards the end of the first portion of the counter vector; or the arranging may be performed by arranging the rear frequencies of the log identifiers in the second portion of the count vector for the respective value of n such that, according to the predefined second order, the log identifiers of the rear frequencies increase monotonously from a beginning of the second portion of the counter vector towards an end of the second portion of the counter vector or such that the log identifiers decrease monotonously from the beginning of the second portion of the counter vector towards the end of the second portion of the counter vector; or the combining the first portion of the count vector for the respective value of n with the second portion of the count vector for the respective value of n by concatenating the first portion of the count vector for the respective value of n and the second portion of the count vector for the respective value of n according to the predefined rule to obtain the count vector for the respective value of n; or the combining may be performed by combining the first portion of the count vector for the respective value of n with the second portion of the count vector for the respective value of n by interleaving the first portion of the count vector for the respective value of n and the second portion of the count vector for the respective value of n according to the predefined rule to obtain the count vector for the respective value of n.

The method may further comprise: monitoring whether a new log line for a point in time later than the points in time for the log lines in the sequence of N log lines is available; converting the new log line into a respective log identifier among the K different log identifiers; removing the log identifier of the log line with an earliest point in time among the sequence of N log lines from the sequence of N log identifiers and adding the log identifier of the new log line to the sequence of N log identifiers to obtain the sequence of N log identifiers.

The method may further comprise: sorting the log identifiers of the sequence of N log identifiers such that the points in time of the log lines on which the log identifiers are based are subsequent.

The method may further comprise: determining a number of different types of log lines from the plurality of log lines or from a source code of a system providing the plurality of log lines; setting K equal to or larger than the number of different types of log lines.

The method may further comprise: identifying, in a system providing the plurality of log lines, a root cause why the sequence of N log lines is marked as anomalous if the sequence of N log lines is marked as anomalous; mitigating the root cause in the system.

The method may be a method of anomaly detection. According to a third aspect of the invention, there is provided a computer program product comprising a set of instructions which, when executed on an apparatus, is configured to cause the apparatus to carry out the method according to the second aspect. The computer program product may be embodied as a computer-readable medium or directly loadable into a computer.

According to some embodiments of the invention, at least one of the following advantages may be achieved:

• unstructured log files (or log lines) may be exploited for anomaly detection and service assurance;

• high precision, recall value, and/or F1 score;

• low processing requirements, in particular because only few model parameters;

• suitable for parallelization;

• suitable for streaming and online service assurance.

It is to be understood that any of the above modifications can be applied singly or in combination to the respective aspects to which they refer, unless they are explicitly stated as excluding alternatives.

Brief description of the drawings

Further details, features, objects, and advantages are apparent from the following detailed description of the preferred embodiments of the present invention which is to be taken in conjunction with the appended drawings, wherein:

Fig. 1 illustrates some processing tasks according to a method according to some example embodiments of the invention (the numerical values of N, K and the log IDs are examples only);

Fig. 2 illustrates an encoder architecture including a comparator according to some example embodiments of the invention;

Fig. 3 illustrates a method according to some example embodiments of the invention;

Fig. 4 shows an apparatus according to an example embodiment of the invention;

Fig. 5 shows a method according to an example embodiment of the invention; and Fig. 6 shows an apparatus according to an example embodiment of the invention. Detailed description of certain embodiments

Herein below, certain embodiments of the present invention are described in detail with reference to the accompanying drawings, wherein the features of the embodiments can be freely combined with each other unless otherwise described. However, it is to be expressly understood that the description of certain embodiments is given by way of example only, and that it is by no way intended to be understood as limiting the invention to the disclosed details.

Moreover, it is to be understood that the apparatus is configured to perform the corresponding method, although in some cases only the apparatus or only the method are described.

Latest with 5G, telecom software is increasingly virtualized and deployed on cloud infrastructure, inside virtual machines or cloud native containers. Due to the introduction of IT/cloud technology, new means of software monitoring have become available in the telecom domain, most notably computer logs, which provide an unstructured text stream of information encoding infrastructure metrics, application metrics, performance related information, failures and other events (such as errors, information, debug info).

Some example embodiments of the invention find autonomously anomalies in a stream of computer log lines, without a-priori knowledge on which log lines or which patterns in the log lines are or should be considered anomalous. In order to solve this technical problem, at least one of the following two technical problems has to be solved:

1 . No off-the-self supervised deep learning method can be used as those assume labeled training data, wherein each log line is labeled as “anomaly” or “no anomaly”. Such labeled data sets do not exist in the 5G/6G telecom domain.

2. Since no labeled training data exists, any available training data may contain a mixture of anomalous and non-anomalous logs. Therefore, off-the-shelf unsupervised training methods depending on the presence of clean (non-anomalous) training data cannot be used.

It is challenging to apply machine learning in order to automatically find patterns in logs without having any a-priori knowledge on the content, semantics or presence of anomalies in the logs. There is no pre-defined format or content for computer logs as they are produced by arbitrary logging statements in the program code of the software modules. Therefore, a log analytics method according to some example embodiments of the invention does not have any dependency of the log structure.

According to some example embodiments of the invention, it is determined and indicated whether or not a sequence of log lines (“input log sequences”) is anomalous (i.e. , an anomaly, if any, is indicated for the whole input log sequence, not just for a single log line of the input log sequence).

Figs. 1 and 2 illustrate a first phase (pre-processing) and a second phase (autoencoding), respectively, of some example embodiments of the invention.

As shown in Fig. 1 , in the pre-processing phase, the textual input log lines are transformed into integers so that similar log lines are denoted by the same integer called log identifier (or log type). This action may be called “log line embedding” and may be performed by any suitable integer embedding algorithm, mapping the logs into integer space. (In general, the mapping may be to any one-dimensional space (e.g. the letters of the alphabet and combinations thereof such as A, B, C, ..., Z, AA, AB, ... , AZ, BA, ...), where the members may be mapped to natural numbers. In the context of the present application such a mapping is considered as an integer embedding algorithm, too).

The number of different log identifiers (integers) that may be output by the log line embedding is limited and denoted by K. K is specific to the system (e.g., SW module) that produces the logs. Any given SW module can only produce a limited number of different types of log lines because the structure of the log lines are pre-determined by the source code of the SW module. Thus, K may be determined if the source code of the SW module is known. Thus, for example, the manufacturer of the system (e.g. SW module) may provide the value of K. Alternatively, or if the source code is not known, K can be obtained by observing a sufficiently large (diverse) amount of logs produced by the system (SW module) and determining the number of different types of log lines. Practically, the determined number of types of log lines may be considered K. In some cases, to be on the safe side, one may choose K slightly larger than the determined number of types of log lines. K may be determined in the pre-processing phase or prior to the preprocessing phase.

After the sequence of log lines is transformed into a sequence of log identifiers, N+1 iterations are performed wherein the sequence of log identifiers is split into respective two parts. I.e., the splitting is performed for all possible combinations. I.e., if the sequence of log identifiers contains N log identifiers, in each iteration the splitting is performed after position n, wherein n has a respective value between 0 and N inclusive. Thus, there are iterations where the two extreme splits (n=0 or n=N) are performed, where one of the two parts is empty and the other contains the entire sequence.

Then, for each split (for each value of n), a respective count vector of length 2K is construed by counting the occurrences of each potential log identifier (out of the K possible log identifiers) in front of the split point and behind the split point. The number of occurrences in front of and behind the split point may be denoted as front frequency and rear frequency, respectively. For each split (each value of n), the front frequencies and the rear frequencies are arranged in the respective count vector according to a predetermined order. I.e., the predetermined order is the same for all values of n between 0 and N inclusive.

An example of such a predetermined order is as follows: For 1 < I < K, the i-th position in the count vector denotes the number of occurrences of log identifier i in front of the split point (the front frequency), whereas for K + 1 < I < 2K the i-th position in the count vector denotes the number of occurrences of log identifier i behind the split point (the rear frequency).

Then, all the count vectors for the different splits (different values of n between 0 and N inclusive) are fed into an Autoencoder (AE), see Fig. 2. In this phase, an AE is trained on the entire set of count vectors created in the previous phase. In each AE training step, the input of the AE is a count vector, and the output of the AE is a reconstruction of the count vector (with some error). The difference between the AE’s input and output (i.e., the reconstruction error) is the loss function used, in the training phase, as the training feedback for the AE. In the training phase, the autoencoding is typically (but not necessarily) repeated for a lot of sequences of N log lines generated by the same system (e.g. a software module).

The difference may be calculated, for example, as a square root from the sum of the squares of the differences of each component of the count vector and the output vector.

For inference, the architecture of Fig. 2 may be used, too. For inference, a new log sequence is input into the trained AE in order to detect whether the new log sequence is anomalous or not. That is, the following actions are performed on the new log sequence: 1 . Create a count vector for each combination of the possible splits on the log sequence (same as described above).

2. Input each of the count vectors to the AE and compare the AE’s output with its input. If the difference between the input and the output is above a pre-defined threshold, the log sequence from which the count vector was created contains an anomaly.

Fig. 3 illustrates a method according to some example embodiments of the invention. In Fig. 3, the sequence of N log lines is created subsequently from the log lines received from the system (e.g. software module). Also, the method of Fig. 3 creates a training feedback for each count vector. If the method is used for inference, in some example embodiments of the invention, such a training feedback need not be created. On the other hand, during the training phase, a decision between “anomaly” and “no anomaly” may be omitted in some example embodiments. In particular, in some example embodiments, the threshold for the difference between input count vector and output vector to decide whether the sequence of logs comprises an anomaly may be defined only after the training was performed.

The count vector based log analytics method according to an example embodiment was implemented and compared to the best state-of-the-art log analytics methods published in the scientific literature. The comparison was based on the publicly known HDFS data set, which is a benchmark for log analytics. Table 1 shows the performance of multiple state-of-the-art models and the count vector based log analytics method according to the example embodiment.

Based on the above comparison, some benefits of example embodiments of the invention over state-of-the-art methods are the following:

- Unprecedented high precision and recall values, making the count vector based method the best performing model in terms of classifying the right log lines as anomalies. - Lightweight operation: the count vector based method does not need computationally expensive recurring deep learning models like predict-next or bidirectional predict-next. This is reflected by the number of model parameters: count vector has significantly less trained model parameters (the parameters of a single AE) compared to any other state- of-the-art model, which translates into less compute cycles during inference and, thus, much higher model throughput (i.e., number of log lines analyzed per second) at a much lower memory footprint. Benefits of lightweight operation include also the potential to be embedded in products or to run on edge clouds.

- Applicable for streaming data by maintaining a memory of the last N log lines and applying the method on the N log lines each time a new log line is received. This gives an immediate anomaly feedback if the latest line of log creates an anomaly within the last N log lines. At the same time it may point to the last log line as being the cause of the anomaly. This gives improved explainability and RCA potential focusing the source of anomaly to an individual line of log.

- Continuous (in-situ or reinforcement) learning: after deploying a trained model for production, every new log line it processes may be used also as a training feedback to continuously train the AE part of the ML architecture. Therefore, the accuracy of the model may stay high even if the production log data starts to deviate from a previously learned distribution.

The count vector method therefore gives a competitive advantage over existing methods, enabling it to be run at the edge, or even on devices, rather than requiring all log data to be streamed to a central computing platform, e.g. in the cloud. This enables to apply the method to sensitive data that must stay at the premise, e.g., for security use cases (applied to user activity log).

Fig. 4 shows an apparatus according to an example embodiment of the invention. The apparatus may be a log analytics device or an element thereof. Fig. 5 shows a method according to an example embodiment of the invention. The apparatus according to Fig. 6 may perform the method of Fig. 5 but is not limited to this method. The method of Fig. 5 may be performed by the apparatus of Fig. 4 but is not limited to being performed by this apparatus.

The apparatus comprises means for converting 110, first means for counting 120, second means for counting 130, means for arranging 140, first means for inputting 150, means for determining 160, means for checking 170, means for marking 180, and second means for inputting 190. The means for converting 110, first means for counting 120, second means for counting 130, means for arranging 140, first means for inputting 150, means for determining 160, means for checking 170, means for marking 180, and second means for inputting 190 may be a converting means, first counting means, second counting means, arranging means, first inputting means, determining means, checking means, marking means, and second inputting means, respectively. The means for converting 110, first means for counting 120, second means for counting 130, means for arranging 140, first means for inputting 150, means for determining 160, means for checking 170, means for marking 180, and second means for inputting 190 may be a converter, first counter, second counter, arranger, first inputter, determiner, checker, marker, and second inputter, respectively. The means for converting 110, first means for counting 120, second means for counting 130, means for arranging 140, first means for inputting 150, means for determining 160, means for checking 170, means for marking 180, and second means for inputting 190 may be a converting processor, first counting processor, second counting processor, arranging processor, first inputting processor, determining processor, checking processor, marking processor, and second inputting processor, respectively.

Some example embodiments (e.g. used for inference) may comprise the means for checking 170 and the means for marking 180 but may not comprise the means for inputting 190. Some example embodiments (e.g. used for inference) may not comprise the means for checking 170 and the means for marking 180 but may comprise the means for inputting 190. Some example embodiments may comprise the means for checking 170, the means for marking 180, and the means for inputting 190.

The means for converting 110 converts each log line of a sequence of N log lines into a respective log identifier to obtain a sequence of N log identifiers (S1 10). The log identifiers are selected among K different log identifiers. Each of the K different log identifiers is a respective integer. The sequence of N log lines may be taken from a plurality of log lines. The sequence of N log lines may e sorted, e.g. according to the time of the event reported in the respective log line.

The first means for counting 120, second means for counting 130, means for arranging 140, first means for inputting 150, means for determining 160, means for checking 170 (if any), means for marking 180 (if any), and second means for inputting 190 (if any) perform the following actions for each value of n between 0 and N inclusive. The actions for different values of n may be performed consecutively or fully or partly in parallel.

The first means for counting 120 counts for each of the K different log identifiers occurrences of the respective log identifier among the first n log identifiers of the sequence of N log identifiers (S120). Thus, the first means for counting 120 obtains a front frequency of the respective log identifier for the respective value of n.

The second means for counting 130 counts for each of the K different log identifiers occurrences of the respective log identifier among the N-n log identifiers of the sequence of N log identifiers (S130) following the first n log identifiers of the sequence of N log identifiers, i.e. of the remaining log identifiers after the first means for counting 130 has counted the occurrences. Thus, the second means for counting 130 obtains a rear frequency of the respective log identifier for the respective value of n.

The means for arranging 140 arranges the front frequencies and the rear frequencies of the K different log identifiers in a count vector for the respective value of n according to a predefined order (S140). The first means for inputting 150 inputs the count vector for the respective value of n into an autoencoder (S150). Thus an output vector for the respective value of n is obtained from the autoencoder. The means for determining 160 determines a difference between the output vector for the respective value of n and the count vector for the respective value of n (S160).

If the apparatus comprises the means for checking 170 and the means for marking 180, it may perform the following actions:

The means for checking 170 checks whether the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than a threshold (S170). If the difference between the output vector for the respective value of n and the count vector for the respective value of n is larger than the threshold (S170 = yes), the means for marking 180 marks the sequence of N log lines as anomalous (S180).

If the apparatus comprises the second means for inputting 190, it may perform the following action: The second means for inputting 190 inputs the difference between the output vector for the respective value of n and the count vector for the respective value of n into the autoencoder as a respective reconstruction error (S190).

If the apparatus comprises the means for checking 170, the means for marking 180, and the second means for inputting 190, S190 and the pair of S170 and S180 may be performed in an arbitrary sequence. They may be performed fully or partly in parallel.

Fig. 6 shows an apparatus according to an example embodiment of the invention. The apparatus comprises at least one processor 810, at least one memory 820 storing instructions that, when executed by the at least one processor 810, cause the apparatus at least to perform the method according to Fig. 5 and related description.

In some example embodiments, if an anomaly is detected from the log files, a root cause analysis is performed to identify the root cause of the anomaly. If the root cause is identified in the system (e.g. the telco network) providing the logs, the root cause is mitigated to remove the anomaly. In some example embodiments, the root cause of a log anomaly may be identified automatically by correlating the anomalous logs with previous log lines. The correlation may identify that one or more anomalous log lines are usually preceded by some other log lines. The content of the other log lines may be indicative of the root cause of the anomalous log lines (for example, anomalous log lines indicating errors in a network protocol layer may be systematically preceded by log lines that indicate that a certain network interface card had failed - the co-occurrence of the two log lines suggest that the network interface card failure, which is the content of the preceding logs, is the root cause behind the network protocol layer errors). In such a case, in some example embodiments, the system may automatically restart the interface card in order to mitigate the root cause.

For each value of n between 0 and N, the arrangement of the front and rear frequencies in the respective count vector is the same. However, there are various options for such an arrangement. For example, a first portion of the count vector may comprise the frequencies (“front frequency”) of occurrences of the log identifiers in the first n log lines, and a second portion of the count vector may comprise the frequencies (“rear frequency”) of occurrences of the log identifiers in the remaining log lines. The first and second portions may be combined to the count vector in different ways according to a predefined rule. For example, the first and second portions may be concatenated, wherein either the second portion follows the first portion, or the first portion follows the second portion. The first and second portions may be interleaved. For example, the front frequency and the rear frequency of a certain log identifier may be arranged as neighbors in the count vector. Within the first portion, the front frequencies may be arranged according to some predefined order, e.g. such that the values of the corresponding log identifiers increase (or decrease) monotonously. Within the second portion, the rear frequencies may be arranged according to some predefined order, e.g. such that the values of the corresponding log identifiers increase (or decrease) monotonously. The arrangement of the front frequencies in the first portion and the arrangement of the rear frequencies in the rear portion may be the same or different from each other.

The log lines of a sequence of N log lines are typically sorted in time (the time of the event reported in the log line) when they are received from the system. If it is not known whether or not the sequence of N log lines is sorted in time or if it is known that the N log lines are not sorted in time, in some example embodiments, the log lines are sorted in time before the detection of anomalous behavior is started based on the (sorted) sequence of log lines. However, it is not necessary that the log lines of the sequence of log lines are sorted in time. They may be sorted according to some other criterion, or they may be unsorted.

If there are several input log sequences, they may be processed independently from each other, which enables massive parallelization over multiple log data sources. Some of the input log sequences may comprise one or more of the same log lines, or the input log sequences may not have any log lines in common.

Some example embodiments are explained with respect to a 5G network. However, the invention is not limited to 5G. It may be used in other communication networks, too, e.g. in previous of forthcoming generations of 3GPP networks such as 4G, 6G, or 7G, etc. It may be used in non-3GPP communication networks, too, such as in wired communication networks. It may be used in any system which produces a number of log lines.

One piece of information may be transmitted in one or plural messages from one entity to another entity. Each of these messages may comprise further (different) pieces of information.

Names of network elements, network functions, protocols, and methods are based on current standards. In other versions or other technologies, the names of these network elements and/or network functions and/or protocols and/or methods may be different, as long as they provide a corresponding functionality. The same applies correspondingly to the terminal.

If not otherwise stated or otherwise made clear from the context, the statement that two entities are different means that they perform different functions. It does not necessarily mean that they are based on different hardware. That is, each of the entities described in the present description may be based on a different hardware, or some or all of the entities may be based on the same hardware. It does not necessarily mean that they are based on different software. That is, each of the entities described in the present description may be based on different software, or some or all of the entities may be based on the same software. Each of the entities described in the present description may be deployed in the cloud.

According to the above description, it should thus be apparent that example embodiments of the present invention provide, for example, a means for service assurance or a component thereof, an apparatus embodying the same, a method for controlling and/or operating the same, and computer program(s) controlling and/or operating the same as well as mediums carrying such computer program(s) and forming computer program product(s).

Implementations of any of the above described blocks, apparatuses, systems, techniques or methods include, as non-limiting examples, implementations as hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Each of the entities described in the present description may be embodied in the cloud.

It is to be understood that what is described above is what is presently considered the preferred example embodiments of the present invention. However, it should be noted that the description of the preferred example embodiments is given by way of example only and that various modifications may be made without departing from the scope of the invention as defined by the appended claims.

The terms “first X” and “second X” include the options that “first X” is the same as “second X” and that “first X” is different from “second X”, unless otherwise specified. As used herein, “at least one of the following: <a list of two or more elements>” and “at least one of <a list of two or more elements>” and similar wording, where the list of two or more elements are joined by “and” or “or”, mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.