Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
GENERATING AN ONTOLOGY FOR REPRESENTING A SYSTEM
Document Type and Number:
WIPO Patent Application WO/2023/105264
Kind Code:
A1
Abstract:
A computer implemented method is disclosed for generating an ontology representing a system comprising a plurality of logical components. The method comprises identifying, from configuration data for the system, conceptual entities and associated parameters present in the system, and creating an ontology grammar comprising the identified conceptual entities and parameters. The method further comprises mapping each logical component of the system to a conceptual entity of the ontology grammar, and creating an ontology graph comprising instances of ontology grammar conceptual entities. The method further comprises obtaining numerical data representing operation of the system, and updating the ontology graph with values extracted from the numerical data, and extracting, from the obtained numerical data, a set of ontology rules comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph. Also disclosed is a method for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated according to the present disclosure.

Inventors:
SOUALHIA MBARKA (CA)
GÉHBERGER DÁNIEL (CA)
WUHIB FETAHI (CA)
GEORGESCU SORIN-MARIAN (CA)
Application Number:
PCT/IB2021/061405
Publication Date:
June 15, 2023
Filing Date:
December 07, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G06N5/02; G06N5/04
Foreign References:
US20200379871A12020-12-03
US10601640B12020-03-24
US10540383B22020-01-21
US10360503B22019-07-23
Other References:
LI TINGTING ET AL: "A semantic model-based fault detection approach for building energy systems", BUILDING AND ENVIRONMENT, PERGAMON PRESS, OXFORD, GB, vol. 207, 11 November 2021 (2021-11-11), XP086896955, ISSN: 0360-1323, [retrieved on 20211111], DOI: 10.1016/J.BUILDENV.2021.108548
Attorney, Agent or Firm:
HASELTINE LAKE KEMPNER LLP et al. (GB)
Download PDF:
Claims:
43

CLAIMS

1. A computer implemented method for generating an ontology representing a system comprising a plurality of logical components, the method comprising: identifying, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and creating an ontology grammar comprising the identified conceptual entities and parameters; mapping each logical component of the system to a conceptual entity of the ontology grammar, and creating an ontology graph comprising an instance of an ontology grammar conceptual entity, and its associated parameters, for each logical component of the system; obtaining numerical data representing operation of the system, and updating the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph; and extracting, from the obtained numerical data, a set of ontology rules, the set comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph.

2. The method of claim 1, further comprising: obtaining numerical data representing live operation of the system; and updating the ontology graph and the set of ontology rules using the live operational data.

3. The method of claim 2, wherein updating the ontology graph and the set of ontology rules using the live operational data comprises: comparing the obtained live operational data to the set of ontology rules, and: for live operational data representing a pattern that is not represented in an existing rule of the set of ontology rules, adding a new rule for the pattern to the set of ontology rules; and for live operational data that is inconsistent with at least one rule of the set of ontology rules, changing the rule to correspond to the live operational data.

4. The method of claim 3, wherein updating the ontology graph and the set of ontology rules using the live operational data further comprises: recording a time at which a new rule is created, or an existing rule is changed. 44

5. The method of any one of claims 1 to 4, wherein identifying, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and creating an ontology grammar comprising the identified conceptual entities and parameters, comprises: if no existing ontology grammar for the system is available, creating a conceptual entity for each term of the configuration data; and if an existing ontology grammar for the system is available, mapping terms from the configuration data to existing conceptual entities in the ontology grammar.

6. The method of claim 5, wherein identifying, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and creating an ontology grammar comprising the identified conceptual entities and parameters, further comprises: if an existing ontology grammar for the system is available, and if no mapping for a term to a conceptual entity in the existing ontology grammar can be identified, creating a new conceptual entity in the ontology grammar for the term

7. The method of claim 5 or 6, wherein creating a conceptual entity in the ontology grammar comprises associating with the created conceptual entity a parameter by which its operation may be represented.

8. The method of any one of claims 5 to 7, wherein creating a conceptual entity in the ontology grammar further comprises including with the conceptual entity at least one topological or functional relationship with another conceptual entity in the ontology grammar, wherein the topological or functional relationship with another conceptual entity in the ontology grammar is extracted from the configuration data.

9. The method of any one of claims 5 to 8, wherein mapping terms from the configuration data to existing conceptual entities in the ontology grammar comprises performing a semantic matching operation between terms from the configuration data and conceptual entities in the existing ontology grammar.

10. The method of any one of claims 1 to 9, wherein identifying, from configuration data for the system, conceptual entities present in the system, and parameters by 45 which their operation is represented, and creating an ontology grammar comprising the identified conceptual entities and parameters, further comprises: if a schema file for the system is available, identifying terms in the configuration data using the schema file for the system.

11. The method of any one of claims 1 to 10, wherein mapping each logical component of the system to a conceptual entity of the ontology grammar, and creating an ontology graph comprising an instance of an ontology grammar conceptual entity, and its associated parameters, for each logical component of the system, comprises: for each term of the configuration data: mapping the term to a conceptual entity of the ontology grammar; and creating in the ontology graph an instance of the mapped conceptual entity and instances of its associated parameters.

12. The method of claim 11 , when dependent on claim 8, wherein creating in the ontology graph an instance of the mapped conceptual entity and instances of its associated parameters further comprises: creating in the ontology graph a relationship between the created instance and an instance in the ontology graph of another conceptual entity, in accordance with a topological or functional relationship included with the mapped conceptual entity in the ontology grammar.

13. The method of any one of claims 1 to 11 , wherein obtaining numerical data representing operation of the system, and updating the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph, comprises: mapping parameters from the obtained numerical data to parameters in the ontology grammar; and for each parameter from the obtained numerical data that can be mapped to a parameter in the ontology grammar: identifying the instance in the ontology graph to which the mapped parameter from the obtained numerical data relates; and appending the value of the parameter from the obtained numerical data to a values record in the ontology graph for the corresponding parameter of the identified instance, together with a timestamp associated with the value.

14. The method of claim 13, wherein obtaining numerical data representing operation of the system, and updating the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph, further comprises: for each parameter from the obtained numerical data that cannot be mapped to a parameter in the ontology grammar: identifying a conceptual entity in the ontology grammar to which the parameter from the obtained numerical data relates; updating the ontology grammar to include the parameter from the obtained numerical data; identifying the instance in the ontology graph to which the parameter from the obtained numerical data relates; creating a values record in the ontology graph corresponding to the parameter from the obtained numerical data; and appending the value of the parameter from the obtained numerical data to the created values record, together with a timestamp associated with the value.

15. The method of any one of claims 1 to 14, wherein the set of ontology rules comprises at least one rule describing a correlation between operation of at least one instance in the ontology graph and at least one of: operation of at least one other instance in the ontology graph; generation of at least one log message.

16. The method of any one of claims 1 to 15, wherein extracting, from the obtained numerical data, a set of ontology rules, the set comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph, comprises: defining a plurality of items based on the obtained numerical data and log messages generated by the system, wherein an item comprises at least one of: generation of a particular log message; a value of a parameter in the ontology graph fulfilling a criterion; identifying items, or groups of items, whose frequency of appearance over one or more time windows satisfies a frequency threshold; generating from the identified items, or groups of items, at least one rule describing at least one of: a pattern of temporal evolution of operation of a single instance in the ontology graph; a plurality of instances in the ontology graph or log messages whose operation or generation demonstrates a pattern of temporal evolution that is correlated.

17. A computer implemented method for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated using a method according to any one of claims 4 to 16, the method comprising: obtaining a time window associated with the incident; identifying, from the ontology, all rules in the set of ontology rules that were either created or changed within the time window associated with the incident, and the ontology instances to which the rules refer; grouping the identified rules according to a relationship between the ontology instances to which they refer; for each group, ordering the rules according to the time at which the rule was created or changed; identifying the group having the greatest relevance to the incident; and outputting as a potential explanation of the incident: the ordered sequence of rules in the identified group; and relationships between ontology instances to which the rules in the identified group refer, according to at least one of the ontology graph or the set of ontology rules.

18. The method of claim 17, further comprising: initiating action to correct the incident on the basis of the output potential explanation.

19. The method of claim 17 or 18, wherein grouping the identified rules according to a relationship between the ontology instances to which they refer comprises: generating a first incident ontology graph comprising vertices corresponding to instances in the ontology graph, wherein each pair of vertices in the first incident ontology graph is connected by an edge if an edge between the pair of instances is present in the ontology graph and at least one of the identified rules refers to both of the instances in the pair; generating a second incident ontology graph comprising vertices corresponding to instances in the ontology graph, wherein a weight of an edge between a pair of vertices in the second incident ontology graph is set to be: 48 if a path between the pair of vertices exists in the first incident ontology graph, the distance between the pair of vertices in the first incident ontology graph, and if no path exists between the pair of vertices in the first incident ontology graph, infinite; clustering the vertices of the second incident ontology graph into disjoint sets according to the edge weights between the vertices; and grouping the identified rules such that each group of rules corresponds to a different set.

20. The method of claim 19, wherein grouping the identified rules such that each group of rules corresponds to a different set comprises, for a given set, assigning an identified rule to a group corresponding to the set if the rule refers to an instance that is represented by a vertex in the set.

21. The method of claim 19 or 20, wherein identifying the group having the greatest relevance to the incident comprises: for each set of vertices clustered from the second incident ontology graph: calculating a distance of each vertex in the set to vertices representing instances involved in the incident; and setting a distance of the set from the incident to be the minimum calculated distance for a vertex in the set; and identifying as the group having the greatest relevance to the incident the group corresponding to the set having the smallest distance from the incident.

22. A computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method as claimed in any one of claims 1 to 21.

23. An ontology node for generating an ontology representing a system comprising a plurality of logical components, the ontology node comprising processing circuitry configured to cause the ontology node to: 49 identify, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and create an ontology grammar comprising the identified conceptual entities and parameters; map each logical component of the system to a conceptual entity of the ontology grammar, and create an ontology graph comprising an instance of an ontology grammar conceptual entity, and its associated parameters, for each logical component of the system; obtain numerical data representing operation of the system, and update the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph; and extract, from the obtained numerical data, a set of ontology rules, the set comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph.

24. The ontology node as claimed in claim 23, wherein the processing circuitry is further configured to cause the management node to carry out the steps of any one or more of claims 2 to 16.

25. A management node for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated using a method according to any one of claims 4 to 16, the management node comprising processing circuitry configured to cause the management node to: obtain a time window associated with the incident; identify, from the ontology, all rules in the set of ontology rules that were either created or changed within the time window associated with the incident, and the ontology instances to which the rules refer; group the identified rules according to a relationship between the ontology instances to which they refer; for each group, order the rules according to the time at which the rule was created or changed; identify the group having the greatest relevance to the incident; and output as a potential explanation of the incident: the ordered sequence of rules in the identified group; and relationships between ontology instances to which the rules in the identified group refer, according to at least one of the ontology graph or the set of ontology rules. 50

26. The management node as claimed in claim 25, wherein the processing circuitry is further configured to cause the management node to carry out the steps of any one or more of claims 18 to 21.

Description:
Generating an Ontology for Representing a System

Technical Field

The present disclosure relates to a computer implemented method for generating an ontology representing a system comprising a plurality of logical components. The present disclosure also relates to a computer implemented method for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated according to examples of the present disclosure. The methods may be performed by an ontology node and a management node respectively, and the present disclosure also relates to an ontology node, a management node, and to a computer program product configured, when run on a computer, to carry out methods for generating an ontology and for managing an incident.

Background

Incidents within IT infrastructures are a daily occurrence. Such infrastructures may include for example a Network Operation Center (NOC) of an Internet Service Provider (ISP), a datacenter for a cloud provider, or a regional site of mobile network operator. Incidents occurring within IT infrastructures may be related to failures (for example, a failed hard disk or inaccessible network storage), performance (for example, slow file access speed, frequently buffering video stream), or security (for example, unauthorized access to admin portals). Whenever incidents occur in an IT infrastructure, it is the responsibility of system administrators of the infrastructure to find an explanation for them.

Finding an explanation for an incident is often a complicated task, even for experienced system administrators, owing in part to the extreme complexity of many infrastructures, which may be constructed from many different components, each requiring a specific expertise to understand its inner workings. Considering for example a cloud platform, the platform may be built from commercial off-the-shelf (COTS) hardware (HW), with the servers provided by one vendor, and the networking equipment from another. Additional components, including power and cooling equipment are often supplied by a different dedicated vendor. The operating system for the servers and the networking equipment, the cloud platform software, and the applications that run on the cloud platform itself, may also all come from different vendors and/or developers, which follow their own protocols and standards to develop and package their products. In this setting, if a user experiences frequent buffering of video streams, the problem could be explained by an issue in either the streaming application, other cloud services the streaming application uses, the cloud platform itself, operating system-level problems, hardware failures, network problems or malicious activities by an intruder. In order for a system administrator to debug the issue and explain its occurrence, he or she would need to be expert in all of the possible areas that could have contributed to the incident.

Many existing technologies for incident management rely on an ontology of the managed system, which describes in a structured manner the entities on the managed system and their relations.

US 10,601640 discloses a system that determines a fault in a cloud computing environment based on at least one log parameter and an existing ontology repository. The disclosed system first receives log information describing a cloud computing task, from which the system can generate a stack token (for example, a computer resource of the cloud computing stack). The system maps the identifier of this stack token to the existing ontology, in order to discover whether or not the task is a fault. The system also identifies resolution nodes of the stack node by looking to the nodes having a direct connection with the faulty node.

US 10,540383, and US 10,360503 propose systems for automatically generating an ontology from input data using a set of predefined ontology templates for a target domain knowledge. The proposed systems also use logical grouping to detect semantic and syntactic classification of data attributes between the created ontology entities. While building the ontology, the references also present methods to analyse and identify the different ontology elements to determine their links and relationships.

Most of the existing solutions for incident analysis using ontologies, and for generating ontologies, suffer from the drawback of requiring human intervention. In addition, many proposed solutions are based on ontologies that are either provided by a user (for example, an ontology engineer), previously extracted from other sources, or based on specific existing ontology templates. Such systems consequently require prior knowledge about both the managed system and the relationships that exist between different elements in the managed system. Finally, incident analysis methods struggle to determine an explanation for incidents that propagate from one element to another in a managed system.

Summary

It is an aim of the present disclosure to provide methods, an ontology node, a management node, and a computer program product which at least partially address one or more of the challenges mentioned above. It is a further aim of the present disclosure to provide methods, an ontology node, a management node, and a computer program product which cooperate to facilitate generation of an ontology and incident management in a system of logical components.

According to a first aspect of the present disclosure, there is provided a computer implemented method for generating an ontology representing a system comprising a plurality of logical components. The method comprises identifying, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and creating an ontology grammar comprising the identified conceptual entities and parameters. The method further comprises mapping each logical component of the system to a conceptual entity of the ontology grammar, and creating an ontology graph comprising an instance of an ontology grammar conceptual entity, and its associated parameters, for each logical component of the system. The method further comprises obtaining numerical data representing operation of the system, and updating the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph, and extracting, from the obtained numerical data, a set of ontology rules, the set comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph.

According to another aspect of the present disclosure, there is provided a computer implemented method for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated using a method according to aspects or examples of the present disclosure. The method for managing an incident comprises obtaining a time window associated with the incident, and identifying, from the ontology, all rules in the set of ontology rules that were either created or changed within the time window associated with the incident, and the ontology instances to which the rules refer. The method further comprises grouping the identified rules according to a relationship between the ontology instances to which they refer, and, for each group, ordering the rules according to the time at which the rule was created or changed. The method further comprises identifying the group having the greatest relevance to the incident, and outputting, as a potential explanation of the incident, the ordered sequence of rules in the identified group, and relationships between ontology instances to which the rules in the identified group refer, according to at least one of the ontology graph or the set of ontology rules.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method according to aspects or examples of the present disclosure.

According to another aspect of the present disclosure, there is provided an ontology node for generating an ontology representing a system comprising a plurality of logical components. The ontology node comprises processing circuitry configured to cause the ontology node to identify, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and create an ontology grammar comprising the identified conceptual entities and parameters. The processing circuitry is further configured to cause the ontology node to map each logical component of the system to a conceptual entity of the ontology grammar, and create an ontology graph comprising an instance of an ontology grammar conceptual entity, and its associated parameters, for each logical component of the system. The processing circuitry is further configured to cause the ontology node to obtain numerical data representing operation of the system, and update the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph. The processing circuitry is further configured to cause the ontology node to extract, from the obtained numerical data, a set of ontology rules, the set comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph.

According to another aspect of the present disclosure, there is provided a management node for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated using a method according to aspects or examples of the present disclosure. The management node comprises processing circuitry configured to cause the management node to obtain a time window associated with the incident, and identify, from the ontology, all rules in the set of ontology rules that were either created or changed within the time window associated with the incident, and the ontology instances to which the rules refer. The processing circuitry is further configured to cause the management node to group the identified rules according to a relationship between the ontology instances to which they refer, and, for each group, order the rules according to the time at which the rule was created or changed. The processing circuitry is further configured to cause the management node to identify the group having the greatest relevance to the incident, and output, as a potential explanation of the incident, the ordered sequence of rules in the identified group, and relationships between ontology instances to which the rules in the identified group refer, according to at least one of the ontology graph or the set of ontology rules.

Aspects of the present disclosure thus provide methods and nodes that facilitate generation of explanations for incidents that occur in systems of logical components. Examples of the present disclosure provide methods for generating an ontology and associated rules of a managed system, and for identifying possible explanations for incidents that occur in the managed system, using the ontology. Examples of the present disclosure may use historical data to build the ontology and the rules of the managed system, and may update, in a substantially continuous manner the ontology with live data that is collected for the system. In other examples, the ontology may be generated from live data, and updated to improve its accuracy as time goes on. When an incident occurs, example methods of the present disclosure use the generated ontology and rules to identify relevant events in the managed system and build possible explanations for the incident based on the rules and events.

Methods according to the present disclosure provide a scalable, systematic and automatic approach to generating an ontology and rules for a system of logical components without prior knowledge of the system, as well as finding explanations for events or incidents that such a system may experience.

Brief Description of the Drawings For a better understanding of the present disclosure, and to show more clearly how it may be carried into effect, reference will now be made, by way of example, to the following drawings in which:

Figure 1 is a flow chart illustrating process steps in a computer implemented method for generating an ontology representing a system;

Figures 2a to 2g show flow charts illustrating process steps in further examples of computer implemented method for generating an ontology representing a system;

Figure 3 is a flow chart illustrating process steps in a computer implemented method for managing an incident occurring in a system;

Figures 4a to 4d show flow charts illustrating process steps in further examples of a computer implemented method for managing an incident occurring in a system;

Figure 5 is a block diagram illustrating functional modules in an example ontology node;

Figure 6 is a block diagram illustrating functional modules in another example ontology node;

Figure 7 is a block diagram illustrating functional modules in an example management node;

Figure 8 is a block diagram illustrating functional modules in another example management node;

Figure 9 illustrates an example implementation architecture for methods according to the present disclosure;

Figure 10 is a flow chart providing an overview of implementation of methods according to the present disclosure;

Figure 11 illustrates an implementation of steps of the methods of Figures 1 and 2a to 2g; Figure 12 shows a simplified ontology grammar and ontology instances graph;

Figure 13 illustrates an implementation of additional steps of the methods of Figures 1 and 2a to 2g;

Figure 14 shows extension of the ontology grammar and instance graphs from Figure 12;

Figure 15 illustrates an implementation of additional steps of the methods of Figures 1 and 2a to 2g;

Figure 16 illustrates connections between the rules and the instances in the ontology graph to which they refer;

Figure 17 illustrates how rules may be updated using live data;

Figure 18 illustrates implementation of the methods of Figures 3 and 4a to 4d; and

Figures 19 and 20 illustrate graphs that may be created as part of the methods of Figures 3 and 4.

Detailed Description

As discussed above, examples of the present disclosure provide methods for building and discovering an ontology and rules from data generated by a managed system, and identifying explanations for incidents that the system experiences. The generated ontology and rules enable users without expert knowledge or detailed information about a managed system to identify an explanation for a given incident.

The present disclosure provides two families of methods, which serve different but related purposes. The first family of methods is responsible for analysing data collected from a monitored system in order to discover automatically the different entities characterising the system and their relationships. The entities and their relationships are then used to build an ontology and rules without input from a human expert. Data produced by the monitored system, including for example logs, time series metrics, configuration files, etc. is input to the method, and the method generates as output the constructed ontology and rules. The second family of methods is responsible for performing automated analysis of a reported incident that a managed system may have experienced. The analysis is performed using the generated ontology and rules from methods in the first family when receiving an incident report, for example from a fault detector, a performance degradation detector, a ticket submitted by an end-user, etc. The incident report and generated ontology and rules are input to the method, and the method generates as output the identified explanation.

In order to provide additional context to the present disclosure, there now follows a discussion of certain terms appearing in the present specification.

Ontology: Ontologies are widely used to describe entities and their relations in a structured way. An ontology consists of an ontology grammar and a graph representing the ontology instances. Ontologies are typically used in decision making systems to capture relevant background or “expert” knowledge into a graph model.

Ontology grammar: In the context of the present disclosure, an ontology grammar defines concepts and their relationships. Concepts represent the “logical entities in the system” and their properties can also be described by the grammar. Relationships capture the connections between concepts, including associative relationships (for example a virtual machine runs in a server), and taxonomical relationships (for example a virtual machine is a compute entity). An ontology grammar can be expressed using triplets. For example, a simple grammar for cloud computing could define that a virtual machine (subject) runs in (predicate) a physical server (object).

Ontology instances: The instances describe physical or virtual objects or instances of concepts and relations in the ontology grammar. This may include values associated with concepts. In the previously mentioned cloud scenario, the ontology instances graph can contain multiple actual virtual machine instances that run in one or more real physical servers, and these virtual machines might have values regarding uptime or system load.

Rules: Rules characterize different observations or changes of the managed system state, and form the basis on which inferences may be made leading to conclusions and/or knowledge about system functional behaviour. For example, rules could be used to discover interesting (for example, non-typical or new) relations and knowledge between the ontology instances composing the ontology system. Incident report: Reports contain a description of an experienced incident. According to examples of the present disclosure, an incident report may present data about an experienced incident, the start time of the incident, and the incident duration. An example of an incident report could be: <CPU1 on served overload, timel , 2 min>.

Streaming Data: Streaming Data is a type of data that is continuously generated by different sources. It could include both historical and live data.

Historical Data: Historical Data is a set of data collected about past events or past observations that characterize the state (e.g., operations, load, etc.) of the monitored system.

Live Data: Live data is a set of data collected in (near) real time about events or observations that characterize the state of the monitored system.

Schema file: Schema files describe the possible concepts and their properties in a structured description file (for example, JSON format). Schemas may not be available for every system.

Configuration input: Input from the managed system, describing current configurations.

Metric Data: Time series data collected at specific intervals about variables that reflect the status of the monitored system (for example, memory allocation, network, disk usage).

Log Data: Textual data written in a file following some specific templates when a specific event (for example, errors, exceptions) occurs in a monitored application or program.

Figure 1 is a flow chart illustrating process steps in a computer implemented method 100 for generating an ontology representing a system comprising a plurality of logical components. The method 100 may be performed by an ontology node, which may comprise a physical or virtual node, and may be implemented in a computing device, server apparatus and/or in a virtualized environment, for example in a cloud, edge cloud or fog deployment. The ontology node may comprise or be instantiated in any part of a network, for example in a logical core network node of a communication network, network management centre, network operations centre, radio access network node, etc. A radio access network node may comprise a base station, eNodeB, gNodeB, or any other current of future implementation of functionality facilitating the exchange of radio network signals between nodes and/or users of a communication network. Any communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the ontology node may be instantiated in one or more logical or physical functions of a communication network node. The ontology node may therefore encompass multiple logical entities, as discussed in greater detail below.

Referring to Figure 1 , in step 110 the method 100 comprises identifying, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and creating an ontology grammar comprising the identified conceptual entities and parameters. The method further comprises, in step 120, mapping each logical component of the system to a conceptual entity of the ontology grammar, and creating an ontology graph comprising an instance of an ontology grammar conceptual entity, and its associated parameters, for each logical component of the system. In step 130, the method comprises obtaining numerical data representing operation of the system, and updating the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph. The method further comprises, in step 140, extracting, from the obtained numerical data, a set of ontology rules, the set comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph.

The ontology generated according to the method 100 thus comprises an ontology grammar, an ontology graph including the instances and values, and extracted rules. The numerical data may be live or historical data, and includes metrics and log messages. The numerical data represents operation, or functioning, of the system. The numerical data may for example represent the status of system instances, parameters describing the system or any instance within the system, individual operations being carried out within the system, etc. The “operation of the system” that is represented by the numerical data thus refers to any aspect of the running of the system, including specific system operations, status of system instances, or any parameter that provides information about the system in a live operational state. The numerical data is therefore representative of the dynamic live system, and is distinguished from the configuration data for the system, which is essentially static. References to “operation” of conceptual entities or instances may be understood in a similar manner as referring to the functional behaviour of the entity or instance.

For the purposes of the present disclosure, a conceptual entity comprises a type or category of logical component in the system, so that each logical component of the system can be represented as an instance of a conceptual entity of the ontology grammar. As discussed above, the set of ontology rules extracted in step 140 comprises at least one rule describing “a pattern of temporal evolution of operation of at least one instance in the ontology graph”. Such rules may vary from comparatively simple statements, for example setting normal operational limits for a single performance metric, to more complex associative statements, for example correlating the behaviour over time of multiple metrics and/or log messages.

Figures 2a to 2g show flow charts illustrating process steps in further examples of computer implemented method 200 for generating an ontology representing a system comprising a plurality of logical components. The method 200 provides various examples of how the steps of the method 100 may be implemented and supplemented to achieve the above discussed and additional functionality. As for the method 100, the method 200 may be performed by an ontology node, which may be a physical or virtual node, and which may encompass multiple logical entities, as discussed more fully above with reference to Figure 1 .

Referring to Figure 2a, in a first step 210, the ontology node identifies, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and creates an ontology grammar comprising the identified conceptual entities and parameters. System configuration data may include any data that sets or controls configurable parameters of any element or instance of the system. This may include configurable parameters of hardware or software, physical and/or virtual system elements or instances of specific elements.

Sub steps that may be carried out by the ontology node in order to perform step 210 are illustrated in Figure 2c.

Referring now to Figure 2c, in step 211 , the ontology node checks whether or not a schema file for the system is available. If a schema file for the system is available, then the ontology node identifies terms in the configuration data using the schema file for the system in step 212. Terms in the configuration data may include any data item or element that refers to a specific element or instance of the system. This may include references to specific hardware elements such as servers, virtual elements such as machines, pods, services, etc. If a schema file for the system is not available, then the ontology node may naively identify terms in the configuration data at step 213 (for example by taking all elements of the input to be terms). In step 214, the ontology node then checks whether or not an existing ontology grammar for the system is available. If an existing ontology grammar for the system is available, the ontology node then maps terms from the configuration data to existing conceptual entities in the ontology grammar in step 215. As illustrated at 215a, mapping terms from the configuration data to existing conceptual entities in the ontology grammar may comprise performing a semantic matching operation between terms from the configuration data and conceptual entities in the existing ontology grammar. If an existing ontology grammar for the system is available, and if no mapping for a term to a conceptual entity in the existing ontology grammar can be identified, the ontology node may create a new conceptual entity in the ontology grammar for the term at step 216. In such examples, it will be appreciated that the method 200 may further comprise subsequently updating instances in an existing ontology graph to be consistent with the newly updated ontology grammar.

If no existing ontology grammar for the system is available, the ontology node creates a conceptual entity for each term of the configuration data in step 217.

As illustrated in Figure 2c, creating a conceptual entity in the ontology grammar, whether at step 216 or 217, may comprise associating with the created conceptual entity a parameter by which its operation may be represented in step 218, and including with the conceptual entity at least one topological or functional relationship with another conceptual entity in the ontology grammar in step 219. The topological or functional relationship with another conceptual entity in the ontology grammar may be extracted from the configuration data.

Referring again to Figure 2a, and having created the ontology grammar at step 210, the ontology node then maps each logical component of the system to a conceptual entity of the ontology grammar, and creates an ontology graph comprising an instance of an ontology grammar conceptual entity, and its associated parameters, for each logical component of the system. Sub steps that may be carried out by the ontology node in order to perform step 220 are illustrated in Figure 2d.

Referring now to Figure 2d, in step 213, the ontology node performs two steps for each term of the configuration data, as illustrated at 223. The steps comprise, at 221 mapping the term to a conceptual entity of the ontology grammar, and, at 222, creating in the ontology graph an instance of the mapped conceptual entity and instances of its associated parameters. As for updating an existing ontology grammar, mapping terms from the configuration data to conceptual entities in the ontology grammar may comprise performing a semantic matching operation between terms from the configuration data and conceptual entities in the ontology grammar.

As illustrated at 222a, creating in the ontology graph an instance of the mapped conceptual entity and instances of its associated parameters may further comprise creating in the ontology graph a relationship between the created instance and an instance in the ontology graph of another conceptual entity, in accordance with a topological or functional relationship included with the mapped conceptual entity in the ontology grammar.

Referring again to Figure 2a, following creation of the ontology graph in step 220, the ontology node then obtains numerical data representing operation, or behaviour, of the system, and updates the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph.

Sub steps that may be carried out by the ontology node in order to perform step 230 are illustrated in Figures 2e and 2f.

Referring initially to Figure 2e, in step 231 , the ontology node maps parameters from the obtained numerical data to parameters in the ontology grammar. The ontology node then performs steps 232 and 233 for each parameter from the obtained numerical data that can be mapped to a parameter in the ontology grammar, as illustrated at step 232a. In step 232, the ontology node identifies the instance in the ontology graph to which the mapped parameter from the obtained numerical data relates. The ontology node then appends the value of the parameter from the obtained numerical data to a values record in the ontology graph for the corresponding parameter of the identified instance, together with a timestamp associated with the value. The obtained data may for example be streaming data, i.e., continuously generated data from logical components in the system. The data may be metrics and/or data contained in log messages, etc., where “parameter” refers to the term in the data, for example, “free memory”, and the value comprises the actual numerical value for the term. The timestamp may be obtained in any suitable manner from the data, for example it may be included with a monitored metric, or it may be the timestamp of a received log massage in which the value was included.

Referring now to Figure 2f, the ontology node then performs each of steps 234 to 238 for each parameter from the obtained numerical data that cannot be mapped to a parameter in the ontology grammar, as illustrated at 234a. In step 234, the ontology node identifies a conceptual entity in the ontology grammar to which the parameter from the obtained numerical data relates. The ontology node then updates the ontology grammar to include the parameter from the obtained numerical data in step 235. In step 236, the ontology node identifies the instance in the ontology graph to which the parameter from the obtained numerical data relates, and in step 237, the ontology node creates a values record in the ontology graph corresponding to the parameter from the obtained numerical data. Finally, in step 238, the ontology node appends the value of the parameter from the obtained numerical data to the created values record, together with a timestamp associated with the value. It will be appreciated that steps 234 to 238 illustrated in Figure 2f may also be performed by the ontology node when updating the ontology graph using live operational data in step 260, as discussed in further detail below.

Referring again to Figure 2a, after updating the ontology graph in step 230, the ontology node then extracts, from the obtained numerical data, a set of ontology rules in step 240, the set comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph. As illustrated at 240a, the set of ontology rules may comprise at least one rule describing a correlation between operation of at least one instance in the ontology graph and at least one of operation of at least one other instance in the ontology graph and/or generation of at least one log message. As discussed above, in step 240, the “operation of at least one instance” refers to the running of the instance, which may be described by its status, parameters describing how the instance is running, operations carried out by the instance, etc. For example, operation (or behaviour) of a database instance may be described by several parameters, and the value of one of those parameters may be correlated with a value describing operation (or behaviour) of another instance, or with a log message, etc. The “operation” of an instance referred to in step 240 thus encompasses any aspect of the status, behaviour or running of a particular instance.

Sub steps that may be carried out by the ontology node in order to perform the rule extraction of step 240 are illustrated in Figure 2g.

Referring to Figure 2g, in step 241 , the ontology node defines a plurality of items based on the obtained numerical data and log messages generated by the system. An item comprises at least one of generation of a particular log message, as shown at 241a, and/or a value of a parameter in the ontology graph fulfilling a criterion, as illustrated at 241 b. The ontology node then identifies in step 242 items, or groups of items, whose frequency of appearance over one or more time windows satisfies a frequency threshold. The ontology node then generates from the identified items, or groups of items, at least one rule in step 243. As illustrated at 243a, the at least one rule describes at least one of a pattern of temporal evolution of operation of a single instance in the ontology graph, and/or a plurality of instances in the ontology graph or log messages whose operation or generation demonstrates a pattern of temporal evolution that is correlated.

For the purposes of the present disclosure, temporal evolution of operation of an instance in the ontology graph refers to how the operation, running or behaviour of the instance, as represented by its parameter values, changes over time. For example, a rule for a single instance may indicate normal limits for operational parameters for that instance. A rule for correlation of multiple instances and/or log messages may indicate associations between instance operation or behaviour and/or the generation of log messages. For example, this may include system components whose operation tends to vary in the same manner, or log messages that are frequently observed when a particular system component, or group of components, is exhibiting particular operational behaviour.

Referring now to Figure 2b, and after extracting a set of ontology rules in step 240, the ontology node then obtains numerical data representing live operation of the system in step 250. In step 260, the ontology node updates the ontology graph and the set of ontology rules using the live operational data. Updating the ontology graph using the live operational data in step 260 may comprise performing the steps 234 to 238 illustrated in Figure 2f and discussed above.

As illustrated in Figure 2b, updating the ontology graph and the set of ontology rules using the live operational data may comprise, at step 261 , comparing the obtained live operational data to the set of ontology rules. In step 262, the ontology node may then, for live operational data representing a pattern that is not represented in an existing rule of the set of ontology rules (as sown at 262a), add a new rule for the pattern to the set of ontology rules in step 262. As illustrated, the ontology node may additionally record a time at which the new rule is created. For live operational data that is inconsistent with at least one rule of the set of ontology rules (as shown at 263a, the ontology node may change the rule to correspond to the live operational data. As above, the ontology node may additionally record a time at which the existing rule is changed. Changing a rule may comprise invalidating it, reversing it, adjusting parameter values stipulated in the rule, etc.

The method 200 thus enables automated generation and updating of an ontology of a system of logical components. This ontology may be used for a range of purposes, including in one example incident analysis and management.

Figure 3 is a flow chart illustrating process steps in a computer implemented method 300 for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated using a method according to any of the aspects or examples of the present disclosure. The method 300 may be performed by a management node, which may comprise a physical or virtual node, and may be implemented in a computing device, server apparatus and/or in a virtualized environment, for example in a cloud, edge cloud or fog deployment. The management node may comprise or be instantiated in any part of a network, for example in a logical core network node of a communication network, network management centre, network operations centre, radio access network node, etc. A radio access network node may comprise a base station, eNodeB, gNodeB, or any other current of future implementation of functionality facilitating the exchange of radio network signals between nodes and/or users of a communication network. Any communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node. The management node may therefore encompass multiple logical entities, as discussed in greater detail below.

Referring to Figure 3, the method 300 comprises obtaining a time window associated with the incident in step 310, and identifying, from the ontology, all rules in the set of ontology rules that were either created or changed within the time window associated with the incident, and the ontology instances to which the rules refer, in step 320. The method 300 further comprises, in step 330, grouping the identified rules according to a relationship between the ontology instances to which they refer. The relationship may be topological or functional, and may be contained in the ontology graph (having been extracted from configuration data for the system) or may contained in the ontology rules. For example, two instances may be connected by multiple rules, only one of which appears in the group that is output as a potential cause of the incident. Other rules that connect the two instances may provide a relationship between the instances which can be used to help interpret the potential explanation for the incident.

In step 340, the method 300 comprises, for each group, ordering the rules according to the time at which the rule was created or changed. The method 300 further comprises identifying the group having the greatest relevance to the incident in step 350 and, in step 360, outputting as a potential explanation of the incident the ordered sequence of rules in the identified group, and relationships between ontology instances to which the rules in the identified group refer, according to at least one of the ontology graph or the set of ontology rules.

Figures 4a to 4d show flow charts illustrating process steps in further examples of a computer implemented method 400 for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated using a method according to any of the aspects or examples of the present disclosure. The method 400 provides various examples of how the steps of the method 300 may be implemented and supplemented to achieve the above discussed and additional functionality. As for the method 300, the method 400 is performed by a management node, which may be a physical or virtual node, and which may encompass multiple logical entities, as discussed more fully above with reference to Figure 3.

Referring to Figure 4a, in a first step 410, the management node obtains a time window associated with the incident. The management node then identifies, from the ontology, all rules in the set of ontology rules that were either created or changed within the time window associated with the incident, and the ontology instances to which the rules refer, in step 420. In step 430, the management node groups the identified rules according to a relationship between the ontology instances to which they refer.

Sub steps that may be carried out by the management node in order to perform the grouping of step 430 are illustrated in Figure 4c.

Referring to Figure 4c, in step 431 , the management node generates a first incident ontology graph comprising vertices corresponding to instances in the ontology graph. As illustrated at 431a, each pair of vertices in the first incident ontology graph is connected by an edge if an edge between the pair of instances is present in the ontology graph and at least one of the identified rules refers to both of the instances in the pair. In step 432, the management node then generates a second incident ontology graph comprising vertices corresponding to instances in the ontology graph. As illustrated at 432a, a weight of an edge between a pair of vertices in the second incident ontology graph is set to be: if a path between the pair of vertices exists in the first incident ontology graph, the distance between the pair of vertices in the first incident ontology graph, and if no path exists between the pair of vertices in the first incident ontology graph, infinite.

The first incident ontology graph, G°, is consequently specific to the particular incident, as the edges are dependent on rules that were identified on the basis of the incident time window. The second incident ontology graph, G A , is also specific to the incident, and captures connections between instances in the form of paths between instances (represented by vertices) in the first incident ontology graph. The distance between a pair of vertices in the second incident ontology graph may comprise number of hops on the shortest path between the vertices.

In step 433, the management node clusters the vertices of the second incident ontology graph into disjoint sets according to the edge weights between the vertices. The management node then groups the identified rules such that each group of rules corresponds to a different set in step 434. As illustrated at 434a, grouping the identified rules such that each group of rules corresponds to a different set may comprise, for a given set, assigning an identified rule to a group corresponding to the set if the rule refers to an instance that is represented by a vertex in the set.

Referring again to Figure 4a, following the grouping of the identified rules in step 430, the management node then, for each group, orders the rules according to the time at which the rule was created or changed.

Referring now to Figure 4b, in step 450, the management node identifies the group having the greatest relevance to the incident. Sub steps that may be carried out by the management node in order to perform the identification of step 450 are illustrated in Figure 4d.

Referring to Figure 4d, the management node may perform steps 451 and 452 for each set of vertices clustered from the second incident ontology graph. Step 451 comprises calculating a distance of each vertex in the set to vertices representing instances involved in the incident. Such instances may for example be identified in an incident report. Step 452 comprises setting a distance of the set from the incident to be the minimum calculated distance for a vertex in the set. After completing steps 451 and 452 for each set of vertices clustered from the second incident ontology graph, the management node then identifies as the group having the greatest relevance to the incident, the group corresponding to the set having the smallest distance from the incident.

Referring again to Figure 4b, in step 460, the management node outputs as a potential explanation of the incident: the ordered sequence of rules in the identified group, and relationships between ontology instances to which the rules in the identified group refer, according to at least one of the ontology graph or the set of ontology rules. In step 470, the management node may then initiate action to correct the incident on the basis of the output potential explanation.

As discussed above, the methods 100 and 200 may be performed by an ontology node, and the present disclosure provides an ontology node that is adapted to perform any or all of the steps of the above discussed methods. The ontology node may be a physical or virtual node, and may for example comprise a virtualised function that is running in a cloud, edge cloud or fog deployment. The ontology node may for example comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, radio access node, etc. Any such communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node.

Figure 5 is a block diagram illustrating an example ontology node 500 which may implement the method 100 and/or 200, as illustrated in Figures 1 and 2a to 2g, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 550. Referring to Figure 5, the ontology node 500 comprises a processor or processing circuitry 502, and may comprise a memory 504 and interfaces 506. The processing circuitry 502 is operable to perform some or all of the steps of the method 100 and/or 200 as discussed above with reference to Figures 1 and 2a to 2g. The memory 504 may contain instructions executable by the processing circuitry 502 such that the ontology node 500 is operable to perform some or all of the steps of the method 100 and/or 200, as illustrated in Figures 1 and 2a to 2g. The instructions may also include instructions for executing one or more telecommunications and/or data communications protocols. The instructions may be stored in the form of the computer program 550. In some examples, the processor or processing circuitry 502 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc. The processor or processing circuitry 502 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), etc. The memory 504 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive, etc.

Figure 6 illustrates functional modules in another example of ontology node 600 which may execute examples of the methods 100 and/or 200 of the present disclosure, for example according to computer readable instructions received from a computer program. It will be understood that the modules illustrated in Figure 6 are functional modules, and may be realised in any appropriate combination of hardware and/or software, including for example the hardware and software illustrated in Figure 5. The modules may comprise one or more processors and may be integrated to any degree. Referring to Figure 6, the ontology node 600 is for generating an ontology representing a system comprising a plurality of logical components. The ontology node comprises an ontology builder 610. The ontology builder 610 comprises a grammar module 612 for identifying, from configuration data for the system, conceptual entities present in the system, and parameters by which their operation is represented, and creating an ontology grammar comprising the identified conceptual entities and parameters. The ontology builder 610 further comprises a graph module 614 for mapping each logical component of the system to a conceptual entity of the ontology grammar, and creating an ontology graph comprising an instance of an ontology grammar conceptual entity, and its associated parameters, for each logical component of the system. The ontology builder 610 further comprises a data module for obtaining numerical data representing operation of the system, and updating the ontology graph with values extracted from the numerical data for the parameters associated with instances in the ontology graph. The ontology node 600 further comprises a rules generator 620 for extracting, from the obtained numerical data, a set of ontology rules, the set comprising at least one rule describing a pattern of temporal evolution of operation of at least one instance in the ontology graph. The ontology node 600 may further comprise interfaces 630 which may be operable to facilitate communication with computing and/or other network nodes, including for example management nodes 700, 800, over suitable communication channels.

As discussed above, the methods 300 and 400 may be performed by a management node, and the present disclosure provides a management node that is adapted to perform any or all of the steps of the above discussed methods. The management node may be a physical or virtual node, and may for example comprise a virtualised function that is running in a cloud, edge cloud or fog deployment. The management node may for example comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, radio access node, etc. Any such communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node.

Figure 7 is a block diagram illustrating an example management node 700 which may implement the method 300 and/or 400, as illustrated in Figures 3 and 4a to 4d, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 750. Referring to Figure 7, the management node 700 comprises a processor or processing circuitry 702, and may comprise a memory 704 and interfaces 706. The processing circuitry 702 is operable to perform some or all of the steps of the method 300 and/or 400 as discussed above with reference to Figures 3 and 4a to 4d. The memory 704 may contain instructions executable by the processing circuitry 702 such that the management node 700 is operable to perform some or all of the steps of the method 300 and/or 400, as illustrated in Figures 3 and 4a to 4d. The instructions may also include instructions for executing one or more telecommunications and/or data communications protocols. The instructions may be stored in the form of the computer program 750. In some examples, the processor or processing circuitry 702 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc. The processor or processing circuitry 702 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), etc. The memory 704 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), randomaccess memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive, etc.

Figure 8 illustrates functional modules in another example of management node 800 which may execute examples of the methods 300 and/or 400 of the present disclosure, for example according to computer readable instructions received from a computer program. It will be understood that the modules illustrated in Figure 8 are functional modules, and may be realised in any appropriate combination of hardware and/or software , including for example the hardware and software illustrated in Figure 7. The modules may comprise one or more processors and may be integrated to any degree.

Referring to Figure 8, the management node 800 is for managing an incident occurring in a system of logical components, wherein the system is represented by an ontology generated using a method according to examples of the present disclosure. The management node 800 comprises an incident analyser 810. The incident analyser 810 comprises a window module 812 for obtaining a time window associated with the incident and for identifying, from the ontology, all rules in the set of ontology rules that were either created or changed within the time window associated with the incident, and the ontology instances to which the rules refer. The incident analyser 810 further comprises a group module 814 for grouping the identified rules according to a relationship between the ontology instances to which they refer, and, for each group, ordering the rules according to the time at which the rule was created or changed. The incident analyser further comprises an explanation module 816 for identifying the group having the greatest relevance to the incident, and outputting as a potential explanation of the incident the ordered sequence of rules in the identified group, and relationships between ontology instances to which the rules in the identified group refer, according to at least one of the ontology graph or the set of ontology rules. The management node 800 may further comprise interfaces 820 which may be operable to facilitate communication with computing and/or other network nodes, including for example ontology nodes 500, 600, over suitable communication channels.

Figures 1 to 4d discussed above provide an overview of methods which may be performed according to different examples of the present disclosure. These methods may be performed by an ontology node and a management node, as illustrated in Figures 5 to 8, and enable automated generation of an ontology representing a system, and management of incidents occurring within the system.

There now follows a detailed discussion of how different process steps illustrated in Figures 1 to 4d and discussed above may be implemented, for example by ontology and management nodes 500, 600, 700 or 800. The following discussion presents implementation of the methods using a cloud platform as an example system. However, it will be appreciated that the methods discussed may equally be applied to any IT infrastructure.

Figure 9 illustrates an example implementation architecture for the methods 100, 200, 300, 400 enabling the automatic generation of an ontology including rules describing a cloud system, without prior knowledge of the system, and using data generated by the cloud system to gain insights allowing identification of an explanation for a reported incident. Figure 9 illustrates three principal modules: the Ontology Builder 910 and Rules Generator 920, comprised within an ontology node 930, and the Incident Analyzer 940, comprised within a management node 950. The input data to the ontology node 930 and management node 950 are (a) streaming data and configuration data 960 from the managed system, and (b) an incident report 970 that may be considered to trigger performance of the methods 300, 400 to find an explanation for the reported incident. It will be appreciated that the cloud system is assumed to be equipped with a monitoring system that collects streaming data during operation of the system. The streaming data can include both historical and live data from log files, time series metrics, etc. The output of the methods 100, 200 is an ontology representing the cloud system, and the output of the methods 300, 400 is one or more explanations for a reported incident. There now follows a brief discussion of the individual modules illustrated in Figure 9, and discussed above with reference to Figures 5 to 8.

The Ontology Builder module 910, 610 processes data collected from the cloud system to identify (1) the ontology grammar, (2) the main instances characterising the managed system and their characteristics (for example, values), and (3) the different relationships between those instances. The identified instances and their relationships are organized into an ontology graph. The input data to this module are the configuration data and the streaming data (logs, metrics, etc.) that may be collected by a cloud monitoring component. The output data is the Ontology including the description of the grammar, values and relationships of the identified instances belonging to the managed cloud system. This module may also use live data to continuously update the generated ontology graph.

The Rules Generator module 920, 620 uses the provided streaming data as input to identify the rules that characterize the different observations or changes the cloud system experiences, and reflects these changes on the generated ontology graph. The output of this module is the set of Ontology Rules. The generated rules detect important relationships between the ontology instances (for example, normal operating ranges for a metric, metrics that deviate together, events observed at the same time, etc.).

The Incident Analyzer module 940, 810 receives an incident report and finds in the ontology graph generated by the Ontology Builder 910, 610 the entities and relations corresponding to the instance identified in the incident report, in order to find the circumstances that lead to the incident. The Incident Analyzer 940, 810 retrieves relevant rules that were created or changed when the instance encountered the reported incident, and identifies other instances in the system that have connections with this instance to build possible explanations. The output of this module is the possible explanation of the reported incident. The explanation can help identifying the appropriate procedures to fix the underlying issue. Figure 10 is a flow chart providing an overview of implementation of the methods disclosed herein: ontology generation via methods 100 or 200, and incident management via methods 300 or 400.

Referring to Figure 10, in a first step 1 the monitoring function of the example cloud system retrieves and makes available configuration data for the cloud system and streaming data generated by the cloud system. The streaming data can be classified into (1) “historical data”, used for generating the system ontology and rules and (2) “live data”, used for continuously updating generated ontology grammar, graph, and rules. In the absence of historical data, the methods 100, 200 may use the live data to both create and update the ontology grammar, graph, and rules. The following description illustrates an example implementation in which streaming data is used in the form of log files and metrics values.

In a second step 2, the configuration input data is used to first identify the entities that belong to the cloud system and their specific relationships that will form the ontology grammar (steps 110, 210 of the methods 100, 200). Next, the different instances of the entities are determined and mapped to the created ontology grammar in order to create the ontology graph (steps 120, 220 of methods 100, 200).

In step 3, provided with metrics and numerical log values from streaming data the values/measurements of the ontology instances are determined and updated in the ontology graph (steps 130, 230 of methods 100, 200).

In step 4, the metrics values and log messages are used to discover observations between the generated ontology instances and their values. The goal in this step is to identify strong rules using the collected measurement data and induce, if possible, knowledge that can help analysing changes experienced by the cloud system (steps 140, 240 of methods 100, 200).

In step 5, when triggered by an external system (for example, fault detector, performance degradation detector, human reported incidents), a received incident report is analysed to find in the ontology graph the entities and relations modelling the incident report instance, in order to identify circumstances that lead to the incident. Possible explanations for the observed incident are identified by looking to the sequence of rules and ontology instances captured when the incident was detected (methods 300, 400). Example implementation of each of the above steps is discussed in further detail below. As above, the following description demonstrates an example implementation of the individual method steps with reference to an example use case for the methods 100 to 400. The example use case is a clustered database service is running in pods in a Kubernetes environment. The following discussion demonstrates an ontology including a minimal set of monitored metrics regarding this service, how rules are discovered and finally, how an incident-example can be analysed. It will be appreciated that this example use case is merely for the purpose of illustration, and a wide range of other use cases in different technical domains may be envisaged. The methods 100 to 400 may be employed for any system of logical components.

Ontology Building (Steps 110 to 130, 210 to 230 of methods 100, 200)

The goal of these method steps is to first build an ontology grammar and identify ontology instances using configuration data, then further extend the instances by determining values using measured metrics and numerical values in log messages.

Ontology Grammar and Instances (Steps 110, 210, 120, 220 of methods 100, 200)

Figure 11 illustrates an implementation of steps 110, 210, 120, 220 of the methods described herein. The input is expected to be in a structured format (such as JSON) describing entities and their properties in the underlying system. As an example, the configuration input can be based on currently running virtual machines and their properties in an OpenStack cloud, or the currently running pods and their properties in a Kubernetes deployment. The configuration input can also include other relevant information, such as the networking topology between pods in the previous example. It is expected implementations of the methods 100, 200 can use unique identifiers to match pieces from the different configuration inputs. These steps involve identifying common terms in the configuration input. For example, after observing multiple pod descriptions, the pod and its possible characteristics will become part of the ontology grammar, and the individual instances of pods that are instantiated in the system will become part of the instances graph. There now follows a detailed discussion of Figure 11 , with reference to the individual methods steps of the methods 100, 200 that are being implemented. Steps A, B and C of Figure 11 implement examples of steps 110 and 210 of the methods 100, 200.

Step A: In case there is no schema file available for the input (step 211), the information available in the configuration data input is converted to terms naively (step 213) by taking all the elements of the input and sending them to the next step. It may be noted that while structured files build on key-value pairs, some of the “values” in such pairs may in fact become part of the ontology. For example, Kubernetes configuration inputs have a key called “kind” that specifies the object that the configuration is related to (pod, deployment, etc.).

Step B: Schema files provide details about the structure of individual configuration inputs, for example listing optional components that are not part of the current input or detailing data types and ranges. In this step, the configuration data input is enriched using the schema file, if it is available, to provide a more quickly converging grammar (step 212), then the terms are identified and provided for the next step.

Step C: This step maps the received terms to the existing ontology grammar (step 215). It will be appreciated that the ontology grammar may be empty when processing the first input. In order to achieve increased flexibility, instead of direct word matching, semantic matching of terms can be also applied (step 215a). Semantic matching operates with a heuristic function that gives a similarity score for terms. A configurable constant then can be used to decide if the terms are the same. For example, “pod” and “pods” might considered to be the same in the grammar. As a result of this step, it is decided if the grammar needs to be updated, for example if an incoming term cannot be matched to a grammar entity (steps 216, 217).

Steps D, E and F of Figure 11 implement examples of steps 120 and 220 of the methods 100, 200.

Step D: This step of the method creates or updates an ontology instance (steps 120, 220). The terms from the current configuration should now map to the ontology grammar. Using the previous pod example, the term pod and its related terms such as id, image, etc. are already part of the ontology grammar. In this step instances of these terms are created for the actual pod referred to by the configuration input, or if they already exist, the related terms might be updated. As an example, if a network is reconfigured for an already running deployment, the related terms are updated at this step.

Step E: In this step the grammar is updated, either by adding new previously unknown terms, or by changing or merging terms. The existing grammar, ontology instances and the new input are evaluated together to determine if updates are appropriate. Computer systems typically use unique IDs to identify different components. For example, a Kubernetes Pod can be part of a ReplicaSet. The configuration description of the Pod contains the unique ID of the ReplicaSet, and by examining multiple configuration inputs, such connections can be discovered and made part of the grammar.

Step F: After the update to the ontology grammar, existing ontology instances can be revisited and updated accordingly. As it was mentioned in the previous step, during an update, new connections can be added, or deeper updates might be carried out in the grammar. In this step it is ensured that the ontology instances graph is in-line with the new grammar by adding or removing connections and potentially instances.

Figure 12 shows a simplified ontology grammar and ontology instances graph for the previously described database service example. In Kubernetes, various configuration information can be gathered from the running system, describing for example running pods, replicas, deployment, services, and physical nodes. Configuration information related to the Database Service itself might be also used, in addition to Kubernetes configuration information. Figure 12 shows a simplified scenario with an ontology grammar containing concepts for the service, service instance and for pod. The ontology graph shows that the system has 2 database instances running in specific pods. For Kubernetes, schema files are available for configuration descriptors, and as a result while no instance has been created, CPU usage is already part of the ontology grammar

Ontology Values (Steps 130, 230 of methods 100, 200)

The goal of this step is to record time stamped numeric values from monitored metrics and numerical values from log messages. The collected numerical data is used in the next step to the discover trends and deviations.

Figure 13 illustrates an implementation of steps 130, 230 of the methods described herein. As illustrated in Figure 13 shows, the building of values relies on the previously created Ontology grammar and Ontology instances. There now follows a detailed discussion of Figure 13, with reference to the individual methods steps of the methods 100, 200 that are being implemented.

Step A: In this first step the received metrics and numerical values from logs are mapped to the Ontology grammar (step 231). The purpose of this step is to identify if the grammar contains a term associated with a value. As it was detailed in the previous section, if there are no schema files available for building the ontology grammar, there can be missing terms related to the values. For example, the grammar might already contain a term for total memory but be missing a term for free memory.

Step B: If one or more terms related to values are identified to be missing from the grammar, the grammar is extended (steps 234, 235). It may be expected that the grammar at this point already contains the concept that needs to be extended with the missing values, making the update of the grammar a relatively simple task.

Step C: If there is no instance in the Ontology Instances for the given metric or numerical value, then an instance with the current initial value is added. As an example, if this is the first metric related to the free memory of a given pod in the system, it is assumed that the instance related to the pod itself has been created as described above, and here a related instance for the free memory is added with the given value (steps 236 to 238).

Step D: If an instance related to the given metric or numerical value has been already added to the Ontology Instances then in this step the current new value is added with a timestamp, creating a time series that will be used in the Rules Generator component (steps 232, 233).

Figure 14 shows how the ontology grammar and instance graphs from Figure 12 can be extended via the values building step. First, it can be seen that while CPU usage was known, now the ontology instances graph contains actual historical values obtained from input metrics. It can also be seen that the grammar was extended with service level delay and instance level throughput. These are service specific metrics that were not discovered from the configuration input in the grammar and instances building step. For the present example it is assumed that the database service reports these metrics for example to Prometheus, a monitoring service of Kubernetes. In this step it is identified that these metrics are not part of the grammar (Step A), and these are new values that are related to known entities. As a result, the grammar is simply extended (Step B), instances are created in the ontology instances graph and values are recorded (Step C).

Rules Discovery (steps 140, 240, 260 of methods 100, 200)

The goal of this step is to discover relevant connections between the ontology instances by using their values identified as discussed above. These connections enable inference of knowledge that assist with analysing incident reports and finding possible explanations for incidents. The idea is to find associations between the changes and/or observations that the identified ontology instances are experiencing in the monitored cloud system. These may include changes of a single instance over time or correlations between instances or between instances and events such as log messages.

Figure 15 illustrates an implementation of steps 140, 240, 260 of the methods described herein. Figure 15 presents the different steps performed by the Rules Generator module, and includes two main parts: the first part is about generating rules using historical data (Step A and Step B) and the second part describes the generation of rules using live data (Step C and Step D). There now follows a detailed discussion of Figure 15, with reference to the individual methods steps of the methods 100, 200 that are being implemented.

Step A (steps 241 , 242): This step takes as input ontology instances, including their values, and the content of the historical log messages collected from the monitored cloud system. Items that are frequently associated together over time-windows (for example, 10 s, 1 min, 2 min) are then identified. For example, two instances’ values (for example metrics) of two different instances may be characterized by the same variation over time (for example, increase or decrease at the same time). These two metrics may also be mapped with the same variation on one metric from another instance. If this observation has been seen frequently, then these two metrics from the first instance and the observed metric from the second instance could be considered as frequent items. More details about how to generate the frequent items can be found in Apriori Itemset Generation, http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_ apriori.html.

Step B (step 243): In this step, the list of frequent items obtained in the previous step is used to builds the ontology rules, also referred to as association rules. In this context, association rules are used to uncover how items are associated or correlated to each other over specific time-windows (10 s, 1 min, 2 min, etc.). The rules are used to represent patterns of items that are statistically related (i.e. , frequent) in the underlying dataset. Three different methods to measure the association between the items I itemsets are presented in Apriori Algorithm Tutorial, https://towardsdatascience.com/apriori-algorithm-tutorial-2e 5eb1d896ab. These methods are:

Support’, used to measure how frequently an itemset (that contains for example an item {X} and another item {Y}) appears in the underlying dataset. itemset Containing both {X} and {Y} Support ({X} -> {Y}) = -

Total Number of Itemset

Confidence’, used to define how likely it is that an item {X} would be present given that another item {Y} has been observed in the dataset.

Itemset Containing both {X} and {Y} Confidence ({X} -> {Y}) = >

Itemset Containing {X}

Lift’, controls for the support (frequency) and confidence while calculating the conditional probability of occurrence of an item {Y} given another item {X}.

Itemset Containing both {X} and {Y} / Itemset Containing < } Lift ({X} -> {Y}) = - - - : -

Fraction of Itemset Containing {Y}

To implement this step, many algorithms could be applied to build and discover those associations rules including the Apriori Algorithm (Apriori Algorithm Tutorial, https://towardsdatascience.com/apriori-algorithm-tutorial-2e 5eb1d896ab), Eclat Algorithm (Association Rule(Apriori and Eclat Algorithms) with Practical Implementation, https://medium.com/machine-learning-researcher/association-r ule-apriori-and-eclat- algorithm-4e963fa972a4), or FP-Growth Algorithm (Understanding (Frequent Pattern) FP Growth Algorithm https://www.mygreatlearning.com/blog/understanding-fp-growth - algorithm/). For example, the Apriori Algorithm could be used for a large dataset while the Eclat Algorithm could be used for small or medium dataset. The method stores the generated rules in a rule database based on the time (timestamp) when they have been created.

Step C (step 261): When live data is received (metrics and numerical log values), the new live data is first compared to the ontology values, and a check is made to determine whether the new data represents a deviation or a new pattern. A deviation or new pattern here represent new changes or observations, which can be used to update the rules and the connections between the ontology instances. When such deviation is present, updates to the detected instances are sent to the input Ontology instances.

Step D (steps 262 and 263): Given the detected changes (for example, data deviation, new data measurements), a check is initially made of the previous generated rules. The previous rules are then mapped to the changes to induce new rules, update previous rules and store them into the rules database based on their timestamps. New conclusions are consequently drawn from the set of previous discovered rules, the ontology instance changes and/or deviation and the live data.

To implement this step, inductive algorithms such as those disclosed in Inductive Logic Programming, http://web.stanford.edU/~vinayc/logicprogramming/html/induct ive_logic_programming.h tml could be used to find the new rules or knowledge given a background theory and a set of examples. There exist several inductive algorithms in the literature including, for example, RULES (RULES: A Simple Rule Extraction System). RULES is a simple inductive learning algorithm for extracting rules from a set of data. Algorithms belonging to the RULES family are usually available in data mining tools, such as Knowledge Extraction based on Evolutionary Learning (KEEL) and Waikato Environment for Knowledge Analysis (WEKA), known for knowledge extraction and decision making

Example Rules Extraction

Continuing the database service example of Figures 12 and 14, the following items may have been obtained from the streaming data generated by the pods in the Kubernetes environment.

A: query delay [ms] in DataBase Service is between min and max values (e.g., [50ms, 700ms])

B: CPU usage in DataBase instance 1 is between min and max values (e.g., 25%- 90%)

C: CPU usage in DataBase instance 2 is between min and max values (e.g., 25%- 90%)

D: number of request/second received by a database service 1 is between min and max values (e.g., 200/s) E: number of request/second received by a database service 2 is between min and max values (e.g., 200/s)

F: log messages about extensive queuing in DataBase instance 2

G: CPU usage in DataBase instance 2 increases higher than the maximum value (e.g., 95%)

Step A (steps 241 , 242): The following example itemset is formed by tracking the occurrence of individual items and considering how many times they have been observed together during different time windows:

Step B (step 243): Using the Apriori algorithm, the following association rules can be built, and their connections calculated (given their frequency of occurrences) based on one or more of the three parameters confidence/lift/support as shown in the following table (the values given here are random numbers to illustrate the example):

Rule (1) “A -> BCDE” means that whenever the item A is observed, it is associated with the presence of items B, C, D and E at the same time. Considering the meaning of this rule, with reference to the identified items, when the delay in DataBase service is between the normal min and max values (item A) this implies that the load is balanced between the two Database instances (items D and E) and that the CPU usage is also balanced in both instances (items B and C). This observation facilitates analysis of the different events that may be experienced by the DataBase service. For example, instead of checking all metrics related to the DataBase instances, the pods, CPU usage, etc, it is sufficient to check one part of the rule (presence of item A) to induce the rest of the rule (presence of itemset BCDE).

Rule (2) “BC -> DE” means that when the CPU usage is balanced between the two DataBase instances (items BC) that implies that the throughput (items DE) is also within the normal values between the two instances.

Rule (5) “F -> G” means that when observing extensive log entries written in DataBase 2 (item F), that means that the DataBase 2 is characterized with a higher CPU usage than the normal values (item G).

Figure 16 illustrates rules 1 and 5, and shows the connections between the rules and the instances in the ontology graph to which they refer. Figure 17 illustrates how the rues in the present example may be updated using live data, as discussed below.

Step C (step 261): When receiving live data indicating that the CPU usage on Pod1 decreased to 0% (see Figure 17), a deviation of the values of the ontology instance from established rules is detected, and new rules are created to reflect this in Step D.

Step D (steps 262 and 263): After detecting the deviation in the previous step, rules that do not reflect the current status/events in the managed system are invalidated or changed. Invalidations are also stored with a timestamp in the rule database. In the present example, the decrease to 0% of CPU usage on Pod 1 means that item B is no longer being observed. Logically, it may be assumed that Pod 2 is compensating, meaning that items C and E may well also not be being observed. Rules 1 (A->BCDE), 2 (BC->DE), and 4 (E->ACD) are consequently invalidated and new rules can be created using an inductive algorithm and the new observed items:

H: CPU usage in Pod 1 is 0%

I: new templates observed in log message in DataBase instance 1

J: unbalanced load between the DataBase instances 1 and 2

K: software errors detected on DataBase instance 1 In one example, new rule “HJ -> IK” can be created. This means that when (H) the CPU usage goes to 0% in Pod 1 , and (J) there is an unbalanced load between the two instances, it may be associated with (K) the software errors experienced in that instance and (I) the system is experiencing new operations. The method may also induce that the load is unbalanced because the other Pod is receiving more requests than usual which affects the delays of the executed requests on that node. Updates to the ontology instances graph and the new rule (Rule 6: “HJ -> IK”) are illustrated in Figure 17.

Incident Analysis (Methods 300 and 400)

Whenever an incident occurs in a managed system examples of the methods 300 and 400 may be used to identify possible explanations for the incident, and consequently guide subsequent actions to address the incident. Examples of incidents include reports from end-users about service failures or alarms from monitoring systems. An incident report generally includes, as a minimum, information about what happened and when. Specifically, the incident report details what components are involved in the incident and when the incident was reported or discovered.

Figure 18 illustrates an implementation of the methods 300, 400 described herein. As can be seen from Figure 18, methods 300, 400 operate on the extracted ontology and identified rules and are triggered when an incident report is received. The incident report includes two parameters that are extracted from the incident report or provided by a user. The first parameter is an identification of an instance or instances of the ontology that is associated with what happened during the incident. The identification may vary according to specific deployments. The second parameter is a time window during which a possible explanation to the incident will be sought. This time window will likely end at the time the incident was first experienced and may start immediately before occurrence of whatever event may have caused the incident. In one example such a time window is provided by the user. In another example, a default time window size ending at the time of the incident report may be used. In another example, the management node implementing the method may search through a set of possible time window sizes to find possible explanations.

There now follows a detailed discussion of Figure 18, with reference to the individual methods steps of the methods 300, 400 that are being implemented. Step A (steps 310, 410, 320, 340): The method starts when it receives the processed incident report in the form of <instance_ids, time_window>. When this happens, the management node starts by extracting from the ontology database a list of notable events R Δ that occurred during the time window along with their associated ontology instances V Δ . In the context of this step, notable events comprise a change (i.e. , creation, update, or removal/invalidation) of ontology rules. These events can be identified by checking the timestamps of the rules and the ontology itself.

Step B (steps 330, 430): These events are then grouped according how close the relevant instances are to each other. This may be done using a graph data structure that captures the relationships among the instances that make up the ontology. In one embodiment, a first incident ontology graph G° is generated as follows (step 431). Let G°(y 0 , E 0 ~) be a graph where vertices V° = {v 1 ,v 2 - ... } is the set of ontology instances in the ontology, and is the set of edges such that edge {v 1 ,v 2 } ∈ E° if and only if

1 . the edge {v 1 , v 2 } belongs to the ontology instance graph or

2. there is an induced rule that associates instances v 1 and v 2 during the time window.

From G° a second incident ontology graph G Δ (V Δ , E Δ ) is generated (step 432). This is a complete weighted graph where V Δ is as defined earlier (step A) and the weight of edge {v 1 , v 2 } is defined as the graph distance (for example number of hops) between v 1 and v 2 on the graph G°. If there is no path between the two instances on graph G°, then the distance is set to be infinite. Next (step 433), a graph clustering algorithm (for example, k-spanning tree, Shared Nearest Neighbour (SNN), Vertex Betweenness Clustering (VBC), etc.) is used on G Δ to group its vertices (i.e., V Δ ) into disjoint subsets (g 1 i , g 2 i ... ). Finally, the notable events (rule changes) in R & = ...} are grouped into groups

( g 1 e , g 2 e ... ) such that event r i : {v j ,..} that associates instances {v p ...} with each other belongs to group if {v j , ...} g k i . in other words, the notable events are grouped with each other according to the grouping of their instances (step 434).

Step C (steps 340, 430): In this step, for each group x of events (i.e., g e x ) identified in the previous step, the events e t ∈ g e x in the group are sorted according to their timestamps. Step D (steps 350, 450): The groups are then sorted according to their relevance (i.e., distance) to the incident. For each group x, its relevance is computed using the instance group g x l as follows. First (step 451), for each instance v 7 e g x l in the group its distance to the incident (denoted by d 7 ) is calculated as the minimum distance on the graph G° between v 7 and the instances identified in the incident report. The distance of a group x from the incident is then calculated (step 452) as the minimum distance among its instances (i.e., min d 7 ). The group that has the minimum distance has the highest i^ax l relevance and vice versa (step 454). Groups that have no relevance will have a distance of infinity.

Step E (steps 360, 460): Possible explanations for the incident are then presented as follows. For a group of events x (i.e., g x an explanation is created as the sequence of events in that group, and how successive events in the group are related to each other in terms of the shortest paths among the instances belonging to the two successive events described in terms of the associating rules or the ontology grammar. The method returns, for each group of events sorted according to their relevance to the incident, the associated explanations. The method may return explanations from only the group having the highest relevance to the incident, or explanations from multiple groups.

Example Incident Analysis

The present example refers again to the database system whose ontology instances graph is presented in Figures 12 and 14. To simplify the incident analysis step, it is assumed that while the system was operating normally, the following (new) rules were discovered when running examples of the method 100, 200:

• Rulel : query delay value is in range [50ms, 700ms]

• Rule 2: CPU usage of pod 1 is in range [15%,90%]

• Rule 3: CPU usage of pod 2 is in range [15%,95%]

• Rule 4: throughput of database instance 1 is correlated to CPU usage of pod 1

• Rule 5: throughput of database instance 2 is correlated to CPU usage of pod 2

• Rule 6: throughput of Database instance 1 and 2 is correlated

At 3:50pm, owing to a software error, database instance 1 freezes/crashes. This causes all the database queries to go to database instance 2 and therefore the query delay increases. As a consequence of these events, several rules will be changed in the following order:

Rule 2 becomes Rule 2’: CPU usage of pod 1 is 0%

Rule 6 becomes Rule 6’ throughput of Database instance 1 and 2 is NOT correlated Rule 1 becomes Rule T: query delay value is in the range [50, 1500ms]

An incident report is created by a user that says “Database service 1 is slow” since 4pm today.

Step A: instancejd is set to “Database service 1”, time window is set to [3:45pm- 4:00pm], R & is set to Rules T, 2’ and 6’. F A is set to {CPU usage of pod 1, throughput of database instance 1 , throughput of database instance 2, query delay value 1}.

Step B: G° and G A are created from the instance graphs and the rules as shown in Figures 19 and 20. Figure 19 shows Graph G° that captures the relationships among the ontology instances. Figure 20 shows Graph G A that captures the distances among the notable events (changes in rules). With reference to Figure 19, it will be appreciated that the rules are not part of G° but are included in the Figure to indicate rules on the basis of which edges were created in the graph. Running a graph clustering algorithm gives only one cluster including all the events in y A . Therefore g{ = F A and It will be appreciated that a larger example where several unrelated events existed would; have resulted in more groups.

Step C: g = {Rule 2', Rule 6' , Rule 1'}, i.e., the rules are sorted according to their time stamps.

Step D: There is no sorting of groups according to relevance as there is only one group. The distance of this group from the incident is 0, as it includes the instance identified in the incident report. It is therefore considered to be highly relevant.

Step E: The explanation of the incident that could be displayed to the end-user is presented as follows:

[sequence of events: <Rule 2’: CPU usage of pod 1 is 0>, <Rule 6’: throughput of Database instance 1 and 2 is NOT correlated>, <Rule T: query delay value is in the range [50, 1500ms]>; relationship between events: (Rule 2’, Rule 6’)<- (throughput of database instance 1 is correlated to CPU usage of pod 1), (Rule 6’, Rule 1’)<- (database instance has throughput, database service has database instances, database service has query delay)]

Use Case

Methods according to the present disclosure provide an automated approach to building an ontology describing system functional behaviour, as well as identifying explanations for incidents experienced by the system. These methods can be applied to any complex hierarchical architecture, including for example the 3GPP 5G RAN architecture. The overall architecture of NG-RAN (New Generation-Radio Access Network) is described in ETSI TS 138 401 v16.2.0. This architecture consists of a set of gNBs (next generation nodeBs) which are deployed as central units and distributed units. gNB nodes provide NR (New Radio) user plane and control plane protocol terminations towards the UE (User Equipment) and connect the RAN to 5GC (5G Core).

Several procedures described in ETSI TS 138 401 , as well as in other 3GPP 5G RAN specifications, could be used to demonstrate implementation of methods according to the present disclosure. In the following discussion, the UE initial access procedure is described for the purposes of illustration, but it will be appreciated that this is merely one possible example.

During the initial access procedure, gNB-DU, gNB-CU, AMF and UE systems exchange specific messages within well-defined time windows. The initial access procedure may fail as a consequence of various error conditions (locked resources, timing errors/congestion etc.) affecting any of the messages exchanged. Typically, when this happens, the system logs the error(s), which are then analysed by a human expert for identifying an explanation for these incidents and taking corrective actions (for example, application and/or OS re-configuration). This incident handling method is not scalable in the context of 5G, and examples of the present disclosure can provide an automated solution for building a model of the interaction flow and for identifying an explanation for the incident, as discussed below. In an example implementation of methods and nodes of the present disclosure, the Ontology Builder component receives historical/live data that describes the messages exchanged by gNB-Dll, gNB-Cll and AMF systems. In addition to this input data, each of these interacting components send to the Ontology Builder their metrics and log numerical data. Given the input provided, the Ontology Builder generates the ontology grammar and incident graph, and the Rules Generator creates the rules. The generated ontology and rules can help in analysing not only errors affecting exchanged messages, but also incidents occurring in gNB-Dll, gNB-Cll and/or AMF systems which could have led to these failed messages. Thus, in addition to inferring the message which failed, for example, UE Context Setup Request, it would be possible to identify the cause of the failed message at the application/OS level, for example, messageBufferSize=O. By analysing the explanation(s) provided by the Incident Analyzer component, it is possible to execute an automated remediation procedure which would reduce the duration of system impact and the maintenance cost.

Data forwarded by gNB-Dll, gNB-Cll and AMF systems to the Ontology Builder, Rules Generator, and Incident Analyzer components is summarized below:

Messages (names & parameters) exchanged between systems

Time stamp when message was sent & received

Log files collected Time series metrics Configuration files

Additional uses for the Ontology (grammar, incident graph and rules)

It will be appreciated that an ontology generated according to examples of the methods 100, 200 presented herein may be used for a range of different purposes, in addition or as an alternative to incident analysis according to examples of the methods 300, 400. For example, an ontology generated according to examples of the present disclosure may be used when making changes to a running system. A change request for a system may be received at a relatively high level of abstraction from a user, for example as an intent, which may then be mapped to actual system level components that should be modified to fulfil the intent using an ontology of the system generated according to examples of the present disclosure. In some examples, an impact of requested changes may also be assessed using the generated ontology grammar, graph and rules before actually executing the changes in the system.

Examples of the present disclosure thus provide methods and nodes enabling the automatic generation of an ontology, including rules, which may serve as a knowledge base to provide explanations for incidents that may occur in an IT infrastructure such as a cloud based system.

Example methods for ontology generation may processes different types of data (including for example metrics, logs, configuration files, etc.) from the monitored system to discover the different entities characterizing the system and their relationships, and to build an ontology and rules without input from a human expert.

Example methods for incident management may perform an automated analysis of a reported incident experienced by a system, based on the generated ontology, including rules, for the system, so as to provide an explanation for the reported incident. The analysis may be performed on receipt of an incident report, received for example from a fault detector, a performance degradation detector, a ticket submitted by an end-user, etc. An instance identified in the report as having experienced the incident is mapped to the ontology graph of the ontology in order to find the circumstances that lead to the incident

Examples of the present disclosure thus provide an automatic solution for building an ontology from multiple input data types, including time-series metrics, logs, configuration files or a combination of this data, which can assist in providing improved explanations for observed incidents. The solution supports continuous learning so that changes in the managed system can be efficiently captured without rebuilding the entire ontology knowledge base. Examples of the present disclosure also provide fast and efficient analysis of data output by systems (such as cloud based systems) to identify and explain that experienced by the system. In this manner, users without expert information about a specific system or environment may gain insights providing an explanation for a given incident such as service failure, service unavailability, delays performance degradation, etc. The output of the incident analysis presents the circumstances that led to the reported incident, which assist with identifying the appropriate procedures to correct the incident. It will be appreciated that the methods and nodes presented herein can be implemented and deployed within any distributed or centralized infrastructure cloud system, and may be implemented in one module or distributed in different modules that are connected.

The methods of the present disclosure may be implemented in hardware, or as software modules running on one or more processors. The methods may also be carried out according to the instructions of a computer program, and the present disclosure also provides a computer readable medium having stored thereon a program for carrying out any of the methods described herein. A computer program embodying the disclosure may be stored on a computer readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal provided from an Internet website, or it could be in any other form.

It should be noted that the above-mentioned examples illustrate rather than limit the disclosure, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims or numbered embodiments. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim or embodiment, “a” or “an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims or numbered embodiments. Any reference signs in the claims or numbered embodiments shall not be construed so as to limit their scope.