Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
INTEGRATED METHOD FOR AUTOMATING ENFORCEMENT OF SERVICE LEVEL AGREEMENTS FOR CLOUD SERVICES
Document Type and Number:
WIPO Patent Application WO/2019/025944
Kind Code:
A1
Abstract:
A method, node and system for enforcement of rules associated with cloud services are provided. A node for probabilistic action triggering for service level agreement, SLA, management for cloud services is provided. The node includes processing circuitry configured to: monitor cloud service data associated with a cloud network, generate a conditional probability distribution, CPD, of performance metrics of the cloud network based on the monitored data, determine a plurality of predicted future states of the performance metrics based on generated CPD, determine a plurality of transition events where each transition event corresponding to a transition from a first predicted future state that does not meet a predefined rule to a second predicated future state that meets the predefined rule, correlate the plurality of transition events, and trigger at least one management action for execution based on the correlation of the plurality of transition events.

Inventors:
VAKILINIA SHAHIN (CA)
KEMPF JAMES (US)
LEMIEUX YVES (CA)
TRUCHAN CATHERINE (CA)
Application Number:
PCT/IB2018/055688
Publication Date:
February 07, 2019
Filing Date:
July 30, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G06N3/04; G06N7/00; G06F9/50; H04L12/24
Other References:
QIU FENG ET AL: "A deep learning approach for VM workload prediction in the cloud", 2016 17TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), IEEE, 30 May 2016 (2016-05-30), pages 319 - 324, XP032926901, DOI: 10.1109/SNPD.2016.7515919
ZHANG PENGCHENG ET AL: "A Novel QoS Prediction Approach for Cloud Service Based on Bayesian Networks Model", 2016 IEEE INTERNATIONAL CONFERENCE ON MOBILE SERVICES (MS), IEEE, 27 June 2016 (2016-06-27), pages 111 - 118, XP033024147, DOI: 10.1109/MOBSERV.2016.26
LORIDO-BOTRAN TANIA ET AL: "A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments", JOURNAL OF GRID COMPUTING, SPRINGER NETHERLANDS, DORDRECHT, vol. 12, no. 4, 11 October 2014 (2014-10-11), pages 559 - 592, XP035382406, ISSN: 1570-7873, [retrieved on 20141011], DOI: 10.1007/S10723-014-9314-7
ARABNEJAD HAMID ET AL: "A Comparison of Reinforcement Learning Techniques for Fuzzy Cloud Auto-Scaling", 2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), IEEE, 14 May 2017 (2017-05-14), pages 64 - 73, XP033118266, DOI: 10.1109/CCGRID.2017.15
AYUSH DUSIA ET AL: "Recent Advances in Fault Localization in Computer Networks", IEEE COMMUNICATIONS SURVEYS & TUTORIALS, 18 May 2016 (2016-05-18), New York, pages 3030 - 3051, XP055523886, Retrieved from the Internet [retrieved on 20181114], DOI: 10.1109/COMST.2016.2570599
Attorney, Agent or Firm:
WEISBERG, Alan M. (US)
Download PDF:
Claims:
What is claimed is:

1. A node (36) for probabilistic action triggering for service level agreement, SLA, management for cloud services, the node (36) comprising:

processing circuitry (42) configured to:

monitor cloud service data associated with a cloud network (34); generate a conditional probability distribution, CPD, of performance metrics of the cloud network (34) based on the monitored data;

determine a plurality of predicted future states of the performance metrics based on generated CPD;

determine a plurality of transition events, each transition event corresponding to a transition from a first predicted future state that does not meet a predefined rule to a second predicated future state that meets the predefined rule;

correlate the plurality of transition events; and

trigger at least one management action for execution based on the correlation of the plurality of transition events.

2. The node (36) of Claim 1, wherein the CPD is generated based on a Dynamic Bayesian Network, DBN. 3. The node (36) of Claim 1, wherein the plurality of predicted future states of the performance metrics are determined using a neural network.

4. The node (36) of Claim 3, wherein the predefined rule includes a predefined probability threshold.

5. The node (36) of Claim 1, wherein the plurality of transition events are determined based on a Markov Chain Monte Carlo, MCMC, modeling.

6. The node (36) of Claim 1, wherein the first predicted future state and the second predicted future state are adjacent predicted future states.

7. The node (36) of Claim 1, wherein the processing circuitry (42) is further configured to map each transition event to a management action.

8. The node (36) of Claim 1, wherein the triggering of the at least one management action is based on a negative time value for transitioning from one of the first predicted future state and second predicted future state to a third predicted future state, the third predicted future state not being adjacent to the second predicted future state. 9. The node (36) of Claim 1, wherein the generating of the CPD of performance metrics includes generating correlation coefficients associated with the performance metrics.

10. The node (36) of Claim 1, wherein the performance metrics includes one of memory utilization and central processing unit utilization.

11. The node (36) of Claim 1, wherein the generating of the CPD of performance metrics, the determining of the plurality of predicted future states of the performance metrics and determining a plurality of transition events correspond to a logical chain of a Dynamic Bayesian Network, DBN, Long Short-Term Memory, LSTM, neural network and Markov Chain Monte Carlo, MCMC, modeling that are performed in a predefined order.

12. The node (36) of Claim 1, wherein each of the plurality of predicted future states of the performance metrics is associated with a respective probability metric indicating a probability of the predicted future state meeting the predefined rule.

13. The node (36) of Claim 1, wherein the cloud service data includes data corresponding to different cloud service layers.

14. A method for probabilistic action triggering for service level agreement, SLA, management for cloud services, the method comprising:

monitoring (S I 14) cloud service data associated with a cloud network (34); generating (S I 16) a conditional probability distribution, CPD, of performance metrics of the cloud network (35) based on the monitored data;

determining (S I 18) a plurality of predicted future states of the performance metrics based on generated CPD;

determining (S 120) a plurality of transition events, each transition event corresponding to a transition from a first predicted future state that does not meet a predefined rule to a second predicated future state that meets the predefined rule; correlating (S 122) the plurality of transition events; and

triggering (S 124) at least one management action for execution based on the correlation of the plurality of transition events. 15. The method of Claim 14, wherein the CPD is generated based on a

Dynamic Bayesian Network, DBN.

16. The method of Claim 14, wherein the plurality of predicted future states of the performance metrics are determined using a neural network.

17. The method of Claim 16, wherein the predefined rule includes a predefined probability threshold.

18. The method of Claim 14, wherein the plurality of transition events are determined based on a Markov Chain Monte Carlo, MCMC, modeling.

19. The method of Claim 14, wherein the first predicted future state and the second predicted future state are adjacent predicted future states. 20. The method of Claim 14, further comprising mapping each transition event to a management action.

21. The method of Claim 14, wherein the triggering of the at least one management action is based on a negative time value for transitioning from one of the first predicted future state and second predicted future state to a third predicted future state, the third predicted future state not being adjacent to the second predicted future state.

22. The method of Claim 14, wherein generating of the CPD of performance metrics includes generating correlation coefficients associated with the performance metrics.

23. The method of Claim 14, wherein the performance metrics includes one of memory utilization and central processing unit utilization.

24. The method of Claim 14, wherein the generating of the CPD of performance metrics, the determining of the plurality of predicted future states of the performance metrics and determining a plurality of transition events correspond to a logical chain of a Dynamic Bayesian Network, DBN, Long Short-Term Memory, LSTM, neural network and Markov Chain Monte Carlo, MCMC, modeling that are performed in a predefined order.

25. The method of Claim 14, wherein each of the plurality of predicted future states of the performance metrics is associated with a respective probability metric indicating a probability of the predicted future states meeting the predefined rule.

26. The method of Claim 14, wherein the cloud service data includes data corresponding to different cloud service layers.

Description:
INTEGRATED METHOD FOR AUTOMATING ENFORCEMENT OF SERVICE LEVEL AGREEMENTS FOR CLOUD SERVICES

TECHNICAL FIELD

This disclosure relates to network cloud services, and in particular to a method, node and system for enforcement of rules associated with cloud services. BACKGROUND

As cloud technologies mature, the complexity of performance management for network cloud ("cloud") services continues to grow. Due to the poor integration between stack layers of the cloud, current cloud performance management makes isolated management decisions along the Service Level Agreement (SLA) boundaries, i.e., applying rules from the SLA, which may lead to contradictory management actions that negatively impact cloud service performance. The situation becomes worse when more layers are added to the cloud stack, e.g., Application Architecture Layer (AAL) and Application Business Logic (ABL) on top of the application layer, and facility layer underneath the infrastructure layer.

Management of cloud services is evolving to co-ordinate different

management platforms in various layers to handle dynamic variations of the cloud environment based on the analysis of complex dependencies between system components. Since cloud services continue to grow in complexity, scale, and heterogeneity, distributed sensors and agents embedded within the layers should monitor the cloud performance. As a result, the analysis for managing cloud services becomes challenging.

SUMMARY

Some embodiments advantageously provide a method, node and system for enforcement of rules associated with cloud services.

In the disclosure, an integrated platform is provided to determine predicted situations where corrective action may be taken or is required. A Dynamic Bayesian Network (DBN) is trained and updated by collecting data to calculate the causal dependencies among various entities in different cloud service layers. Then, the correlation values are input into a Long Short-Term Memory (LSTM) neural network to predict the future states of the cloud services. Predicted future states that violate one or more Service Level Agreement (SLA) terms of the cloud services are learned with training data, and if the forecasted future states do not meet the SLA of the cloud services, associated events are generated to trigger management actions, i.e., management actions are triggered based on a predicted future state (a state that has not yet occurred but that have been predicted to possibly occur) of one of more metrics of one or more cloud services. Next, management actions are assigned a different set of events using a reinforcement learning approach. An aspect of the cloud automation framework is to apply root-cause analysis with a modular design. A set of experiments based on collected data from a real cloud service environment is conducted using the teaching described herein. Experimental results of the integrated platform show that the prediction method is efficient and accurate.

Further, this disclosure targets the limitations of using a non-integrated management system for cloud services. First, the correlation among entities located at different layers is calculated via development of causal graph models. Then, the correlation coefficients and consequently Conditional Probability Distributions (CPDs) are input into a Long Short-Term Memory (LSTM) based deep learning module to predict the next states. Events are generated if the predicted probability of one or more states violates SLA thresholds. After finding the correlation among the generated events and their CPDs, the management policy configuration is loaded into a reward-based decision recommender module. Considering the structure of CPDs and applying the Q-Learning algorithm, the recommender module maps the events to one or more actions. A platform architecture may be implemented in a modular structure.

The cloud automation framework described herein advantageously considers the dependency among the cloud entities in different layers to better predict and manage the cloud services. Thus, a DBN model is trained and upgraded using a mixed TRP-RBP (tree re -parameterization (TRP) and residual belief propagation (RBP)) message passing algorithm to track CPDs of performance metrics over time. Next, cross -correlation coefficients are calculated via an Inverse Fast Fourier

Transform (IFFT) of the spectral density of mutual variables to mitigate the complexity of neural networks. Cross -correlation coefficients are input into an LSTM to predict the future cloud states, which may violate the SLAs, that are learned with training data. If the forecasted states of the cloud violate the cloud services SLA, associated events are generated to trigger the management actions. Management actions are assigned to a different set of events using a Reinforcement Learning approach. The modular design of the proposed framework assists in root-cause analysis such that the causes can be quickly discovered from CPDs and cross- correlation of events and performance metrics.

According to one aspect of the disclosure, a node for probabilistic action triggering for service level agreement, SLA, management for cloud services is provided. The node includes processing circuitry configured to: monitor cloud service data associated with a cloud network, generate a conditional probability distribution, CPD, of performance metrics of the cloud network based on the monitored data, determine a plurality of predicted future states of the performance metrics based on generated CPD, determine a plurality of transition events where each transition event corresponding to a transition from a first predicted future state that does not meet a predefined rule to a second predicated future state that meets the predefined rule, correlate the plurality of transition events, and trigger at least one management action for execution based on the correlation of the plurality of transition events.

According to one embodiment of this aspect, the CPD is generated based on a Dynamic Bayesian Network, DBN. According to embodiment of this aspect, the plurality of predicted future states of the performance metrics are determined using a neural network. According to another embodiment of this aspect, the predefined rule includes a predefined probability threshold. According to another embodiment of this aspect, the plurality of transition events are determined based on a Markov Chain Monte Carlo, MCMC, modeling.

According to another embodiment of this aspect, the first predicted future state and the second predicted future state are adjacent predicted future states. According to another embodiment of this aspect, the processing circuitry is further configured to map each transition event to a management action. According to another embodiment of this aspect, the triggering of the at least one management action is based on a negative time value for transitioning from one of the first predicted future state and second predicted future state to a third predicted future state, the third predicted future state not being adjacent to the second predicted future state.

According to another embodiment of this aspect, the generating of the CPD of performance metrics includes generating correlation coefficients associated with the performance metrics. According to another embodiment of this aspect, the performance metric includes one of memory utilization and central processing unit utilization. According to another embodiment of this aspect, the generating of the CPD of performance metrics, the determining of the plurality of predicted future states of the performance metrics and the determining of the plurality of predicted future states of the performance metrics correspond to a logical chain of a Dynamic Bayesian Network, DBN, Long Short-Term Memory, LSTM, neural network and Markov Chain Monte Carlo, MCMC, modeling that are performed in a predefined order. According to another embodiment of this aspect, each of the plurality of predicted future states of the performance metrics is associated with a respective probability metric indicating a probability of the predicted future state meeting the predefined rule. According to another embodiment of this aspect, the cloud service data includes data corresponding to different cloud service layers.

According to another aspect of the disclosure, a method for probabilistic action triggering for service level agreement, SLA, management for cloud services is provided. Cloud service data associated with a cloud network is monitored. A conditional probability distribution, CPD, of performance metrics of the cloud network is generated based on the monitored data. A plurality of predicted future states of the performance metrics are determined based on generated CPD. A plurality of transition events are determined where each transition event

corresponding to a transition from a first predicted future state that does not meet a predefined rule to a second predicated future state that meets the predefined rule. The plurality of transition events are correlated. At least one management action for execution is triggered based on the correlation of the plurality of transition events.

According to one embodiment of this aspect, the CPD is generated based on a

Dynamic Bayesian Network, DBN. According to another embodiment of this aspect, the plurality of predicted future states of the performance metrics are determined using a neural network. According to another embodiment of this aspect, the predefined rule includes a predefined probability threshold. According to another embodiment of this aspect, the plurality of transition events are determined based on a Markov Chain Monte Carlo, MCMC, modeling.

According to another embodiment of this aspect, the first predicted future state and the second predicted future state are adjacent predicted future states. According to another embodiment of this aspect, each transition event is mapped to a

management action. According to another embodiment of this aspect, the triggering of the at least one management action is based on a negative time value for transitioning from one of the first predicted future state and second predicted future state to a third predicted future state, the third predicted future state not being adjacent to the second predicted future state. According to another embodiment of this aspect, generating of the CPD of performance metrics includes generating correlation coefficients associated with the performance metrics. According to another embodiment of this aspect, the performance metric includes one of memory utilization and central processing unit utilization.

According to another embodiment of this aspect, the generating of the CPD of performance metrics, the determining of the plurality of predicted future states of the performance metrics and the determining of the plurality of predicted future states of the performance metrics correspond to a logical chain of a Dynamic Bayesian

Network, DBN, Long Short-Term Memory, LSTM, neural network and Markov Chain Monte Carlo, MCMC, modeling that are performed in a predefined order. According to another embodiment of this aspect, each of the plurality of predicted future states of the performance metrics is associated with a respective probability metric indicating a probability of the predicted future state meeting the predefined rule. According to another embodiment of this aspect, the cloud service data includes data corresponding to different cloud service layers.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete understanding of the present embodiments, and the attendant advantages and features thereof, will be more readily understood by reference to the following detailed description when considered in conjunction with the accompanying drawings wherein:

FIG. 1 is a block diagram of graph representation of dependencies between entities of cloud-based micro- services;

FIG. 2 is a block diagram of a three-dimensional representation of the correlation among the cloud entities;

FIG. 3 is a block diagram of an exemplary cloud network or system for enforcement of at least one rule associated with cloud services in accordance with the principles of the disclosure;

FIG. 4 is a flow diagram of an example enforcement process of enforcement code using analytics described herein;

FIG. 5 is a simplified flow diagram of the example of FIG. 4 in accordance with the principles of the disclosure;

FIG. 6 is an example flow diagram of another enforcement process of enforcement code in accordance with the principles of the disclosure;

FIG. 7 is a diagram of a simple probabilistic state space model for decision making in accordance with the principles of the disclosure;

FIG. 8 is a block diagram of an example of a mixed automation model including DBN, LSTM and MCMC in accordance with the principles of the disclosure;

FIG. 9 is a diagram of another example architecture in accordance with principles of the disclosure;

FIG. 10 is a flow diagram of an example root cause analysis in accordance with the principles of the disclosure;

FIG. 11 is a diagram of an example cloud implementation of the enforcement process as analytics as a service (AaaS) in accordance with the principles of the disclosure; and

FIG. 12 is an alternate embodiment of the node in accordance with the principles of the disclosure. DETAILED DESCRIPTION

Existing cloud computing platforms lack integrated analytics and management capabilities. For example, existing systems implement monitoring platforms that collect data from different entities in multiple cloud stack layers. In some existing systems, a 3-D cloud monitoring framework is used to provides multi-layer surveillance and data stream analysis that consists of a Complex Event Processing (CEP) engine to support the real-time analysis of the collected data. These existing systems may select various metrics from different service layers for comparison with a threshold. However, these existing systems are not focused on predicted analysis of the data. Other platforms may monitor and analyze in conjunction with a fully automated multi-layer cloud resource provisioning system with different features such as adaptive data filtering. Another existing system offers root cause analysis to detect anomalous behavior and analyze anomaly propagation via application of K-means clustering. This existing system suffers from at least one drawback described herein.

From a correlation perspective, some existing systems concentrate on leveraging advancements in stream processing platforms. Other existing systems address cross network alarm correlation as a basis for root cause analysis. Different clustering algorithms are applied in these existing systems to automatically correlate classes of faults. In one approach of these existing systems, the assumption that is made is that the lower layer issues affect the higher layer issues as well as neighbor nodes. In other words, the lower layer alarms appear as the root-cause alarms more frequently than the upper layer alarms. However, direct application of alarms diminishes the resolution and accuracy of the decision-making modules.

Similarly, other existing systems characterize and identify the cloud system according to the analysis of datasets for four different metrics on the platform, namely power consumption, temperature, platform events, and workload. Further analysis of the correlation between the metrics from multiple layers is also performed in which some of metrics are highly correlated (i.e., power consumption and temperature), while other metrics are not (i.e., platform events and workload).

The correlation between metrics at different stack layers is divided into three logical groups namely expected, discovered and unexpected. Linear regression is applied to predict the congestion and latency as a function of expected and discovered correlated monitored metrics. A rule-based root-cause analysis is also proposed to avoid congestion. In another existing system, assuming Poisson distribution for failure rate and exponential distribution for repair rates, a Bayesian network is used to analyze the reliability-performance and reliability-energy correlations for cloud infrastructure as a service (IaaS) using a check-pointing fault recovery mechanism. In another existing system, a Bayesian network model of SLA prediction for cloud services is implemented. For example, first, an initial Bayesian network model is established by collecting CPU and memory data from the infrastructure layer, the number of processes from platform layer and the response time and availability from the application layer. Then the Bayesian network is trained and updated to obtain the cloud service SLA prediction model.

In addition to Bayesian networks, neural networks have been discussed as a method for predicting the SLA metrics. For instance, one existing system classifies the state of IBM BlueGreen cloud services through typical machine learning algorithms (rule-based classifier, support vector machine and nearest neighbor search method) applied to the logs of low-level hardware failures and high-level kernel faults. Taking fault tolerance as the objective, fault detection and fault localization techniques are classified into three groups: artificial intelligence techniques (including neural networks), traversing techniques and graphical models. Although combining the neural networks and Bayesian networks is one topic in the artificial intelligence domain, such a combination has not been deployed for cloud SLA prediction or maintenance. As discussed above, cloud SLA prediction typically relies on neural network or Bayesian networks, and not both.

The neural network and Bayesian network approaches have their respective pros and cons. For instance, despite the good performance of neural networks, there is still no comprehensive theoretical understanding of their inner organization. Thus, a neural network acts as a black-box without interpretation and capability of root cause analysis. On the other hand, an important assumption of DBN structure learning is that the data is generated by a stationary process, an assumption that does not hold for the dynamic cloud scenarios.

The disclosure solves at least some of the problems with existing systems. In particular, the analysis implemented in this disclosure and discussed below may be similar to the work mentioned above in that the analysis works with the collected data from various service layers. However, instead of applying neural networks or Bayesian networks separately, these two network approaches are mixed as discussed herein, and outperform the existing prediction solutions.

Some advantages of some embodiments described herein include:

• Combining neural network machine learning with causal graphical models to dynamically predict the cloud system behavior.

• Extracting discrete event alarms from the continuously monitored data and state space of the cloud system.

· Developing a Q-Learning module to map the events and the best recommended decisions that lead to automating cloud SLA management.

• Using modular cloud analytics structure to automate cloud services management.

Before describing in detail exemplary embodiments, it is noted that the embodiments reside primarily in combinations of apparatus components and processing steps related to network cloud services, and in particular to a method, cloud network, node and system for enforcement of rules associated with cloud services via probabilistic action triggering. Accordingly, the system, cloud network, node and method components have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.

As used herein, relational terms, such as "first" and "second," "top" and "bottom," and the like, may be used solely to distinguish one entity or element from another entity or element without necessarily requiring or implying any physical or logical relationship or order between such entities or elements. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the concepts described herein. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including" when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In embodiments described herein, the joining term, "in communication with" and the like, may be used to indicate electrical or data communication, which may be accomplished by physical contact, induction, electromagnetic radiation, radio signaling, infrared signaling or optical signaling, for example. One having ordinary skill in the art will appreciate that multiple components may interoperate and modifications and variations are possible of achieving the electrical and data communication.

FIG. 1 is a block diagram of a representation of dependencies between entities of cloud-based micro- services. A 3-tier cloud-based micro-services is represented in FIG. 1 and includes infrastructure layer 12, virtualization layer 14 and application layer 16. The infrastructure layer 12 may include one or more servers 18a-18n, one or more switches 20 (SW 20), one or more IP routers 22, among other entities. The virtualization layer 14 may include one or more virtual machines (VMs) 24a- 24n, among other entities. The application layer 16 may include one or more databases 26, one or more applications 28a-28n, and one or more web server applications (WS) 30, among other entities. For clarity, other cloud service layers such as a platform layer are omitted from FIG. 1.

In one example, features and metrics of the micro-services (i.e., received hypertext transfer protocol (HTTP) GET requests by the web server) have an impact on the virtualization layer 14 component metrics (i.e., virtual machine (VM 24) CPU utilization). Consequently, performance metrics such as CPU power

consumption/cooling systems on commodity servers are also impacted. The database input/output (I/O) throughput is also related to the VM 24 memory usage, VM 24 communication traffic rate, physical server and so on. Three different correlation types are identified as follows:

• Vertical Correlation (indicated by dashed lines in FIG. 1) is the

performance dependencies among entities due to layering in the cloud environment. For example, VM 24 performance of the virtualization layer may be dependent on the IP router in the infrastructure layer, and vice versa, as illustrated in FIG. 1. In other words, one or more performance metrics of one or more entities in a first layer in FIG. 1 may be dependent one or more entities in a second layer different from the first layer. Similar to the dependencies among container features, VM metrics and

Physical machines.

• Horizontal Correlation (indicated by solid lines in FIG. 1) describes the dependencies caused by the topology and architecture of the services. Component features at the same service layer that are communicating with each other are horizontally correlated. For instance, as illustrated in FIG. 1, the performance metrics of the application servers 18 such as execution time are tightly related to the metrics of the web-server such as its incoming network traffic, both of which are logically located in the application layer 16.

• Time Correlation represents the statistical time correlation of the

component behavior. This type of correlation is used in many approaches such as time series to predict the future states of dynamic metrics such as CPU, memory, etc.

FIG. 2 is a block diagram of a three-dimensional representation of the correlation(s) among the cloud entities and/or components such as VMs 24, APs 28 and operating systems (OS) 32 or platforms 32. In FIG. 2, a vertical correlation is indicated by sold lines, the larger dashed lines indicate horizontal correlation and the smaller dashed lines indicate time correlation (e.g., same Appl at time t-1, t, t+1). In general, the correlation term among metrics of cloud entities can be decomposed into the types mentioned above. Thus, the entities in different layers, and various locations are correlated with varying time lags. Again, for simplicity, only some entities are illustrated in FIG. 2 such as applications 28, platforms (operating systems (OS)) 32 and VMs 24 are shown in FIG. 2. Aspects, such as size and hatching pattern, symbolically, may represent the metrics and features of cloud entities that vary over time. For example, a micro-service application (Appl(t)) 28 may have metrics and/or features that vary at different times such as at time (t+2), hence the various hatching patterns and sizes in FIG. 2. In another example, a micro-service application 28 located in a VM 24 is correlated to the metrics of another VM hosting other components. These dependencies are used to predict the future states of the components.

One aspect of the system described herein is to forecast the next states, i.e., predicted future states, of the cloud services. Cloud management decisions can be made pro-actively, i.e., before the predicted event occurs. Although, deep neural networks (DNN) may outperform causal inference models (i.e., dynamic Bayesian network (DBN)) in estimation, both can be applied to the cloud systems described herein for at least these reasons:

• Conventional neural network modules may perform as a black-box without interpretation or capability of root cause analysis. The DBN described herein supports root cause analysis and, as a result, the decision-making process for fault detection and fault localization is improved. In other words, the DBNs as the diagnostic inference help determine the cause of faults and limits the decision space.

• The DBN described herein at least attempts to account for conditional independence of some variables, to simplify the graph and to decrease the number of parameters needed to estimate the joint probability distribution. This reduces the number of input variables of the neural network and lessens the complexity and scale of the neural networks.

FIG. 3 is a block diagram of an exemplary cloud network 34, i.e., system 34, for enforcement of at least one rule associated with cloud services in accordance with the principles of the disclosure. Cloud network 34 includes one or more nodes 36a- 36n. As used herein, node 36 refers to one or more of nodes 36a-36n. Node 36 includes one or more transmitters 38 and one or more receivers 40 for communicating with one or more nodes 36 and/or other entities in cloud network 34. In one or more embodiments, transmitter 38 and receiver 40 include or are replaced by one or more communication interfaces.

Node 36 includes processing circuitry 42. Processing circuitry 42 includes processor 44 and memory 46. In addition to a traditional processor and memory, processing circuitry 42 may comprise integrated circuitry for processing and/or control, e.g., one or more processors and/or processor cores and/or FPGAs (Field Programmable Gate Array) and/or ASICs (Application Specific Integrated Circuitry). Processor 44 may be configured to access (e.g., write to and/or reading from) memory 46, which may comprise any kind of volatile and/or nonvolatile memory, e.g., cache and/or buffer memory and/or RAM (Random Access Memory) and/or ROM (Readonly Memory) and/or optical memory and/or EPROM (Erasable Programmable Read-Only Memory). Such memory 46 may be configured to store code executable by processor 44 and/or other data, e.g., data pertaining to communication, e.g., configuration and/or address data of nodes, etc.

Processing circuitry 42 may be configured to control any of the methods and/or processes described herein and/or to cause such methods, signaling and/or processes to be performed, e.g., by node 36. Processor 44 corresponds to one or more processors 44 for performing node 36 functions described herein. Node 36 includes memory 46 that is configured to store data, programmatic software code and/or other information described herein. In one or more embodiments, memory 46 is configured to store enforcement code 48. For example, enforcement code 48 includes instructions that, when executed by processor 44, causes processor 44 to perform the processes describe herein with respect to node 36.

The term "node 36" used herein can be any kind of networking entity capable of providing computing resources. It is contemplated that the functions of node 36, in one or more embodiments, are distributed among several physical devices locally or across a network cloud such as a backhaul network and/or the Internet.

FIG. 4 is a flow diagram of an example enforcement process of enforcement code 48 using the analytics described herein. In particular, processing circuitry 42 monitors data (M ) as described herein (Block S 100). Processing circuitry 42 performs a DBN process on the monitored data (M) and outputs correlations (* ) generated by the DBN as described herein (Block S 102). Processing circuitry 42 performs Fast Fourier Transforms (FFT) and/or Inverse FFT (IFFT) on the correlations where the results of the FFT/IFFT are denoted as *R (Block S 104).

Processing circuitry 42 is configured to apply a neural network "featurizer" such as a LSTM neural network to the Ή data, thereby generating/determining and outputting predicted future states (S) of one or more metrics as described herein (Block S 106). Processing circuitry 42 is configured to determine transition events ( i ) from the predicted future states, i.e., converts the predicted future states to event space (Block S 108). Processing circuitry 42 is configured to perform a planner process such as a Reinforcement Learning (RL) process in order to determine one or more management actions based on the transition events (Block S 110). Processing circuitry 42 is configured to execute one or more management actions (Block S 112). For example, in one or more embodiments, the flow diagram of FIG. 4 may be re-illustrated as shown in FIG. 5 where M is the monitored data (i.e., cloud service data), *P are the correlations generated by the DBN, <S are the predicted future states determined by the LSTM, Ψ are the transition events generated by the MCMC process, and are the management actions determined by Q-learning.

In particular, cloud service data such as logs and real-time monitoring data are input, i.e., received by receiver 40, to calculate the correlation and conditional probability distributions for the DBN. The deep neural network implementing LSTM inputs the results of the DBN and predicts the future states of cloud services, i.e., cloud computing services. According to the estimated/predicted future states, transition events are generated. Afterwards, processing circuitry 42 determines at least one management action via a Reinforcement Q-learning method. FIG. 5 depicts one or more embodiments of the modular structure of the platform described herein that encompasses the analysis and planning parts of a Monitor- Analyze-Plan-Execute over a shared Knowledge (MAPE-K) autonomic model. Monitored and collected data from the cloud (A-t) are filtered and pre-processed for the collection of CPDs ( P) via Bayesian inference. An LSTM module is configured to estimate/predict future states of the cloud services S) using correlation coefficients and CPDs. A collection of events, i.e., transition events, may be generated according to the predicted future states. Reinforcement Q learning is applied to map the events to set of actions (Λ ) such as management actions. In one or more embodiments, results and data for this enforcement process are stored in memory 46.

Message Passing

Calculating the joint CPDs is a non-deterministic polynomial-time (NP-hard) problem in a Bayesian network inference. In cloud computing systems, sensors and monitoring agents may not report measurements synchronously. Therefore, asynchronous belief propagation algorithms such as clique tree message passing can be applied to update the CPDs. However, since many loops may exist in the Bayesian network, joint inference methods, tree re -parameterization (TRP) and residual belief propagation (RBP) are applied for faster convergence. Monitored data, in the form of RBP messages, is not considered in upgrading the CPDs unless there is a minimum predefined change in the metric. These algorithms may extensively reduce the computation.

Normalization: Raw time series data often have gaps and the time between reported measurements is may be variable. Thus, Normalization of the data is required before processing to find CPDs. Therefore, in one or more embodiments, the data is normalized before processing the cross -correlation.

DBN interacting with Monitoring: As shown in FIGS. 4 and 5, DBN may be the only portion of the enforcement process which interacts directly with monitoring data. This may help to enhance the quality of monitoring in the following ways:

· As the error in calculating the CPDs increases, the monitoring frequency of the associated metrics can be increased to better pro-actively calculate the

probabilities.

• When message passing algorithms may require distinctive information, the pull-driven monitoring structure can be used to get the required information from the cloud components.

Fourier Transform

Cloud monitored data, i.e., cloud service data, may be collected, i.e., received at receiver 40, at irregular time intervals across vast arrays of decentralized sensors which causes noisy data. Regression can be used to approximate the unknown values which bring some noise into the analysis as well. An alternative solution is to use the spectral density of correlation functions. In this approach, at each time slot, the Fast Fourier Transform (FFT) is calculated, and the correlation functions are calculated via an Inverse FFT (IFFT) transform. The high-frequency part of spectrum is considered noise and can be filtered out.

Deep Neural Network such as LSTM neural network

The cross-correlation coefficients calculated and output from the DBN as described herein are input into the deep neural network such as a LSTM neural network. In some embodiments, LSTM is selected for prediction for the following reasons:

• In practice, some neural networks are not capable of handling long-term dependencies while LSTM can recognize and learn useful long-term dependencies. In one or more embodiments, LSTMs are explicitly designed for this purpose. LSTM excels at remembering values for both long or short durations of time which may be a better match for DBNs.

• In statistical learning (which includes supervised neural networks), one assumption is that training sets are independently drawn from the probability distribution. However, in the cloud context, the input data is not independent and identically distributed (i.i.d.), but cross-correlated and heteroskedastic. In one or more embodiments, this kind of time correlated data may be better suited for the LSTM approach to predict the future states of the system and/or cloud services.

• Learning approaches in neural networks apply recurrent back-propagation which is time and process consuming. Therefore, rather than disregarding previous parameters and starting parameter calculations from scratch, it may be more computational efficient to start with previous parameters.

• Relative insensitivity to the data gap length gives an advantage to LSTM over alternative neural networks algorithms. The LSTM network is configured to learn from experience when there are time lags of unknown size of monitoring data.

Therefore, the deep neural network, such as the LSTM network, determines a plurality of future predicted states based on the output from the DBN.

Markov Chain Monte Carlo (MCMC) EVENT GENERATOR

For each of the metrics provided by the cloud network 34, a safe interval is defined based on the SLA attributes. From a user perspective, application metrics (i.e., latency) have a higher priority while from the cloud service provider's perspective, lower layer performance metrics (i.e., CPU utilization) have a greater importance. Therefore, different metrics may have different constraints. A set of events are generated and are sent to the decision maker module to perform the Q- learning process for selecting one or more actions to implement. In one or more embodiments, the set of events that are generated are transition events as described in detail with respect to FIG. 7. In other words, event generator converts one or more of the predicted future states in state space to an event space. By converting the various predicted future states to the event space, computational complexity of management action selection is reduced such that complexity may correspond with a computational complexity logarithmic equal to the number of actions, or equivalently linear in the number of dimensions. At least some of the advantages of choosing to implement DBNs and MCMC for modeling and event generation processes may be as follows:

• Scalability and speed are one advantage provided by the use of DBNs (inference) for large monitored datasets. DBN is trivial to parallelize. Inference gets better results than running MCMC for the same amount of time. As the scale of the cloud increases, the computation time of MCMC grows exponentially. Thus, the MCMC process may not be adequate for very large data sets (with many states).

• Inference is irredeemably biased, whereas MCMC's bias approaches zero. The sample variance of an MCMC estimate usually approaches zero as more samples are considered. Thus, for limited search space, MCMC outperforms DBN.

FIG. 6 is an example flow diagram of the enforcement process of enforcement code 48 in accordance with the principles of the disclosure. Processing circuitry 42 is configured to monitor cloud service data associated with a cloud network 34 (Block S 114). In one or more embodiments, the cloud service data includes data that corresponds to different cloud service layers. In one or more embodiments, monitoring data includes data received at receiver 40 from a cloud environment as described herein. For example, the monitored data may be in the form of residual belief propagation (RBP) messages received at receiver 40 from one or more other nodes 36 providing one or more cloud services, as described herein. The monitored data may be received at receiver 40 from other network elements as well. In one or more embodiments, the monitored data is stored in memory 46.

Processing circuitry 42 is configured to generate a conditional probability distribution (CPD) of performance metrics of the cloud network 34 based on the monitored data (Block S 116). In one or more embodiments, at least one processor 44 is configured to train a dynamic Bayesian Network (DBN) using monitoring data, i.e., cloud service data, associated with the cloud services, thereby generating the CPD of performance metrics. At least one processor 44 is configured to generate correlation values based on the training of the DBN, as described herein. In one or more embodiments, generating the CPD of performance metrics includes generating correlation coefficients associated with the performance metrics. Processing circuitry 42 is configured to determine a plurality of predicted future states of the performance metrics based on generated CPD (Block S 118). In one or more embodiments, the performance metrics associated with the predicted future states are determined using a neural network. In one or more embodiments, a first predicted future state and a second predicted future state are adjacent predicted future states.

In one or more embodiments, at least one processor 44 is configured to determine a plurality of future states of the cloud services based on the correlation values, as described herein. In one or more embodiments, each of the plurality of predicted future states of the performance metrics is associated with a respective probability metric indicating a probability of the predicted future state meeting the predefined rule. Processing circuitry 42 is configured to determine a plurality of transition events where each transition event corresponds to a transition from a first predicted future state that does not meet a predefined rule to a second predicated future state that meets the predefined rule (Block S 120). In one or more

embodiments, the predefined rule includes a predefined probability threshold. In one or more embodiments, the plurality of transition events are determined based on a Markov Chain Monte Carlo, MCMC, modeling or process. In one or more embodiments, processing circuitry 42 is further configured to map each transition event to a management action.

Processing circuitry 42 is configured to correlate the plurality of transition events (Block S 122). Processing circuitry 42 is configured to trigger at least one management action for execution based on the correlation of the plurality of transition events (Block S 124). In other words, at least one management action is triggered based on one or more predicted future states of one or more performance metrics, which is in stark contrast to existing system that rely on current/real-time performance metrics of states to trigger an action. These existing systems are not configured to use the probabilities of predicted future states in violating, i.e., meeting, a predefined rule as a basis to determine whether to trigger a management action.

In one or more embodiments, the triggering of the at least one management action is based on a negative time value for transitioning from one of the first predicted future state and second predicted future state to a third predicted future state, the third predicted future state not being adjacent to the second predicted future state. While a negative time value is used as a reward for the RL in one or more embodiments, other rewards for the RL may be used in accordance with the principles of the disclosure. In one or more embodiments, the performance metric includes memory utilization, central processing unit utilization, latency, disk I/O, and/or other performance metrics. In one or more embodiments, the generating of the CPD of performance metrics, the determining of the plurality of predicted future states of the performance metrics and the determining a plurality of transition events correspond to a logical chain of a Dynamic Bayesian Network, DBN, Long Short-Term Memory, LSTM, neural network and Markov Chain Monte Carlo, MCMC, modeling that are performed in one or more predefined orders such as different predefined orders illustrated in FIG. 4 and FIG. 9, for example.

Some other embodiments

In some embodiments, at least one processor 44 is configured to determine whether at least one of the plurality of future states violates at least one predefined rule, as described herein. At least one processor 44 is configured to, if the at least one of the plurality of future states violates at least one predefined rule, trigger at least one management action, as described herein.

In one or more embodiments, the training of the DBN includes filtering and pre-processing the monitoring data to generate a plurality of Conditional Probability Distributions (CPDs). In one or more embodiments, the correlation values are correlation coefficients. In one or more embodiments, the plurality of predicted future states are based on the CPDs. In one or more embodiments, the determination of the plurality of predicted future states of the cloud services based on the correlation values includes analyzing the correlation values using a neural network to generate neural network data. In one or more embodiments, the predicted future states are based on the neural network data. In one or more embodiments, the determination whether the at least one of the plurality of predicted future states violates at least one predefined rule includes determining a plurality of events, i.e., transition events, according to determined plurality of predicted future states. In one or more embodiments, the plurality of transition events are mapped to the at least one management action.

FIG. 7 is a diagram of a probabilistic state space for decision making in accordance with the principles of the disclosure. In particular, FIG. 7 illustrates a state space generated from a Markovian modeling with two performance metrics where various states such as S 1-S 10 and a safe state are illustrated. While FIG. 7 uses two performance metrics, the teachings of the disclosure are equally applicable to more or less than two performance metrics. The transitions between the predicted future states such as from S I to S6 and from S7 to S2 are illustrated in FIG. 7. The positioning of each state in state space corresponds to a probability that the state will transition into a violation state such as S6-S 10, where a transition event corresponds to a transition from a first state not meeting the probability threshold to a second state that meets the probability threshold. For example, transitioning from S3 to S8 is referred to as a transition event.

Further, while more states may exist than are illustrated in FIG. 7, the MCMC sampling process reduces the state space to (1) one or more of predicted future states such as S 1-S5 that indicate that the systems may transition towards SLA violation, i.e., transition to a violation state, and 2) SLA violation states such as S6-S 10. In other words, the state space is reduced to states corresponding to one or more transition events and an optional "Safe State" that may encompass other states. In one or more embodiments, the MCMC sampling process focuses on transition events where each transition event corresponds to a transition from a predicted future state (e.g., S I, S2, etc.) that does not meet a predefined rule (i.e., SLA violation rule) to a predicated future state (e.g., S6, S7, etc.) that meets this predefined rule. In one or more examples, multiple levels or thresholds may be predefined, where one level corresponds to a warning level and another level corresponds to a SLA violation level. Using FIG. 7 as an example, states S 1-S5 meet the warning level/rule but not the SLA violation level/rule while states S6-S 10 meet the SLA violation level/rule. Therefore, a transition event, in this example, corresponds to a transition of a state from the warning level to the SLA violation level. In one example, a level or rule is met if the state corresponds to a value that is greater than and/or equal to a predefined threshold or other predefined metric. The dashed line indicates the probabilistic or probability threshold for SLA violation.

DECISIONMAKER MODULE

Q-Learning as a reinforcement off-policy learning based approach that is used to determine at least one management action to select. The reinforcement learning strategy is used to take into consideration the correlation among the transition events. In Q-Learning method, a future reward is assigned to each combination of

management actions. In this disclosure, the reward may be presumed to be the negative time that it takes to transition from the adjacent predicted future states back to the normal predicted future states such as a predicted future state meeting one or more predefined rules and/or not meeting one or more other predefined rules. For example, in one or more embodiments, the predicted state may not meet the SLA violation rule/level but may meet the warning rule/level, or may not met either rules/levels. The Q-Learning algorithm may corresponds to a value iteration update.

FIG. 8 is a block diagram of an example of a mixed automation model including DBN, LSTM and MCMC in accordance with the principles of the disclosure. The application of the Bayesian Networks, Neural Network and MCMC is presented in FIG. 8 in which each section is configured in a logical chain in a predefined order. One advantage of the predefined architecture illustrated in FIG. 8 is its modularity factor which allows easy structural additions in the future and parallelism of different components, as illustrated in FIG. 9 that is a diagram of another architecture which outperforms the one represented in FIG. 8. In the arrangement of FIG. 9, DBNs are configured on top of the Neural networks (LSTM) and backpropagation of LSTM goes through the DBNs first. In one or more embodiments, instead of having separate modules, a Deep Bayesian Neural network is embedded in which its loss function is the Kullback-Leibler (KL) divergence of the joint probability of all performance metrics.

FIG. 10 is a flow diagram of an example root cause analysis for identifying a source or cause of an SLA violation, i.e., the cause of why one or more states met an SLA violation rule/level, in accordance with the principles of the disclosure. In one or more embodiments, the root cause analysis is performed separate and/or in addition to the DBN/MCMC processes. When an event is generated, in one or more embodiments, three methods get activated to find the root of the cause. One method searches the past correlated events while two other methods go through the DBN of performance metrics and find the main attributed metric and other metrics highly correlated to the main one. For example, processing circuitry 42 is configured to find or determine time correlated events () (Block S 128). Processing circuitry 42 is configured to perform sensitivity analysis of events, described below (Block S 130). Processing circuitry 42 is configured to find or determine the attributed metrics (Block S 132). Processing circuitry 42 is configured to perform the sort time function (SortTimeO) (Block S 134). Processing circuitry 42 is configured to perform the sort components function (SortComponentsO) for sorting performance metrics based on component type (Block S 136). Processing circuitry 42 is configured to perform the sort layer function (SortLayer()) for sorting performance metrics based on layer (Block S 138). Processing circuitry 42 is configured to perform the sort correlation function (SortCorrelation()) for sorting performance metrics based on a correlation with a primary performance metric (Block S 140). Processing circuitry 42 is configured to perform the root selection function for determining assigned metrics of the component (Root_Selection()) (Block S 142). In other words, SortTime(), SortComponentsO, SortLayer() and SortCorrelation() are sorting processes where one or more of these sorting processes may be selected to be part of the root cause analysis processing. For example, Root_Selection() is the function that selects the data from the one or more sorting processes that the result of the root selection process is going to be based on, i.e., based on Time, Components, Layer and/or Correlation processes. Referring back to the "START", processing circuitry 42 is configured to perform Bayesian Analysis described herein (Block S 144). The paths/methods from the "START" may be activated based on an alarm generation.

For further explanation, it is assumed that the event e happens in state s, and performance metric (related to the specific component in a specific layer at specific time) of the x t . Sensitivity analysis module returns performance metrics x t s with maximum sensitivity in State s (Max AS/Δχ). For this purpose, deviation will be applied in back propagation way to find the metrics cause sensitivity. Bayesian Analysis find the x such that p(e\x, s)p(s\x)p(x)

ArgMaxP(x\s, e) = p(e |j)p(j)

Which:

p(x) = Probability of violation via performance metric x p(e) = Probability of error e happening in the system p(s\x) = Probability of being in state s given x

p(e\x) = Probability of happening e in state s

p(e\x, s) =

Probability of happening e in state s via performance metric x =

Note that historical data may be used to find all the above posterior and prior probability distributions. Thus, there are many parameters in different components and different layers which are the candidates for the root of the event cause. These performance metrics are sorted according to three different criteria: location, layer, and correlation with the primary performance metric (attributed the generated event) and time. These sorting algorithms give different weights to these metrics. Also, another machine learning based method can be used to select the root (in

Root_Selection()) based on the weights. For example, the metric with the maximum weights is selected at the main root.

FIG. 11 is a diagram of an example cloud implementation of the enforcement process as analytics as a service (AaaS) in accordance with the principles of the disclosure. The cloud implementation includes management module(s) 52, administrative module(s) 54, Project Management as a Service (PMaaS) applications 56, connectivity 58 between applications, Log Forwarder/Resilient Distributed Dataset (RDD) Application Program Interfaces (APIs) 60, pre-processing module 62, SAR data 64, Big Data Time Series Database Management System (DBMS), and various databases 68. The cloud implementation may further include RESTful API 70, visualization plug-in 72 such as kibana 74. The cloud implementation may further include openstack ceilometer 76, configuration update (e.g., using YAML language) 78, cloud automation tools 80 and various automation software 82 such as saltstac, ansible, etc. The cloud implementation further includes one or more monitoring applications such as Nagios 84 and Sensu 86. The cloud implementation further includes monitoring functionality (Block S 100) such as SNMP and/or DSIM monitoring as described herein. DBN functionality (Block S 102) as described above, neural network featurizer (Block S 106) as described above, and event generation (Block S 108) as described above.

Real-time data collected by different monitoring modules/processes such as Ceilometer 76 (e.g. OpenStack), Istio-Prometheus (e.g., Kubernetes), Beats and Nagios 84, are used as input in the enforcement process via a Restful interface 70. The collected data is saved in a Big-Data No Structured Query Language (SQL) based time series database 68 (e.g., Apache, Cassandra). The fast stream processing or preprocessing 62 platform Spark parses the data from the database, or gets the data in a Restful application program interface (API) 70 format. Logstash along with a No SQL database retrieves the data from the distributed logs. A micro-service component written in Python (using a Bayespy library) is used to setup the DBN (Block S 102) from the monitored data. PySpark API fetches the data from the DBN module. Numpy FFT and IFFT libraries are used to find the cross -correlation among the monitored performance metrics. The Tensorflow API (in Python) and micro module LSTM (Block S 106) predict the future states of the system and/ cloud services. The predicted future states can be presented to management and/or administrative modules(s) via Kibana 74 or other visualization modules. However, it is possible to feed the forecasted states to other micro-services to generate cloud alarms and determine the cloud service management solutions. These configuration recommendations can be sent to a Dev/Ops platform 80 such as Jenkins/ Ansible 82 in the YAML format.

Apparatus implementation example:

In one or more embodiments, No SQL Time Series database and

preprocessing modules are memory intensive and have the access to cache and memory as much as possible. Storage assigned to the databases should support High input/output operations per second (IOPS). Computing modules, (Correlator, DBN, LSTM and MCMC) are process intensive and should be assigned to many parallel processors such as GPU.

FIG. 12 is an alternate embodiment of node 36 in accordance with the principles of the disclosure. Node 36 includes an enforcement module 88 configured to monitor cloud service data associated with a cloud network, generate a conditional probability distribution, CPD, of performance metrics of the cloud network based on the monitored data, determine a plurality of predicted future states of the performance metrics based on generated CPD, determine a plurality of transition events where each transition event corresponding to a transition from a first predicted future state that does not meet a predefined rule to a second predicated future state that meets the predefined rule, correlate the plurality of transition events, and trigger at least one management action for execution based on the correlation of the plurality of transition events, as described herein.

In one or more embodiments, enforcement module 88 is configured to train a dynamic Bayesian Network (DBN) using monitoring data associated with the cloud services, generate correlation values based on the training of the DBN, determine a plurality of future states of the cloud services based on the correlation values, determine whether at least one of the plurality of future states violates at least one predefined rule, and if the at least one of the plurality of future states violates at least one predefined rule, trigger at least one management action, as described herein.

Some embodiments

According to one aspect of the disclosure, a method for enforcement of at least one rule for cloud services is provided. A dynamic Bayesian Network (DBN) is trained using monitoring data associated with the cloud services. Correlation values are generated based on the training of the DBN. A plurality of future states of the cloud services are determined based on the correlation values. A determination is made whether at least one of the plurality of future states violates at least one predefined rule. If the at least one of the plurality of future states violates at least one predefined rule, at least one management action is triggered.

According to one embodiment of this aspect, the training of the DBN includes filtering and pre-processing the monitoring data to generate a plurality of Conditional Probability Distributions (CPDs). The correlation values are correlation coefficients. According to one embodiment of this aspect, the plurality of future states are based on the CPDs. According to one embodiment of this aspect, the determination of the plurality of future states of the cloud services based on the correlation values includes analyzing the correlation values using a neural network to generate neural network data. The determined future states are based on the neural network data. According to one embodiment of this aspect, the determination whether the at least one of the plurality of future states violates at least one predefined rule includes determining a plurality of events according to determined plurality of future states. According to one embodiment of this aspect, the plurality of events are mapped to the at least one management action.

According to another aspect of the disclosure, a node 36 for enforcement of at least one rule for cloud services is provided. The node 36 includes at least processor 44 and at least one memory 46. The at least one processor 44 is configured to: train a dynamic Bayesian Network (DBN) using monitoring data associated with the cloud services, generate correlation values based on the training of the DBN, determine a plurality of future states of the cloud services based on the correlation values, determine whether at least one of the plurality of future states violates at least one predefined rule, and if the at least one of the plurality of future states violates at least one predefined rule, trigger at least one management action.

According to one embodiment of this aspect, the training of the DBN includes filtering and pre-processing the monitoring data to generate a plurality of Conditional Probability Distributions (CPDs). The correlation values are correlation coefficients. According to one embodiment of this aspect, the plurality of future states are based on the CPDs. According to one embodiment of this aspect, the determination of the plurality of future states of the cloud services based on the correlation values includes analyzing the correlation values using a neural network to generate neural network data. The determined future states are based on the neural network data.

According to one embodiment of this aspect, the determination whether the at least one of the plurality of future states violates at least one predefined rule includes determining a plurality of events according to determined plurality of future states.

According to one embodiment of this aspect, the plurality of events are mapped to the at least one management action.

According to another aspect of the disclosure, a node 36 for enforcement of at least one rule for cloud services. The node 36 includes a training module that is configured to train a dynamic Bayesian Network (DBN) using monitoring data associated with the cloud services, a correlation module that is configured to generate correlation values based on the training of the DBN, a state determination module that is configured to determine a plurality of future states of the cloud services based on the correlation values, a rule violation determination module that is configured to determine whether at least one of the plurality of future states violates at least one predefined rule, and a management module that is configured to, if the at least one of the plurality of future states violates at least one predefined rule, trigger at least one management action.

As will be appreciated by one of skill in the art, the concepts described herein may be embodied as a method, data processing system, and/or computer program product. Accordingly, the concepts described herein may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects all generally referred to herein as a "circuit" or "module." Furthermore, the disclosure may take the form of a computer program product on a tangible computer usable storage medium having computer program code embodied in the medium that can be executed by a computer. Any suitable tangible computer readable medium may be utilized including hard disks, CD-ROMs, electronic storage devices, optical storage devices, or magnetic storage devices.

Some embodiments are described herein with reference to flowchart illustrations and/or block diagrams of methods, systems and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable memory or storage medium that can direct a computer or other

programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

It is to be understood that the functions/acts noted in the blocks may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that

communication may occur in the opposite direction to the depicted arrows.

Computer program code for carrying out operations of the concepts described herein may be written in an object oriented programming language such as Java® or C++. However, the computer program code for carrying out operations of the disclosure may also be written in conventional procedural programming languages, such as the "C" programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer. In the latter scenario, the remote computer may be connected to the user's computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

Many different embodiments have been disclosed herein, in connection with the above description and the drawings. It will be understood that it would be unduly repetitious and obfuscating to literally describe and illustrate every combination and subcombination of these embodiments. Accordingly, all embodiments can be combined in any way and/or combination, and the present specification, including the drawings, shall be construed to constitute a complete written description of all combinations and subcombinations of the embodiments described herein, and of the manner and process of making and using them, and shall support claims to any such combination or subcombination.

It will be appreciated by persons skilled in the art that the embodiments described herein are not limited to what has been particularly shown and described herein above. In addition, unless mention was made above to the contrary, it should be noted that all of the accompanying drawings are not to scale. A variety of modifications and variations are possible in light of the above teachings without departing from the scope of the following claims.