Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SERVICE LEVEL AGREEMENT (SLA) ASSURANCE FOR HIGHLY AVAILABLE APPLICATIONS
Document Type and Number:
WIPO Patent Application WO/2017/089866
Kind Code:
A1
Abstract:
A Service Level Agreement (SLA) is specified for a component. To satisfy the SLA, a system receives a component profile describing expected resource consumption of the component, and identifies available resources for an execution environment to execute the component. A component recovery time is calculated for a redundancy model supported by the component and for an execution environment composition supported by the available resources. The component recovery time includes a time duration for each phase in a set of phases of component recovery. The system also determines resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the SLA.

Inventors:
KANSO ALI (CA)
HEIDARI PARISA (CA)
Application Number:
PCT/IB2015/059091
Publication Date:
June 01, 2017
Filing Date:
November 24, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (PUBL) (SE)
KANSO ALI (CA)
HEIDARI PARISA (CA)
International Classes:
G06F9/50; G06Q10/06
Foreign References:
US20120233501A12012-09-13
EP1760588A12007-03-07
Other References:
MANAR JAMMAL ET AL: "High availability-aware optimization digest for applications deployment in cloud", 2015 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 1 June 2015 (2015-06-01), pages 6822 - 6828, XP055291449, ISBN: 978-1-4673-6432-4, DOI: 10.1109/ICC.2015.7249413
Attorney, Agent or Firm:
RAHMER, David et al. (CA)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method for satisfying a Service Level Agreement (SLA) specified for a component, comprising:

receiving a component profile describing expected resource consumption of the component;

identifying available resources for an execution environment to execute the component;

calculating a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, wherein the component recovery time includes a time duration for each phase in a set of phases of component recovery; and

determining resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the SLA.

2. The method of claim 1, wherein calculating the component recovery time further comprising:

identifying, according to the redundancy model, which one or more phases that a redundant component goes through among a full set of phases in a component recovery process; and

calculating the component recovery time that includes a phase duration for each of the identified one or more phases. 3. The method of claim 1, wherein the set of phases include one or more of the following: an instantiation phase, a fetch state phase, a parse state phase, a recuperation phase, a normal execution phase and a termination phase.

4. The method of claim 1, wherein the redundancy model is one of redundancy models that include: spare, cold standby, warm standby, hot standby and multi-active.

5. The method of claim 1, further compri determining an action to satisfy the recovery time objective, the action including at least one of: resizing at least a resource allocated to the execution environment, changing a resource location, adding additional components, and changing the redundancy model for the component.

6. The method of claim 1, wherein the execution environment is a virtual machine, a container, or a physical server.

7. The method of claim 1, wherein the component profile further includes attributes specific to each phase in the set of phases of component recovery.

8. The method of claim 1, wherein the execution environment composition specifies an amount, performance and a location of resources allocated to the execution

environment.

9. The method of claim 1, wherein calculating the component recovery time further comprises:

calculating a parallelization improvement by taking into account a number of execution threads in the component and a portion of the component that is strictly serial.

10. The method of claim 1, wherein calculating the component recovery time further comprises:

calculating delays caused by each of a set of resources allocated to the component during each phase of the set of phases; and

summing up the delays to obtain the time duration of each phase.

11. The method of claim 1, further comprising:

calculating an execution duration per request for the component based on an average number of instructions that a processor executes per request and an average number of memory transfers per request; and

determining an action to satisfy the SLA with respect to the execution duration.

12. A computer system operable to satisfy a Service Level Agreement (SLA) specified for a component, the computer system comprising circuitry including a processor and a memory, the memory containing instructions executable by the processor, wherein the computer system is operable to:

receive a component profile describing expected resource consumption of the component;

identify available resources for an execution environment to execute the component;

calculate a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, wherein the component recovery time includes a time duration for each phase in a set of phases of component recovery; and

determine resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the SLA.

13. The computer system of claim 12, wherein the computer system is further operable to:

identify, according to the redundancy model, which one or more phases that a redundant component goes through among a full set of phases in a component recovery process; and

calculate the component recovery time that includes a phase duration for each of the identified one or more phases.

14. The computer system of claim 12, wherein the set of phases include one or more of the following: an instantiation phase, a fetch state phase, a parse state phase, a

recuperation phase, a normal execution phase and a termination phase.

15. The computer system of claim 12, wherein the redundancy model is one of redundancy models that include: spare, cold standby, warm standby, hot standby and multi -active.

16. The computer system of claim 12, wherein the computer system is further operable to:

determine an action to satisfy the recovery time objective, the action including at least one of: resizing at least a resource allocated to the execution environment, changing a resource location, adding additional components, and changing the redundancy model for the component.

17. The computer system of claim 12, wherein the execution environment is a virtual machine, a container, or a physical server.

18. The computer system of claim 12, wherein the component profile further includes attributes specific to each phase in the set of phases of component recovery.

19. The computer system of claim 12, wherein the execution environment composition specifies an amount, performance and a location of resources allocated to the execution environment.

20. The computer system of claim 12, wherein the computer system is further operable to:

calculate a parallelization improvement by taking into account a number of execution threads in the component and a portion of the component that is strictly serial.

21. The computer system of claim 12, wherein the computer system is further operable to:

calculate delays caused by each of a set of resources allocated to the component during each phase of the set of phases; and

sum up the delays to obtain the time duration of each phase.

22. The computer system of claim 12, further comprising:

calculating an execution duration per request for the component based on an average number of instructions that a processor executes per request and an average number of memory transfers per request; and determining an action to satisfy the SLA with respect to the execution duration.

23. A system operable to satisfy a Service Level Agreement (SLA) specified for a component, the system comprising:

an input module to receive a component profile describing expected resource consumption of the component;

an identifier module to identify available resources for an execution environment to execute the component;

a calculation module to calculate a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, wherein the component recovery time includes a time duration for each phase in a set of phases of component recovery; and

an output module to determine resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the SLA.

24. A computer readable storage medium storing executable instructions, which when executed by a processor, cause the processor to:

receive a component profile describing expected resource consumption of a component;

identify available resources for an execution environment to execute the component;

calculate a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, wherein the component recovery time includes a time duration for each phase in a set of phases of component recovery; and

determine resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in a Service Level Agreement (SLA).

Description:
SERVICE LEVEL AGREEMENT fSLA^ ASSURANCE FOR HIGHLY

AVAILABLE APPLICATIONS

TECHNICAL FIELD [0001] This disclosure relates generally to systems and methods for satisfying

Service Level Agreement (SLA) in a cloud computing environment.

BACKGROUND

[0002] Recently, cloud computing has become the lifeblood of many telecommunication network services and information technology (IT) software applications. With the development of the cloud market, cloud computing can be seen as an opportunity for information and communications technology (ICT) companies to deliver communication and IT services over any fixed or mobile network, high performance and secure end-to-end quality of service (QoS) for end users. Although cloud computing provides benefits to different players in its ecosystem and makes services available anytime, anywhere and in any context, other concerns arise regarding the performance and the quality of services offered by the cloud.

[0003] One area of concern is the High Availability (HA) of the applications hosted in the cloud. Since many of these applications are hosted by virtual machines (VMs) residing on physical servers, their availability depends on that of the hosting servers. When a hosting server fails, its VMs, as well as their applications become inoperative.

[0004] Cloud computing offers the ability to use compute, network, and storage resources on-demand in an execution environment such as a virtualized environment. By virtualizing the physical infrastructure, the resources can be dimensioned at a finer grain which can allow multiple tenants to share the same underlying infrastructure. Yet a question remains as to how to ensure that the resources allocated to a given software application are sufficient to guarantee that the software application provides its functionality according to its corresponding Service Level Agreement (SLA). The SLA can constraint the expected availability as well as the speed (e.g. bandwidth) of handling requests.

[0005] In order to meet the SLA requirements, solution providers often over- dimension their system's resources, leaving the system functioning at lower capacity most of the time and wasting resources. This conventional data center approach is now being replaced with the cloud computing approach where resources can be added and removed dynamically due to the advances in virtualization technologies including the use of software defined networks, software defined storage, etc. As a result, over-dimensioning is no longer appropriate. Therefore, it would be desirable to provide a system and method that obviate or mitigate the above described problems.

SUMMARY

[0006] It is an object of the present invention to obviate or mitigate at least one disadvantage of the prior art.

[0007] In a first aspect of the present invention, there is provided a method for satisfying an SLA specified for a component. The method comprises: receiving a component profile describing expected resource consumption of the component; identifying available resources for an execution environment to execute the component; calculating a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, wherein the component recovery time includes a time duration for each phase in a set of phases of component recovery; and determining resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the

SLA.

[0008] In another aspect of the present invention, there is provided a computer system operable to satisfy an SLA specified for a component. The computer system comprises circuitry including a processor and a memory, the memory containing instructions executable by the processor. The computer system is operable to: receive a component profile describing expected resource consumption of the component; identify available resources for an execution environment to execute the component; calculate a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, wherein the component recovery time includes a time duration for each phase in a set of phases of component recovery; and determine resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the SLA.

[0009] In another aspect of the present invention, there is provided a system operable to satisfy an SLA specified for a component. The system comprises: an input module to receive a component profile describing expected resource consumption of the component; an identifier module to identify available resources for an execution environment to execute the component; a calculation module to calculate a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, wherein the component recovery time includes a time duration for each phase in a set of phases of component recovery; and an output module to determine resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the SLA.

[0010] In yet another aspect of the present invention, there is provided a computer readable storage medium storing executable instructions, which when executed by a processor, cause the processor to: receive a component profile describing expected resource consumption of a component; identifying available resources for an execution environment to execute the component; calculating a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, wherein the component recovery time includes a time duration for each phase in a set of phases of component recovery; and determining resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in an SLA.

[0011] Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures. BRIEF DESCRIPTION OF THE DRAWINGS

[0012] Embodiments of the present invention will now be described, by way of example only, with reference to the attached Figures, wherein:

[0013] Figure 1 illustrates an overall approach for determining resource allocation to satisfy the SLA requirements according to one embodiment.

[0014] Figure 2 illustrates an example of behavior phases of a software component associated with redundancy models according to one embodiment.

[0015] Figure 3A and Figure 3B illustrate a Unified Modeling Language (UML) class diagram according to one embodiment.

[0016] Figure 4 is a flow chart illustrating a method for ensuring that a software component satisfies its SLA requirements according to one embodiment.

[0017] Figure 5 is a block diagram of a network element according to one embodiment.

[0018] Figure 6 is a block diagram of a cloud manager node according to one embodiment.

[0019] Figure 7 is an architectural overview of a cloud network according to one embodiment.

DETAILED DESCRIPTION [0020] Reference may be made below to specific elements, numbered in accordance with the attached figures. The discussion below should be taken to be exemplary in nature, and not as limiting of the scope of the present invention. The scope of the present invention is defined in the claims, and should not be considered as limited by the implementation details described below, which as one skilled in the art will appreciate, can be modified by replacing elements with equivalent functional elements.

[0021] Systems and methods are provided for allocating resources for an execution environment to host application components, such that the recovery time of a failed component satisfies the requirements of its Service Level Agreement (SLA). In some embodiments, a model can be defined to profile a software component and its resources consumption. Based on this model, a method is derived to determine the needed amount and performance of resources to satisfy its SLA requirements.

[0022] In highly available systems, redundancy models are employed to ensure that when an active component fails, another redundant component can take over its workload. The term "component" herein refers to a software component that is a part of a software application. The redundant component is not necessarily idle. Depending on the particular redundancy model, the redundant component can be another active component in the system already serving, a non-instantiated spare component that will be instantiated after the failure, or a standby component in one of behavior phases to be described in further detail with reference to Figure 2. A component may be backed by any number of active components, spare components and/or standby components. The method described herein calculates the recovery time for a component to failover to its redundant component under different redundancy models, and thus determines which redundancy model can be used to satisfy the SLA requirements. It is noted that the term "redundancy model" may be used interchangeably with "redundancy scheme."

[0023] The redundant component may be hosted by one or more redundant resources. These redundant resources may be in a different zone or region from the primary resources. Failover from a primary resource to a redundant resource may take place when a failure occurs in the primary resources, during the maintenance of the primary resources, or for maximizing clean energy consumption. The method described herein also verifies whether the redundant resources meet the SLA requirements, especially during a failover procedure. The method calculates the delays caused by the resources allocated to the component, taking into account the network delay between the primary and redundant resources. Thus, it can be determined which zones or regions that the redundant resources should be located to satisfy the SLA requirements.

[0024] Figure 1 illustrates an overall approach for determining resource allocation to satisfy the SLA requirements of a software component according to one embodiment. In this embodiment, an analysis engine 100 receives an input including SLA requirements 110 for a component, a component profile 120 and an available resources description 130. Based on the input, the analysis engine 100 performs an analysis to generate recommended resource allocation 140 for the component and a recommended action 150 for satisfying the SLA requirements 110. Among other things, the SLA requirements 110 specify a recovery time objective and a request execution duration. The component profile 120 describes, among other things, expected resource consumption of the component and a redundancy model supported by the component. The available resources description 130 describes the available resources for deploying and executing the component, such as processing power (e.g. speed and performance of a central processing unit (CPU)), memory (e.g. speed and size of random access memory (RAM)), disk storage and/or network resources.

[0025] The recommended resource allocation 140 for the component may include a recommended amount, performance and/or location of the allocated resources, as well as a recommended redundancy model for the component. The recommended action 150 for the component may include resizing the resources (e.g. changing the size and/or performance of CPU, memory, disk, network, etc.), changing the location of the resources used in the component recovery (e.g. from a remote disk to a local disk), and changing the redundancy model for the component. The recommended resource allocation 140 and recommended action 150 may be applied to the component and its execution environment at deployment time before the component is in service, and/or at runtime such that the component can adapt to changes that may arise during its operation while satisfying the SLA requirements 110.

[0026] In order to determine the proper resource provisioning for a component, the component behavior can be dissected into a set of behavior phases (also referred to "phases") during a component recovery process. It should be noted that during each of those phases, the component may consume different amounts of various resources. Hence, an analysis is performed for each such phase.

[0027] Figure 2 illustrates an example of the phases associated with a number of redundancy models according to one embodiment. In this example embodiment, the component behavior is dissected into six different phases - instantiation 211, fetch state 212, parse state 213, recuperation 214, normal execution 215 and termination 216. The redundancy models include spare, cold standby, warm standby, hot standby and multiactive. During a recovery, a component may potentially go sequentially through all, or a subset, of these phases according to the redundancy model. For a non-instantiated spare component, if the state exists 220 for the component (and/or its users), the spare component will sequentially go through all of the instantiation 211, fetch state 212, parse state 213 and recuperation 214 phases to reach the normal execution phase 215. It is noted that some phases can be skipped during the recovery process according to the redundancy model that a component implements. For example, a warm standby component may skip the instantiation phase 211 and the fetch state phase 212 during a failover recovery process.

[0028] Figure 2 also shows that the component is monitored by an HA orchestrator 270 that periodically monitors the component behavior at a given monitoring frequency (e.g. at a given number of heartbeats per second). The monitoring interval, which is the inverse of the monitoring frequency, provides an estimate of the detection time for detecting a failure occurring in the component. After a failure is detected, the HA orchestrator 270 may take a given reaction time to react to the detection to start the recovery process.

[0029] In each phase the component consumes different amounts of CPU, memory, storage, and network resources. As such, a model can be defined to profile the component resource consumption in each phase. Figure 3A and Figure 3B provide an example Unified Modeling Language (UML) class diagram 300 describing a component profile and available resources according to one embodiment. In this embodiment, a SoftwareComponent class 310 defines the attributes of the component, as well as its association to six different phase classes 321-326. The phase classes 321-326 correspond to the phases 211-216 of Figure 2, respectively. Each phase class can have the specific attributes describing the resource consumption when the component is executing this phase. The phase classes 321-326 and their attributes inherit from the IntermediatePhase class 331 and ServingPhase class 332, which in turn inherit from the Phase class 330.

[0030] Figure 3B illustrates a ResourceGroup class 340, which contains an

Operating System class 341, a CPU class 342, a RAM class 343, a Disk class 344 and a Network class 345. Each class defines a number of attributes specifying the size, performance, and/or other characteristics of the resources. A Flavor class 350 is also defined. In a cloud setting, multiple "flavors" of an execution environment can be offered to a cloud tenant, such as tiny, small, medium, large and extra-large. Each flavor specifies the size and processing power of a set of resources, e.g. a medium flavor can have half of the virtual CPU cores, disk space and RAM of a large flavor. The method described herein can be used to determine which flavor should be chosen to host a software component that provides a service while satisfying the SLA requirements.

[0031] Although the UML class diagram 300 only shows a VirtualEnvironment

360 as the execution environment for the component, it is understood that the execution environment for the component may be virtual (e.g. VM 361 or Container 362) or physical (e.g. physical CPUs or servers).

[0032] Moreover, the attribute values defined in the UML class diagram 300 may be numerical (e.g. integer or double), Boolean (e.g. true or false), string, or other specified types of values such as location (e.g. in the memory, on a local or remote disk), redundancy model (e.g. spare, cold standby, warm standby, hot standby or multi-active) or recovery action (e.g. restart, failover or switchover).

[0033] The attribute values defined in the UML class diagram 300 for a given component dictate the recovery time of the component. In this example, it is assumed that a component is serving requests coming from users. Each user can have state information regarding its current requests and other credentials. In this example, the state size of the component is proportional to the number of users it is serving. The SoftwareComponent class 310 has a stateLocation attribute that indicates the location of the state information.

[0034] Moreover, Figure 3 A shows a PhysicalFlavorDiagnosis class 380, which has attributes including delayByDisk, delayByRAM, delayByNetwork, delayByCPU and phaseDuration. These attribute values may be calculated based on the given attribute values associated with the component and the resources, and may be used to determine the duration of each phase. In general, it will be appreciated that the duration of each phase can be determined in accordance with the delay associated with the disk, the delay associated with the network, the delay associated with the RAM (e.g. memory) and/or the delay associated with the CPU (e.g. processing). The delay associated with the disk can depend on the size of the component, the number of read/write operations, the speed of the disk itself and/or other factors. The delay associated with the network can be determined in accordance with executable location of the component. The delay associated with the network can further depend on the size of the component, the number of read/write operations, the network link bandwidth and/or other factors. The delay associated with the RAM can depend on the size of the component, the number of read/write operations, the speed of the RAM itself and/or other factors. The delay associated with the CPU can depend on the number of instructions to be executed, the CPU speed, the amount of processing that can be parallelized, and/or other factors.

[0035] The delay associated with the disk, the delay associated with the network, the delay associated with the RAM, and the delay associated with the CPU can be summed together to determine the duration of a phase. Finally, the appropriate phase durations can be summed together, in accordance with the redundancy model used, to determine the overall recovery duration of a component. The overall recovery duration, also referred to as the recovery time, can be compared with the recovery time objective (RTO) defined in the ServiceLevelAgreement class 320 to determine whether the SLA requirements for the component are satisfied.

[0036] Figure 3A also shows an HA orchestrator 370, which corresponds to the

HA orchestrator 270 of Figure 2. The HA orchestrator 370 monitors the operation of the component with a given monitoring interval and reacts to a component failure within a given reaction time. A FailureType class 390 is also defined to describe the failure rate and the recommended recovery of the component.

[0037] In the following, a method for calculating the duration of each phase

(except for the normal execution phase 215) of Figure 2 is described with reference to the attributes defined in Figure 3 A and Figure 3B.

[0038] The instantiation phase 211 duration of a component depends on a number of attributes:

[0039] Executable Size defines the size in megabytes (MB) of the executable file(s) that is instantiated to start the component.

[0040] ExecutableLocation defines if the executable is be located in the main memory (RAM), on the local disk, or on a remote disk.

[0041] BasicNumberOflnstructions refers to the number of CPU instructions needed irrespective of the executable size, e.g. during the process initialization by the operating system. [0042] AmountOflnstructionsperMB refers to the amount of CPU instructions that are executed per one MB of the executable size of the component. This is specific to the component, excluding the portion of processing that the operating system has to perform for the instantiation of any new software process in the system. For the purpose of this example, it is assumed that the distribution is uniform, i.e. the averaged value is used even if the CPU usage varies at different points in time during the instantiation.

[0043] NumberOfPhaseCPUInstructions refers to the number of CPU instructions needed to execute the phase. It can be calculated as:

AmountOflnstructionsperMB x executable Size + BasicNumberOflnstructions

[0044] AmountOfMemoryTransfer refers to the number of memory transfer

(read/write) that is executed during the instantiation phase 211.

[0045] AmountOfDiskTransfer refers to the number of disk transfer (read/write) that is executed during the instantiation phase 211.

[0046] Parallelizationlmprovement. According to Amdhal's law, the maximum speedup of using multiple processors depends on (1) the number of execution threads, and (2) on the portion of the process that is strictly serial (between 0 and 1 where 1 = 100%). The time needed to execute with n threads Ύ(η) is therefore equal to:

1 - strictlySerial

T(n) = J(l) x strictlySerial + - numberOfEffective Threads

where T(l) is the time to execute with one thread.

[0047] Parallelizationlmprovement is defined as:

, , I - strictlySerial

strictlySerial A

numberOfEffective Threads

[0048] The number of effective threads is defined as the:

Minimum(number of cores, number of threads in the component process)

[0049] CPU SpeedlnMIPS refers to the number of million instructions a CPU core assigned to the VM or container hosting the component can perform in one second. The value of this speed depends on the CPUSpeed (i.e. the clock speed in megahertz or gigahertz) multiplied by the average number of instructions that can be executed in one CPU clock cycle (which differs from one CPU architecture to another).

CPU SpeedlnMIPS = CPUSpeed x AvelnstructionsPer Cycle

[0050] The duration of the instantiation phase 211 can be calculated as follows: CaluclatelnstantiationDuration ( )

{

if (ExecutableLocation == ON_REMOTE_DISK)

{

DelayByDisk = ( ( ExecutableSize + amountOfDiskTransfer ) ÷ diskSpeed) ;

DelayByNetwork = ( (ExecutableSize + amountOfDiskTransfer ) ÷ bandwidth;

DelayByRAM = ( (ExecutableSize + amountOfMemoryTransfer ) ÷ ramSpeed;

DelayByCPU = ( (numberOfPhaseCPUInstructions ÷

cpuSpeedlnMIPS ) x Parallelizationlmprovement;

}

Elseif (ExecutableLocation == ON_LOCAL_DISK)

{

DelayByDisk = ( (ExecutableSize + amountOfDiskTransfer) ÷ diskSpeed) ;

DelayByNetwork = 0 ;

DelayByRAM = ( ( comp . getExecutableSize ( ) +

amountOfMemoryTransfer) ÷ ramSpeed;

DelayByCPU = ( ( numberOfPhaseCPUInstructions ÷

cpuSpeedlnMIPS ) x Parallelizationlmprovement ;

}

Elseif (ExecutableLocation == IN_MEMORY)

{

DelayByDisk = 0

DelayByNetwork = 0 ;

DelayByRAM = ( (ExecutableSize + amountOfMemoryTransfer ) ÷ ramSpeed;

DelayByCPU = ( ( numberOfPhaseCPUInstructions ÷

cpuSpeedlnMIPS ) x Parallelizationlmprovement ;

}

InstantiationDuration = DelayByDisk + DelayByNetwork + DelayByRAM + DelayByCPU;

return InstantiationDuration;

} [0051] The fetch state phase 212 duration of a component depends on a number of attributes:

[0052] AverageStateSizePerUser is the size in MB of the state of each user of the system. For example, if the component is processing calls of 500 users, then the state is the information that needs to be maintained by the component during these calls, and backed up to ensure service continuity in case of failure.

[0053] NumberOfUsersPerComponent refers to the expected number of users per component. It is equal to averageNumberOfUsers ÷ numberOflnstances . For instance a web-server component can have 10 instances serving 5000 users, in such case the number of users per component is 500.

[0054] StateLocation refers to the location of the state: in the main memory

(RAM), on the local disk, or on a remote disk.

[0055] BasicStateSize refers the component state size irrespective of the users' states, e.g. the component security certificate, etc.

TotalMemoryTransfer = AverageStateSizePerUser x NumberOfUsersPerComponent +

BasicStateSize

[0056] The duration of the fetch state phase 212 can be calculated as follows:

CaluclateFetchStatePhaseDuration ( )

{

if (StateLocation == ON_REMOTE_DISK)

{

DelayByDisk = ( (TotalMemoryTransfer + amountOfDiskTransfer ) ÷ diskSpeed) ;

DelayByNetwork = ( (TotalMemoryTransfer +

amountOfDiskTransfer) ÷ bandwidth;

DelayByRAM = ( (TotalMemoryTransfer + amountOfMemoryTransfer )

÷ ramSpeed) ;

}

Elseif ( ExecutableLocation == ON_LOCAL_DISK)

{

DelayByDisk = ( (TotalMemoryTransfer + amountOfDiskTransfer ) ÷ diskSpeed) ;

DelayByNetwork = 0 ;

DelayByRAM = ( (TotalMemoryTransfer +

amountOfMemoryTransfer ) ÷ ramSpeed) ;

}

Elseif (ExecutableLocation == IN MEMORY) {

DelayByDisk = 0

DelayByNetwork = 0 ;

DelayByRAM = 0;

}

FetchStateDuration = DelayByDisk + DelayByNetwork + DelayByRAM; return FetchStateDuration;

}

[0057] The parse state phase 213 duration of a component depends on a number of attributes:

[0058] AmountOflnstructionsperMB refers to the amount of CPU instructions that are executed per one MB of the state size of the component

[0059] NumberOfPhaseCPUInstructions refers to the number of CPU instructions needed to execute the phase. It is equal to:

AmountOflnstructionsperMB x stateSize + BasicNumberOflnstructions

[0060] The duration of the parse state phase 213 can be calculated as follows: CaluclateParseStatePhaseDuration ( )

{

DelayByDisk = 0;

DelayByNetwork = 0 ;

DelayByRAM = ( ( amountOfMemoryTransfer ) ÷ ramSpeed) ;

DelayByCPU = ( (numberOfPhaseCPUInstructions ÷ cpuSpeed)x

Parallelizationlmprovement ;

ParseStateDuration = DelayByDisk + DelayByNetwork + DelayByRAM; return ParseStateDuration;

}

[0061] The recuperation phase 214 duration of a component depends on a number of attributes:

[0062] AmountOflnstructionsperMB refers to the amount of CPU instructions that are executed per one MB of the state size of the component

[0063] NumberOfPhaseCPUInstructions refers to the number of CPU instructions needed to execute the phase. It is equal to:

AmountOflnstructionsperMB x stateSize + BasicNumberOflnstructions [0064] The duration of the recuperation phase 214 can be calculated as follows:

CaluclateRecuperateStatePhaseDuration ( )

{

DelayByDisk = 0;

DelayByNetwork = 0;

DelayByRAM = ( ( amountOfMemoryTransfer ) ÷ ramSpeed) ;

DelayByCPU = ( ( numberOfPhaseCPUInstructions ÷

cpuSpeedlnMIPS ) x Parallelizationlmprovement;

RecuperateStateDuration = DelayByDisk + DelayByNetwork +

DelayByRAM + DelayByCPU;

return RecuperateStateDuration;

}

[0065] The termination phase 216 duration of a component depends on a number of attributes:

[0066] AmountOflnstructionsperMB refers to the amount of CPU instructions that are executed per one MB of the executable size of the component.

[0067] NumberOfPhaseCPUInstructions refers to the number of CPU instructions needed to execute the phase. It is equal to:

AmountOflnstructionsperMB x Executable Size + BasicNumberOflnstructions

[0068] The duration of the termination phase 216 can be calculated as follows:

CaluclateTerminatePhaseDuration ( )

{

DelayByDisk = 0;

DelayByNetwork = 0 ;

DelayByRAM = ( ( amountOfMemoryTransfer ) ÷ ramSpeed) ;

DelayByCPU = ( ( numberOfPhaseCPUInstructions ÷ cpuSpeedlnMIPS ) x Parallelizationlmprovement;

TerminatePhaseDuration = DelayByDisk + DelayByNetwork + DelayByRAM + DelayByCPU;

return TerminatePhaseDuration;

}

[0069] As shown in the calculation above, the recovery time of a component can depend on the redundancy model that the component supports, as illustrated in Figure 2. According to the redundancy model, the recovery time is the aggregate duration of the various phases that a redundant component goes through. The detection time and the reaction time depend on the HA orchestrator 270 that is used to manage the availability of the components. The recovery starts by terminating the faulty component to stop the propagation of the failure. A spare component will go through all of the phases before the normal execution phase to take over the active assignment, while a hot standby component will skip most of the phases.

[0070] The duration of the recovery of a component can be calculated as follows:

CaluclateRecoveryDuration ( )

{

switch ( RedundancyModel ) {

case SPARE: recoveryDuration =

(detectionTime + reactionTime

+terminationDuration

+instantiateDuration

+fetchStateDuration

+parseStateDuration

+recuperationDuration) ;

break;

case COLD_STANDBY : recoveryDuration =

(detectionTime + reationTime

+terminationDuration

+fetchStateDuration

+parseStateDuration

+recuperationDuration) ;

break;

case WARM_STANDBY : recoveryDuration =

(detectionTime + reationTime

+terminationDuration

+parseStateDuration

+recuperationDuration) ;

break;

case MULTI ACTIVE OR HOT STANDBY: recoveryDuration

(detectionTime + reationTime

+ terminationDuration) ;

break; }

return recoveryDuration;

[0071] In addition to the recovery time, the SLA requirements for the component may also place an upper limit on the execution time per request. For example, the ServiceLevelAgreement class 320 in Figure 3A defined a maxExecTimePerRequest attribute to specify such a limit. The request execution duration of a component depends on a number of attributes:

[0072] NumberO ViillionlnstructionsPerRequest refers to the average number of instructions the CPU must execute per request.

[0073] AmountOflAemoryTransferPerRequest refers to the average number of memory read/write per request.

[0074] The duration of the request execution of a component can be calculated as follows:

CaluclateRequestExecutionDuration ( ) {

duration =

( (NumberOfMillionlnstructionsPerRequest ÷

cpuAvailableSpeedinMIPS )

x parallellmprovement ) + (AmountOfMemoryTransferPerRequest ÷ RamSpeed) ;

return duration

}

[0075] The minimum number of requests that can be served per second is proportional to the request execution duration time discussed above. It can be deduced as:

NumberOfliequestsPerSeccmd= 1 ÷ RequestExecutionDuration

[0076] Referring again to Figure 1, the analysis engine 100 may perform the calculations of recovery time and request execution time to ensure the compliance with the SLA requirements 110. In order to satisfy the SLA requirements 110, the analysis engine 100 may recommend a resource allocation scheme and/or a number of actions. Component based actions include changing the redundancy model and adding additional components. Execution environment (e.g. VM, Container, or physical environment) based actions include changing the location of the allocated resources (e.g. from a remote disk to a local disk), increasing/decreasing the amount and/or upgrading/downgrading the performance of the allocated resources for CPU, memory, disk, network, etc.

[0077] In one embodiment, the analysis engine 100 analyzes the RTO and maximum number of requests per second for a given component (e.g. a HTTP-server) deployed on a given VM flavor (e.g. medium-size VM). The analysis engine 100 also analyzes the duration of each of the phases shown in Figure 2. Moreover, the analysis engine 100 can identify the cause of the delay (e.g. CPU, disk, memory, and/or network) in each phase. From the analysis result, the performance bottleneck can be identified.

[0078] For example, two redundancy models such as spare and cold standby can be compared to estimate and analyze how long a recovery will take for each redundancy model. Using this approach, it can be determined whether the VM flavor used (e.g. medium-size VM) is suitable to satisfy the SLA requirements. This approach can determine which redundancy model and VM flavor can be selected to satisfy the SLA requirements. Not only can this approach ensure SLA compliance, it also avoids over- provisioning of resources; for example, when the selected VM flavor offers much better performance than required, less resources may be allocated.

[0079] Depending on the redundancy model, a synchronization level can be imposed between the primary and secondary (i.e. redundant) set of resources. If the two sets of resources are located far from each other (e.g. in two different zones or regions), specifically the fetch state phase is impacted. The further the secondary resources are located, the more network delay is incurred. Depending on the redundancy model, the recovery time may be impacted. In the case of spare and cold standby redundancy models, the recovery time increases. For warm, hot, and multi-active redundancy models, the recovery time is not directly impacted but it is noted that the synchronization will be more costly in terms of performance overhead and energy consumption.

[0080] Accordingly, embodiments of the present disclosure can be employed to better dimension computing resources while satisfying the SLA requirements, thus saving cost. [0081] Figure 4 is a flow chart illustrating a method 400 for ensuring a software component satisfies its SLA requirements. In particular, an SLA can specify a recovery time objective for a component to recover from failure. The method 400 can be performed by the analysis engine 100 of Figure 1, which may be part of a system such as a cloud management system, a cloud scheduler system, or a data center management system. The method 400 begins with step 410 at which the system receives a component profile describing expected resource consumption of the component. The component profile can include the expected resource consumption of the component including processing power (e.g. CPU), memory (e.g. RAM), storage and/or network resources. The component profile can further include consumption details and/or attributes specific to each phase of a recovery process. The component profile can also further specify one or more redundancy models to be supported for the component. The method 400 proceeds to step 420 at which the system identifies available resources for an execution environment to execute the component. The execution environment may be a virtual or a physical execution environment. For example, a VM composition may be selected for hosting the component. The VM composition can be selected in accordance with the component profile. The VM composition can identify the amount of resources (e.g. CPU, RAM, storage, network, etc.), performance, and the location of resources (e.g. remote disk or local disk) allocated for the VM to host the component.

[0082] The method 400 then proceeds to step 430 at which the system calculates the component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources. The component recovery time includes a time duration for each phase in a set of one or more phases of component recovery. The component recovery time can be determined based on the selected execution environment composition. The component recovery time can be calculated based on the resource values specified in the component profile and/or the selected execution environment composition. The component recovery time can be calculated for the redundancy model(s) supported by the component. The component recovery time can be calculated for at least one or more phases of the redundancy model(s). The duration associated with each phase can be determined in accordance with one or more parameters including, but not limited to, the delay associated with the disk, the delay associated with the network, the delay associated with the memory, and the delay associated with the processing of the component profile on the selected execution environment composition.

[0083] The method 400 further proceeds to step 440 at which the system determines resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the SLA. In one embodiment, the calculated recovery time can be compared to the recovery time objective specified in the SLA. Responsive to the recovery time object not being met by the calculated recovery time, an action for the component can be determined. The action can include changing the redundancy model, adding additional components, modifying attributes in the selected execution environment composition (e.g. change disk location, increase one or more resources), and the like.

[0084] Optionally, the component recovery time can be re-calculated following the modification to the component profile and/or execution environment composition. This process can be repeated iteratively until the calculated recovery time satisfies the recovery time objective.

[0085] A final redundancy model and final execution environment composition for the component can be determined to satisfy the recovery time objective. The final redundancy model and final execution environment composition can be output as a recommended resource allocation and/or recommended action for implementation in the cloud computing environment.

[0086] Figure 5 is a block diagram illustrating a computer system 500 according to an embodiment. The computer system 500 can be a network node or element serving as a cloud management system, a cloud scheduler system, or a data center management system as have been described herein. The computer system 500 includes circuitry including a processor 502, a memory or instruction repository 504 and a communication interface 506. The communication interface 506 can include at least one input port and at least one output port. The memory 504 contains instructions executable by the processor 502 whereby the computer system 500 is operable to perform the various embodiments as described herein. [0087] Figure 6 is a block diagram of an example cloud management system 600

(also referred to as a cloud scheduler system or a data center management system) that can include a number of modules. Cloud management system 600 includes an input module 610 to receive a component profile describing expected resource consumption of the component, an identifier module 620 to identify available resources for an execution environment to execute the component, a calculation module 630 to calculate a component recovery time for a redundancy model supported by the component and for an execution environment composition supported by the available resources, and an output module 640 to determine resource allocation based on the calculated component recovery time to satisfy a recovery time objective specified in the SLA. Cloud management system 600 can be configured to perform the various embodiments as have been described herein.

[0088] Figure 7 is an architectural overview of a cloud network 700 that comprises a hierarchy of a cloud computing environment. The cloud network 700 can include a number of different data centers (DCs) 730 at different geographic sites. Each data center 730 site comprising a number of racks 720, each rack 720 comprising a number of servers 710. One or more of the servers 710 can be selected to host a VM or another type of execution environment for running a software component as has been described in embodiments of the present disclosure.

[0089] The unexpected outage of cloud services has a great impact on business continuity and IT enterprises. One key to achieving these requirements is to properly dimension and allocate the underlying resources to meet SLA requirements. Attaining an always-on and always-available application is an objective of the described embodiments by determining optimal resources for hosting the requested applications such that they meet the desired recovery time objectives. Those skilled in the art will appreciate that the proposed systems and methods can be extended to include multiple objectives, such as maximizing the high availability of applications' components and maximizing resource utilization of the used infrastructure.

[0090] Embodiments of the invention may be represented as a software product stored in a machine-readable medium (also referred to as a computer-readable medium, a processor-readable medium, or a computer usable medium having a computer readable program code embodied therein). The non-transitory machine-readable medium may be any suitable tangible medium including a magnetic, optical, or electrical storage medium including a diskette, compact disk read only memory (CD-ROM), digital versatile disc read only memory (DVD-ROM) memory device (volatile or non-volatile), or similar storage mechanism. The machine-readable medium may contain various sets of instructions, code sequences, configuration information, or other data, which, when executed, cause a processor to perform steps in a method according to an embodiment of the invention. Those of ordinary skill in the art will appreciate that other instructions and operations necessary to implement the described invention may also be stored on the machine-readable medium. Software running from the machine-readable medium may interface with circuitry to perform the described tasks.

[0091] The above-described embodiments of the present invention are intended to be examples only. Alterations, modifications and variations may be effected to the particular embodiments by those of skill in the art without departing from the scope of the invention, which is defined solely by the claims appended hereto.